]> git.ipfire.org Git - thirdparty/kernel/stable.git/log
thirdparty/kernel/stable.git
6 weeks agonet: gso: restore ids of outer ip headers correctly
Richard Gobert [Tue, 23 Sep 2025 08:59:06 +0000 (10:59 +0200)] 
net: gso: restore ids of outer ip headers correctly

Currently, NETIF_F_TSO_MANGLEID indicates that the inner-most ID can
be mangled. Outer IDs can always be mangled.

Make GSO preserve outer IDs by default, with NETIF_F_TSO_MANGLEID allowing
both inner and outer IDs to be mangled.

This commit also modifies a few drivers that use SKB_GSO_FIXEDID directly.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-4-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: gro: only merge packets with incrementing or fixed outer ids
Richard Gobert [Tue, 23 Sep 2025 08:59:05 +0000 (10:59 +0200)] 
net: gro: only merge packets with incrementing or fixed outer ids

Only merge encapsulated packets if their outer IDs are either
incrementing or fixed, just like for inner IDs and IDs of non-encapsulated
packets.

Add another ip_fixedid bit for a total of two bits: one for outer IDs (and
for unencapsulated packets) and one for inner IDs.

This commit preserves the current behavior of GSO where only the IDs of the
inner-most headers are restored correctly.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-3-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: gro: remove is_ipv6 from napi_gro_cb
Richard Gobert [Tue, 23 Sep 2025 08:59:04 +0000 (10:59 +0200)] 
net: gro: remove is_ipv6 from napi_gro_cb

Remove is_ipv6 from napi_gro_cb and use sk->sk_family instead.
This frees up space for another ip_fixedid bit that will be added
in the next commit.

udp_sock_create always creates either a AF_INET or a AF_INET6 socket,
so using sk->sk_family is reliable. In IPv6-FOU, cfg->ipv6_v6only is
always enabled.

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250923085908.4687-2-richardbgobert@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoeth: fbnic: Read module EEPROM
Mohsin Bashir [Mon, 22 Sep 2025 23:18:55 +0000 (16:18 -0700)] 
eth: fbnic: Read module EEPROM

Add support to read module EEPROM for fbnic. Towards this, add required
support to issue a new command to the firmware and to receive the response
to the corresponding command.

Create a local copy of the data in the completion struct before writing to
ethtool_module_eeprom to avoid writing to data in case it is freed. Given
that EEPROM pages are small, the overhead of additional copy is
negligible.

Do not block API with explicit checks since API has appropriate checks in
place for length, offset, and page.

Explicitly check bank, page, offset, and length in
fbnic_fw_parse_qsfp_read_resp() to match EEPROM read responses to the
correct request. This is important because if the driver times out waiting
for an EEPROM read response, a subsequent read request with different
values is susceptible to receiving an erroneous response (i.e., the
response to the previous request).

Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Link: https://patch.msgid.link/20250922231855.3717483-1-mohsin.bashr@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'convert-3-drivers-to-ndo_hwtstamp-api'
Jakub Kicinski [Thu, 25 Sep 2025 01:13:05 +0000 (18:13 -0700)] 
Merge branch 'convert-3-drivers-to-ndo_hwtstamp-api'

Vadim Fedorenko says:

====================
convert 3 drivers to ndo_hwtstamp API

Convert tg3, bnxt_en and mlx5 to use ndo_hwtstamp API. These 3 drivers
were chosen because I have access to the HW and is able to test the
changes. Also there is a selftest provided to validated that the driver
correctly sets up timestamp configuration, according to what is exposed
as supported by the hardware. Selftest allows driver to fallback to some
wider scope of RX timestamping, i.e. it allows the driver to set up
ptpv2-event filter when ptpv2-l2-event is requested.
====================

Link: https://patch.msgid.link/20250923173310.139623-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: drv-net: add HW timestamping tests
Vadim Fedorenko [Tue, 23 Sep 2025 17:33:10 +0000 (17:33 +0000)] 
selftests: drv-net: add HW timestamping tests

Add simple tests to validate that the driver sets up timestamping
configuration according to what is reported in capabilities.

For RX timestamping we allow driver to fallback to wider scope for
timestamping if filter is applied. That actually means that driver
can enable ptpv2-event when it reports ptpv2-l4-event is supported,
but not vice versa.

Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250923173310.139623-5-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobnxt_en: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
Vadim Fedorenko [Tue, 23 Sep 2025 17:33:08 +0000 (17:33 +0000)] 
bnxt_en: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()

Convert bnxt dirver to use new timestamping configuration API.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250923173310.139623-3-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotg3: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
Vadim Fedorenko [Tue, 23 Sep 2025 17:33:07 +0000 (17:33 +0000)] 
tg3: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()

Convert tg3 driver to new timestamping configuration API.

Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250923173310.139623-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoDocumentation: rxrpc: Demote three sections
Bagas Sanjaya [Mon, 22 Sep 2025 12:41:37 +0000 (19:41 +0700)] 
Documentation: rxrpc: Demote three sections

Three sections ("Socket Options", "Security", and "Example Client Usage")
use title headings, which increase number of entries in the networking
docs toctree by three, and also make the rest of sections headed under
"Example Client Usage".

Demote these sections back to section headings.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250922124137.5266-1-bagasdotme@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: micrel: Fix default LED behaviour
Horatiu Vultur [Mon, 22 Sep 2025 13:03:14 +0000 (15:03 +0200)] 
net: phy: micrel: Fix default LED behaviour

By default the LED will be ON when there is a link but they are not
blinking when there is any traffic activity. Therefore change this
to blink when there is any traffic.

Fixes: 5a774b64cd6a ("net: phy: micrel: Add support for lan8842")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://patch.msgid.link/20250922130314.758229-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'nf-next-25-09-24' of https://git.kernel.org/pub/scm/linux/kernel/git/netfi...
Jakub Kicinski [Thu, 25 Sep 2025 00:45:14 +0000 (17:45 -0700)] 
Merge tag 'nf-next-25-09-24' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

Florian Westphal says:

====================
netfilter: fixes for net-next

These fixes target next because the bug is either not severe or has
existed for so long that there is no reason to cram them in at the last
minute.

1) Fix IPVS ftp unregistering during netns cleanup, broken since netns
   support was introduced in 2011 in the 2.6.39 kernel.
   From Slavin Liu.

2) nfnetlink must reset the 'nlh' pointer back to the original
   address when a batch is replayed, else we emit bogus ACK messages
   and conceal real errno from userspace.
   From Fernando Fernandez Mancera.  This was broken since 6.10.

3) Recent fix for nftables 'pipapo' set type was incomplete, it only
   made things work for the AVX2 version of the algorithm.

4) Testing revealed another problem with avx2 version that results in
   out-of-bounds read access, this bug always existed since feature was
   added in 5.7 kernel.  This also comes with a selftest update.

Last fix resolves a long-standing bug (since 4.9) in conntrack /proc
interface:
Decrease skip count when we reap an expired entry during dump.
As-is we erronously elide one conntrack entry from dump for every expired
entry seen.  From Eric Dumazet.

* tag 'nf-next-25-09-24' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack
  selftests: netfilter: nft_concat_range.sh: add check for double-create bug
  netfilter: nft_set_pipapo_avx2: fix skip of expired entries
  netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups
  netfilter: nfnetlink: reset nlh pointer during batch replay
  ipvs: Defer ip_vs_ftp unregister during netns cleanup
====================

Link: https://patch.msgid.link/20250924140654.10210-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-stmmac-yet-more-cleanups'
Jakub Kicinski [Thu, 25 Sep 2025 00:40:29 +0000 (17:40 -0700)] 
Merge branch 'net-stmmac-yet-more-cleanups'

Russell King says:

====================
net: stmmac: yet more cleanups

Building on the previous cleanup series, this cleans up yet more stmmac
code.

- Move stmmac_bus_clks_config() into stmmac_platform() which is where
  its onlny user is.

- Move the xpcs Clause 73 test into stmmac_init_phy(), resulting in
  simpler code in __stmmac_open().

- Move "can't attach PHY" error message into stmmac_init_phy().

We then start moving stuff out of __stmac_open() into stmmac_open()
(and correspondingly __stmmac_release() into stmmac_release()) which
is not necessary when re-initialising the interface on e.g. MTU change.

- Move initialisation of tx_lpi_timer
- Move PHY attachment/detachment
- Move PHY error message into stmmac_init_phy()

Finally, simplfy the paths in stmmac_init_phy().
====================

Link: https://patch.msgid.link/aNKDqqI7aLsuDD52@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: simplify stmmac_init_phy()
Russell King (Oracle) [Tue, 23 Sep 2025 11:26:24 +0000 (12:26 +0100)] 
net: stmmac: simplify stmmac_init_phy()

If we fail to attach a PHY, there is no point trying to configure WoL
settings. Exit the function after printing the "cannot attach to PHY"
error, and remove the now unnecessary code indentation for configuring
the LPI timer in phylink. Since we know that "ret" must be zero at this
point, change the final return to use a constant rather than "ret".

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v11A8-0000000774M-3pmH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: move PHY handling out of __stmmac_open()/release()
Russell King (Oracle) [Tue, 23 Sep 2025 11:26:19 +0000 (12:26 +0100)] 
net: stmmac: move PHY handling out of __stmmac_open()/release()

Move the PHY attachment/detachment from the network driver out of
__stmmac_open() and __stmmac_release() into stmmac_open() and
stmmac_release() where these actions will only happen when the
interface is administratively brought up or down. It does not make
sense to detach and re-attach the PHY during a change of MTU.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v11A3-0000000774G-3PKY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: move initialisation of priv->tx_lpi_timer to stmmac_open()
Russell King (Oracle) [Tue, 23 Sep 2025 11:26:14 +0000 (12:26 +0100)] 
net: stmmac: move initialisation of priv->tx_lpi_timer to stmmac_open()

The initialisation of priv->tx_lpi_timer only happens once during the
lifetime of the driver, which is during the initial administrative
open of the device. Move this initialisation out of __stmmac_open()
into stmmac_open().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v119y-0000000774A-2vvl@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: move PHY attachment error message into stmmac_init_phy()
Russell King (Oracle) [Tue, 23 Sep 2025 11:26:09 +0000 (12:26 +0100)] 
net: stmmac: move PHY attachment error message into stmmac_init_phy()

Move the "cannot attach to PHY" error message into stmmac_init_phy()
so we don't end up with multiple error messages printed when things
go wrong. Drop the function name from the message, and use %pe to
print the error code description rather than just a number.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v119t-00000007744-2SxG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: move xpcs clause 73 test into stmmac_init_phy()
Russell King (Oracle) [Tue, 23 Sep 2025 11:26:04 +0000 (12:26 +0100)] 
net: stmmac: move xpcs clause 73 test into stmmac_init_phy()

We avoid binding a PHY if the XPCS is using clause 73 negotiation.
Rather than having this complexity in __stmmac_open(), move it to
stmmac_init_phy() instead. There is no point checking the XPCS
state this unless phylink wants a PHY, so place this appropriately.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v119o-0000000773y-21gs@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: stmmac: move stmmac_bus_clks_config() to stmmac_platform.c
Russell King (Oracle) [Tue, 23 Sep 2025 11:25:59 +0000 (12:25 +0100)] 
net: stmmac: move stmmac_bus_clks_config() to stmmac_platform.c

stmmac_bus_clks_config() is only used by stmmac_platform.c, so rather
than having it in stmmac_main.c and needing to export the symbol,
move it to where it's used.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v119j-0000000773s-1R2v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: ethernet-controller: Fix grammar in comment
J. Neuschäfer [Mon, 22 Sep 2025 23:10:01 +0000 (01:10 +0200)] 
dt-bindings: net: ethernet-controller: Fix grammar in comment

"difference" is a noun, so "sufficient" is an adjective without "ly".

Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250923-dt-net-typo-v1-1-08e1bdd14c74@posteo.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotls: Avoid -Wflex-array-member-not-at-end warning
Gustavo A. R. Silva [Tue, 23 Sep 2025 20:45:10 +0000 (22:45 +0200)] 
tls: Avoid -Wflex-array-member-not-at-end warning

Remove unused flexible-array member in struct tls_rec and, with this,
fix the following warning:

net/tls/tls.h:131:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Also, add a comment to prevent people from adding any members
after struct aead_request, which is a flexible structure --this is
a structure that ends in a flexible-array member.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/aNMG1lyXw4XEAVaE@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Wed, 24 Sep 2025 17:21:39 +0000 (10:21 -0700)] 
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-09-23

We've added 9 non-merge commits during the last 33 day(s) which contain
a total of 10 files changed, 480 insertions(+), 53 deletions(-).

The main changes are:

1) A new bpf_xdp_pull_data kfunc that supports pulling data from
   a frag into the linear area of a xdp_buff, from Amery Hung.

   This includes changes in the xdp_native.bpf.c selftest, which
   Nimrod's future work depends on.

   It is a merge from a stable branch 'xdp_pull_data' which has
   also been merged to bpf-next.

   There is a conflict with recent changes in 'include/net/xdp.h'
   in the net-next tree that will need to be resolved.

2) A compiler warning fix when CONFIG_NET=n in the recent dynptr
   skb_meta support, from Jakub Sitnicki.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests: drv-net: Pull data before parsing headers
  selftests/bpf: Test bpf_xdp_pull_data
  bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
  bpf: Make variables in bpf_prog_test_run_xdp less confusing
  bpf: Clear packet pointers after changing packet data in kfuncs
  bpf: Support pulling non-linear xdp data
  bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
  bpf: Clear pfmemalloc flag when freeing all fragments
  bpf: Return an error pointer for skb metadata when CONFIG_NET=n
====================

Link: https://patch.msgid.link/20250924050303.2466356-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonetfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack
Eric Dumazet [Wed, 24 Sep 2025 07:27:09 +0000 (07:27 +0000)] 
netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack

ct_seq_show() has an opportunistic garbage collector :

if (nf_ct_should_gc(ct)) {
    nf_ct_kill(ct);
    goto release;
}

So if one nf_conn is killed there, next time ct_get_next() runs,
we skip the following item in the bucket, even if it should have
been displayed if gc did not take place.

We can decrement st->skip_elems to tell ct_get_next() one of the items
was removed from the chain.

Fixes: 58e207e4983d ("netfilter: evict stale entries when user reads /proc/net/nf_conntrack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agoselftests: netfilter: nft_concat_range.sh: add check for double-create bug
Florian Westphal [Tue, 9 Sep 2025 23:39:48 +0000 (01:39 +0200)] 
selftests: netfilter: nft_concat_range.sh: add check for double-create bug

Add a test case for bug resolved with:
'netfilter: nft_set_pipapo_avx2: fix skip of expired entries'.

It passes on nf.git (it uses the generic/C version for insertion
duplicate check) but fails on unpatched nf-next if AVX2 is supported:

  cannot create same element twice      0s                        [FAIL]
Could create element twice in same transaction
table inet filter { # handle 8
[..]
  elements = { 1.2.3.4 . 1.2.4.1 counter packets 0 bytes 0,
               1.2.4.1 . 1.2.3.4 counter packets 0 bytes 0,
               1.2.3.4 . 1.2.4.1 counter packets 0 bytes 0,
               1.2.4.1 . 1.2.3.4 counter packets 0 bytes 0 }

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonetfilter: nft_set_pipapo_avx2: fix skip of expired entries
Florian Westphal [Tue, 9 Sep 2025 12:11:48 +0000 (14:11 +0200)] 
netfilter: nft_set_pipapo_avx2: fix skip of expired entries

KASAN reports following splat:
BUG: KASAN: slab-out-of-bounds in pipapo_get_avx2+0x941/0x25d0
Read of size 1 at addr ffff88814c561be0 by task nft/3944
Call Trace:
 pipapo_get_avx2+0x941/0x25d0
 nft_pipapo_insert+0x440/0x11b0
 nf_tables_newsetelem+0x220a/0x3a00
 ..

This bisects to commit 84c1da7b38d9 ("netfilter: nft_set_pipapo: use AVX2
algorithm for insertions too").

However, that change merely uncovers this bug.

When we find a match but that match has expired or timed out, the AVX2
implementation restarts the full match loop.

At that point, the pointer to the key data has already been changed and
points to the keys last field.
This will then result in out-of-bounds read once its incremented again
for the next field.

The restart logic in AVX2 is different compared to the plain C
implementation, but both should follow the same logic.

The C implementation just calls pipapo_refill() again do check the next
entry.  Do the same in the AVX2 implementation.

Note that with this change, due to implementation differences of
pipapo_refill vs. nft_pipapo_avx2_refill, the refill call will return
the same element again. Then, on the next call, it will move to the next
entry as expected.  This is because avx2_refill doesn't clear the bitmap
in the 'last' conditional.  This is harmless. Expired/timed out elements
are also not expected to be frequent.

selftest is added in a followup commit.

Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonetfilter: nft_set_pipapo: use 0 genmask for packetpath lookups
Florian Westphal [Tue, 16 Sep 2025 16:34:01 +0000 (18:34 +0200)] 
netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups

In commit c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from
packetpath lookups") I replaced genmask_cur() with NFT_GENMASK_ANY, but
this change has no effect in the pipapo set type.

New entries are unreachable from the active copy, so NFT_GENMASK_ANY has
same result as genmask_cur():

current-gen elements are disabled and the new-generation
elements cannot be found.

Tests did not catch this incomplete fix because the change also dropped
the genmask test from the AVX2 version of the algorithm, so test only
fails if host cpu lacks AVX2 support.

Use genmask test only from the control plane (inserts, deletions, ..).

Packet path has to skip the check, use of 0 is enough for this because
ext->genmask has a the relevant bit set when the element is INACTIVE
in that generation: using a 0 genmask thus makes nft_set_elem_active()
always return true.

Fix the comment and replace NFT_GENMASK_ANY with 0.

Fixes: c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agonetfilter: nfnetlink: reset nlh pointer during batch replay
Fernando Fernandez Mancera [Fri, 19 Sep 2025 12:40:43 +0000 (14:40 +0200)] 
netfilter: nfnetlink: reset nlh pointer during batch replay

During a batch replay, the nlh pointer is not reset until the parsing of
the commands. Since commit bf2ac490d28c ("netfilter: nfnetlink: Handle
ACK flags for batch messages") that is problematic as the condition to
add an ACK for batch begin will evaluate to true even if NLM_F_ACK
wasn't used for batch begin message.

If there is an error during the command processing, netlink is sending
an ACK despite that. This misleads userspace tools which think that the
return code was 0. Reset the nlh pointer to the original one when a
replay is triggered.

Fixes: bf2ac490d28c ("netfilter: nfnetlink: Handle ACK flags for batch messages")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agoipvs: Defer ip_vs_ftp unregister during netns cleanup
Slavin Liu [Thu, 11 Sep 2025 17:57:59 +0000 (01:57 +0800)] 
ipvs: Defer ip_vs_ftp unregister during netns cleanup

On the netns cleanup path, __ip_vs_ftp_exit() may unregister ip_vs_ftp
before connections with valid cp->app pointers are flushed, leading to a
use-after-free.

Fix this by introducing a global `exiting_module` flag, set to true in
ip_vs_ftp_exit() before unregistering the pernet subsystem. In
__ip_vs_ftp_exit(), skip ip_vs_ftp unregister if called during netns
cleanup (when exiting_module is false) and defer it to
__ip_vs_cleanup_batch(), which unregisters all apps after all connections
are flushed. If called during module exit, unregister ip_vs_ftp
immediately.

Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.")
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Slavin Liu <slavin452@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
6 weeks agotcp: Remove stale locking comment for TFO.
Kuniyuki Iwashima [Tue, 23 Sep 2025 00:54:19 +0000 (00:54 +0000)] 
tcp: Remove stale locking comment for TFO.

The listener -> child locking no longer exists in the fast path
since commit e994b2f0fb92 ("tcp: do not lock listener to process
SYN packets").

Let's remove the stale comment for reqsk_fastopen_remove().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250923005441.4131554-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethtool: tsconfig: set command must provide a reply
Vadim Fedorenko [Mon, 22 Sep 2025 23:19:24 +0000 (16:19 -0700)] 
net: ethtool: tsconfig: set command must provide a reply

Timestamping configuration through ethtool has inconsistent behavior of
skipping the reply for set command if configuration was not changed. Fix
it be providing reply in any case.

Fixes: 6e9e2eed4f39d ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config")
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250922231924.2769571-1-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: bridge_fdb_local_vlan_0: Test FDB vs. NET_ADDR_SET behavior
Petr Machata [Mon, 22 Sep 2025 14:14:49 +0000 (16:14 +0200)] 
selftests: bridge_fdb_local_vlan_0: Test FDB vs. NET_ADDR_SET behavior

The previous patch fixed an issue whereby no FDB entry would be created for
the bridge itself on VLAN 0 under some circumstances. This could break
forwarding. Add a test for the fix.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/137cc25396f5a4f407267af895a14bc45552ba5f.1758550408.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: bridge: Install FDB for bridge MAC on VLAN 0
Petr Machata [Mon, 22 Sep 2025 14:14:48 +0000 (16:14 +0200)] 
net: bridge: Install FDB for bridge MAC on VLAN 0

Currently, after the bridge is created, the FDB does not hold an FDB entry
for the bridge MAC on VLAN 0:

 # ip link add name br up type bridge
 # ip -br link show dev br
 br               UNKNOWN        92:19:8c:4e:01:ed <BROADCAST,MULTICAST,UP,LOWER_UP>
 # bridge fdb show | grep 92:19:8c:4e:01:ed
 92:19:8c:4e:01:ed dev br vlan 1 master br permanent

Later when the bridge MAC is changed, or in fact when the address is given
during netdevice creation, the entry appears:

 # ip link add name br up address 00:11:22:33:44:55 type bridge
 # bridge fdb show | grep 00:11:22:33:44:55
 00:11:22:33:44:55 dev br vlan 1 master br permanent
 00:11:22:33:44:55 dev br master br permanent

However when the bridge address is set by the user to the current bridge
address before the first port is enslaved, none of the address handlers
gets invoked, because the address is not actually changed. The address is
however marked as NET_ADDR_SET. Then when a port is enslaved, the address
is not changed, because it is NET_ADDR_SET. Thus the VLAN 0 entry is not
added, and it has not been added previously either:

 # ip link add name br up type bridge
 # ip -br link show dev br
 br               UNKNOWN        7e:f0:a8:1a:be:c2 <BROADCAST,MULTICAST,UP,LOWER_UP>
 # ip link set dev br addr 7e:f0:a8:1a:be:c2
 # ip link add name v up type veth
 # ip link set dev v master br
 # ip -br link show dev br
 br               UNKNOWN        7e:f0:a8:1a:be:c2 <BROADCAST,MULTICAST,UP,LOWER_UP>
 # bridge fdb | grep 7e:f0:a8:1a:be:c2
 7e:f0:a8:1a:be:c2 dev br vlan 1 master br permanent

Then when the bridge MAC is used as DMAC, and br_handle_frame_finish()
looks up an FDB entry with VLAN=0, it doesn't find any, and floods the
traffic instead of passing it up.

Fix this by simply adding the VLAN 0 FDB entry for the bridge itself always
on netdevice creation. This also makes the behavior consistent with how
ports are treated: ports always have an FDB entry for each member VLAN as
well as VLAN 0.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/415202b2d1b9b0899479a502bbe2ba188678f192.1758550408.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-phy-stop-exporting-phy_driver_register'
Jakub Kicinski [Tue, 23 Sep 2025 23:58:44 +0000 (16:58 -0700)] 
Merge branch 'net-phy-stop-exporting-phy_driver_register'

Heiner Kallweit says:

====================
net: phy: stop exporting phy_driver_register

Once the last user of a clock in dp83640 has been removed, the clock should
be removed. So far orphaned clocks are cleaned up in dp83640_free_clocks()
only. Add the logic to remove orphaned clocks in dp83640_remove().
This allows to simplify the code, and use standard macro
module_phy_driver(). dp83640 was the last external user of
phy_driver_register(), so we can stop exporting this function afterwards.
====================

Link: https://patch.msgid.link/b86c2ecc-41f6-4f7f-85db-b7fa684d1fb7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: stop exporting phy_driver_register
Heiner Kallweit [Sat, 20 Sep 2025 21:34:07 +0000 (23:34 +0200)] 
net: phy: stop exporting phy_driver_register

phy_driver_register() isn't used outside phy_device.c any longer,
so we can stop exporting it.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/dff44b83-4a85-4fff-bf6b-f12efd97b56e@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: dp83640: improve phydev and driver removal handling
Heiner Kallweit [Sat, 20 Sep 2025 21:33:16 +0000 (23:33 +0200)] 
net: phy: dp83640: improve phydev and driver removal handling

Once the last user of a clock has been removed, the clock should be
removed. So far orphaned clocks are cleaned up in dp83640_free_clocks()
only. Add the logic to remove orphaned clocks in dp83640_remove().
This allows to simplify the code, and use standard macro
module_phy_driver(). dp83640 was the last external user of
phy_driver_register(), so we can stop exporting this function afterwards.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/6d4e80e7-c684-4d95-abbd-ea62b79a9a8a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: move config symbol MDIO_BUS to drivers/net/phy/Kconfig
Heiner Kallweit [Sat, 20 Sep 2025 21:11:54 +0000 (23:11 +0200)] 
net: phy: move config symbol MDIO_BUS to drivers/net/phy/Kconfig

Config symbol MDIO_BUS isn't used in drivers/net/mdio. It's only used
in drivers/net/phy. So move it there.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/164ff1c6-2cf9-4e30-80fb-da4cc7165dc8@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: rtnetlink: correct error message in rtnetlink.sh fou test
Alok Tiwari [Sun, 21 Sep 2025 19:21:08 +0000 (12:21 -0700)] 
selftests: rtnetlink: correct error message in rtnetlink.sh fou test

The rtnetlink FOU selftest prints an incorrect string:
"FAIL: fou"s. Change it to the intended "FAIL: fou" by
removing a stray character in the end_test string of the test.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250921192111.1567498-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: airoha: Avoid -Wflex-array-member-not-at-end warning
Gustavo A. R. Silva [Mon, 22 Sep 2025 14:08:21 +0000 (16:08 +0200)] 
net: airoha: Avoid -Wflex-array-member-not-at-end warning

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Move the conflicting declaration to the end of the corresponding
structure. Notice that `struct airoha_foe_entry` is a flexible
structure, this is a structure that contains a flexible-array
member.

Fix the following warning:

drivers/net/ethernet/airoha/airoha_eth.h:474:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aNFYVYLXQDqm4yxb@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoudp: remove busylock and add per NUMA queues
Eric Dumazet [Mon, 22 Sep 2025 10:42:40 +0000 (10:42 +0000)] 
udp: remove busylock and add per NUMA queues

busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.

Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.

This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).

This means that fewer number of cpus have to acquire
the UDP receive queue lock.

Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.

Then one of the cpu is chosen to process this lockless queue
in a batch.

The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.

Tested:

DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1004179            0.0
Udp6InErrors                    3117               0.0
Udp6RcvbufErrors                3117               0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1116633            0.0
Udp6InErrors                    14197275           0.0
Udp6RcvbufErrors                14197275           0.0

We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.

I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().

Before:

@udp_enqueue_us[398]:
[0]                24901 |@@@                                                 |
[1]                63512 |@@@@@@@@@                                           |
[2, 4)            344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[8, 16)            54022 |@@@@@@@@                                            |
[16, 32)          222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[32, 64)          232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[64, 128)           4219 |                                                    |
[128, 256)           188 |                                                    |

After:

@udp_enqueue_us[398]:
[0]              5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]              1111277 |@@@@@@@@@@                                          |
[2, 4)            501439 |@@@@                                                |
[4, 8)            102921 |                                                    |
[8, 16)            29895 |                                                    |
[16, 32)           43500 |                                                    |
[32, 64)           31552 |                                                    |
[64, 128)            979 |                                                    |
[128, 256)            13 |                                                    |

Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net'
Martin KaFai Lau [Tue, 23 Sep 2025 22:46:52 +0000 (15:46 -0700)] 
Merge branch 'bpf-next/xdp_pull_data' into 'bpf-next/net'

Merge the xdp_pull_data stable branch into the net branch. No conflict.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
6 weeks agoMerge branch 'add-kfunc-bpf_xdp_pull_data'
Martin KaFai Lau [Tue, 23 Sep 2025 20:35:13 +0000 (13:35 -0700)] 
Merge branch 'add-kfunc-bpf_xdp_pull_data'

Amery Hung says:

====================
Add kfunc bpf_xdp_pull_data

v7 -> v6
  patch 5 (new patch)
  - Rename variables in bpf_prog_test_run_xdp()

  patch 6
  - Fix bugs (Martin)

v6 -> v5
  patch 6
  - v5 selftest failed on S390 when changing how tailroom occupied by
    skb_shared_info is calculated. Revert selftest to v4, where we get
    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) by running an XDP
    program

Link: https://lore.kernel.org/bpf/20250919230952.3628709-1-ameryhung@gmail.com/
v5 -> v4
  patch 1
  - Add a new patch clearing pfmemalloc bit in xdp->frags when all frags
    are freed in bpf_xdp_adjust_tail() (Maciej)

  patch 2
  - Refactor bpf_xdp_shrink_data() (Maciej)

  patch 3
  - Clear pfmemalloc when all frags are freed in bpf_xdp_pull_data()
    (Maciej)

  patch 6
  - Use BTF to get sizes of skb_shared_info and xdp_frame (Maciej)

Link: https://lore.kernel.org/bpf/20250919182100.1925352-1-ameryhung@gmail.com/
v3 -> v4
  patch 2
  - Improve comments (Jakub)
  - Drop new_end and len_free to simplify code (Jakub)

  patch 4
  - Instead of adding is_xdp to bpf_test_init, move lower-bound check
    of user_size to callers (Martin)
  - Simplify linear data size calculation (Martin)

  patch 5
  - Add static function identifier (Martin)
  - Free calloc-ed buf (Martin)

Link: https://lore.kernel.org/bpf/20250917225513.3388199-1-ameryhung@gmail.com/
v2 -> v3
  Separate mlx5 fixes from the patchset

  patch 2
  - Use headroom for pulling data by shifting metadata and data down
    (Jakub)
  - Drop the flags argument (Martin)

  patch 4
  - Support empty linear xdp data for BPF_PROG_TEST_RUN

Link: https://lore.kernel.org/bpf/20250915224801.2961360-1-ameryhung@gmail.com/
v1 -> v2
  Rebase onto bpf-next

  Try to build on top of the mlx5 patchset that avoids copying payload
  to linear part by Christoph but got a kernel panic. Will rebase on
  that patchset if it got merged first, or separate the mlx5 fix
  from this set.

  patch 1
  - Remove the unnecessary head frag search (Dragos)
  - Rewind the end frag pointer to simplify the change (Dragos)
  - Rewind the end frag pointer and recalculate truesize only when the
    number of frags changed (Dragos)

  patch 3
  - Fix len == zero behavior. To mirror bpf_skb_pull_data() correctly,
    the kfunc should do nothing (Stanislav)
  - Fix a pointer wrap around bug (Jakub)
  - Use memmove() when moving sinfo->frags (Jakub)

Link: https://lore.kernel.org/bpf/20250905173352.3759457-1-ameryhung@gmail.com/
====================

Link: https://patch.msgid.link/20250922233356.3356453-1-ameryhung@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
6 weeks agoselftests: drv-net: Pull data before parsing headers
Amery Hung [Mon, 22 Sep 2025 23:33:56 +0000 (16:33 -0700)] 
selftests: drv-net: Pull data before parsing headers

It is possible for drivers to generate xdp packets with data residing
entirely in fragments. To keep parsing headers using direct packet
access, call bpf_xdp_pull_data() to pull headers into the linear data
area.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-9-ameryhung@gmail.com
6 weeks agoselftests/bpf: Test bpf_xdp_pull_data
Amery Hung [Mon, 22 Sep 2025 23:33:55 +0000 (16:33 -0700)] 
selftests/bpf: Test bpf_xdp_pull_data

Test bpf_xdp_pull_data() with xdp packets with different layouts. The
xdp bpf program first checks if the layout is as expected. Then, it
calls bpf_xdp_pull_data(). Finally, it checks the 0xbb marker at offset
1024 using directly packet access.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-8-ameryhung@gmail.com
6 weeks agobpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN
Amery Hung [Mon, 22 Sep 2025 23:33:54 +0000 (16:33 -0700)] 
bpf: Support specifying linear xdp packet data size for BPF_PROG_TEST_RUN

To test bpf_xdp_pull_data(), an xdp packet containing fragments as well
as free linear data area after xdp->data_end needs to be created.
However, bpf_prog_test_run_xdp() always fills the linear area with
data_in before creating fragments, leaving no space to pull data. This
patch will allow users to specify the linear data size through
ctx->data_end.

Currently, ctx_in->data_end must match data_size_in and will not be the
final ctx->data_end seen by xdp programs. This is because ctx->data_end
is populated according to the xdp_buff passed to test_run. The linear
data area available in an xdp_buff, max_linear_sz, is alawys filled up
before copying data_in into fragments.

This patch will allow users to specify the size of data that goes into
the linear area. When ctx_in->data_end is different from data_size_in,
only ctx_in->data_end bytes of data will be put into the linear area when
creating the xdp_buff.

While ctx_in->data_end will be allowed to be different from data_size_in,
it cannot be larger than the data_size_in as there will be no data to
copy from user space. If it is larger than the maximum linear data area
size, the layout suggested by the user will not be honored. Data beyond
max_linear_sz bytes will still be copied into fragments.

Finally, since it is possible for a NIC to produce a xdp_buff with empty
linear data area, allow it when calling bpf_test_init() from
bpf_prog_test_run_xdp() so that we can test XDP kfuncs with such
xdp_buff. This is done by moving lower-bound check to callers as most of
them already do except bpf_prog_test_run_skb(). The change also fixes a
bug that allows passing an xdp_buff with data < ETH_HLEN. This can
happen when ctx is used and metadata is at least ETH_HLEN.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-7-ameryhung@gmail.com
6 weeks agobpf: Make variables in bpf_prog_test_run_xdp less confusing
Amery Hung [Mon, 22 Sep 2025 23:33:53 +0000 (16:33 -0700)] 
bpf: Make variables in bpf_prog_test_run_xdp less confusing

Change the variable naming in bpf_prog_test_run_xdp() to make the
overall logic less confusing. As different modes were added to the
function over the time, some variables got overloaded, making
it hard to understand and changing the code becomes error-prone.

Replace "size" with "linear_sz" where it refers to the size of metadata
and data. If "size" refers to input data size, use test.data_size_in
directly.

Replace "max_data_sz" with "max_linear_sz" to better reflect the fact
that it is the maximum size of metadata and data (i.e., linear_sz). Also,
xdp_rxq.frags_size is always PAGE_SIZE, so just set it directly instead
of subtracting headroom and tailroom and adding them back.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-6-ameryhung@gmail.com
6 weeks agobpf: Clear packet pointers after changing packet data in kfuncs
Amery Hung [Mon, 22 Sep 2025 23:33:52 +0000 (16:33 -0700)] 
bpf: Clear packet pointers after changing packet data in kfuncs

bpf_xdp_pull_data() may change packet data and therefore packet pointers
need to be invalidated. Add bpf_xdp_pull_data() to the special kfunc
list instead of introducing a new KF_ flag until there are more kfuncs
changing packet data.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-5-ameryhung@gmail.com
6 weeks agobpf: Support pulling non-linear xdp data
Amery Hung [Mon, 22 Sep 2025 23:33:51 +0000 (16:33 -0700)] 
bpf: Support pulling non-linear xdp data

Add kfunc, bpf_xdp_pull_data(), to support pulling data from xdp
fragments. Similar to bpf_skb_pull_data(), bpf_xdp_pull_data() makes
the first len bytes of data directly readable and writable in bpf
programs. If the "len" argument is larger than the linear data size,
data in fragments will be copied to the linear data area when there
is enough room. Specifically, the kfunc will try to use the tailroom
first. When the tailroom is not enough, metadata and data will be
shifted down to make room for pulling data.

A use case of the kfunc is to decapsulate headers residing in xdp
fragments. It is possible for a NIC driver to place headers in xdp
fragments. To keep using direct packet access for parsing and
decapsulating headers, users can pull headers into the linear data
area by calling bpf_xdp_pull_data() and then pop the header with
bpf_xdp_adjust_head().

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250922233356.3356453-4-ameryhung@gmail.com
6 weeks agobpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail
Amery Hung [Mon, 22 Sep 2025 23:33:50 +0000 (16:33 -0700)] 
bpf: Allow bpf_xdp_shrink_data to shrink a frag from head and tail

Move skb_frag_t adjustment into bpf_xdp_shrink_data() and extend its
functionality to be able to shrink an xdp fragment from both head and
tail. In a later patch, bpf_xdp_pull_data() will reuse it to shrink an
xdp fragment from head.

Additionally, in bpf_xdp_frags_shrink_tail(), breaking the loop when
bpf_xdp_shrink_data() returns false (i.e., not releasing the current
fragment) is not necessary as the loop condition, offset > 0, has the
same effect. Remove the else branch to simplify the code.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-3-ameryhung@gmail.com
6 weeks agobpf: Clear pfmemalloc flag when freeing all fragments
Amery Hung [Mon, 22 Sep 2025 23:33:49 +0000 (16:33 -0700)] 
bpf: Clear pfmemalloc flag when freeing all fragments

It is possible for bpf_xdp_adjust_tail() to free all fragments. The
kfunc currently clears the XDP_FLAGS_HAS_FRAGS bit, but not
XDP_FLAGS_FRAGS_PF_MEMALLOC. So far, this has not caused a issue when
building sk_buff from xdp_buff since all readers of xdp_buff->flags
use the flag only when there are fragments. Clear the
XDP_FLAGS_FRAGS_PF_MEMALLOC bit as well to make the flags correct.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250922233356.3356453-2-ameryhung@gmail.com
6 weeks agoMerge branch 'dibs-direct-internal-buffer-sharing'
Paolo Abeni [Tue, 23 Sep 2025 09:13:24 +0000 (11:13 +0200)] 
Merge branch 'dibs-direct-internal-buffer-sharing'

Alexandra Winter says:

====================
dibs - Direct Internal Buffer Sharing

This series introduces a generic abstraction of existing components like:
- the s390 specific ISM device (Internal Shared Memory),
- the SMC-D loopback mechanism (Shared Memory Communication - Direct)
- the client interface of the SMC-D module to the transport devices
This generic shim layer can be extended with more devices, more clients and
more features in the future.

This layer is called 'dibs' for Direct Internal Buffer Sharing based on the
common scheme that these mechanisms enable controlled sharing of memory
buffers within some containing entity such as a hypervisor or a Linux
instance.

Benefits:
- Cleaner separation of ISM and SMC-D functionality
- simpler and less module dependencies
- Clear interface definition.
- Extendable for future devices and clients.

An overview was given at the Netdev 0x19 conference, recordings and slides
are available [1].

Background / Status quo:
------------------------
Currently s390 hardware provides virtual PCI ISM devices (Internal Shared
Memory). Their driver is in drivers/s390/net/ism_drv.c. The main user is
SMC-D (net/smc). The ism driver offers a client interface so other
users/protocols can also use them, but it is still heavily intermingled
with the smc code. Namely, the ism module cannot be used without the smc
module, which feels artificial.

There is ongoing work to extend the ISM concept of shared buffers that can
be accessed directly by another instance on the same hardware: [2] proposed
a loopback interface (ism_lo), that can be used on non-s390 architectures
(e.g. between containers or to test SMC-D). A minimal implementation went
upstream with [3]: ism_lo currently is a part of the smc protocol and
rather hidden.

[4] proposed a virtio definition of ism (ism_virtio) that can be used
between kvm guests.

We will shortly send an RFC for an dibs client that uses dibs as transport
for TTY.

Concept:
--------
Create a shim layer in net/dibs that contains common definitions and code
for all dibs devices and all dibs clients. Any device or client module only
needs to depend on this dibs layer module and any device or client code
only needs to include the definitions in include/linux/dibs.h.

The name dibs was chosen to clearly distinguish it from the existing s390
ism devices. And to emphasize that it is not about sharing whole memory
regions with anybody, but dedicating single buffers for another system.

Implementation:
---------------
The end result of this series is: A dibs shim layer with
- One dibs client: smc-d
- Two dibs device drivers: ism and dibs-loopback
- Everything prepared to add more clients and more device drivers.

Patches 1-2 contain some issues that were found along the way. They make
sense on their own, but also enable a better structured dibs series.

There are three components that exist today:
a) smc module (especially SMC-D functionality, which is an ism client today)
b) ism device driver (supports multiple ism clients today)
c) smc-loopback (integrated with smc today)
In order to preserve existing functionality at each step, these are not
moved to dibs layer by component, instead:
- the dibs layer is established in parallel to existing code [patches 3-6]
- then some service functions are moved to the dibs layer [patches 7-12]
- the actual data movement is moved to the dibs layer [patch 13]
- and last event handling is moved to the dibs layer [patch 14]

Future:
-------
Items that are not part of this patchset but can be added later:
- dynamically add or remove dibs_loopback. That will be allow for simple
  testing of add_dev()/del_dev()
- handle_irq(): Call clients without interrupt context. e.g using
  threaded interrupts. I left this for a follow-on, because it includes
  conceptual changes for the smcd receive code.
- Any improvements of locking scopes. I mainly moved some of the the
  existing locks to dibs layer. I have the feeling there is room for
  improvements.
- The device drivers should not loop through the client array
- dibs_dev_op.*_dmb() functions reveal unnecessary details of the
  internal dmb struct to the clients
- Check whether client calls to dibs_dev_ops should be replaced by
  interface functions that provide additional value
- Check whether device driver calls to dibs_client_ops should be replaced
  by interface functions that provide additional value.

Link: [1] https://netdevconf.info/0x19/sessions/talk/communication-via-internal-shared-memory-ism-time-to-open-up.html
Link: [2] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
Link: [3] https://lore.kernel.org/linux-kernel//20240428060738.60843-1-guwen@linux.alibaba.com/
Link: [4] https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=c060ecf9-ea1a-49a2-9827-c92f0e6447b2&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
====================

Link: https://patch.msgid.link/20250918110500.1731261-1-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Move event handling to dibs layer
Julian Ruess [Thu, 18 Sep 2025 11:05:00 +0000 (13:05 +0200)] 
dibs: Move event handling to dibs layer

Add defines for all event types and subtypes an ism device is known to
produce as it can be helpful for debugging purposes.

Introduces a generic 'struct dibs_event' and adopt ism device driver
and smc-d client accordingly. Tolerate and ignore other type and subtype
values to enable future device extensions.

SMC-D and ISM are now independent.
struct ism_dev can be moved to drivers/s390/net/ism.h.

Note that in smc, the term 'ism' is still used. Future patches could
replace that with 'dibs' or 'smc-d' as appropriate.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Move data path to dibs layer
Alexandra Winter [Thu, 18 Sep 2025 11:04:59 +0000 (13:04 +0200)] 
dibs: Move data path to dibs layer

Use struct dibs_dmb instead of struct smc_dmb and move the corresponding
client tables to dibs_dev. Leave driver specific implementation details
like sba in the device drivers.

Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a
single client, but a dibs device can have dmbs for more than one client.

Trigger dibs clients via dibs_client_ops->handle_irq(), when data is
received into a dmb. For dibs_loopback replace scheduling an smcd receive
tasklet with calling dibs_client_ops->handle_irq().

For loopback devices attach_dmb(), detach_dmb() and move_data() need to
access the dmb tables, so move those to dibs_dev_ops in this patch as well.

Remove remaining definitions of smc_loopback as they are no longer
required, now that everything is in dibs_loopback.

Note that struct ism_client and struct ism_dev are still required in smc
until a follow-on patch moves event handling to dibs. (Loopback does not
use events).

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Move query_remote_gid() to dibs_dev_ops
Alexandra Winter [Thu, 18 Sep 2025 11:04:58 +0000 (13:04 +0200)] 
dibs: Move query_remote_gid() to dibs_dev_ops

Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback
dibs_devices. And call it in smc dibs_client.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Move vlan support to dibs_dev_ops
Alexandra Winter [Thu, 18 Sep 2025 11:04:57 +0000 (13:04 +0200)] 
dibs: Move vlan support to dibs_dev_ops

It can be debated how much benefit definition of vlan ids for dibs devices
brings, as the dmbs are accessible only by a single peer anyhow. But ism
provides vlan support and smcd exploits it, so move it to dibs layer as an
optional feature.

smcd_loopback simply ignores all vlan settings, do the same in
dibs_loopback.

SMC-D and ISM have a method to use the invalid VLAN ID 1FFF
(ISM_RESERVED_VLANID), to indicate that both communication peers support
routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Local gid for dibs devices
Alexandra Winter [Thu, 18 Sep 2025 11:04:56 +0000 (13:04 +0200)] 
dibs: Local gid for dibs devices

Define a uuid_t GID attribute to identify a dibs device.

SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that
need to be sent via the SMC protocol. Because the smc code uses integers,
network endianness and host endianness need to be considered. Avoid this
in the dibs layer by using uuid_t byte arrays. Future patches could change
SMC to use uuid_t. For now conversion helper functions are introduced.

ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this:
 _________________________________________
| 64 Bit ISM-vPCI GID | 00000000_00000000 |
 -----------------------------------------
If interpreted as UUID [1], this would be interpreted as the UIID variant,
that is reserved for NCS backward compatibility. So it will not collide
with UUIDs that were generated according to the standard.

smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to
dibs loopback. A temporary change to smc_lo_query_rgid() is required,
that will be moved to dibs_loopback with a follow-on patch.

Provide gid of a dibs device as sysfs read-only attribute.

Link: https://datatracker.ietf.org/doc/html/rfc4122
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Create class dibs
Julian Ruess [Thu, 18 Sep 2025 11:04:55 +0000 (13:04 +0200)] 
dibs: Create class dibs

Create '/sys/class/dibs' to represent multiple kinds of dibs devices in
sysfs. Show s390/ism devices as well as dibs_loopback devices.

Show attribute fabric_id using dibs_ops.get_fabric_id(). This can help
users understand which dibs devices are connected to the same fabric in
different systems and which dibs devices are loopback devices
(fabric_id 0xffff)

Instead of using the same name as the pci device, give the ism devices
their own readable names based on uid or fid from the HW definition.

smc_loopback was never visible in sysfs. dibs_loopback is now represented
as a virtual device.

For the SMC feature "software defined pnet-id" either the ib device name or
the PCI-ID (actually the parent device name) can be used for SMC-R entries.
Mimic this behaviour for SMC-D, and check the parent device name as well.
So device name or PCI-ID can be used for ism and device name can be used
for dibs-loopback. Note that this:
IB_DEVICE_NAME_MAX - 1 == smc_pnet_policy.[SMC_PNETID_IBNAME].len
is the length of smcd_name. Future SW-pnetid cleanup patches to could use a
meaningful define, but that would touch too much unrelated code here.

Examples:
---------
ism before:
> ls /sys/bus/pci/devices/0000:00:00.0/0000:00:00.0
uevent

ism now:
> ls /sys/bus/pci/devices/0000:00:00.0/dibs/ism30
device -> ../../../0000:00:00.0/
fabric_id
subsystem -> ../../../../../class/dibs/
uevent

dibs loopback:
> ls /sys/devices/virtual/dibs/lo/
fabric_id
subsystem -> ../../../../class/dibs/
uevent

dibs class:
> ls -l /sys/class/dibs/
ism30 -> ../../devices/pci0000:00/0000:00:00.0/dibs/ism30/
lo -> ../../devices/virtual/dibs/lo/

For comparison:
> ls -l /sys/class/net/
enc8410 -> ../../devices/qeth/0.0.8410/net/enc8410/
ens1693 -> ../../devices/pci0001:00/0001:00:00.0/net/ens1693/
lo -> ../../devices/virtual/net/lo/

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-10-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Move struct device to dibs_dev
Julian Ruess [Thu, 18 Sep 2025 11:04:54 +0000 (13:04 +0200)] 
dibs: Move struct device to dibs_dev

Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a
corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev
in smc_lo_dev_remove().

Replace smcd->ops->get_dev(smcd) by using dibs->dev directly.

An alternative design would be to embed dibs_dev as a field in ism_dev and
do the same for other dibs device driver specific structs. However that
would have the disadvantage that each dibs device driver needs to allocate
dibs_dev and each dibs device driver needs a different device release
function. The advantage would be that ism_dev and other device driver
specific structs would be covered by device reference counts.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Define dibs_client_ops and dibs_dev_ops
Alexandra Winter [Thu, 18 Sep 2025 11:04:53 +0000 (13:04 +0200)] 
dibs: Define dibs_client_ops and dibs_dev_ops

Move the device add() and remove() functions from ism_client to
dibs_client_ops and call add_dev()/del_dev() for ism devices and
dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for
the smc_dibs_client. This is the first step to handle ism and loopback
devices alike (as dibs devices) in the smc dibs client.

Define dibs_dev->ops and move smcd_ops->get_chid to
dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for
why this needs to be in the same patch as dibs_client_ops->add_dev().

The following changes contain intermediate steps, that will be obsoleted by
follow-on patches, once more functionality has been moved to dibs:

Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches
will change SMC-D to directly use dibs_ops instead of smcd_ops.

In smcd_register_dev() it is now necessary to identify a dibs_loopback
device before smcd_dev and smcd_ops->get_chid() are available. So provide
dibs_dev_ops->get_fabric_id() in this patch and evaluate it in
smc_ism_is_loopback().

Call smc_loopback_init() in smcd_register_dev() and call
smc_loopback_exit() in smcd_unregister_dev() to handle the functionality
that is still in smc_loopback. Follow-on patches will move all smc_loopback
code to dibs_loopback.

In smcd_[un]register_dev() use only ism device name, this will be replaced
by dibs device name by a follow-on patch.

End of changes with intermediate parts.

Allocate an smcd event workqueue for all dibs devices, although
dibs_loopback does not generate events.

Use kernel memory instead of devres memory for smcd_dev and smcd->conn.
Since commit a72178cfe855 ("net/smc: Fix dependency of SMC on ISM") an ism
device and its driver can have a longer lifetime than the smc module, so
smc should not rely on devres to free its resources [1]. It is now the
responsibility of the smc client to free smcd and smcd->conn for all dibs
devices, ism devices as well as loopback. Call client->ops->del_dev() for
all existing dibs devices in dibs_unregister_client(), so all device
related structures can be freed in the client.

When dibs_unregister_client() is called in the context of smc_exit() or
smc_core_reboot_event(), these functions have already called
smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets
going_away. This is done a second time in smcd_unregister_dev(). This is
analogous to how smcr is handled in these functions, by calling first
smc_lgrs_shutdown() and then smc_ib_unregister_client() >
smc_ib_remove_dev(), so leave it that way. It may be worth investigating,
whether smc_lgrs_shutdown() is still required or useful.

Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback
device exists or not.

Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Define dibs loopback
Alexandra Winter [Thu, 18 Sep 2025 11:04:52 +0000 (13:04 +0200)] 
dibs: Define dibs loopback

The first stage of loopback-ism was implemented as part of the
SMC module [1]. Now that we have the dibs layer, provide access to a
dibs_loopback device to all dibs clients.

This is the first step of moving loopback-ism from net/smc/smc_loopback.*
to drivers/dibs/dibs_loopback.*. One global structure lo_dev is allocated
and added to the dibs devices. Follow-on patches will move functionality.

Same as smc_loopback, dibs_loopback is provided by a config option.
Note that there is no way to dynamically add or remove the loopback
device. That could be a future improvement.

When moving code to drivers/dibs, replace ism_ prefix with dibs_ prefix.
As this is mostly a move of existing code, copyright and authors are
unchanged.

Link: https://lore.kernel.org/lkml/20240428060738.60843-1-guwen@linux.alibaba.com/
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-7-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Register ism as dibs device
Alexandra Winter [Thu, 18 Sep 2025 11:04:51 +0000 (13:04 +0200)] 
dibs: Register ism as dibs device

Register ism devices with the dibs layer. Follow-on patches will move
functionality to the dibs layer.

As DIBS is only a shim layer without any dependencies, we can depend ISM
on DIBS without adding indirect dependencies. A follow-on patch will
remove implication of SMC by ISM.

Define struct dibs_dev. Follow-on patches will move more content into
dibs_dev.  The goal of follow-on patches is that ism_dev will only
contain fields that are special for this device driver. The same concept
will apply to other dibs device drivers.

Define dibs_dev_alloc(), dibs_dev_add() and dibs_dev_del() to be called
by dibs device drivers and call them from ism_drv.c
Use ism_dev.dibs for a pointer to dibs_dev.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-6-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Register smc as dibs_client
Alexandra Winter [Thu, 18 Sep 2025 11:04:50 +0000 (13:04 +0200)] 
dibs: Register smc as dibs_client

Formally register smc as dibs client. Functionality will be moved by
follow-on patches from ism_client to dibs_client until eventually
ism_client can be removed.
As DIBS is only a shim layer without any dependencies, we can depend SMC
on DIBS without adding indirect dependencies. A follow-on patch will
remove dependency of SMC on ISM.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-5-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agodibs: Create drivers/dibs
Alexandra Winter [Thu, 18 Sep 2025 11:04:49 +0000 (13:04 +0200)] 
dibs: Create drivers/dibs

Create the file structure for a 'DIBS - Direct Internal Buffer Sharing'
shim layer that will provide generic functionality and declarations for
dibs device drivers and dibs clients.

Following patches will add functionality.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-4-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/smc: Decouple sf and attached send_buf in smc_loopback
Alexandra Winter [Thu, 18 Sep 2025 11:04:48 +0000 (13:04 +0200)] 
net/smc: Decouple sf and attached send_buf in smc_loopback

Before this patch there was the following assumption in
smc_loopback.c>smc_lo_move_data():
sf (signalling flag) == 0 : data is already in an attached target dmb
sf == 1 : data is not yet in the target dmb

This is true for the 2 callers in smc client
smcd_cdc_msg_send() : sf=1
smcd_tx_rdma_writes() : sf=0
but should not be a general assumption.

Add a bool to struct smc_buf_desc to indicate whether an SMC-D sndbuf_desc
is an attached buffer. Don't call move_data() for attached send_buffers,
because it is not necessary.

Move the data in smc_lo_move_data() if len != 0 and signal when requested.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-3-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet/smc: Remove error handling of unregister_dmb()
Alexandra Winter [Thu, 18 Sep 2025 11:04:47 +0000 (13:04 +0200)] 
net/smc: Remove error handling of unregister_dmb()

smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and
then unconditionally frees buf_desc.

Remove the cleaning up of fields of buf_desc in
smc_ism_unregister_dmb(), because it is not helpful.

This removes the only usage of ISM_ERROR from the smc module. So move it
to drivers/s390/net/ism.h.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'tcp-update-bind-bucket-state-on-port-release'
Paolo Abeni [Tue, 23 Sep 2025 08:12:17 +0000 (10:12 +0200)] 
Merge branch 'tcp-update-bind-bucket-state-on-port-release'

Jakub Sitnicki says:

====================
tcp: Update bind bucket state on port release

TL;DR
-----

This is another take on addressing the issue we already raised earlier [1].

This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.

The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.

Changelog
---------
Changes in v5:
- Fix initial port-addr bucket state on saddr update with ip_dynaddr=1
- Add Kuniyuki's tag for tests
- Link to v4: https://lore.kernel.org/r/20250913-update-bind-bucket-state-on-unhash-v4-0-33a567594df7@cloudflare.com

Changes in v4:
- Drop redundant sk_is_connect_bind helper doc comment
- Link to v3: https://lore.kernel.org/r/20250910-update-bind-bucket-state-on-unhash-v3-0-023caaf4ae3c@cloudflare.com

Changes in v3:
- Move the flag from inet_flags to sk_userlocks (Kuniyuki)
- Rename the flag from AUTOBIND to CONNECT_BIND to avoid a name clash (Kuniyuki)
- Drop unreachable code for sk_state == TCP_NEW_SYN_RECV (Kuniyuki)
- Move the helper to inet_hashtables where it's used
- Reword patch 1 description for conciseness
- Link to v2: https://lore.kernel.org/r/20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com

Changes in v2:
- Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric)
- Clear the AUTOBIND flag on disconnect path (Eric)
- Add a test to cover the disconnect case (Eric)
- Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com

Situation
---------

We observe the following scenario in production:

                                                  inet_bind_bucket
                                                state for port 54321
                                                --------------------

                                                (bucket doesn't exist)

// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
                                                tb->fastreuse = -1
                                                tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.

// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()

                                                // bucket remains in this
                                                // state even though port
                                                // was released by s2
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0

// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.

s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)

Problem
-------

We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.

The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.

Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.

For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:

        dialer := &net.Dialer{
                LocalAddr: &net.TCPAddr{IP: srcIP},
        }
        conn, err := dialer.Dial("tcp4", dialTarget)

Or even a ubiquitous tool like dig when using a specific local address:

        $ dig -b 127.1.1.1 +tcp +short example.com

Hence, we are proposing a systematic fix in the network stack itself.

Solution
--------

Please see the description in patch 1.

[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com

Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
====================

Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-0-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoselftests/net: Test tcp port reuse after unbinding a socket
Jakub Sitnicki [Wed, 17 Sep 2025 13:22:05 +0000 (15:22 +0200)] 
selftests/net: Test tcp port reuse after unbinding a socket

Exercise the scenario described in detail in the cover letter:

  1) socket A: connect() from ephemeral port X
  2) socket B: explicitly bind() to port X
  3) check that port X is now excluded from ephemeral ports
  4) close socket B to release the port bind
  5) socket C: connect() from ephemeral port X

As well as a corner case to test that the connect-bind flag is cleared:

  1) connect() from ephemeral port X
  2) disconnect the socket with connect(AF_UNSPEC)
  3) bind() it explicitly to port X
  4) check that port X is now excluded from ephemeral ports

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-2-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agotcp: Update bind bucket state on port release
Jakub Sitnicki [Wed, 17 Sep 2025 13:22:04 +0000 (15:22 +0200)] 
tcp: Update bind bucket state on port release

Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.

In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.

The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.

Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoMerge branch 'tcp-move-few-fields-for-data-locality'
Jakub Kicinski [Tue, 23 Sep 2025 00:55:28 +0000 (17:55 -0700)] 
Merge branch 'tcp-move-few-fields-for-data-locality'

Eric Dumazet says:

====================
tcp: move few fields for data locality

After recent additions (PSP and AccECN) I wanted to make another
round on fields locations to increase data locality.

This series manages to shrink TCP and TCPv6 objects by 128 bytes,
but more importantly should reduce number of touched cache lines
in TCP fast paths.

There is more to come.

v2: removed tcp CACHELINE_ASSERT_GROUP_SIZE after a kernel build bot
reported an error.
====================

Link: https://patch.msgid.link/20250919204856.2977245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: reclaim 8 bytes in struct request_sock_queue
Eric Dumazet [Fri, 19 Sep 2025 20:48:56 +0000 (20:48 +0000)] 
tcp: reclaim 8 bytes in struct request_sock_queue

synflood_warned had to be u32 for xchg(), but ensuring
atomicity is not really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: move mtu_info to remove two 32bit holes
Eric Dumazet [Fri, 19 Sep 2025 20:48:55 +0000 (20:48 +0000)] 
tcp: move mtu_info to remove two 32bit holes

This removes 8bytes waste on 64bit builds.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: move tcp_clean_acked to tcp_sock_read_tx group
Eric Dumazet [Fri, 19 Sep 2025 20:48:54 +0000 (20:48 +0000)] 
tcp: move tcp_clean_acked to tcp_sock_read_tx group

tp->tcp_clean_acked is fetched in tx path when snd_una is updated.

This field thus belongs to tcp_sock_read_tx group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: move recvmsg_inq to tcp_sock_read_txrx
Eric Dumazet [Fri, 19 Sep 2025 20:48:53 +0000 (20:48 +0000)] 
tcp: move recvmsg_inq to tcp_sock_read_txrx

Fill a hole in tcp_sock_read_txrx, instead of possibly wasting
a cache line.

Note that tcp_recvmsg_locked() is also reading tp->repair,
so this removes one cache line miss in tcp recvmsg().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: move tcp->rcv_tstamp to tcp_sock_write_txrx group
Eric Dumazet [Fri, 19 Sep 2025 20:48:52 +0000 (20:48 +0000)] 
tcp: move tcp->rcv_tstamp to tcp_sock_write_txrx group

tcp_ack() writes this field, it belongs to tcp_sock_write_txrx.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotcp: remove CACHELINE_ASSERT_GROUP_SIZE() uses
Eric Dumazet [Fri, 19 Sep 2025 20:48:51 +0000 (20:48 +0000)] 
tcp: remove CACHELINE_ASSERT_GROUP_SIZE() uses

Maintaining the CACHELINE_ASSERT_GROUP_SIZE() uses
for struct tcp_sock has been painful.

This had little benefit, so remove them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: move sk->sk_err_soft and sk->sk_sndbuf
Eric Dumazet [Fri, 19 Sep 2025 20:48:50 +0000 (20:48 +0000)] 
net: move sk->sk_err_soft and sk->sk_sndbuf

sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.

sk->sk_err_soft was not identified previously, but
is used from tcp_ack().

Move it to sock_write_tx group for better cache locality.

Also change tcp_ack() to clear sk->sk_err_soft only if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: move sk_uid and sk_protocol to sock_read_tx
Eric Dumazet [Fri, 19 Sep 2025 20:48:49 +0000 (20:48 +0000)] 
net: move sk_uid and sk_protocol to sock_read_tx

sk_uid and sk_protocol are read from inet6_csk_route_socket()
for each TCP transmit.

Also read from udpv6_sendmsg(), udp_sendmsg() and others.

Move them to sock_read_tx for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'add-more-functionality-to-bnge'
Jakub Kicinski [Tue, 23 Sep 2025 00:51:37 +0000 (17:51 -0700)] 
Merge branch 'add-more-functionality-to-bnge'

Bhargava Marreddy says:

====================
Add more functionality to BNGE

This patch series adds the infrastructure to make the netdevice
functional. It allocates data structures for core resources,
followed by their initialisation and registration with the firmware.
The core resources include the RX, TX, AGG, CMPL, and NQ rings,
as well as the VNIC. RX/TX functionality will be introduced in the
next patch series to keep this one at a reviewable size.
====================

Link: https://patch.msgid.link/20250919174742.24969-1-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Configure default VNIC
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:41 +0000 (23:17 +0530)] 
bng_en: Configure default VNIC

Add functions to add a filter to the VNIC to configure unicast
addresses. Also, add multicast, broadcast, and promiscuous settings
to the default VNIC.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-11-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Register default VNIC
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:40 +0000 (23:17 +0530)] 
bng_en: Register default VNIC

Allocate the default VNIC with the firmware and configure its RSS,
HDS, and Jumbo parameters. Add related functions to support VNIC
configuration for these parameters.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-10-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Register rings with the firmware
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:39 +0000 (23:17 +0530)] 
bng_en: Register rings with the firmware

Enable ring functionality by registering RX, AGG, TX, CMPL, and
NQ rings with the firmware. Initialise the doorbells associated
with the rings.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-9-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Allocate stat contexts
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:38 +0000 (23:17 +0530)] 
bng_en: Allocate stat contexts

Allocate the hardware statistics context with the firmware and
register DMA memory required for ring statistics. This helps the
driver to collect ring statistics provided by the firmware.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-8-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Allocate packet buffers
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:37 +0000 (23:17 +0530)] 
bng_en: Allocate packet buffers

Populate packet buffers into the RX and AGG rings while these
rings are being initialized.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-7-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Initialise core resources
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:36 +0000 (23:17 +0530)] 
bng_en: Initialise core resources

Add initial settings to all core resources, such as
the RX, AGG, TX, CQ, and NQ rings, as well as the VNIC.
This will help enable these resources in future patches.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-6-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Introduce VNIC
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:35 +0000 (23:17 +0530)] 
bng_en: Introduce VNIC

Add the VNIC-specific structures and DMA memory necessary to support
UC/MC and RSS functionality.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-5-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Add initial support for CP and NQ rings
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:34 +0000 (23:17 +0530)] 
bng_en: Add initial support for CP and NQ rings

Allocate CP and NQ related data structures and add support to
associate NQ and CQ rings. Also, add the association of NQ, NAPI,
and interrupts.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-4-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: Add initial support for RX and TX rings
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:33 +0000 (23:17 +0530)] 
bng_en: Add initial support for RX and TX rings

Allocate data structures to support RX, AGG, and TX rings.
While data structures for RX/AGG rings are allocated,
initialise the page pool accordingly.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-3-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agobng_en: make bnge_alloc_ring() self-unwind on failure
Bhargava Marreddy [Fri, 19 Sep 2025 17:47:32 +0000 (23:17 +0530)] 
bng_en: make bnge_alloc_ring() self-unwind on failure

Ensure bnge_alloc_ring() frees any intermediate allocations
when it fails. This enables later patches to rely on this
self-unwinding behavior.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-2-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-replace-wq-users-and-add-wq_percpu-to-alloc_workqueue-users'
Jakub Kicinski [Tue, 23 Sep 2025 00:40:32 +0000 (17:40 -0700)] 
Merge branch 'net-replace-wq-users-and-add-wq_percpu-to-alloc_workqueue-users'

Marco Crivellari says:

====================
net: replace wq users and add WQ_PERCPU to alloc_workqueue() users

Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:

  "workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."

Link: https://lore.kernel.org/20250221112003.1dSuoGyc@linutronix.de
=== Current situation: problems ===

Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.

This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
        schedule_delayed_work(, 0);

This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:

        schedule_delayed_work(, 1);

Will move the timer on an housekeeping CPU, and schedule the work there.

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

=== Plan and future plans ===

This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.

These are the main steps:
1)  API refactoring (that this patch is introducing)
    -   Make more clear and uniform the system wq names, both per-cpu and
        unbound. This to avoid any possible confusion on what should be
        used.

    -   Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
        introduced in this patchset and used on all the callers that are not
        currently using WQ_UNBOUND.

        WQ_UNBOUND will be removed in a future release cycle.

        Most users don't need to be per-cpu, because they don't have
        locality requirements, because of that, a next future step will be
        make "unbound" the default behavior.

2)  Check who really needs to be per-cpu
    -   Remove the WQ_PERCPU flag when is not strictly required.

3)  Add a new API (prefer local cpu)
    -   There are users that don't require a local execution, like mentioned
        above; despite that, local execution yeld to performance gain.

        This new API will prefer the local execution, without requiring it.

=== Introduced Changes by this series ===

1) [P 1-2] Replace use of system_wq and system_unbound_wq

        system_wq is a per-CPU workqueue, but his name is not clear.
        system_unbound_wq is to be used when locality is not required.

        Because of that, system_wq has been renamed in system_percpu_wq, and
        system_unbound_wq has been renamed in system_dfl_wq.

2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users

        Every alloc_workqueue() caller should use one among WQ_PERCPU or
        WQ_UNBOUND.

        WQ_UNBOUND will be removed in a next release cycle.
====================

Link: https://patch.msgid.link/20250918142427.309519-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: WQ_PERCPU added to alloc_workqueue users
Marco Crivellari [Thu, 18 Sep 2025 14:24:27 +0000 (16:24 +0200)] 
net: WQ_PERCPU added to alloc_workqueue users

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly
request the use of the per-CPU behavior. Both flags coexist for one release
cycle to allow callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: replace use of system_wq with system_percpu_wq
Marco Crivellari [Thu, 18 Sep 2025 14:24:26 +0000 (16:24 +0200)] 
net: replace use of system_wq with system_percpu_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: replace use of system_unbound_wq with system_dfl_wq
Marco Crivellari [Thu, 18 Sep 2025 14:24:25 +0000 (16:24 +0200)] 
net: replace use of system_unbound_wq with system_dfl_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-2-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Mon, 22 Sep 2025 23:56:30 +0000 (16:56 -0700)] 
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-09-19 (ice, idpf, iavf, ixgbevf, fm10k)

Paul adds support for Earliest TxTime First (ETF) hardware offload
for E830 devices on ice. ETF is configured per-queue using tc-etf Qdisc;
a new Tx flow mechanism utilizes a dedicated timestamp ring alongside
the standard Tx ring. The timestamp ring contains descriptors that
specify when hardware should transmit packets; up to 2048 Tx queues can
be supported.

Additional info: https://lore.kernel.org/intel-wired-lan/20250818132257.21720-1-paul.greenwalt@intel.com/

Dave removes excess cleanup call to ice_lag_move_new_vf_nodes() in error
path.

Milena adds reporting of timestamping statistics to idpf.

Alex changes error variable type for code clarity for iavf and ixgbevf.

Brahmajit Das removes unused parameter from fm10k_unbind_hw_stats_q().

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  net: intel: fm10k: Fix parameter idx set but not used
  ixgbevf: fix proper type for error code in ixgbevf_resume()
  iavf: fix proper type for error code in iavf_resume()
  idpf: add HW timestamping statistics
  ice: Remove deprecated ice_lag_move_new_vf_nodes() call
  ice: add E830 Earliest TxTime First Offload support
  ice: move ice_qp_[ena|dis] for reuse
====================

Link: https://patch.msgid.link/20250919175412.653707-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: phy: ax88796b: Replace hard-coded values with PHY_ID_MATCH_MODEL()
Thorsten Blum [Fri, 19 Sep 2025 10:39:45 +0000 (12:39 +0200)] 
net: phy: ax88796b: Replace hard-coded values with PHY_ID_MATCH_MODEL()

Use the PHY_ID_MATCH_MODEL() macro instead of hardcoding the values in
asix_driver[] and asix_tbl[].

In asix_tbl[], the macro also uses designated initializers instead of
positional initializers, which allows the struct fields to be reordered.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250919103944.854845-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: xilinx: axienet: Fix kernel-doc warnings for missing return descriptions
Suraj Gupta [Fri, 19 Sep 2025 10:37:54 +0000 (16:07 +0530)] 
net: xilinx: axienet: Fix kernel-doc warnings for missing return descriptions

Add missing "Return:" sections to kernel-doc comments for four functions:
- axienet_calc_cr()
- axienet_device_reset()
- axienet_free_tx_chain()
- axienet_dim_coalesce_count_rx()

Also standardize the return documentation format by replacing inline
"Returns" text with proper "Return:" tags as per kernel documentation
guidelines.

Fixes below kernel-doc warnings:
- Warning: No description found for return value of 'axienet_calc_cr'
- Warning: No description found for return value of 'axienet_device_reset'
- Warning: No description found for return value of 'axienet_free_tx_chain'
- Warning: No description found for return value of
'axienet_dim_coalesce_count_rx'

Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Link: https://patch.msgid.link/20250919103754.434711-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-dsa-microchip-add-strap-description-to-set-spi-as-interface-bus'
Jakub Kicinski [Mon, 22 Sep 2025 23:31:21 +0000 (16:31 -0700)] 
Merge branch 'net-dsa-microchip-add-strap-description-to-set-spi-as-interface-bus'

Bastien Curutchet says:

====================
net: dsa: microchip: Add strap description to set SPI as interface bus

At reset, the KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing
(by mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.

This small series aims to allow to configure the KSZ8463 switch at
reset when the hardware straps are missing.

PATCH 0 and 1 add a new property to the bindings that describes the GPIOs
to be set during reset in order to configure the switch properly.

PATCH 2 implements the use of these properties in the driver.
====================

Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-0-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: dsa: microchip: Set SPI as bus interface during reset for KSZ8463
Bastien Curutchet [Thu, 18 Sep 2025 08:33:52 +0000 (10:33 +0200)] 
net: dsa: microchip: Set SPI as bus interface during reset for KSZ8463

At reset, the KSZ8463 uses a strap-based configuration to set SPI as
bus interface. SPI is the only bus supported by the driver. If the
required pull-ups/pull-downs are missing (by mistake or by design to
save power) the pins may float and the configuration can go wrong
preventing any communication with the switch.

Introduce a ksz8463_configure_straps_spi() function called during the
device reset. It relies on the 'straps-rxd-gpios' OF property and the
'reset' pinmux configuration to enforce SPI as bus interface.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-3-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: dsa: microchip: Add strap description to set SPI mode
Bastien Curutchet (Schneider Electric) [Thu, 18 Sep 2025 08:33:51 +0000 (10:33 +0200)] 
dt-bindings: net: dsa: microchip: Add strap description to set SPI mode

At reset, KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing (by
mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.

Add a 'reset' pinmux state
Add a KSZ8463 specific strap description that can be used by the driver
to drive the strap pins during reset. Two GPIOs are used. Users must
describe either both of them or none of them.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-2-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agodt-bindings: net: dsa: microchip: Group if clause under allOf tag
Bastien Curutchet (Schneider Electric) [Thu, 18 Sep 2025 08:33:50 +0000 (10:33 +0200)] 
dt-bindings: net: dsa: microchip: Group if clause under allOf tag

Upcoming patch adds a new if/then clause. It requires to be grouped with
the already existing if/then clause under an 'allOf:' tag.

Move the if/then clause under the already existing 'allOf:' tag to
prepare next patch.

Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-1-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mella...
Jakub Kicinski [Mon, 22 Sep 2025 23:30:17 +0000 (16:30 -0700)] 
Merge tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Tariq Toukan says:

====================
mlx5-next updates 2025-09-21

* tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Add uar access and odp page fault counters
====================

Link: https://patch.msgid.link/1758443940-708689-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'net-rework-sfp-capability-parsing-and-quirks'
Jakub Kicinski [Mon, 22 Sep 2025 23:05:16 +0000 (16:05 -0700)] 
Merge branch 'net-rework-sfp-capability-parsing-and-quirks'

Russell King says:

====================
net: rework SFP capability parsing and quirks

The original SPF module parsing was implemented prior to gaining any
quirks, and was designed such that the upstream calls the parsing
functions to get the translated capabilities of the module.

SFP quirks were then added to cope with modules that didn't correctly
fill out their ID EEPROM. The quirk function was called from
sfp_parse_support() to allow quirks to modify the ethtool link mode
masks.

Using just ethtool link mode masks eventually lead to difficulties
determining the correct phy_interface_t mode, so a bitmap of these
modes were added - needing both the upstream API and quirks to be
updated.

We have had significantly more SFP module quirks added since, some
which are modifying the ID EEPROM as a way of influencing the data
we provide to the upstream - for example, sfp_fixup_10gbaset_30m()
changes id.base.connector so we report PORT_TP. This could be done
more cleanly if the quirks had access to the parsed SFP port.

In order to improve flexibility, and to simplify some of the upstream
code, we group all module capabilities into a single structure that
the upstream can access via sfp_module_get_caps(). This will allow
the module capabilities to be expanded if required without reworking
all the infrastructure and upstreams again.

In this series, we rework the SFP code to use the capability structure
and then rework all the upstream implementations, finally removing the
old kernel internal APIs.
====================

Link: https://patch.msgid.link/aMnaoPjIuzEAsESZ@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: sfp: remove old sfp_parse_* functions
Russell King (Oracle) [Tue, 16 Sep 2025 21:47:07 +0000 (22:47 +0100)] 
net: sfp: remove old sfp_parse_* functions

Remove the old sfp_parse_*() functions that are now no longer used.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVz-000000061Wj-13Yd@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>