git.ipfire.org Git - thirdparty/kernel/linux.git/log

]> git.ipfire.org Git - thirdparty/kernel/linux.git/log

projects / thirdparty / kernel / linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Gal Pressman [Sun, 25 Jan 2026 12:16:48 +0000 (14:16 +0200)]

net/mlx5e: Remove redundant UDP length adjustment with GSO_PARTIAL

GSO_PARTIAL now takes care of updating the UDP header length,
mlx5e_udp_gso_handle_tx_skb() is redundant, remove it.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260125121649.778086-3-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gal Pressman [Sun, 25 Jan 2026 12:16:47 +0000 (14:16 +0200)]

udp: gso: Use single MSS length in UDP header for GSO_PARTIAL

In GSO_PARTIAL segmentation, set the UDP length field to the single
segment size (gso_size + UDP header) instead of the large MSS size.
This provides hardware with a template length value for final
segmentation, similar to how tunnel GSO_PARTIAL handles outer headers
in UDP tunnels.

This will remove the need to manually adjust the UDP header length in
the drivers, as can be seen in subsequent patches.

This was suggested by Alex in 2018:
https://lore.kernel.org/netdev/CAKgT0UcdnUWgr3KQ=RnLKigokkiUuYefmL-ePpDvJOBNpKScFA@mail.gmail.com/

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260125121649.778086-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Mon, 26 Jan 2026 12:33:14 +0000 (12:33 +0000)]

net: stmmac: don't pass ioaddr to fix_soc_reset() method

As the stmmac_priv struct is passed to the fix_soc_reset() method which
has the ioaddr, there is no need to pass ioaddr separately. Pass just
the stmmac_priv struct. Fix up the glues that use it.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vkLmM-00000005vE1-0nop@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ethan Nelson-Moore [Sat, 24 Jan 2026 03:22:43 +0000 (19:22 -0800)]

net: usb: sr9700: replace magic numbers with register bit macros

The first byte of the Rx frame is a copy of the Rx status register, so
0x40 corresponds to RSR_MF (meaning the frame is multicast). Replace
0x40 with RSR_MF for clarity. (All other bits of the RSR indicate
errors. The fact that the driver ignores these errors will be fixed by
a later patch.)

The first byte of the status URB is a copy of the NSR, so 0x40
corresponds to NSR_LINKST. Replace 0x40 with NSR_LINKST for clarity.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://patch.msgid.link/20260124032248.26807-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 27 Jan 2026 23:47:44 +0000 (15:47 -0800)]

Merge branch 'remove-low-level-sha-1-functions'

Eric Biggers says:

====================
Remove low-level SHA-1 functions

This series updates net/ipv6/addrconf.c to use the regular SHA-1
functions, then removes sha1_init_raw() and sha1_transform().

(These were originally patches 25-26 of the series
https://lore.kernel.org/linux-crypto/20250712232329.818226-1-ebiggers@kernel.org/ )
====================

Link: https://patch.msgid.link/20260123051656.396371-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Biggers [Fri, 23 Jan 2026 05:16:56 +0000 (21:16 -0800)]

lib/crypto: sha1: Remove low-level functions from API

Now that there are no users of the low-level SHA-1 interface, remove it.

Specifically:

- Remove SHA1_DIGEST_WORDS (no longer used)
- Remove sha1_init_raw() (no longer used)
- Rename sha1_transform() to sha1_block_generic() and make it static
- Move SHA1_WORKSPACE_WORDS into lib/crypto/sha1.c

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260123051656.396371-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Biggers [Fri, 23 Jan 2026 05:16:55 +0000 (21:16 -0800)]

ipv6: Switch to higher-level SHA-1 functions

There's now a proper SHA-1 API that follows the usual conventions for
hash function APIs: sha1_init(), sha1_update(), sha1_final(), sha1().
The only remaining user of the older low-level SHA-1 API,
sha1_init_raw() and sha1_transform(), is ipv6_generate_stable_address().
I'd like to remove this older API, which is too low-level.

Unfortunately, ipv6_generate_stable_address() does in fact skip the
SHA-1 finalization for some reason. So the values it computes are not
standard SHA-1 values, and it sort of does want the low-level API.

Still, it's still possible to use the higher-level functions sha1_init()
and sha1_update() to get the same result, provided that the resulting
state is used directly, skipping sha1_final().

So, let's do that instead. This will allow removing the low-level API.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260123051656.396371-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 27 Jan 2026 18:40:18 +0000 (10:40 -0800)]

Merge branch 'net-stmmac-rk-simplify-per-soc-configuration'

Russell King says:

====================
net: stmmac: rk: simplify per-SoC configuration [part]
====================

Link: https://patch.msgid.link/aXdTi4ViCkhhXvFI@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Mon, 26 Jan 2026 11:45:18 +0000 (11:45 +0000)]

net: stmmac: rk: group MACPHY register offset and fields together

Group the MACPHY register offsets and associated bitfields together
to become self-documenting which definitions are associated with
which register.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vkL1y-00000005usW-1TKX@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Mon, 26 Jan 2026 11:45:13 +0000 (11:45 +0000)]

net: stmmac: rk: convert rk3328 to use bsp_priv->id

rk3328 contains two GMAC instances - gmac2io and gmac2phy. While the
gmac2io instance may be connected to an external PHY, the gmac2phy
instance is permanently connected via RMII to an on-SoC integrated PHY.

The driver currently tests for the gmac2phy instance by checking
bsp_priv->integrated_phy (determined from the PHY's phy-is-integrated
property) and sometimes that the interface mode is RMII. This works
because the rk3328.dtsi has:

gmac2phy: ethernet@ff550000 {
compatible = "rockchip,rk3328-gmac";
phy-mode = "rmii";
phy-handle = <&phy>;

mdio {
phy: ethernet-phy@0 {
phy-is-integrated;
};
};
};

The driver contains a mechanism to look up the MMIO address in a table
to determine bsp_priv->id, which is used for every other Rockchip
device. Switch rk3328 to use this mechanism to determine bsp_priv->id
and use that to select which GRF register is used for configuration,
similarly to how the other Rockchip SoCs handle such differences.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vkL1t-00000005usQ-0vjt@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Mon, 26 Jan 2026 11:45:08 +0000 (11:45 +0000)]

net: stmmac: rk: get rid of rk_phy_power_ctl()

It is not worth having a common rk_phy_power_ctl() when the only
difference is which regulator function is called. Also, passing
true/false is non-descriptive. Split this function, moving the code
appropriately into rk_phy_powerup() and rk_phy_powerdown().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vkL1o-00000005usJ-08hy@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Mon, 26 Jan 2026 11:45:02 +0000 (11:45 +0000)]

net: stmmac: rk: avoid phy_power_on()

In https://lore.kernel.org/netdev/aDne1Ybuvbk0AwG0@shell.armlinux.org.uk/
I requested that a follow-up patch to change the name of dwmac-rk's
phy_power_on() function, which clashes with the drivers/phy function
of the same name. This can cause confusion when grepping for this
function name, or when reviewing code. Thankfully, stmmac doesn't make
use of drivers/phy which saves this from compile errors.

However, as is the usual case when a request is made as part of a
review, if the review leads to successful application of the patch the
author doesn't bother following up with any such requests, and so the
problem falls back onto the reviewer to address... so here is the
solution.

Rename dwmac-rk's function to rk_phy_power_ctl(), as the function both
powers up and down.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vkL1i-00000005usD-3lhz@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Fri, 23 Jan 2026 11:16:05 +0000 (11:16 +0000)]

tcp: move sk_forced_mem_schedule() to tcp.c

TCP fast path can (auto)inline this helper, instead
of (auto)inling it from tcp_send_fin().

No change of overall code size, but tcp_sendmsg() is faster.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/1 up/down: 141/-140 (1)
Function                                     old     new   delta
tcp_stream_alloc_skb                         216     357    +141
tcp_send_fin                                 688     548    -140
Total: Before=22236729, After=22236730, chg +0.00%

BTW, we might change tcp_send_fin() to use tcp_stream_alloc_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260123111605.4089200-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Tue, 27 Jan 2026 12:32:35 +0000 (13:32 +0100)]

Merge branch 'extend-bit-width-in-the-flow-director-of-hns3-driver'

Jijie Shao says:

====================
extend bit width in the flow director of HNS3 driver

The bit widths of HCLGE_FD_AD_QID and HCLGE_FD_AD_COUNTER_NUM are
increased to support higher specifications.

Note: The hardware already supports the specifications.
====================

Link: https://patch.msgid.link/20260123094756.3718516-1-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jijie Shao [Fri, 23 Jan 2026 09:47:56 +0000 (17:47 +0800)]

net: hns3: extend HCLGE_FD_AD_COUNTER_NUM to 8 bits

Currently, HCLGE_FD_AD_COUNTER_NUM has only 7 bits and supports a
maximum of 127 counter_id. However, there are actually scenarios
where the counter_id exceeds 127.

This patch adds an additional bit to HCLGE_FD_AD_QID to ensure
that counter_id greater than 127 are supported.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260123094756.3718516-3-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jijie Shao [Fri, 23 Jan 2026 09:47:55 +0000 (17:47 +0800)]

net: hns3: extend HCLGE_FD_AD_QID to 11 bits

Currently, HCLGE_FD_AD_QID has only 10 bits and supports a
maximum of 1023 queues. However, there are actually scenarios
where the queue_id exceeds 1023.

This patch adds an additional bit to HCLGE_FD_AD_QID to ensure
that queue_id greater than 1023 are supported.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20260123094756.3718516-2-shaojijie@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Tue, 27 Jan 2026 10:52:46 +0000 (11:52 +0100)]

Merge branch 'net-dsa-lantiq-add-support-for-intel-gsw150'

Daniel Golle says:

====================
net: dsa: lantiq: add support for Intel GSW150

The Intel GSW150 Ethernet Switch (aka. Lantiq PEB7084) is the predecessor of
MaxLinear's GSW1xx series of switches. It shares most features, but has a
slightly different port layout and different MII interfaces.
Adding support for this switch to the mxl-gsw1xx driver is quite trivial.
====================

Link: https://patch.msgid.link/cover.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:39:30 +0000 (16:39 +0000)]

net: dsa: mxl-gsw1xx: add support for Intel GSW150

Add support for the Intel GSW150 (aka. Lantiq PEB7084) switch IC to
the mxl-gsw1xx driver. This switch comes with 5 Gigabit Ethernet
copper ports (Intel XWAY PHY11G (xRX v1.2 integrated) PHYs) as well as
one GMII/RGMII and one RGMII port.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/c84cf94337bf1be30940841b338b6368468c6e17.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:39:23 +0000 (16:39 +0000)]

net: dsa: mxl-gsw1xx: only setup SerDes PCS if it exists

Older Intel GSW150 chip doesn't have a SGMII/1000Base-X/2500Base-X PCS.
Prepare for supporting Intel GSW150 by skipping PCS reset and
initialization in case no .mac_select_pcs operation is defined.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/fd46a821b1535751cd7b478a04a9ffe1e9d4d289.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:39:16 +0000 (16:39 +0000)]

net: dsa: lantiq: clean up phylink_get_caps switch statement

Use case ranges for phylink_get_caps and remove the redundant "port N:"
from the comments.

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/423daf99b3d60f510ff048a261c62d3de7d39321.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:39:09 +0000 (16:39 +0000)]

net: dsa: lantiq: allow arbitrary MII registers

The Lantiq GSWIP and MaxLinear GSW1xx drivers are currently relying on a
hard-coded mapping of MII ports to their respective MII_CFG and MII_PCDU
registers and only allow applying an offset to the port index.

While this is sufficient for the currently supported hardware, the very
similar Intel GSW150 (aka. Lantiq PEB7084) cannot be described using
this arrangement.

Introduce two arrays to specify the MII_CFG and MII_PCDU registers for
each port, replacing the current bitmap used to safeguard MII ports as
well as the port index offset.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/63fc01195196384f5e244a0ce9ec2ae3a6c08fe3.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:38:53 +0000 (16:38 +0000)]

dt-bindings: net: dsa: lantiq,gswip: add Intel GSW150

Add compatible strings for the Intel GSW150 which is apparently
identical or at least compatible with the Lantiq PEB7084 Ethernet
switch IC.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/1dc62de5263e8536d5960b837bc5dad7b8f42fad.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Daniel Golle [Thu, 22 Jan 2026 16:38:45 +0000 (16:38 +0000)]

dt-bindings: net: dsa: lantiq,gswip: use correct node name

Ethernet PHYs should use nodes named 'ethernet-phy@'.
Rename the Ethernet PHY nodes in the example to comply.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/94f439aa17d7b51fb367877df4fb84c8c07c7ce4.1769099517.git.daniel@makrotopia.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Tue, 27 Jan 2026 09:45:40 +0000 (10:45 +0100)]

Merge branch 'vsock-add-namespace-support-to-vhost-vsock-and-loopback'

Bobby Eshleman says:

====================
vsock: add namespace support to vhost-vsock and loopback

This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).

The current revision supports two modes: local and global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).

The mode is set using the parent namespace's
/proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
created. The mode of the current namespace can be queried by reading
/proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
has been created.

Modes are per-netns. This allows a system to configure namespaces
independently (some may share CIDs, others are completely isolated).
This also supports future possible mixed use cases, where there may be
namespaces in global mode spinning up VMs while there are mixed mode
namespaces that provide services to the VMs, but are not allowed to
allocate from the global CID pool (this mode is not implemented in this
series).

Additionally, added tests for the new namespace features:

tools/testing/selftests/vsock/vmtest.sh
1..25
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 ns_host_vsock_ns_mode_ok
ok 5 ns_host_vsock_child_ns_mode_ok
ok 6 ns_global_same_cid_fails
ok 7 ns_local_same_cid_ok
ok 8 ns_global_local_same_cid_ok
ok 9 ns_local_global_same_cid_ok
ok 10 ns_diff_global_host_connect_to_global_vm_ok
ok 11 ns_diff_global_host_connect_to_local_vm_fails
ok 12 ns_diff_global_vm_connect_to_global_host_ok
ok 13 ns_diff_global_vm_connect_to_local_host_fails
ok 14 ns_diff_local_host_connect_to_local_vm_fails
ok 15 ns_diff_local_vm_connect_to_local_host_fails
ok 16 ns_diff_global_to_local_loopback_local_fails
ok 17 ns_diff_local_to_global_loopback_fails
ok 18 ns_diff_local_to_local_loopback_fails
ok 19 ns_diff_global_to_global_loopback_ok
ok 20 ns_same_local_loopback_ok
ok 21 ns_same_local_host_connect_to_local_vm_ok
ok 22 ns_same_local_vm_connect_to_local_host_ok
ok 23 ns_delete_vm_ok
ok 24 ns_delete_host_ok
ok 25 ns_delete_both_ok
SUMMARY: PASS=25 SKIP=0 FAIL=0

Thanks again for everyone's help and reviews!

Suggested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
v15: https://lore.kernel.org/r/20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com
v14: https://lore.kernel.org/r/20260112-vsock-vmtest-v14-0-a5c332db3e2b@meta.com
v13: https://lore.kernel.org/all/20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com/
v12: https://lore.kernel.org/r/20251126-vsock-vmtest-v12-0-257ee21cd5de@meta.com
v11: https://lore.kernel.org/r/20251120-vsock-vmtest-v11-0-55cbc80249a7@meta.com
v10: https://lore.kernel.org/r/20251117-vsock-vmtest-v10-0-df08f165bf3e@meta.com
v9: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-0-852787a37bed@meta.com
v8: https://lore.kernel.org/r/20251023-vsock-vmtest-v8-0-dea984d02bb0@meta.com
v7: https://lore.kernel.org/r/20251021-vsock-vmtest-v7-0-0661b7b6f081@meta.com
v6: https://lore.kernel.org/r/20250916-vsock-vmtest-v6-0-064d2eb0c89d@meta.com
v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com
v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com
v2: https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
====================

Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-0-2859a7512097@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:52 +0000 (14:11 -0800)]

selftests/vsock: add tests for namespace deletion

Add tests that validate vsock sockets are resilient to deleting
namespaces. The vsock sockets should still function normally.

The function check_ns_delete_doesnt_break_connection() is added to
re-use the step-by-step logic of 1) setup connections, 2) delete ns,
3) check that the connections are still ok.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-12-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:51 +0000 (14:11 -0800)]

selftests/vsock: add tests for host <-> vm connectivity with namespaces

Add tests to validate namespace correctness using vsock_test and socat.
The vsock_test tool is used to validate expected success tests, but
socat is used for expected failure tests. socat is used to ensure that
connections are rejected outright instead of failing due to some other
socket behavior (as tested in vsock_test). Additionally, socat is
already required for tunneling TCP traffic from vsock_test. Using only
one of the vsock_test tests like 'test_stream_client_close_client' would
have yielded a similar result, but doing so wouldn't remove the socat
dependency.

Additionally, check for the dependency socat. socat needs special
handling beyond just checking if it is on the path because it must be
compiled with support for both vsock and unix. The function
check_socat() checks that this support exists.

Add more padding to test name printf strings because the tests added in
this patch would otherwise overflow.

Add vm_dmesg_* helpers to encapsulate checking dmesg
for oops and warnings.

Add ability to pass extra args to host-side vsock_test so that tests
that cause false positives may be skipped with arg --skip.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-11-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:50 +0000 (14:11 -0800)]

selftests/vsock: add namespace tests for CID collisions

Add tests to verify CID collision rules across different vsock namespace
modes.

1. Two VMs with the same CID cannot start in different global namespaces
   (ns_global_same_cid_fails)
2. Two VMs with the same CID can start in different local namespaces
   (ns_local_same_cid_ok)
3. VMs with the same CID can coexist when one is in a global namespace
   and another is in a local namespace (ns_global_local_same_cid_ok and
   ns_local_global_same_cid_ok)

The tests ns_global_local_same_cid_ok and ns_local_global_same_cid_ok
make sure that ordering does not matter.

The tests use a shared helper function namespaces_can_boot_same_cid()
that attempts to start two VMs with identical CIDs in the specified
namespaces and verifies whether VM initialization failed or succeeded.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-10-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:49 +0000 (14:11 -0800)]

selftests/vsock: add tests for proc sys vsock ns_mode

Add tests for the /proc/sys/net/vsock/{ns_mode,child_ns_mode}
interfaces. Namely, that they accept/report "global" and "local" strings
and enforce their access policies.

Start a convention of commenting the test name over the test
description. Add test name comments over test descriptions that existed
before this convention.

Add a check_netns() function that checks if the test requires namespaces
and if the current kernel supports namespaces. Skip tests that require
namespaces if the system does not have namespace support.

This patch is the first to add tests that do *not* re-use the same
shared VM. For that reason, it adds a run_ns_tests() function to run
these tests and filter out the shared VM tests.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-9-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:48 +0000 (14:11 -0800)]

selftests/vsock: use ss to wait for listeners instead of /proc/net

Replace /proc/net parsing with ss(8) for detecting listening sockets in
wait_for_listener() functions and add support for TCP, VSOCK, and Unix
socket protocols.

The previous implementation parsed /proc/net/tcp using awk to detect
listening sockets, but this approach could not support vsock because
vsock does not export socket information to /proc/net/.

Instead, use ss so that we can detect listeners on tcp, vsock, and unix.

The protocol parameter is now required for all wait_for_listener family
functions (wait_for_listener, vm_wait_for_listener,
host_wait_for_listener) to explicitly specify which socket type to wait
for.

ss is added to the dependency check in check_deps().

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-8-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:47 +0000 (14:11 -0800)]

selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers

These functions are reused by the VM tests to collect and compare dmesg
warnings and oops counts. The future VM-specific tests use them heavily.
This patches relies on vm_ssh() already supporting namespaces.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-7-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:46 +0000 (14:11 -0800)]

selftests/vsock: prepare vm management helpers for namespaces

Add namespace support to vm management, ssh helpers, and vsock_test
wrapper functions. This enables running VMs and test helpers in specific
namespaces, which is required for upcoming namespace isolation tests.

The functions still work correctly within the init ns, though the caller
must now pass "init_ns" explicitly.

No functional changes for existing tests. All have been updated to pass
"init_ns" explicitly.

Affected functions (such as vm_start() and vm_ssh()) now wrap their
commands with 'ip netns exec' when executing commands in non-init
namespaces.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-6-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:45 +0000 (14:11 -0800)]

selftests/vsock: add namespace helpers to vmtest.sh

Add functions for initializing namespaces with the different vsock NS
modes. Callers can use add_namespaces() and del_namespaces() to create
namespaces global0, global1, local0, and local1.

The add_namespaces() function initializes global0, local0, etc... with
their respective vsock NS mode by toggling child_ns_mode before creating
the namespace.

Remove namespaces upon exiting the program in cleanup(). This is
unlikely to be needed for a healthy run, but it is useful for tests that
are manually killed mid-test.

This patch is in preparation for later namespace tests.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-5-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:44 +0000 (14:11 -0800)]

selftests/vsock: increase timeout to 1200

Increase the timeout from 300s to 1200s. On a modern bare metal server
my last run showed the new set of tests taking ~400s. Multiply by an
(arbitrary) factor of three to account for slower/nested runners.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-4-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:43 +0000 (14:11 -0800)]

vsock: add netns support to virtio transports

Add netns support to loopback and vhost. Keep netns disabled for
virtio-vsock, but add necessary changes to comply with common API
updates.

This is the patch in the series when vhost-vsock namespaces actually
come online.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-3-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:42 +0000 (14:11 -0800)]

virtio: set skb owner of virtio_transport_reset_no_sock() reply

Associate reply packets with the sending socket. When vsock must reply
with an RST packet and there exists a sending socket (e.g., for
loopback), setting the skb owner to the socket correctly handles
reference counting between the skb and sk (i.e., the sk stays alive
until the skb is freed).

This allows the net namespace to be used for socket lookups for the
duration of the reply skb's lifetime, preventing race conditions between
the namespace lifecycle and vsock socket search using the namespace
pointer.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-2-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Bobby Eshleman [Wed, 21 Jan 2026 22:11:41 +0000 (14:11 -0800)]

vsock: add netns to vsock core

Add netns logic to vsock core. Additionally, modify transport hook
prototypes to be used by later transport-specific patches (e.g.,
*_seqpacket_allow()).

Namespaces are supported primarily by changing socket lookup functions
(e.g., vsock_find_connected_socket()) to take into account the socket
namespace and the namespace mode before considering a candidate socket a
"match".

This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
for new namespaces.

Add netns functionality (initialization, passing to transports, procfs,
etc...) to the af_vsock socket layer. Later patches that add netns
support to transports depend on this patch.

This patch changes the allocation of random ports for connectible vsocks
in order to avoid leaking the random port range starting point to other
namespaces.

dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
modified to take a vsk in order to perform logic on namespace modes. In
future patches, the net will also be used for socket
lookups in these functions.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260121-vsock-vmtest-v16-1-2859a7512097@meta.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Yang [Fri, 23 Jan 2026 16:48:36 +0000 (00:48 +0800)]

net: ethernet: ti: netcp: Use u64_stats_t with u64_stats_sync properly

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. Convert to u64_stats_t to ensure atomic
operations.

Note that does not mean the code is now tear-free: there're u32 counters
unprotected by u64_stats or anything else.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260123164841.2890054-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Fri, 23 Jan 2026 21:10:55 +0000 (05:10 +0800)]

netdevsim: use u64_stats_t with u64_stats_sync properly

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. Convert to u64_stats_t to ensure atomic
operations.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260123211101.2929547-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gal Pressman [Sun, 25 Jan 2026 10:55:24 +0000 (12:55 +0200)]

selftests: net: fix wrong boolean evaluation in __exit__

The __exit__ method receives ex_type as the exception class when an
exception occurs. The previous code used implicit boolean evaluation:

terminate = self.terminate or (self._exit_wait and ex_type)
^^^^^^^^^^^

In Python, the and operator can be used with non-boolean values, but it
does not always return a boolean result.

This is probably not what we want, because 'self._exit_wait and ex_type'
could return the actual ex_type value (the exception class) rather than
a boolean True when an exception occurs.

Use explicit `ex_type is not None` check to properly evaluate whether
an exception occurred, returning a boolean result.

Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20260125105524.773993-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Sat, 24 Jan 2026 21:30:15 +0000 (22:30 +0100)]

r8169: remove optional size argument in calls to strscpy

Minor simplification of the code by removing the optional size argument.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/d560ec66-848e-4290-818a-ce28f39de493@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Javen Xu [Sat, 24 Jan 2026 21:27:24 +0000 (22:27 +0100)]

r8169: add support for extended chip version id and RTL9151AS

The bits in register TxConfig used for chip identification aren't
sufficient for the number of upcoming chip versions. Therefore a register
is added with extended chip version information, for compatibility
purposes it's called TX_CONFIG_V2. First chip to use the extended chip
identification is RTL9151AS.

Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
[hkallweit1@gmail.com: add support for extended XID where XID is printed]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/a3525b74-a1aa-43f6-8413-56615f6fa795@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ethan Nelson-Moore [Sat, 24 Jan 2026 08:22:06 +0000 (00:22 -0800)]

net: usb: replace unnecessary get_link functions with usbnet_get_link

usbnet_get_link calls mii_link_ok if the device has a MII defined in
its usbnet struct and no check_connect function defined there. This is
true of these drivers, so their custom get_link functions which call
mii_link_ok are useless. Remove them in favor of usbnet_get_link.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://patch.msgid.link/20260124082217.82351-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ethan Nelson-Moore [Sat, 24 Jan 2026 08:07:51 +0000 (00:07 -0800)]

net: usb: smsc95xx: use phy_do_ioctl_running function

The smsc95xx_ioctl function behaves identically to the
phy_do_ioctl_running function. Remove it and use the
phy_do_ioctl_running function directly instead.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260124080751.78488-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 27 Jan 2026 02:58:23 +0000 (18:58 -0800)]

Merge branch 'code-clean-up'

Justin Chen says:

====================
code clean up

Clean up and streamlined some code that is no longer needed due to
older HW support being dropped.
====================

Link: https://patch.msgid.link/20260122194949.1145107-1-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Justin Chen [Thu, 22 Jan 2026 19:49:49 +0000 (11:49 -0800)]

net: bcmasp: streamline early exit in probe

Streamline the bcmasp_probe early exit. As support for other
functionality is added(i.e. ptp), it is easier to keep track of early
exit cleanup when it is all in one place.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20260122194949.1145107-3-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Justin Chen [Thu, 22 Jan 2026 19:49:48 +0000 (11:49 -0800)]

net: bcmasp: clean up some legacy logic

Removed wol_irq check. This was needed for brcm,asp-v2.0, which was
removed in previous commits.

Removed bcmasp_intf_ops. These function pointers were added to make
it easier to implement pseudo channels. These channels were removed
in newer versions of the hardware and were never implemented.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20260122194949.1145107-2-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Thu, 22 Jan 2026 18:51:07 +0000 (02:51 +0800)]

net: alacritech: Use u64_stats_t with u64_stats_sync properly

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. Convert to u64_stats_t to ensure atomic
operations.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Link: https://patch.msgid.link/20260122185113.2760355-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Mon, 26 Jan 2026 17:47:31 +0000 (17:47 +0000)]

net: include <linux/hex.h> from sysctl_net_core.c

Needed for hex_byte_pack().

x86_64 was already including it, but some arches were not.

Fixes: 37b0ea8fef56 ("net: expand NETDEV_RSS_KEY_LEN to 256 bytes")
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/netdev/aXeka0KYBnrkwUcF@sirena.org.uk/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260126174731.2767372-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Akiyoshi Kurita [Fri, 23 Jan 2026 15:02:11 +0000 (00:02 +0900)]

dt-bindings: net: dsa: fix typos in bindings docs

Fix "alway" -> "always" in lan9303.txt and marvell,mv88e6xxx.yaml.

Signed-off-by: Akiyoshi Kurita <weibu@redadmin.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20260123150211.2646235-1-weibu@redadmin.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sun, 25 Jan 2026 23:09:09 +0000 (15:09 -0800)]

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
refactor IDPF resource access

Pavan Kumar Linga says:

Queue and vector resources for a given vport, are stored in the
idpf_vport structure. At the time of configuration, these
resources are accessed using vport pointer. Meaning, all the
config path functions are tied to the default queue and vector
resources of the vport.

There are use cases which can make use of config path functions
to configure queue and vector resources that are not tied to any
vport. One such use case is PTP secondary mailbox creation
(it would be in a followup series). To configure queue and interrupt
resources for such cases, we can make use of the existing config
infrastructure by passing the necessary queue and vector resources info.

To achieve this, group the existing queue and vector resources into
default resource group and refactor the code to pass the resource
pointer to the config path functions.

This series also includes patches which generalizes the send virtchnl
message APIs and mailbox API that are necessary for the implementation
of PTP secondary mailbox.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  idpf: generalize mailbox API
  idpf: avoid calling get_rx_ptypes for each vport
  idpf: generalize send virtchnl message API
  idpf: remove vport pointer from queue sets
  idpf: add rss_data field to RSS function parameters
  idpf: reshuffle idpf_vport struct members to avoid holes
  idpf: move some iterator declarations inside for loops
  idpf: move queue resources to idpf_q_vec_rsrc structure
  idpf: introduce idpf_q_vec_rsrc struct and move vector resources to it
  idpf: introduce local idpf structure to store virtchnl queue chunks
====================

Link: https://patch.msgid.link/20260122223601.2208759-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ethan Nelson-Moore [Fri, 23 Jan 2026 08:03:58 +0000 (00:03 -0800)]

net: usb: sr9700: rename register write commands for clarity

SR_WR_REG and SR_WR_REGS may be confused at a cursory glance. Rename
them to be more easily differentiated to prevent this.

Suggested-by: Andrew Lunn <andrew+netdev@lunn.ch>
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://patch.msgid.link/20260123080409.64165-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ethan Nelson-Moore [Fri, 23 Jan 2026 07:06:39 +0000 (23:06 -0800)]

net: usb: sr9700: use ETH_ALEN instead of magic number

The driver hardcodes the number 6 as the number of bytes to write to
the SR_PAR register, which stores the MAC address. Use ETH_ALEN instead
to make the code clearer.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://patch.msgid.link/20260123070645.56434-1-enelsonmoore@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sun, 25 Jan 2026 22:57:41 +0000 (14:57 -0800)]

Merge branch 'net-neighbour-notify-changes-atomically'

Petr Machata says:

====================
net: neighbour: Notify changes atomically

Andy Roulin and Francesco Ruggeri have apparently independently both hit an
issue with the current neighbor notification scheme. Francesco reported the
issue in [1]. In a response[2] to that report, Andy said:

    neigh_update sends a rtnl notification if an update, e.g.,
    nud_state change, was done but there is no guarantee of
    ordering of the rtnl notifications. Consider the following
    scenario:

    userspace thread                   kernel thread
    ================                   =============
    neigh_update
       write_lock_bh(n->lock)
       n->nud_state = STALE
       write_unlock_bh(n->lock)
       neigh_notify
         neigh_fill_info
           read_lock_bh(n->lock)
           ndm->nud_state = STALE
           read_unlock_bh(n->lock)
         -------------------------->
                                      neigh:update
                                      write_lock_bh(n->lock)
                                      n->nud_state = REACHABLE
                                      write_unlock_bh(n->lock)
                                      neigh_notify
                                        neigh_fill_info
                                           read_lock_bh(n->lock)
                                           ndm->nud_state = REACHABLE
                                           read_unlock_bh(n->lock)
                                        rtnl_nofify
                                      RTNL REACHABLE sent
                            <--------
        rtnl_notify
        RTNL STALE sent

    In this scenario, the kernel neigh is updated first to STALE and
    then REACHABLE but the netlink notifications are sent out of order,
    first REACHABLE and then STALE.

The solution presented in [2] was to extend the critical region to include
both the call to neigh_fill_info(), as well as rtnl_notify(). Then we have
a guarantee that whatever state was captured by neigh_fill_info(), will be
sent right away. The above scenario can thus not happen.

This is how this patchset begins: patches #1 and #2 add helper duals to
neigh_fill_info() and __neigh_notify() such that the __-prefixed function
assumes the neighbor lock is held, and the unprefixed one is a thin wrapper
that manages locking. This extends locking further than Andy's patch, but
makes for a clear code and supports the following part.

At that point, the original race is gone. But what can happen is the
following race, where the notification does not reflect the change that was
made:

    userspace thread        kernel thread
    ================        =============
    neigh_update
       write_lock_bh(n->lock)
       n->nud_state = STALE
       write_unlock_bh(n->lock)
-------------------------->
      neigh:update
      write_lock_bh(n->lock)
      n->nud_state = REACHABLE
      write_unlock_bh(n->lock)
      neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
   ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
      RTNL REACHABLE sent
    <--------
       neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
   ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
       RTNL REACHABLE sent again

Here, even though neigh_update() made a change to STALE, it later sends a
notification with a NUD of REACHABLE. The obvious solution to fix this race
is to move the notifier to the same critical section that actually makes
the change.

Sending a notification in fact involves two things: invoking the internal
notifier chain, and sending the netlink notification. The overall approach
in this patchset is to move the netlink notification to the critical
section of the change, while keeping the internal notifier intact. Since
the motion is not obviously correct, the patchset presents the change in
series of incremental steps with discussion in commit messages. Please see
details in the patches themselves.

Reproducer
==========

To consistently reproduce, I injected an mdelay before the rtnl_notify()
call. Since only one thread should delay, a bit of instrumentation was
needed to see where the call originates. The mdelay was then only issued on
the call stack rooted in the RTNL request.

Then the general idea is to issue an "ip neigh replace" to mark a neighbor
entry as failed. In parallel to that, inject an ARP burst that validates
the entry. This is all observed with an "ip monitor neigh", where one can
see either a REACHABLE->FAILED transition, or FAILED->REACHABLE, while the
actual state at the end of the sequence is always REACHABLE.

With the patchset, only FAILED->REACHABLE is ever observed in the monitor.

Alternatives
============

Another approach to solving the issue would be to have a per-neighbor queue
of notification digests, each with a set of fields necessary for formatting
a notification. In pseudocode, a neighbor update would look something like
this:

  neighbor_update:
    - lock
    -   do update
    -   allocate notification digest, fill partially, mark not-committed
    - unlock
    - critical-section-breaking stuff (probes, ARP Q, etc.)
    - lock
    -   fill in missing details to the digest (notably neigh->probes)
    -   mark the digest as committed
    -   while (front of the digest queue is committed)
    -     pop it, convert to notifier, send the notification
    - unlock

This adds more complexity and would imply more changes to the code, which
is why I think the approach presented in this patchset is better. But it
would allow us to retain the overall structure of the code while giving us
accurate notifications.

A third approach would be to consider the second race not very serious and
be OK with seeing a notification that does not reflect the change that
prompted it. Then a two-patch prefix of this patchset would be all that is
needed.

[1]: https://lore.kernel.org/20220606230107.D70B55EC0B30@us226.sjc.aristanetworks.com
[2]: https://lore.kernel.org/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com
====================

Link: https://patch.msgid.link/cover.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:42 +0000 (17:43 +0100)]

net: core: neighbour: Make another netlink notification atomically

Similarly to the issue from the previous patch, neigh_timer_handler() also
updates the neighbor separately from formatting and sending the netlink
notification message. We have not seen reports to the effect of this
causing trouble, but in theory, the same sort of issues could have come up:
neigh_timer_handler() would make changes as necessary, but before
formatting and sending a notification, is interrupted before sending by
another thread, which makes a parallel change and sends its own message.
The message send that is prompted by an earlier change thus contains
information that does not reflect the change having been made.

To solve this, the netlink notification needs to be in the same critical
section that updates the neighbor. The critical section is ended by the
neigh_probe() call which drops the lock before calling solicit. Stretching
the critical section over the solicit call is problematic, because that can
then involved all sorts of forwarding callbacks. Therefore, like in the
previous patch, split the netlink notification away from the internal one
and move it ahead of the probe call.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/e440118511cbdbe1d88eb0d71c9047116feb96e0.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:41 +0000 (17:43 +0100)]

net: core: neighbour: Make one netlink notification atomically

As noted in a previous patch, one race remains in the current code. A
kernel thread might interrupt a userspace thread after the change is done,
but before formatting and sending the message. Then what we would see is
two messages with the same contents:

    userspace thread        kernel thread
    ================        =============
    neigh_update
       write_lock_bh(n->lock)
       n->nud_state = STALE
       write_unlock_bh(n->lock)
-------------------------->
      neigh:update
      write_lock_bh(n->lock)
      n->nud_state = REACHABLE
      write_unlock_bh(n->lock)
      neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
   ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
      RTNL REACHABLE sent
    <--------
       neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
   ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
       RTNL REACHABLE sent again

The solution is to send the netlink message inside the critical section
where the neighbor is changed, so that it reflects the notified-upon
neighbor state.

To that end, in __neigh_update(), move the current neigh_notify() call up
to said critical section, and convert it to __neigh_notify(), because the
lock is held. This motion crosses calls to neigh_update_managed_list(),
neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which
potentially unlock and give an opportunity for the above race.

This also crosses a call to neigh_update_process_arp_queue() which calls
neigh->output(), which might be neigh_resolve_output() calls
neigh_event_send() calls neigh_event_send_probe() calls
__neigh_event_send() calls neigh_probe(), which touches neigh->probes,
an update which will now not be visible in the notification.

However, there is indication that there is no promise that these changes
will be accurately projected to notifications: fib6_table_lookup()
indirectly calls route.c's find_match() calls rt6_probe(), which looks up a
neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0,
but neither this nor the caller seems to send a notification.

Additionally, the neighbor object that the neigh_probe() mentioned above is
called on, might be the alternative neighbor looked up for the ARP queue
packet destination. If that is the case, the changed value of n1->probes is
not notified anywhere.

So at least in some circumstances, the reported number of probes needs to
be assumed to change without notification.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ceb44995498eb52375cb2d46c3245bdb9e74b355.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:40 +0000 (17:43 +0100)]

net: core: neighbour: Reorder netlink & internal notification

The netlink message needs to be send inside the critical section where the
neighbor is changed, so that it reflects the notified-upon neighbor state.
On the other hand, there is no such need in case of notifier chain: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock.

This requires that the netlink notification be done before the internal
notifier chain message is sent. That is safe to do, because the current
listeners, as well as __neigh_notify(), only read the updated neighbor
fields, and never modify them. (Apart from locking.)

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/f3ef74d5460f14c4d102b8a5857d4a6624da9a5a.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:39 +0000 (17:43 +0100)]

net: core: neighbour: Inline neigh_update_notify() calls

The obvious idea behind the helper is to keep together the two bits that
should be done either both or neither: the internal notifier chain message,
and the netlink notification.

To make sure that the notification sent reflects the change being made, the
netlink message needs to be send inside the critical section where the
neighbor is changed. But for the notifier chain, there is no such need: the
listeners do not assume lock, and often in fact just schedule a delayed
work to act on the neighbor later. At least one in fact also takes the
neighbor lock. Therefore these two items have each different locking needs.

Now we could unlock inside the helper, but I find that error prone, and the
fact that the notification is conditional in the first place does not help
to make the call site obvious.

So in this patch, the helper is instead removed and the body, which is just
these two calls, inlined. That way we can use each notifier independently.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/e65dce5882bc6f4aa2530b8a4877d0e003071a1a.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:38 +0000 (17:43 +0100)]

net: core: neighbour: Process ARP queue later

ARP queue processing unlocks the neighbor lock, which can allow another
thread to asynchronously perform a neighbor update and send an out of order
notification. Therefore this needs to be done after the notification is
sent.

Move it just before the end of the critical section. Since
neigh_update_process_arp_queue() unlocks, it does not form a part of the
critical section anymore but it can benefit from the lock being taken. The
intention is to eventually do the RTNL notification before this call.

This motion crosses a call to neigh_update_is_router(), which should not
influence processing of the ARP queue.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/9ea7159e71430ebdc837ebcc880a76b7e82e52a4.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:37 +0000 (17:43 +0100)]

net: core: neighbour: Extract ARP queue processing to a helper function

In order to make manipulation with this bit of code clearer, extract it
to a helper function, neigh_update_process_arp_queue().

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/8b0fa0abe2cf0e24484903f5436fe0ac64163057.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:36 +0000 (17:43 +0100)]

net: core: neighbour: Call __neigh_notify() under a lock

Andy Roulin has described an issue with the current neighbor notification
scheme as follows. This was also presented publicly at the link below.

    neigh_update sends a rtnl notification if an update, e.g.,
    nud_state change, was done but there is no guarantee of
    ordering of the rtnl notifications. Consider the following
    scenario:

    userspace thread                   kernel thread
    ================                   =============
    neigh_update
       write_lock_bh(n->lock)
       n->nud_state = STALE
       write_unlock_bh(n->lock)
       neigh_notify
         neigh_fill_info
           read_lock_bh(n->lock)
           ndm->nud_state = STALE
           read_unlock_bh(n->lock)
         -------------------------->
                                      neigh:update
                                      write_lock_bh(n->lock)
                                      n->nud_state = REACHABLE
                                      write_unlock_bh(n->lock)
                                      neigh_notify
                                        neigh_fill_info
                                           read_lock_bh(n->lock)
                                           ndm->nud_state = REACHABLE
                                           read_unlock_bh(n->lock)
                                        rtnl_nofify
                                      RTNL REACHABLE sent
                            <--------
        rtnl_notify
        RTNL STALE sent

    In this scenario, the kernel neigh is updated first to STALE and
    then REACHABLE but the netlink notifications are sent out of order,
    first REACHABLE and then STALE.

The solution is to send the netlink message inside the same critical
section that formats the message. That way both the contents and ordering
of the message reflect the same state, and we cannot see the abovementioned
out-of-order delivery.

Even with this patch, a remaining issue that the contents of the message
may not reflect the changes made to the neighbor. A kernel thread might
still interrupt a userspace thread after the change is done, but before
formatting and sending the message. Then what we would see is two messages
with the same contents. The following patches will attempt to address that
issue.

To support those future patches, convert __neigh_notify() to a helper that
assumes that the neighbor lock is already taken by having it call
__neigh_fill_info() instead of neigh_fill_info(). Add a new helper,
neigh_notify(), which takes the lock before calling __neigh_notify().
Migrate all callers to use the latter.

Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/4b4368dcc5f5a7e407009cb6c36b69cfb5282864.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Petr Machata [Wed, 21 Jan 2026 16:43:35 +0000 (17:43 +0100)]

net: core: neighbour: Add a neigh_fill_info() helper for when lock not held

The netlink message needs to be formatted and sent inside the critical
section where the neighbor is changed, so that it reflects the
notified-upon neighbor state. Because it will happen inside an already
existing critical section, it has to assume that the neighbor lock is held.
Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but
makes this assumption. Convert neigh_fill_info() to a wrapper around this
new helper.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/7ec20113d5d809200e3534d3ed8f0004514914b8.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 17:22:47 +0000 (17:22 +0000)]

ipv4: igmp: annotate data-races around idev->mr_maxdelay

idev->mr_maxdelay is read and written locklessly,
add READ_ONCE()/WRITE_ONCE() annotations.

While we are at it, make this field an u32.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 16:50:49 +0000 (16:50 +0000)]

ipvlan: remove ipvlan_ht_addr_lookup()

ipvlan_ht_addr_lookup() is called four times and not inlined.

Split it to ipvlan_ht_addr_lookup6() and ipvlan_ht_addr_lookup4()
and rework ipvlan_addr_lookup() to call these helpers once,
so that they are (auto)inlined.

After this change, ipvlan_addr_lookup() is faster, and we save
350 bytes of text on x86_64.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 123/-473 (-350)
Function                                     old     new   delta
ipvlan_addr_lookup                           467     590    +123
__pfx_ipvlan_ht_addr_lookup                   16       -     -16
ipvlan_ht_addr_lookup                        457       -    -457
Total: Before=22571833, After=22571483, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260122165049.2366985-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 19:03:49 +0000 (19:03 +0000)]

net: expand NETDEV_RSS_KEY_LEN to 256 bytes

NETDEV_RSS_KEY_LEN has been set to 52 bytes in 2014, until now.

Jakub suggested we bump the size to 128 bytes or more.

Some drivers (like idpf) were already working around the core limit.

Since this change might cause some issues in admin scripts,
bump it directly to 256 in one go.

tjbp26:~# cat /proc/sys/net/core/netdev_rss_key | wc -c
768

tjbp26:~# ethtool -x eth1
RX flow hash indirection table for eth1 with 32 RX ring(s):
...
RSS hash key:
fe:16:5b:2f:93:85:c2:c9:c1:ef:bd:60:c6:e0:2b:99:4d:bf:b7:14:c8:1e:8d:cb:31:17:51:da:55:eb:91:d9:9e:f9:89:9b:44:a1:dc:08:72:3a:b3:d6:31:86:9a:fe:02:3a:0d:eb:a1:7c:f5:a3:51:3b:08:56:c9:3f:71:69:01:ba:70:38
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20260122075206.504ec591@kernel.org/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260122190349.2771064-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sun, 25 Jan 2026 21:18:59 +0000 (13:18 -0800)]

Merge branch 'net-few-critical-helpers-are-inlined-again'

Eric Dumazet says:

====================
net: few critical helpers are inlined again

Recent devmem additions increased stack depth. Some helpers
that were inlined in the past are now out-of-line.
====================

Link: https://patch.msgid.link/20260122045720.1221017-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 04:57:19 +0000 (04:57 +0000)]

net: inline get_netmem() and put_netmem()

These helpers are used in network fast paths.

Only call out-of-line helpers for netmem case.

We might consider inlining __get_netmem() and __put_netmem()
in the future.

$ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4
add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968)
Function                                     old     new   delta
pskb_carve                                  1669    1894    +225
gro_pull_from_frag0                            -     206    +206
get_page                                     190     380    +190
skb_segment                                 3561    3747    +186
put_page                                     595     765    +170
skb_copy_ubufs                              1683    1822    +139
__pskb_trim_head                             276     401    +125
__pskb_copy_fclone                           734     858    +124
skb_zerocopy                                1092    1215    +123
pskb_expand_head                             892    1008    +116
skb_split                                    828     940    +112
skb_release_data                             297     409    +112
___pskb_trim                                 829     941    +112
__skb_zcopy_downgrade_managed                120     226    +106
tcp_clone_payload                            530     634    +104
esp_ssg_unref                                191     294    +103
dev_gro_receive                             1464    1514     +50
__put_netmem                                   -      41     +41
__get_netmem                                   -      41     +41
skb_shift                                   1139    1175     +36
skb_try_coalesce                             681     714     +33
__pfx_put_page                               112     144     +32
__pfx_get_page                                32      64     +32
__pskb_pull_tail                            1137    1168     +31
veth_xdp_get                                 250     267     +17
__pfx_gro_pull_from_frag0                      -      16     +16
__pfx___put_netmem                             -      16     +16
__pfx___get_netmem                             -      16     +16
__pfx_put_netmem                              16       -     -16
__pfx_gro_try_pull_from_frag0                 16       -     -16
__pfx_get_netmem                              16       -     -16
put_netmem                                   114       -    -114
get_netmem                                   130       -    -130
napi_gro_frags                               929     771    -158
gro_try_pull_from_frag0                      196       -    -196
Total: Before=22565857, After=22567825, chg +0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 04:57:18 +0000 (04:57 +0000)]

net: inline net_is_devmem_iov()

1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158)
Function                                     old     new   delta
validate_xmit_skb                            866     857      -9
__pfx_net_is_devmem_iov                       16       -     -16
net_is_devmem_iov                             22       -     -22
get_netmem                                   152     130     -22
put_netmem                                   140     114     -26
tcp_recvmsg_locked                          3860    3797     -63
Total: Before=22566015, After=22565857, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 04:57:17 +0000 (04:57 +0000)]

gro: change the BUG_ON() in gro_pull_from_frag0()

Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE()

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116)
Function                                     old     new   delta
gro_try_pull_from_frag0                        -     196    +196
napi_gro_frags                               771     929    +158
__pfx_gro_try_pull_from_frag0                  -      16     +16
__pfx_gro_pull_from_frag0                     16       -     -16
dev_gro_receive                             1514    1464     -50
gro_pull_from_frag0                          188       -    -188
Total: Before=22565899, After=22566015, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Thu, 22 Jan 2026 04:57:16 +0000 (04:57 +0000)]

net: always inline skb_frag_unref() and __skb_frag_unref()

clang is not inlining skb_frag_unref() and __skb_frag_unref()
in gro fast path.

It also does not inline gro_try_pull_from_frag0().

Using __always_inline fixes this issue, makes the
kernel faster _and_ smaller.

Also change __skb_frag_ref(), skb_frag_ref() and skb_page_unref()
to let them inlined for the last patch in this series.

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 2/6 grow/shrink: 1/2 up/down: 218/-511 (-293)
Function                                     old     new   delta
gro_pull_from_frag0                            -     188    +188
__pfx_gro_pull_from_frag0                      -      16     +16
skb_shift                                   1125    1139     +14
__pfx_skb_frag_unref                          16       -     -16
__pfx_gro_try_pull_from_frag0                 16       -     -16
__pfx___skb_frag_unref                        16       -     -16
__skb_frag_unref                              36       -     -36
skb_frag_unref                                59       -     -59
dev_gro_receive                             1608    1514     -94
napi_gro_frags                               892     771    -121
gro_try_pull_from_frag0                      153       -    -153
Total: Before=22566192, After=22565899, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Sun, 25 Jan 2026 21:14:14 +0000 (13:14 -0800)]

Merge branch 'u64_stats-introduce-u64_stats_copy'

David Yang says:

====================
u64_stats: Introduce u64_stats_copy()

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
====================

Link: https://patch.msgid.link/20260120092137.2161162-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Tue, 20 Jan 2026 09:21:32 +0000 (17:21 +0800)]

vxlan: vnifilter: fix memcpy with u64_stats

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-5-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Tue, 20 Jan 2026 09:21:31 +0000 (17:21 +0800)]

macsec: fix memcpy with u64_stats

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Tue, 20 Jan 2026 09:21:30 +0000 (17:21 +0800)]

net: bridge: mcast: fix memcpy with u64_stats

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

David Yang [Tue, 20 Jan 2026 09:21:29 +0000 (17:21 +0800)]

u64_stats: Introduce u64_stats_copy()

The following (anti-)pattern was observed in the code tree:

        do {
                start = u64_stats_fetch_begin(&pstats->syncp);
                memcpy(&temp, &pstats->stats, sizeof(temp));
        } while (u64_stats_fetch_retry(&pstats->syncp, start));

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing, especially for memcpy(), for which arches may
provide their highly-optimized implements.

In theory the affected code should convert to u64_stats_t, or use
READ_ONCE()/WRITE_ONCE() properly.

However since there are needs to copy chunks of statistics, instead of
writing loops at random places, we provide a safe memcpy() variant for
u64_stats.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dimitri Daskalakis [Thu, 22 Jan 2026 22:57:23 +0000 (14:57 -0800)]

Documentation: net: Fix typos in netdevices.rst

Fixes two minor typos. Specifically, on -> or and Devices -> Device.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260122225723.2368698-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 23 Jan 2026 19:51:38 +0000 (11:51 -0800)]

Merge branch 'net-rds-rds-tcp-state-machine-and-message-loss-improvements'

Allison Henderson says:

====================
net/rds: RDS-TCP state machine and message loss improvements

This is subset 2 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.

In this set, we correct a few RDS/TCP connection handling issues, and
message loss issues.
====================

Link: https://patch.msgid.link/20260122055213.83608-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gerd Rausch [Thu, 22 Jan 2026 05:52:13 +0000 (22:52 -0700)]

net/rds: rds_tcp_accept_one ought to not discard messages

RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged
by the TCP stack of a peer, "rds_tcp_write_space()" goes on
to discard prior messages from the send queue.

Which is fine, for as long as the receiver never throws any messages away.

Unfortunately, that is *not* the case since the introduction of MPRDS:
commit 1a0e100fb2c96 "RDS: TCP: Enable multipath RDS for TCP"

A new function "rds_tcp_accept_one_path" was introduced,
which is entitled to return "NULL", if no connection path is currently
available.

Unfortunately, this happens after the "->accept()" call, and the new socket
often already contains messages, since the peer already transitioned
to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED".

That's also the case after this [1]:
commit 1a0e100fb2c96 "RDS: TCP: Force every connection to be initiated by
numerically smaller IP address"

which tried to address the situation of pending data by only transitioning
connections from a smaller IP address to "RDS_CONN_UP".

But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS"
handshake has not occurred yet, and therefore we're working with
"c_npaths <= 1", "c_conn[0]" may be in a state distinct from
"RDS_CONN_DOWN", and therefore all messages on the just accepted socket
will be tossed away.

This fix changes "rds_tcp_accept_one":

* If connected from a peer with a larger IP address, the new socket
  will continue to get closed right away.
  With commit [1] above, there should not be any messages
  in the socket receive buffer, since the peer never transitioned
  to "RDS_CONN_UP".
  Therefore it should be okay to not make any efforts to dispatch
  the socket receive buffer.

* If connected from a peer with a smaller IP address,
  we call "rds_tcp_accept_one_path" to find a free slot/"path".
  If found, business goes on as usual.
  If none was found, we save/stash the newly accepted socket
  into "rds_tcp_accepted_sock", in order to not lose any
  messages that may have arrived already.
  We then return from "rds_tcp_accept_one" with "-ENOBUFS".
  Later on, when a slot/"path" does become available again
  (e.g. state transitioned to "RDS_CONN_DOWN",
   or HS extension header was received with "c_npaths > 1")
  we call "rds_tcp_conn_slots_available" that simply re-issues
  a "rds_tcp_accept_one_path" worker-callback and picks
  up the new socket from "rds_tcp_accepted_sock", and thereby
  continuing where it left with "-ENOBUFS" last time.
  Since a new slot has become available, those messages
  won't be lost, since processing proceeds as if that slot
  had been available the first time around.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Gerd Rausch [Thu, 22 Jan 2026 05:52:12 +0000 (22:52 -0700)]

net/rds: No shortcut out of RDS_CONN_ERROR

RDS connections carry a state "rds_conn_path::cp_state"
and transitions from one state to another and are conditional
upon an expected state: "rds_conn_path_transition."

There is one exception to this conditionality, which is
"RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop"
regardless of what state the condition is currently in.

But as soon as a connection enters state "RDS_CONN_ERROR",
the connection handling code expects it to go through the
shutdown-path.

The RDS/TCP multipath changes added a shortcut out of
"RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING"
via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change").

A subsequent "rds_tcp_reset_callbacks" can then transition
the state to "RDS_CONN_RESETTING" with a shutdown-worker queued.

That'll trip up "rds_conn_init_shutdown", which was
never adjusted to handle "RDS_CONN_RESETTING" and subsequently
drops the connection with the dreaded "DR_INV_CONN_STATE",
which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever.

So we do two things here:

a) Don't shortcut "RDS_CONN_ERROR", but take the longer
   path through the shutdown code.

b) Add "RDS_CONN_RESETTING" to the expected states in
  "rds_conn_init_shutdown" so that we won't error out
  and get stuck, if we ever hit weird state transitions
  like this again."

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 23 Jan 2026 19:49:58 +0000 (11:49 -0800)]

Merge branch 'net-restore-the-structure-of-driver-facing-qcfg-api'

Jakub Kicinski says:

====================
net: restore the structure of driver-facing qcfg API

The goal of qcfg objects is to let us seamlessly support new use cases
without modifying all the drivers. We want to pull all the logic of
combining configuration supplied via different interfaces into the core
and present the drivers with a flat queue-by-queue configuration.
Additionally we want to separate the current effective configuration
from the user intent (default vs user setting vs memory provider setting).

Restructure the recently added code to re-introduce the pieces that
are missing compared to the old RFC:
https://lore.kernel.org/20250421222827.283737-1-kuba@kernel.org
Namely:
- the netdev_queue_config() helper
- queue config validation callback

I hopefully removed all the more "out there" parts of the RFC.
====================

Link: https://patch.msgid.link/20260122005113.2476634-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:13 +0000 (16:51 -0800)]

eth: bnxt: plug bnxt_validate_qcfg() into qops

Plug bnxt_validate_qcfg() back into qops, where it was in my old RFC.

Link: https://patch.msgid.link/20260122005113.2476634-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:12 +0000 (16:51 -0800)]

net: add queue config validation callback

I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.

The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..

Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:11 +0000 (16:51 -0800)]

net: use netdev_queue_config() for mp restart

We should follow the prepare/commit approach for queue configuration.
The qcfg struct should be added to dev->cfg rather than directly to
queue objects so that we can clone and discard the pending config
easily.

Remove the qcfg in struct netdev_rx_queue, and switch remaining callers
to netdev_queue_config(). netdev_queue_config() will construct the qcfg
on the fly based on device defaults and state of the queue.

ndo_default_qcfg becomes optional because having the callback itself
does not have any meaningful semantics to us.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:10 +0000 (16:51 -0800)]

net: move mp->rx_page_size validation to __net_mp_open_rxq()

Move mp->rx_page_size validation where the rest of MP input
validation lives. No other caller is modifying mp params so
validation logic in queue restarts is out of place.

Link: https://patch.msgid.link/20260122005113.2476634-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:09 +0000 (16:51 -0800)]

net: introduce a trivial netdev_queue_config()

We may choose to extend or reimplement the logic which renders
the per-queue config. The drivers should not poke directly into
the queue state. Add a helper for drivers to use when they want
to query the config for a specific queue.

Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 22 Jan 2026 00:51:08 +0000 (16:51 -0800)]

eth: bnxt: always set the queue mgmt ops

Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feed it the correct settings.

Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/20260122005113.2476634-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 23 Jan 2026 19:43:31 +0000 (11:43 -0800)]

Merge branch 'selftest-extend-tun-virtio-coverage-for-gso-over-udp-tunnel'

Xu Du says:

====================
selftest: Extend tun/virtio coverage for GSO over UDP tunnel

The design strategy is to extend the existing tun testing infrastructure
to support this new use-case, rather than introducing a new or parallel framework.
This allows for better integration and re-use of existing test logic.
====================

Link: https://patch.msgid.link/cover.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:05:01 +0000 (18:05 +0800)]

selftest: tun: Add test data for success and failure paths

To improve the robustness and coverage of the TUN selftests, this
patch expands the set of test data.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/5054f3ad9f3dbfe33b827183fccc5efeb8fd0da7.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:05:00 +0000 (18:05 +0800)]

selftest: tun: Add test for receiving gso packet from tun

The test validate that GSO information are correctly exposed
when reading packets from a TUN device.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/fe75ac66466380490eba858eef50596a1bfbd071.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:04:59 +0000 (18:04 +0800)]

selftest: tun: Add test for sending gso packet into tun

The test constructs a raw packet, prepends a virtio_net_hdr,
and writes the result to the TUN device. This mimics the behavior
of a vm forwarding a guest's packet to the host networking stack.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/a988dbc9ca109e4f1f0b33858c5035bce8ebede3.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:04:58 +0000 (18:04 +0800)]

selftest: tun: Add helpers for GSO over UDP tunnel

In preparation for testing GSO over UDP tunnels, enhance the test
infrastructure to support a more complex data path involving a TUN
device and a GENEVE udp tunnel.

This patch introduces a dedicated setup/teardown topology that creates
both a GENEVE tunnel interface and a TUN interface. The TUN device acts
as the VTEP (Virtual Tunnel Endpoint), allowing it to send and receive
virtio-net packets. This setup effectively tests the kernel's data path
for encapsulated traffic.

Note that after adding a new address to the UDP tunnel, we need to wait
a bit until the associated route is available.

Additionally, a new data structure is defined to manage test parameters.
This structure is designed to be extensible, allowing different test
data and configurations to be easily added in subsequent patches.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/b5787b8c269f43ce11e1756f1691cc7fd9a1e901.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:04:57 +0000 (18:04 +0800)]

selftest: tun: Refactor tun_delete to use tuntap_helpers

The previous patch introduced common tuntap helpers to simplify
tun test code. This patch refactors the tun_delete function to
use these new helpers.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/ecc7c0c2d75d87cb814e97579e731650339703ab.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:04:56 +0000 (18:04 +0800)]

selftest: tun: Introduce tuntap_helpers.h header for TUN/TAP testing

Introduce rtnetlink manipulation and packet construction helpers that
will simplify the later creation of more related test cases. This avoids
duplicating logic across different test cases.

This new header will contain:
- YNL-based netlink management utilities.
- Helpers for ip link, ip address, ip neighbor and ip route operations.
- Packet construction and manipulation helpers.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/91f905715c69c75f7bf72d43388921fde6c34989.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Xu Du [Wed, 21 Jan 2026 10:04:55 +0000 (18:04 +0800)]

selftest: tun: Format tun.c existing code

In preparation for adding new tests for GSO over UDP tunnels,
apply consistently the kernel style to the existing code.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/d797de1e5a3d215dd78cb46775772ef682bab60e.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 23 Jan 2026 19:31:15 +0000 (11:31 -0800)]

Merge branch 'geneve-introduce-double-tunnel-gso-gro-support'

Paolo Abeni says:

====================
geneve: introduce double tunnel GSO/GRO support

This is the [belated] incarnation of topic discussed in the last Neconf
[1].

In container orchestration in virtual environments there is a consistent
usage of double UDP tunneling - specifically geneve. Such setup lack
support of GRO and GSO for inter VM traffic.

After commit b430f6c38da6 ("Merge branch 'virtio_udp_tunnel_08_07_2025'
of https://github.com/pabeni/linux-devel") and the qemu cunter-part, VMs
are able to send/receive GSO over UDP aggregated packets.

This series introduces the missing bit for full end-to-end aggregation
in the above mentioned scenario. Specifically:

- introduces a new netdev feature set to generalize existing per device
driver GSO admission check.1
- adds GSO partial support for the geneve and vxlan drivers
- introduces and use a geneve option to assist double tunnel GRO
- adds some simple functional tests for the above.

The new device features set is not strictly needed for the following
work, but avoids the introduction of trivial `ndo_features_check` to
support GSO partial and thus possible performance regression due to the
additional indirect call. Such feature set could be leveraged by a
number of existing drivers (intel, meta and possibly wangxun) to avoid
duplicate code/tests. Such part has been omitted here to keep the series
small.

Both GSO partial support and double GRO support have some downsides.
With the first in place, GSO partial packets will traverse the network
stack 'downstream' the outer geneve UDP tunnel and will be visible by
the udp/IP/IPv6 and by netfilter. Currently only H/W NICs implement GSO
partial support and such packets are visible only via software taps.

Double UDP tunnel GRO will cook 'GSO partial' like aggregate packets,
i.e. the inner UDP encapsulation headers set will still carry the
wire-level lengths and csum, so that segmentation considering such
headers parts of a giant, constant encapsulation header will yield the
correct result.

The correct GSO packet layout is applied when the packet traverse the
outermost geneve encapsulation.

Both GSO partial and double UDP encap are disabled by default and must
be explicitly enabled via, respectively ethtool and geneve device
configuration.

Finally note that the GSO partial feature could potentially be applied
to all the other UDP tunnels, but this series limits its usage to geneve
and vxlan devices.

Link: https://netdev.bots.linux.dev/netconf/2024/paolo.pdf
====================

Link: https://patch.msgid.link/cover.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:36 +0000 (17:11 +0100)]

selftests: net: tests for add double tunneling GRO/GSO

Create a simple, netns-based topology with double, nested UDP tunnels and
perform TSO transfers on top.

Explicitly enable GSO and/or GRO and check the skb layout consistency with
different configuration allowing (or not) GSO frames to be delivered on
the other end.

The trickest part is account in a robust way the aggregated/unaggregated
packets with double encapsulation: use a classic bpf filter for it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/61f2c98ba0f73057c2d6f6cb62eb807abd90bf6b.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:35 +0000 (17:11 +0100)]

geneve: use GRO hint option in the RX path

At the GRO stage, when a valid hint option is found, try match the whole
nested headers and try to aggregate on the inner protocol; in case of hdr
mismatch extract the nested address and port to properly flush on a
per-inner flow basis.

On GRO completion, the (unmodified) nested headers will be considered part
of the (constant) outer geneve encap header so that plain UDP tunnel
segmentation will yield valid wire packets.

In the geneve RX path, when processing a GSO packet carrying a GRO hint
option, update the nested header length fields from the wire packet size to
the GSO-packet one. If the nested header additionally carries a checksum,
convert it to CSUM-partial.

Finally, when the RX path leverages the GRO hints, skip the additional GRO
stage done by GRO cells: otherwise the already set skb->encapsulation flag
will foul the GRO cells complete step to use touch the innermost IP header
when it should update the nested csum, corrupting the packet.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/4a9a390588a429191e0ffe48ccdd288bb69e567e.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:34 +0000 (17:11 +0100)]

geneve: extract hint option at GRO stage

Add helpers for finding a GRO hint option in the geneve header, performing
basic sanitization of the option offsets vs the actual packet layout,
validate the option for GRO aggregation and check the nested header
checksum.

The validation helper closely mirrors similar check performed by the ipv4
and ipv6 gro callbacks, with the additional twist of accessing the
relevant network header via the GRO hint offset.

To validate the nested UDP checksum, leverage the csum completed of the
outer header, similarly to LCO, with the main difference that in this case
we have the outer checksum available.

Use the helpers to extract the hint info at the GRO stage.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/cd0e9dc42ba83f388b604097cffe268ffcb53351.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:33 +0000 (17:11 +0100)]

geneve: add GRO hint output path

If a geneve egress packet contains nested UDP encap headers, add a geneve
option including the information necessary on the RX side to perform GRO
aggregation of the whole packets: the nested network and transport headers,
and the nested protocol type.

Use geneve option class `netdev`, already registered in the Network
Virtualization Overlay (NVO3) IANA registry:

https://www.iana.org/assignments/nvo3/nvo3.xhtml#Linux-NetDev.

To pass the GRO hint information across the different xmit path functions,
store them in the skb control buffer, to avoid adding additional arguments.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/aa614567f7bdb776d693041375bede4990a19649.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:32 +0000 (17:11 +0100)]

geneve: pass the geneve device ptr to geneve_build_skb()

Instead of handing to it the geneve configuration in multiple arguments.
This already avoids some code duplication and we are going to pass soon
more arguments to such function.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/761f05690646181fffc533ee4db59b68e5c3a0c3.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Wed, 21 Jan 2026 16:11:31 +0000 (17:11 +0100)]

geneve: constify geneve_hlen()

Such helper does not modify the argument; constifying it will additionally
simplify later patches.

Additionally move the definition earlier, still for later's patchesi sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/ea9e279b9544e8644194508dd9a4320ee455fa95.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

A mirror of Linus' kernel repository