]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
4 weeks agoxsk: add indirect call for xsk_destruct_skb
Jason Xing [Fri, 31 Oct 2025 10:33:28 +0000 (18:33 +0800)] 
xsk: add indirect call for xsk_destruct_skb

Since Eric proposed an idea about adding indirect call wrappers for
UDP and managed to see a huge improvement[1], the same situation can
also be applied in xsk scenario.

This patch adds an indirect call for xsk and helps current copy mode
improve the performance by around 1% stably which was observed with
IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect
will be magnified. I applied this patch on top of batch xmit series[2],
and was able to see <5% improvement from our internal application
which is a little bit unstable though.

Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to
be when the mitigation config is off.

Be aware of the freeing path that can be very hot since the frequency
can reach around 2,000,000 times per second with the xdpsock test.

[1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/
[2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 weeks agonet: dsa: loop: use new helper fixed_phy_register_100fd to simplify the code
Heiner Kallweit [Sat, 8 Nov 2025 21:59:51 +0000 (22:59 +0100)] 
net: dsa: loop: use new helper fixed_phy_register_100fd to simplify the code

Use new helper fixed_phy_register_100fd to simplify the code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/922f1b45-1748-4dd2-87eb-9d018df44731@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoisdn: kcapi: add WQ_PERCPU to alloc_workqueue users
Marco Crivellari [Fri, 7 Nov 2025 13:44:52 +0000 (14:44 +0100)] 
isdn: kcapi: add WQ_PERCPU to alloc_workqueue users

Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107134452.198378-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'gve-improve-rx-buffer-length-management'
Jakub Kicinski [Tue, 11 Nov 2025 01:36:40 +0000 (17:36 -0800)] 
Merge branch 'gve-improve-rx-buffer-length-management'

Ankit Garg says:

====================
gve: Improve RX buffer length management

This patch series improves the management of the RX buffer length for
the DQO queue format in the gve driver. The goal is to make RX buffer
length config more explicit, easy to change, and performant by default.

We accomplish that in four patches:

1.  Currently, the buffer length is implicitly coupled with the header
    split setting, which is an unintuitive and restrictive design. The
    first patch decouples the RX buffer length from the header split
    configuration.

2.  The second patch is a preparatory step for third. It converts the XDP
    config verification method to use extack for better error reporting.

3.  The third patch exposes the `rx_buf_len` parameter to userspace via
    ethtool, allowing user to directly view or modify the RX buffer length
    if supported by the device.

4.  The final patch improves the out-of-the-box RX single stream throughput
    by >10%  by changing the driver's default behavior to select the
    maximum supported RX buffer length advertised by the device during
    initialization.
====================

Link: https://patch.msgid.link/20251106192746.243525-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agogve: Default to max_rx_buffer_size for DQO if device supported
Ankit Garg [Thu, 6 Nov 2025 19:27:46 +0000 (11:27 -0800)] 
gve: Default to max_rx_buffer_size for DQO if device supported

Change the driver's default behavior to prefer the largest available RX
buffer length supported by the device for DQO format, rather than always
using the hardcoded 2K default.

Previously, the driver would initialize with
`GVE_DEFAULT_RX_BUFFER_SIZE` (2K), even if the device advertised support
for a larger length (e.g., 4K).

Performance observations:
- With LRO disabled, we observed >10% improvement in RX single stream
throughput when MTU >=2048.
- With LRO enabled, we observed >10% improvement in RX single stream
throughput when MTU >=1460.
- No regressions were observed.

Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-5-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agogve: Allow ethtool to configure rx_buf_len
Ankit Garg [Thu, 6 Nov 2025 19:27:45 +0000 (11:27 -0800)] 
gve: Allow ethtool to configure rx_buf_len

Add support for getting and setting the RX buffer length via the
ethtool ring parameters (`ethtool -g`/`-G`). The driver restricts the
allowed buffer length to 2048 (SZ_2K) by default and allows 4096 (SZ_4K)
based on device options.

As XDP is only supported when the `rx_buf_len` is 2048, the driver now
enforces this in two places:
1.  In `gve_xdp_set`, rejecting XDP programs if the current buffer
    length is not 2048.
2.  In `gve_set_rx_buf_len_config`, rejecting buffer length changes if XDP
    is loaded and the new length is not 2048.

Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-4-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agogve: Use extack to log xdp config verification errors
Ankit Garg [Thu, 6 Nov 2025 19:27:44 +0000 (11:27 -0800)] 
gve: Use extack to log xdp config verification errors

Plumb extack as it allows us to send more detailed error messages back
and append 'gve' suffix to method name per convention.

NL_SET_ERR_MSG_FMT_MOD doesn't support format string longer than 80
chars so keeping netdev warning with actual queue count details.

Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-3-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agogve: Decouple header split from RX buffer length
Ankit Garg [Thu, 6 Nov 2025 19:27:43 +0000 (11:27 -0800)] 
gve: Decouple header split from RX buffer length

Previously, enabling header split via `gve_set_hsplit_config` also
implicitly changed the RX buffer length to 4K (if supported by the
device). This coupled two settings that should be orthogonal; this patch
removes that side effect.

After this change, `gve_set_hsplit_config` only toggles the header
split configuration. The RX buffer length is no longer affected and
must be configured independently.

Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Jordan Rhee <jordanrhee@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Link: https://patch.msgid.link/20251106192746.243525-2-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-stmmac-ingenic-convert-to-set_phy_intf_sel'
Jakub Kicinski [Tue, 11 Nov 2025 01:30:43 +0000 (17:30 -0800)] 
Merge branch 'net-stmmac-ingenic-convert-to-set_phy_intf_sel'

Russell King says:

====================
net: stmmac: ingenic: convert to set_phy_intf_sel()

Convert ingenic to use the new ->set_phy_intf_sel() method that was
recently introduced in net-next.

This is the largest of the conversions, as there is scope for cleanups
along with the conversion.
====================

Link: https://patch.msgid.link/aQ2tgEu-dudzlZlg@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: use ->set_phy_intf_sel()
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:26 +0000 (08:29 +0000)] 
net: stmmac: ingenic: use ->set_phy_intf_sel()

Rather than placing the phy_intf_sel() setup in the ->init() method,
move it to the new ->set_phy_intf_sel() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqY-0000000Djrn-1D6H@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: pass ingenic_mac struct rather than plat_dat
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:21 +0000 (08:29 +0000)] 
net: stmmac: ingenic: pass ingenic_mac struct rather than plat_dat

It no longer makes sense to pass a pointer to struct
plat_stmmacenet_data when calling the set_mode() methods to only use it
to get a pointer to the ingenic_mac structure that we already had in
the caller. Simplify this by passing the struct ingenic_mac pointer.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqT-0000000Djrh-0ka3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: simplify x2000 mac_set_mode()
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:16 +0000 (08:29 +0000)] 
net: stmmac: ingenic: simplify x2000 mac_set_mode()

As per the previous commit, we have validated that the phy_intf_sel
value is one that is permissible for this SoC, so there is no need to
handle invalid PHY interface modes. We can also apply the other
configuration based upon the phy_intf_sel value rather than the
PHY interface mode.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqO-0000000Djrb-0DPN@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: simplify mac_set_mode() methods
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:10 +0000 (08:29 +0000)] 
net: stmmac: ingenic: simplify mac_set_mode() methods

x1000, x1600 and x1830 only accept RMII mode. PHY_INTF_SEL_RMII is only
selected with PHY_INTERFACE_MODE_RMII, and PHY_INTF_SEL_RMII has been
validated by the SoC's .valid_phy_intf_sel bitmask. Thus, checking the
interface mode in these functions becomes unnecessary. Remove these.

jz4775 is similar, except for a greater set of PHY_INTF_SEL_x valies.
Also remove the switch statement here.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqI-0000000DjrV-3ygL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: move "MAC PHY control register" debug
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:05 +0000 (08:29 +0000)] 
net: stmmac: ingenic: move "MAC PHY control register" debug

Move the printing of the MAC PHY control register interface mode
setting into ingenic_set_phy_intf_sel(), and use phy_modes() to
print the string rather than using the enum name.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHqD-0000000DjrP-3aaU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: use stmmac_get_phy_intf_sel()
Russell King (Oracle) [Fri, 7 Nov 2025 08:29:00 +0000 (08:29 +0000)] 
net: stmmac: ingenic: use stmmac_get_phy_intf_sel()

Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result against the SoC specific
supported phy_intf_sel values, and pass into the SoC specific
set_mode() methods, replacing the local phy_intf_sel variable. This
provides the value for the MACPHYC_PHY_INFT_MASK field.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHq8-0000000DjrJ-2NRK@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: prep PHY_INTF_SEL_x field after switch()
Russell King (Oracle) [Fri, 7 Nov 2025 08:28:55 +0000 (08:28 +0000)] 
net: stmmac: ingenic: prep PHY_INTF_SEL_x field after switch()

Move the preparation of the PHY_INTF_SEL_x bitfield out of the switch()
statement such that it only appears once.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHq3-0000000DjrD-1u8O@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: use PHY_INTF_SEL_x directly
Russell King (Oracle) [Fri, 7 Nov 2025 08:28:50 +0000 (08:28 +0000)] 
net: stmmac: ingenic: use PHY_INTF_SEL_x directly

Use the PHY_INTF_SEL_x values directly in each of the mac_set_mode
methods rather than the driver private MACPHYC_PHY_INFT_x definitions.
Remove the MACPHYC_PHY_INFT_x definitions.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHpy-0000000Djr7-1R1m@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: use PHY_INTF_SEL_x to select PHY interface
Russell King (Oracle) [Fri, 7 Nov 2025 08:28:45 +0000 (08:28 +0000)] 
net: stmmac: ingenic: use PHY_INTF_SEL_x to select PHY interface

Use the common dwmac definitions for the PHY interface selection field.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHpt-0000000Djr1-0wwr@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: simplify jz4775 mac_set_mode()
Russell King (Oracle) [Fri, 7 Nov 2025 08:28:40 +0000 (08:28 +0000)] 
net: stmmac: ingenic: simplify jz4775 mac_set_mode()

All paths configure the transmit clock as an input. Move this out of
the switch() statement to simplify the code.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHpo-0000000Djqv-0RD4@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: ingenic: move ingenic_mac_init()
Russell King (Oracle) [Fri, 7 Nov 2025 08:28:34 +0000 (08:28 +0000)] 
net: stmmac: ingenic: move ingenic_mac_init()

Move ingenic_mac_init() to between variant specific set_mode()
implementations and ingenic_mac_probe(). No code changes.

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vHHpi-0000000Djqp-4910@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agousbnet: Add support for Byte Queue Limits (BQL)
Simon Schippers [Thu, 6 Nov 2025 17:56:15 +0000 (18:56 +0100)] 
usbnet: Add support for Byte Queue Limits (BQL)

In the current implementation, usbnet uses a fixed tx_qlen of:

USB2: 60 * 1518 bytes = 91.08 KB
USB3: 60 * 5 * 1518 bytes = 454.80 KB

Such large transmit queues can be problematic, especially for cellular
modems. For example, with a typical celluar link speed of 10 Mbit/s, a
fully occupied USB3 transmit queue results in:

454.80 KB / (10 Mbit/s / 8 bit/byte) = 363.84 ms

of additional latency.

This patch adds support for Byte Queue Limits (BQL) [1] to dynamically
manage the transmit queue size and reduce latency without sacrificing
throughput.

Testing was performed on various devices using the usbnet driver for
packet transmission:

- DELOCK 66045: USB3 to 2.5 GbE adapter (ax88179_178a)
- DELOCK 61969: USB2 to 1 GbE adapter (asix)
- Quectel RM520: 5G modem (qmi_wwan)
- USB2 Android tethering (cdc_ncm)

No performance degradation was observed for iperf3 TCP or UDP traffic,
while latency for a prioritized ping application was significantly
reduced. For example, using the USB3 to 2.5 GbE adapter, which was fully
utilized by iperf3 UDP traffic, the prioritized ping was improved from
1.6 ms to 0.6 ms. With the same setup but with a 100 Mbit/s Ethernet
connection, the prioritized ping was improved from 35 ms to 5 ms.

[1] https://lwn.net/Articles/469652/

Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106175615.26948-1-simon.schippers@tu-dortmund.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotg3: Fix num of RX queues being reported by ethtool
Breno Leitao [Fri, 7 Nov 2025 10:36:59 +0000 (02:36 -0800)] 
tg3: Fix num of RX queues being reported by ethtool

Using num_online_cpus() to report number of queues is actually not
correct, as reported by Michael[1].

netif_get_num_default_rss_queues() was used to replace num_online_cpus()
in the past, but tg3 ethtool callbacks didn't get converted. Doing it
now.

Link: https://lore.kernel.org/all/CACKFLim7ruspmqvjr6bNRq5Z_XXVk3vVaLZOons7kMCzsEG23A@mail.gmail.com/#t
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20251107-tg3_counts-v1-1-337fe5c8ccb7@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-dsa-b53-add-support-for-bcm5389-97-98-and-bcm63xx-arl-formats'
Jakub Kicinski [Tue, 11 Nov 2025 01:11:09 +0000 (17:11 -0800)] 
Merge branch 'net-dsa-b53-add-support-for-bcm5389-97-98-and-bcm63xx-arl-formats'

Jonas Gorski says:

====================
net: dsa: b53: add support for BCM5389/97/98 and BCM63XX ARL formats

Currently b53 assumes that all switches apart from BCM5325/5365 use the
same ARL formats, but there are actually multiple formats in use.

Older switches use a format apparently introduced with BCM5387/BCM5389,
while newer chips use a format apparently introduced with BCM5395.

Note that these numbers are not linear, BCM5397/BCM5398 use the older
format.

In addition to that the switches integrated into BCM63XX SoCs use their
own format. While accessing these normal read/write ARL entries are the
same format as BCM5389 one, the search format is different.

So in order to support all these different format, split all code
accessing these entries into chip-family specific functions, and collect
them in appropriate arl ops structs to keep the code cleaner.

Sent as net-next since the ARL accesses have never worked before, and
the extensive refactoring might be too much to warrant a fix.
====================

Link: https://patch.msgid.link/20251107080749.26936-1-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: add support for bcm63xx ARL entry format
Jonas Gorski [Fri, 7 Nov 2025 08:07:49 +0000 (09:07 +0100)] 
net: dsa: b53: add support for bcm63xx ARL entry format

The ARL registers of BCM63XX embedded switches are somewhat unique. The
normal ARL table access registers have the same format as BCM5389, but
the ARL search registers differ:

* SRCH_CTL is at the same offset of BCM5389, but 16 bits wide. It does
  not have more fields, just needs to be accessed by a 16 bit read.
* SRCH_RSLT_MACVID and SRCH_RSLT are aligned to 32 bit, and have shifted
  offsets.
* SRCH_RSLT has a different format than the normal ARL data entry
  register.
* There is only one set of ENTRY_N registers, implying a 1 bin layout.

So add appropriate ops for bcm63xx and let it use it.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-9-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: add support for 5389/5397/5398 ARL entry format
Jonas Gorski [Fri, 7 Nov 2025 08:07:48 +0000 (09:07 +0100)] 
net: dsa: b53: add support for 5389/5397/5398 ARL entry format

BCM5389, BCM5397 and BCM5398 use a different ARL entry format with just
a 16 bit fwdentry register, as well as different search control and data
offsets.

So add appropriate ops for them and switch those chips to use them.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-8-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: move ARL entry functions into ops struct
Jonas Gorski [Fri, 7 Nov 2025 08:07:47 +0000 (09:07 +0100)] 
net: dsa: b53: move ARL entry functions into ops struct

Now that the differences in ARL entry formats are neatly contained into
functions per chip family, wrap them into an ops struct and add wrapper
functions to access them.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-7-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: split reading search entry into their own functions
Jonas Gorski [Fri, 7 Nov 2025 08:07:46 +0000 (09:07 +0100)] 
net: dsa: b53: split reading search entry into their own functions

Split reading search entries into a function for each format.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-6-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: provide accessors for accessing ARL_SRCH_CTL
Jonas Gorski [Fri, 7 Nov 2025 08:07:45 +0000 (09:07 +0100)] 
net: dsa: b53: provide accessors for accessing ARL_SRCH_CTL

In order to more easily support more formats, move accessing
ARL_SRCH_CTL into helper functions to contain the differences.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-5-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: move writing ARL entries into their own functions
Jonas Gorski [Fri, 7 Nov 2025 08:07:44 +0000 (09:07 +0100)] 
net: dsa: b53: move writing ARL entries into their own functions

Move writing ARL entries into individual functions for each format.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-4-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: move reading ARL entries into their own function
Jonas Gorski [Fri, 7 Nov 2025 08:07:43 +0000 (09:07 +0100)] 
net: dsa: b53: move reading ARL entries into their own function

Instead of duplicating the whole code iterating over all bins for
BCM5325, factor out reading and parsing the entry into its own
functions, and name it the modern one after the first chip with that ARL
format, (BCM53)95.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-3-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: dsa: b53: b53_arl_read{,25}(): use the entry for comparision
Jonas Gorski [Fri, 7 Nov 2025 08:07:42 +0000 (09:07 +0100)] 
net: dsa: b53: b53_arl_read{,25}(): use the entry for comparision

Align the b53_arl_read{,25}() functions by consistently using the
parsed arl entry instead of parsing the raw registers again.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251107080749.26936-2-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf...
Jakub Kicinski [Tue, 11 Nov 2025 00:43:51 +0000 (16:43 -0800)] 
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-11-10

We've added 19 non-merge commits during the last 3 day(s) which contain
a total of 22 files changed, 1345 insertions(+), 197 deletions(-).

The main changes are:

1) Preserve skb metadata after a TC BPF program has changed the skb,
   from Jakub Sitnicki.
   This allows a TC program at the end of a TC filter chain to still see
   the skb metadata, even if another TC program at the front of the chain
   has changed the skb using BPF helpers.

2) Initial af_smc bpf_struct_ops support to control the smc specific
   syn/synack options, from D. Wythe.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  bpf/selftests: Add selftest for bpf_smc_hs_ctrl
  net/smc: bpf: Introduce generic hook for handshake flow
  bpf: Export necessary symbols for modules with struct_ops
  selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
  selftests/bpf: Cover skb metadata access after change_head/tail helper
  selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
  selftests/bpf: Cover skb metadata access after vlan push/pop helper
  selftests/bpf: Expect unclone to preserve skb metadata
  selftests/bpf: Dump skb metadata on verification failure
  selftests/bpf: Verify skb metadata in BPF instead of userspace
  bpf: Make bpf_skb_change_head helper metadata-safe
  bpf: Make bpf_skb_change_proto helper metadata-safe
  bpf: Make bpf_skb_adjust_room metadata-safe
  bpf: Make bpf_skb_vlan_push helper metadata-safe
  bpf: Make bpf_skb_vlan_pop helper metadata-safe
  vlan: Make vlan_remove_tag return nothing
  bpf: Unclone skb head on bpf_dynptr_write to skb metadata
  net: Preserve metadata on pskb_expand_head
  net: Helper to move packet data and metadata after skb_push/pull
====================

Link: https://patch.msgid.link/20251110232427.3929291-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: ravb: Correct bad check of timestamp control flags
Niklas Söderlund [Fri, 7 Nov 2025 20:01:00 +0000 (21:01 +0100)] 
net: ravb: Correct bad check of timestamp control flags

When converting the Renesas network drivers to use flags from enum
hwtstamp_rx_filters to control when to timestamp packages instead of a
driver specific schema with bit-wise flags an error was made.

The bit-wise driver specific flags correct logic to set get_ts was:

  q: RAVB_BE + tstamp_rx_ctrl: 0 => 0
  q: RAVB_NC + tstamp_rx_ctrl: 0 => 0
  q: RAVB_BE + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_V2_L2_EVENT => 0
  q: RAVB_NC + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_V2_L2_EVENT => 1
  q: RAVB_BE + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_ALL => 1
  q: RAVB_NC + tstamp_rx_ctrl: RAVB_RXTSTAMP_TYPE_ALL => 1

The converted logic to use enum flags mapped tstamp_rx_ctrl as

  0 to HWTSTAMP_FILTER_NONE
  RAVB_RXTSTAMP_TYPE_V2_L2_EVENT to HWTSTAMP_FILTER_PTP_V2_L2_EVENT
  RAVB_RXTSTAMP_TYPE_ALL to HWTSTAMP_FILTER_ALL

But the logic was incorrectly changed to:

  q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_NONE => 1 (error)
  q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_NONE => 0
  q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_PTP_V2_L2_EVENT => 0
  q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_PTP_V2_L2_EVENT => 1
  q: RAVB_BE + tstamp_rx_ctrl: HWTSTAMP_FILTER_ALL => 1
  q: RAVB_NC + tstamp_rx_ctrl: HWTSTAMP_FILTER_ALL => 0 (error)

This change restores the converted flag check to the correct logic of
the bit-wise driver specific flags.

Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/linux-renesas-soc/aQ4xSv9629XF-Bt3@horms.kernel.org/
Fixes: 16e2e6cf75e6 ("net: ravb: Use common defines for time stamping control")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251107200100.3637869-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoptp: ocp: Document sysfs output format for backward compatibility
Zhongqiu Han [Fri, 7 Nov 2025 07:45:33 +0000 (15:45 +0800)] 
ptp: ocp: Document sysfs output format for backward compatibility

Add a comment to ptp_ocp_tty_show() explaining that the sysfs output
intentionally does not include a trailing newline. This is required for
backward compatibility with existing userspace software that reads the
sysfs attribute and uses the value directly as a device path.

A previous attempt to add a newline to align with common kernel
conventions broke userspace applications that were opening device paths
like "/dev/ttyS4\n" instead of "/dev/ttyS4", resulting in ENOENT errors.

This comment prevents future attempts to "fix" this behavior, which would
break existing userspace applications.

Link: https://lore.kernel.org/netdev/20251030124519.1828058-1-zhongqiu.han@oss.qualcomm.com/
Link: https://lore.kernel.org/netdev/aef3b850-5f38-4c28-a018-3b0006dc2f08@linux.dev/
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251107074533.416048-1-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agosctp: Don't inherit do_auto_asconf in sctp_clone_sock().
Kuniyuki Iwashima [Thu, 6 Nov 2025 22:34:06 +0000 (22:34 +0000)] 
sctp: Don't inherit do_auto_asconf in sctp_clone_sock().

syzbot reported list_del(&sp->auto_asconf_list) corruption
in sctp_destroy_sock().

The repro calls setsockopt(SCTP_AUTO_ASCONF, 1) to a SCTP
listener, calls accept(), and close()s the child socket.

setsockopt(SCTP_AUTO_ASCONF, 1) sets sp->do_auto_asconf
to 1 and links sp->auto_asconf_list to a per-netns list.

Both fields are placed after sp->pd_lobby in struct sctp_sock,
and sctp_copy_descendant() did not copy the fields before the
cited commit.

Also, sctp_clone_sock() did not set them explicitly.

In addition, sctp_auto_asconf_init() is called from
sctp_sock_migrate(), but it initialises the fields only
conditionally.

The two fields relied on __GFP_ZERO added in sk_alloc(),
but sk_clone() does not use it.

Let's clear newsp->do_auto_asconf in sctp_clone_sock().

[0]:
list_del corruption. prev->next should be ffff8880799e9148, but was ffff8880799e8808. (prev=ffff88803347d9f8)
kernel BUG at lib/list_debug.c:64!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 UID: 0 PID: 6008 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025
RIP: 0010:__list_del_entry_valid_or_report+0x15a/0x190 lib/list_debug.c:62
Code: e8 7b 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 7c ee 92 fd 49 8b 17 48 c7 c7 80 0a bf 8b 48 89 de 4c 89 f9 e8 07 c6 94 fc 90 <0f> 0b 4c 89 f7 e8 4c 26 71 fd 43 80 3c 2c 00 74 08 4c 89 ff e8 4d
RSP: 0018:ffffc90003067ad8 EFLAGS: 00010246
RAX: 000000000000006d RBX: ffff8880799e9148 RCX: b056988859ee6e00
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: dffffc0000000000 R08: ffffc90003067807 R09: 1ffff9200060cf00
R10: dffffc0000000000 R11: fffff5200060cf01 R12: 1ffff1100668fb3f
R13: dffffc0000000000 R14: ffff88803347d9f8 R15: ffff88803347d9f8
FS:  00005555823e5500(0000) GS:ffff88812613e000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000000480 CR3: 00000000741ce000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 __list_del_entry_valid include/linux/list.h:132 [inline]
 __list_del_entry include/linux/list.h:223 [inline]
 list_del include/linux/list.h:237 [inline]
 sctp_destroy_sock+0xb4/0x370 net/sctp/socket.c:5163
 sk_common_release+0x75/0x310 net/core/sock.c:3961
 sctp_close+0x77e/0x900 net/sctp/socket.c:1550
 inet_release+0x144/0x190 net/ipv4/af_inet.c:437
 __sock_release net/socket.c:662 [inline]
 sock_close+0xc3/0x240 net/socket.c:1455
 __fput+0x44c/0xa70 fs/file_table.c:468
 task_work_run+0x1d4/0x260 kernel/task_work.c:227
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 exit_to_user_mode_loop+0xe9/0x130 kernel/entry/common.c:43
 exit_to_user_mode_prepare include/linux/irq-entry-common.h:225 [inline]
 syscall_exit_to_user_mode_work include/linux/entry-common.h:175 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:210 [inline]
 do_syscall_64+0x2bd/0xfa0 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: 16942cf4d3e3 ("sctp: Use sk_clone() in sctp_accept().")
Reported-by: syzbot+ba535cb417f106327741@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/690d2185.a70a0220.22f260.000e.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20251106223418.1455510-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-smc-introduce-smc_hs_ctrl'
Martin KaFai Lau [Mon, 10 Nov 2025 19:10:09 +0000 (11:10 -0800)] 
Merge branch 'net-smc-introduce-smc_hs_ctrl'

D. Wythe says:

====================
net/smc: Introduce smc_hs_ctrl

This patch aims to introduce BPF injection capabilities for SMC and
includes a self-test to ensure code stability.

Since the SMC protocol isn't ideal for every situation, especially
short-lived ones, most applications can't guarantee the absence of
such scenarios. Consequently, applications may need specific strategies
to decide whether to use SMC. For example, an application might limit SMC
usage to certain IP addresses or ports.

To maintain the principle of transparent replacement, we want applications
to remain unaffected even if they need specific SMC strategies. In other
words, they should not require recompilation of their code.

Additionally, we need to ensure the scalability of strategy implementation.
While using socket options or sysctl might be straightforward, it could
complicate future expansions.

Fortunately, BPF addresses these concerns effectively. Users can write
their own strategies in eBPF to determine whether to use SMC, and they can
easily modify those strategies in the future.

This is a rework of the series from [1]. Changes since [1] are limited to
the SMC parts:

1. Rename smc_ops to smc_hs_ctrl and change interface name.
2. Squash SMC patches, removing standalone non-BPF hook capability.
3. Fix typos

[1]: https://lore.kernel.org/bpf/20250123015942.94810-1-alibuda@linux.alibaba.com/#t

v2 -> v1:
  - Removed the fixes patch, which have already been merged on current branch.
  - Fixed compilation warning of smc_call_hsbpf() when CONFIG_SMC_HS_CTRL_BPF
    is not enabled.
  - Changed the default value of CONFIG_SMC_HS_CTRL_BPF to Y.
  - Fix typo and renamed some variables

v3 -> v2:
  - Removed the libbpf patch, which have already been merged on current branch.
  - Fixed sparse warning of smc_call_hsbpf() and xchg().

v4 -> v3:
   - Rebased on latest bpf-next, updated SMC loopback config from SMC_LO to DIBS_LO
     per upstream changes.

v5 -> v4:
    - Removed the redundant sk parameter from smc_call_hsbpf
    - Reject registration when bpf_link is set, link support will be added in the
      future.
    - Updated selftests with new test heplers.
====================

Link: https://patch.msgid.link/20251107035632.115950-1-alibuda@linux.alibaba.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
4 weeks agobpf/selftests: Add selftest for bpf_smc_hs_ctrl
D. Wythe [Fri, 7 Nov 2025 03:56:32 +0000 (11:56 +0800)] 
bpf/selftests: Add selftest for bpf_smc_hs_ctrl

This tests introduces a tiny smc_hs_ctrl for filtering SMC connections
based on IP pairs, and also adds a realistic topology model to verify it.

Also, we can only use SMC loopback under CI test, so an additional
configuration needs to be enabled.

Follow the steps below to run this test.

make -C tools/testing/selftests/bpf
cd tools/testing/selftests/bpf
sudo ./test_progs -t smc

Results shows:
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Tested-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20251107035632.115950-4-alibuda@linux.alibaba.com
4 weeks agonet/smc: bpf: Introduce generic hook for handshake flow
D. Wythe [Fri, 7 Nov 2025 03:56:31 +0000 (11:56 +0800)] 
net/smc: bpf: Introduce generic hook for handshake flow

The introduction of IPPROTO_SMC enables eBPF programs to determine
whether to use SMC based on the context of socket creation, such as
network namespaces, PID and comm name, etc.

As a subsequent enhancement, to introduce a new generic hook that
allows decisions on whether to use SMC or not at runtime, including
but not limited to local/remote IP address or ports.

User can write their own implememtion via bpf_struct_ops now to choose
whether to use SMC or not before TCP 3rd handshake to be comleted.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20251107035632.115950-3-alibuda@linux.alibaba.com
4 weeks agobpf: Export necessary symbols for modules with struct_ops
D. Wythe [Fri, 7 Nov 2025 03:56:30 +0000 (11:56 +0800)] 
bpf: Export necessary symbols for modules with struct_ops

Exports three necessary symbols for implementing struct_ops with
tristate subsystem.

To hold or release refcnt of struct_ops refcnt by inline funcs
bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put)
conditionally.

And to copy obj name from one to the other with effective checks by
bpf_obj_name_cpy.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com
4 weeks agoMerge branch 'make-tc-bpf-helpers-preserve-skb-metadata'
Martin KaFai Lau [Mon, 10 Nov 2025 18:52:33 +0000 (10:52 -0800)] 
Merge branch 'make-tc-bpf-helpers-preserve-skb-metadata'

Jakub Sitnicki says:

====================
Make TC BPF helpers preserve skb metadata

Changes in v4:
- Fix copy-paste bug in check_metadata() test helper (AI review)
- Add "out of scope" section (at the bottom)
- Link to v3: https://lore.kernel.org/r/20251026-skb-meta-rx-path-v3-0-37cceebb95d3@cloudflare.com

Changes in v3:
- Use the already existing BPF_STREAM_STDERR const in tests (Martin)
- Unclone skb head on bpf_dynptr_write to skb metadata (patch 3) (Martin)
- Swap order of patches 1 & 2 to refer to skb_postpush_data_move() in docs
- Mention in skb_data_move() docs how to move just the metadata
- Note in pskb_expand_head() docs to move metadata after skb_push() (Jakub)
- Link to v2: https://lore.kernel.org/r/20251019-skb-meta-rx-path-v2-0-f9a58f3eb6d6@cloudflare.com

Changes in v2:
- Tweak WARN_ON_ONCE check in skb_data_move() (patch 2)
- Convert all tests to verify skb metadata in BPF (patches 9-10)
- Add test coverage for modified BPF helpers (patches 12-15)
- Link to RFCv1: https://lore.kernel.org/r/20250929-skb-meta-rx-path-v1-0-de700a7ab1cb@cloudflare.com

This patch set continues our work [1] to allow BPF programs and user-space
applications to attach multiple bytes of metadata to packets via the
XDP/skb metadata area.

The focus of this patch set it to ensure that skb metadata remains intact
when packets pass through a chain of TC BPF programs that call helpers
which operate on skb head.

Currently, several helpers that either adjust the skb->data pointer or
reallocate skb->head do not preserve metadata at its expected location,
that is immediately in front of the MAC header. These are:

- bpf_skb_adjust_room
- bpf_skb_change_head
- bpf_skb_change_proto
- bpf_skb_change_tail
- bpf_skb_vlan_pop
- bpf_skb_vlan_push

In TC BPF context, metadata must be moved whenever skb->data changes to
keep the skb->data_meta pointer valid. I don't see any way around
it. Creative ideas how to avoid that would be very welcome.

With that in mind, we can patch the helpers in at least two different ways:

1. Integrate metadata move into header move

   Replace the existing memmove, which follows skb_push/pull, with a helper
   that moves both headers and metadata in a single call. This avoids an
   extra memmove but reduces transparency.

        skb_pull(skb, len);
-       memmove(skb->data, skb->data - len, n);
+       skb_postpull_data_move(skb, len, n);
        skb->mac_header += len;

        skb_push(skb, len)
-       memmove(skb->data, skb->data + len, n);
+       skb_postpush_data_move(skb, len, n);
        skb->mac_header -= len;

2. Move metadata separately

   Add a dedicated metadata move after the header move. This is more
   explicit but costs an additional memmove.

        skb_pull(skb, len);
        memmove(skb->data, skb->data - len, n);
+       skb_metadata_postpull_move(skb, len);
        skb->mac_header += len;

        skb_push(skb, len)
+       skb_metadata_postpush_move(skb, len);
        memmove(skb->data, skb->data + len, n);
        skb->mac_header -= len;

This patch set implements option (1), expecting that "you can have just one
memmove" will be the most obvious feedback, while readability is a,
somewhat subjective, matter of taste, which I don't claim to have ;-)

The structure of the patch set is as follows:

- patches 1-4 prepare ground for safe-proofing the BPF helpers
- patches 5-9 modify the BPF helpers to preserve skb metadata
- patches 10-11 prepare ground for metadata tests with BPF helper calls
- patches 12-16 adapt and expand tests to cover the made changes

Out of scope for this series:
- safe-proofing tunnel & tagging devices - VLAN, GRE, ...
  (next in line, in development preview at [2])
- metadata access after packet foward
  (to do after Rx path - once metadata reliably reaches sk_filter)

Thanks,
-jkbs

[1] https://lore.kernel.org/all/20250814-skb-metadata-thru-dynptr-v7-0-8a39e636e0fb@cloudflare.com/
[2] https://github.com/jsitnicki/linux/commits/skb-meta/safeproof-netdevs/
====================

Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-0-5ceb08a9b37b@cloudflare.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
4 weeks agoselftests/bpf: Cover skb metadata access after bpf_skb_change_proto
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:53 +0000 (21:19 +0100)] 
selftests/bpf: Cover skb metadata access after bpf_skb_change_proto

Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_proto(), which modifies packet headroom to accommodate
different IP header sizes.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-16-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Cover skb metadata access after change_head/tail helper
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:52 +0000 (21:19 +0100)] 
selftests/bpf: Cover skb metadata access after change_head/tail helper

Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_head() and bpf_skb_change_tail(), which modify packet
headroom/tailroom and can trigger head reallocation.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-15-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:51 +0000 (21:19 +0100)] 
selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room

Add a test to verify that skb metadata remains accessible after calling
bpf_skb_adjust_room(), which modifies the packet headroom and can trigger
head reallocation.

The helper expects an Ethernet frame carrying an IP packet so switch test
packet identification by source MAC address since we can no longer rely on
Ethernet proto being set to zero.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-14-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Cover skb metadata access after vlan push/pop helper
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:50 +0000 (21:19 +0100)] 
selftests/bpf: Cover skb metadata access after vlan push/pop helper

Add a test to verify that skb metadata remains accessible after calling
bpf_skb_vlan_push() and bpf_skb_vlan_pop(), which modify the packet
headroom.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-13-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Expect unclone to preserve skb metadata
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:49 +0000 (21:19 +0100)] 
selftests/bpf: Expect unclone to preserve skb metadata

Since pskb_expand_head() no longer clears metadata on unclone, update tests
for cloned packets to expect metadata to remain intact.

Also simplify the clone_dynptr_kept_on_{data,meta}_slice_write tests.
Creating an r/w dynptr slice is sufficient to trigger an unclone in the
prologue, so remove the extraneous writes to the data/meta slice.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-12-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Dump skb metadata on verification failure
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:48 +0000 (21:19 +0100)] 
selftests/bpf: Dump skb metadata on verification failure

Add diagnostic output when metadata verification fails to help with
troubleshooting test failures. Introduce a check_metadata() helper that
prints both expected and received metadata to the BPF program's stderr
stream on mismatch. The userspace test reads and dumps this stream on
failure.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-11-5ceb08a9b37b@cloudflare.com
4 weeks agoselftests/bpf: Verify skb metadata in BPF instead of userspace
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:47 +0000 (21:19 +0100)] 
selftests/bpf: Verify skb metadata in BPF instead of userspace

Move metadata verification into the BPF TC programs. Previously,
userspace read metadata from a map and verified it once at test end.

Now TC programs compare metadata directly using __builtin_memcmp() and
set a test_pass flag. This enables verification at multiple points during
test execution rather than a single final check.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-10-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Make bpf_skb_change_head helper metadata-safe
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:46 +0000 (21:19 +0100)] 
bpf: Make bpf_skb_change_head helper metadata-safe

Although bpf_skb_change_head() doesn't move packet data after skb_push(),
skb metadata still needs to be relocated. Use the dedicated helper to
handle it.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-9-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Make bpf_skb_change_proto helper metadata-safe
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:45 +0000 (21:19 +0100)] 
bpf: Make bpf_skb_change_proto helper metadata-safe

bpf_skb_change_proto reuses the same headroom operations as
bpf_skb_adjust_room, already updated to handle metadata safely.

The remaining step is to ensure that there is sufficient headroom to
accommodate metadata on skb_push().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-8-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Make bpf_skb_adjust_room metadata-safe
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:44 +0000 (21:19 +0100)] 
bpf: Make bpf_skb_adjust_room metadata-safe

bpf_skb_adjust_room() may push or pull bytes from skb->data. In both cases,
skb metadata must be moved accordingly to stay accessible.

Replace existing memmove() calls, which only move payload, with a helper
that also handles metadata. Reserve enough space for metadata to fit after
skb_push.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-7-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Make bpf_skb_vlan_push helper metadata-safe
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:43 +0000 (21:19 +0100)] 
bpf: Make bpf_skb_vlan_push helper metadata-safe

Use the metadata-aware helper to move packet bytes after skb_push(),
ensuring metadata remains valid after calling the BPF helper.

Also, take care to reserve sufficient headroom for metadata to fit.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-6-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Make bpf_skb_vlan_pop helper metadata-safe
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:42 +0000 (21:19 +0100)] 
bpf: Make bpf_skb_vlan_pop helper metadata-safe

Use the metadata-aware helper to move packet bytes after skb_pull(),
ensuring metadata remains valid after calling the BPF helper.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-5-5ceb08a9b37b@cloudflare.com
4 weeks agovlan: Make vlan_remove_tag return nothing
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:41 +0000 (21:19 +0100)] 
vlan: Make vlan_remove_tag return nothing

All callers ignore the return value.

Prepare to reorder memmove() after skb_pull() which is a common pattern.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-4-5ceb08a9b37b@cloudflare.com
4 weeks agobpf: Unclone skb head on bpf_dynptr_write to skb metadata
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:40 +0000 (21:19 +0100)] 
bpf: Unclone skb head on bpf_dynptr_write to skb metadata

Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when
the skb is cloned, preventing writes to metadata.

Remove this restriction and unclone the skb head on bpf_dynptr_write() to
metadata, now that the metadata is preserved during uncloning. This makes
metadata dynptr consistent with skb dynptr, allowing writes regardless of
whether the skb is cloned.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com
4 weeks agonet: Preserve metadata on pskb_expand_head
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:39 +0000 (21:19 +0100)] 
net: Preserve metadata on pskb_expand_head

pskb_expand_head() copies headroom, including skb metadata, into the newly
allocated head, but then clears the metadata. As a result, metadata is lost
when BPF helpers trigger an skb head reallocation.

Let the skb metadata remain in the newly created copy of head.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-2-5ceb08a9b37b@cloudflare.com
4 weeks agonet: Helper to move packet data and metadata after skb_push/pull
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:38 +0000 (21:19 +0100)] 
net: Helper to move packet data and metadata after skb_push/pull

Lay groundwork for fixing BPF helpers available to TC(X) programs.

When skb_push() or skb_pull() is called in a TC(X) ingress BPF program, the
skb metadata must be kept in front of the MAC header. Otherwise, BPF
programs using the __sk_buff->data_meta pseudo-pointer lose access to it.

Introduce a helper that moves both metadata and a specified number of
packet data bytes together, suitable as a drop-in replacement for
memmove().

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-1-5ceb08a9b37b@cloudflare.com
4 weeks agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next...
Jakub Kicinski [Sat, 8 Nov 2025 03:15:36 +0000 (19:15 -0800)] 
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf)

Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for
controlling the maximum number of MAC address filters allowed by a VF. This
allows administrators to control the VF behavior in a more nuanced manner.

Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF
for VFs running on E800 series ice hardware. This improves performance and
scalability for virtualized network functions in 5G and LTE deployments.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add RSS support for GTP protocol via ethtool
  ice: Extend PTYPE bitmap coverage for GTP encapsulated flows
  ice: improve TCAM priority handling for RSS profiles
  ice: implement GTP RSS context tracking and configuration
  ice: add virtchnl definitions and static data for GTP RSS
  ice: add flow parsing for GTP and new protocol field support
  i40e: support generic devlink param "max_mac_per_vf"
  devlink: Add new "max_mac_per_vf" generic device param
====================

Link: https://patch.msgid.link/20251106225321.1609605-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-stmmac-lpc18xx-and-sti-convert-to-set_phy_intf_sel'
Jakub Kicinski [Sat, 8 Nov 2025 03:05:51 +0000 (19:05 -0800)] 
Merge branch 'net-stmmac-lpc18xx-and-sti-convert-to-set_phy_intf_sel'

Russell King says:

====================
net: stmmac: lpc18xx and sti: convert to set_phy_intf_sel()

This series converts lpc18xx and sti to use the new .set_phy_intf_sel()
method.
====================

Link: https://patch.msgid.link/aQyEs4DAZRWpAz32@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: sti: use ->set_phy_intf_sel()
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:52 +0000 (11:23 +0000)] 
net: stmmac: sti: use ->set_phy_intf_sel()

Rather than placing the phy_intf_sel() setup in the ->init() method,
move it to the new ->set_phy_intf_sel() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5o-0000000DhQn-34JE@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: sti: use stmmac_get_phy_intf_sel()
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:47 +0000 (11:23 +0000)] 
net: stmmac: sti: use stmmac_get_phy_intf_sel()

Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result and use that to set the
control register to select the operating mode for the DWMAC core.

Note that when an unsupported interface mode is used, the array would
decode this to PHY_INTF_SEL_GMII_MII, so preserve this behaviour.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5j-0000000DhQh-2e0x@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: sti: use PHY_INTF_SEL_x directly
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:42 +0000 (11:23 +0000)] 
net: stmmac: sti: use PHY_INTF_SEL_x directly

Use the PHY_INTF_SEL_x values directly rather than the driver private
ETH_PHY_SEL_x values. Move the FIELD_PREP() into sti_dwmac_set_mode().
Use dwmac->interface directly.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5e-0000000DhQb-2B7I@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: sti: use PHY_INTF_SEL_x to select PHY interface
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:37 +0000 (11:23 +0000)] 
net: stmmac: sti: use PHY_INTF_SEL_x to select PHY interface

Use the common dwmac definitions for the PHY interface selection field,
adding MII_PHY_SEL_VAL() temporarily to avoid line wrapping.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5Z-0000000DhQV-1e2l@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: lpc18xx: use ->set_phy_intf_sel()
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:32 +0000 (11:23 +0000)] 
net: stmmac: lpc18xx: use ->set_phy_intf_sel()

Move the configuration of the dwmac PHY interface selection to the new
->set_phy_intf_sel() method.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5U-0000000DhQP-19Hd@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: lpc18xx: validate phy_intf_sel
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:27 +0000 (11:23 +0000)] 
net: stmmac: lpc18xx: validate phy_intf_sel

Validate the phy_intf_sel value rather than the PHY interface mode.
This will allow us to transition to the ->set_phy_intf_sel() method.
Note that this will allow GMII as well as MII as the phy_intf_sel
value is the same for both.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5P-0000000DhQJ-0Oi3@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: lpc18xx: use stmmac_get_phy_intf_sel()
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:21 +0000 (11:23 +0000)] 
net: stmmac: lpc18xx: use stmmac_get_phy_intf_sel()

Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, and use the result to program the ethernet mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5J-0000000DhQD-46Ob@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: lpc18xx: use PHY_INTF_SEL_x directly
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:16 +0000 (11:23 +0000)] 
net: stmmac: lpc18xx: use PHY_INTF_SEL_x directly

Use the PHY_INTF_SEL_x values directly rather than the driver private
LPC18XX_CREG_CREG6_ETHMODE_x definitions, and convert
LPC18XX_CREG_CREG6_ETHMODE_MASK to use GENMASK().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy5E-0000000DhQ7-3cuy@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: stmmac: lpc18xx: convert to PHY_INTF_SEL_x
Russell King (Oracle) [Thu, 6 Nov 2025 11:23:11 +0000 (11:23 +0000)] 
net: stmmac: lpc18xx: convert to PHY_INTF_SEL_x

Use the common dwmac definitions for the PHY interface selection field.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vGy59-0000000DhQ1-393H@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-use-skb_attempt_defer_free-in-napi_consume_skb'
Jakub Kicinski [Sat, 8 Nov 2025 03:02:42 +0000 (19:02 -0800)] 
Merge branch 'net-use-skb_attempt_defer_free-in-napi_consume_skb'

Eric Dumazet says:

====================
net: use skb_attempt_defer_free() in napi_consume_skb()

There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.

Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.

Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.

This removes contention on SLUB spinlocks and data structures,
and this makes sure that recycled sk_buff have correct NUMA locality.

After this series, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).

I will later refactor skb_attempt_defer_free()
to no longer have to care of skb_shared() and skb_release_head_state().
====================

Link: https://patch.msgid.link/20251106202935.1776179-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: increase skb_defer_max default to 128
Eric Dumazet [Thu, 6 Nov 2025 20:29:35 +0000 (20:29 +0000)] 
net: increase skb_defer_max default to 128

skb_defer_max value is very conservative, and can be increased
to avoid too many calls to kick_defer_list_purge().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251106202935.1776179-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: fix napi_consume_skb() with alien skbs
Eric Dumazet [Thu, 6 Nov 2025 20:29:34 +0000 (20:29 +0000)] 
net: fix napi_consume_skb() with alien skbs

There is a lack of NUMA awareness and more generally lack
of slab caches affinity on TX completion path.

Modern drivers are using napi_consume_skb(), hoping to cache sk_buff
in per-cpu caches so that they can be recycled in RX path.

Only use this if the skb was allocated on the same cpu,
otherwise use skb_attempt_defer_free() so that the skb
is freed on the original cpu.

This removes contention on SLUB spinlocks and data structures.

After this patch, I get ~50% improvement for an UDP tx workload
on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues).

80 Mpps -> 120 Mpps.

Profiling one of the 32 cpus servicing NIC interrupts :

Before:

mpstat -P 511 1 1

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     511    0.00    0.00    0.00    0.00    0.00   98.00    0.00    0.00    0.00    2.00

    31.01%  ksoftirqd/511    [kernel.kallsyms]  [k] queued_spin_lock_slowpath
    12.45%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     5.60%  ksoftirqd/511    [kernel.kallsyms]  [k] __slab_free
     3.31%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
     3.27%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     2.95%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_start
     2.52%  ksoftirqd/511    [kernel.kallsyms]  [k] fq_dequeue
     2.32%  ksoftirqd/511    [kernel.kallsyms]  [k] read_tsc
     2.25%  ksoftirqd/511    [kernel.kallsyms]  [k] build_detached_freelist
     2.15%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free
     2.11%  swapper          [kernel.kallsyms]  [k] __slab_free
     2.06%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_features_check
     2.01%  ksoftirqd/511    [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
     1.97%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_data
     1.52%  ksoftirqd/511    [kernel.kallsyms]  [k] sock_wfree
     1.34%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
     1.23%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     1.15%  ksoftirqd/511    [kernel.kallsyms]  [k] dma_unmap_page_attrs
     1.11%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
     1.03%  swapper          [kernel.kallsyms]  [k] fq_dequeue
     0.94%  swapper          [kernel.kallsyms]  [k] kmem_cache_free
     0.93%  swapper          [kernel.kallsyms]  [k] read_tsc
     0.81%  ksoftirqd/511    [kernel.kallsyms]  [k] napi_consume_skb
     0.79%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
     0.77%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_free_head
     0.76%  swapper          [kernel.kallsyms]  [k] idpf_features_check
     0.72%  swapper          [kernel.kallsyms]  [k] skb_release_data
     0.69%  swapper          [kernel.kallsyms]  [k] build_detached_freelist
     0.58%  ksoftirqd/511    [kernel.kallsyms]  [k] skb_release_head_state
     0.56%  ksoftirqd/511    [kernel.kallsyms]  [k] __put_partials
     0.55%  ksoftirqd/511    [kernel.kallsyms]  [k] kmem_cache_free_bulk
     0.48%  swapper          [kernel.kallsyms]  [k] sock_wfree

After:

mpstat -P 511 1 1

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     511    0.00    0.00    0.00    0.00    0.00   51.49    0.00    0.00    0.00   48.51

    19.10%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_hdr
    13.86%  swapper          [kernel.kallsyms]  [k] idpf_tx_clean_buf_ring
    10.80%  swapper          [kernel.kallsyms]  [k] skb_attempt_defer_free
    10.57%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_clean_all
     7.18%  swapper          [kernel.kallsyms]  [k] queued_spin_lock_slowpath
     6.69%  swapper          [kernel.kallsyms]  [k] sock_wfree
     5.55%  swapper          [kernel.kallsyms]  [k] dma_unmap_page_attrs
     3.10%  swapper          [kernel.kallsyms]  [k] fq_dequeue
     3.00%  swapper          [kernel.kallsyms]  [k] skb_release_head_state
     2.73%  swapper          [kernel.kallsyms]  [k] read_tsc
     2.48%  swapper          [kernel.kallsyms]  [k] idpf_tx_splitq_start
     1.20%  swapper          [kernel.kallsyms]  [k] idpf_features_check
     1.13%  swapper          [kernel.kallsyms]  [k] napi_consume_skb
     0.93%  swapper          [kernel.kallsyms]  [k] idpf_vport_splitq_napi_poll
     0.64%  swapper          [kernel.kallsyms]  [k] native_send_call_func_single_ipi
     0.60%  swapper          [kernel.kallsyms]  [k] acpi_processor_ffh_cstate_enter
     0.53%  swapper          [kernel.kallsyms]  [k] io_idle
     0.43%  swapper          [kernel.kallsyms]  [k] netif_skb_features
     0.41%  swapper          [kernel.kallsyms]  [k] __direct_call_cpuidle_state_enter2
     0.40%  swapper          [kernel.kallsyms]  [k] native_irq_return_iret
     0.40%  swapper          [kernel.kallsyms]  [k] idpf_tx_buf_hw_update
     0.36%  swapper          [kernel.kallsyms]  [k] sched_clock_noinstr
     0.34%  swapper          [kernel.kallsyms]  [k] handle_softirqs
     0.32%  swapper          [kernel.kallsyms]  [k] net_rx_action
     0.32%  swapper          [kernel.kallsyms]  [k] dql_completed
     0.32%  swapper          [kernel.kallsyms]  [k] validate_xmit_skb
     0.31%  swapper          [kernel.kallsyms]  [k] skb_network_protocol
     0.29%  swapper          [kernel.kallsyms]  [k] skb_csum_hwoffload_help
     0.29%  swapper          [kernel.kallsyms]  [k] x2apic_send_IPI
     0.28%  swapper          [kernel.kallsyms]  [k] ktime_get
     0.24%  swapper          [kernel.kallsyms]  [k] __qdisc_run

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20251106202935.1776179-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: allow skb_release_head_state() to be called multiple times
Eric Dumazet [Thu, 6 Nov 2025 20:29:33 +0000 (20:29 +0000)] 
net: allow skb_release_head_state() to be called multiple times

Currently, only skb dst is cleared (thanks to skb_dst_drop())

Make sure skb->destructor, conntrack and extensions are cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: add prefetch() in skb_defer_free_flush()
Eric Dumazet [Thu, 6 Nov 2025 08:55:00 +0000 (08:55 +0000)] 
net: add prefetch() in skb_defer_free_flush()

skb_defer_free_flush() is becoming more important these days.

Add a prefetch operation to reduce latency a bit on some
platforms like AMD EPYC 7B12.

On more recent cpus, a stall happens when reading skb_shinfo().
Avoiding it will require a more elaborate strategy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20251106085500.2438951-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'psp-track-stats-from-core-and-provide-a-driver-stats-api'
Jakub Kicinski [Sat, 8 Nov 2025 02:54:07 +0000 (18:54 -0800)] 
Merge branch 'psp-track-stats-from-core-and-provide-a-driver-stats-api'

Daniel Zahka says:

====================
psp: track stats from core and provide a driver stats api

This series introduces stats counters for psp. Device key rotations,
and so called 'stale-events' are common to all drivers and are tracked
by the core.

A driver facing api is provided for reporting stats required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Drivers must implement these stats.

Lastly, implementations of the driver stats api for mlx5 and netdevsim
are included.

Here is the output of running the psp selftest suite and then
printing out stats with the ynl cli on system with a psp-capable CX7:

  $ ./ksft-psp-stats/drivers/net/psp.py
  TAP version 13
  1..28
  ok 1 psp.test_case # SKIP Test requires IPv4 connectivity
  ok 2 psp.data_basic_send_v0_ip6
  ok 3 psp.test_case # SKIP Test requires IPv4 connectivity
  ok 4 psp.data_basic_send_v1_ip6
  ok 5 psp.test_case # SKIP Test requires IPv4 connectivity
  ok 6 psp.data_basic_send_v2_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-128')
  ok 7 psp.test_case # SKIP Test requires IPv4 connectivity
  ok 8 psp.data_basic_send_v3_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-256')
  ok 9 psp.test_case # SKIP Test requires IPv4 connectivity
  ok 10 psp.data_mss_adjust_ip6
  ok 11 psp.dev_list_devices
  ok 12 psp.dev_get_device
  ok 13 psp.dev_get_device_bad
  ok 14 psp.dev_rotate
  ok 15 psp.dev_rotate_spi
  ok 16 psp.assoc_basic
  ok 17 psp.assoc_bad_dev
  ok 18 psp.assoc_sk_only_conn
  ok 19 psp.assoc_sk_only_mismatch
  ok 20 psp.assoc_sk_only_mismatch_tx
  ok 21 psp.assoc_sk_only_unconn
  ok 22 psp.assoc_version_mismatch
  ok 23 psp.assoc_twice
  ok 24 psp.data_send_bad_key
  ok 25 psp.data_send_disconnect
  ok 26 psp.data_stale_key
  ok 27 psp.removal_device_rx # XFAIL Test only works on netdevsim
  ok 28 psp.removal_device_bi # XFAIL Test only works on netdevsim
  # Totals: pass:19 fail:0 xfail:2 xpass:0 skip:7 error:0
  #
  # Responder logs (0):
  # STDERR:
  #  Set PSP enable on device 1 to 0x3
  #  Set PSP enable on device 1 to 0x0

  $ ynl --family psp --dump get-stats
  [{'dev-id': 1,
              'key-rotations': 5,
              'rx-auth-fail': 21,
              'rx-bad': 0,
              'rx-bytes': 11844,
              'rx-error': 0,
              'rx-packets': 94,
              'stale-events': 6,
              'tx-bytes': 1128456,
              'tx-error': 0,
              'tx-packets': 780}]
====================

Link: https://patch.msgid.link/20251106002608.1578518-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonetdevsim: implement psp device stats
Daniel Zahka [Thu, 6 Nov 2025 00:26:06 +0000 (16:26 -0800)] 
netdevsim: implement psp device stats

For now only tx/rx packets/bytes are reported. This is not compliant
with the PSP Architecture Specification.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-6-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet/mlx5e: Add PSP stats support for Rx/Tx flows
Jakub Kicinski [Thu, 6 Nov 2025 00:26:05 +0000 (16:26 -0800)] 
net/mlx5e: Add PSP stats support for Rx/Tx flows

Add all statistics described under the "Implementation Requirements"
section of the PSP Architecture Specification:

Rx successfully decrypted PSP packets:
psp_rx_pkts  : Number of packets decrypted successfully
psp_rx_bytes : Number of bytes decrypted successfully

Rx PSP authentication failure statistics:
psp_rx_pkts_auth_fail  : Number of PSP packets that failed authentication
psp_rx_bytes_auth_fail : Number of PSP bytes that failed authentication

Rx PSP bad frame error statistics:
psp_rx_pkts_frame_err;
psp_rx_bytes_frame_err;

Rx PSP drop statistics:
psp_rx_pkts_drop  : Number of PSP packets dropped
psp_rx_bytes_drop : Number of PSP bytes dropped

Tx successfully encrypted PSP packets:
psp_tx_pkts  : Number of packets encrypted successfully
psp_tx_bytes : Number of bytes encrypted successfully

Tx drops:
tx_drop : Number of misc psp related drops

The above can be seen using the ynl cli:
./pyynl/cli.py  --spec netlink/specs/psp.yaml --dump get-stats

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-5-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agopsp: add stats from psp spec to driver facing api
Jakub Kicinski [Thu, 6 Nov 2025 00:26:04 +0000 (16:26 -0800)] 
psp: add stats from psp spec to driver facing api

Provide a driver api for reporting device statistics required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Use a warning to ensure drivers report stats required
by the spec.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoselftests: drv-net: psp: add assertions on core-tracked psp dev stats
Daniel Zahka [Thu, 6 Nov 2025 00:26:03 +0000 (16:26 -0800)] 
selftests: drv-net: psp: add assertions on core-tracked psp dev stats

Add assertions to existing test cases to cover key rotations and
'stale-events'.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-3-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agopsp: report basic stats from the core
Jakub Kicinski [Thu, 6 Nov 2025 00:26:02 +0000 (16:26 -0800)] 
psp: report basic stats from the core

Track and report stats common to all psp devices from the core. A
'stale-event' is when the core marks the rx state of an active
psp_assoc as incapable of authenticating psp encapsulated data.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20251106002608.1578518-2-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: phy: fixed_phy: shrink size of struct fixed_phy_status
Heiner Kallweit [Wed, 5 Nov 2025 22:09:17 +0000 (23:09 +0100)] 
net: phy: fixed_phy: shrink size of struct fixed_phy_status

All three members are effectively of type bool, so make this explicit
and shrink size of struct fixed_phy_status.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/9eca3d7e-fa64-4724-8fdc-f2c1a8f2ae8f@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'net-phy-add-open-alliance-tc14-10base-t1s-phy-cable-diagnostic-support'
Jakub Kicinski [Sat, 8 Nov 2025 02:52:36 +0000 (18:52 -0800)] 
Merge branch 'net-phy-add-open-alliance-tc14-10base-t1s-phy-cable-diagnostic-support'

Parthiban Veerasooran says:

====================
net: phy: Add Open Alliance TC14 10Base-T1S PHY cable diagnostic support

This patch series adds Open Alliance TC14 (OATC14) 10BASE-T1S cable
diagnostic feature support to the Linux kernel PHY subsystem and enable
this feature for Microchip LAN867x Rev.D0 PHYs. These patches provide
standardized cable test functionality for 10BASE-T1S Ethernet PHYs,
allowing users to perform cable diagnostics via ethtool.

Patch Summary:
1. add OATC14 10BASE-T1S PHY cable diagnostic support
- Implements support for the OATC14 cable diagnostic feature in
  Clause 45 PHYs.
- Adds functions to start a cable test and retrieve its status,
  mapping hardware results to ethtool codes.
- Exports these functions for use by PHY drivers.
- Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features.
  https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf

2. add cable diagnostic support for LAN867x Rev.D0
- Integrates the generic OATC14 cable test functions into the
  Microchip LAN867x Rev.D0 PHY driver.
- Enables ethtool cable diagnostics for this PHY, improving
  troubleshooting and maintenance.
====================

Link: https://patch.msgid.link/20251105051213.50443-1-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: phy: microchip_t1s:: add cable diagnostic support for LAN867x Rev.D0
Parthiban Veerasooran [Wed, 5 Nov 2025 05:12:13 +0000 (10:42 +0530)] 
net: phy: microchip_t1s:: add cable diagnostic support for LAN867x Rev.D0

Enable Open Alliance TC14 (OATC14) 10Base-T1S cable diagnostic feature
for Microchip LAN867x Rev.D0 PHY by implementing `cable_test_start` and
`cable_test_get_status` using the generic C45 functions. This allows the
`ethtool` utility to perform cable diagnostic tests directly on the PHY,
improving network troubleshooting and maintenance.

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20251105051213.50443-3-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: phy: phy-c45: add OATC14 10BASE-T1S PHY cable diagnostic support
Parthiban Veerasooran [Wed, 5 Nov 2025 05:12:12 +0000 (10:42 +0530)] 
net: phy: phy-c45: add OATC14 10BASE-T1S PHY cable diagnostic support

Add support for Open Alliance TC14 (OATC14) 10BASE-T1S PHYs cable
diagnostic feature.

This patch implements:
- genphy_c45_oatc14_cable_test_start() to initiate a cable test
- genphy_c45_oatc14_cable_test_get_status() to retrieve test results
- Helper function to map PHY cable test status to ethtool result codes
- Function declarations and exports for use by PHY drivers

This enables ethtool to report ok, open, short, and undetectable cable
conditions on OATC14 10Base-T1S PHYs.

Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features
Specification ref:
https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf

Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Link: https://patch.msgid.link/20251105051213.50443-2-parthiban.veerasooran@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: mana: Fix incorrect speed reported by debugfs
Erni Sri Satya Vennela [Wed, 5 Nov 2025 19:04:28 +0000 (11:04 -0800)] 
net: mana: Fix incorrect speed reported by debugfs

Once the netshaper is created for MANA, the current bandwidth
is reported in debugfs like this:

$ sudo ./tools/net/ynl/pyynl/cli.py \
  --spec Documentation/netlink/specs/net_shaper.yaml \
  --do set \
  --json '{"ifindex":'3',
           "handle":{ "scope": "netdev", "id":'1' },
           "bw-max": 200000000 }'
None

$ sudo cat /sys/kernel/debug/mana/1/vport0/current_speed
200

After the shaper  is deleted, it is expected to report
the maximum speed supported by the SKU. But currently it is
reporting 0, which is incorrect.

Fix this inconsistency, by resetting apc->speed to apc->max_speed
during deletion of the shaper object. This will improve
readability and debuggability.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1762369468-32570-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agonet: airoha: Add the capability to consume out-of-order DMA tx descriptors
Lorenzo Bianconi [Thu, 6 Nov 2025 12:53:23 +0000 (13:53 +0100)] 
net: airoha: Add the capability to consume out-of-order DMA tx descriptors

EN7581 and AN7583 SoCs are capable of DMA mapping non-linear tx skbs on
non-consecutive DMA descriptors. This feature is useful when multiple
flows are queued on the same hw tx queue since it allows to fully utilize
the available tx DMA descriptors and to avoid the starvation of
high-priority flow we have in the current codebase due to head-of-line
blocking introduced by low-priority flows.

Tested-by: Xuegang Lu <xuegang.lu@airoha.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20251106-airoha-tx-linked-list-v2-1-0706d4a322bd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: add net.ipv4.tcp_comp_sack_rtt_percent
Eric Dumazet [Thu, 6 Nov 2025 11:52:36 +0000 (11:52 +0000)] 
tcp: add net.ipv4.tcp_comp_sack_rtt_percent

TCP SACK compression has been added in 2018 in commit
5d9f4262b7ea ("tcp: add SACK compression").

It is working great for WAN flows (with large RTT).
Wifi in particular gets a significant boost _when_ ACK are suppressed.

Add a new sysctl so that we can tune the very conservative 5 % value
that has been used so far in this formula, so that small RTT flows
can benefit from this feature.

delay = min ( 5 % of RTT, 1 ms)

This patch adds new tcp_comp_sack_rtt_percent sysctl
to ease experiments and tuning.

Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl),
set the default value to 33 %.

Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ )

The rationale for 33% is basically to try to facilitate pipelining,
where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so
that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in
flight" at all times:

+ 1 skb in the qdisc waiting to be sent by the NIC next
+ 1 skb being sent by the NIC (being serialized by the NIC out onto the wire)
+ 1 skb being received and aggregated by the receiver machine's
aggregation mechanism (some combination of LRO, GRO, and sack
compression)

Note that this is basically the same magic number (3) and the same
rationales as:

(a) tcp_tso_should_defer() ensuring that we defer sending data for no
longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3),
and
(b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO
skbs to maintain pipelining and full throughput at low RTTs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoMerge branch 'tcp-clean-up-syn-ack-rto-code-and-apply-max-rto'
Jakub Kicinski [Sat, 8 Nov 2025 02:05:28 +0000 (18:05 -0800)] 
Merge branch 'tcp-clean-up-syn-ack-rto-code-and-apply-max-rto'

Kuniyuki Iwashima says:

====================
tcp: Clean up SYN+ACK RTO code and apply max RTO.

Patch 1 - 4 are misc cleanup.

Patch 5 applies max RTO to non-TFO SYN+ACK.

Patch 6 adds a test for max RTO of SYN+ACK.
====================

Link: https://patch.msgid.link/20251106003357.273403-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agoselftest: packetdrill: Add max RTO test for SYN+ACK.
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:45 +0000 (00:32 +0000)] 
selftest: packetdrill: Add max RTO test for SYN+ACK.

This script sets net.ipv4.tcp_rto_max_ms to 1000 and checks
if SYN+ACK RTO is capped at 1s for TFO and non-TFO.

Without the previous patch, the max RTO is applied to TFO
SYN+ACK only, and non-TFO SYN+ACK RTO increases exponentially.

  # selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt
  # TAP version 13
  # 1..2
  # tcp_rto_synack_rto_max.pkt:46: error handling packet: timing error:
     expected outbound packet at 5.091936 sec but happened at 6.107826 sec; tolerance 0.127974 sec
  # script packet:  5.091936 S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
  # actual packet:  6.107826 S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK>
  # not ok 1 ipv4
  # tcp_rto_synack_rto_max.pkt:46: error handling packet: timing error:
     expected outbound packet at 5.075901 sec but happened at 6.091841 sec; tolerance 0.127976 sec
  # script packet:  5.075901 S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
  # actual packet:  6.091841 S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK>
  # not ok 2 ipv6
  # # Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0
  not ok 49 selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt # exit=1

With the previous patch, all SYN+ACKs are retransmitted
after 1s.

  # selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt
  # TAP version 13
  # 1..2
  # ok 1 ipv4
  # ok 2 ipv6
  # # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
  ok 49 selftests: net/packetdrill: tcp_rto_synack_rto_max.pkt

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: Apply max RTO to non-TFO SYN+ACK.
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:44 +0000 (00:32 +0000)] 
tcp: Apply max RTO to non-TFO SYN+ACK.

Since commit 54a378f43425 ("tcp: add the ability to control
max RTO"), TFO SYN+ACK RTO is capped by the TFO full sk's
inet_csk(sk)->icsk_rto_max.

The value is inherited from the parent listener.

Let's apply the same cap to non-TFO SYN+ACK.

Note that req->rsk_listener is always non-NULL when we call
tcp_reqsk_timeout() in reqsk_timer_handler() or tcp_check_req().

It could be NULL for SYN cookie req, but we do not use
req->timeout then.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: Remove timeout arg from reqsk_timeout().
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:43 +0000 (00:32 +0000)] 
tcp: Remove timeout arg from reqsk_timeout().

reqsk_timeout() is always called with @timeout being TCP_RTO_MAX.

Let's remove the arg.

As a prep for the next patch, reqsk_timeout() is moved to tcp.h
and renamed to tcp_reqsk_timeout().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: Remove redundant init for req->num_timeout.
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:42 +0000 (00:32 +0000)] 
tcp: Remove redundant init for req->num_timeout.

Commit 5903123f662e ("tcp: Use BPF timeout setting for SYN ACK
RTO") introduced req->timeout and initialised it in 3 places:

  1. reqsk_alloc() sets 0
  2. inet_reqsk_alloc() sets TCP_TIMEOUT_INIT
  3. tcp_conn_request() sets tcp_timeout_init()

1. has been always redundant as 2. overwrites it immediately.

2. was necessary for TFO SYN+ACK but is no longer needed after
commit 8ea731d4c2ce ("tcp: Make SYN ACK RTO tunable by BPF
programs with TFO").

3. was moved to reqsk_queue_hash_req() in the previous patch.

Now, we always initialise req->timeout just before scheduling
the SYN+ACK timer:

  * For non-TFO SYN+ACK : reqsk_queue_hash_req()
  * For TFO SYN+ACK     : tcp_fastopen_create_child()

Let's remove the redundant initialisation of req->timeout in
reqsk_alloc() and inet_reqsk_alloc().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: Remove timeout arg from reqsk_queue_hash_req().
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:41 +0000 (00:32 +0000)] 
tcp: Remove timeout arg from reqsk_queue_hash_req().

inet_csk_reqsk_queue_hash_add() is no longer shared by DCCP.

We do not need to pass req->timeout down to reqsk_queue_hash_req().

Let's move tcp_timeout_init() from tcp_conn_request() to
reqsk_queue_hash_req().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
4 weeks agotcp: Call tcp_syn_ack_timeout() directly.
Kuniyuki Iwashima [Thu, 6 Nov 2025 00:32:40 +0000 (00:32 +0000)] 
tcp: Call tcp_syn_ack_timeout() directly.

Since DCCP has been removed, we do not need to use
request_sock_ops.syn_ack_timeout().

Let's call tcp_syn_ack_timeout() directly.

Now other function pointers of request_sock_ops are
protocol-dependent.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonetlink: specs: netdev add missing stats to qstat-get
Jakub Kicinski [Tue, 4 Nov 2025 23:23:44 +0000 (15:23 -0800)] 
netlink: specs: netdev add missing stats to qstat-get

Add missing entries in C attribute list.

Link: https://patch.msgid.link/20251104232348.1954349-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agoMerge branch 'net-renesas-cleanup-usage-of-gptp-flags'
Jakub Kicinski [Fri, 7 Nov 2025 01:38:27 +0000 (17:38 -0800)] 
Merge branch 'net-renesas-cleanup-usage-of-gptp-flags'

Niklas Söderlund says:

====================
net: renesas: Cleanup usage of gPTP flags

This series aim is to prepare for future work that will enable the use
of gPTP on R-Car RAVB on Gen4. Currently RAVB have a dedicated gPTP
implementation supported on Gen2 and Gen3 (ravb_ptp.c). For Gen4 a new
implementation that is already upstream (rcar_gen4_ptp.c) and used by
other Gen4 devices such as RTSN and RSWITCH is needed.

Unfortunately the design of the Gen2/Gen3 RAVB driver where driver
specific flags to control gPTP behavior have been mimicked in RTSN and
RSWITCH. This was OK as there was no overlap between the two gPTP
implementations. Now that RAVB needs to be able to use both having to
translate between driver specific flags and common net code flags
becomes even more cumbersome as there are two sets of driver specific
flags to pick from.

This series cleans this up for all Renesas drivers using gPTP by
removing all driver specific flags and using the common flags directly.
This simplifies drivers while at the same time prepare RAVB to be
extended with Gen4 support.

Patch 1/7 is a drive by patch where RSWITCH specific define was added in
the wrong header. Patch 2/7 removes a short-cut used in RTSN and RSWITCH
that prevents extending Gen4 support to RAVB without fuss. While patch
3/7 to 7/7 rework the Renesas drivers to use the common flags instead of
driver specific ones.

There is no intentional behavior change and only a small rework in logic
in the RAVB driver. Looking at patch 3/7, 4/7 and 7/7 one can clearly
see how the code have been copied from RAVB to the later implementations
in RTSN and RSWITCH.
====================

Link: https://patch.msgid.link/20251104222420.882731-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ravb: Use common defines for time stamping control
Niklas Söderlund [Tue, 4 Nov 2025 22:24:20 +0000 (23:24 +0100)] 
net: ravb: Use common defines for time stamping control

Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.

The change from a device specific bit-field track variable to the common
enum datatypes forces us to touch the ravb_rx_rcar_hwstamp() in a non
trivial way. To make this cleaner and easier to understand expand the
nested conditions.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-8-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: ravb: Break out Rx hardware timestamping
Niklas Söderlund [Tue, 4 Nov 2025 22:24:19 +0000 (23:24 +0100)] 
net: ravb: Break out Rx hardware timestamping

Prepare for moving away from device specific bit-fields to track how to
do hardware Rx timestamping to using net common enums by breaking out
the timestamping to a helper function. This is done to create cleaner
code and prepare for easier changes improving the hardware timestapming.

There is no functional change.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-7-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: rcar_gen4_ptp: Remove unused defines
Niklas Söderlund [Tue, 4 Nov 2025 22:24:18 +0000 (23:24 +0100)] 
net: rcar_gen4_ptp: Remove unused defines

The driver specific flags to control packet time stamps have all been
replaced by values from enum hwtstamp_tx_types and enum
hwtstamp_rx_filters. Remove the driver specific flags as there are no
more users.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251104222420.882731-6-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: rtsn: Use common defines for time stamping control
Niklas Söderlund [Tue, 4 Nov 2025 22:24:17 +0000 (23:24 +0100)] 
net: rtsn: Use common defines for time stamping control

Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.

One thing to note is that the bit-wise and check in rtsn_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-5-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: rswitch: Use common defines for time stamping control
Niklas Söderlund [Tue, 4 Nov 2025 22:24:16 +0000 (23:24 +0100)] 
net: rswitch: Use common defines for time stamping control

Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.

One thing to note is that the bit-wise and check in rswitch_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Link: https://patch.msgid.link/20251104222420.882731-4-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 weeks agonet: rcar_gen4_ptp: Move control fields to users
Niklas Söderlund [Tue, 4 Nov 2025 22:24:15 +0000 (23:24 +0100)] 
net: rcar_gen4_ptp: Move control fields to users

The struct rcar_gen4_ptp_private provides two fields for convenience of
its users, tstamp_tx_ctrl and tstamp_rx_ctrl. These fields are not used
by the rcar_gen4_ptp driver itself but only by the drivers using it.

Upcoming work will enable the RAVB driver currently only supporting gPTP
on pre-Gen4 SoCs to use the Gen4 implementation as well. To facilitate
this the convenience of having these fields in struct
rcar_gen4_ptp_private becomes a problem as the RAVB driver already have
it's own driver specific fields for the same thing.

Move the fields from struct rcar_gen4_ptp_private to each driver using
the Gen4 gPTP clocks own private data structures. There is no functional
change.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251104222420.882731-3-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jakub Kicinski <kuba@kernel.org>