Jason Xing [Fri, 31 Oct 2025 10:33:28 +0000 (18:33 +0800)]
xsk: add indirect call for xsk_destruct_skb
Since Eric proposed an idea about adding indirect call wrappers for
UDP and managed to see a huge improvement[1], the same situation can
also be applied in xsk scenario.
This patch adds an indirect call for xsk and helps current copy mode
improve the performance by around 1% stably which was observed with
IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect
will be magnified. I applied this patch on top of batch xmit series[2],
and was able to see <5% improvement from our internal application
which is a little bit unstable though.
Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to
be when the mitigation config is off.
Be aware of the freeing path that can be very hot since the frequency
can reach around 2,000,000 times per second with the xdpsock test.
Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
isdn: kcapi: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
This patch series improves the management of the RX buffer length for
the DQO queue format in the gve driver. The goal is to make RX buffer
length config more explicit, easy to change, and performant by default.
We accomplish that in four patches:
1. Currently, the buffer length is implicitly coupled with the header
split setting, which is an unintuitive and restrictive design. The
first patch decouples the RX buffer length from the header split
configuration.
2. The second patch is a preparatory step for third. It converts the XDP
config verification method to use extack for better error reporting.
3. The third patch exposes the `rx_buf_len` parameter to userspace via
ethtool, allowing user to directly view or modify the RX buffer length
if supported by the device.
4. The final patch improves the out-of-the-box RX single stream throughput
by >10% by changing the driver's default behavior to select the
maximum supported RX buffer length advertised by the device during
initialization.
====================
Ankit Garg [Thu, 6 Nov 2025 19:27:46 +0000 (11:27 -0800)]
gve: Default to max_rx_buffer_size for DQO if device supported
Change the driver's default behavior to prefer the largest available RX
buffer length supported by the device for DQO format, rather than always
using the hardcoded 2K default.
Previously, the driver would initialize with
`GVE_DEFAULT_RX_BUFFER_SIZE` (2K), even if the device advertised support
for a larger length (e.g., 4K).
Performance observations:
- With LRO disabled, we observed >10% improvement in RX single stream
throughput when MTU >=2048.
- With LRO enabled, we observed >10% improvement in RX single stream
throughput when MTU >=1460.
- No regressions were observed.
Signed-off-by: Ankit Garg <nktgrg@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Jordan Rhee <jordanrhee@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20251106192746.243525-5-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ankit Garg [Thu, 6 Nov 2025 19:27:45 +0000 (11:27 -0800)]
gve: Allow ethtool to configure rx_buf_len
Add support for getting and setting the RX buffer length via the
ethtool ring parameters (`ethtool -g`/`-G`). The driver restricts the
allowed buffer length to 2048 (SZ_2K) by default and allows 4096 (SZ_4K)
based on device options.
As XDP is only supported when the `rx_buf_len` is 2048, the driver now
enforces this in two places:
1. In `gve_xdp_set`, rejecting XDP programs if the current buffer
length is not 2048.
2. In `gve_set_rx_buf_len_config`, rejecting buffer length changes if XDP
is loaded and the new length is not 2048.
Signed-off-by: Ankit Garg <nktgrg@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Jordan Rhee <jordanrhee@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20251106192746.243525-4-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ankit Garg [Thu, 6 Nov 2025 19:27:43 +0000 (11:27 -0800)]
gve: Decouple header split from RX buffer length
Previously, enabling header split via `gve_set_hsplit_config` also
implicitly changed the RX buffer length to 4K (if supported by the
device). This coupled two settings that should be orthogonal; this patch
removes that side effect.
After this change, `gve_set_hsplit_config` only toggles the header
split configuration. The RX buffer length is no longer affected and
must be configured independently.
Signed-off-by: Ankit Garg <nktgrg@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Jordan Rhee <jordanrhee@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Link: https://patch.msgid.link/20251106192746.243525-2-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: stmmac: ingenic: pass ingenic_mac struct rather than plat_dat
It no longer makes sense to pass a pointer to struct
plat_stmmacenet_data when calling the set_mode() methods to only use it
to get a pointer to the ingenic_mac structure that we already had in
the caller. Simplify this by passing the struct ingenic_mac pointer.
As per the previous commit, we have validated that the phy_intf_sel
value is one that is permissible for this SoC, so there is no need to
handle invalid PHY interface modes. We can also apply the other
configuration based upon the phy_intf_sel value rather than the
PHY interface mode.
x1000, x1600 and x1830 only accept RMII mode. PHY_INTF_SEL_RMII is only
selected with PHY_INTERFACE_MODE_RMII, and PHY_INTF_SEL_RMII has been
validated by the SoC's .valid_phy_intf_sel bitmask. Thus, checking the
interface mode in these functions becomes unnecessary. Remove these.
jz4775 is similar, except for a greater set of PHY_INTF_SEL_x valies.
Also remove the switch statement here.
net: stmmac: ingenic: move "MAC PHY control register" debug
Move the printing of the MAC PHY control register interface mode
setting into ingenic_set_phy_intf_sel(), and use phy_modes() to
print the string rather than using the enum name.
net: stmmac: ingenic: use stmmac_get_phy_intf_sel()
Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result against the SoC specific
supported phy_intf_sel values, and pass into the SoC specific
set_mode() methods, replacing the local phy_intf_sel variable. This
provides the value for the MACPHYC_PHY_INFT_MASK field.
Use the PHY_INTF_SEL_x values directly in each of the mac_set_mode
methods rather than the driver private MACPHYC_PHY_INFT_x definitions.
Remove the MACPHYC_PHY_INFT_x definitions.
Such large transmit queues can be problematic, especially for cellular
modems. For example, with a typical celluar link speed of 10 Mbit/s, a
fully occupied USB3 transmit queue results in:
454.80 KB / (10 Mbit/s / 8 bit/byte) = 363.84 ms
of additional latency.
This patch adds support for Byte Queue Limits (BQL) [1] to dynamically
manage the transmit queue size and reduce latency without sacrificing
throughput.
Testing was performed on various devices using the usbnet driver for
packet transmission:
No performance degradation was observed for iperf3 TCP or UDP traffic,
while latency for a prioritized ping application was significantly
reduced. For example, using the USB3 to 2.5 GbE adapter, which was fully
utilized by iperf3 UDP traffic, the prioritized ping was improved from
1.6 ms to 0.6 ms. With the same setup but with a 100 Mbit/s Ethernet
connection, the prioritized ping was improved from 35 ms to 5 ms.
====================
net: dsa: b53: add support for BCM5389/97/98 and BCM63XX ARL formats
Currently b53 assumes that all switches apart from BCM5325/5365 use the
same ARL formats, but there are actually multiple formats in use.
Older switches use a format apparently introduced with BCM5387/BCM5389,
while newer chips use a format apparently introduced with BCM5395.
Note that these numbers are not linear, BCM5397/BCM5398 use the older
format.
In addition to that the switches integrated into BCM63XX SoCs use their
own format. While accessing these normal read/write ARL entries are the
same format as BCM5389 one, the search format is different.
So in order to support all these different format, split all code
accessing these entries into chip-family specific functions, and collect
them in appropriate arl ops structs to keep the code cleaner.
Sent as net-next since the ARL accesses have never worked before, and
the extensive refactoring might be too much to warrant a fix.
====================
Jonas Gorski [Fri, 7 Nov 2025 08:07:49 +0000 (09:07 +0100)]
net: dsa: b53: add support for bcm63xx ARL entry format
The ARL registers of BCM63XX embedded switches are somewhat unique. The
normal ARL table access registers have the same format as BCM5389, but
the ARL search registers differ:
* SRCH_CTL is at the same offset of BCM5389, but 16 bits wide. It does
not have more fields, just needs to be accessed by a 16 bit read.
* SRCH_RSLT_MACVID and SRCH_RSLT are aligned to 32 bit, and have shifted
offsets.
* SRCH_RSLT has a different format than the normal ARL data entry
register.
* There is only one set of ENTRY_N registers, implying a 1 bin layout.
So add appropriate ops for bcm63xx and let it use it.
Jonas Gorski [Fri, 7 Nov 2025 08:07:48 +0000 (09:07 +0100)]
net: dsa: b53: add support for 5389/5397/5398 ARL entry format
BCM5389, BCM5397 and BCM5398 use a different ARL entry format with just
a 16 bit fwdentry register, as well as different search control and data
offsets.
So add appropriate ops for them and switch those chips to use them.
Jonas Gorski [Fri, 7 Nov 2025 08:07:47 +0000 (09:07 +0100)]
net: dsa: b53: move ARL entry functions into ops struct
Now that the differences in ARL entry formats are neatly contained into
functions per chip family, wrap them into an ops struct and add wrapper
functions to access them.
Jonas Gorski [Fri, 7 Nov 2025 08:07:43 +0000 (09:07 +0100)]
net: dsa: b53: move reading ARL entries into their own function
Instead of duplicating the whole code iterating over all bins for
BCM5325, factor out reading and parsing the entry into its own
functions, and name it the modern one after the first chip with that ARL
format, (BCM53)95.
We've added 19 non-merge commits during the last 3 day(s) which contain
a total of 22 files changed, 1345 insertions(+), 197 deletions(-).
The main changes are:
1) Preserve skb metadata after a TC BPF program has changed the skb,
from Jakub Sitnicki.
This allows a TC program at the end of a TC filter chain to still see
the skb metadata, even if another TC program at the front of the chain
has changed the skb using BPF helpers.
2) Initial af_smc bpf_struct_ops support to control the smc specific
syn/synack options, from D. Wythe.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
bpf/selftests: Add selftest for bpf_smc_hs_ctrl
net/smc: bpf: Introduce generic hook for handshake flow
bpf: Export necessary symbols for modules with struct_ops
selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
selftests/bpf: Cover skb metadata access after change_head/tail helper
selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
selftests/bpf: Cover skb metadata access after vlan push/pop helper
selftests/bpf: Expect unclone to preserve skb metadata
selftests/bpf: Dump skb metadata on verification failure
selftests/bpf: Verify skb metadata in BPF instead of userspace
bpf: Make bpf_skb_change_head helper metadata-safe
bpf: Make bpf_skb_change_proto helper metadata-safe
bpf: Make bpf_skb_adjust_room metadata-safe
bpf: Make bpf_skb_vlan_push helper metadata-safe
bpf: Make bpf_skb_vlan_pop helper metadata-safe
vlan: Make vlan_remove_tag return nothing
bpf: Unclone skb head on bpf_dynptr_write to skb metadata
net: Preserve metadata on pskb_expand_head
net: Helper to move packet data and metadata after skb_push/pull
====================
net: ravb: Correct bad check of timestamp control flags
When converting the Renesas network drivers to use flags from enum
hwtstamp_rx_filters to control when to timestamp packages instead of a
driver specific schema with bit-wise flags an error was made.
The bit-wise driver specific flags correct logic to set get_ts was:
This change restores the converted flag check to the correct logic of
the bit-wise driver specific flags.
Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/linux-renesas-soc/aQ4xSv9629XF-Bt3@horms.kernel.org/ Fixes: 16e2e6cf75e6 ("net: ravb: Use common defines for time stamping control") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Link: https://patch.msgid.link/20251107200100.3637869-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhongqiu Han [Fri, 7 Nov 2025 07:45:33 +0000 (15:45 +0800)]
ptp: ocp: Document sysfs output format for backward compatibility
Add a comment to ptp_ocp_tty_show() explaining that the sysfs output
intentionally does not include a trailing newline. This is required for
backward compatibility with existing userspace software that reads the
sysfs attribute and uses the value directly as a device path.
A previous attempt to add a newline to align with common kernel
conventions broke userspace applications that were opening device paths
like "/dev/ttyS4\n" instead of "/dev/ttyS4", resulting in ENOENT errors.
This comment prevents future attempts to "fix" this behavior, which would
break existing userspace applications.
This patch aims to introduce BPF injection capabilities for SMC and
includes a self-test to ensure code stability.
Since the SMC protocol isn't ideal for every situation, especially
short-lived ones, most applications can't guarantee the absence of
such scenarios. Consequently, applications may need specific strategies
to decide whether to use SMC. For example, an application might limit SMC
usage to certain IP addresses or ports.
To maintain the principle of transparent replacement, we want applications
to remain unaffected even if they need specific SMC strategies. In other
words, they should not require recompilation of their code.
Additionally, we need to ensure the scalability of strategy implementation.
While using socket options or sysctl might be straightforward, it could
complicate future expansions.
Fortunately, BPF addresses these concerns effectively. Users can write
their own strategies in eBPF to determine whether to use SMC, and they can
easily modify those strategies in the future.
This is a rework of the series from [1]. Changes since [1] are limited to
the SMC parts:
v2 -> v1:
- Removed the fixes patch, which have already been merged on current branch.
- Fixed compilation warning of smc_call_hsbpf() when CONFIG_SMC_HS_CTRL_BPF
is not enabled.
- Changed the default value of CONFIG_SMC_HS_CTRL_BPF to Y.
- Fix typo and renamed some variables
v3 -> v2:
- Removed the libbpf patch, which have already been merged on current branch.
- Fixed sparse warning of smc_call_hsbpf() and xchg().
v4 -> v3:
- Rebased on latest bpf-next, updated SMC loopback config from SMC_LO to DIBS_LO
per upstream changes.
v5 -> v4:
- Removed the redundant sk parameter from smc_call_hsbpf
- Reject registration when bpf_link is set, link support will be added in the
future.
- Updated selftests with new test heplers.
====================
D. Wythe [Fri, 7 Nov 2025 03:56:31 +0000 (11:56 +0800)]
net/smc: bpf: Introduce generic hook for handshake flow
The introduction of IPPROTO_SMC enables eBPF programs to determine
whether to use SMC based on the context of socket creation, such as
network namespaces, PID and comm name, etc.
As a subsequent enhancement, to introduce a new generic hook that
allows decisions on whether to use SMC or not at runtime, including
but not limited to local/remote IP address or ports.
User can write their own implememtion via bpf_struct_ops now to choose
whether to use SMC or not before TCP 3rd handshake to be comleted.
====================
Make TC BPF helpers preserve skb metadata
Changes in v4:
- Fix copy-paste bug in check_metadata() test helper (AI review)
- Add "out of scope" section (at the bottom)
- Link to v3: https://lore.kernel.org/r/20251026-skb-meta-rx-path-v3-0-37cceebb95d3@cloudflare.com
Changes in v3:
- Use the already existing BPF_STREAM_STDERR const in tests (Martin)
- Unclone skb head on bpf_dynptr_write to skb metadata (patch 3) (Martin)
- Swap order of patches 1 & 2 to refer to skb_postpush_data_move() in docs
- Mention in skb_data_move() docs how to move just the metadata
- Note in pskb_expand_head() docs to move metadata after skb_push() (Jakub)
- Link to v2: https://lore.kernel.org/r/20251019-skb-meta-rx-path-v2-0-f9a58f3eb6d6@cloudflare.com
Changes in v2:
- Tweak WARN_ON_ONCE check in skb_data_move() (patch 2)
- Convert all tests to verify skb metadata in BPF (patches 9-10)
- Add test coverage for modified BPF helpers (patches 12-15)
- Link to RFCv1: https://lore.kernel.org/r/20250929-skb-meta-rx-path-v1-0-de700a7ab1cb@cloudflare.com
This patch set continues our work [1] to allow BPF programs and user-space
applications to attach multiple bytes of metadata to packets via the
XDP/skb metadata area.
The focus of this patch set it to ensure that skb metadata remains intact
when packets pass through a chain of TC BPF programs that call helpers
which operate on skb head.
Currently, several helpers that either adjust the skb->data pointer or
reallocate skb->head do not preserve metadata at its expected location,
that is immediately in front of the MAC header. These are:
In TC BPF context, metadata must be moved whenever skb->data changes to
keep the skb->data_meta pointer valid. I don't see any way around
it. Creative ideas how to avoid that would be very welcome.
With that in mind, we can patch the helpers in at least two different ways:
1. Integrate metadata move into header move
Replace the existing memmove, which follows skb_push/pull, with a helper
that moves both headers and metadata in a single call. This avoids an
extra memmove but reduces transparency.
This patch set implements option (1), expecting that "you can have just one
memmove" will be the most obvious feedback, while readability is a,
somewhat subjective, matter of taste, which I don't claim to have ;-)
The structure of the patch set is as follows:
- patches 1-4 prepare ground for safe-proofing the BPF helpers
- patches 5-9 modify the BPF helpers to preserve skb metadata
- patches 10-11 prepare ground for metadata tests with BPF helper calls
- patches 12-16 adapt and expand tests to cover the made changes
Out of scope for this series:
- safe-proofing tunnel & tagging devices - VLAN, GRE, ...
(next in line, in development preview at [2])
- metadata access after packet foward
(to do after Rx path - once metadata reliably reaches sk_filter)
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:53 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after bpf_skb_change_proto
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_proto(), which modifies packet headroom to accommodate
different IP header sizes.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:52 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after change_head/tail helper
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_change_head() and bpf_skb_change_tail(), which modify packet
headroom/tailroom and can trigger head reallocation.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:51 +0000 (21:19 +0100)]
selftests/bpf: Cover skb metadata access after bpf_skb_adjust_room
Add a test to verify that skb metadata remains accessible after calling
bpf_skb_adjust_room(), which modifies the packet headroom and can trigger
head reallocation.
The helper expects an Ethernet frame carrying an IP packet so switch test
packet identification by source MAC address since we can no longer rely on
Ethernet proto being set to zero.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:49 +0000 (21:19 +0100)]
selftests/bpf: Expect unclone to preserve skb metadata
Since pskb_expand_head() no longer clears metadata on unclone, update tests
for cloned packets to expect metadata to remain intact.
Also simplify the clone_dynptr_kept_on_{data,meta}_slice_write tests.
Creating an r/w dynptr slice is sufficient to trigger an unclone in the
prologue, so remove the extraneous writes to the data/meta slice.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:48 +0000 (21:19 +0100)]
selftests/bpf: Dump skb metadata on verification failure
Add diagnostic output when metadata verification fails to help with
troubleshooting test failures. Introduce a check_metadata() helper that
prints both expected and received metadata to the BPF program's stderr
stream on mismatch. The userspace test reads and dumps this stream on
failure.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:47 +0000 (21:19 +0100)]
selftests/bpf: Verify skb metadata in BPF instead of userspace
Move metadata verification into the BPF TC programs. Previously,
userspace read metadata from a map and verified it once at test end.
Now TC programs compare metadata directly using __builtin_memcmp() and
set a test_pass flag. This enables verification at multiple points during
test execution rather than a single final check.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:46 +0000 (21:19 +0100)]
bpf: Make bpf_skb_change_head helper metadata-safe
Although bpf_skb_change_head() doesn't move packet data after skb_push(),
skb metadata still needs to be relocated. Use the dedicated helper to
handle it.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:44 +0000 (21:19 +0100)]
bpf: Make bpf_skb_adjust_room metadata-safe
bpf_skb_adjust_room() may push or pull bytes from skb->data. In both cases,
skb metadata must be moved accordingly to stay accessible.
Replace existing memmove() calls, which only move payload, with a helper
that also handles metadata. Reserve enough space for metadata to fit after
skb_push.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:40 +0000 (21:19 +0100)]
bpf: Unclone skb head on bpf_dynptr_write to skb metadata
Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when
the skb is cloned, preventing writes to metadata.
Remove this restriction and unclone the skb head on bpf_dynptr_write() to
metadata, now that the metadata is preserved during uncloning. This makes
metadata dynptr consistent with skb dynptr, allowing writes regardless of
whether the skb is cloned.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:39 +0000 (21:19 +0100)]
net: Preserve metadata on pskb_expand_head
pskb_expand_head() copies headroom, including skb metadata, into the newly
allocated head, but then clears the metadata. As a result, metadata is lost
when BPF helpers trigger an skb head reallocation.
Let the skb metadata remain in the newly created copy of head.
Jakub Sitnicki [Wed, 5 Nov 2025 20:19:38 +0000 (21:19 +0100)]
net: Helper to move packet data and metadata after skb_push/pull
Lay groundwork for fixing BPF helpers available to TC(X) programs.
When skb_push() or skb_pull() is called in a TC(X) ingress BPF program, the
skb metadata must be kept in front of the MAC header. Otherwise, BPF
programs using the __sk_buff->data_meta pseudo-pointer lose access to it.
Introduce a helper that moves both metadata and a specified number of
packet data bytes together, suitable as a drop-in replacement for
memmove().
Jakub Kicinski [Sat, 8 Nov 2025 03:15:36 +0000 (19:15 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2025-11-06 (i40, ice, iavf)
Mohammad Heib introduces a new devlink parameter, max_mac_per_vf, for
controlling the maximum number of MAC address filters allowed by a VF. This
allows administrators to control the VF behavior in a more nuanced manner.
Aleksandr and Przemek add support for Receive Side Scaling of GTP to iAVF
for VFs running on E800 series ice hardware. This improves performance and
scalability for virtualized network functions in 5G and LTE deployments.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: add RSS support for GTP protocol via ethtool
ice: Extend PTYPE bitmap coverage for GTP encapsulated flows
ice: improve TCAM priority handling for RSS profiles
ice: implement GTP RSS context tracking and configuration
ice: add virtchnl definitions and static data for GTP RSS
ice: add flow parsing for GTP and new protocol field support
i40e: support generic devlink param "max_mac_per_vf"
devlink: Add new "max_mac_per_vf" generic device param
====================
Use stmmac_get_phy_intf_sel() to decode the PHY interface mode to the
phy_intf_sel value, validate the result and use that to set the
control register to select the operating mode for the DWMAC core.
Note that when an unsupported interface mode is used, the array would
decode this to PHY_INTF_SEL_GMII_MII, so preserve this behaviour.
Use the PHY_INTF_SEL_x values directly rather than the driver private
ETH_PHY_SEL_x values. Move the FIELD_PREP() into sti_dwmac_set_mode().
Use dwmac->interface directly.
Validate the phy_intf_sel value rather than the PHY interface mode.
This will allow us to transition to the ->set_phy_intf_sel() method.
Note that this will allow GMII as well as MII as the phy_intf_sel
value is the same for both.
Use the PHY_INTF_SEL_x values directly rather than the driver private
LPC18XX_CREG_CREG6_ETHMODE_x definitions, and convert
LPC18XX_CREG_CREG6_ETHMODE_MASK to use GENMASK().
====================
psp: track stats from core and provide a driver stats api
This series introduces stats counters for psp. Device key rotations,
and so called 'stale-events' are common to all drivers and are tracked
by the core.
A driver facing api is provided for reporting stats required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Drivers must implement these stats.
Lastly, implementations of the driver stats api for mlx5 and netdevsim
are included.
Here is the output of running the psp selftest suite and then
printing out stats with the ynl cli on system with a psp-capable CX7:
$ ./ksft-psp-stats/drivers/net/psp.py
TAP version 13
1..28
ok 1 psp.test_case # SKIP Test requires IPv4 connectivity
ok 2 psp.data_basic_send_v0_ip6
ok 3 psp.test_case # SKIP Test requires IPv4 connectivity
ok 4 psp.data_basic_send_v1_ip6
ok 5 psp.test_case # SKIP Test requires IPv4 connectivity
ok 6 psp.data_basic_send_v2_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-128')
ok 7 psp.test_case # SKIP Test requires IPv4 connectivity
ok 8 psp.data_basic_send_v3_ip6 # SKIP ('PSP version not supported', 'hdr0-aes-gmac-256')
ok 9 psp.test_case # SKIP Test requires IPv4 connectivity
ok 10 psp.data_mss_adjust_ip6
ok 11 psp.dev_list_devices
ok 12 psp.dev_get_device
ok 13 psp.dev_get_device_bad
ok 14 psp.dev_rotate
ok 15 psp.dev_rotate_spi
ok 16 psp.assoc_basic
ok 17 psp.assoc_bad_dev
ok 18 psp.assoc_sk_only_conn
ok 19 psp.assoc_sk_only_mismatch
ok 20 psp.assoc_sk_only_mismatch_tx
ok 21 psp.assoc_sk_only_unconn
ok 22 psp.assoc_version_mismatch
ok 23 psp.assoc_twice
ok 24 psp.data_send_bad_key
ok 25 psp.data_send_disconnect
ok 26 psp.data_stale_key
ok 27 psp.removal_device_rx # XFAIL Test only works on netdevsim
ok 28 psp.removal_device_bi # XFAIL Test only works on netdevsim
# Totals: pass:19 fail:0 xfail:2 xpass:0 skip:7 error:0
#
# Responder logs (0):
# STDERR:
# Set PSP enable on device 1 to 0x3
# Set PSP enable on device 1 to 0x0
Jakub Kicinski [Thu, 6 Nov 2025 00:26:05 +0000 (16:26 -0800)]
net/mlx5e: Add PSP stats support for Rx/Tx flows
Add all statistics described under the "Implementation Requirements"
section of the PSP Architecture Specification:
Rx successfully decrypted PSP packets:
psp_rx_pkts : Number of packets decrypted successfully
psp_rx_bytes : Number of bytes decrypted successfully
Rx PSP authentication failure statistics:
psp_rx_pkts_auth_fail : Number of PSP packets that failed authentication
psp_rx_bytes_auth_fail : Number of PSP bytes that failed authentication
Rx PSP bad frame error statistics:
psp_rx_pkts_frame_err;
psp_rx_bytes_frame_err;
Rx PSP drop statistics:
psp_rx_pkts_drop : Number of PSP packets dropped
psp_rx_bytes_drop : Number of PSP bytes dropped
Tx successfully encrypted PSP packets:
psp_tx_pkts : Number of packets encrypted successfully
psp_tx_bytes : Number of bytes encrypted successfully
Tx drops:
tx_drop : Number of misc psp related drops
The above can be seen using the ynl cli:
./pyynl/cli.py --spec netlink/specs/psp.yaml --dump get-stats
Jakub Kicinski [Thu, 6 Nov 2025 00:26:04 +0000 (16:26 -0800)]
psp: add stats from psp spec to driver facing api
Provide a driver api for reporting device statistics required by the
"Implementation Requirements" section of the PSP Architecture
Specification. Use a warning to ensure drivers report stats required
by the spec.
Jakub Kicinski [Thu, 6 Nov 2025 00:26:02 +0000 (16:26 -0800)]
psp: report basic stats from the core
Track and report stats common to all psp devices from the core. A
'stale-event' is when the core marks the rx state of an active
psp_assoc as incapable of authenticating psp encapsulated data.
====================
net: phy: Add Open Alliance TC14 10Base-T1S PHY cable diagnostic support
This patch series adds Open Alliance TC14 (OATC14) 10BASE-T1S cable
diagnostic feature support to the Linux kernel PHY subsystem and enable
this feature for Microchip LAN867x Rev.D0 PHYs. These patches provide
standardized cable test functionality for 10BASE-T1S Ethernet PHYs,
allowing users to perform cable diagnostics via ethtool.
Patch Summary:
1. add OATC14 10BASE-T1S PHY cable diagnostic support
- Implements support for the OATC14 cable diagnostic feature in
Clause 45 PHYs.
- Adds functions to start a cable test and retrieve its status,
mapping hardware results to ethtool codes.
- Exports these functions for use by PHY drivers.
- Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features.
https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf
2. add cable diagnostic support for LAN867x Rev.D0
- Integrates the generic OATC14 cable test functions into the
Microchip LAN867x Rev.D0 PHY driver.
- Enables ethtool cable diagnostics for this PHY, improving
troubleshooting and maintenance.
====================
net: phy: microchip_t1s:: add cable diagnostic support for LAN867x Rev.D0
Enable Open Alliance TC14 (OATC14) 10Base-T1S cable diagnostic feature
for Microchip LAN867x Rev.D0 PHY by implementing `cable_test_start` and
`cable_test_get_status` using the generic C45 functions. This allows the
`ethtool` utility to perform cable diagnostic tests directly on the PHY,
improving network troubleshooting and maintenance.
net: phy: phy-c45: add OATC14 10BASE-T1S PHY cable diagnostic support
Add support for Open Alliance TC14 (OATC14) 10BASE-T1S PHYs cable
diagnostic feature.
This patch implements:
- genphy_c45_oatc14_cable_test_start() to initiate a cable test
- genphy_c45_oatc14_cable_test_get_status() to retrieve test results
- Helper function to map PHY cable test status to ethtool result codes
- Function declarations and exports for use by PHY drivers
This enables ethtool to report ok, open, short, and undetectable cable
conditions on OATC14 10Base-T1S PHYs.
Open Alliance TC14 10BASE-T1S Advanced Diagnostic PHY Features
Specification ref:
https://opensig.org/wp-content/uploads/2025/06/OPEN_Alliance_10BASE-T1S_Advanced_PHY_features_for-automotive_Ethernet_V2.1b.pdf
After the shaper is deleted, it is expected to report
the maximum speed supported by the SKU. But currently it is
reporting 0, which is incorrect.
Fix this inconsistency, by resetting apc->speed to apc->max_speed
during deletion of the shaper object. This will improve
readability and debuggability.
net: airoha: Add the capability to consume out-of-order DMA tx descriptors
EN7581 and AN7583 SoCs are capable of DMA mapping non-linear tx skbs on
non-consecutive DMA descriptors. This feature is useful when multiple
flows are queued on the same hw tx queue since it allows to fully utilize
the available tx DMA descriptors and to avoid the starvation of
high-priority flow we have in the current codebase due to head-of-line
blocking introduced by low-priority flows.
Eric Dumazet [Thu, 6 Nov 2025 11:52:36 +0000 (11:52 +0000)]
tcp: add net.ipv4.tcp_comp_sack_rtt_percent
TCP SACK compression has been added in 2018 in commit 5d9f4262b7ea ("tcp: add SACK compression").
It is working great for WAN flows (with large RTT).
Wifi in particular gets a significant boost _when_ ACK are suppressed.
Add a new sysctl so that we can tune the very conservative 5 % value
that has been used so far in this formula, so that small RTT flows
can benefit from this feature.
delay = min ( 5 % of RTT, 1 ms)
This patch adds new tcp_comp_sack_rtt_percent sysctl
to ease experiments and tuning.
Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl),
set the default value to 33 %.
The rationale for 33% is basically to try to facilitate pipelining,
where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so
that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in
flight" at all times:
+ 1 skb in the qdisc waiting to be sent by the NIC next
+ 1 skb being sent by the NIC (being serialized by the NIC out onto the wire)
+ 1 skb being received and aggregated by the receiver machine's
aggregation mechanism (some combination of LRO, GRO, and sack
compression)
Note that this is basically the same magic number (3) and the same
rationales as:
(a) tcp_tso_should_defer() ensuring that we defer sending data for no
longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3),
and
(b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO
skbs to maintain pipelining and full throughput at low RTTs
====================
net: renesas: Cleanup usage of gPTP flags
This series aim is to prepare for future work that will enable the use
of gPTP on R-Car RAVB on Gen4. Currently RAVB have a dedicated gPTP
implementation supported on Gen2 and Gen3 (ravb_ptp.c). For Gen4 a new
implementation that is already upstream (rcar_gen4_ptp.c) and used by
other Gen4 devices such as RTSN and RSWITCH is needed.
Unfortunately the design of the Gen2/Gen3 RAVB driver where driver
specific flags to control gPTP behavior have been mimicked in RTSN and
RSWITCH. This was OK as there was no overlap between the two gPTP
implementations. Now that RAVB needs to be able to use both having to
translate between driver specific flags and common net code flags
becomes even more cumbersome as there are two sets of driver specific
flags to pick from.
This series cleans this up for all Renesas drivers using gPTP by
removing all driver specific flags and using the common flags directly.
This simplifies drivers while at the same time prepare RAVB to be
extended with Gen4 support.
Patch 1/7 is a drive by patch where RSWITCH specific define was added in
the wrong header. Patch 2/7 removes a short-cut used in RTSN and RSWITCH
that prevents extending Gen4 support to RAVB without fuss. While patch
3/7 to 7/7 rework the Renesas drivers to use the common flags instead of
driver specific ones.
There is no intentional behavior change and only a small rework in logic
in the RAVB driver. Looking at patch 3/7, 4/7 and 7/7 one can clearly
see how the code have been copied from RAVB to the later implementations
in RTSN and RSWITCH.
====================
net: ravb: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
The change from a device specific bit-field track variable to the common
enum datatypes forces us to touch the ravb_rx_rcar_hwstamp() in a non
trivial way. To make this cleaner and easier to understand expand the
nested conditions.
Prepare for moving away from device specific bit-fields to track how to
do hardware Rx timestamping to using net common enums by breaking out
the timestamping to a helper function. This is done to create cleaner
code and prepare for easier changes improving the hardware timestapming.
The driver specific flags to control packet time stamps have all been
replaced by values from enum hwtstamp_tx_types and enum
hwtstamp_rx_filters. Remove the driver specific flags as there are no
more users.
net: rtsn: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
One thing to note is that the bit-wise and check in rtsn_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.
net: rswitch: Use common defines for time stamping control
Instead of translating to/from driver specific flags for packet time
stamp control use the common flags directly. This simplifies the driver
as the translating code can be removed while at the same time making it
clear the flags are not flags written to hardware registers.
One thing to note is that the bit-wise and check in rswitch_rx() of
RCAR_GEN4_RXTSTAMP_TYPE_V2_L2_EVENT is replaced with a not set check of
HWTSTAMP_FILTER_NONE. This is okay as the bit of device specific event
replaced was set for all modes except HWTSTAMP_FILTER_NONE.
The struct rcar_gen4_ptp_private provides two fields for convenience of
its users, tstamp_tx_ctrl and tstamp_rx_ctrl. These fields are not used
by the rcar_gen4_ptp driver itself but only by the drivers using it.
Upcoming work will enable the RAVB driver currently only supporting gPTP
on pre-Gen4 SoCs to use the Gen4 implementation as well. To facilitate
this the convenience of having these fields in struct
rcar_gen4_ptp_private becomes a problem as the RAVB driver already have
it's own driver specific fields for the same thing.
Move the fields from struct rcar_gen4_ptp_private to each driver using
the Gen4 gPTP clocks own private data structures. There is no functional
change.