Add Xsk counterparts for XDP_TX buffer sending and completion.
The same base structures and functions used from the libeth_xdp core,
with adjustments to that XSk Rx always operates on &xdp_buff_xsk for
both head and frags. And unlike regular Rx, here unlikely() are used
for frags, as the header split gives no benefits for XSk Rx, at
least for now.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
libeth: xdp: add RSS hash hint and XDP features setup helpers
End the XDP section by adding helpers to setup XDP features, flipping
.ndo_xdp_xmit() support at runtime (in case when it's not always on),
and calculating the queue clean/refill threshold.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
libeth: xdp: add templates for building driver-side callbacks
Defining driver-specific functions to pass to libeth_xdp functions can
induce boilerplates and/or look a bit cryptic with all those layers of
indirection. On the other hand, this indirection is needed to allow
compilers to uninline big functions even when passed to __always_inline
helpers (too much inlining also hurts performance in some cases), plus
to reuse some XDP helpers in XSk code.
Add macros to quickly build them, with the detailed kdoc. They take
names of the actual callbacks for filling a Tx descriptor and other
purely HW-specific things and wrap them appropriately.
LIBETH_XDP_DEFINE_{BEGIN,END}() is needed for GCC 8+ unfortunately to
let the drivers control which functions will be static and which global
without hitting `-Wold-style-declaration`.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
libeth: xdp: add XDP prog run and verdict result handling
Running a prog and handling the verdicts, up to napi_gro_receive()
is also pretty generic code not really differing between vendors
(except for Tx descriptor filling and Rx descriptor parsing).
Define a couple inlines to do that. The inline callbacks a driver
needs to pass is mentioned above: Tx descriptor filling for XDP_TX,
populating skb with the descriptor data for XDP_PASS, finalizing
XDPSQs after the polling loop for XDP_TX (kicking the HW to start
sending).
The populate callback passes only &libeth_xdp_buff assuming buff::desc
pointer is enough, plus you can always get the corresponding Rx queue
structure via container_of(buff::rxq). If not, a driver can extend
the buff with more fields directly on the stack without touching
libeth_xdp definitions.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
libeth: xdp: add helpers for preparing/processing &libeth_xdp_buff
Add convenience helpers to build an &xdp_buff. This means: general
initialization before the NAPI loop, adding head, adding frags etc.
libeth_xdp_process_buff() is the same what everybody have in their
drivers:
dma_sync_for_cpu();
if (!frag) {
add_head();
prefetch();
} else {
add_frag();
}
Note that I don't use net_prefetch(), sticking to the original
prefetch(). In none of my tests prefetching 128 bytes yielded better
perf than 64 bytes. That might differ if the headers are huge enough,
but then additional tunneling etc. overhead takes place, you either
way won't win a lot.
&libeth_xdp_stash is for cases when you exit the polling loop without
finishing building the buff. If that happens, you need to store the
buffer in the queue structure until the next loop and then restore it.
It makes no sense to place a whole full &xdp_buff there. Define a
minimal structure, which would store only the fields essential to
restore it.
I was able to pack it into 16 bytes, which is only 8 bytes bigger
than `struct sk_buff *skb` on x64.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When XDP Tx queues are not interrupt-driven but use lazy cleaning,
i.e. only when there are less than `threshold` free descriptors left,
we also need cleanup timers to avoid &xdp_buff and &xdp_frame stall
for too long, especially with Page Pool (it warns every about inflight
pages every 60 second).
Let's say we sent 256 frames and don't need to send more, but we clean
only when the number of pending items >= 384. In that case, those 256
will stall until 128 more are sent. For this, add simple helpers to
run a timer which will clean the queue regardless, after 1 second of
the last send.
The timer is triggered when finalizing the queue. As long as there is
regular active traffic, the timer doesn't fire.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Unfortunately, it's not always possible to allocate
max(num_rxqs, nr_cpu_ids) even on hi-end NICs.
To mitigate this, add simple locking helpers to libeth_xdp.
As long as XDPSQs are not shared, the whole functionality is gated
behind a static lock. Otherwise, each bulk flush locks the queue for
the time of cleaning and filling the descriptors.
As long as this particular queue is not used by more than 1 CPU,
the impact is minimal (runtime check for boolean twice per 16+
descriptors).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # static key Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Similarly to libeth_tx_complete(), add libeth_xdp_complete_tx() to
handle XDP_TX and xmit buffers. Both use bulk return under the hood.
Also add out of line libeth_tx_complete_any() which handles both
regular and XDP frames (if libeth_xdp is loaded), for example,
to call on queue destroy, where we don't need inlining but
convenience.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add helpers for implementing .ndo_xdp_xmit().
Same as for XDP_TX, accumulate up to 16 DMA-mapped frames on the stack,
then flush. If DMA mapping is failed for some reason, don't try mapping
further frames, but still flush what was already prepared.
DMA address of a head frame is stored in its headroom, assuming it
has enough of it for an 8 (or 4) byte value.
In addition to @prep and @xmit driver callbacks in XDP_TX, xmit also
needs @finalize to kick the XDPSQ after filling.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Start adding XDP-specific code to libeth, namely handling XDP_TX buffers
(only sending).
The idea is that we accumulate up to 16 buffers on the stack, then,
if either the limit is reached or the polling is finished, flush them
at once with only one XDPSQ cleaning (if needed). The main sending
function will be aware of the sending budget and already have all the
info to send the buffers, so it can't fail.
Drivers need to provide 2 inline callbacks to the main sending function:
for cleaning an XDPSQ and for filling descriptors; the library code
takes care of the rest.
Note that unlike the generic code, multi-buffer support is not wrapped
here with unlikely() to not hurt header split setups.
&libeth_xdp_buff is a simple extension over &xdp_buff which has a direct
pointer to the corresponding Rx descriptor (and, luckily, precisely 1 CL
size and 16-byte alignment on x86_64).
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # xmit logic Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
libeth: support native XDP and register memory model
Expand libeth's Page Pool functionality by adding native XDP support.
This means picking the appropriate headroom and DMA direction.
Also, register all the created &page_pools as XDP memory models.
A driver then can call xdp_rxq_info_attach_page_pool() when registering
its RxQ info.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Back when the libeth Rx core was initially written, devmem was a draft
and netmem_ref didn't exist in the mainline. Now that it's here, make
libeth MP-agnostic before introducing any new code or any new library
users.
When it's known that the created PP/FQ is for header buffers, use faster
"unsafe" underscored netmem <--> virt accessors as netmem_is_net_iov()
is always false in that case, but consumes some cycles (bit test +
true branch).
Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Change EXPORT_SYMBOL_NS_GPL(x, "LIBETH") to EXPORT_SYMBOL_GPL(x) +
DEFAULT_SYMBOL_NAMESPACE "LIBETH" to make the code more compact.
Also, explicitly include <linux/export.h> to satisfy new
requirements from scripts/misc-check.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add ethqos_pcs_set_inband() to improve readability, and to allow future
changes when phylink PCS support is properly merged.
Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sa8775p-ride-r3 Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/E1uPkbO-004EyA-EU@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yajun Deng [Thu, 12 Jun 2025 14:27:07 +0000 (14:27 +0000)]
net: sysfs: Implement is_visible for phys_(port_id, port_name, switch_id)
phys_port_id_show, phys_port_name_show and phys_switch_id_show would
return -EOPNOTSUPP if the netdev didn't implement the corresponding
method.
There is no point in creating these files if they are unsupported.
Put these attributes in netdev_phys_group and implement the is_visible
method. make phys_(port_id, port_name, switch_id) invisible if the netdev
dosen't implement the corresponding method.
net: ti: icssg-prueth: Read firmware-names from device tree
Refactor the way firmware names are handled for the ICSSG PRUETH driver.
Instead of using hardcoded firmware name arrays for different modes (EMAC,
SWITCH, HSR), the driver now reads the firmware names from the device tree
property "firmware-name". Only the EMAC firmware names are specified in the
device tree property. The firmware names for all other supported modes are
generated dynamically based on the EMAC firmware names by replacing
substrings (e.g., "eth" with "sw" or "hsr") as appropriate.
Example: Below are the firmwares used currently for PRU0 core
All three firmware names are same except for the operating mode.
In general for PRU0 core, firmware name is,
ti-pruss/am65x-sr2-pru0-pru<mode>-fw.elf
Since the EMAC firmware names are defined in DT, driver will read those
directly and for other modes swap the mode name. i.e. eth -> sw or
eth -> hsr.
This preserves backwards compatibility as ICSSG driver is supported only
by AM65x and AM64x. Both of these have "firmware-name" property
populated in their device tree.
Jakub Kicinski [Sat, 14 Jun 2025 01:23:01 +0000 (18:23 -0700)]
Merge branch 'net-stmmac-rk-much-needed-cleanups'
Russell King says:
====================
net: stmmac: rk: much needed cleanups
This series starts attacking the reams of fairly identical duplicated
code in dwmac-rk. Every new SoC that comes along seems to need more
code added to this file because e.g. the way the clock is controlled
is different in every SoC.
The first thing to realise is that the driver only supports RMII and
RGMII interface modes. So, the first patch adds a .get_interfaces()
implementation which reports this for phylink's usage, thus ensuring
that we error out during initialisation should something that isn't
supported be specified. Note that there is one case where there are
a pair of interfaces, one supports only RMII the other supports RMII
and RGMII, but we report both anyway - something that the existing
driver allows. A future patch may attempt to fix this.
Rather than writing code, let's realise that there are two major
implementations here:
1. a struct clk that needs to be set.
2. writing a register with settings for RGMII and RMII speeds.
Provide implementations for these, Also realise that as a result
of doing this, we can kill off the .set_rgmii_speed() and
.set_rmii_speed() methods by combining them together - indeed,
this is what later SoCs already do by pointing both these methods
at the same function.
Overall, this patch series shrinks the file LOC by almost 8.7%
by removing 175 lines from over 2000 lines.
Apart from the error reporting changing and restricting interface
modes to those that the driver supports, no functional change is
anticipated with this patch. However, I have no hardware to test
this.
====================
Now that no SoC implements the .set_*_speed() methods, we can get rid
of these methods and the now unused code in rk_set_clk_tx_rate().
Arrange for the function to return an error when the .set_speed()
method is not implemented.
px30_set_rmii_speed() doesn't need to be as verbose as it is - it
merely needs the values for the register and clock rate which depend
on the speed, and then call the appropriate functions. Rewrite the
function to make it so.
As a result of the previous patches, many of the .set_rgmii_speed()
and .set_rmii_speed() implementations are identical apart from the
interface mode. Add a new .set_speed() function which takes the
interface mode in addition to the speed, and use it to combine the
separate implementations, calling the common rk_set_reg_speed()
function.
Also convert rk_set_clk_mac_speed() to be called by this new method
pointer, rather than having these implementations called from both
.set_*_speed() methods.
Remove all the error messages from the .set_speed() methods, as these
return an error code which is propagated up to stmmac_mac_link_up()
which will print the error.
Just like rk3568, there is no need to have separate RGMII and RMII
methods to set clk_mac_speed() as rgmii_clock() can be used to return
the clock rate for both RGMII and RMII interface modes. Combine these
two methods.
net: stmmac: rk: add struct for programming register based speeds
There is a common pattern in the driver where many SoCs need to write a
single register with a value dependent on the interface mode and speed.
Rather than having a lot of repeated code, add some common functions
and a struct to contain the values to be written to a register to
select the RGMII and RMII speeds.
Rather than having lots of regmap_write()s to the same register but
with different values depending on the speed, reorganise the
functions to use a local variable for the value, and then have one
regmap_write() call to write it to the register. This reduces the
amount of code and is a step towards further reducing the code size.
RK platforms support RGMII and/or RMII depending on the SoC. Detect
whether support for a SoC exists by whether the interface specific
set_to functions have been populated, and set the appropriate bits in
phylink's bitmap of interfaces.
This assumes all dwmac interfaces on a SoC have identical support,
but it should be noted that this is not true for RK3528 which only
supports RGMII on GMAC1. However, the existing code structure
permits RGMII to be configured on GMAC0 without complaint, so
preserve this behaviour even though it is incorrect to avoid
functional change.
Phase offset measurement is typically performed against the current active
source. However, some DPLL (Digital Phase-Locked Loop) devices may offer
the capability to monitor phase offsets across all available inputs.
The attribute and current feature state shall be included in the response
message of the ``DPLL_CMD_DEVICE_GET`` command for supported DPLL devices.
In such cases, users can also control the feature using the
``DPLL_CMD_DEVICE_SET`` command by setting the ``enum dpll_feature_state``
values for the attribute.
Once enabled the phase offset measurements for the input shall be returned
in the ``DPLL_A_PIN_PHASE_OFFSET`` attribute.
Implement feature support in ice driver for dpll-enabled devices.
ice: add phase offset monitor for all PPS dpll inputs
Implement a new admin command and helper function to handle and obtain
CGU measurements for input pins.
Add new callback operations to control the dpll device-level feature
"phase offset monitor," allowing it to be enabled or disabled. If the
feature is enabled, provide users with measured phase offsets and
notifications.
Initialize PPS DPLL with new callback operations if the feature is
supported by the firmware.
Add new callback operations for a dpll device:
- phase_offset_monitor_get(..) - to obtain current state of phase offset
monitor feature from dpll device,
- phase_offset_monitor_set(..) - to allow feature configuration.
Obtain the feature state value using the get callback and provide it to
the user if the device driver implements callbacks.
dpll: add phase-offset-monitor feature to netlink spec
Add enum dpll_feature_state for control over features.
Add dpll device level attribute:
DPLL_A_PHASE_OFFSET_MONITOR - to allow control over a phase offset monitor
feature. Attribute is present and shall return current state of a feature
(enum dpll_feature_state), if the device driver provides such capability,
otherwie attribute shall not be present.
Improve the .set_clk_tx_rate() method error message to include the
PHY interface mode along with the speed, which will be helpful to
the RK implementations.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1uPjjx-0049r5-NN@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Improve the rgmii_clock() documentation to indicate that it can also
be used for MII, GMII and RMII modes as well as RGMII as the required
clock rates are identical, but note that it won't error out for 1G
speeds for MII and RMII.
RubenKelevra [Thu, 12 Jun 2025 14:50:12 +0000 (16:50 +0200)]
net: pfcp: fix typo in message_priority field name
The field is spelled "message_priprity" in the big-endian bit-field
definition. Nothing in-tree currently references the member, so the
typo does not break kernel builds, but it is clearly incorrect.
Jakub Kicinski [Sat, 14 Jun 2025 01:09:48 +0000 (18:09 -0700)]
Merge branch 'dp83tg720-reduce-link-recovery'
Oleksij Rempel says:
====================
dp83tg720: Reduce link recovery
This patch series improves the link recovery behavior of the TI
DP83TG720 PHY driver.
Previously, we introduced randomized reset delay logic to avoid reset
collisions in multi-PHY setups. While this approach was functional, it
had notable drawbacks: unpredictable behavior, longer and more variable
link recovery times, and overall higher complexity in link handling.
With this new approach, we replace the randomized delay with
deterministic, role-specific delays in the PHY reset logic. This enables
us to:
- Remove the redundant empirical 600 ms delay in read_status()
- Drop the random polling interval logic
- Introduce a clean, adaptive polling strategy with consistent
behavior and improved responsiveness
As a result, the PHY is now able to recover link reliably in under
1000_ms
====================
David Jander [Thu, 12 Jun 2025 10:41:57 +0000 (12:41 +0200)]
net: phy: dp83tg720: switch to adaptive polling and remove random delays
Now that the PHY reset logic includes a role-specific asymmetric delay
to avoid synchronized reset deadlocks, the previously used randomized
polling intervals are no longer necessary.
This patch removes the get_random_u32_below()-based logic and introduces
an adaptive polling strategy:
- Fast polling for a short time after link-down
- Slow polling if the link remains down
- Slower polling when the link is up
This balances CPU usage and responsiveness while avoiding reset
collisions. Additionally, the driver still relies on polling for
all link state changes, as interrupt support is not implemented,
and link-up events are not reliably signaled by the PHY.
The polling parameters are now documented in the updated top-of-file
comment.
Co-developed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David Jander <david@protonic.nl> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250612104157.2262058-4-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that dp83tg720_soft_reset() introduces role-specific delays to avoid
reset synchronization deadlocks, the fixed 600ms post-reset delay in
dp83tg720_read_status() is no longer needed.
The new logic provides both the required MDC timing and link stabilization,
making the old empirical delay redundant and unnecessarily long.
Co-developed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David Jander <david@protonic.nl> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250612104157.2262058-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Jander [Thu, 12 Jun 2025 10:41:55 +0000 (12:41 +0200)]
net: phy: dp83tg720: implement soft reset with asymmetric delay
Add a .soft_reset callback for the DP83TG720 PHY that issues a hardware
reset followed by an asymmetric post-reset delay. The delay differs
based on the PHY's master/slave role to avoid synchronized reset
deadlocks, which are known to occur when both link partners use
identical reset intervals.
The delay includes:
- a fixed 1ms wait to satisfy MDC access timing per datasheet, and
- an empirically chosen extra delay (97ms for master, 149ms for slave).
Co-developed-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David Jander <david@protonic.nl> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250612104157.2262058-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Wed, 11 Jun 2025 20:11:21 +0000 (22:11 +0200)]
net: phy: improve mdio-boardinfo.h
There's no need to include phy.h and mutex.h in mdio-boardinfo.h.
However mdio-boardinfo.c included phy.h indirectly this way so far,
include it explicitly instead. Whilst at it, sort the included
headers properly.
Shannon Nelson [Mon, 9 Jun 2025 21:46:44 +0000 (14:46 -0700)]
ionic: cancel delayed work earlier in remove
Cancel any entries on the delayed work queue before starting
to tear down the lif to be sure there is no race with any
other events.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Mon, 9 Jun 2025 21:46:43 +0000 (14:46 -0700)]
ionic: clean dbpage in de-init
Since the kern_dbpage gets set up in ionic_lif_init() and that
function's error path will clean it if needed, the kern_dbpage
on teardown should be cleaned in ionic_lif_deinit(), not in
ionic_lif_free(). As it is currently we get a double call
to iounmap() on kern_dbpage if the PCI ionic fails setting up
the lif. One example of this is when firmware isn't responding
to AdminQ requests and ionic's first AdminQ call fails to
setup the NotifyQ.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Mon, 9 Jun 2025 21:46:42 +0000 (14:46 -0700)]
ionic: print firmware heartbeat as unsigned
The firmware heartbeat value is an unsigned number, and seeing
a negative number when it gets big is a little disconcerting.
Example:
ionic 0000:24:00.0: FW heartbeat stalled at -1342169688
Print using the unsigned flag.
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Joe Damato <joe@dama.to> Signed-off-by: David S. Miller <davem@davemloft.net>
irq_domain_create_simple() takes fwnode as the first argument. It can be
extracted from the struct device using dev_fwnode() helper instead of
using of_node with of_fwnode_handle().
John Madieu [Wed, 11 Jun 2025 06:12:04 +0000 (08:12 +0200)]
dt-bindings: net: renesas-gbeth: Add support for RZ/G3E (R9A09G047) SoC
Document support for the GBETH IP found on the Renesas RZ/G3E (R9A09G047) SoC.
The GBETH block on RZ/G3E is equivalent in functionality to the GBETH found on
RZ/V2H(P) (R9A09G057).
Florian Fainelli [Wed, 11 Jun 2025 21:27:29 +0000 (14:27 -0700)]
net: bcmasp: Utilize napi_complete_done() return value
Make use of the return value from napi_complete_done(). This allows
users to use the gro_flush_timeout and napi_defer_hard_irqs sysfs
attributes for configuring software interrupt coalescing.
Simplify the arguments passed to phy_get_internal_delay() - the "dev"
argument is always &phydev->mdio.dev, and as the phydev is passed in,
there's no need to also pass in the struct device, especially when this
function is the only reason for the caller to have a local "dev"
variable.
Remove the redundant "dev" argument, and update the callers.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1uPLwB-003VzR-4C@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Tue, 10 Jun 2025 21:34:53 +0000 (23:34 +0200)]
net: phy: move definition of genphy_c45_driver to phy_device.c
genphy_c45_read_status() is exported, so we can move definition of
genphy_c45_driver to phy_device.c and make it static. This helps
to clean up phy.h a little.
Nikunj Kela [Tue, 10 Jun 2025 20:04:11 +0000 (13:04 -0700)]
net: stmmac: extend use of snps,multicast-filter-bins property to xgmac
Hash based multicast filtering is an optional feature. Currently,
driver overrides the value of multicast_filter_bins based on the hash
table size. If the feature is not supported, hash table size reads 0
however the value of multicast_filter_bins remains set to default
HASH_TABLE_SIZE which is incorrect. Let's extend the use of the property
snps,multicast-filter-bins to xgmac so it can be set to 0 via devicetree
to indicate multicast filtering is not supported.
Hari Kalavakunta [Tue, 10 Jun 2025 19:33:38 +0000 (12:33 -0700)]
net: ncsi: Fix buffer overflow in fetching version id
In NC-SI spec v1.2 section 8.4.44.2, the firmware name doesn't
need to be null terminated while its size occupies the full size
of the field. Fix the buffer overflow issue by adding one
additional byte for null terminator.
Heiner Kallweit [Tue, 10 Jun 2025 06:03:43 +0000 (08:03 +0200)]
net: phy: assign default match function for non-PHY MDIO devices
Make mdio_device_bus_match() the default match function for non-PHY
MDIO devices. Benefit is that we don't have to export this function
any longer. As long as mdiodev->modalias isn't set, there's no change
in behavior. mdiobus_create_device() is the only place where
mdiodev->modalias gets set, but this function sets
mdio_device_bus_match() as match function anyway.
Andrew asked me to plumb the RXFH header fields configuration
thru to netlink. Before we do that we need to clean up the driver
facing API a little bit. Right now RXFH configuration shares the
callbacks with n-tuple filters. The future of n-tuple filters
is uncertain within netlink. Separate the two for clarity both
of the core code and the driver facing API.
This series adds the new callbacks and converts the initial
handful of drivers. There is 31 more driver patches to come,
then we can stop calling rxnfc in the core for rxfh.
====================
Jakub Kicinski [Wed, 11 Jun 2025 14:59:49 +0000 (07:59 -0700)]
net: drv: hyperv: migrate to new RXFH callbacks
Add support for the new rxfh_fields callbacks, instead of de-muxing
the rxnfc calls. This driver does not support flow filtering so
the set_rxnfc callback is completely removed.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:48 +0000 (07:59 -0700)]
net: drv: virtio: migrate to new RXFH callbacks
Add support for the new rxfh_fields callbacks, instead of de-muxing
the rxnfc calls. This driver does not support flow filtering so
the set_rxnfc callback is completely removed.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:47 +0000 (07:59 -0700)]
net: drv: vmxnet3: migrate to new RXFH callbacks
Add support for the new rxfh_fields callbacks, instead of de-muxing
the rxnfc calls. This driver does not support flow filtering so
the set_rxnfc callback is completely removed.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:46 +0000 (07:59 -0700)]
eth: fbnic: migrate to new RXFH callbacks
Add support for the new rxfh_fields callbacks, instead of de-muxing
the rxnfc calls. The code is moved as we try to declare the functions
in the order ing which they appear in the ops struct.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:44 +0000 (07:59 -0700)]
net: ethtool: add dedicated callbacks for getting and setting rxfh fields
We mux multiple calls to the drivers via the .get_nfc and .set_nfc
callbacks. This is slightly inconvenient to the drivers as they
have to de-mux them back. It will also be awkward for netlink code
to construct struct ethtool_rxnfc when it wants to get info about
RX Flow Hash, from the RSS module.
Add dedicated driver callbacks. Create struct ethtool_rxfh_fields
which contains only data relevant to RXFH. Maintain the names of
the fields to avoid having to heavily modify the drivers.
For now support both callbacks, once all drivers are converted
ethtool_*et_rxfh_fields() will stop using the rxnfc callbacks.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:43 +0000 (07:59 -0700)]
net: ethtool: require drivers to opt into the per-RSS ctx RXFH
RX Flow Hashing supports using different configuration for different
RSS contexts. Only two drivers seem to support it. Make sure we
uniformly error out for drivers which don't.
Jakub Kicinski [Wed, 11 Jun 2025 14:59:41 +0000 (07:59 -0700)]
net: ethtool: copy the rxfh flow handling
RX Flow Hash configuration uses the same argument structure
as flow filters. This is probably why ethtool IOCTL handles
them together. The more checks we add the more convoluted
this code is getting (as some of the checks apply only
to flow filters and others only to the hashing).
Copy the code to separate the handling. This is an exact
copy, the next change will remove unnecessary handling.
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set"
* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
net: ethtool: Don't check if RSS context exists in case of context 0
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
ipv6: Move fib6_config_validate() to ip6_route_add().
net: drv: netdevsim: don't napi_complete() from netpoll
net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
veth: prevent NULL pointer dereference in veth_xdp_rcv
net_sched: remove qdisc_tree_flush_backlog()
net_sched: ets: fix a race in ets_qdisc_change()
net_sched: tbf: fix a race in tbf_change()
net_sched: red: fix a race in __red_change()
net_sched: prio: fix a race in prio_tune()
net_sched: sch_sfq: reject invalid perturb period
net: phy: phy_caps: Don't skip better duplex macth on non-exact match
MAINTAINERS: Update Kuniyuki Iwashima's email address.
selftests: net: add test case for NAT46 looping back dst
net: clear the dst when changing skb protocol
net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
net/mlx5e: Fix leak of Geneve TLV option object
net/mlx5: HWS, make sure the uplink is the last destination
...
Linus Torvalds [Thu, 12 Jun 2025 15:21:13 +0000 (08:21 -0700)]
Merge tag 'pinctrl-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Add some missing pins on the Qualcomm QCM2290, along with a managed
resources patch that make it clean and nice
- Drop an unused function in the ST Micro driver
- Drop bouncing MAINTAINER entry
- Drop of_match_ptr() macro to rid compile warnings in the TB10x
driver
- Fix up calculation of pin numbers from base in the Sunxi driver
* tag 'pinctrl-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: sunxi: dt: Consider pin base when calculating bank number from pin
pinctrl: tb10x: Drop of_match_ptr for ID table
pinctrl: MAINTAINERS: Drop bouncing Jianlong Huang
pinctrl: st: Drop unused st_gpio_bank() function
pinctrl: qcom: pinctrl-qcm2290: Add missing pins
pinctrl: qcom: switch to devm_gpiochip_add_data()
Linus Torvalds [Thu, 12 Jun 2025 15:17:56 +0000 (08:17 -0700)]
Merge tag 'arc-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- arch_atomic64_cmpxchg relaxed variant [Jason]
- use of inbuilt swap in stack unwinder [Yu-Chun Lin]
- use of __ASSEMBLER__ in kernel headers [Thomas Huth]
* tag 'arc-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: Replace __ASSEMBLY__ with __ASSEMBLER__ in the non-uapi headers
ARC: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
ARC: unwind: Use built-in sort swap to reduce code size and improve performance
ARC: atomics: Implement arch_atomic64_cmpxchg using _relaxed
Jakub Kicinski [Thu, 12 Jun 2025 15:16:47 +0000 (08:16 -0700)]
Merge tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Another quick round of updates:
- revert mwifiex HT40 that was causing issues
- many ath10k/ath11k/ath12k fixes
- re-add some iwlwifi code I lost in a merge
- use kfree_sensitive() on an error path in cfg80211
* tag 'wireless-2025-06-12' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: use kfree_sensitive() for connkeys cleanup
wifi: iwlwifi: fix merge damage related to iwl_pci_resume
Revert "wifi: mwifiex: Fix HT40 bandwidth issue."
wifi: ath12k: fix uaf in ath12k_core_init()
wifi: ath12k: Fix hal_reo_cmd_status kernel-doc
wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850
wifi: ath11k: validate ath11k_crypto_mode on top of ath11k_core_qmi_firmware_ready
wifi: ath11k: consistently use ath11k_mac_get_fw_stats()
wifi: ath11k: move locking outside of ath11k_mac_get_fw_stats()
wifi: ath11k: adjust unlock sequence in ath11k_update_stats_event()
wifi: ath11k: move some firmware stats related functions outside of debugfs
wifi: ath11k: don't wait when there is no vdev started
wifi: ath11k: don't use static variables in ath11k_debugfs_fw_stats_process()
wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
wil6210: fix support for sparrow chipsets
wifi: ath10k: Avoid vdev delete timeout when firmware is already down
ath10k: snoc: fix unbalanced IRQ enable in crash recovery
====================
This series addresses a regression in ethtool flow steering where rules
targeting the default RSS context (context 0) were incorrectly rejected.
The default RSS context always exists but is not stored in the rss_ctx
xarray like additional contexts. The current validation logic was
checking for the existence of context 0 in this array, causing valid
flow steering rules to be rejected.
This prevented configurations such as:
- High priority rules directing specific traffic to the default context
- Low priority catch-all rules directing remaining traffic to additional
contexts
Patch 1 fixes the validation logic to skip the existence check for
context 0.
Patch 2 adds a selftest that verifies this behavior.
Gal Pressman [Thu, 12 Jun 2025 07:19:58 +0000 (10:19 +0300)]
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
Add test_rss_default_context_rule() to verify that ntuple rules can
correctly direct traffic to the default RSS context (context 0).
The test creates two ntuple rules with explicit location priorities:
- A high-priority rule (loc 0) directing specific port traffic to
context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
1.
This validates that:
1. Rules targeting the default context function properly.
2. Traffic steering works as expected when mixing default and
additional RSS contexts.
The test was written by AI, and reviewed by humans.
Gal Pressman [Thu, 12 Jun 2025 07:19:57 +0000 (10:19 +0300)]
net: ethtool: Don't check if RSS context exists in case of context 0
Context 0 (default context) always exists, there is no need to check
whether it exists or not when adding a flow steering rule.
The existing check fails when creating a flow steering rule for context
0 as it is not stored in the rss_ctx xarray.
For example:
$ ethtool --config-ntuple eth2 flow-type tcp4 dst-ip 194.237.147.23 dst-port 19983 context 0 loc 618
rmgr: Cannot insert RX class rule: Invalid argument
Cannot insert classification rule
An example usecase for this could be:
- A high-priority rule (loc 0) directing specific port traffic to
context 0.
- A low-priority rule (loc 1) directing all other TCP traffic to context
1.
This is a user-visible regression that was caught in our testing
environment, it was not reported by a user yet.
Fixes: de7f7582dff2 ("net: ethtool: prevent flow steering to RSS contexts which don't exist") Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250612071958.1696361-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 12 Jun 2025 15:13:47 +0000 (08:13 -0700)]
Merge tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- eir: Fix NULL pointer deference on eir_get_service_data
- eir: Fix possible crashes on eir_create_adv_data
- hci_sync: Fix broadcast/PA when using an existing instance
- ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
- ISO: Fix not using bc_sid as advertisement SID
- MGMT: Fix sparse errors
* tag 'for-net-2025-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Fix sparse errors
Bluetooth: ISO: Fix not using bc_sid as advertisement SID
Bluetooth: ISO: Fix using BT_SK_PA_SYNC to detect BIS sockets
Bluetooth: eir: Fix possible crashes on eir_create_adv_data
Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance
Bluetooth: Fix NULL pointer deference on eir_get_service_data
====================
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
Before the cited commit, the kernel unconditionally embedded SCM
credentials to skb for embryo sockets even when both the sender
and listener disabled SO_PASSCRED and SO_PASSPIDFD.
Now, the credentials are added to skb only when configured by the
sender or the listener.
However, as reported in the link below, it caused a regression for
some programs that assume credentials are included in every skb,
but sometimes not now.
The only problematic scenario would be that a socket starts listening
before setting the option. Then, there will be 2 types of non-small
race window, where a client can send skb without credentials, which
the peer receives as an "invalid" message (and aborts the connection
it seems ?):
Client Server
------ ------
s1.listen() <-- No SO_PASS{CRED,PIDFD}
s2.connect()
s2.send() <-- w/o cred
s1.setsockopt(SO_PASS{CRED,PIDFD})
s2.send() <-- w/ cred
or
Client Server
------ ------
s1.listen() <-- No SO_PASS{CRED,PIDFD}
s2.connect()
s2.send() <-- w/o cred
s3, _ = s1.accept() <-- Inherit cred options
s2.send() <-- w/o cred but not set yet
Jakub Kicinski [Wed, 11 Jun 2025 17:46:43 +0000 (10:46 -0700)]
net: drv: netdevsim: don't napi_complete() from netpoll
netdevsim supports netpoll. Make sure we don't call napi_complete()
from it, since it may not be scheduled. Breno reports hitting a
warning in napi_complete_done():
WARNING: CPU: 14 PID: 104 at net/core/dev.c:6592 napi_complete_done+0x2cc/0x560
__napi_poll+0x2d8/0x3a0
handle_softirqs+0x1fe/0x710
This is presumably after netpoll stole the SCHED bit prematurely.
Reported-by: Breno Leitao <leitao@debian.org> Fixes: 3762ec05a9fb ("netdevsim: add NAPI support") Tested-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250611174643.2769263-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
veth: prevent NULL pointer dereference in veth_xdp_rcv
The veth peer device is RCU protected, but when the peer device gets
deleted (veth_dellink) then the pointer is assigned NULL (via
RCU_INIT_POINTER).
This patch adds a necessary NULL check in veth_xdp_rcv when accessing
the veth peer net_device.
This fixes a bug introduced in commit dc82a33297fc ("veth: apply qdisc
backpressure on full ptr_ring to reduce TX drops"). The bug is a race
and only triggers when having inflight packets on a veth that is being
deleted.
Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Closes: https://lore.kernel.org/all/fecfcad0-7a16-42b8-bff2-66ee83a6e5c4@linux.dev/ Reported-by: syzbot+c4c7bf27f6b0c4bd97fe@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/683da55e.a00a0220.d8eae.0052.GAE@google.com/ Fixes: dc82a33297fc ("veth: apply qdisc backpressure on full ptr_ring to reduce TX drops") Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://patch.msgid.link/174964557873.519608.10855046105237280978.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock") Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250611111515.1983366-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock") Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://patch.msgid.link/20250611111515.1983366-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: 0c8d13ac9607 ("net: sched: red: delay destroying child qdisc on replace") Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250611111515.1983366-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Calling qdisc_purge_queue() instead of qdisc_tree_flush_backlog()
should fix the race, because all packets will be purged from the qdisc
before releasing the lock.
Fixes: 7b8e0b6e6599 ("net: sched: prio: delay destroying child qdiscs on change") Reported-by: Gerrard Tai <gerrard.tai@starlabs.sg> Suggested-by: Gerrard Tai <gerrard.tai@starlabs.sg> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250611111515.1983366-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: phy_caps: Don't skip better duplex macth on non-exact match
When performing a non-exact phy_caps lookup, we are looking for a
supported mode that matches as closely as possible the passed speed/duplex.
Blamed patch broke that logic by returning a match too early in case
the caller asks for half-duplex, as a full-duplex linkmode may match
first, and returned as a non-exact match without even trying to mach on
half-duplex modes.
Reported-by: Jijie Shao <shaojijie@huawei.com> Closes: https://lore.kernel.org/netdev/20250603102500.4ec743cf@fedora/T/#m22ed60ca635c67dc7d9cbb47e8995b2beb5c1576 Tested-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Fixes: fc81e257d19f ("net: phy: phy_caps: Allow looking-up link caps based on speed and duplex") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250606094321.483602-1-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: bcmgenet: add support for GRO software interrupt coalescing
Enable support for software IRQ coalescing and GRO aggregation
and apply conservative defaults which can help improve system
and network performance by reducing the number of hardware
interrupts and improving GRO aggregation ratio.
Zak Kemble [Tue, 10 Jun 2025 22:04:02 +0000 (23:04 +0100)]
net: bcmgenet: use napi_complete_done return value
Make use of the return value from napi_complete_done(). This allows users to
use the gro_flush_timeout and napi_defer_hard_irqs sysfs attributes for
configuring software interrupt coalescing.
Abin Joseph [Tue, 10 Jun 2025 11:41:11 +0000 (17:11 +0530)]
net: macb: Add shutdown operation support
Implement the shutdown hook to ensure clean and complete deactivation of
MACB controller. The shutdown sequence is protected with 'rtnl_lock()'
to serialize access and prevent race conditions while detaching and
closing the network device. This ensure a safe transition when the Kexec
utility calls the shutdown hook, facilitating seamless loading and
booting of a new kernel from the currently running one.
====================
net: phy: micrel: add extended PHY support for KSZ9477-class devices
This patch series extends the PHY driver support for the Microchip
KSZ9477-class switch-integrated PHYs. These changes enhance ethtool
functionality and diagnostic capabilities by implementing the following
features:
- MDI/MDI-X configuration support
All crossover modes (auto, MDI, MDI-X) are now configurable.
- RX error counter reporting
The RXER counter (reg 0x15) is now accumulated and exported via
ethtool stats.
- Cable test support
Reuses the KSZ9131 implementation to enable open/short fault
detection and approximate fault length reporting.
====================