git.ipfire.org Git - thirdparty/kernel/stable.git/log

]> git.ipfire.org Git - thirdparty/kernel/stable.git/log

projects / thirdparty / kernel / stable.git / log

Jakub Kicinski [Wed, 27 Aug 2025 23:43:19 +0000 (16:43 -0700)]

eth: mlx5: remove Kconfig co-dependency with VXLAN

mlx5 has a Kconfig co-dependency on VXLAN, even tho it doesn't
call any VXLAN function (unlike mlxsw). Perhaps this dates back
to very old days when tunnel ports were fetched directly from
VXLAN.

Remove the dependency to allow MLX5=y + VXLAN=m kernel configs.
But still avoid compiling in the lib/vxlan code if VXLAN=n.

Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://patch.msgid.link/20250827234319.3504852-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 27 Aug 2025 13:27:47 +0000 (14:27 +0100)]

net: stmmac: mdio: clean up c22/c45 accessor split

The C45 accessors were setting the GR (register number) field twice,
once with the 16-bit register address truncated to five bits, and
then overwritten with the C45 devad. This is harmless since the field
was being cleared prior to being updated with the C45 devad, except
for the extra work.

Remove the redundant code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/E1urGBn-00000000DCH-3swS@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 28 Aug 2025 23:46:25 +0000 (16:46 -0700)]

Merge branch 'net_sched-extend-rcu-use-in-dump-methods-ii'

Eric Dumazet says:

====================
net_sched: extend RCU use in dump() methods (II)

Second series adding RCU dump() to three actions

First patch removes BH blocking on modules done in the first series.
====================

Link: https://patch.msgid.link/20250827125349.3505302-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 27 Aug 2025 12:53:49 +0000 (12:53 +0000)]

net_sched: act_skbmod: use RCU in tcf_skbmod_dump()

Also storing tcf_action into struct tcf_skbmod_params
makes sure there is no discrepancy in tcf_skbmod_act().

No longer block BH in tcf_skbmod_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 27 Aug 2025 12:53:48 +0000 (12:53 +0000)]

net_sched: act_tunnel_key: use RCU in tunnel_key_dump()

Also storing tcf_action into struct tcf_tunnel_key_params
makes sure there is no discrepancy in tunnel_key_act().

No longer block BH in tunnel_key_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 27 Aug 2025 12:53:47 +0000 (12:53 +0000)]

net_sched: act_vlan: use RCU in tcf_vlan_dump()

Also storing tcf_action into struct tcf_vlan_params
makes sure there is no discrepancy in tcf_vlan_act().

No longer block BH in tcf_vlan_init() when acquiring tcf_lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Dumazet [Wed, 27 Aug 2025 12:53:46 +0000 (12:53 +0000)]

net_sched: remove BH blocking in eight actions

Followup of f45b45cbfae3 ("Merge branch
'net_sched-act-extend-rcu-use-in-dump-methods'")

We never grab tcf_lock from BH context in these modules:

act_connmark
act_csum
act_ct
act_ctinfo
act_mpls
act_nat
act_pedit
act_skbedit

No longer block BH when acquiring tcf_lock from init functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827125349.3505302-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 27 Aug 2025 08:54:51 +0000 (09:54 +0100)]

net: stmmac: minor cleanups to stmmac_bus_clks_config()

stmmac_bus_clks_config() doesn't need to repeatedly on dereference
priv->plat as this remains the same throughout this function. Not only
does this detract from the function's readability, but it could cause
the value to be reloaded each time. Use a local variable.

Also, the final return can simply return zero, and we can dispense
with the initialiser for 'ret'.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/E1urBvf-000000002ii-37Ce@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Russell King (Oracle) [Wed, 27 Aug 2025 08:41:48 +0000 (09:41 +0100)]

net: stmmac: mdio: use netdev_priv() directly

netdev_priv() is an inline function, taking a struct net_device
pointer. When passing in the MII bus->priv, which is a void pointer,
there is no need to go via a local ndev variable to type it first.

Thus, instead of:

struct net_device *ndev = bus->priv;
struct stmmac_priv *priv;
...
priv = netdev_priv(ndev);

we can simply do:

struct stmmac_priv *priv = netdev_priv(bus->priv);

which simplifies the code.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/E1urBj2-000000002as-0pod@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sky Huang [Wed, 27 Aug 2025 04:47:55 +0000 (12:47 +0800)]

net: phy: mtk-2p5ge: Add LED support for MT7988

Add LED support for MT7988's built-in 2.5Gphy. LED hardware has almost
the same design with MT7981's/MT7988's built-in GbE. So hook the same
helper function here.

Before mtk_phy_leds_state_init(), set correct default values of LED0
and LED1.

Signed-off-by: Sky Huang <skylake.huang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250827044755.3256991-1-SkyLake.Huang@mediatek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Wed, 27 Aug 2025 17:35:58 +0000 (10:35 -0700)]

selftests: drv-net: rss_ctx: fix the queue count check

Commit 0d6ccfe6b319 ("selftests: drv-net: rss_ctx: check for all-zero keys")
added a skip exception if NIC has fewer than 3 queues enabled,
but it's just constructing the object, it's not actually rising
this exception.

Before:

  # Exception| net.lib.py.utils.CmdExitFailure: Command failed: ethtool -X enp1s0 equal 3 hkey d1:cc:77:47:9d:ea:15:f2:b9:6c:ef:68:62:c0:45:d5:b0:99:7d:cf:29:53:40:06:3d:8e:b9:bc:d4:70:89:b8:8d:59:04:ea:a9:c2:21:b3:55:b8:ab:6b:d9:48:b4:bd:4c:ff:a5:f0:a8:c2
  not ok 1 rss_ctx.test_rss_key_indir

After:

  ok 1 rss_ctx.test_rss_key_indir # SKIP Device has fewer than 3 queues (or doesn't support queue stats)

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250827173558.3259072-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 28 Aug 2025 23:05:34 +0000 (16:05 -0700)]

Merge branch 'devmem-io_uring-allow-more-flexibility-for-zc-dma-devices'

Dragos Tatulea says:

====================
devmem/io_uring: allow more flexibility for ZC DMA devices

For TCP zerocopy rx (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
- Scalable Function netdevs [1] have the DMA device in the grandparent.
- For Multi-PF netdevs [2] queues can be associated to different DMA
devices.

The series adds an API for getting the DMA device for a netdev queue.
Drivers that have special requirements can implement the newly added
queue management op. Otherwise the parent will still be used as before.

This series continues with switching to this API for io_uring zcrx and
devmem and adds a ndo_queue_dma_dev op for mlx5.

The last part of the series changes devmem rx bind to get the DMA device
per queue and blocks the case when multiple queues use different DMA
devices. The tx bind is left as is.

[1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
[2] Documentation/networking/multi-pf-netdev.rst
====================

Link: https://patch.msgid.link/20250827144017.1529208-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:40:01 +0000 (17:40 +0300)]

net: devmem: allow binding on rx queues with same DMA devices

Multi-PF netdevs have queues belonging to different PFs which also means
different DMA devices. This means that the binding on the DMA buffer can
be done to the incorrect device.

This change allows devmem binding to multiple queues only when the
queues have the same DMA device. Otherwise an error is returned.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20250827144017.1529208-9-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:40:00 +0000 (17:40 +0300)]

net: devmem: pre-read requested rx queues during bind

Instead of reading the requested rx queues after binding the buffer,
read the rx queues in advance in a bitmap and iterate over them when
needed.

This is a preparation for fetching the DMA device for each queue.

This patch has no functional changes besides adding an extra
rq index bounds check.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-8-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:39:59 +0000 (17:39 +0300)]

net: devmem: pull out dma_dev out of net_devmem_bind_dmabuf

Fetch the DMA device before calling net_devmem_bind_dmabuf()
and pass it on as a parameter.

This is needed for an upcoming change which will read the
DMA device per queue.

This patch has no functional changes.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-7-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:39:58 +0000 (17:39 +0300)]

net/mlx5e: add op for getting netdev DMA device

For zero-copy (devmem, io_uring), the netdev DMA device used
is the parent device of the net device. However that is not
always accurate for mlx5 devices:
- SFs: The parent device is an auxdev.
- Multi-PF netdevs: The DMA device should be determined by
the queue.

This change implements the DMA device queue API that returns the DMA
device appropriately for all cases.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-6-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:39:57 +0000 (17:39 +0300)]

net: devmem: get netdev DMA device via new API

Switch to the new API for fetching DMA devices for a netdev. The API is
called with queue index 0 for now which is equivalent with the previous
behavior.

This patch will allow devmem to work with devices where the DMA device
is not stored in the parent device. mlx5 SFs are an example of such a
device.

Multi-PF netdevs are still problematic (as they were before this
change). Upcoming patches will address this for the rx binding.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-5-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:39:56 +0000 (17:39 +0300)]

io_uring/zcrx: add support for custom DMA devices

Use the new API for getting a DMA device for a specific netdev queue.

This patch will allow io_uring zero-copy rx to work with devices
where the DMA device is not stored in the parent device. mlx5 SFs
are an example of such a device.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/20250827144017.1529208-4-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Dragos Tatulea [Wed, 27 Aug 2025 14:39:55 +0000 (17:39 +0300)]

queue_api: add support for fetching per queue DMA dev

For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
- Scalable Function netdevs [1] have the DMA device in the grandparent.
- For Multi-PF netdevs [2] queues can be associated to different DMA
devices.

This patch introduces the a queue based interface for allowing drivers
to expose a different DMA device for zerocopy.

[1] Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
[2] Documentation/networking/multi-pf-netdev.rst

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250827144017.1529208-3-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Paolo Abeni [Thu, 28 Aug 2025 12:51:09 +0000 (14:51 +0200)]

Merge branch 'fbnic-synchronize-address-handling-with-bmc'

Alexander Duyck says:

====================
fbnic: Synchronize address handling with BMC

The fbnic driver needs to communicate with the BMC if it is operating on
the RMII-based transport (RBT) of the same port the host is on. To enable
this we need to add rules that will route BMC traffic to the RBT/BMC and
the BMC and firmware need to configure rules on the RBT side of the
interface to route traffic from the BMC to the host instead of the MAC.

To enable that this patch set addresses two issues. First it will cause the
TCAM to be reconfigured in the event that the BMC was not previously
present when the driver was loaded, but the FW sends a notification that
the FW capabilities have changed and a BMC w/ various MAC addresses is now
present. Second it adds support for sending a message to the firmware so
that if the host adds additional MAC addresses the FW can be made aware and
route traffic for those addresses from the RBT to the host instead of the
MAC.
====================

Link: https://patch.msgid.link/175623715978.2246365.7798520806218461199.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Alexander Duyck [Tue, 26 Aug 2025 19:45:07 +0000 (12:45 -0700)]

fbnic: Push local unicast MAC addresses to FW to populate TCAMs

The MACDA TCAM can only be accessed by one entity at a time and as such we
cannot have simultaneous reads from the firmware to probe for changes from
the host. As such we have to send a message indicating what the state of
the MACDA is to the firmware when we updated it so that the firmware can
sync up the TCAMs it owns to route BMC packets to the host.

To support that we are adding a new message that is invoked when we write
the MACDA that will notify the firmware of updates from the host and allow
it to sync up the TCAM configuration to match the one on the host side.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/175623750782.2246365.9178255870985916357.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Alexander Duyck [Tue, 26 Aug 2025 19:45:01 +0000 (12:45 -0700)]

fbnic: Add logic to repopulate RPC TCAM if BMC enables channel

The BMC itself can decide to abandon a link and move onto another link in
the event of things such as a link flap. As a result the driver may load
with the BMC not present, and then needs to update things to support the
BMC being present while the link is up and the NIC is passing traffic.

To support this we add support to the watchdog to reinitialize the RPC to
support adding the BMC unicast, multicast, and multicast promiscuous
filters while the link is up and the NIC owns the link.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/175623750101.2246365.8518307324797058580.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Alexander Duyck [Tue, 26 Aug 2025 19:44:54 +0000 (12:44 -0700)]

fbnic: Pass fbnic_dev instead of netdev to __fbnic_set/clear_rx_mode

To make the __fbnic_set_rx_mode and __fbnic_clear_rx_mode calls usable by
more points in the code we can make to that they expect a fbnic_dev pointer
instead of a netdev pointer.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/175623749436.2246365.6068665520216196789.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Alexander Duyck [Tue, 26 Aug 2025 19:44:47 +0000 (12:44 -0700)]

fbnic: Move promisc_sync out of netdev code and into RPC path

In order for us to support the BMC possibly connecting, disconnecting, and
then reconnecting we need to be able to support entities outside of just
the NIC setting up promiscuous mode as the BMC can use a multicast
promiscuous setup.

To support that we should move the promisc_sync code out of the netdev and
into the RPC section of the driver so that it is reachable from more paths.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/175623748769.2246365.2130394904175851458.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 28 Aug 2025 12:42:02 +0000 (14:42 +0200)]

Merge branch 'add-si3474-pse-controller-driver'

Piotr Kubik says:

====================
Add Si3474 PSE controller driver

From: Piotr Kubik <piotr.kubik@adtran.com>

These patch series provide support for Skyworks Si3474 I2C Power
Sourcing Equipment controller.

Based on the TPS23881 driver code.

Supported features of Si3474:
- get port status,
- get port power,
- get port voltage,
- enable/disable port power

Signed-off-by: Piotr Kubik <piotr.kubik@adtran.com>
====================

Link: https://patch.msgid.link/6af537dc-8a52-4710-8a18-dcfbb911cf23@adtran.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Piotr Kubik [Tue, 26 Aug 2025 14:41:58 +0000 (14:41 +0000)]

net: pse-pd: Add Si3474 PSE controller driver

Add a driver for the Skyworks Si3474 I2C Power Sourcing Equipment
controller.

Driver supports basic features of Si3474 IC:
- get port status,
- get port power,
- get port voltage,
- enable/disable port power.

Only 4p configurations are supported at this moment.

Signed-off-by: Piotr Kubik <piotr.kubik@adtran.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/9b72c8cd-c8d3-4053-9c80-671b9481d166@adtran.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Piotr Kubik [Tue, 26 Aug 2025 14:41:44 +0000 (14:41 +0000)]

dt-bindings: net: pse-pd: Add bindings for Si3474 PSE controller

Add the Si3474 I2C Power Sourcing Equipment controller device tree
bindings documentation.

Signed-off-by: Piotr Kubik <piotr.kubik@adtran.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/71a67c6f-6fce-49c7-96ec-554602dbd4f1@adtran.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Paolo Abeni [Thu, 28 Aug 2025 11:14:52 +0000 (13:14 +0200)]

Merge branch 'net-better-drop-accounting'

Eric Dumazet says:

====================
net: better drop accounting

Incrementing sk->sk_drops for every dropped packet can
cause serious cache line contention under DOS.

Add optional sk->sk_drop_counters pointer so that
protocols can opt-in to use two dedicated cache lines
to hold drop counters.

Convert UDP and RAW to use this infrastructure.

Tested on UDP (see patch 4/5 for details)

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 816281             0.0
Udp6InErrors                    7497093            0.0
Udp6RcvbufErrors                7497093            0.0
====================

Link: https://patch.msgid.link/20250826125031.1578842-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 26 Aug 2025 12:50:31 +0000 (12:50 +0000)]

inet: raw: add drop_counters to raw sockets

When a packet flood hits one or more RAW sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to raw sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per RAW socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 26 Aug 2025 12:50:30 +0000 (12:50 +0000)]

udp: add drop_counters to udp socket

When a packet flood hits one or more UDP sockets, many cpus
have to update sk->sk_drops.

This slows down other cpus, because currently
sk_drops is in sock_write_rx group.

Add a socket_drop_counters structure to udp sockets.

Using dedicated cache lines to hold drop counters
makes sure that consumers no longer suffer from
false sharing if/when producers only change sk->sk_drops.

This adds 128 bytes per UDP socket.

Tested with the following stress test, sending about 11 Mpps
to a dual socket AMD EPYC 7B13 64-Core.

super_netperf 20 -t UDP_STREAM -H DUT -l10 -- -n -P,1000 -m 120
Note: due to socket lookup, only one UDP socket is receiving
packets on DUT.

Then measure receiver (DUT) behavior. We can see both
consumer and BH handlers can process more packets per second.

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 615091             0.0
Udp6InErrors                    3904277            0.0
Udp6RcvbufErrors                3904277            0.0

After:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 816281             0.0
Udp6InErrors                    7497093            0.0
Udp6RcvbufErrors                7497093            0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 26 Aug 2025 12:50:29 +0000 (12:50 +0000)]

net: add sk->sk_drop_counters

Some sockets suffer from heavy false sharing on sk->sk_drops,
and fields in the same cache line.

Add sk->sk_drop_counters to:

- move the drop counter(s) to dedicated cache lines.
- Add basic NUMA awareness to these drop counter(s).

Following patches will use this infrastructure for UDP and RAW sockets.

sk_clone_lock() is not yet ready, it would need to properly
set newsk->sk_drop_counters if we plan to use this for TCP sockets.

v2: used Paolo suggestion from https://lore.kernel.org/netdev/8f09830a-d83d-43c9-b36b-88ba0a23e9b2@redhat.com/

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 26 Aug 2025 12:50:28 +0000 (12:50 +0000)]

net: add sk_drops_skbadd() helper

Existing sk_drops_add() helper is renamed to sk_drops_skbadd().

Add sk_drops_add() and convert sk_drops_inc() to use it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-3-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Eric Dumazet [Tue, 26 Aug 2025 12:50:27 +0000 (12:50 +0000)]

net: add sk_drops_read(), sk_drops_inc() and sk_drops_reset() helpers

We want to split sk->sk_drops in the future to reduce
potential contention on this field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250826125031.1578842-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 20:18:28 +0000 (13:18 -0700)]

uapi: wrap compiler_types.h in an ifdef instead of the implicit strip

The uAPI stddef header includes compiler_types.h, a kernel-only
header, to make sure that kernel definitions of annotations
like __counted_by() take precedence.

There is a hack in scripts/headers_install.sh which strips includes
of compiler.h and compiler_types.h when installing uAPI headers.
While explicit handling makes sense for compiler.h, which is included
all over the uAPI, compiler_types.h is only included by stddef.h
(within the uAPI, obviously it's included in kernel code a lot).

Remove the stripping from scripts/headers_install.sh and wrap
the include of compiler_types.h in #ifdef __KERNEL__ instead.
This should be equivalent functionally, but is easier to understand
to a casual reader of the code. It also makes it easier to work
with kernel headers directly from under tools/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250825201828.2370083-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Jakub Kicinski [Thu, 28 Aug 2025 01:56:27 +0000 (18:56 -0700)]

Merge branch 'eth-fbnic-extend-hw-stats-support'

Jakub Kicinski says:

====================
eth: fbnic: Extend hw stats support

Mohsin says:

Extend hardware stats support for fbnic by adding the ability to reset
hardware stats when the device experience a reset due to a PCI error and
include MAC stats in the hardware stats reset. Additionally, expand
hardware stats coverage to include FEC, PHY, and Pause stats.

v1: https://lore.kernel.org/20250822164731.1461754-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250825200206.2357713-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:06 +0000 (13:02 -0700)]

eth: fbnic: Add pause stats support

Add support to read pause stats for fbnic. Unlike FEC and PCS stats,
pause stats won't wrap, do not fetch them under the service task. Since,
they are exclusively accessed via the ethtool API, don't include them in
fbnic_get_hw_stats().

]# ethtool -I -a eth0
Pause parameters for eth0:
Autonegotiate: on
RX: off
TX: off
Statistics:
tx_pause_frames: 0
rx_pause_frames: 0

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:05 +0000 (13:02 -0700)]

eth: fbnic: Read PHY stats via the ethtool API

Provide support to read PHY stats (FEC and PCS) via the ethtool API.

]# ethtool -I --show-fec eth0
FEC parameters for eth0:
Supported/Configured FEC encodings: RS
Active FEC encoding: RS
Statistics:
corrected_blocks: 0
uncorrectable_blocks: 0

]# ethtool -S eth0 --groups eth-phy
Standard stats for eth0:
eth-phy-SymbolErrorDuringCarrier: 0

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:04 +0000 (13:02 -0700)]

eth: fbnic: Fetch PHY stats from device

Add support to fetch PHY stats consisting of PCS and FEC stats from the
device. When reading the stats counters, the lo part is read first, which
latches the hi part to ensure consistent reading of the stats counter.

FEC and PCS stats can wrap depending on the access frequency. To prevent
wrapping, fetch these stats periodically under the service task. Also to
maintain consistency fetch these stats along with other 32b stats under
__fbnic_get_hw_stats32().

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:03 +0000 (13:02 -0700)]

eth: fbnic: Reset MAC stats

Reset the MAC stats as part of the hardware stats reset to ensure
consistency. Currently, hardware stats are reset during device bring-up
and upon experiencing PCI errors; however, MAC stats are being skipped
during these resets.

When fbnic_reset_hw_stats() is called upon recovering from PCI error,
MAC stats are accessed outside the rtnl_lock. The only other access to
MAC stats is via the ethtool API, which is protected by rtnl_lock. This
can result in concurrent access to MAC stats and a potential race. Protect
the fbnic_reset_hw_stats() call in __fbnic_pm_attach() with rtnl_lock to
avoid this.

Note that fbnic_reset_hw_mac_stats() is called outside the hardware
stats lock which protects access to the fbnic_hw_stats. This is intentional
because MAC stats are fetched from the device outside this lock and are
exclusively read via the ethtool API.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:02 +0000 (13:02 -0700)]

eth: fbnic: Reset hw stats upon PCI error

Upon experiencing a PCI error, fbnic reset the device to recover from
the failure. Reset the hardware stats as part of the device reset to
ensure accurate stats reporting.

Note that the reset is not really resetting the aggregate value to 0,
which may result in a spike for a system collecting deltas in stats.
Rather, the reset re-latches the current value as previous, in case HW
got reset.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mohsin Bashir [Mon, 25 Aug 2025 20:02:01 +0000 (13:02 -0700)]

eth: fbnic: Move hw_stats_lock out of fbnic_dev

Move hw_stats_lock out of fbnic_dev to a more appropriate struct
fbnic_hw_stats since the only use of this lock is to protect access to
the hardware stats. While at it, enclose the lock and stats
initialization in a single init call.

Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825200206.2357713-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 28 Aug 2025 01:34:55 +0000 (18:34 -0700)]

Merge branch 'macsec-replace-custom-netlink-attribute-checks-with-policy-level-checks'

Sabrina Dubroca says:

====================
macsec: replace custom netlink attribute checks with policy-level checks

We can simplify attribute validation a lot by describing the accepted
ranges more precisely in the policies, using NLA_POLICY_MAX etc.

Some of the checks still need to be done later on, because the
attribute length and acceptable range can vary based on values that
can't be known when the policy is validated (cipher suite determines
the key length and valid ICV length, presence of XPN changes the PN
length, detection of duplicate SCIs or ANs, etc).

As a bonus, we get a few extack messages from the policy
validation. I'll add extack to the rest of the checks (mostly in the
genl commands) in an future series.

v1: https://lore.kernel.org/netdev/cover.1664379352.git.sd@queasysnail.net
====================

Link: https://patch.msgid.link/cover.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:31 +0000 (15:16 +0200)]

macsec: replace custom check on IFLA_MACSEC_ENCODING_SA with NLA_POLICY_MAX

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/085bc642136cf3d267ddbb114e6f0c4a9247c797.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:30 +0000 (15:16 +0200)]

macsec: replace custom checks for IFLA_MACSEC_* flags with NLA_POLICY_MAX

Those are all off/on flags.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/95707fb36adc1904fa327bc8f4eb055895aa6eff.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:29 +0000 (15:16 +0200)]

macsec: validate IFLA_MACSEC_VALIDATION with NLA_POLICY_MAX

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/629efe0b2150b30abc6472074018cbd521b46578.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:28 +0000 (15:16 +0200)]

macsec: use NLA_POLICY_VALIDATE_FN to validate IFLA_MACSEC_CIPHER_SUITE

Unfortunately, since the value of MACSEC_DEFAULT_CIPHER_ID doesn't fit
near the others, we can't use a simple range in the policy.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/015e43ade9548c7682c9739087eba0853b3a1331.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:27 +0000 (15:16 +0200)]

macsec: replace custom checks on IFLA_MACSEC_ICV_LEN with NLA_POLICY_RANGE

The existing checks already force this range.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/398cf16191a634ab343ecd811c481d7bdd44a933.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:26 +0000 (15:16 +0200)]

macsec: add NLA_POLICY_MAX for MACSEC_OFFLOAD_ATTR_TYPE and IFLA_MACSEC_OFFLOAD

This is equivalent to the existing checks allowing either
MACSEC_OFFLOAD_OFF or calling macsec_check_offload.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/37e1f1716f1d1d46d3d06c52317564b393fe60e6.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:25 +0000 (15:16 +0200)]

macsec: remove validate_add_rxsc

It's not doing much anymore.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/218147f2f11cab885abc86b779dcefcd3208a2f8.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:24 +0000 (15:16 +0200)]

macsec: use NLA_UINT for MACSEC_SA_ATTR_PN

MACSEC_SA_ATTR_PN is either a u32 or a u64, we can now use NLA_UINT
for this instead of a custom binary type. We can then use a min check
within the policy.

We need to keep the length checks done in macsec_{add,upd}_{rx,tx}sa
based on whether the device is set up for XPN (with 64b PNs instead of
32b).

On the dump side, keep the existing custom code as userspace may
expect a u64 when using XPN, and nla_put_uint may only output a u32
attribute if the value fits.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/c9d32bd479cd4464e09010fbce1becc75377c8a0.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:23 +0000 (15:16 +0200)]

macsec: use NLA_POLICY_MAX_LEN for MACSEC_SA_ATTR_KEY

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/192227ca0047b643d6530ece0a3679998b010fac.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:22 +0000 (15:16 +0200)]

macsec: replace custom checks on MACSEC_SA_ATTR_KEYID with NLA_POLICY_EXACT_LEN

The existing checks already specify that MACSEC_SA_ATTR_KEYID must
have length MACSEC_KEYID_LEN.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/c4c113328962aae4146183e7a27854e854c796fb.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:21 +0000 (15:16 +0200)]

macsec: replace custom checks on MACSEC_SA_ATTR_SALT with NLA_POLICY_EXACT_LEN

The existing checks already specify that MACSEC_SA_ATTR_SALT must have
length MACSEC_SALT_LEN.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/9699c5fd72322118b164cc8777fadabcce3b997c.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:20 +0000 (15:16 +0200)]

macsec: replace custom checks on MACSEC_*_ATTR_ACTIVE with NLA_POLICY_MAX

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/2b07434304c725c72a7d81a8460d0bbe8af384a2.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Sabrina Dubroca [Tue, 26 Aug 2025 13:16:19 +0000 (15:16 +0200)]

macsec: replace custom checks on MACSEC_SA_ATTR_AN with NLA_POLICY_MAX

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/22a7820cfc2cbfe5e33f030f1a3276e529cc70dc.1756202772.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 28 Aug 2025 01:23:04 +0000 (18:23 -0700)]

Merge branch 'net-prevent-rps-table-overwrite-of-active-flows'

Krishna Kumar says:

====================
net: Prevent RPS table overwrite of active flows

This series splits the original RPS patch [1] into two patches for
net-next. It also addresses a kernel test robot warning by defining
rps_flow_is_active() only when aRFS is enabled. I tested v3 with
four builds and reboots: two for [PATCH 1/2] with aRFS enabled &
disabled, and two for [PATCH 2/2]. There are no code changes in v4
and v5, only documentation. Patch v6 has one line change to keep
'hash' field under #ifdef, and was test built with aRFS=on and
aRFS=off. The same two builds were done for v7, along with 15m load
testing with aRFS=on to ensure the new changes are correct.

The first patch prevents RPS table overwrite for active flows thereby
improving aRFS stability.

The second patch caches hash & flow_id in get_rps_cpu() to avoid
recalculating it in set_rps_cpu().

[1] lore.kernel.org/netdev/20250708081516.53048-1-krikku@gmail.com/
[2] lore.kernel.org/netdev/20250729104109.1687418-1-krikku@gmail.com/
====================

Link: https://patch.msgid.link/20250825031005.3674864-1-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Krishna Kumar [Mon, 25 Aug 2025 03:10:05 +0000 (08:40 +0530)]

net: Cache hash and flow_id to avoid recalculation

get_rps_cpu() can cache flow_id and hash as both are required by
set_rps_cpu() instead of recalculating them twice.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-3-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Krishna Kumar [Mon, 25 Aug 2025 03:10:04 +0000 (08:40 +0530)]

net: Prevent RPS table overwrite of active flows

This patch fixes an issue where two different flows on the same RXq
produce the same hash resulting in continuous flow overwrites.

Flow #1: A packet for Flow #1 comes in, kernel calls the steering
         function. The driver gives back a filter id. The kernel saves
this filter id in the selected slot. Later, the driver's
service task checks if any filters have expired and then
installs the rule for Flow #1.
Flow #2: A packet for Flow #2 comes in. It goes through the same steps.
         But this time, the chosen slot is being used by Flow #1. The
driver gives a new filter id and the kernel saves it in the
same slot. When the driver's service task runs, it runs through
all the flows, checks if Flow #1 should be expired, the kernel
returns True as the slot has a different filter id, and then
the driver installs the rule for Flow #2.
Flow #1: Another packet for Flow #1 comes in. The same thing repeats.
         The slot is overwritten with a new filter id for Flow #1.

This causes a repeated cycle of flow programming for missed packets,
wasting CPU cycles while not improving performance. This problem happens
at higher rates when the RPS table is small, but tests show it still
happens even with 12,000 connections and an RPS size of 16K per queue
(global table size = 144x16K = 64K).

This patch prevents overwriting an rps_dev_flow entry if it is active.
The intention is that it is better to do aRFS for the first flow instead
of hurting all flows on the same hash. Without this, two (or more) flows
on one RX queue with the same hash can keep overwriting each other. This
causes the driver to reprogram the flow repeatedly.

Changes:
  1. Add a new 'hash' field to struct rps_dev_flow.
  2. Add rps_flow_is_active(): a helper function to check if a flow is
     active or not, extracted from rps_may_expire_flow(). It is further
     simplified as per reviewer feedback.
  3. In set_rps_cpu():
     - Avoid overwriting by programming a new filter if:
        - The slot is not in use, or
        - The slot is in use but the flow is not active, or
        - The slot has an active flow with the same hash, but target CPU
          differs.
     - Save the hash in the rps_dev_flow entry.
  4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active().

Testing & results:
  - Driver: ice (E810 NIC), Kernel: net-next
  - #CPUs = #RXq = 144 (1:1)
  - Number of flows: 12K
  - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal,
    it is still sufficient to cover 12K flows (256*144 rx-queues = 64K
    global table slots)
  - Global Table Size = 144 * RPS (effectively equal to 256 * RPS)
  - Each RPS test duration = 8 mins (org code) + 8 mins (new code).
  - Metrics captured on client

Legend for following tables:
Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu()
Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries
Add:     #times driver actually programmed aRFS (ice_arfs_build_entry())
Del:     #times driver deleted the flow (ice_arfs_del_flow_rules())
Units:   K = 1,000 times, M = 1 million times

  |-------|---------|------|     Org Code    |---------|---------|
  | RPS   | Latency | CPU  | Add    |  Del   | Steer-C | Steer-L |
  |-------|---------|------|--------|--------|---------|---------|
  | 256   | 227.0   | 93.2 | 1.6M   | 1.6M   | 121.7M  | 267.6M  |
  | 512   | 225.9   | 94.1 | 11.5M  | 11.2M  | 65.7M   | 199.6M  |
  | 1024  | 223.5   | 95.6 | 16.5M  | 16.5M  | 27.1M   | 187.3M  |
  | 2048  | 222.2   | 96.3 | 10.5M  | 10.5M  | 12.5M   | 115.2M  |
  | 4096  | 223.9   | 94.1 | 5.5M   | 5.5M   | 7.2M    | 65.9M   |
  | 8192  | 224.7   | 92.5 | 2.7M   | 2.7M   | 3.0M    | 29.9M   |
  | 16384 | 223.5   | 92.5 | 1.3M   | 1.3M   | 1.4M    | 13.9M   |
  | 32768 | 219.6   | 93.2 | 838.1K | 838.1K | 965.1K  | 8.9M    |
  |-------|---------|------|   New Code      |---------|---------|
  | 256   | 201.5   | 99.1 | 13.4K  | 5.0K   | 13.7K   | 75.2K   |
  | 512   | 202.5   | 98.2 | 11.2K  | 5.9K   | 11.2K   | 55.5K   |
  | 1024  | 207.3   | 93.9 | 11.5K  | 9.7K   | 11.5K   | 59.6K   |
  | 2048  | 207.5   | 96.7 | 11.8K  | 11.1K  | 15.5K   | 79.3K   |
  | 4096  | 206.9   | 96.6 | 11.8K  | 11.7K  | 11.8K   | 63.2K   |
  | 8192  | 205.8   | 96.7 | 11.9K  | 11.8K  | 11.9K   | 63.9K   |
  | 16384 | 200.9   | 98.2 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  | 32768 | 202.5   | 98.0 | 11.9K  | 11.9K  | 11.9K   | 64.2K   |
  |-------|---------|------|--------|--------|---------|---------|

Some observations:
  1. Overall Latency improved: (1790.19-1634.94)/1790.19*100 = 8.67%
  2. Overall CPU increased:    (777.32-751.49)/751.45*100    = 3.44%
  3. Flow Management (add/delete) remained almost constant at ~11K
     compared to values in millions.

Signed-off-by: Krishna Kumar <krikku@gmail.com>
Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Qianfeng Rong [Tue, 26 Aug 2025 13:50:19 +0000 (21:50 +0800)]

net: wwan: iosm: use int type to store negative error codes

The 'ret' variable in ipc_pcie_resources_request() either stores '-EBUSY'
directly or holds returns from pci_request_regions() and ipc_acquire_irq().
Storing negative error codes in u32 causes no runtime issues but is
stylistically inconsistent and very ugly. Change 'ret' from u32 to int
type - this has no runtime impact.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Link: https://patch.msgid.link/20250826135021.510767-1-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Tue, 26 Aug 2025 19:24:44 +0000 (21:24 +0200)]

net: phy: fixed_phy: simplify fixed_mdio_read

swphy_read_reg() doesn't change the passed struct fixed_phy_status,
so we can pass &fp->status directly.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/c49195c7-a3a1-485c-baed-9b33740752de@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Qianfeng Rong [Tue, 26 Aug 2025 14:21:59 +0000 (22:21 +0800)]

amd-xgbe: Use int type to store negative error codes

Use int instead of unsigned int for the 'ret' variable to store return
values from functions that either return zero on success or negative error
codes on failure. Storing negative error codes in an unsigned int causes
no runtime issues, but it's ugly as pants, Change 'ret' from unsigned int
to int type - this change has no runtime impact.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250826142159.525059-1-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Andre Przywara [Mon, 25 Aug 2025 17:20:55 +0000 (18:20 +0100)]

net: stmmac: sun8i: drop unneeded default syscon value

For some odd reason we were very jealous about the value of the EMAC
clock register from the syscon block, insisting on a reset value and
only doing read-modify-write operations on that register, even though we
pretty much know the register layout.
This already led to a basically redundant entry for the H6, which only
differs by that value. We seem to have the same situation with the new
A523 SoC, which again is compatible to the A64, but has a different
syscon reset value.

Drop any assumptions about that value, and set or clear the bits that we
want to program, from scratch (starting with a value of 0). For the
remove() implementation, we just turn on the POWERDOWN bit, and deselect
the internal PHY, which mimics the existing code.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Acked-by: Corentin LABBE <clabbe.montjoie@gmail.com>
Tested-by: Corentin LABBE <clabbe.montjoie@gmail.com>
Tested-by: Paul Kocialkowski <paulk@sys-base.io>
Reviewed-by: Paul Kocialkowski <paulk@sys-base.io>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250825172055.19794-1-andre.przywara@arm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Fabio Estevam [Tue, 26 Aug 2025 14:17:36 +0000 (11:17 -0300)]

dt-bindings: nfc: ti,trf7970a: Restrict the ti,rx-gain-reduction-db values

Instead of stating the supported values for the ti,rx-gain-reduction-db
property in free text format, add an enum entry that can help validating
the devicetree files.

Signed-off-by: Fabio Estevam <festevam@gmail.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250826141736.712827-1-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Alok Tiwari [Tue, 26 Aug 2025 10:22:15 +0000 (03:22 -0700)]

net: stmmac: rk: remove incorrect _DLY_DISABLE bit definition

The RK3328 GMAC clock delay macros define enable/disable controls for
TX and RX clock delay. While the TX definitions are correct, the
RXCLK_DLY_DISABLE macro incorrectly clears bit 0.

The macros RK3328_GMAC_TXCLK_DLY_DISABLE and
RK3328_GMAC_RXCLK_DLY_DISABLE are not referenced anywhere
in the driver code. Remove them to clean up unused definitions.

No functional change.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250826102219.49656-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Aleksander Jan Bajkowski [Mon, 25 Aug 2025 21:09:49 +0000 (23:09 +0200)]

net: phy: realtek: support for TRIGGER_NETDEV_LINK on RTL8211E and RTL8211F

This patch adds support for the TRIGGER_NETDEV_LINK trigger. It activates
the LED when a link is established, regardless of the speed.

Tested on Orange Pi PC2 with RTL8211E PHY.

Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250825211059.143231-1-olek2@wp.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Wed, 27 Aug 2025 01:11:31 +0000 (18:11 -0700)]

Merge branch 'ipv6-sr-simplify-and-optimize-hmac-calculations'

Eric Biggers says:

====================
ipv6: sr: Simplify and optimize HMAC calculations

This series simplifies and optimizes the HMAC calculations in
IPv6 Segment Routing.
====================

Link: https://patch.msgid.link/20250824013644.71928-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Biggers [Sun, 24 Aug 2025 01:36:44 +0000 (21:36 -0400)]

ipv6: sr: Prepare HMAC key ahead of time

Prepare the HMAC key when it is added to the kernel, instead of
preparing it implicitly for every packet. This significantly improves
the performance of seg6_hmac_compute(). A microbenchmark on x86_64
shows seg6_hmac_compute() (with HMAC-SHA256) dropping from ~1978 cycles
to ~1419 cycles, a 28% improvement.

The size of 'struct seg6_hmac_info' increases by 128 bytes, but that
should be fine, since there should not be a massive number of keys.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Eric Biggers [Sun, 24 Aug 2025 01:36:43 +0000 (21:36 -0400)]

ipv6: sr: Use HMAC-SHA1 and HMAC-SHA256 library functions

Use the HMAC-SHA1 and HMAC-SHA256 library functions instead of
crypto_shash. This is simpler and faster. Pre-allocating per-CPU hash
transformation objects and descriptors is no longer needed, and a
microbenchmark on x86_64 shows seg6_hmac_compute() (with HMAC-SHA256)
dropping from ~2494 cycles to ~1978 cycles, a 20% improvement.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20250824013644.71928-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Wed, 27 Aug 2025 00:35:30 +0000 (17:35 -0700)]

Merge branch 'selftests-drv-net-ncdevmem-fix-error-paths'

Jakub Kicinski says:

====================
selftests: drv-net: ncdevmem: fix error paths

Make ncdevmem clean up after itself. While at it make sure it sets
HDS threshold to 0 automatically.

v1: https://lore.kernel.org/20250822200052.1675613-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20250825180447.2252977-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 18:04:47 +0000 (11:04 -0700)]

selftests: drv-net: ncdevmem: explicitly set HDS threshold to 0

Make sure we set HDS threshold to 0 if the device supports changing it.
It's required for ZC.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250825180447.2252977-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 18:04:46 +0000 (11:04 -0700)]

selftests: drv-net: ncdevmem: restore original HDS setting before exiting

Restore HDS settings if we modified them.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250825180447.2252977-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 18:04:45 +0000 (11:04 -0700)]

selftests: drv-net: ncdevmem: restore old channel config

In case changing channel count with provider bound succeeds
unexpectedly - make sure we return to original settings.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250825180447.2252977-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 18:04:44 +0000 (11:04 -0700)]

selftests: drv-net: ncdevmem: save IDs of flow rules we added

In prep for more selective resetting of ntuple filters
try to save the rule IDs to a table.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250825180447.2252977-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 18:04:43 +0000 (11:04 -0700)]

selftests: drv-net: ncdevmem: remove use of error()

Using error() makes it impossible for callers to unwind their
changes. Replace error() calls with proper error handling.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250825180447.2252977-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Mon, 25 Aug 2025 17:59:39 +0000 (10:59 -0700)]

selftests: drv-net: hds: restore hds settings

The test currently modifies the HDS settings and doesn't restore them.
This may cause subsequent tests to fail (or pass when they should not).
Add defer()ed reset handling.

Link: https://patch.msgid.link/20250825175939.2249165-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Guillaume Nault [Mon, 25 Aug 2025 13:37:43 +0000 (15:37 +0200)]

ipv4: Convert ->flowi4_tos to dscp_t.

Convert the ->flowic_tos field of struct flowi_common from __u8 to
dscp_t, rename it ->flowic_dscp and propagate these changes to struct
flowi and struct flowi4.

We've had several bugs in the past where ECN bits could interfere with
IPv4 routing, because these bits were not properly cleared when setting
->flowi4_tos. These bugs should be fixed now and the dscp_t type has
been introduced to ensure that variables carrying DSCP values don't
accidentally have any ECN bits set. Several variables and structure
fields have been converted to dscp_t already, but the main IPv4 routing
structure, struct flowi4, is still using a __u8. To avoid any future
regression, this patch converts it to dscp_t.

There are many users to convert at once. Fortunately, around half of
->flowi4_tos users already have a dscp_t value at hand, which they
currently convert to __u8 using inet_dscp_to_dsfield(). For all of
these users, we just need to drop that conversion.

But, although we try to do the __u8 <-> dscp_t conversions at the
boundaries of the network or of user space, some places still store
TOS/DSCP variables as __u8 in core networking code. Those can hardly be
converted either because the data structure is part of UAPI or because
the same variable or field is also used for handling ECN in other parts
of the code. In all of these cases where we don't have a dscp_t
variable at hand, we need to use inet_dsfield_to_dscp() when
interacting with ->flowi4_dscp.

Changes since v1:
* Fix space alignment in __bpf_redirect_neigh_v4() (Ido).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/29acecb45e911d17446b9a3dbdb1ab7b821ea371.1756128932.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Wed, 27 Aug 2025 00:24:18 +0000 (17:24 -0700)]

Merge branch 'expose-burst-period-for-devlink-health-reporter'

Mark Bloch says:

====================
Expose burst period for devlink health reporter

Shahar writes:
--------------------------------------------------------------------------

Currently, the devlink health reporter initiates the grace period
immediately after recovering an error, which blocks further recovery
attempts until the grace period concludes. Since additional errors
are not generally expected during this short interval, any new error
reported during the grace period is not only rejected but also causes
the reporter to enter an error state that requires manual intervention.

This approach poses a problem in scenarios where a single root cause
triggers multiple related errors in quick succession - for example,
a PCI issue affecting multiple hardware queues. Because these errors
are closely related and occur rapidly, it is more effective to handle
them together rather than handling only the first one reported and
blocking any subsequent recovery attempts. Furthermore, setting the
reporter to an error state in this context can be misleading, as these
multiple errors are manifestations of a single underlying issue, making
it unlike the general case where additional errors are not expected
during the grace period.

To resolve this, introduce a configurable burst period attribute to the
devlink health reporter. This period starts when the first error
is recovered and lasts for a user-defined duration. Once this error
burst period expires, the grace period begins. After the grace period
ends, a new reported error will start the same flow again.

Timeline summary:

----|--------|------------------------------/----------------------/--
error is  error is      burst period             grace period
reported  recovered  (recoveries allowed)     (recoveries blocked)

With burst period, create a time window during which recovery attempts
are permitted, allowing all reported errors to be handled sequentially
before the grace period starts. Once the grace period begins, it
prevents any further error recoveries until it ends.

When burst period is set to 0, current behavior is preserved.

Design alternatives considered:

1. Recover all queues upon any error:
   A brute-force approach that recovers all queues on any error.
   While simple, it is overly aggressive and disrupts unaffected queues
   unnecessarily. Also, because this is handled entirely within the
   driver, it leads to a driver-specific implementation rather than a
   generic one.

2. Per-queue reporter:
   This design would isolate recovery handling per SQ or RQ, effectively
   removing interdependencies between queues. While conceptually clean,
   it introduces significant scalability challenges as the number of
   queues grows, as well as synchronization challenges across multiple
   reporters.

3. Error aggregation with delayed handling:
   Errors arriving during the grace period are saved and processed after
   it ends. While addressing the issue of related errors whose recovery
   is aborted as grace period started, this adds complexity due to
   synchronization needs and contradicts the assumption that no errors
   should occur during a healthy system’s grace period. Also, this
   breaks the important role of grace period in preventing an infinite
   loop of immediate error detection following recovery. In such cases
   we want to stop.

4. Allowing a fixed burst of errors before starting grace period:
   Allows a set number of recoveries before the grace period begins.
   However, it also requires limiting the error reporting window.
   To keep the design simple, the burst threshold becomes redundant.

The burst period design was chosen for its simplicity and precision in
addressing the problem at hand. It effectively captures the temporal
correlation of related errors and aligns with the original intent of
the grace period as a stabilization window where further errors are
unexpected, and if they do occur, they indicate an abnormal system
state.

v3: https://lore.kernel.org/1755111349-416632-1-git-send-email-tariqt@nvidia.com
v2: https://lore.kernel.org/1753390134-345154-1-git-send-email-tariqt@nvidia.com
v1: https://lore.kernel.org/1752768442-264413-1-git-send-email-tariqt@nvidia.com
====================

Link: https://patch.msgid.link/20250824084354.533182-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shahar Shitrit [Sun, 24 Aug 2025 08:43:54 +0000 (11:43 +0300)]

net/mlx5e: Set default burst period for TX and RX reporters

System errors can sometimes cause multiple errors to be reported
to the TX reporter at the same time. For instance, lost interrupts
may cause several SQs to time out simultaneously. When dev_watchdog
notifies the driver for that, it iterates over all SQs to trigger
recovery for the timed-out ones, via TX health reporter.
However, grace period allows only one recovery at a time, so only
the first SQ recovers while others remain blocked. Since no further
recoveries are allowed during the grace period, subsequent errors
cause the reporter to enter an ERROR state, requiring manual
intervention.

To address this, set the TX reporter's default burst period
to 0.5 second. This allows the reporter to detect and handle all
timed-out SQs within this window before initiating the grace period.

To account for the possibility of a similar issue in the RX reporter,
its default burst period is also configured.

Additionally, while here, align the TX definition prefix with the RX,
as these are used only in EN driver.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shahar Shitrit [Sun, 24 Aug 2025 08:43:53 +0000 (11:43 +0300)]

devlink: Make health reporter burst period configurable

Enable configuration of the burst period — a time window starting
from the first error recovery, during which the reporter allows
recovery attempts for each reported error.

This feature is helpful when a single underlying issue causes multiple
errors, as it delays the start of the grace period to allow sufficient
time for recovering all related errors. For example, if multiple TX
queues time out simultaneously, a sufficient burst period could allow
all affected TX queues to be recovered within that window. Without this
period, only the first TX queue that reports a timeout will undergo
recovery, while the remaining TX queues will be blocked once the grace
period begins.

Configuration example:
$ devlink health set pci/0000:00:09.0 reporter tx burst_period 500

Configuration example with ynl:
./tools/net/ynl/pyynl/cli.py \
--spec Documentation/netlink/specs/devlink.yaml \
--do health-reporter-set --json '{
  "bus-name": "auxiliary",
  "dev-name": "mlx5_core.eth.0",
  "port-index": 65535,
  "health-reporter-name": "tx",
  "health-reporter-burst-period": 500
}'

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shahar Shitrit [Sun, 24 Aug 2025 08:43:52 +0000 (11:43 +0300)]

devlink: Introduce burst period for health reporter

Currently, the devlink health reporter starts the grace period
immediately after handling an error, blocking any further recoveries
until it finished.

However, when a single root cause triggers multiple errors in a short
time frame, it is desirable to treat them as a bulk of errors and to
allow their recoveries, avoiding premature blocking of subsequent
related errors, and reducing the risk of inconsistent or incomplete
error handling.

To address this, introduce a configurable burst period for devlink
health reporter. Start this period when the first error is handled,
and allow recovery attempts for reported errors during this window.
Once burst period expires, begin the grace period to block further
recoveries until it concludes.

Timeline summary:

----|--------|------------------------------/----------------------/--
error is error is burst period grace period
reported recovered (recoveries allowed) (recoveries blocked)

For calculating the burst period duration, use the same
last_recovery_ts as the grace period. Update it on recovery only
when the burst period is inactive (either disabled or at the
first error).

This patch implements the framework for the burst period and
effectively sets its value to 0 at reporter creation, so the current
behavior remains unchanged, which ensures backward compatibility.

A downstream patch will make the burst period configurable.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shahar Shitrit [Sun, 24 Aug 2025 08:43:51 +0000 (11:43 +0300)]

devlink: Move health reporter recovery abort logic to a separate function

Extract the health reporter recovery abort logic into a separate
function devlink_health_recover_abort().
The function encapsulates the conditions for aborting recovery:
- When auto-recovery is disabled
- When previous error wasn't recovered
- When within the grace period after last recovery

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Shahar Shitrit [Sun, 24 Aug 2025 08:43:50 +0000 (11:43 +0300)]

devlink: Move graceful period parameter to reporter ops

Move the default graceful period from a parameter to
devlink_health_reporter_create() to a field in the
devlink_health_reporter_ops structure.

This change improves consistency, as the graceful period is inherently
tied to the reporter's behavior and recovery policy. It simplifies the
signature of devlink_health_reporter_create() and its internal helper
functions. It also centralizes the reporter configuration at the ops
structure, preparing the groundwork for a downstream patch that will
introduce a devlink health reporter burst period attribute whose
default value will similarly be provided by the driver via the ops
structure.

Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250824084354.533182-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Sat, 23 Aug 2025 21:25:05 +0000 (23:25 +0200)]

net: phy: fixed_phy: let fixed_phy_unregister free the phy_device

fixed_phy_register() creates and registers the phy_device. To be
symmetric, we should not only unregister, but also free the phy_device
in fixed_phy_unregister(). This allows to simplify code in users.

Note wrt of_phy_deregister_fixed_link():
put_device(&phydev->mdio.dev) and phy_device_free(phydev) are identical.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/ad8dda9a-10ed-4060-916b-3f13bdbb899d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Heiner Kallweit [Fri, 22 Aug 2025 20:36:11 +0000 (22:36 +0200)]

net: phy: fixed: let fixed_phy_add always use addr 0 and remove return value

We have only two users of fixed_phy_add(), both use address 0 and
ignore the return value. So simplify fixed_phy_add() accordingly.

Whilst at it, constify the fixed_phy_status configs.

Note:
fixed_phy_add() is a legacy function which shouldn't be used in new
code, as it's use may be problematic:
- No check whether a fixed phy exists already at the given address
- If fixed_phy_register() is called afterwards by any other driver,
then it will also use phy_addr 0, because fixed_phy_add() ignores
the ida which manages address assignment
Drivers using a fixed phy created by fixed_phy_add() in platform code,
should dynamically create a fixed phy with fixed_phy_register()
instead.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/762700e5-a0b1-41af-aa03-929822a39475@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Qianfeng Rong [Mon, 25 Aug 2025 14:27:52 +0000 (22:27 +0800)]

net: hns3: use kcalloc() instead of kzalloc()

As noted in the kernel documentation, open-coded multiplication in
allocator arguments is discouraged because it can lead to integer overflow.

Use devm_kcalloc() to gain built-in overflow protection, making memory
allocation safer when calculating allocation size compared to explicit
multiplication.

Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20250825142753.534509-1-rongqianfeng@vivo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Christian Marangi [Sat, 23 Aug 2025 13:44:29 +0000 (15:44 +0200)]

net: phy: as21xxx: better handle PHY HW reset on soft-reboot

On soft-reboot, with a reset GPIO defined for an Aeonsemi PHY, the
special match_phy_device fails to correctly identify that the PHY
needs to load the firmware again.

This is caused by the fact that PHY ID is read BEFORE the PHY reset
GPIO (if present) is asserted, so we can be in the scenario where the
phydev have the previous PHY ID (with the PHY firmware loaded) but
after reset the generic AS21xxx PHY is present in the PHY ID registers.

To better handle this, skip reading the PHY ID register only for the PHY
that are not AS21xxx (by matching for the Aeonsemi Vendor) and always
read the PHY ID for the other case to handle both firmware already
loaded or an HW reset.

Fixes: 830877d89edc ("net: phy: Add support for Aeonsemi AS21xxx PHYs")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://patch.msgid.link/20250823134431.4854-2-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Christian Marangi [Sat, 23 Aug 2025 13:44:28 +0000 (15:44 +0200)]

net: phy: introduce phy_id_compare_vendor() PHY ID helper

Introduce phy_id_compare_vendor() PHY ID helper to compare a PHY ID with
the PHY ID Vendor using the generic PHY ID Vendor mask.

While at it also rework the PHY_ID_MATCH macro and move the mask to
dedicated define so that PHY driver can make use of the mask if needed.

Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250823134431.4854-1-ansuelsmth@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Mingming Cao [Thu, 21 Aug 2025 13:02:15 +0000 (06:02 -0700)]

ibmvnic: Increase max subcrq indirect entries with fallback

POWER8 support a maximum of 16 subcrq indirect descriptor entries per
H_SEND_SUB_CRQ_INDIRECT call, while POWER9 and newer hypervisors
support up to 128 entries. Increasing the max number of indirect
descriptor entries improves batching efficiency and reduces
hcall overhead, which enhances throughput under large workload on POWER9+.

Currently, ibmvnic driver always uses a fixed number of max indirect
descriptor entries (16). send_subcrq_indirect() treats all hypervisor
errors the same:
- Cleanup and Drop the entire batch of descriptors.
- Return an error to the caller.
- Rely on TCP/IP retransmissions to recover.
- If the hypervisor returns H_PARAMETER (e.g., because 128
   entries are not supported on POWER8), the driver will continue
   to drop batches, resulting in unnecessary packet loss.

In this patch:
Raise the default maximum indirect entries to 128 to improve ibmvnic
batching on morden platform. But also gracefully fall back to
16 entries for Power 8 systems.

Since there is no VIO interface to query the hypervisor’s supported
limit, vnic handles send_subcrq_indirect() H_PARAMETER errors:
- On first H_PARAMETER failure, log the failure context
- Reduce max_indirect_entries to 16 and allow the single batch to drop.
- Subsequent calls automatically use the correct lower limit,
    avoiding repeated drops.

The goal is to  optimizes performance on modern systems while handles
falling back for older POWER8 hypervisors.

Performance shows 40% improvements with MTU (1500) on largework load.

Signed-off-by: Mingming Cao <mmc@linux.ibm.com>
Reviewed-by: Brian King <bjking1@linux.ibm.com>
Reviewed-by: Haren Myneni <haren@linux.ibm.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250821130215.97960-1-mmc@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

Yue Haibing [Wed, 20 Aug 2025 12:30:07 +0000 (20:30 +0800)]

octeontx2-af: Remove unused declarations

Commit 1845ada47f6d ("octeontx2-af: cn10k: Add RPM LMAC pause frame
support") remove cgx_lmac_[s|g]et_pause_frm() and leave these unused.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://patch.msgid.link/20250820123007.1705047-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

commit | commitdiff | tree

David Yang [Sun, 24 Aug 2025 01:30:03 +0000 (09:30 +0800)]

net: phylink: remove stale an_enabled from doc

state->an_enabled was removed by
commit 4ee9b0dcf09f ("net: phylink: remove an_enabled")
but is left in mac_config() doc, so clean it.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250824013009.2443580-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 26 Aug 2025 00:53:40 +0000 (17:53 -0700)]

Merge branch 'tcp-follow-up-for-dccp-removal'

Kuniyuki Iwashima says:

====================
tcp: Follow up for DCCP removal.

As I mentioned in [0], TCP still has code for DCCP.

This series cleans up such leftovers.

[0]: https://patch.msgid.link/20250410023921.11307-3-kuniyu@amazon.com

v1: https://lore.kernel.org/20250821061540.2876953-1-kuniyu@google.com
====================

Link: https://patch.msgid.link/20250822190803.540788-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:07:01 +0000 (19:07 +0000)]

tcp: Move TCP-specific diag functions to tcp_diag.c.

tcp_diag_dump() / tcp_diag_dump_one() is just a wrapper of
inet_diag_dump_icsk() / inet_diag_dump_one_icsk(), respectively.

Let's inline them in tcp_diag.c and move static callees as well.

Note that inet_sk_attr_size() is merged into tcp_diag_get_aux_size(),
and we remove inet_diag_handler.idiag_get_aux_size() accordingly.

While at it, BUG_ON() is replaced with DEBUG_NET_WARN_ON_ONCE().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250822190803.540788-7-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:07:00 +0000 (19:07 +0000)]

tcp: Don't pass hashinfo to inet_diag helpers.

These inet_diag functions required struct inet_hashinfo because
they are shared by TCP and DCCP:

  * inet_diag_dump_icsk()
  * inet_diag_dump_one_icsk()
  * inet_diag_find_one_icsk()

DCCP has gone, and we don't need to pass hashinfo down to them.

Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the first
2 functions.

Note that inet_diag_find_one_icsk() don't need hashinfo since the
previous patch.

We will move TCP-specific functions to tcp_diag.c in the next patch.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:06:59 +0000 (19:06 +0000)]

tcp: Don't pass hashinfo to socket lookup helpers.

These socket lookup functions required struct inet_hashinfo because
they are shared by TCP and DCCP.

  * __inet_lookup_established()
  * __inet_lookup_listener()
  * __inet6_lookup_established()
  * inet6_lookup_listener()

DCCP has gone, and we don't need to pass hashinfo down to them.

Let's fetch net->ipv4.tcp_death_row.hashinfo directly in the above
4 functions.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:06:58 +0000 (19:06 +0000)]

tcp: Remove hashinfo test for inet6?_lookup_run_sk_lookup().

Commit 6c886db2e78c ("net: remove duplicate sk_lookup helpers")
started to check if hashinfo == net->ipv4.tcp_death_row.hashinfo
in __inet_lookup_listener() and inet6_lookup_listener() and
stopped invoking BPF sk_lookup prog for DCCP.

DCCP has gone and the condition is always true.

Let's remove the hashinfo test.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:06:57 +0000 (19:06 +0000)]

tcp: Remove timewait_sock_ops.twsk_destructor().

Since DCCP has been removed, sk->sk_prot->twsk_prot->twsk_destructor
is always tcp_twsk_destructor().

Let's call tcp_twsk_destructor() directly in inet_twsk_free() and
remove ->twsk_destructor().

While at it, tcp_twsk_destructor() is un-exported.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Kuniyuki Iwashima [Fri, 22 Aug 2025 19:06:56 +0000 (19:06 +0000)]

tcp: Remove sk_protocol test for tcp_twsk_unique().

Commit 383eed2de529 ("tcp: get rid of twsk_unique()") added
sk->sk_protocol test in __inet_check_established() and
__inet6_check_established() to remove twsk_unique() and call
tcp_twsk_unique() directly.

DCCP has gone, and the condition is always true.

Let's remove the sk_protocol test.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250822190803.540788-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Tue, 26 Aug 2025 00:16:03 +0000 (17:16 -0700)]

Merge branch 'net-airoha-add-ppe-support-for-rx-wlan-offload'

Lorenzo Bianconi says:

====================
net: airoha: Add PPE support for RX wlan offload

Introduce the missing bits to airoha ppe driver to offload traffic received
by the MT76 driver (wireless NIC) and forwarded by the Packet Processor
Engine (PPE) to the ethernet interface.

v2: https://lore.kernel.org/20250822-airoha-en7581-wlan-rx-offload-v2-0-8a76e1d3fec2@kernel.org
v1: https://lore.kernel.org/20250819-airoha-en7581-wlan-rx-offload-v1-0-71a097e0e2a1@kernel.org
====================

Link: https://patch.msgid.link/20250823-airoha-en7581-wlan-rx-offload-v3-0-f78600ec3ed8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Lorenzo Bianconi [Sat, 23 Aug 2025 07:56:04 +0000 (09:56 +0200)]

net: airoha: Introduce check_skb callback in ppe_dev ops

Export airoha_ppe_check_skb routine in ppe_dev ops. check_skb callback
will be used by the MT76 driver in order to offload the traffic received
by the wlan NIC and forwarded to the ethernet one.
Add rx_wlan parameter to airoha_ppe_check_skb routine signature.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250823-airoha-en7581-wlan-rx-offload-v3-3-f78600ec3ed8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Lorenzo Bianconi [Sat, 23 Aug 2025 07:56:03 +0000 (09:56 +0200)]

net: airoha: Add airoha_ppe_dev struct definition

Introduce airoha_ppe_dev struct as container for PPE offload callbacks
consumed by the MT76 driver during flowtable offload for traffic
received by the wlan NIC and forwarded to the wired one.
Add airoha_ppe_setup_tc_block_cb routine to PPE offload ops for MT76
driver.
Rely on airoha_ppe_dev pointer in airoha_ppe_setup_tc_block_cb
signature.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250823-airoha-en7581-wlan-rx-offload-v3-2-f78600ec3ed8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Mirror https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

RSS Atom