git.ipfire.org Git - thirdparty/kernel/linux.git/log

fbnic: Add SW shim for MDIO interface to PMD and PCS

In order for us to support a PCS device we need to add an MDIO bus to allow
the drivers to have access to the registers for the device. This change
adds such an interface.

The interface will consist of 2 PHY addrs, the first one consisting of a
PMD and PCS, and the second just being a PCS. There is a need for 2 PHYs
addrs due to the fact that in order to support the 50GBase-CR2 mode we will
need to access and configure the PCS vendor registers and RSFEC registers
from the second lane identical to the first.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374324532.959489.15389723111560978054.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

fbnic: Add handler for reporting link down event statistics

We were previously not displaying the number of link_down_events tracked by
the device. With this change we should now be able to display the value.
The value itself tracks the calls from the phylink interface to the
mac_link_down call.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374323824.959489.6915296616773178954.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

fbnic: Add logic to track PMD state via MAC/PCS signals

One complication with the design of our part is that the PMD doesn't
provide a direct signal to the host. Instead we have visibility to signals
that the PCS provides to the MAC that allow us to check the link state
through that.

We will need to account for several things in the PMD and firmware when
managing the link. Specifically when the link first starts to come up the
PMD will cause the link to flap. This is due to the firmware starting a
training cycle when the link is first detected. This will cause link
flapping if we were to immediately report link up when the PCS first
detects it.

To address that we are adding a pmd_state variable that is meant to be a
countdown of sorts indicating the state of the PMD. If the link is down or
has been reconfigured the PMD will start out in the initialize state. By
default the link is assumed to be in the SEND_DATA state if it is available
on initial link inspection. If link is detected while in the initialize
state the PMD state will switch to training, and if after 4 seconds the
link is still stable we will transition to link_ready, and finally the
send_data state. With this we can avoid link flapping when a cable is
first connected to the NIC.

One side effect of this is that we need to pull the link state away from
the PCS. For now we use a union of the PCS link state register value and
the pmd_state. The plan is to add a PMD register to report the pmd_state
to the phylink interface. With that we can then look at switching over to
the use of the XPCS driver for fbnic instead of having an internal one.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374323107.959489.14951134213387615059.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

fbnic: Rename PCS IRQ to MAC IRQ as it is actually a MAC interrupt

Throughout several spots in the code I had called out the IRQ as being
related to the PCS. However the actual IRQ is a part of the MAC and it is
just exposing PCS data. To more accurately reflect the owner of the calls
this change makes it so that we rename the functions and values that are
taking in the interrupt value and processing it to reflect that it is a MAC
call and not a PCS one.

This change is mostly motivated by the fact that we will be moving the
handling of this interrupt from being PCS focused to being more PMA/PMD
focused as this will drive the phydev driver that I am adding instead of
driving the PCS directly.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374322373.959489.12018231545479053860.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: Add support for FBNIC 25G, 50G, 100G PMD

The fbnic driver is planning to make use of the XPCS driver to enable
support for PCS and better integration with phylink. To do this though we
will need to enable several workarounds since the PMD interface for fbnic
is likely to be unique since it is a mix of two different vendor products
with a unique wrapper around the IP.

I have generated a PHY identifier based on IEEE 802.3-2022 22.2.4.3.1 using
an OUI belonging to Meta Platforms and used with our NICs. Using this we
will provide it as the PMD ID via the SW based MDIO interface so that
the fbnic device can be identified and necessary workarounds enabled in the
XPCS driver.

As an initial workaround this change adds an exception so that soft_reset
is not set when the driver is initially bound to the PCS.

In addition I have added logic to integrate the PMD Rx signal detect state
into the link state for the PCS. With this we can avoid the link coming up
too soon on the FBNIC PMD and as a result of it being in the training state
so we can avoid link flaps.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374321695.959489.6648161125012056619.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: Fix PMA identifier handling in XPCS

The XPCS driver was mangling the PMA identifier as the original code
appears to have been focused on just capturing the OUI. Rather than store a
mangled ID it is better to work with the actual PMA ID and instead just
mask out the values that don't apply rather than shifting them and
reordering them as you still don't get the original OUI for the NIC without
having to bitswap the values as per the definition of the layout in IEEE
802.3-2022 22.2.4.3.1.

By laying it out as it was in the hardware it is also less likely for us to
have an unintentional collision as the enum values will occupy the revision
number area while the OUI occupies the upper 22 bits.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374320920.959489.17267159479370601070.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: pcs: xpcs: Add support for 25G, 50G, and 100G interfaces

With this change we are adding support for 25G, 50G, and 100G interface
types to the XPCS driver. This had supposedly been enabled with the
addition of XLGMII but I don't see any capability for configuration there
so I suspect it may need to be refactored in the future.

With this change we can enable the XPCS driver with the selected interface
and it should be able to detect link, speed, and report the link status to
the phylink interface.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374320248.959489.11649590675011158859.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: phy: Add MDIO_PMA_CTRL1_SPEED for 2.5G and 5G to reflect PMA values

The 2.5G and 5G values are not consistent between the PCS CTRL1 and PMA
CTRL1 values. In order to avoid confusion between the two I am updating the
values to include "PMA" in the name similar to values used in similar
places.

To avoid breaking UAPI I have retained the original macros and just defined
them as the new PMA based defines.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://patch.msgid.link/176374319569.959489.6610469879021800710.stgit@ahduyck-xeon-server.home.arpa
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

amd-xgbe: let the MAC manage PHY PM

Use the MAC managed PM flag to indicate that MAC driver takes care of
suspending/resuming the PHY, and reset it when the device is brought up.

Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Link: https://patch.msgid.link/20251123163721.442162-1-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-hibmcge-add-support-for-tracepoint-and-pagepool-on-hibmcge-driver'

Jijie Shao says:

====================
net: hibmcge: Add support for tracepoint and pagepool on hibmcge driver

In this patch set:
1: add support for tracepoint for rx descriptor
2: double the rx queue depth to reduce packet drop
3: add support for pagepool on rx
====================

Link: https://patch.msgid.link/20251122034657.3373143-1-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: add support for pagepool on rx

add support for pagepool on rx, and remove the legacy path

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20251122034657.3373143-4-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: reduce packet drop under stress testing

Under stress test scenarios, hibmcge driver may not receive packets
in a timely manner, which can lead to the buffer of the hardware queue
being exhausted, resulting in packet drop.

This patch doubles the software queue depth and uses half of the buffer
to fill the hardware queue before receiving packets, thus preventing
packet loss caused by the hardware queue buffer being exhausted.

Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20251122034657.3373143-3-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hibmcge: add support for tracepoint to dump some fields of rx_desc

add support for tracepoint to dump some fields of rx_desc

Signed-off-by: Tao Lan <lantao5@huawei.com>
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Link: https://patch.msgid.link/20251122034657.3373143-2-shaojijie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fman_memac: report structured ethtool counters

The FMan driver has support for 2 MACs: mEMAC (newer, present on
Layerscape and PowerPC T series) and dTSEC/TGEC (older, present on
PowerPC P series). I only have handy access to the mEMAC, and this adds
support for MAC counters for those platforms.

MAC counters are necessary for any kind of low-level debugging, and
currently there is no mechanism to dump them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251122115931.151719-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dpaa: fman_memac: complete phylink support with 2500base-x

The DPAA phylink conversion in the following commits partially developed
code for handling the 2500base-x host interface mode (called "2.5G
SGMII" in LS1043A/LS1046A reference manuals).

- 0fc83bd79589 ("net: fman: memac: Add serdes support")
- 5d93cfcf7360 ("net: dpaa: Convert to phylink")

In principle, having phy-interface-mode = "2500base-x" and a pcsphy-handle
(unnamed or with pcs-handle-names = "sgmii") in the MAC device tree node
results in PHY_INTERFACE_MODE_2500BASEX being set in phylink_config ::
supported_interfaces, but this isn't sufficient.

Because memac_select_pcs() returns no PCS for PHY_INTERFACE_MODE_2500BASEX,
the Lynx PCS code never engages. There's a chance the PCS driver doesn't
have any configuration to change for 2500base-x fixed-link (based on
bootloader pre-initialization), but there's an even higher chance that
this is not the case, and the PCS remains misconfigured.

More importantly, memac_if_mode() does not handle
PHY_INTERFACE_MODE_2500BASEX, and it should be telling the mEMAC to
configure itself in GMII mode (which is upclocked by the PCS). Currently
it prints a WARN_ON() and returns zero, aka IF_MODE_10G (incorrect).

The additional case statement in memac_prepare() for calling
phy_set_mode_ext() does not make any difference, because there is no
generic PHY driver for the Lynx 10G SerDes from LS1043A/LS1046A. But we
add it nonetheless, for consistency.

Regarding the question "did 2500base-x ever work with the FMan mEMAC
mainline code prior to the phylink conversion?" - the answer is more
nuanced.

For context, the previous phylib-based implementation was unable to
describe the fixed-link speed as 2500, because the software PHY
implementation is limited to 1G. However, improperly describing the link
as an sgmii fixed-link with speed = <1000> would have resulted in a
functional 2.5G speed, because there is no other difference than the
SerDes lane clock net frequency (3.125 GHz for 2500base-x) - all the
other higher-level settings are the same, and the SerDes lane frequency
is currently handled by the RCW.

But this hack cannot be extended towards a phylib PHY such as Aquantia
operating in OCSGMII, because the latter requires phy-mode = "2500base-x",
which the mEMAC driver did not support prior to the phylink conversion.
So I do not really consider this a regression, just completing support
for a missing feature.

The FMan mEMAC driver sets phylink's "default_an_inband" property to
true, making it as if the device tree node had the managed =
"in-band-status" property anyway. This default made sense for SGMII,
where it was added to avoid regressions, but for 2500base-x we learned
only recently how to enable in-band autoneg:
https://lore.kernel.org/netdev/20251122113433.141930-1-vladimir.oltean@nxp.com/

so the driver needs to opt out of this default in-band enabled
behaviour, and only enable in-band based on the device tree property.

Suggested-by: Russell King (Oracle) <linux@armlinux.org.uk>
Link: https://lore.kernel.org/netdev/aIyx0OLWGw5zKarX@shell.armlinux.org.uk/#t
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251122115523.150260-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83867: implement configurability for SGMII in-band auto-negotiation

Implement the inband_caps() and config_inband() PHY driver methods, to
allow working with PCS devices that do not support or want in-band to be
used.

There is a complication due to existing logic from commit c76acfb7e19d
("net: phy: dp83867: retrigger SGMII AN when link change") which might
re-enable what dp83867_config_inband() has disabled. So we need to
modify dp83867_link_change_notify() to use phy_modify_changed() when
temporarily disabling in-band autoneg. If the return code is 0, it means
the original in-band was disabled and we need to keep it disabled.
If the return code is 1, the original was enabled and we need to
re-enable it. If negative, there was an error, which was silent before,
and remains silent now.

dp83867_config_inband() and dp83867_link_change_notify() are serialized
by the phydev->lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251122110427.133035-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add YNL test framework

Add a test framework for YAML Netlink (YNL) tools, covering both CLI and
ethtool functionality. The framework includes:

1) cli: family listing, netdev, ethtool, rt-* families, and nlctrl
operations
2) ethtool: device info, statistics, ring/coalesce/pause parameters, and
feature gettings

The current YNL syntax is a bit obscure, and end users may not always know
how to use it. This test framework provides usage examples and also serves
as a regression test to catch potential breakages caused by future changes.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20251124022055.33389-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: add big-endian byte-order for u32 IPv4 addresses

The fix commit converted several IPv4 address attributes from binary
to u32, but forgot to specify byte-order: big-endian. Without this,
YNL tools display IPv4 addresses incorrectly due to host-endian
interpretation.

Add the missing byte-order: big-endian to all affected u32 IPv4
address fields to ensure correct parsing and display.

Fixes: 1064d521d177 ("netlink: specs: support ipv4-or-v6 for dual-stack fields")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://patch.msgid.link/20251125112048.37631-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-intel-migrate-to-get_rx_ring_count-ethtool-callback'

Breno Leitao says:

====================
net: intel: migrate to .get_rx_ring_count() ethtool callback

This series migrates Intel network drivers to use the new .get_rx_ring_count()
ethtool callback introduced in commit 84eaf4359c36 ("net: ethtool: add
get_rx_ring_count callback to optimize RX ring queries").

The new callback simplifies the .get_rxnfc() implementation by removing
ETHTOOL_GRXRINGS handling and moving it to a dedicated callback. This provides
a cleaner separation of concerns and aligns these drivers with the modern
ethtool API.

The series updates the following Intel drivers:
  - idpf
  - igb
  - igc
  - ixgbevf
  - fm10k
====================

Link: https://patch.msgid.link/20251125-gxring_intel-v2-0-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fm10k: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns fm10k with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-8-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbevf: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns ixgbevf with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-7-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

igc: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns igc with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-6-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

igb: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns igb with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-5-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

idpf: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns idpf with the new
ethtool API for querying RX ring parameters.

I was not totally convinced I needed to have the lock, but, I decided to
be on the safe side and get the exact same behaviour it was before.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-4-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns ice with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-3-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

iavf: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns iavf with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-2-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

i40e: extract GRXRINGS from .get_rxnfc

Commit 84eaf4359c36 ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns i40e with the new
ethtool API for querying RX ring parameters.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Link: https://patch.msgid.link/20251125-gxring_intel-v2-1-f55cd022d28b@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'unify-platform-suspend-resume-routines-for-pci-dwmac-glue'

Yao Zi says:

====================
Unify platform suspend/resume routines for PCI DWMAC glue

There are currently three PCI-based DWMAC glue drivers in tree,
stmmac_pci.c, dwmac-intel.c, and dwmac-loongson.c. Both stmmac_pci.c and
dwmac-intel.c implements the same and duplicated platform suspend/resume
routines.

This series introduces a new PCI helper library, stmmac_libpci.c,
providing a pair of helpers, stmmac_pci_plat_{suspend,resume}, and
replaces the driver-specific implementation with the helpers to reduce
code duplication. The helper will also simplify the Motorcomm DWMAC glue
driver which I'm working on.

The glue driver for Intel controllers isn't covered by the series, since
its suspend routine doesn't call pci_disable_device() and thus is a
little different from the new generic helpers.

I only have Loongson hardware on hand, thus the series is only tested on
Loongson 3A5000 machine. I could confirm the controller works after
resume, and WoL works as expected. This shouldn't break stmmac_pci.c,
either, since the new helpers have the exactly same code as the old
driver-specific suspend/resume hooks.
====================

Link: https://patch.msgid.link/20251124160417.51514-1-ziyao@disroot.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: pci: Use generic PCI suspend/resume routines

Convert STMMAC PCI glue driver to use the generic platform
suspend/resume routines for PCI controllers, instead of implementing its
own one.

Signed-off-by: Yao Zi <ziyao@disroot.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Link: https://patch.msgid.link/20251124160417.51514-4-ziyao@disroot.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: loongson: Use generic PCI suspend/resume routines

Convert glue driver for Loongson DWMAC controller to use the generic
platform suspend/resume routines for PCI controllers, instead of
implementing its own one.

Signed-off-by: Yao Zi <ziyao@disroot.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Link: https://patch.msgid.link/20251124160417.51514-3-ziyao@disroot.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Add generic suspend/resume helper for PCI-based controllers

Most glue driver for PCI-based DWMAC controllers utilize similar
platform suspend/resume routines. Add a generic implementation to reduce
duplicated code.

Signed-off-by: Yao Zi <ziyao@disroot.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cn>
Link: https://patch.msgid.link/20251124160417.51514-2-ziyao@disroot.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-hwtstamp_get-callback-to-phy-drivers'

Vadim Fedorenko says:

====================
add hwtstamp_get callback to phy drivers

PHY drivers are able to configure HW time stamping and are not able to
report configuration back to user space. Add callback to report
configuration like it's done for net_device and add implementation to
the drivers.
====================

Link: https://patch.msgid.link/20251124181151.277256-1-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ptp_ines: add HW timestamp configuration reporting

The driver partially stores HW timestamping configuration, but missing
pieces can be read from HW. Add callback to report configuration.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-8-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: nxp-c45-tja11xx: add HW timestamp configuration reporting

The driver stores HW timestamping configuration and can technically
report it. Add callback to do it.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-7-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

phy: mscc: add HW timestamp configuration reporting

The driver stores HW configuration and can technically report it.
Add callback to do it.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-6-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83640: add HW timestamp configuration reporting

The driver stores configuration of TX timestamping and can technically
report it. Patch RX timestamp configuration storage to be more precise
on reporting and add callback to actually report it.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-5-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: broadcom: add HW timestamp configuration reporting

The driver stores configuration information and can technically report
it. Implement hwtstamp_get callback to report the configuration.

Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251124181151.277256-4-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

phy: add hwtstamp_get callback to phy drivers

PHY devices had lack of hwtstamp_get callback even though most of them
are tracking configuration info. Introduce new call back to
mii_timestamper.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-3-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

phy: rename hwtstamp callback to hwtstamp_set

PHY devices has hwtstamp callback which actually performs set operation.
Rename it to better reflect the action.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251124181151.277256-2-vadim.fedorenko@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: packetdrill: pass send_omit_free to MSG_ZEROCOPY tests

The --send_omit_free flag is needed for TCP zero copy tests, to ensure
that packetdrill doesn't free the send() buffer after the send() call.

Fixes: 1e42f73fd3c2 ("selftests/net: packetdrill: import tcp/zerocopy")
Closes: https://lore.kernel.org/netdev/20251124071831.4cbbf412@kernel.org/
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20251125234029.1320984-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: initialize char variable to null

char variable in 'so_txtime.c' & 'txtimestamp.c' were left uninitilized
when switch default case taken. which raises following warning.

txtimestamp.c:240:2: warning: variable 'tsname' is used uninitialized
whenever switch default is taken [-Wsometimes-uninitialized]

so_txtime.c:210:3: warning: variable 'reason' is used uninitialized
whenever switch default is taken [-Wsometimes-uninitialized]

initializing these variables to NULL to fix this.

Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com>
Link: https://patch.msgid.link/20251125165302.20079-1-ankitkhushwaha.linux@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: fbnic: Return the true error in fbnic_alloc_napi_vectors.

The error case in fbnic_alloc_napi_vectors defaulted to returning
ENOMEM. This can mask the true error case, causing confusion.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20251124200518.1848029-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftest-af_unix-misc-updates'

Kuniyuki Iwashima says:

====================
selftest: af_unix: Misc updates.

Patch 1 add .gitignore under tools/testing/selftests/net/af_unix/.

Patch 2 make so_peek_off.c less flaky.
====================

Link: https://patch.msgid.link/20251124212805.486235-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: af_unix: Extend recv() timeout in so_peek_off.c.

so_peek_off.c is reported to be flaky on NIPA:

  # # so_peek_off.c:149:two_chunks_overlap_blocking:Expected -1 (-1) != bytes (-1)
  # # two_chunks_overlap_blocking: Test terminated by assertion
  # #          FAIL  so_peek_off.stream.two_chunks_overlap_blocking

The test fork()s a child process to send() data after 1ms to
wake up the parent process being blocked (up to 3ms) on recv().

But, from the log, the parent woke up after 3ms timeout, so it
could be too short when the host is overloaded.

Let's extend it to 5s.

Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20251124070722.1e828c53@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124212805.486235-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: af_unix: Create its own .gitignore.

Somehow AF_UNIX tests have reused ../.gitignore,
but now NIPA warns about it.

Let's create .gitignore under af_unix/.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124212805.486235-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

virtio_net: enhance wake/stop tx queue statistics accounting

This patch refines and strengthens the statistics collection of TX queue
wake/stop events introduced by commit c39add9b2423 ("virtio_net: Add TX
stopped and wake counters").

Previously, the driver only recorded partial wake/stop statistics
for TX queues. Some wake events triggered by 'skb_xmit_done()' or resume
operations were not counted, which made the per-queue metrics incomplete.

Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20251120015320.1418-1-liming.wu@jaguarmicro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-provide-better-locality-for-retransmit-timer'

Eric Dumazet says:

====================
tcp: provide better locality for retransmit timer

TCP stack uses three timers per flow, currently spread this way:

- sk->sk_timer : keepalive timer
- icsk->icsk_retransmit_timer : retransmit timer
- icsk->icsk_delack_timer : delayed ack timer

This series moves the retransmit timer to sk->sk_timer location,
to increase data locality in TX paths.

keepalive timers are not often used, this change should be neutral for them.

After the series we have following fields:

- sk->tcp_retransmit_timer : retransmit timer, in sock_write_tx group
- icsk->icsk_delack_timer : delayed ack timer
- icsk->icsk_keepalive_timer : keepalive timer

Moving icsk_delack_timer in a beter location would also be welcomed.
====================

Link: https://patch.msgid.link/20251124175013.1473655-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: remove icsk->icsk_retransmit_timer

Now sk->sk_timer is no longer used by TCP keepalive, we can use
its storage for TCP and MPTCP retransmit timers for better
cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: introduce icsk->icsk_keepalive_timer

sk->sk_timer has been used for TCP keepalives.

Keepalive timers are not in fast path, we want to use sk->sk_timer
storage for retransmit timers, for better cache locality.

Create icsk->icsk_keepalive_timer and change keepalive
code to no longer use sk->sk_timer.

Added space is reclaimed in the following patch.

This includes changes to MPTCP, which was also using sk_timer.

Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer
for inet_sk_diag_fill() sake.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx group

These two fields are mostly read in TCP tx path, move them
in an more appropriate group for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: rename icsk_timeout() to tcp_timeout_expires()

In preparation of sk->tcp_timeout_timer introduction,
rename icsk_timeout() helper and change its argument to plain
'const struct sock *sk'.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ice: fix broken Rx on VFs

Since the tagged commit, ice stopped respecting Rx buffer length
passed from VFs.
At that point, the buffer length was hardcoded in ice, so VFs still
worked up to some point (until, for example, a VF wanted an MTU
larger than its PF).
The next commit 93f53db9f9dc ("ice: switch to Page Pool"), broke
Rx on VFs completely since ice started accounting per-queue buffer
lengths again, but now VF queues always had their length zeroed, as
ice was already ignoring what iavf was passing to it.

Restore the line that initializes the buffer length on VF queues
basing on the virtchnl messages.

Fixes: 3a4f419f7509 ("ice: drop page splitting and recycling")
Reported-by: Jakub Slepecki <jakub.slepecki@intel.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Jakub Slepecki <jakub.slepecki@intel.com>
Link: https://patch.msgid.link/20251124170735.3077425-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

chtls: Avoid -Wflex-array-member-not-at-end warning

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Use the `DEFINE_RAW_FLEX()` helper for on-stack definitions of
a flexible structure where the size of the flexible-array member
is known at compile-time, and refactor the rest of the code,
accordingly.

So, with these changes, fix the following warning:

drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c:163:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/aSQocKoJGkN0wzEj@kspp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-ynl-gen-regeneration-comment-function-prefix'

Asbjørn Sloth Tønnesen says:

====================
tools: ynl-gen: regeneration comment + function prefix

It looks like these two patches are the last ones needed
for YNL, before the WireGuard patches can go in.

These patches was both requested by Jason, during review
of the WireGuard YNL conversion patchset[1].
====================

Link: https://patch.msgid.link/20251120174429.390574-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: add regeneration comment

Add a comment on regeneration to the generated files.

The comment is placed after the YNL-GEN line[1], as to not interfere
with ynl-regen.sh's detection logic.

[1] and after the optional YNL-ARG line.

Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl-gen: add function prefix argument

This patch adds a new CLI argument for overriding the default
function prefix, as used for naming the doit/dumpit functions
in the generated kernel code.

When not specified the default "$(FAMILY)-nl" is used.

This can also be specified persistently in generated files:
  /* YNL-ARG --function-prefix wg */

In the above example it causes the following changes:
  wireguard_nl_get_device_dumpit() -> wg_get_device_dumpit()
  wireguard_nl_get_device_doit()   -> wg_get_device_doit()

The variable name fn_prefix, was chosen as it relates to op_prefix
which is used to prefix the UAPI commands enum entries.

Link: https://lore.kernel.org/r/aRvWzC8qz3iXDAb3@zx2c4.com/
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Link: https://patch.msgid.link/20251120174429.390574-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ptp-ocp-a-fix-and-refactoring'

Andy Shevchenko says:

====================
ptp: ocp: A fix and refactoring

Here is the fix for incorrect use of %ptT with the associated
refactoring and additional cleanups.

Note, %ptS, which is introduced in another series that is already
applied to PRINTK tree, doesn't fit here, that's why this fix
is separated from that series.
====================

Link: https://patch.msgid.link/20251124084816.205035-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ocp: Reuse META's PCI vendor ID

The META's PCI vendor ID is listed already in the pci_ids.h.
Reuse it here.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251124084816.205035-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ocp: Apply standard pattern for cleaning up loop

The while (i--) is a standard pattern for the cleaning up loops.
Apply this pattern where it makes sense in the driver.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251124084816.205035-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ocp: Make ptp_ocp_unregister_ext() NULL-aware

It's a common practice to make resource release functions be NULL-aware.
Make ptp_ocp_unregister_ext() NULL-aware.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251124084816.205035-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: ocp: Refactor signal_show() and fix %ptT misuse

Refactor signal_show() to avoid sequential calls to sysfs_emit*()
and use the same pattern to get the index of a signal as it's done
in signal_store().

While at it, fix wrong use of %ptT against struct timespec64.
It's kinda lucky that it worked just because the first member
there 64-bit and it's of time64_t type. Now with %ptS it may
be used correctly.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20251124084816.205035-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: Extend transport change null-ptr-deref test

syzkaller reported a lockdep lock order inversion warning[1] due to
commit 687aa0c5581b ("vsock: Fix transport_* TOCTOU"). This was fixed in
commit f7c877e75352 ("vsock: fix lock inversion in
vsock_assign_transport()").

Redo syzkaller's repro by piggybacking on a somewhat related test
implemented in commit 3a764d93385c ("vsock/test: Add test for null ptr
deref when transport changes").

[1]: https://lore.kernel.org/netdev/68f6cdb0.a70a0220.205af.0039.GAE@google.com/

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20251123-vsock_test-linger-lockdep-warn-v1-1-4b1edf9d8cdc@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: improve MAC EEE handling

Let phydev->enable_tx_lpi control whether MAC enables TX LPI, instead of
enabling it unconditionally. This way TX LPI is disabled if e.g. link
partner doesn't support EEE. This helps to avoid potential issues like
link flaps.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/91bcb837-3fab-4b4e-b495-038df0932e44@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mxl-gpy: add support for MxL86252 and MxL86282

Add PHY driver support for Maxlinear MxL86252 and MxL86282 switches.
The PHYs built-into those switches are just like any other GPY 2.5G PHYs
with the exception of the temperature sensor data being encoded in a
different way.

Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/a6cd7fe461b011cec2b59dffaf34e9c8b0819059.1763818120.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: mxl-gpy: add support for MxL86211C

MxL86211C is a smaller and more efficient version of the GPY211C.
Add the PHY ID and phy_driver instance to the mxl-gpy driver.

Signed-off-by: Chad Monroe <chad@monroe.io>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/cabf3559d6511bed6b8a925f540e3162efc20f6b.1763818120.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: remove redundant fwnode cleanup

Remove redundant fwnode cleanup in of_mdiobus_register_device()
and xpcs_plat_init_dev().

mdio_device_free() eventually calls mdio_device_release(),
which already performs fwnode_handle_put(), making the manual
cleanup unnecessary.

Combine fwnode_handle_get() with device_set_node() in
of_mdiobus_register_device() for clarity.

Signed-off-by: Buday Csaba <buday.csaba@prolan.hu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/00847693daa8f7c8ff5dfa19dd35fc712fa4e2b5.1763995734.git.buday.csaba@prolan.hu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: eliminate kdoc warnings in mdio_device.c and mdio_bus.c

Fix all warnings reported by scripts/kernel-doc in
mdio_device.c and mdio_bus.c

Signed-off-by: Buday Csaba <buday.csaba@prolan.hu>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/7ef7b80669da2b899d38afdb6c45e122229c3d8c.1763968667.git.buday.csaba@prolan.hu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-enetc-add-port-mdio-support-for-both-i-mx94-and-i-mx95'

Wei Fang says:

====================
net: enetc: add port MDIO support for both i.MX94 and i.MX95

The NETC IP has one external master MDIO interface (eMDIO) for managing
external PHYs, all ENETC ports share this eMDIO. The EMDIO function and
the ENETC port MDIO are the virtual ports of this eMDIO, ENETC can use
these virtual ports to access their PHYs. The difference is that EMDIO
function is a 'global port', it can access all the PHYs on the eMDIO, so
it provides a means for different software modules to share a single set
of MDIO signals to access their PHYs.

The ENETC port MDIO can only access its own external PHY. Furthermore,
its PHY address must be set to its corresponding LaBCR register in IERB
module, which is is a 64 KB size page containing registers that are used
for pre-boot initialization for all NETC PCIe functions. And this IERB
is owned by the host OS and it will be locked after the initialization,
so it cannot be configured at running time any more. The port MDIO can
only work properly when the PHY address accessed by it matches the value
of its corresponding LaBCR[MDIO_PHYAD_PRTAD]. Otherwise, the MDIO access
by the port MDIO will not take effect.

Note that the same PHY is either controlled by port MDIO or by the EMDIO
function. The netc-blk-ctrl driver will only set the PHY address in the
LaBCR register corresponding to the ENETC when the ENETC node contains
an mdio child node, and the ENETC driver will only create the port MDIO
bus then. An example in DTS is as follows, the EMDIO function will not\
access this PHY.

enetc_port0 {
        phy-handle = <&ethphy0>;
        phy-mode = "rgmii-id";

        mdio {
                #address-cells = <1>;
                #size-cells = <0>;

                ethphy0: ethernet-phy@1 {
                        reg = <1>;
                };
        };
};

If users want to use EMDIO funtion to manage the PHY, they only need to
place the PHY node in the emdio node. The same PHY must not be placed
simultaneously within the ENETC node. An example in DTS to use EMDIO
is as below.

netc_emdio {
        ethphy0: ethernet-phy@1 {
                reg = <1>;
        };

        ethphy2: ethernet-phy@8 {
                reg = <8>;
        };
};

In the host OS, when there are multiple ENETCs, they can all access their
PHYs using their own port MDIO, or they can all access their PHYs using
the EMDIO function, or they can partially use port MDIO and partially use
the EMDIO function.

Another typical use case of port MDIO is the Jailhouse usage. An ENETC is
assigned to a guest OS. The EMDIO function will be unavailable in the
guest OS because EMDIO is controlled by the host OS. Therefore, the ENETC
can use its port MDIO to manage its external PHY in this situation. In
this use case, the host OS's root dtb will disable the ENETC node, so the
host OS's ENETC driver will not probe the ENETC and its PHY.

In addition, this series also adds the internal MDIO bus support, each
ENETC has an internal MDIO interface for managing on-die PHY (PCS) if it
has PCS layer.
====================

Link: https://patch.msgid.link/20251119102557.1041881-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: update the base address of port MDIO registers for ENETC v4

Each ENETC has a set of external MDIO registers to access its external
PHY based on its port EMDIO bus, these registers are used for MDIO bus
access, such as setting the PHY address, PHY register address and value,
read or write operations, C22 or C45 format, etc. The base address of
this set of registers has been modified in ENETC v4 and is different
from that in ENETC v1. So the base address needs to be updated so that
ENETC v4 can use port MDIO to manage its own external PHY.

Additionally, if ENETC has the PCS layer, it also has a set of internal
MDIO registers for managing its on-die PHY (PCS/Serdes). The base address
of this set of registers is also different from that of ENETC v1, so the
base address also needs to be updated so that ENETC v4 can support the
management of on-die PHY through the internal MDIO bus.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-4-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: set external PHY address in IERB for i.MX94 ENETC

NETC IP has only one external master MDIO interface (eMDIO) for managing
the external PHYs. ENETC can use the interfaces provided by the EMDIO
function or its port MDIO to access and manage its external PHY. Both
the EMDIO function and the port MDIO are all virtual ports of the eMDIO.

The difference is that the EMDIO function is a 'global port', it can
access all the PHYs on the eMDIO, but port MDIO can only access its own
PHY. To ensure that ENETC can only access its own PHY through port MDIO,
LaBCR[MDIO_PHYAD_PRTAD] needs to be set, which represents the address of
the external PHY connected to ENETC. If the accessed PHY address is not
consistent with LaBCR[MDIO_PHYAD_PRTAD], then the MDIO access initiated
by port MDIO will be invalid.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: set the external PHY address in IERB for port MDIO usage

The ENETC supports managing its own external PHY through its port MDIO
functionality. To use this function, the PHY address needs be set in the
corresponding LaBCR register in the Integrated Endpoint Register Block
(IERB), which is used for pre-boot initialization of NETC PCIe functions.
The port MDIO can only work properly when the PHY address accessed by the
port MDIO matches the corresponding LaBCR[MDIO_PHYAD_PRTAD] value.

Because the ENETC driver only registers the MDIO bus (port MDIO bus) when
it detects an MDIO child node in its node, similarly, the netc-blk-ctrl
driver only resolves the PHY address and sets it in the corresponding
LaBCR when it detects an MDIO child node in the ENETC node.

Co-developed-by: Aziz Sellami <aziz.sellami@nxp.com>
Signed-off-by: Aziz Sellami <aziz.sellami@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20251119102557.1041881-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'improvements-over-dsa-conduit-ethtool-ops'

Vladimir Oltean says:

====================
Improvements over DSA conduit ethtool ops

DSA interceps 'ethtool -S eth0', where eth0 is the host port of the
switch (called 'conduit'). It does this because otherwise there is no
way to report port counters for the CPU port, which is a MAC like any
other of that switch, except Linux exposes no net_device for it, thus no
ethtool hook.

Having understood all downsides of this debugging interface, when we
need it we needed, so the proposed changes here are to make it more
useful by dumping more counters in it: not just the switch CPU port,
but all other switch ports in the tree which lack a net_device. Not
reinventing any wheel, just putting more output in an existing command.

That is patch 3/3. The other 2 are cleanup.
====================

Link: https://patch.msgid.link/20251122112311.138784-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: append ethtool counters of all hidden ports to conduit

Currently there is no way to see packet counters on cascade ports, and
no clarity on how the API for that would look like.

Because it's something that is currently needed, just extend the hack
where ethtool -S on the conduit interface dumps CPU port counters, and
also use it to dump counters of cascade ports.

Note that the "pXX_" naming convention changes to "sXX_pYY", to
distinguish between ports having the same index but belonging to
different switches. This has a slight chance of causing regressions to
existing tooling:

- grepping for "p04_counter_name" still works, but might return more
than one string now
- grepping for " p04_counter_name" no longer works

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251122112311.138784-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: use kernel data types for ethtool ops on conduit

Suppress some checkpatch 'CHECK' messages about u8 being preferable over
uint8_t, etc. No functional change.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251122112311.138784-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: cpu_dp->orig_ethtool_ops might be NULL

In theory this would have been seen by now, but it seems that all
drivers used as DSA conduit interfaces thus far have had ethtool_ops
set, and it's hard to even find modern Ethernet drivers (and not VF
ones) which don't use ethtool.

Here is the unfiltered list of drivers which register any sort of
net_device but don't set its ethtool_ops pointer. I don't think any of
them 'risks' being used as a DSA conduit, maybe except for moxart,
rnpbge and icssm, I'm not sure.

- drivers/net/can/dev/dev.c
- drivers/net/wwan/qcom_bam_dmux.c
- drivers/net/wwan/t7xx/t7xx_netdev.c
- drivers/net/arcnet/arcnet.c
- drivers/net/hamradio/
- drivers/net/slip/slip.c
- drivers/net/ethernet/ezchip/nps_enet.c
- drivers/net/ethernet/moxa/moxart_ether.c
- drivers/net/ethernet/wangxun/txgbevf/txgbevf_main.c
- drivers/net/ethernet/wangxun/ngbevf/ngbevf_main.c
- drivers/net/ethernet/huawei/hinic3/hinic3_main.c
- drivers/net/ethernet/i825xx/
- drivers/net/ethernet/ti/icssm/icssm_prueth.c
- drivers/net/ethernet/seeq/
- drivers/net/ethernet/litex/litex_liteeth.c
- drivers/net/ethernet/sunplus/spl2sw_driver.c
- drivers/net/ethernet/mucse/rnpgbe/rnpgbe_main.c
- drivers/net/ipa/
- drivers/net/wireless/microchip/wilc1000/
- drivers/net/wireless/mediatek/mt76/dma.c
- drivers/net/wireless/ath/ath12k/
- drivers/net/wireless/ath/ath11k/
- drivers/net/wireless/ath/ath6kl/
- drivers/net/wireless/ath/ath10k/
- drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/trans.c
- drivers/net/wireless/virtual/mac80211_hwsim.c
- drivers/net/wireless/quantenna/qtnfmac/pcie/pcie.c
- drivers/net/wireless/realtek/rtw89/core.c
- drivers/net/wireless/realtek/rtw88/pci.c
- drivers/net/caif/
- drivers/net/plip/
- drivers/net/wan/
- drivers/net/mctp/
- drivers/net/ppp/
- drivers/net/thunderbolt/

Nonetheless, it's good for the framework not to make such assumptions,
and not panic when coming across such kind of host device in the future.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251122112311.138784-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cxgb4: Rename sched_class to avoid type clash

drivers/net/ethernet/chelsio/cxgb4/sched.h declares a sched_class
struct which has a type name clash with struct sched_class
in kernel/sched/sched.h (a type used in a field in task_struct).

When cxgb4 is a builtin we end up with both sched_class types,
and as a result of this we wind up with DWARF (and derived from
that BTF) with a duplicate incorrect task_struct representation.
When cxgb4 is built-in this type clash can cause kernel builds to
fail as resolve_btfids will fail when confused which task_struct
to use. See [1] for more details.

As such, renaming sched_class to ch_sched_class (in line with
other structs like ch_sched_flowc) makes sense.

[1] https://lore.kernel.org/bpf/2412725b-916c-47bd-91c3-c2d57e3e6c7b@acm.org/

Reported-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://patch.msgid.link/20251121181231.64337-1-alan.maguire@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: add support for RTL9151A

This adds support for chip RTL9151A. Its XID is 0x68b. It is bascially
basd on the one with XID 0x688, but with different firmware file.

Signed-off-by: Javen Xu <javen_xu@realsil.com.cn>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/20251121090104.3753-1-javen_xu@realsil.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net_sched-speedup-qdisc-dequeue'

Eric Dumazet says:

====================
net_sched: speedup qdisc dequeue

Avoid up to two cache line misses in qdisc dequeue() to fetch
skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.

Idea is to cache gso_segs at enqueue time before spinlock is
acquired, in the first skb cache line, where we already
have qdisc_skb_cb(skb)->pkt_len.

This series gives a 8 % improvement in a TX intensive workload.

(120 Mpps -> 130 Mpps on a Turin host, IDPF with 32 TX queues)

v2: https://lore.kernel.org/netdev/20251111093204.1432437-1-edumazet@google.com/
v1: https://lore.kernel.org/netdev/20251110094505.3335073-1-edumazet@google.com/T/#m8f562ed148f807c02fd02c6cd243604d449615b9
====================

Link: https://patch.msgid.link/20251121083256.674562-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codel

cake, codel and fq_codel can drop many packets from dequeue().

Use qdisc_dequeue_drop() so that the freeing can happen
outside of the qdisc spinlock scope.

Add TCQ_F_DEQUEUE_DROPS to sch->flags.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-15-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: add qdisc_dequeue_drop() helper

Some qdisc like cake, codel, fq_codel might drop packets
in their dequeue() method.

This is currently problematic because dequeue() runs with
the qdisc spinlock held. Freeing skbs can be extremely expensive.

Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS
so that these qdiscs can opt-in to defer the skb frees
after the socket spinlock is released.

TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs
with an extra cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: add tcf_kfree_skb_list() helper

Using kfree_skb_list_reason() to free list of skbs from qdisc
operations seems wrong as each skb might have a different drop reason.

Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once
in preparation of the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: annotate a data-race in __dev_xmit_skb()

q->limit is read locklessly, add a READ_ONCE().

Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-12-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: prefech skb->priority in __dev_xmit_skb()

Most qdiscs need to read skb->priority at enqueue time().

In commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
I added a prefetch(next), lets add another one for the second
half of skb.

Note that skb->priority and skb->hash share a common cache line,
so this patch helps qdiscs needing both fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-11-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: sch_fq: prefetch one skb ahead in dequeue()

prefetch the skb that we are likely to dequeue at the next dequeue().

Also call fq_dequeue_skb() a bit sooner in fq_dequeue().

This reduces the window between read of q.qlen and
changes of fields in the cache line that could be dirtied
by another cpu trying to queue a packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-10-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()

Group together changes to qdisc fields to reduce chances of false sharing
if another cpu attempts to acquire the qdisc spinlock.

  qdisc_qstats_backlog_dec(sch, skb);
  sch->q.qlen--;
  qdisc_bstats_update(sch, skb);

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-9-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: add Qdisc_read_mostly and Qdisc_write groups

It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in
fast path by reducing this to a single dirtied cache line.

In current layout, we change only four/six fields in the first cache line:
- q.spinlock
- q.qlen
- bstats.bytes
- bstats.packets
- some Qdisc also change q.next/q.prev

In the second cache line we change in the fast path:
- running
- state
- qstats.backlog

        /* --- cacheline 2 boundary (128 bytes) --- */
        struct sk_buff_head        gso_skb __attribute__((__aligned__(64))); /*  0x80  0x18 */
        struct qdisc_skb_head      q;                    /*  0x98  0x18 */
        struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xb0  0x10 */

        /* --- cacheline 3 boundary (192 bytes) --- */
        struct gnet_stats_queue    qstats;               /*  0xc0  0x14 */
        bool                       running;              /*  0xd4   0x1 */

        /* XXX 3 bytes hole, try to pack */

        unsigned long              state;                /*  0xd8   0x8 */
        struct Qdisc *             next_sched;           /*  0xe0   0x8 */
        struct sk_buff_head        skb_bad_txq;          /*  0xe8  0x18 */
        /* --- cacheline 4 boundary (256 bytes) --- */

Reorganize things to have a first cache line mostly read,
then a mostly written one.

This gives a ~3% increase of performance under tx stress.

Note that there is an additional hole because @qstats now spans over a third cache line.

/* --- cacheline 2 boundary (128 bytes) --- */
__u8                       __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); /*  0x80     0 */
struct sk_buff_head        gso_skb;              /*  0x80  0x18 */
struct Qdisc *             next_sched;           /*  0x98   0x8 */
struct sk_buff_head        skb_bad_txq;          /*  0xa0  0x18 */
__u8                       __cacheline_group_end__Qdisc_read_mostly[0]; /*  0xb8     0 */

/* XXX 8 bytes hole, try to pack */

/* --- cacheline 3 boundary (192 bytes) --- */
__u8                       __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); /*  0xc0     0 */
struct qdisc_skb_head      q;                    /*  0xc0  0x18 */
unsigned long              state;                /*  0xd8   0x8 */
struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /*  0xe0  0x10 */
bool                       running;              /*  0xf0   0x1 */

/* XXX 3 bytes hole, try to pack */

struct gnet_stats_queue    qstats;               /*  0xf4  0x14 */
/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
__u8                       __cacheline_group_end__Qdisc_write[0]; /* 0x108     0 */

/* XXX 56 bytes hole, try to pack */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-8-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: cake: use qdisc_pkt_segs()

Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue()
for non GSO packets.

cake_overhead() does not have to recompute it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-7-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()

Avoid up to two cache line misses in qdisc dequeue() to fetch
skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held.

This gives a 5 % improvement in a TX intensive workload.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-6-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()

sch_handle_ingress() sets qdisc_skb_cb(skb)->pkt_len.

We also need to initialize qdisc_skb_cb(skb)->pkt_segs.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-5-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()

qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len.

Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function
to qdisc_pkt_len_segs_init().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-4-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: init shinfo->gso_segs from qdisc_pkt_len_init()

Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(),
but this field needs to be initialized for SKB_GSO_DODGY users.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-3-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net_sched: make room for (struct qdisc_skb_cb)->pkt_segs

Add a new u16 field, next to pkt_len : pkt_segs

This will cache shinfo->gso_segs to speed up qdisc deqeue().

Move slave_dev_queue_mapping at the end of qdisc_skb_cb,
and move three bits from tc_skb_cb :
- post_ct
- post_ct_snat
- post_ct_dnat

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251121083256.674562-2-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: aspeed: add AST2700 MDIO compatible

Add "aspeed,ast2700-mdio" compatible to the binding schema with a fallback
to "aspeed,ast2600-mdio".

Although the MDIO controller on AST2700 is functionally the same as the
one on AST2600, it's good practice to add a SoC-specific compatible for
new silicon. This allows future driver updates to handle any 2700-specific
integration issues without requiring devicetree changes or complex
runtime detection logic.

For now, the driver continues to bind via the existing
"aspeed,ast2600-mdio" compatible, so no driver changes are needed.

Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Jacky Chou <jacky_chou@aspeedtech.com>
Link: https://patch.msgid.link/20251120-aspeed_mdio_ast2700-v2-1-0d722bfb2c54@aspeedtech.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'mptcp-memcg-accounting-for-passive-sockets-backlog-processing'

Matthieu Baerts says:

====================
mptcp: memcg accounting for passive sockets & backlog processing

This series is split in two: the 4 first patches are linked to memcg
accounting for passive sockets, and the rest introduce the backlog
processing. They are sent together, because the first one appeared to be
needed to get the second one fully working.

The second part includes RX path improvement built around backlog
processing. The main goals are improving the RX performances _and_
increase the long term maintainability.

- Patches 1-3: preparation work to ease the introduction of the next
  patch.

- Patch 4: fix memcg accounting for passive sockets. Note that this is a
  (non-urgent) fix, but it depends on material that is currently only in
  net-next, e.g. commit 4a997d49d92a ("tcp: Save lock_sock() for memcg
  in inet_csk_accept().").

- Patches 5-6: preparation of the stack for backlog processing, removing
  assumptions that will not hold true any more after the backlog
  introduction.

- Patches 7,8,10,11,12 are more cleanups that will make the backlog
  patch a little less huge.

- Patch 9: somewhat an unrelated cleanup, included here not to forget
  about it.

- Patches 13-14: The real work is done by them. Patch 13 introduces the
  helpers needed to manipulate the msk-level backlog, and the data
  struct itself, without any actual functional change. Patch 14 finally
  uses the backlog for RX skb processing. Note that MPTCP can't use the
  sk_backlog, as the MPTCP release callback can also release and
  re-acquire the msk-level spinlock and core backlog processing works
  under the assumption that such event is not possible.
  A relevant point is memory accounts for skbs in the backlog. It's
  somewhat "original" due to MPTCP constraints. Such skbs use space from
  the incoming subflow receive buffer, do not use explicitly any forward
  allocated memory, as we can't update the msk fwd mem while enqueuing,
  nor we want to acquire again the ssk socket lock while processing the
  skbs. Instead the msk borrows memory from the subflow and reserve it
  for the backlog, see patch 5 and 14 for the gory details.
====================

Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-0-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: leverage the backlog for RX packet processing

When the msk socket is owned or the msk receive buffer is full,
move the incoming skbs in a msk level backlog list. This avoid
traversing the joined subflows and acquiring the subflow level
socket lock at reception time, improving the RX performances.

When processing the backlog, use the fwd alloc memory borrowed from
the incoming subflow. skbs exceeding the msk receive space are
not dropped; instead they are kept into the backlog until the receive
buffer is freed. Dropping packets already acked at the TCP level is
explicitly discouraged by the RFC and would corrupt the data stream
for fallback sockets.

Special care is needed to avoid adding skbs to the backlog of a closed
msk and to avoid leaving dangling references into the backlog
at subflow closing time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: introduce mptcp-level backlog

We are soon using it for incoming data processing.
MPTCP can't leverage the sk_backlog, as the latter is processed
before the release callback, and such callback for MPTCP releases
and re-acquire the socket spinlock, breaking the sk_backlog processing
assumption.

Add a skb backlog list inside the mptcp sock struct, and implement
basic helper to transfer packet to and purge such list.

Packets in the backlog are memory accounted and still use the incoming
subflow receive memory, to allow back-pressure. The backlog size is
implicitly bounded to the sum of subflows rcvbuf.

When a subflow is closed, references from the backlog to such sock
are removed.

No packet is currently added to the backlog, so no functional changes
intended here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-13-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: borrow forward memory from subflow

In the MPTCP receive path, we release the subflow allocated fwd
memory just to allocate it again shortly after for the msk.

That could increases the failures chances, especially when we will
add backlog processing, with other actions could consume the just
released memory before the msk socket has a chance to do the
rcv allocation.

Replace the skb_orphan() call with an open-coded variant that
explicitly borrows, the fwd memory from the subflow socket instead
of releasing it.

The borrowed memory does not have PAGE_SIZE granularity; rounding to
the page size will make the fwd allocated memory higher than what is
strictly required and could make the incoming subflow fwd mem
consistently negative. Instead, keep track of the accumulated frag and
borrow the full page at subflow close time.

This allow removing the last drop in the TCP to MPTCP transition and
the associated, now unused, MIB.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-12-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: handle first subflow closing consistently

Currently, as soon as the PM closes a subflow, the msk stops accepting
data from it, even if the TCP socket could be still formally open in the
incoming direction, with the notable exception of the first subflow.

The root cause of such behavior is that code currently piggy back two
separate semantic on the subflow->disposable bit: the subflow context
must be released and that the subflow must stop accepting incoming
data.

The first subflow is never disposed, so it also never stop accepting
incoming data. Use a separate bit to mark the latter status and set such
bit in __mptcp_close_ssk() for all subflows.

Beyond making per subflow behaviour more consistent this will also
simplify the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-11-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: drop the __mptcp_data_ready() helper

It adds little clarity and there is a single user of such helper,
just inline it in the caller.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-10-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: make mptcp_destroy_common() static

Such function is only used inside protocol.c, there is no need
to expose it to the whole stack.

Note that the function definition most be moved earlier to avoid
forward declaration.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Geliang Tang <geliang@kernel.org>
Tested-by: Geliang Tang <geliang@kernel.org>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-9-1f34b6c1e0b1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>