git.ipfire.org Git - thirdparty/kernel/linux.git/log

net: dev_queue_xmit() llist adoption

Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: claim one cache line in Qdisc

Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net/sched: Fix mirred deadlock on device recursion"

This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: act_mirred: add loop detection

Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.

Prepare its revert by implementing loop detection in act_mirred.

Adds an array of device pointers in struct netdev_xmit.

tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add add indirect call wrapper in skb_release_head_state()

While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: packetdrill: unflake tcp_user_timeout_user-timeout-probe.pkt

This test fails the first time I am running it after a fresh virtme-ng boot.

tcp_user_timeout_user-timeout-probe.pkt:33: runtime error in write call: Expected result -1 but got 24 with errno 2 (No such file or directory)

Tweaks the timings a bit, to reduce flakiness.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soham Chakradeo <sohamch@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20251014171907.3554413-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: qcom: ethernet: Add interconnect properties

Add documentation for the interconnect and interconnect-names
properties required when voting for AHB and AXI buses.

Suggested-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Sagar Cheluvegowda <quic_scheluve@quicinc.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20251015-topic-qc_stmmac_icc_bindings-v5-1-da39126cff28@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-driver-support-for-eswin-eic7700-soc-ethernet-controller'

Shangjuan Wei says:

====================
Add driver support for Eswin eic7700 SoC ethernet controller
====================

Link: https://patch.msgid.link/20251015113751.1114-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: add Eswin EIC7700 glue driver

Add Ethernet controller support for Eswin's eic7700 SoC. The driver
implements hardware initialization, clock configuration, delay
adjustment functions based on DWC Ethernet controller, and supports
device tree configuration and platform driver integration.

Signed-off-by: Zhi Li <lizhi2@eswincomputing.com>
Signed-off-by: Shangjuan Wei <weishangjuan@eswincomputing.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251015114101.1218-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: ethernet: eswin: Document for EIC7700 SoC

Add ESWIN EIC7700 Ethernet controller, supporting clock
configuration, delay adjustment and speed adaptive functions.

Signed-off-by: Zhi Li <lizhi2@eswincomputing.com>
Signed-off-by: Shangjuan Wei <weishangjuan@eswincomputing.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20251015114041.1166-1-weishangjuan@eswincomputing.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-more-cleanups'

Russell King says:

====================
net: stmmac: more cleanups

The subject for the cover message is wearing thin as I've used it a
number of times, but the scope for cleaning up the driver continues,
and continue it will do, because this is just a small fraction of the
queue.

1. make a better job of one of my previous commits, moving the holding
   of the lock into stmmac_mdio.c

2. move the mac_finish() method to be in-order with the layout of
   struct phylink_mac_ops - this order was chosen because it reflects
   the order that the methods are called, thus making the flow more
   obvious when reading code.

3. continuing on the "removal of stuff that doesn't need to happen",
   patch 3 removes the phylink_speed_(up|down) out of the path that
   is used for MTU changes - we really don't need to fiddle with the
   PHY advertisement when changing the MTU!

4. clean up tc_init()'s initialisation of flow_entries_max - this is
   the sole place that this is written, and we might as well make the
   code more easy to follow.

5. stmmac_phy_setup() really confuses me when I read the code, it's
   not really about PHY setup, but about phylink setup. So, name its
   name reflect its functionality.
====================

Link: https://patch.msgid.link/aO_HIwT_YvxkDS8D@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rename stmmac_phy_setup() to include phylink

stmmac_phy_setup() does not set up any PHY, but does setup phylink.
Rename this function to stmmac_phylink_setup() to reflect more what
it is doing.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Gatien Chevallier <gatien.chevallier@foss.st.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945d-0000000Ameh-3Bs7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rearrange tc_init()

To make future changes easier, rearrange the use of dma_cap->l3l4fnum
vs priv->flow_entries_max.

Always initialise priv->flow_entries_max from dma_cap->l3l4fnum, then
use priv->flow_entries_max to determine whether we allocate
priv->flow_entries and set it up.

This change is safe because tc_init() is only called once from
stmmac_dvr_probe().

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1v945Y-0000000Ameb-2gDI@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: avoid PHY speed change when configuring MTU

There is no need to do the speed-down, speed-up dance when changing
the MTU as there is little power saving that can be gained from such
a brief interval between these, and the autonegotiation they cause
takes much longer.

Move the calls to phylink_speed_up() and phylink_speed_down() into
stmmac_open() and stmmac_release() respectively, reducing the work
done in the __-variants of these functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Gatien Chevallier <gatien.chevallier@foss.st.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945T-0000000AmeV-2BvU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: place .mac_finish() method more appropriately

Place the .mac_finish() initialiser and implementation after the
.mac_config() initialiser and method which reflects the order that
they appear in struct phylink_mac_ops, and the order in which they
are called. This keeps logically similar code together.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945O-0000000AmeP-1k0t@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwc-qos-eth: move MDIO bus locking into stmmac_mdio

Rather than dwc-qos-eth manipulating the MDIO bus lock directly, add
helpers to the stmmac MDIO layer and use them in dwc-qos-eth. This
improves my commit 87f43e6f06a2 ("net: stmmac: dwc-qos: calibrate tegra
with mdio bus idle").

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/E1v945J-0000000AmeJ-1GOb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/hsr: add interlink to fill_info output

Currently, it is possible to configure the interlink
port, but no way to read it back from userspace.

Add it to the output of hsr_fill_info(), so it can be
read from userspace, for example:

$ ip -d link show hsr0
12: hsr0: <BROADCAST,MULTICAST> mtu ...
...
hsr slave1 veth0 slave2 veth1 interlink veth2 ...

Signed-off-by: Jan Vaclav <jvaclav@redhat.com>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20251015101001.25670-2-jvaclav@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Octeontx2-af: Fix pci_alloc_irq_vectors() return value check

In cgx_probe() when pci_alloc_irq_vectors() fails the error value will
be negative and that check is sufficient.

err = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_MSIX);
if (err < 0 || err != nvec) {
...
}

When pci_alloc_irq_vectors() fail to allocate nvec number of vectors,
-ENOSPC is returned, so it would be safe to remove the check that
compares err with nvec.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251015090117.1557870-1-harshit.m.mogalapalli@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: add ipsec hw_features

Currently, netdevsim only sets dev->features, which makes the ESP features
fixed. For example:

  # ethtool -k eni0np1 | grep esp
  tx-esp-segmentation: on [fixed]
  esp-hw-offload: on [fixed]
  esp-tx-csum-hw-offload: on [fixed]

This patch adds the ESP features to hw_features, allowing them to be
changed manually. For example:

  # ethtool -k eni0np1 | grep esp
  tx-esp-segmentation: on
  esp-hw-offload: on
  esp-tx-csum-hw-offload: on

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251015083649.54744-1-liuhangbin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: amd-xgbe: use EOPNOTSUPP instead of ENOTSUPP in xgbe_phy_mii_read_c45

The MDIO read callback xgbe_phy_mii_read_c45() can propagate its return
value up through phylink_mii_ioctl() to user space via netdev ioctls such
as SIOCGMIIREG. Returning ENOTSUPP results in user space seeing
"Unknown error", since ENOTSUPP is not a standard errno value.

Replace ENOTSUPP with EOPNOTSUPP to align with the MDIO core’s
usage and ensure user space receives a proper "Operation not supported"
error instead of an unknown code.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Link: https://patch.msgid.link/20251015025751.1532149-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.18-rc2).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from CAN

  Current release - regressions:

    - udp: do not use skb_release_head_state() before
      skb_attempt_defer_free()

    - gro_cells: use nested-BH locking for gro_cell

    - dpll: zl3073x: increase maximum size of flash utility

  Previous releases - regressions:

    - core: fix lockdep splat on device unregister

    - tcp: fix tcp_tso_should_defer() vs large RTT

    - tls:
        - don't rely on tx_work during send()
        - wait for pending async decryptions if tls_strp_msg_hold fails

    - can: j1939: add missing calls in NETDEV_UNREGISTER notification
      handler

    - eth: lan78xx: fix lost EEPROM write timeout in
      lan78xx_write_raw_eeprom

  Previous releases - always broken:

    - ip6_tunnel: prevent perpetual tunnel growth

    - dpll: zl3073x: handle missing or corrupted flash configuration

    - can: m_can: fix pm_runtime and CAN state handling

    - eth:
        - ixgbe: fix too early devlink_free() in ixgbe_remove()
        - ixgbevf: fix mailbox API compatibility
        - gve: Check valid ts bit on RX descriptor before hw timestamping
        - idpf: cleanup remaining SKBs in PTP flows
        - r8169: fix packet truncation after S4 resume on RTL8168H/RTL8111H"

* tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
  udp: do not use skb_release_head_state() before skb_attempt_defer_free()
  net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
  netdevsim: set the carrier when the device goes up
  selftests: tls: add test for short splice due to full skmsg
  selftests: net: tls: add tests for cmsg vs MSG_MORE
  tls: don't rely on tx_work during send()
  tls: wait for pending async decryptions if tls_strp_msg_hold fails
  tls: always set record_type in tls_process_cmsg
  tls: wait for async encrypt in case of error during latter iterations of sendmsg
  tls: trim encrypted message to match the plaintext on short splice
  tg3: prevent use of uninitialized remote_adv and local_adv variables
  MAINTAINERS: new entry for IPv6 IOAM
  gve: Check valid ts bit on RX descriptor before hw timestamping
  net: core: fix lockdep splat on device unregister
  MAINTAINERS: add myself as maintainer for b53
  selftests: net: check jq command is supported
  net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()
  tcp: fix tcp_tso_should_defer() vs large RTT
  r8152: add error handling in rtl8152_driver_init
  usbnet: Fix using smp_processor_id() in preemptible code warnings
  ...

Merge tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Niklas Cassel:

- Do not print an error message (and assume that the General Purpose
   Log Directory log page is not supported) for a device that reports a
   bogus General Purpose Logging Version.

   Unsurprisingly, many vendors fail to report the only valid General
   Purpose Logging Version (Damien)

* tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-core: relax checks in ata_read_log_directory()

udp: do not use skb_release_head_state() before skb_attempt_defer_free()

Michal reported and bisected an issue after recent adoption
of skb_attempt_defer_free() in UDP.

The issue here is that skb_release_head_state() is called twice per skb,
one time from skb_consume_udp(), then a second time from skb_defer_free_flush()
and napi_consume_skb().

As Sabrina suggested, remove skb_release_head_state() call from
skb_consume_udp().

Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free()

Many thanks to Michal, Sabrina, Paolo and Florian for their help.

Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()")
Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz>
Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'txgbe-feat-new-aml-firmware'

Jiawen Wu says:

====================
TXGBE feat new AML firmware

The firmware of AML devices are redesigned to adapt to more PHY
interfaces. Optimize the driver to be compatible with the new firmware.

v1: https://lore.kernel.org/all/20250928093923.30456-1-jiawenwu@trustnetic.com/
====================

Link: https://patch.msgid.link/20251014061726.36660-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: txgbe: rename txgbe_get_phy_link()

The function txgbe_get_phy_link() is more appropriately named
txgbe_get_mac_link(), since it reads the link status from the MAC
register.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20251014061726.36660-4-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: txgbe: optimize the flow to setup PHY for AML devices

To adapt to new firmware for AML devices, the driver should send the
"SET_LINK_CMD" to the firmware only once when switching PHY interface
mode, and no longer needs to re-trigger PHY configuration based on the
RX signal interrupt (TXGBE_GPIOBIT_3).

In previous firmware versions, the PHY was configured only after receiving
"SET_LINK_CMD", and might remain incomplete if the RX signal was lost.
To handle this case, the driver used TXGBE_GPIOBIT_3 interrupt to resend
the command. This workaround is no longer necessary with the new firmware.

And the unknown link speed is permitted in the mailbox buffer.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20251014061726.36660-3-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: txgbe: expend SW-FW mailbox buffer size to identify QSFP module

Recent firmware updates introduce additional fields in the mailbox message
to provide more information for identifying 40G and 100G QSFP modules.
To accommodate these new fields, expand the mailbox buffer size by 4 bytes.

Without this change, drivers built against the updated firmware cannot
properly identify modules due to mismatched mailbox message lengths.

The old firmware version that used the smaller mailbox buffer has never
been publicly released, so there are no backward-compatibility concerns.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251014061726.36660-2-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: Convert amd,xgbe-seattle-v1a to DT schema

Convert amd,xgbe-seattle-v1a binding to DT schema format. It's a
straight-forward conversion.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251013213049.686797-2-robh@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'add-aarch64-support-for-fbnic'

Dimitri Daskalakis says:

====================
Add aarch64 support for FBNIC

We need to support aarch64 with 64K PAGE_SIZE, and I uncovered an issue
during testing.
====================

Link: https://patch.msgid.link/20251013211449.1377054-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fbnic: Allow builds for all 64 bit architectures

This enables aarch64 testing, but there's no reason we cannot support other
architectures.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013211449.1377054-3-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: fbnic: Fix page chunking logic when PAGE_SIZE > 4K

The HW always works on a 4K page size. When the OS supports larger
pages, we fragment them across multiple BDQ descriptors.
We were not properly incrementing the descriptor, which resulted in us
specifying the last chunks id/addr and then 15 zero descriptors. This
would cause packet loss and driver crashes. This is not a fix since the
Kconfig prevents use outside of x86.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013211449.1377054-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset

dev->chipid is used in lan78xx_init_mac_address before it's initialized:

lan78xx_reset() {
    lan78xx_init_mac_address()
        lan78xx_read_eeprom()
            lan78xx_read_raw_eeprom() <- dev->chipid is used here

    dev->chipid = ... <- dev->chipid is initialized correctly here
}

Reorder initialization so that dev->chipid is set before calling
lan78xx_init_mac_address().

Fixes: a0db7d10b76e ("lan78xx: Add to handle mux control per chip id")
Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Khalid Aziz <khalid@kernel.org>
Link: https://patch.msgid.link/20251013181648.35153-1-viswanathiyyappan@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2025-10-14

The first 2 paches are by Celeste Liu and target the gS_usb driver.
The first patch remove the limitation to 3 CAN interface per USB
device. The second patch adds the missing population of
net_device->dev_port.

The next 4 patches are by me and fix the m_can driver. They add a
missing pm_runtime_disable(), fix the CAN state transition back to
Error Active and fix the state after ifup and suspend/resume.

Another patch by me targets the m_can driver, too and replaces Dong
Aisheng's old email address.

The next 2 patches are by Vincent Mailhol and update the CAN
networking Documentation.

Tetsuo Handa contributes the last patch that add missing cleanup calls
in the NETDEV_UNREGISTER notification handler.

* tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: j1939: add missing calls in NETDEV_UNREGISTER notification handler
  can: add Transmitter Delay Compensation (TDC) documentation
  can: remove false statement about 1:1 mapping between DLC and length
  can: m_can: replace Dong Aisheng's old email address
  can: m_can: fix CAN state in system PM
  can: m_can: m_can_chip_config(): bring up interface in correct state
  can: m_can: m_can_handle_state_errors(): fix CAN state transition to Error Active
  can: m_can: m_can_plat_remove(): add missing pm_runtime_disable()
  can: gs_usb: gs_make_candev(): populate net_device->dev_port
  can: gs_usb: increase max interface to U8_MAX
====================

Link: https://patch.msgid.link/20251014122140.990472-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-airoha-npu-introduce-support-for-airoha-7583-npu'

Lorenzo Bianconi says:

====================
net: airoha: npu: Introduce support for Airoha 7583 NPU

Introduce support for Airoha 7583 SoC NPU.
====================

Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-0-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: npu: Add 7583 SoC support

Introduce support for Airoha 7583 SoC NPU selecting proper firmware images.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-3-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: npu: Add airoha_npu_soc_data struct

Introduce airoha_npu_soc_data structure in order to generalize per-SoC
NPU firmware info. Introduce airoha_npu_load_firmware utility routine.
This is a preliminary patch in order to introduce AN7583 NPU support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-2-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: airoha: npu: Add AN7583 support

Introduce AN7583 NPU support to Airoha EN7581 NPU device-tree bindings.

Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20251013-airoha-npu-7583-v3-1-00f748b5a0c7@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'preserve-pse-pd692x0-configuration-across-reboots'

Kory Maincent says:

====================
Preserve PSE PD692x0 configuration across reboots

Previously, the driver would always reconfigure the PSE hardware on
probe, causing a port matrix reflash that resulted in temporary power
loss to all connected devices. This change maintains power continuity
by preserving existing configuration when the PSE has been previously
initialized.
====================

Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-0-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pse-pd: pd692x0: Preserve PSE configuration across reboots

Detect when PSE hardware is already configured (user byte == 42) and
skip hardware initialization to prevent power interruption to connected
devices during system reboots.

Previously, the driver would always reconfigure the PSE hardware on
probe, causing a port matrix reflash that resulted in temporary power
loss to all connected devices. This change maintains power continuity
by preserving existing configuration when the PSE has been previously
initialized.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-3-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pse-pd: pd692x0: Separate configuration parsing from hardware setup

Cache the port matrix configuration in driver private data to enable
PSE controller reconfiguration. This refactoring separates device tree
parsing from hardware configuration application, allowing settings to be
reapplied without reparsing the device tree.

This refactoring is a prerequisite for preserving PSE configuration
across reboots to prevent power disruption to connected devices.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-2-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pse-pd: pd692x0: Replace __free macro with explicit kfree calls

Replace __free(kfree) with explicit kfree() calls to follow the net
subsystem policy of avoiding automatic cleanup macros as described in
the documentation.

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013-feature_pd692x0_reboot_keep_conf-v2-1-68ab082a93dd@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bcmasp: Add support for PHY-based Wake-on-LAN

If available, interrogate the PHY to find out whether we can use it for
Wake-on-LAN. This can be a more power efficient way of implementing that
feature, especially when the MAC is powered off in low power states.

Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251013172306.2250223-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netdevsim: set the carrier when the device goes up

Bringing a linked netdevsim device down and then up causes communication
failure because both interfaces lack carrier. Basically a ifdown/ifup on
the interface make the link broken.

Commit 3762ec05a9fbda ("netdevsim: add NAPI support") added supported
for NAPI, calling netif_carrier_off() in nsim_stop(). This patch
re-enables the carrier symmetrically on nsim_open(), in case the device
is linked and the peer is up.

Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: 3762ec05a9fbda ("netdevsim: add NAPI support")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251014-netdevsim_fix-v2-1-53b40590dae1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tls-misc-bugfixes'

Sabrina Dubroca says:

====================
tls: misc bugfixes

Jann Horn reported multiple bugs in kTLS. This series addresses them,
and adds some corresponding selftests for those that are reproducible
(and without failure injection).
====================

Link: https://patch.msgid.link/cover.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: tls: add test for short splice due to full skmsg

We don't have a test triggering a partial splice caused by a full
skmsg. Add one, based on a program by Jann Horn.

Use MAX_FRAGS=48 to make sure the skmsg will be full for any allowed
value of CONFIG_MAX_SKB_FRAGS (17..45).

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/1d129a15f526ea3602f3a2b368aa0b6f7e0d35d5.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: tls: add tests for cmsg vs MSG_MORE

We don't have a test to check that MSG_MORE won't let us merge records
of different types across sendmsg calls.

Add new tests that check:
- MSG_MORE is only allowed for DATA records
- a pending DATA record gets closed and pushed before a non-DATA
record is processed

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/b34feeadefe8a997f068d5ed5617afd0072df3c0.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: don't rely on tx_work during send()

With async crypto, we rely on tx_work to actually transmit records
once encryption completes. But while send() is running, both the
tx_lock and socket lock are held, so tx_work_handler cannot process
the queue of encrypted records, and simply reschedules itself. During
a large send(), this could last a long time, and use a lot of memory.

Transmit any pending encrypted records before restarting the main
loop of tls_sw_sendmsg_locked.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/8396631478f70454b44afb98352237d33f48d34d.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: wait for pending async decryptions if tls_strp_msg_hold fails

Async decryption calls tls_strp_msg_hold to create a clone of the
input skb to hold references to the memory it uses. If we fail to
allocate that clone, proceeding with async decryption can lead to
various issues (UAF on the skb, writing into userspace memory after
the recv() call has returned).

In this case, wait for all pending decryption requests.

Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/b9fe61dcc07dab15da9b35cf4c7d86382a98caf2.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: always set record_type in tls_process_cmsg

When userspace wants to send a non-DATA record (via the
TLS_SET_RECORD_TYPE cmsg), we need to send any pending data from a
previous MSG_MORE send() as a separate DATA record. If that DATA record
is encrypted asynchronously, tls_handle_open_record will return
-EINPROGRESS. This is currently treated as an error by
tls_process_cmsg, and it will skip setting record_type to the correct
value, but the caller (tls_sw_sendmsg_locked) handles that return
value correctly and proceeds with sending the new message with an
incorrect record_type (DATA instead of whatever was requested in the
cmsg).

Always set record_type before handling the open record. If
tls_handle_open_record returns an error, record_type will be
ignored. If it succeeds, whether with synchronous crypto (returning 0)
or asynchronous (returning -EINPROGRESS), the caller will proceed
correctly.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/0457252e578a10a94e40c72ba6288b3a64f31662.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: wait for async encrypt in case of error during latter iterations of sendmsg

If we hit an error during the main loop of tls_sw_sendmsg_locked (eg
failed allocation), we jump to send_end and immediately
return. Previous iterations may have queued async encryption requests
that are still pending. We should wait for those before returning, as
we could otherwise be reading from memory that userspace believes
we're not using anymore, which would be a sort of use-after-free.

This is similar to what tls_sw_recvmsg already does: failures during
the main loop jump to the "wait for async" code, not straight to the
unlock/return.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/c793efe9673b87f808d84fdefc0f732217030c52.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tls: trim encrypted message to match the plaintext on short splice

During tls_sw_sendmsg_locked, we pre-allocate the encrypted message
for the size we're expecting to send during the current iteration, but
we may end up sending less, for example when splicing: if we're
getting the data from small fragments of memory, we may fill up all
the slots in the skmsg with less data than expected.

In this case, we need to trim the encrypted message to only the length
we actually need, to avoid pushing uninitialized bytes down the
underlying TCP socket.

Fixes: fe1e81d4f73b ("tls/sw: Support MSG_SPLICE_PAGES")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/66a0ae99c9efc15f88e9e56c1f58f902f442ce86.1760432043.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove obsolete WARN_ON(refcount_read(&sk->sk_refcnt) == 1)

sk->sk_refcnt has been converted to refcount_t in 2017.

__sock_put(sk) being refcount_dec(&sk->sk_refcnt), it will complain
loudly if the current refcnt is 1 (or less) in a non racy way.

We can remove four WARN_ON() in favor of the generic refcount_dec()
check.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xuanqiang Luo<luoxuanqiang@kylinos.cn>
Link: https://patch.msgid.link/20251014140605.2982703-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tg3: prevent use of uninitialized remote_adv and local_adv variables

Some execution paths that jump to the fiber_setup_done label
could leave the remote_adv and local_adv variables uninitialized
and then use it.

Initialize this variables at the point of definition to avoid this.

Fixes: 85730a631f0c ("tg3: Add SGMII phy support for 5719/5718 serdes")
Co-developed-by: Alexandr Sapozhnikov <alsp705@gmail.com>
Signed-off-by: Alexandr Sapozhnikov <alsp705@gmail.com>
Signed-off-by: Alexey Simakov <bigalex934@gmail.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20251014164736.5890-1-bigalex934@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: new entry for IPv6 IOAM

Create a maintainer entry for IPv6 IOAM. Add myself as I authored most
if not all of the IPv6 IOAM code in the kernel and actively participate
in the related IETF groups.

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Link: https://patch.msgid.link/20251014170650.27679-1-justin.iurman@uliege.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'vfs-6.18-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

- Handle inode number mismatches in nsfs file handles

- Update the comment to init_file()

- Add documentation link for EBADF in the rust file code

- Skip read lock assertion for read-only filesystems when using dax

- Don't leak disconnected dentries during umount

- Fix new coredump input pattern validation

- Handle ENOIOCTLCMD conversion in vfs_fileattr_{g,s}et() correctly

- Remove redundant IOCB_DIO_CALLER_COMP clearing in overlayfs

* tag 'vfs-6.18-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ovl: remove redundant IOCB_DIO_CALLER_COMP clearing
  fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
  Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
  coredump: fix core_pattern input validation
  vfs: Don't leak disconnected dentries on umount
  dax: skip read lock assertion for read-only filesystems
  rust: file: add intra-doc link for 'EBADF'
  fs: update comment in init_file()
  nsfs: handle inode number mismatches gracefully in file handles

net: bcmgenet: remove unused platform code

This effectively reverts b0ba512e25d7 ("net: bcmgenet: enable driver to
work without a device tree"). There has never been an in-tree user of
struct bcmgenet_platform_data, all devices use OF or ACPI.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/108b4e64-55d4-4b4e-9a11-3c810c319d66@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8152: Advertise software timestamp information.

Driver calls skb_tx_timestamp(skb) in rtl8152_start_xmit(), but does not advertise the capability in ethtool.
Advertise software timestamp capabilities on struct ethtool_ops.

Signed-off-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Reviewed-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251014055234.46527-1-rawal.abhishek92@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: Check valid ts bit on RX descriptor before hw timestamping

The device returns a valid bit in the LSB of the low timestamp byte in
the completion descriptor that the driver should check before
setting the SKB's hardware timestamp. If the timestamp is not valid, do not
hardware timestamp the SKB.

Cc: stable@vger.kernel.org
Fixes: b2c7aeb49056 ("gve: Implement ndo_hwtstamp_get/set for RX timestamping")
Reviewed-by: Joshua Washington <joshwash@google.com>
Signed-off-by: Tim Hostetler <thostet@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20251014004740.2775957-1-hramamurthy@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-deal-with-sticky-tx-queues'

Eric Dumazet says:

====================
net: deal with sticky tx queues

Back in 2010, Tom Herbert added skb->ooo_okay to TCP flows.

Extend the feature to connected flows for other protocols like UDP.

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu
or aggressive interrupt mitigation.

The so-called "strange attractors" has caused many performance
issues, we need to do better now that TCP reacts better to
potential reorders.

Add new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this series, we no longer have to make sure threads
are pinned to cpus, they can migrate without adding
too much [spinlock, qdisc, TX completion] pressure anymore.
====================

Link: https://patch.msgid.link/20251013152234.842065-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: allow busy connected flows to switch tx queues

This is a followup of commit 726e9e8b94b9 ("tcp: refine
skb->ooo_okay setting") and of prior commit in this series
("net: control skb->ooo_okay from skb_set_owner_w()")

skb->ooo_okay might never be set for bulk flows that always
have at least one skb in a qdisc queue of NIC queue,
especially if TX completion is delayed because of a stressed cpu.

The so-called "strange attractors" has caused many performance
issues (see for instance 9b462d02d6dd ("tcp: TCP Small Queues
and strange attractors")), we need to do better.

We have tried very hard to avoid reorders because TCP was
not dealing with them nicely a decade ago.

Use the new net.core.txq_reselection_ms sysctl to let
flows follow XPS and select a more efficient queue.

After this patch, we no longer have to make sure threads
are pinned to cpus, they now can be migrated without
adding too much spinlock/qdisc/TX completion pressure anymore.

TX completion part was problematic, because it added false sharing
on various socket fields, but also added false sharing and spinlock
contention in mm layers. Calling skb_orphan() from ndo_start_xmit()
is not an option unfortunately.

Note for later:

1) move sk->sk_tx_queue_mapping closer
to sk_tx_queue_mapping_jiffies for better cache locality.

2) Study if 9b462d02d6dd ("tcp: TCP Small Queues
and strange attractors") could be revised.

Tested:

Used a host with 32 TX queues, shared by groups of 8 cores.
XPS setup :

echo ff >/sys/class/net/eth1/queue/tx-0/xps_cpus
echo ff00 >/sys/class/net/eth1/queue/tx-1/xps_cpus
echo ff0000 >/sys/class/net/eth1/queue/tx-2/xps_cpus
echo ff000000 >/sys/class/net/eth1/queue/tx-3/xps_cpus
echo ff,00000000 >/sys/class/net/eth1/queue/tx-4/xps_cpus
echo ff00,00000000 >/sys/class/net/eth1/queue/tx-5/xps_cpus
echo ff0000,00000000 >/sys/class/net/eth1/queue/tx-6/xps_cpus
echo ff000000,00000000 >/sys/class/net/eth1/queue/tx-7/xps_cpus
...

Launched a tcp_stream with 15 threads and 1000 flows, initially affined to core 0-15

taskset -c 0-15 tcp_stream -T15 -F1000 -l1000 -c -H target_host

Checked that only queues 0 and 1 are used as instructed by XPS :
tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 123489410b 1890p
backlog 69809026b 1064p
backlog 52401054b 805p

Then force each thread to run on cpu 1,9,17,25,33,41,49,57,65,73,81,89,97,105,113,121

C=1;PID=`pidof tcp_stream`;for P in `ls /proc/$PID/task`; do taskset -pc $C $P; C=$(($C + 8));done

Set txq_reselection_ms to 1000
echo 1000 > /proc/sys/net/core/txq_reselection_ms

Check that the flows have migrated nicely:

tc -s qdisc show dev eth1|grep backlog|grep -v "backlog 0b 0p"
backlog 130508314b 1916p
backlog 8584380b 126p
backlog 8584380b 126p
backlog 8379990b 123p
backlog 8584380b 126p
backlog 8487484b 125p
backlog 8584380b 126p
backlog 8448120b 124p
backlog 8584380b 126p
backlog 8720640b 128p
backlog 8856900b 130p
backlog 8584380b 126p
backlog 8652510b 127p
backlog 8448120b 124p
backlog 8516250b 125p
backlog 7834950b 115p

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add /proc/sys/net/core/txq_reselection_ms control

Add a new sysctl to control how often a queue reselection
can happen even if a flow has a persistent queue of skbs
in a Qdisc or NIC queue.

A value of zero means the feature is disabled.

Default is 1000 (1 second).

This sysctl is used in the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: control skb->ooo_okay from skb_set_owner_w()

15 years after Tom Herbert added skb->ooo_okay, only TCP transport
benefits from it.

We can support other transports directly from skb_set_owner_w().

If no other TX packet for this socket is in a host queue (qdisc, NIC queue)
there is no risk of self-inflicted reordering, we can set skb->ooo_okay.

This allows netdev_pick_tx() to choose a TX queue based on XPS settings,
instead of reusing the queue chosen at the time the first packet was sent
for connected sockets.

Tested:
  500 concurrent UDP_RR connected UDP flows, host with 32 TX queues,
  512 cpus, XPS setup.

  super_netperf 500 -t UDP_RR -H <host> -l 1000 -- -r 100,100 -Nn &

This patch saves between 10% and 20% of cycles, depending on how
process scheduler migrates threads among cpus.

Using following bpftrace script, we can see the effect on Qdisc/NIC tx queues
being better used (less cache line misses).

bpftrace -e '
k:__dev_queue_xmit { @start[cpu] = nsecs; }
kr:__dev_queue_xmit {
if (@start[cpu]) {
    $delay = nsecs - @start[cpu];
    delete(@start[cpu]);
    @__dev_queue_xmit_ns = hist($delay);
}
}
END { clear(@start); }'

Before:
@__dev_queue_xmit_ns:
[128, 256)             6 |                                                    |
[256, 512)        116283 |                                                    |
[512, 1K)        1888205 |@@@@@@@@@@@                                         |
[1K, 2K)         8106167 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |
[2K, 4K)         8699293 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)         2600676 |@@@@@@@@@@@@@@@                                     |
[8K, 16K)         721688 |@@@@                                                |
[16K, 32K)        122995 |                                                    |
[32K, 64K)         10639 |                                                    |
[64K, 128K)          119 |                                                    |
[128K, 256K)           1 |                                                    |

After:
@__dev_queue_xmit_ns:
[128, 256)             3 |                                                    |
[256, 512)        651112 |@@                                                  |
[512, 1K)        8109938 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[1K, 2K)        16081031 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)         2411692 |@@@@@@@                                             |
[4K, 8K)           98994 |                                                    |
[8K, 16K)           1536 |                                                    |
[16K, 32K)           587 |                                                    |
[32K, 64K)             2 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add SK_WMEM_ALLOC_BIAS constant

sk->sk_wmem_alloc is initialized to 1, and sk_wmem_alloc_get()
takes care of this initial value.

Add SK_WMEM_ALLOC_BIAS define to not spread this magic value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20251013152234.842065-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: better handle TCP_TX_DELAY on established flows

Some applications uses TCP_TX_DELAY socket option after TCP flow
is established.

Some metrics need to be updated, otherwise TCP might take time to
adapt to the new (emulated) RTT.

This patch adjusts tp->srtt_us, tp->rtt_min, icsk_rto
and sk->sk_pacing_rate.

This is best effort, and for instance icsk_rto is reset
without taking backoff into account.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251013145926.833198-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: nxp,sja1105: Add optional clock

Add optional clock for OSC_IN and fix the below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6qp-prtwd3.dtb: switch@0 (nxp,sja1105q): Unevaluated properties are not allowed ('clocks' was unexpected)

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20251010183418.2179063-1-Frank.Li@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'bitmap-for-v6.18-rc2' of https://github.com/norov/linux

Pull bitmap fixes from Yury Norov:
"A "unnecessary `unsafe`" warning fix for bitmap/rust, and one leftover
  patch for FIELD_PREP_WM16() conversion.

   - rust: bitmap: clean Rust 1.92.0 `unused_unsafe` warning (Miguel)

   - FIELD_PREP_WM16() rework leftover (Nicolas)"

* tag 'bitmap-for-v6.18-rc2' of https://github.com/norov/linux:
  PM / devfreq: rockchip-dfi: switch to FIELD_PREP_WM16 macro
  rust: bitmap: clean Rust 1.92.0 `unused_unsafe` warning

Remove long-stale ext3 defconfig option

Inspired by commit c065b6046b34 ("Use CONFIG_EXT4_FS instead of
CONFIG_EXT3_FS in all of the defconfigs") I looked around for any other
left-over EXT3 config options, and found some old defconfig files still
mentioned CONFIG_EXT3_DEFAULTS_TO_ORDERED.

That config option was removed a decade ago in commit c290ea01abb7 ("fs:
Remove ext3 filesystem driver"). It had a good run, but let's remove it
for good.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:

- Fix regression caused by removing CONFIG_EXT3_FS when testing some
   very old defconfigs

- Avoid a BUG_ON when opening a file on a maliciously corrupted file
   system

- Avoid mm warnings when freeing a very large orphan file metadata

- Avoid a theoretical races between metadata writeback and checkpoints
   (it's very hard to hit in practice, since the race requires that the
   writeback take a very long time)

* tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
  ext4: free orphan info with kvfree
  ext4: detect invalid INLINE_DATA + EXTENTS flag combination
  ext4, doc: fix and improve directory hash tree description
  ext4: wait for ongoing I/O to complete before freeing blocks
  jbd2: ensure that all ongoing I/O complete before freeing blocks

PM / devfreq: rockchip-dfi: switch to FIELD_PREP_WM16 macro

The era of hand-rolled HIWORD_UPDATE macros is over, at least for those
drivers that use constant masks.

Like many other Rockchip drivers, rockchip-dfi brings with it its own
HIWORD_UPDATE macro. This variant doesn't shift the value (and like the
others, doesn't do any checking).

Remove it, and replace instances of it with hw_bitfield.h's
FIELD_PREP_WM16. Since FIELD_PREP_WM16 requires contiguous masks and
shifts the value for us, some reshuffling of definitions needs to
happen.

This gives us better compile-time error checking, and in my opinion,
nicer code.

Tested on an RK3568 ODROID-M1 board (LPDDR4X at 1560 MHz, an RK3588
Radxa ROCK 5B board (LPDDR4X at 2112 MHz) and an RK3588 Radxa ROCK 5T
board (LPDDR5 at 2400 MHz). perf measurements were consistent with the
measurements of stress-ng --stream in all cases.

Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>

rust: bitmap: clean Rust 1.92.0 `unused_unsafe` warning

Starting with Rust 1.92.0 (expected 2025-12-11), Rust allows to safely
take the address of a union field [1][2]:

      CLIPPY L rust/kernel.o
    error: unnecessary `unsafe` block
       --> rust/kernel/bitmap.rs:169:13
        |
    169 |             unsafe { core::ptr::addr_of!(self.repr.bitmap) }
        |             ^^^^^^ unnecessary `unsafe` block
        |
        = note: `-D unused-unsafe` implied by `-D warnings`
        = help: to override `-D warnings` add `#[allow(unused_unsafe)]`

    error: unnecessary `unsafe` block
       --> rust/kernel/bitmap.rs:185:13
        |
    185 |             unsafe { core::ptr::addr_of_mut!(self.repr.bitmap) }
        |             ^^^^^^ unnecessary `unsafe` block

Thus allow both instances to clean the warning in newer compilers.

Link: https://github.com/rust-lang/rust/issues/141264
Link: https://github.com/rust-lang/rust/pull/141469
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>

Merge branch 'net-airoha-add-some-new-ethtool-bits'

Lorenzo Bianconi says:

====================
net: airoha: Add some new ethtool bits

- add missing stats to ethtool ethtool_eth_mac_stats struct
- set get_link ethtool callback to ethtool_op_get_link routine
====================

Link: https://patch.msgid.link/20251013-airoha-ethtool-improvements-v1-0-fdd1c6fc9be1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Add get_link ethtool callback

Set get_link ethtool callback to ethtool_op_get_link routine.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251013-airoha-ethtool-improvements-v1-2-fdd1c6fc9be1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Add missing stats to ethtool_eth_mac_stats

Add the following stats to ethtool ethtool_eth_mac_stats stats:
- FramesTransmittedOK
- OctetsTransmittedOK
- FramesReceivedOK
- OctetsReceivedOK

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251013-airoha-ethtool-improvements-v1-1-fdd1c6fc9be1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: core: fix lockdep splat on device unregister

Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.

If the device list is too long, this will lock more device mutexes than
lockdep can handle:

unshare -n \
bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'

BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by kworker/u16:1/69:
#0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
#1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
#2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
#3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
#4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]

Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.

Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.

Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20251013185052.14021-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fealnx: fix typo in comments

There are a few typos in comments:
- replace "avilable" with "available"
- replace "mutlicast" with "multicast"

Signed-off-by: Denis Benato <benato.denis96@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013183632.1226627-1-benato.denis96@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mdio: use macro module_driver to avoid boilerplate code

Use macro module_driver to avoid boilerplate code.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/e5c37417-4984-4b57-8154-264deef61e0d@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: add myself as maintainer for b53

I wrote the original OpenWrt driver that Florian used as the base for
the dsa driver, I might as well take responsibility for it.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20251013180347.133246-1-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: b53: implement port isolation support

Implement port isolation support via the Protected Ports register.

Protected ports can only communicate with unprotected ports, but not
with each other, matching the expected behaviour of isolated ports.

Tested on BCM963268BU.

Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20251013152834.100169-1-jonas.gorski@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: fix various typos in comments and strings

Fix several minor typos and grammatical errors in comments and log
(in fbnic firmware, PCI, and time modules)

Changes include:
- "cordeump" -> "coredump"
- "of" -> "off" in RPC config comment
- "healty" -> "healthy" in firmware heartbeat comment
- "Firmware crashed detected!" -> "Firmware crash detected!"
- "The could be caused" -> "This could be caused"
- "lockng" -> "locking" in fbnic_time.c

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251013160507.768820-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: correct debug message function name in br_fill_ifinfo

The debug message in br_fill_ifinfo() incorrectly refers to br_fill_info
instead of the actual function name. Update it for clarity in debugging
output.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20251013100121.755899-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

- Fix a crasher reported by rtm@csail.mit.edu

* tag 'nfsd-6.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Define a proc_layoutcommit for the FlexFiles layout type

Merge tag 'for-linus-6.18-2' of https://github.com/cminyard/linux-ipmi

Pull IPMI fixes from Corey Minyard:
"A few bug fixes for patches that went in this release: a refcount
  error and some missing or incorrect error checks"

* tag 'for-linus-6.18-2' of https://github.com/cminyard/linux-ipmi:
  ipmi: Fix handling of messages with provided receive message pointer
  mfd: ls2kbmc: check for devm_mfd_add_devices() failure
  mfd: ls2kbmc: Fix an IS_ERR() vs NULL check in probe()

selftests: net: check jq command is supported

The jq command is used in vlan_bridge_binding.sh, if it is not supported,
the test will spam the following log.

  # ./vlan_bridge_binding.sh: line 51: jq: command not found
  # ./vlan_bridge_binding.sh: line 51: jq: command not found
  # ./vlan_bridge_binding.sh: line 51: jq: command not found
  # ./vlan_bridge_binding.sh: line 51: jq: command not found
  # ./vlan_bridge_binding.sh: line 51: jq: command not found
  # TEST: Test bridge_binding on->off when lower down                   [FAIL]
  #       Got operstate of , expected 0

The rtnetlink.sh has the same problem. It makes sense to check if jq is
installed before running these tests. After this patch, the
vlan_bridge_binding.sh skipped if jq is not supported:

  # timeout set to 3600
  # selftests: net: vlan_bridge_binding.sh
  # TEST: jq not installed                                              [SKIP]

Fixes: dca12e9ab760 ("selftests: net: Add a VLAN bridge binding selftest")
Fixes: 6a414fd77f61 ("selftests: rtnetlink: Add an address proto test")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20251013080039.3035898-1-wangliang74@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/hsr: add protocol version to fill_info output

Currently, it is possible to configure IFLA_HSR_VERSION, but
there is no way to check in userspace what the currently
configured HSR protocol version is.

Add it to the output of hsr_fill_info(), when the interface
is using the HSR protocol. Let's not expose it when using
the PRP protocol, since it only has one version and it's
not possible to set it from userspace.

This info could then be used by e.g. ip(8), like so:
$ ip -d link show hsr0
12: hsr0: <BROADCAST,MULTICAST> mtu ...
...
hsr slave1 veth0 slave2 veth1 ... proto 0 version 1
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Jan Vaclav <jvaclav@redhat.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20251009210903.1055187-6-jvaclav@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netmem: replace __netmem_clear_lsb() with netmem_to_nmdesc()

Now that we have struct netmem_desc, it'd better access the pp fields
via struct netmem_desc rather than struct net_iov.

Introduce netmem_to_nmdesc() for safely converting netmem_ref to
netmem_desc regardless of the type underneath e.i. netmem_desc, net_iov.

While at it, remove __netmem_clear_lsb() and make netmem_to_nmdesc()
used instead.

Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20251013044133.69472-1-byungchul@sk.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hdlc_ppp: fix potential null pointer in ppp_cp_event logging

drivers/net/wan/hdlc_ppp.c: In function ‘ppp_cp_event’:
drivers/net/wan/hdlc_ppp.c:353:17: warning: ‘%s’ directive argument is null [-Wformat-overflow=]
  353 |                 netdev_info(dev, "%s down\n", proto_name(pid));
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/wan/hdlc_ppp.c:342:17: warning: ‘%s’ directive argument is null [-Wformat-overflow=]
  342 |                 netdev_info(dev, "%s up\n", proto_name(pid));
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Update proto_name() to return "LCP" by default instead of NULL.
This change silences the compiler without changing existing behavior
and removes the need for the local 'pname' variable in ppp_cp_event.

Suggested-by: Krzysztof Hałasa <khalasa@piap.pl>
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Acked-by: Krzysztof Hałasa <khalasa@piap.pl>
Link: https://patch.msgid.link/20251013014319.1608706-1-kriish.sharma2006@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()

Completion napi can free out-of-order tx descriptors if hw QoS is
enabled and packets with different priority are queued to same DMA ring.
Take into account possible out-of-order reports checking if the tx queue
is full using circular buffer head/tail pointer instead of the number of
queued packets.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251012-airoha-tx-busy-queue-v2-1-a600b08bab2d@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: fix tcp_tso_should_defer() vs large RTT

Neal reported that using neper tcp_stream with TCP_TX_DELAY
set to 50ms would often lead to flows stuck in a small cwnd mode,
regardless of the congestion control.

While tcp_stream sets TCP_TX_DELAY too late after the connect(),
it highlighted two kernel bugs.

The following heuristic in tcp_tso_should_defer() seems wrong
for large RTT:

delta = tp->tcp_clock_cache - head->tstamp;
/* If next ACK is likely to come too late (half srtt), do not defer */
if ((s64)(delta - (u64)NSEC_PER_USEC * (tp->srtt_us >> 4)) < 0)
goto send_now;

If next ACK is expected to come in more than 1 ms, we should
not defer because we prefer a smooth ACK clocking.

While blamed commit was a step in the good direction, it was not
generic enough.

Another patch fixing TCP_TX_DELAY for established flows
will be proposed when net-next reopens.

Fixes: 50c8339e9299 ("tcp: tso: restore IW10 after TSO autosizing")
Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20251011115742.1245771-1-edumazet@google.com
[pabeni@redhat.com: fixed whitespace issue]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

r8152: add error handling in rtl8152_driver_init

rtl8152_driver_init() is missing the error handling.
When rtl8152_driver registration fails, rtl8152_cfgselector_driver
should be deregistered.

Fixes: ec51fbd1b8a2 ("r8152: add USB device driver for config selection")
Cc: stable@vger.kernel.org
Signed-off-by: Yi Cong <yicong@kylinos.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251011082415.580740-1-yicongsrfy@163.com
[pabeni@redhat.com: clarified the commit message]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

usbnet: Fix using smp_processor_id() in preemptible code warnings

Syzbot reported the following warning:

BUG: using smp_processor_id() in preemptible [00000000] code: dhcpcd/2879
caller is usbnet_skb_return+0x74/0x490 drivers/net/usb/usbnet.c:331
CPU: 1 UID: 0 PID: 2879 Comm: dhcpcd Not tainted 6.15.0-rc4-syzkaller-00098-g615dca38c2ea #0 PREEMPT(voluntary)
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x16c/0x1f0 lib/dump_stack.c:120
check_preemption_disabled+0xd0/0xe0 lib/smp_processor_id.c:49
usbnet_skb_return+0x74/0x490 drivers/net/usb/usbnet.c:331
usbnet_resume_rx+0x4b/0x170 drivers/net/usb/usbnet.c:708
usbnet_change_mtu+0x1be/0x220 drivers/net/usb/usbnet.c:417
__dev_set_mtu net/core/dev.c:9443 [inline]
netif_set_mtu_ext+0x369/0x5c0 net/core/dev.c:9496
netif_set_mtu+0xb0/0x160 net/core/dev.c:9520
dev_set_mtu+0xae/0x170 net/core/dev_api.c:247
dev_ifsioc+0xa31/0x18d0 net/core/dev_ioctl.c:572
dev_ioctl+0x223/0x10e0 net/core/dev_ioctl.c:821
sock_do_ioctl+0x19d/0x280 net/socket.c:1204
sock_ioctl+0x42f/0x6a0 net/socket.c:1311
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl fs/ioctl.c:892 [inline]
__x64_sys_ioctl+0x190/0x200 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xcd/0x260 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

For historical and portability reasons, the netif_rx() is usually
run in the softirq or interrupt context, this commit therefore add
local_bh_disable/enable() protection in the usbnet_resume_rx().

Fixes: 43daa96b166c ("usbnet: Stop RX Q on MTU change")
Link: https://syzkaller.appspot.com/bug?id=81f55dfa587ee544baaaa5a359a060512228c1e1
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Link: https://patch.msgid.link/20251011070518.7095-1-qiang.zhang@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Octeontx2-af: Fix missing error code in cgx_probe()

When CGX fails mapping to NIX, set the error code to -ENODEV, currently
err is zero and that is treated as success path.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/aLAdlCg2_Yv7Y-3h@stanley.mountain/
Fixes: d280233fc866 ("Octeontx2-af: Fix NIX X2P calibration failures")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20251010204239.94237-1-harshit.m.mogalapalli@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

amd-xgbe: Avoid spurious link down messages during interface toggle

During interface toggle operations (ifdown/ifup), the driver currently
resets the local helper variable 'phy_link' to -1. This causes the link
state machine to incorrectly interpret the state as a link change event,
resulting in spurious "Link is down" messages being logged when the
interface is brought back up.

Preserve the phy_link state across interface toggles to avoid treating
the -1 sentinel value as a legitimate link state transition.

Fixes: 88131a812b16 ("amd-xgbe: Perform phy connect/disconnect at dev open/stop")
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Link: https://patch.msgid.link/20251010065142.1189310-1-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs

Commit d6ace46c82fd ("ext4: remove obsolete EXT3 config options")
removed the obsolete EXT3_CONFIG options, since it had been over a
decade since fs/ext3 had been removed.  Unfortunately, there were a
number of defconfigs that still used CONFIG_EXT3_FS which the cleanup
commit didn't fix up.  This led to a large number of defconfig test
builds to fail.  Oops.

Fixes: d6ace46c82fd ("ext4: remove obsolete EXT3 config options")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

net: phy: realtek: Avoid PHYCR2 access if PHYCR2 not present

The driver is currently checking for PHYCR2 register presence in
rtl8211f_config_init(), but it does so after accessing PHYCR2 to
disable EEE. This was introduced in commit bfc17c165835 ("net:
phy: realtek: disable PHY-mode EEE"). Move the PHYCR2 presence
test before the EEE disablement and simplify the code.

Fixes: bfc17c165835 ("net: phy: realtek: disable PHY-mode EEE")
Signed-off-by: Marek Vasut <marek.vasut@mailbox.org>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20251011110309.12664-1-marek.vasut@mailbox.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'intel-wired-lan-driver-updates-2025-10-01-idpf-ixgbe-ixgbevf'

Jacob Keller says:

====================
Intel Wired LAN Driver Updates 2025-10-01 (idpf, ixgbe, ixgbevf)

For idpf:
Milena fixes a memory leak in the idpf reset logic when the driver resets
with an outstanding Tx timestamp.

For ixgbe and ixgbevf:
Jedrzej fixes an issue with reporting link speed on E610 VFs.

Jedrzej also fixes the VF mailbox API incompatibilities caused by the
confusion with API v1.4, v1.5, and v1.6. The v1.4 API introduced IPSEC
offload, but this was only supported on Linux hosts. The v1.5 API
introduced a new mailbox API which is necessary to resolve issues on ESX
hosts. The v1.6 API introduced a new link management API for E610. Jedrzej
introduces a new v1.7 API with a feature negotiation which enables properly
checking if features such as IPSEC or the ESX mailbox APIs are supported.
This resolves issues with compatibility on different hosts, and aligns the
API across hosts instead of having Linux require custom mailbox API
versions for IPSEC offload.

Koichiro fixes a KASAN use-after-free bug in ixgbe_remove().
====================

Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-0-ef32a425b92a@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: fix too early devlink_free() in ixgbe_remove()

Since ixgbe_adapter is embedded in devlink, calling devlink_free()
prematurely in the ixgbe_remove() path can lead to UAF. Move devlink_free()
to the end.

KASAN report:

BUG: KASAN: use-after-free in ixgbe_reset_interrupt_capability+0x140/0x180 [ixgbe]
Read of size 8 at addr ffff0000adf813e0 by task bash/2095
CPU: 1 UID: 0 PID: 2095 Comm: bash Tainted: G S  6.17.0-rc2-tnguy.net-queue+ #1 PREEMPT(full)
[...]
Call trace:
  show_stack+0x30/0x90 (C)
  dump_stack_lvl+0x9c/0xd0
  print_address_description.constprop.0+0x90/0x310
  print_report+0x104/0x1f0
  kasan_report+0x88/0x180
  __asan_report_load8_noabort+0x20/0x30
  ixgbe_reset_interrupt_capability+0x140/0x180 [ixgbe]
  ixgbe_clear_interrupt_scheme+0xf8/0x130 [ixgbe]
  ixgbe_remove+0x2d0/0x8c0 [ixgbe]
  pci_device_remove+0xa0/0x220
  device_remove+0xb8/0x170
  device_release_driver_internal+0x318/0x490
  device_driver_detach+0x40/0x68
  unbind_store+0xec/0x118
  drv_attr_store+0x64/0xb8
  sysfs_kf_write+0xcc/0x138
  kernfs_fop_write_iter+0x294/0x440
  new_sync_write+0x1fc/0x588
  vfs_write+0x480/0x6a0
  ksys_write+0xf0/0x1e0
  __arm64_sys_write+0x70/0xc0
  invoke_syscall.constprop.0+0xcc/0x280
  el0_svc_common.constprop.0+0xa8/0x248
  do_el0_svc+0x44/0x68
  el0_svc+0x54/0x160
  el0t_64_sync_handler+0xa0/0xe8
  el0t_64_sync+0x1b0/0x1b8

Fixes: a0285236ab93 ("ixgbe: add initial devlink support")
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Tested-by: Rinitha S <sx.rinitha@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-6-ef32a425b92a@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: handle IXGBE_VF_FEATURES_NEGOTIATE mbox cmd

Send to VF information about features supported by the PF driver.

Increase API version to 1.7.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-5-ef32a425b92a@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbevf: fix mailbox API compatibility by negotiating supported features

There was backward compatibility in the terms of mailbox API. Various
drivers from various OSes supporting 10G adapters from Intel portfolio
could easily negotiate mailbox API.

This convention has been broken since introducing API 1.4.
Commit 0062e7cc955e ("ixgbevf: add VF IPsec offload code") added support
for IPSec which is specific only for the kernel ixgbe driver. None of the
rest of the Intel 10G PF/VF drivers supports it. And actually lack of
support was not included in the IPSec implementation - there were no such
code paths. No possibility to negotiate support for the feature was
introduced along with introduction of the feature itself.

Commit 339f28964147 ("ixgbevf: Add support for new mailbox communication
between PF and VF") increasing API version to 1.5 did the same - it
introduced code supported specifically by the PF ESX driver. It altered API
version for the VF driver in the same time not touching the version
defined for the PF ixgbe driver. It led to additional discrepancies,
as the code provided within API 1.6 cannot be supported for Linux ixgbe
driver as it causes crashes.

The issue was noticed some time ago and mitigated by Jake within the commit
d0725312adf5 ("ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5").
As a result we have regression for IPsec support and after increasing API
to version 1.6 ixgbevf driver stopped to support ESX MBX.

To fix this mess add new mailbox op asking PF driver about supported
features. Basing on a response determine whether to set support for IPSec
and ESX-specific enhanced mailbox.

New mailbox op, for compatibility purposes, must be added within new API
revision, as API version of OOT PF & VF drivers is already increased to
1.6 and doesn't incorporate features negotiate op.

Features negotiation mechanism gives possibility to be extended with new
features when needed in the future.

Reported-by: Jacob Keller <jacob.e.keller@intel.com>
Closes: https://lore.kernel.org/intel-wired-lan/20241101-jk-ixgbevf-mailbox-v1-5-fixes-v1-0-f556dc9a66ed@intel.com/
Fixes: 0062e7cc955e ("ixgbevf: add VF IPsec offload code")
Fixes: 339f28964147 ("ixgbevf: Add support for new mailbox communication between PF and VF")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-4-ef32a425b92a@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ixgbe: handle IXGBE_VF_GET_PF_LINK_STATE mailbox operation

Update supported API version and provide handler for
IXGBE_VF_GET_PF_LINK_STATE cmd.
Simply put stored values of link speed and link_up from adapter context.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Link: https://lore.kernel.org/stable/20250828095227.1857066-3-jedrzej.jagielski%40intel.com
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-3-ef32a425b92a@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>