git.ipfire.org Git - thirdparty/kernel/stable.git/log

wifi: mac80211: fix rx link assignment for non-MLO stations

Currently, ieee80211_rx_data_set_sta() does not correctly handle the
case where the interface supports multiple links (MLO), but the station
does not (non-MLO). This can lead to incorrect link assignment or
unexpected warnings when accessing link information.

Hence, add a fix to check if the station lacks valid link support and
use its default link ID for rx->link assignment. If the station
unexpectedly has valid links, fall back to the default link.

This ensures correct link association and prevents potential issues
in mixed MLO/non-MLO environments.

Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com>
Signed-off-by: Sarika Sharma <quic_sarishar@quicinc.com>
Link: https://patch.msgid.link/20250630084119.3583593-1-quic_sarishar@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

wifi: cfg80211: move away from using a fake platform device

Downloading regulatory "firmware" needs a device to hang off of, and so
a platform device seemed like the simplest way to do this. Now that we
have a faux device interface, use that instead as this "regulatory
device" is not anything resembling a platform device at all.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: <linux-wireless@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2025070116-growing-skeptic-494c@gregkh
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

Merge tag 'mt76-next-2025-07-07' of https://github.com/nbd168/wireless

Felix Fietkay says:
===================
mt76 patches for 6.17

- firmware recovery improvements for mt7915
- mlo improvements
- fixes
===================

Signed-off-by: Johannes Berg <johannes.berg@intel.com>

wifi: mt76: mt7921s: Introduce SDIO WiFi/BT combo module card reset

Add a hardware reset method to recover from the SDIO bus error that cannot
be resolved by the current WiFi/BT subsystem reset.

Signed-off-by: Leon Yen <leon.yen@mediatek.com>
Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com>
Link: https://patch.msgid.link/20250418093740.3814909-1-mingyen.hsieh@mediatek.com
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt792x: improve monitor interface handling

Enable IEEE80211_HW_NO_VIRTUAL_MONITOR to ensure the driver is notified of
all monitor interfaces and their channels.

Signed-off-by: Ming Yen Hsieh <mingyen.hsieh@mediatek.com>
Link: https://patch.msgid.link/20250625075611.1407687-1-mingyen.hsieh@mediatek.com
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: Get rid of dma_sync_single_for_device() for MMIO devices

Since the page_pool for MT76 MMIO devices are created with
PP_FLAG_DMA_SYNC_DEV flag, we do not need to sync_for_device each page
received from the pool since it is already done by the page_pool
codebase.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250625-mt76-sync-for-device-v1-1-e687e3278e1a@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Move num_sta accounting in mt7996_mac_sta_{add,remove}_links

Move phy num_sta accounting in mt7996_mac_sta_add() and
mt7996_mac_sta_remove() routines in order to take into account all
possibles MLO links.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-9-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()

Generalize mt7996_tx_check_aggr() and mt7996_txwi_free() routines to
introduce MLO support for MT7996 driver.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-8-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Fix valid_links bitmask in mt7996_mac_sta_{add,remove}

sta->valid_links bitmask can be set even for non-MLO client.

Fixes: dd82a9e02c054 ("wifi: mt76: mt7996: Rely on mt7996_sta_link in sta_add/sta_remove callbacks")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-7-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Fix possible OOB access in mt7996_tx()

Fis possible Out-Of-Boundary access in mt7996_tx routine if link_id is
set to IEEE80211_LINK_UNSPECIFIED

Fixes: 3ce8acb86b661 ("wifi: mt76: mt7996: Update mt7996_tx to MLO support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-6-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Fix mlink lookup in mt7996_tx_prepare_skb

Use proper link_id in mt7996_tx_prepare_skb routine for mlink lookup.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-5-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()

msta_link->wcid.sta is already set to 1 in mt7996_mac_sta_init_link
routine.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-4-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Rely on for_each_sta_active_link() in mt7996_mcu_sta_mld_setup_tlv()

Reuse for_each_sta_active_link utility macro in
mt7996_mcu_sta_mld_setup_tlv routine.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-3-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7996: Fix secondary link lookup in mt7996_mcu_sta_mld_setup_tlv()

Use proper link_id value for secondary link lookup in
mt7996_mcu_sta_mld_setup_tlv routine.

Fixes: 00cef41d9d8f5 ("wifi: mt76: mt7996: Add mt7996_mcu_sta_mld_setup_tlv() and mt7996_mcu_sta_eht_mld_tlv()")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-2-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: fix vif link allocation

Reuse the vif deflink for link_id = 0 in order to avoid confusion with
vif->bss_conf, which also gets a link id of 0.

Link: https://patch.msgid.link/20250704-mt7996-mlo-fixes-v1-1-356456c73f43@kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.16-rc5).

No conflicts.

No adjacent changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()

The ssid->ssids[] and sreq->ssids[] arrays have MT7925_RNR_SCAN_MAX_BSSIDS
elements so this >= needs to be > to prevent an out of bounds access.

Fixes: 8284815ca161 ("wifi: mt76: mt7925: add RNR scan support for 6GHz")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/aDVT2tPhG_8T0Qla@stanley.mountain
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7915: mcu: re-init MCU before loading FW patch

Restart the MCU and release the patch semaphore before loading the MCU
patch firmware from the host.

This fixes failures upon error recovery in case the semaphore was
previously taken and never released by the host.

This happens from time to time upon triggering a full-chip error
recovery. Under this circumstance, the hardware restart fails and the
radio is rendered inoperational.

Signed-off-by: David Bauer <mail@david-bauer.net>
Link: https://patch.msgid.link/20250402004528.1036715-3-mail@david-bauer.net
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7915: mcu: lower default timeout

The default timeout set in mt76_connac2_mcu_fill_message of 20 seconds
leads to excessive stalling in case messages are lost.

Testing showed that a smaller timeout of 5 seconds is sufficient in
normal operation.

Signed-off-by: David Bauer <mail@david-bauer.net>
Link: https://patch.msgid.link/20250402004528.1036715-1-mail@david-bauer.net
Signed-off-by: Felix Fietkau <nbd@nbd.name>

wifi: mt76: mt7915: mcu: increase eeprom command timeout

Increase the timeout for MCU_EXT_CMD_EFUSE_BUFFER_MODE command.

Regular retries upon hardware-recovery have been observed. Increasing
the timeout slightly remedies this problem.

Signed-off-by: David Bauer <mail@david-bauer.net>
Link: https://patch.msgid.link/20250402004528.1036715-2-mail@david-bauer.net
Signed-off-by: Felix Fietkau <nbd@nbd.name>

Merge tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.

  Current release - new code bugs:

    - eth:
       - txgbe: fix the issue of TX failure
       - ngbe: specify IRQ vector when the number of VFs is 7

  Previous releases - regressions:

    - sched: always pass notifications when child class becomes empty

    - ipv4: fix stat increase when udp early demux drops the packet

    - bluetooth: prevent unintended pause by checking if advertising is active

    - virtio: fix error reporting in virtqueue_resize

    - eth:
       - virtio-net:
          - ensure the received length does not exceed allocated size
          - fix the xsk frame's length check
       - lan78xx: fix WARN in __netif_napi_del_locked on disconnect

  Previous releases - always broken:

    - bluetooth: mesh: check instances prior disabling advertising

    - eth:
       - idpf: convert control queue mutex to a spinlock
       - dpaa2: fix xdp_rxq_info leak
       - amd-xgbe: align CL37 AN sequence as per databook"

* tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  vsock/vmci: Clear the vmci transport packet properly when initializing it
  dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example
  net: ngbe: specify IRQ vector when the number of VFs is 7
  net: wangxun: revert the adjustment of the IRQ vector sequence
  net: txgbe: request MISC IRQ in ndo_open
  virtio_net: Enforce minimum TX ring size for reliability
  virtio_net: Cleanup '2+MAX_SKB_FRAGS'
  virtio_ring: Fix error reporting in virtqueue_resize
  virtio-net: xsk: rx: fix the frame's length check
  virtio-net: use the check_mergeable_len helper
  virtio-net: remove redundant truesize check with PAGE_SIZE
  virtio-net: ensure the received length does not exceed allocated size
  net: ipv4: fix stat increase when udp early demux drops the packet
  net: libwx: fix the incorrect display of the queue number
  amd-xgbe: do not double read link status
  net/sched: Always pass notifications when child class becomes empty
  nui: Fix dma_mapping_error() check
  rose: fix dangling neighbour pointers in rose_rt_device_down()
  enic: fix incorrect MTU comparison in enic_change_mtu()
  amd-xgbe: align CL37 AN sequence as per databook
  ...

Merge tag 'xfs-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

- Fix umount hang with unflushable inodes (and add new tracepoint used
   for debugging this)

- Fix ABBA deadlock in xfs_reclaim_inode() vs xfs_ifree_cluster()

- Fix dquot buffer pin deadlock

* tag 'xfs-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags mask
  xfs: fix unmount hang with unflushable inodes stuck in the AIL
  xfs: factor out stale buffer item completion
  xfs: rearrange code in xfs_buf_item.c
  xfs: add tracepoints for stale pinned inode state debug
  xfs: avoid dquot buffer pin deadlock
  xfs: catch stale AGF/AGF metadata
  xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
  xfs: actually use the xfs_growfs_check_rtgeom tracepoint
  xfs: Improve error handling in xfs_mru_cache_create()
  xfs: move xfs_submit_zoned_bio a bit
  xfs: use xfs_readonly_buftarg in xfs_remount_rw
  xfs: remove NULL pointer checks in xfs_mru_cache_insert
  xfs: check for shutdown before going to sleep in xfs_select_zone

ipv6: Cleanup fib6_drop_pcpu_from()

Since commit 0e2338749192 ("ipv6: fix races in ip6_dst_destroy()"),
'table' is unused in __fib6_drop_pcpu_from(), no need pass it from
fib6_drop_pcpu_from().

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701041235.1333687-1-yuehaibing@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'another-ip-sysctl-docs-cleanup'

Bagas Sanjaya says:

====================
Another ip-sysctl docs cleanup

Inspired by Abdelrahman's cleanup [1]. This time, mostly formatting
conversion to bullet lists.

[1]: https://lore.kernel.org/linux-doc/20250624150923.40590-1-abdelrahmanfekry375@gmail.com/
====================

Link: https://patch.msgid.link/20250701031300.19088-1-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ip-sysctl: Add link to SCTP IPv4 scoping draft

addr_scope_policy description contains pointer to SCTP IPv4 scoping
draft but not its IETF Datatracker link. Add it.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701031300.19088-6-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ip-sysctl: Format SCTP-related memory parameters description as bullet list

The description for vector elements of SCTP-related memory usage
parameters (sctp{r,w,}mem) is formatted as normal paragraphs rather than
bullet list. Convert the description to the latter.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701031300.19088-5-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ip-sysctl: Format pf_{enable,expose} boolean lists as bullet lists

These lists' items were separated by newlines but without bullet list
marker. Turn the lists into proper bullet list.

While at it, also reword values description for pf_expose to not repeat
mentioning SCTP_PEER_ADDR_CHANGE and SCTP_GET_PEER_ADDR_INFO sockopt.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701031300.19088-4-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ip-sysctl: Format possible value range of ioam6_id{,_wide} as bullet list

Format possible value range bounds of ioam6_id and ioam6_id_wide as
bullet list instead of running paragraph.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701031300.19088-3-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ip-sysctl: Format Private VLAN proxy arp aliases as bullet list

Alias names list for private VLAN proxy arp technology is formatted as
indented paragraph instead. Make it bullet list as it is better fit for
this purpose.

Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701031300.19088-2-bagasdotme@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'ptp-provide-support-for-auxiliary-clocks-for-ptp_sys_offset_extended'

Thomas Gleixner says:

====================
ptp: Provide support for auxiliary clocks for PTP_SYS_OFFSET_EXTENDED

This is a follow up to the V1 series, which can be found here:

     https://lore.kernel.org/all/20250626124327.667087805@linutronix.de

to address the merge logistics problem, which I created myself.

Changes vs. V1:

    - Make patch 1, which provides the timestamping function temporarily
      define CLOCK_AUX* if undefined so that it can be merged independently,

    - Add a missing check for CONFIG_POSIX_AUX_CLOCK in the PTP IOCTL

    - Picked up tags

Merge logistics if agreed on:

    1) Patch #1 is applied to the tip tree on top of plain v6.16-rc1 and
       tagged

    2) That tag is merged into tip:timers/ptp and the temporary CLOCK_AUX
       define is removed in a subsequent commit

    3) Network folks merge the tag and apply patches #2 + #3

So the only fallout from this are the extra merges in both trees and the
cleanup commit in the tip tree. But that way there are no dependencies and
no duplicate commits with different SHAs.

Thoughts?

Due to the above constraints there is no branch offered to pull from right
now. Sorry for the inconveniance. Should have thought about that earlier.
====================

Link: https://patch.msgid.link/20250701130923.579834908@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ptp: Enable auxiliary clocks for PTP_SYS_OFFSET_EXTENDED

Allow ioctl(PTP_SYS_OFFSET_EXTENDED*) to select CLOCK_AUX clock ids for
generating the pre and post hardware readout timestamps.

Aside of adding these clocks to the clock ID validation, this also requires
to check the timestamp to be valid, i.e. the seconds value being greater
than or equal zero. This is necessary because AUX clocks can be
asynchronously enabled or disabled, so there is no way to validate the
availability upfront.

The same could have been achieved by handing the return value of
ktime_get_aux_ts64() all the way down to the IOCTL call site, but that'd
require to modify all existing ptp::gettimex64() callbacks and their inner
call chains. The timestamp check achieves the same with less churn and less
complicated code all over the place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250701132628.491315452@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ptp: Use ktime_get_clock_ts64() for timestamping

The inlined ptp_read_system_[pre|post]ts() switch cases expand to a copious
amount of text in drivers, e.g. ~500 bytes in e1000e. Adding auxiliary
clock support to the inlines would increase it further.

Replace the inline switch case with a call to ktime_get_clock_ts64(), which
reduces the code size in drivers and allows to access auxiliary clocks once
they are enabled in the IOCTL parameter filter.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20250701132628.426168092@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'ktime-get-clock-ts64-for-ptp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Base implementation for PTP with a temporary CLOCK_AUX* workaround to
allow integration of depending changes into the networking tree.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bonding: don't force LACPDU tx to ~333 ms boundaries

The timer which ensures that no more than 3 LACPDUs are transmitted in
a second rearms itself every 333ms regardless of whether an LACPDU is
transmitted when the timer expires. This causes LACPDU tx to be delayed
until the next expiration of the timer, which effectively aligns LACPDUs
to ~333ms boundaries. This results in a variable amount of jitter in the
timing of periodic LACPDUs.

Change this to only rearm the timer when an LACPDU is actually sent,
allowing tx at any point after the timer has expired.

Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Reviewed-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Link: https://patch.msgid.link/20250625-fix-lacpdu-jitter-v1-1-4d0ee627e1ba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

timekeeping: Provide ktime_get_clock_ts64()

PTP implements an inline switch case for taking timestamps from various
POSIX clock IDs, which already consumes quite some text space. Expanding it
for auxiliary clocks really becomes too big for inlining.

Provide a out of line version.

The function invalidates the timestamp in case the clock is invalid. The
invalidation allows to implement a validation check without the need to
propagate a return value through deep existing call chains.

Due to merge logistics this temporarily defines CLOCK_AUX[_LAST] if
undefined, so that the plain branch, which does not contain any of the core
timekeeper changes, can be pulled into the networking tree as prerequisite
for the PTP side changes. These temporary defines are removed after that
branch is merged into the tip::timers/ptp branch. That way the result in
-next or upstream in the next merge window has zero dependencies.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20250701132628.357686408@linutronix.de

vsock/vmci: Clear the vmci transport packet properly when initializing it

In vmci_transport_packet_init memset the vmci_transport_packet before
populating the fields to avoid any uninitialised data being left in the
structure.

Cc: Bryan Tan <bryan-bt.tan@broadcom.com>
Cc: Vishnu Dasa <vishnu.dasa@broadcom.com>
Cc: Broadcom internal kernel review list
Cc: Stefano Garzarella <sgarzare@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: virtualization@lists.linux.dev
Cc: netdev@vger.kernel.org
Cc: stable <stable@kernel.org>
Signed-off-by: HarshaVardhana S A <harshavardhana.sa@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250701122254.2397440-1-gregkh@linuxfoundation.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example

Examples should be complete and should not have a 'status' property,
especially a disabled one because this disables the dt_binding_check of
the example against the schema. Dropping 'status' property shows
missing other properties - phy-mode and phy-handle.

Fixes: 114508a89ddc ("dt-bindings: net: Add support for Sophgo SG2044 dwmac")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Reviewed-by: Chen Wang <unicorn_wang@outlook.com>
Link: https://patch.msgid.link/20250701063621.23808-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'fix-irq-vectors'

Jiawen Wu says:

====================
Fix IRQ vectors

The interrupt vector order was adjusted by [1]commit 937d46ecc5f9 ("net:
wangxun: add ethtool_ops for channel number") in Linux-6.8. Because at
that time, the MISC interrupt acts as the parent interrupt in the GPIO
IRQ chip. When the number of Rx/Tx ring changes, the last MISC
interrupt must be reallocated. Then the GPIO interrupt controller would
be corrupted. So the initial plan was to adjust the sequence of the
interrupt vectors, let MISC interrupt to be the first one and do not
free it.

Later, irq_domain was introduced in [2]commit aefd013624a1 ("net: txgbe:
use irq_domain for interrupt controller") to avoid this problem.
However, the vector sequence adjustment was not reverted. So there is
still one problem that has been left unresolved.

Due to hardware limitations of NGBE, queue IRQs can only be requested
on vector 0 to 7. When the number of queues is set to the maximum 8,
the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by
MISC interrupt, and althrough the vector 8 is used by queue interrupt,
it is unable to receive packets. This will cause some packets to be
dropped when RSS is enabled and they are assigned to queue 8.

This patch set fix the above problems.

[1] https://git.kernel.org/netdev/net-next/c/937d46ecc5f9
[2] https://git.kernel.org/netdev/net-next/c/aefd013624a1
====================

Link: https://patch.msgid.link/20250701063030.59340-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ngbe: specify IRQ vector when the number of VFs is 7

For NGBE devices, the queue number is limited to be 1 when SRIOV is
enabled. In this case, IRQ vector[0] is used for MISC and vector[1] is
used for queue, based on the previous patches. But for the hardware
design, the IRQ vector[1] must be allocated for use by the VF[6] when
the number of VFs is 7. So the IRQ vector[0] should be shared for PF
MISC and QUEUE interrupts.

+-----------+----------------------+
| Vector    | Assigned To          |
+-----------+----------------------+
| Vector 0  | PF MISC and QUEUE    |
| Vector 1  | VF 6                 |
| Vector 2  | VF 5                 |
| Vector 3  | VF 4                 |
| Vector 4  | VF 3                 |
| Vector 5  | VF 2                 |
| Vector 6  | VF 1                 |
| Vector 7  | VF 0                 |
+-----------+----------------------+

Minimize code modifications, only adjust the IRQ vector number for this
case.

Fixes: 877253d2cbf2 ("net: ngbe: add sriov function support")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20250701063030.59340-4-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: wangxun: revert the adjustment of the IRQ vector sequence

Due to hardware limitations of NGBE, queue IRQs can only be requested
on vector 0 to 7. When the number of queues is set to the maximum 8,
the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by
MISC interrupt, and althrough the vector 8 is used by queue interrupt,
it is unable to receive packets. This will cause some packets to be
dropped when RSS is enabled and they are assigned to queue 8.

So revert the adjustment of the MISC IRQ location, to make it be the
last one in IRQ vectors.

Fixes: 937d46ecc5f9 ("net: wangxun: add ethtool_ops for channel number")
Cc: stable@vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Link: https://patch.msgid.link/20250701063030.59340-3-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: txgbe: request MISC IRQ in ndo_open

Move the creating of irq_domain for MISC IRQ from .probe to .ndo_open,
and free it in .ndo_stop, to maintain consistency with the queue IRQs.
This it for subsequent adjustments to the IRQ vectors.

Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller")
Cc: stable@vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250701063030.59340-2-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'virtio-fixes-for-tx-ring-sizing-and-resize-error-reporting'

Laurent Vivier says:

====================
virtio: Fixes for TX ring sizing and resize error reporting

This patch series contains two fixes and a cleanup for the virtio subsystem.

The first patch fixes an error reporting bug in virtio_ring's
virtqueue_resize() function. Previously, errors from internal resize
helpers could be masked if the subsequent re-enabling of the virtqueue
succeeded. This patch restores the correct error propagation, ensuring that
callers of virtqueue_resize() are properly informed of underlying resize
failures.

The second patch does a cleanup of the use of '2+MAX_SKB_FRAGS'

The third patch addresses a reliability issue in virtio_net where the TX
ring size could be configured too small, potentially leading to
persistently stopped queues and degraded performance. It enforces a
minimum TX ring size to ensure there's always enough space for at least one
maximally-fragmented packet plus an additional slot.
====================

Link: https://patch.msgid.link/20250521092236.661410-1-lvivier@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio_net: Enforce minimum TX ring size for reliability

The `tx_may_stop()` logic stops TX queues if free descriptors
(`sq->vq->num_free`) fall below the threshold of (`MAX_SKB_FRAGS` + 2).
If the total ring size (`ring_num`) is not strictly greater than this
value, queues can become persistently stopped or stop after minimal
use, severely degrading performance.

A single sk_buff transmission typically requires descriptors for:
- The virtio_net_hdr (1 descriptor)
- The sk_buff's linear data (head) (1 descriptor)
- Paged fragments (up to MAX_SKB_FRAGS descriptors)

This patch enforces that the TX ring size ('ring_num') must be strictly
greater than (MAX_SKB_FRAGS + 2). This ensures that the ring is
always large enough to hold at least one maximally-fragmented packet
plus at least one additional slot.

Reported-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-4-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio_net: Cleanup '2+MAX_SKB_FRAGS'

Improve consistency by using everywhere it is needed
'MAX_SKB_FRAGS + 2' rather than '2+MAX_SKB_FRAGS' or
'2 + MAX_SKB_FRAGS'.

No functional change.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-3-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio_ring: Fix error reporting in virtqueue_resize

The virtqueue_resize() function was not correctly propagating error codes
from its internal resize helper functions, specifically
virtqueue_resize_packet() and virtqueue_resize_split(). If these helpers
returned an error, but the subsequent call to virtqueue_enable_after_reset()
succeeded, the original error from the resize operation would be masked.
Consequently, virtqueue_resize() could incorrectly report success to its
caller despite an underlying resize failure.

This change restores the original code behavior:

       if (vdev->config->enable_vq_after_reset(_vq))
               return -EBUSY;

       return err;

Fix: commit ad48d53b5b3f ("virtio_ring: separate the logic of reset/enable from virtqueue_resize")
Cc: xuanzhuo@linux.alibaba.com
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250521092236.661410-2-lvivier@redhat.com
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio-net: xsk: rx: fix the frame's length check

When calling buf_to_xdp, the len argument is the frame data's length
without virtio header's length (vi->hdr_len). We check that len with

xsk_pool_get_rx_frame_size() + vi->hdr_len

to ensure the provided len does not larger than the allocated chunk
size. The additional vi->hdr_len is because in virtnet_add_recvbuf_xsk,
we use part of XDP_PACKET_HEADROOM for virtio header and ask the vhost
to start placing data from

hard_start + XDP_PACKET_HEADROOM - vi->hdr_len
not
hard_start + XDP_PACKET_HEADROOM

But the first buffer has virtio_header, so the maximum frame's length in
the first buffer can only be

xsk_pool_get_rx_frame_size()
not
xsk_pool_get_rx_frame_size() + vi->hdr_len

like in the current check.

This commit adds an additional argument to buf_to_xdp differentiate
between the first buffer and other ones to correctly calculate the maximum
frame's length.

Cc: stable@vger.kernel.org
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: a4e7ba702701 ("virtio_net: xsk: rx: support recv small mode")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250630151315.86722-2-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'virtio-net-fixes-for-mergeable-xdp-receive-path'

Bui Quang Minh says:

====================
virtio-net: fixes for mergeable XDP receive path

This series contains fixes for XDP receive path in virtio-net
- Patch 1: add a missing check for the received data length with our
allocated buffer size in mergeable mode.
- Patch 2: remove a redundant truesize check with PAGE_SIZE in mergeable
mode
- Patch 3: make the current repeated code use the check_mergeable_len to
check for received data length in mergeable mode
====================

Link: https://patch.msgid.link/20250630144212.48471-1-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio-net: use the check_mergeable_len helper

Replace the current repeated code to check received length in mergeable
mode with the new check_mergeable_len helper.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250630144212.48471-4-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio-net: remove redundant truesize check with PAGE_SIZE

The truesize is guaranteed not to exceed PAGE_SIZE in
get_mergeable_buf_len(). It is saved in mergeable context, which is not
changeable by the host side, so the check in receive path is quite
redundant.

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250630144212.48471-3-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

virtio-net: ensure the received length does not exceed allocated size

In xdp_linearize_page, when reading the following buffers from the ring,
we forget to check the received length with the true allocate size. This
can lead to an out-of-bound read. This commit adds that missing check.

Cc: <stable@vger.kernel.org>
Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250630144212.48471-2-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: ipv6: Fix spelling mistake

change 'Maximium' to 'Maximum'

Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Link: https://patch.msgid.link/20250702055820.112190-1-zhaochenguang@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'support-rate-management-on-traffic-classes-in-devlink-and-mlx5'

Mark Bloch says:

====================
Support rate management on traffic classes in devlink and mlx5

This patch series extends the devlink-rate API to support traffic class
(TC) bandwidth management, enabling more granular control over traffic
shaping and rate limiting across multiple TCs. The API now allows users
to specify bandwidth proportions for different traffic classes in a
single command. This is particularly useful for managing Enhanced
Transmission Selection (ETS) for groups of Virtual Functions (VFs),
allowing precise bandwidth allocation across traffic classes.

Additionally the series refines the QoS handling in net/mlx5 to support
TC arbitration and bandwidth management on vports and rate nodes.

Discussions on traffic class shaping in net-shapers began in V5 [1],
where we discussed with maintainers whether net-shapers should support
traffic classes and how this could be implemented.

Later, after further conversations with Paolo Abeni and Simon Horman,
Cosmin provided an update [2], confirming that net-shapers' tree-based
hierarchy aligns well with traffic classes when treated as distinct
subsets of netdev queues. Since mlx5 enforces a 1:1 mapping between TX
queues and traffic classes, this approach seems feasible, though some
open questions remain regarding queue reconfiguration and certain mlx5
scheduling behaviors.

Building on that discussion, Cosmin has now shared a concrete
implementation plan on the netdev mailing list [3]. The plan, developed
in collaboration with Paolo and Simon, outlines how net-shapers can be
extended to support the same use cases currently covered by
devlink-rate, with the eventual goal of aligning both and simplifying
the shaping infrastructure in the kernel.

This work was presented at Netdev 0x19 in Zagreb [4].
There we presented how TC scheduling is enforced in mlx5 hardware,
which led to discussions on the mailing list.

A summary of how things work:

Classification means labeling a packet with a traffic class based on
the packet's DSCP or VLAN PCP field, then treating packets with
different traffic classes differently during transmit processing.

In a virtualized setup, VFs are untrusted and do not control
classification or shaping. Classification is done by the hardware using
a prio-to-TC mapping set by the hypervisor. VFs only select which send
queue to use and are expected to respect the classification logic by
sending each traffic class on its dedicated queue. As stated in the
net-shapers plan [3], each transmit queue should carry only a single
traffic class. Mixing classes in a single queue can lead to HOL
blocking.

In the mlx5 implementation, if the queue used does not match the
classified traffic class, the hardware moves the queue to the correct
TC scheduler. This movement is not a reclassification; it’s a necessary
enforcement step to ensure traffic class isolation is maintained.

Extend devlink-rate API to support rate management on TCs:
- devlink: Extend the devlink rate API to support traffic class
bandwidth management

Introduce a no-op implementation:
- net/mlx5: Add no-op implementation for setting tc-bw on rate objects

Add support for enabling and disabling TC QoS on vports and nodes:
- net/mlx5: Add support for setting tc-bw on nodes
- net/mlx5: Add traffic class scheduling support for vport QoS

Support for setting tc-bw on rate objects:
- net/mlx5: Manage TC arbiter nodes and implement full support for
tc-bw

[1]
https://lore.kernel.org/netdev/20241204220931.254964-1-tariqt@nvidia.com/
[2]
https://lore.kernel.org/netdev/67df1a562614b553dcab043f347a0d7c5393ff83.camel@nvidia.com/
[3]
https://lore.kernel.org/netdev/d9831d0c940a7b77419abe7c7330e822bbfd1cfb.camel@nvidia.com/T/
[4]
https://netdevconf.info/0x19/sessions/talk/optimizing-bandwidth-allocation-with-ets-and-traffic-classes.html
====================

Link: https://patch.msgid.link/20250629142138.361537-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: Add test for devlink-rate traffic class bandwidth distribution

This test suite validates the functionality of the devlink-rate API for
traffic class (TC) bandwidth allocation. It ensures that bandwidth can
be distributed between different traffic classes as configured, and
verifies that explicit TC-to-queue mapping is required for the
allocation to be effective.

The first test (test_no_tc_mapping_bandwidth) is marked as expected
failure on mlx5, since the hardware automatically enforces traffic
class separation by dynamically moving queues to the correct TC
scheduler, even without explicit TC-to-queue mapping configuration.

Test output on mlx5:
1..2
# Created VF interface: eth5
# Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2
# Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10
# Set representor eth4 up and added to bridge
# Bandwidth check results without TC mapping:
# TC 3: 0.19 Gbits/sec
# TC 4: 0.76 Gbits/sec
# Total bandwidth: 0.95 Gbits/sec
# TC 3 percentage: 20.0%
# TC 4 percentage: 80.0%
ok 1 devlink_rate_tc_bw.test_no_tc_mapping_bandwidth # XFAIL Bandwidth matched 80/20 split without TC mapping
# Created VF interface: eth5
# Created VLAN eth5.101 on eth5 with tc 3 and IP 198.51.100.2
# Created VLAN eth5.102 on eth5 with tc 4 and IP 198.51.100.10
# Set representor eth4 up and added to bridge
# Bandwidth check results with TC mapping:
# TC 3: 0.21 Gbits/sec
# TC 4: 0.78 Gbits/sec
# Total bandwidth: 0.98 Gbits/sec
# TC 3 percentage: 21.1%
# TC 4 percentage: 78.9%
# Bandwidth is distributed as 80/20 with TC mapping
ok 2 devlink_rate_tc_bw.test_tc_mapping_bandwidth
# Totals: pass:1 fail:0 xfail:1 xpass:0 skip:0 error:0

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-9-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw

Introduce support for managing Traffic Class (TC) arbiter nodes and
associated vports TC nodes within the E-Switch QoS hierarchy. This
patch adds support for the new scheduling node type,
`SCHED_NODE_TYPE_VPORTS_TC_TSAR`, and implements full support for
setting tc-bw on both vports and nodes.

Key changes include:

- Introduced the new scheduling node type,
  `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, for managing vports within the TC
  arbiter node.

- New helper functions for creating and destroying vports TC nodes
  under the TC arbiter.

- Updated the minimum rate normalization function to skip nodes of type
  `SCHED_NODE_TYPE_VPORTS_TC_TSAR`. Vports TC TSARs have bandwidth
  shares configured on them but not minimum rates, so their `min_rate`
  cannot be normalized.

- Implementation of `esw_qos_tc_arbiter_scheduling_setup()` and
  `esw_qos_tc_arbiter_scheduling_teardown()` for initializing and
  cleaning up TC arbiter scheduling elements. These functions now fully
  support tc-bw configuration on TC arbiter nodes.

- Introduced a new helper `esw_qos_calculate_tc_bw_divider()` to
  compute the total TC bandwidth share, which is used as a divider for
  normalizing each TC's share.

- Added `esw_qos_tc_arbiter_get_bw_shares()` and
  `esw_qos_set_tc_arbiter_bw_shares()` to handle the settings of
  bandwidth shares for vports traffic class TSARs.

- `esw_qos_set_tc_arbiter_bw_shares()` normalizes  each TC share based
  on the total and the firmware's maximum allowed TSAR bandwidth share.

- Refactored `mlx5_esw_devlink_rate_node_tc_bw_set()` and
  `mlx5_esw_devlink_rate_leaf_tc_bw_set()` to fully support configuring
  tc-bw on devlink rate nodes and vports, respectively.

- Refactored `mlx5_esw_qos_node_update_parent()` to ensure that tc-bw
  configuration remains compatible with setting a parent on a rate
  node, preserving level hierarchy functionality.

- Refactored `esw_qos_calc_bw_share()` to generalize its input so it
  can be used for both minimum rate and bandwidth share calculations.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-8-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add traffic class scheduling support for vport QoS

Introduce support for traffic class (TC) scheduling on vports by
allowing the vport to own multiple TC scheduling nodes. This patch
enables more granular control of QoS by defining three distinct QoS
states for vports, each providing unique scheduling behavior:

1. Regular QoS: The `sched_node` represents the vport directly,
   handling QoS as a single scheduling entity.
2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling
   TC scheduling directly on the vport.
3. TC QoS on the parent node: The `sched_node` functions as a rate
   limiter, with TC arbitration enabled at the parent level, associating
   multiple scheduling nodes with each vport.

Key changes include:

- Added support for new scheduling elements, vport traffic class and
  rate limiter.

- New helper functions for creating, destroying, and restoring vport TC
  scheduling nodes, handling transitions between regular QoS and TC
  arbitration states.

- Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to
  support both regular QoS and TC arbitration states, ensuring consistent
  transitions between scheduling modes.

- Introduced a `sched_nodes` array under `vport->qos` to store multiple
  TC scheduling nodes per vport, enabling finer control over per-TC QoS.

- Enhanced `esw_qos_vport_update_parent()` to handle transitions between
  the three QoS states based on the current and new parent node types.

This patch lays the groundwork for future support for configuring tc-bw
on vports. Although the infrastructure is in place, full support for
tc-bw is not yet implemented; attempts to set tc-bw on vports will
return `-EOPNOTSUPP`.

No functional changes are introduced at this stage.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add support for setting tc-bw on nodes

Introduce support for enabling and disabling Traffic Class (TC)
arbitration for existing devlink rate nodes. This patch adds support
for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`.

Key changes include:

- New helper functions for transitioning existing rate nodes to TC
  arbiter nodes and vice versa. These functions handle the allocation
  of TC arbiter nodes, copying of child nodes, and restoring vport QoS
  settings when TC arbitration is disabled.

- Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage
  tc-bw configuration on nodes.

- Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and
  `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in
  future patches to provide full support for tc-bw on devlink rate
  objects.

- Validation functions for tc-bw settings, allowing graceful handling
  of unsupported traffic class bandwidth configurations.

- Updated `__esw_qos_alloc_node()` to insert the new node into the
  parent’s children list only if the parent is not NULL. For the root
  TSAR, the new node is inserted directly after the allocation call.

- Don't allow `tc-bw` configuration for nodes containing non-leaf
  children.

This patch lays the groundwork for future support for configuring tc-bw
on devlink rate nodes. Although the infrastructure is in place, full
support for tc-bw is not yet implemented; attempts to set tc-bw on
nodes will return `-EOPNOTSUPP`.

No functional changes are introduced at this stage.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Add no-op implementation for setting tc-bw on rate objects

Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and
`mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic.

Future patches will add support for setting traffic class bandwidth
on rate objects.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: netdevsim: Add devlink rate tc-bw test

Test verifies that netdevsim correctly implements devlink ops callbacks
that set tc-bw on leaf or node rate object.

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

devlink: Extend devlink rate API with traffic classes bandwidth management

Introduce support for specifying relative bandwidth shares between
traffic classes (TC) in the devlink-rate API. This new option allows
users to allocate bandwidth across multiple traffic classes in a
single command.

This feature provides a more granular control over traffic management,
especially for scenarios requiring Enhanced Transmission Selection.

Users can now define a relative bandwidth share for each traffic class.
For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5
(RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the
total bandwidth. The actual percentage each class receives depends on
the ratio of its share value to the sum of all shares.

Example:
DEV=pci/0000:08:00.0

$ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \
  tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0

$ devlink port function rate set $DEV/vfs_group \
  tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0

Example usage with ynl:

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-set --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1,
  "rate-tc-bws": [
    {"rate-tc-index": 0, "rate-tc-bw": 50},
    {"rate-tc-index": 1, "rate-tc-bw": 50},
    {"rate-tc-index": 2, "rate-tc-bw": 0},
    {"rate-tc-index": 3, "rate-tc-bw": 0},
    {"rate-tc-index": 4, "rate-tc-bw": 0},
    {"rate-tc-index": 5, "rate-tc-bw": 0},
    {"rate-tc-index": 6, "rate-tc-bw": 0},
    {"rate-tc-index": 7, "rate-tc-bw": 0}
  ]
}'

./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \
  --do rate-get --json '{
  "bus-name": "pci",
  "dev-name": "0000:08:00.0",
  "port-index": 1
}'

output for rate-get:
{'bus-name': 'pci',
'dev-name': '0000:08:00.0',
'port-index': 1,
'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0},
                 {'rate-tc-bw': 50, 'rate-tc-index': 1},
                 {'rate-tc-bw': 0, 'rate-tc-index': 2},
                 {'rate-tc-bw': 0, 'rate-tc-index': 3},
                 {'rate-tc-bw': 0, 'rate-tc-index': 4},
                 {'rate-tc-bw': 0, 'rate-tc-index': 5},
                 {'rate-tc-bw': 0, 'rate-tc-index': 6},
                 {'rate-tc-bw': 0, 'rate-tc-index': 7}],
'rate-tx-max': 0,
'rate-tx-priority': 0,
'rate-tx-share': 0,
'rate-tx-weight': 0,
'rate-type': 'leaf'}

Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: introduce type-checking attribute iteration for nlmsg

Add the nlmsg_for_each_attr_type() macro to simplify iteration over
attributes of a specific type in a Netlink message.

Convert existing users in vxlan and nfsd to use the new macro.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vhost-net: reduce one userspace copy when building XDP buff

We used to do twice copy_from_iter() to copy virtio-net and packet
separately. This introduce overheads for userspace access hardening as
well as SMAP (for x86 it's stac/clac). So this patch tries to use one
copy_from_iter() to copy them once and move the virtio-net header
afterwards to reduce overheads.

Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tun: remove unnecessary tun_xdp_hdr structure

With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"),
buffer length could be stored as frame size so there's no need to have
a dedicated tun_xdp_hdr structure. We can simply store virtio net
header instead.

Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'preserve-msg_zerocopy-with-forwarding'

Willem de Bruijn says:

====================
preserve MSG_ZEROCOPY with forwarding

Avoid false positive copying of zerocopy skb frags when entering the
ingress path if the skb is not queued locally but forwarded.

Patch 1 for more details and feature.

Patch 2 converts the existing selftest to a pass/fail test and adds
coverage for this new feature.
====================

Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftest: net: extend msg_zerocopy test with forwarding

Zerocopy skbs are converted to regular copy skbs when data is queued
to a local socket. This happens in the existing test with a sender and
receiver communicating over a veth device.

Zerocopy skbs are sent without copying if egressing a device. Verify
that this behavior is maintained even in the common container setup
where data is forwarded over a veth to the physical device.

Update msg_zerocopy.sh to

1. Have a dummy network device to simulate a physical device.
2. Have forwarding enabled between veth and dummy.
3. Add a tx-only test that sends out dummy via the forwarding path.
4. Verify the exitcode of the sender, which signals zerocopy success.

As dummy drops all packets, this cannot be a TCP connection. Test
the new case with unconnected UDP only.

Update msg_zerocopy.c to
- Accept an argument whether send with zerocopy is expected.
- Return an exitcode whether behavior matched that expectation.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: preserve MSG_ZEROCOPY with forwarding

MSG_ZEROCOPY data must be copied before data is queued to local
sockets, to avoid indefinite timeout until memory release.

This test is performed by skb_orphan_frags_rx, which is called when
looping an egress skb to packet sockets, error queue or ingress path.

To preserve zerocopy for skbs that are looped to ingress but are then
forwarded to an egress device rather than delivered locally, defer
this last check until an skb enters the local IP receive path.

This is analogous to existing behavior of skb_clear_delivery_time.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'vsock-test-check-for-null-ptr-deref-when-transport-changes'

Luigi Leonardi says:

====================
vsock/test: check for null-ptr-deref when transport changes

This series introduces a new test that checks for a null pointer
dereference that may happen when there is a transport change[1]. This
bug was fixed in [2].

Note that this test *cannot* fail, it hangs if it triggers a kernel
oops. The intended use-case is to run it and then check if there is any
oops in the dmesg.

This test is based on Hyunwoo Kim's[3] and Michal's python
reproducers[4].

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/
[2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/
[3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t
[4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/

v4: https://lore.kernel.org/20250624-test_vsock-v4-1-087c9c8e25a2@redhat.com
v3: https://lore.kernel.org/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com
v2: https://lore.kernel.org/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com
v1: https://lore.kernel.org/20250306-test_vsock-v1-0-0320b5accf92@redhat.com
====================

Link: https://patch.msgid.link/20250630-test_vsock-v5-0-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: Add test for null ptr deref when transport changes

Add a new test to ensure that when the transport changes a null pointer
dereference does not occur. The bug was reported upstream [1] and fixed
with commit 2cb7c756f605 ("vsock/virtio: discard packets if the
transport changes").

KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067]
CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted
Workqueue: vsock-loopback vsock_loopback_work
RIP: 0010:vsock_stream_has_data+0x44/0x70
Call Trace:
virtio_transport_do_close+0x68/0x1a0
virtio_transport_recv_pkt+0x1045/0x2ae4
vsock_loopback_work+0x27d/0x3f0
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x35a/0x700
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30

Note that this test may not fail in a kernel without the fix, but it may
hang on the client side if it triggers a kernel oops.

This works by creating a socket, trying to connect to a server, and then
executing a second connect operation on the same socket but to a
different CID (0). This triggers a transport change. If the connect
operation is interrupted by a signal, this could cause a null-ptr-deref.

Since this bug is non-deterministic, we need to try several times. It
is reasonable to assume that the bug will show up within the timeout
period.

If there is a G2H transport loaded in the system, the bug is not
triggered and this test will always pass. This is because
`vsock_assign_transport`, when using CID 0, like in this case, sets
vsk->transport to `transport_g2h` that is not NULL if a G2H transport is
available.

[1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/

Suggested-by: Hyunwoo Kim <v4bel@theori.io>
Suggested-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

vsock/test: Add macros to identify transports

Add three new macros: TRANSPORTS_G2H, TRANSPORTS_H2G and
TRANSPORTS_LOCAL.
They can be used to identify the type of the transport(s) loaded when
using the `get_transports()` function.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Luigi Leonardi <leonardi@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250630-test_vsock-v5-1-2492e141e80b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: Convert socfpga-dwmac bindings to yaml

Convert the bindings for socfpga-dwmac to yaml. Since the original
text contained descriptions for two separate nodes, two separate
yaml files were created.

Signed-off-by: Mun Yew Tham <mun.yew.tham@altera.com>
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250630213748.71919-1-matthew.gerlach@altera.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-07-01 (idpf, igc)

For idpf:
Michal returns 0 for key size when RSS is not supported.

Ahmed changes control queue to a spinlock due to sleeping calls.

For igc:
Vitaly disables L1.2 PCI-E link substate on I226 devices to resolve
performance issues.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igc: disable L1.2 PCI-E link substate to avoid performance issue
  idpf: convert control queue mutex to a spinlock
  idpf: return 0 size for RSS key if not supported
====================

Link: https://patch.msgid.link/20250701164317.2983952-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

amd-xgbe: add support for giant packet size

AMD XGBE hardware supports giant Ethernet frames up to 16K bytes.
Add support for configuring and enabling giant packet handling
in the driver.

- Define new register fields and macros for giant packet support.
- Update the jumbo frame configuration logic to enable giant
packet mode when MTU exceeds the jumbo threshold.

Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: fix stat increase when udp early demux drops the packet

udp_v4_early_demux now returns drop reasons as it either returns 0 or
ip_mc_validate_source, which returns itself a drop reason. However its
use was not converted in ip_rcv_finish_core and the drop reason is
ignored, leading to potentially skipping increasing LINUX_MIB_IPRPFILTER
if the drop reason is SKB_DROP_REASON_IP_RPFILTER.

This is a fix and we're not converting udp_v4_early_demux to explicitly
return a drop reason to ease backports; this can be done as a follow-up.

Fixes: d46f827016d8 ("net: ip: make ip_mc_validate_source() return drop reason")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250701074935.144134-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ifb: support BIG TCP packets

Set the driver limit to GSO_MAX_SIZE (512 KB).

This allows the admin/user to set a GSO limit up to this value, to
avoid segmenting too large GRO packets in the netem -> ifb path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250701084540.459261-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: libwx: fix the incorrect display of the queue number

When setting "ethtool -L eth0 combined 1", the number of RX/TX queue is
changed to be 1. RSS is disabled at this moment, and the indices of FDIR
have not be changed in wx_set_rss_queues(). So the combined count still
shows the previous value. This issue was introduced when supporting
FDIR. Fix it for those devices that support FDIR.

Fixes: 34744a7749b3 ("net: txgbe: add FDIR info to ethtool ops")
Cc: stable@vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/A5C8FE56D6C04608+20250701070625.73680-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

amd-xgbe: do not double read link status

The link status is latched low so that momentary link drops
can be detected. Always double-reading the status defeats this
design feature. Only double read if link was already down

This prevents unnecessary duplicate readings of the link status.

Fixes: 4f3b20bfbb75 ("amd-xgbe: add support for rx-adaptation")
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250701065016.4140707-1-Raju.Rangoju@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: Always pass notifications when child class becomes empty

Certain classful qdiscs may invoke their classes' dequeue handler on an
enqueue operation. This may unexpectedly empty the child qdisc and thus
make an in-flight class passive via qlen_notify(). Most qdiscs do not
expect such behaviour at this point in time and may re-activate the
class eventually anyways which will lead to a use-after-free.

The referenced fix commit attempted to fix this behavior for the HFSC
case by moving the backlog accounting around, though this turned out to
be incomplete since the parent's parent may run into the issue too.
The following reproducer demonstrates this use-after-free:

    tc qdisc add dev lo root handle 1: drr
    tc filter add dev lo parent 1: basic classid 1:1
    tc class add dev lo parent 1: classid 1:1 drr
    tc qdisc add dev lo parent 1:1 handle 2: hfsc def 1
    tc class add dev lo parent 2: classid 2:1 hfsc rt m1 8 d 1 m2 0
    tc qdisc add dev lo parent 2:1 handle 3: netem
    tc qdisc add dev lo parent 3:1 handle 4: blackhole

    echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888
    tc class delete dev lo classid 1:1
    echo 1 | socat -u STDIN UDP4-DATAGRAM:127.0.0.1:8888

Since backlog accounting issues leading to a use-after-frees on stale
class pointers is a recurring pattern at this point, this patch takes
a different approach. Instead of trying to fix the accounting, the patch
ensures that qdisc_tree_reduce_backlog always calls qlen_notify when
the child qdisc is empty. This solves the problem because deletion of
qdiscs always involves a call to qdisc_reset() and / or
qdisc_purge_queue() which ultimately resets its qlen to 0 thus causing
the following qdisc_tree_reduce_backlog() to report to the parent. Note
that this may call qlen_notify on passive classes multiple times. This
is not a problem after the recent patch series that made all the
classful qdiscs qlen_notify() handlers idempotent.

Fixes: 3f981138109f ("sch_hfsc: Fix qlen accounting bug when using peek in hfsc_enqueue()")
Signed-off-by: Lion Ackermann <nnamrec@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/d912cbd7-193b-4269-9857-525bee8bbb6a@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-add-data-race-annotations-around-dst-fields'

Eric Dumazet says:

====================
net: add data-race annotations around dst fields

Add annotations around various dst fields, which can change under us.

Add four helpers to prepare better dst->dev protection,
and start using them. More to come later.

v1: https://lore.kernel.org/20250627112526.3615031-1-edumazet@google.com
====================

Link: https://patch.msgid.link/20250630121934.3399505-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: ip6_mc_input() and ip6_mr_input() cleanups

Both functions are always called under RCU.

We remove the extra rcu_read_lock()/rcu_read_unlock().

We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net().

We use dev_net_rcu() instead of dev_net().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpers

Use the new helpers as a step to deal with potential dst->dev races.

v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: adopt dst_dev() helper

Use the new helper as a step to deal with potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]

Use the new helpers as a first step to deal with
potential dst->dev races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: add four helpers to annotate data-races around dst->dev

dst->dev is read locklessly in many contexts,
and written in dst_dev_put().

Fixing all the races is going to need many changes.

We probably will have to add full RCU protection.

Add three helpers to ease this painful process.

static inline struct net_device *dst_dev(const struct dst_entry *dst)
{
       return READ_ONCE(dst->dev);
}

static inline struct net_device *skb_dst_dev(const struct sk_buff *skb)
{
       return dst_dev(skb_dst(skb));
}

static inline struct net *skb_dst_dev_net(const struct sk_buff *skb)
{
       return dev_net(skb_dst_dev(skb));
}

static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb)
{
       return dev_net_rcu(skb_dst_dev(skb));
}

Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: annotate data-races around dst->output

dst_dev_put() can overwrite dst->output while other
cpus might read this field (for instance from dst_output())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need RCU protection in the future.

Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: annotate data-races around dst->input

dst_dev_put() can overwrite dst->input while other
cpus might read this field (for instance from dst_input())

Add READ_ONCE()/WRITE_ONCE() annotations to suppress
potential issues.

We will likely need full RCU protection later.

Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: annotate data-races around dst->lastuse

(dst_entry)->lastuse is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: annotate data-races around dst->expires

(dst_entry)->expires is read and written locklessly,
add corresponding annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dst: annotate data-races around dst->obsolete

(dst_entry)->obsolete is read locklessly, add corresponding
annotations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-introduce-net_aligned_data'

Eric Dumazet says:

====================
net: introduce net_aligned_data

____cacheline_aligned_in_smp on small fields like
tcp_memory_allocated and udp_memory_allocated is not good enough.

It makes sure to put these fields at the start of a cache line,
but does not prevent the linker from using the cache line for other
fields, with potential performance impact.

nm -v vmlinux|egrep -5 "tcp_memory_allocated|udp_memory_allocated"

...
ffffffff83e35cc0 B tcp_memory_allocated
ffffffff83e35cc8 b __key.0
ffffffff83e35cc8 b __tcp_tx_delay_enabled.1
ffffffff83e35ce0 b tcp_orphan_timer
...
ffffffff849dddc0 B udp_memory_allocated
ffffffff849dddc8 B udp_encap_needed_key
ffffffff849dddd8 B udpv6_encap_needed_key
ffffffff849dddf0 b inetsw_lock

One solution is to move these sensitive fields to a structure,
so that the compiler is forced to add empty holes between each member.

nm -v vmlinux|egrep -2 "tcp_memory_allocated|udp_memory_allocated|net_aligned_data"

ffffffff885af970 b mem_id_init
ffffffff885af980 b __key.0
ffffffff885af9c0 B net_aligned_data
ffffffff885afa80 B page_pool_mem_providers
ffffffff885afa90 b __key.2

v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/
====================

Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp: move udp_memory_allocated into net_aligned_data

____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_memory_allocated into net_aligned_data

____cacheline_aligned_in_smp attribute only makes sure to align
a field to a cache line. It does not prevent the linker to use
the remaining of the cache line for other variables, causing
potential false sharing.

Move tcp_memory_allocated into a dedicated cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move net_cookie into net_aligned_data

Using per-cpu data for net->net_cookie generation is overkill,
because even busy hosts do not create hundreds of netns per second.

Make sure to put net_cookie in a private cache line to avoid
potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: add struct net_aligned_data

This structure will hold networking data that must
consume a full cache line to avoid accidental false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: thunderbolt: Enable end-to-end flow control also in transmit

According to USB4 specification, if E2E flow control is disabled for
the Transmit Descriptor Ring, the Host Interface Adapter Layer shall
not require any credits to be available before transmitting a Tunneled
Packet from this Transmit Descriptor Ring, so e2e flow control should
be enabled in both directions.

Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://lore.kernel.org/20250624153805.GC2824380@black.fi.intel.com
Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com>
Link: https://patch.msgid.link/20250628093813.647005-1-zhangjianrong5@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: thunderbolt: Fix the parameter passing of tb_xdomain_enable_paths()/tb_xdomain_disable_paths()

According to the description of tb_xdomain_enable_paths(), the third
parameter represents the transmit ring and the fifth parameter represents
the receive ring. tb_xdomain_disable_paths() is the same case.

[Jakub] Mika says: it works now because both rings ->hop is the same

Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Link: https://lore.kernel.org/20250625051149.GD2824380@black.fi.intel.com
Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com>
Link: https://patch.msgid.link/20250628094920.656658-1-zhangjianrong5@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mmc-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
"MMC core:
   - Apply BROKEN_SD_DISCARD quirk earlier during init
   - Silence some confusing error messages for SD UHS-II cards

  MMC host:
   - mtk-sd:
       - Prevent memory corruption from DMA map failure
       - Fix a pagefault in dma_unmap_sg() for not prepared data
   - sdhci: Revert "Disable SD card clock before changing parameters"
   - sdhci-of-k1: Fix error code in probe()
   - sdhci-uhs2: Silence some confusing error messages for SD UHS-II cards"

* tag 'mmc-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mtk-sd: reset host->mrq on prepare_data() error
  Revert "mmc: sdhci: Disable SD card clock before changing parameters"
  mmc: sdhci-uhs2: Adjust some error messages and register dump for SD UHS-II card
  mmc: sdhci: Add a helper function for dump register in dynamic debug mode
  mmc: core: Adjust some error messages for SD UHS-II cards
  mtk-sd: Prevent memory corruption from DMA map failure
  mtk-sd: Fix a pagefault in dma_unmap_sg() for not prepared data
  mmc: sdhci-of-k1: Fix error code in probe()
  mmc: core: sd: Apply BROKEN_SD_DISCARD quirk earlier

Merge tag 's390-6.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

- Fix PCI error recovery and bring it in line with AER/EEH

* tag 's390-6.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pci: Allow automatic recovery with minimal driver support
  s390/pci: Do not try re-enabling load/store if device is disabled
  s390/pci: Fix stale function handles in error handling

Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd

Pull iommufd fixes from Jason Gunthorpe:
"Some changes to the userspace selftest framework cause the iommufd
  tests to start failing. This turned out to be bugs in the iommufd side
  that were just getting uncovered.

   - Deal with MAP_HUGETLB mmaping more than requested even when in
     MAP_FIXED mode

   - Fixup missing error flow cleanup in the test

   - Check that the memory allocations suceeded

   - Suppress some bogus gcc 'may be used uninitialized' warnings"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
  iommufd/selftest: Fix build warnings due to uninitialized mfd
  iommufd/selftest: Add asserts testing global mfd
  iommufd/selftest: Add missing close(mfd) in memfd_mmap()
  iommufd/selftest: Fix iommufd_dirty_tracking with large hugepage sizes

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"Several mlx5 bugs, crashers, and reports:

   - Limit stack usage

   - Fix mis-use of __xa_store/erase() without holding the lock to a
     locked version

   - Rate limit prints in the gid cache error cases

   - Fully initialize the event object before making it globally visible
     in an xarray

   - Fix deadlock inside the ODP code if the MMU notifier was called
     from a reclaim context

   - Include missed counters for some switchdev configurations and
     mulit-port MPV mode

   - Fix loopback packet support when in mulit-port MPV mode"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/mlx5: Fix vport loopback for MPV device
  RDMA/mlx5: Fix CC counters query for MPV
  RDMA/mlx5: Fix HW counters query for non-representor devices
  IB/core: Annotate umem_mutex acquisition under fs_reclaim for lockdep
  IB/mlx5: Fix potential deadlock in MR deregistration
  RDMA/mlx5: Initialize obj_event->obj_sub_list before xa_insert
  RDMA/core: Rate limit GID cache warning messages
  RDMA/mlx5: Fix unsafe xarray access in implicit ODP handling
  RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup

nui: Fix dma_mapping_error() check

dma_map_XXX() functions return values DMA_MAPPING_ERROR as error values
which is often ~0. The error value should be tested with
dma_mapping_error().

This patch creates a new function in niu_ops to test if the mapping
failed. The test is fixed in niu_rbr_add_page(), added in
niu_start_xmit() and the successfully mapped pages are unmaped upon error.

Fixes: ec2deec1f352 ("niu: Fix to check for dma mapping errors.")
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tulip: Rename PCI driver struct to end in _driver

This is not only a cosmetic change because the section mismatch checks
also depend on the name and for drivers the checks are stricter than for
ops.

However xircom_driver also passes the stricter checks just fine, so no
further changes needed.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20250627102220.1937649-2-u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>