====================
net_sched: act: extend RCU use in dump() methods
We are trying to get away from central RTNL in favor of fine-grained
mutexes. While looking at net/sched, I found that act already uses
RCU in the fast path for the most cases, and could also be used
in dump() methods.
This series is not complete and will be followed by a second one.
Eric Dumazet [Wed, 9 Jul 2025 09:01:57 +0000 (09:01 +0000)]
net_sched: act_ctinfo: use atomic64_t for three counters
Commit 21c167aa0ba9 ("net/sched: act_ctinfo: use percpu stats")
missed that stats_dscp_set, stats_dscp_error and stats_cpmark_set
might be written (and read) locklessly.
Use atomic64_t for these three fields, I doubt act_ctinfo is used
heavily on big SMP hosts anyway.
Fixes: 24ec483cec98 ("net: sched: Introduce act_ctinfo action") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Pedro Tammela <pctammela@mojatatu.com> Link: https://patch.msgid.link/20250709090204.797558-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 9 Jul 2025 20:59:10 +0000 (13:59 -0700)]
eth: fbnic: fix ubsan complaints about OOB accesses
UBSAN complains that we reach beyond the end of the log entry:
UBSAN: array-index-out-of-bounds in drivers/net/ethernet/meta/fbnic/fbnic_fw_log.c:94:50
index 71 is out of range for type 'char [*]'
Call Trace:
<TASK>
ubsan_epilogue+0x5/0x2b
fbnic_fw_log_write+0x120/0x960
fbnic_fw_parse_logs+0x161/0x210
We're just taking the address of the character after the array,
so this really seems like something that should be legal.
But whatever, easy enough to silence by doing direct pointer math.
Fixes: c2b93d6beca8 ("eth: fbnic: Create ring buffer for firmware logs") Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250709205910.3107691-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
virtio_net: simplify tx queue wake condition check
Consolidate the two nested if conditions for checking tx queue wake
conditions into a single combined condition. This improves code
readability without changing functionality. And move netif_tx_wake_queue
into if condition to reduce unnecessary checks for queue stops.
William Liu [Tue, 8 Jul 2025 16:44:05 +0000 (16:44 +0000)]
selftests/tc-testing: Add tests for restrictions on netem duplication
Ensure that a duplicating netem cannot exist in a tree with other netems
in both qdisc addition and change. This is meant to prevent the soft
lockup and OOM loop scenario discussed in [1]. Also adjust a HFSC's
re-entrancy test case with netem for this new restriction - KASAN
still triggers upon its failure.
Signed-off-by: William Liu <will@willsroot.io> Reviewed-by: Savino Dicanosa <savy@syst3mfailure.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250708164219.875521-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
William Liu [Tue, 8 Jul 2025 16:43:26 +0000 (16:43 +0000)]
net/sched: Restrict conditions for adding duplicating netems to qdisc tree
netem_enqueue's duplication prevention logic breaks when a netem
resides in a qdisc tree with other netems - this can lead to a
soft lockup and OOM loop in netem_dequeue, as seen in [1].
Ensure that a duplicating netem cannot exist in a tree with other
netems.
Previous approaches suggested in discussions in chronological order:
1) Track duplication status or ttl in the sk_buff struct. Considered
too specific a use case to extend such a struct, though this would
be a resilient fix and address other previous and potential future
DOS bugs like the one described in loopy fun [2].
2) Restrict netem_enqueue recursion depth like in act_mirred with a
per cpu variable. However, netem_dequeue can call enqueue on its
child, and the depth restriction could be bypassed if the child is a
netem.
3) Use the same approach as in 2, but add metadata in netem_skb_cb
to handle the netem_dequeue case and track a packet's involvement
in duplication. This is an overly complex approach, and Jamal
notes that the skb cb can be overwritten to circumvent this
safeguard.
4) Prevent the addition of a netem to a qdisc tree if its ancestral
path contains a netem. However, filters and actions can cause a
packet to change paths when re-enqueued to the root from netem
duplication, leading us to the current solution: prevent a
duplicating netem from inhabiting the same tree as other netems.
Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication") Reported-by: William Liu <will@willsroot.io> Reported-by: Savino Dicanosa <savy@syst3mfailure.io> Signed-off-by: William Liu <will@willsroot.io> Signed-off-by: Savino Dicanosa <savy@syst3mfailure.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20250708164141.875402-1-will@willsroot.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2).
No conflicts.
Adjacent changes:
drivers/net/wireless/mediatek/mt76/mt7925/mcu.c c701574c5412 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan") b3a431fe2e39 ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()")
drivers/net/wireless/mediatek/mt76/mt7996/mac.c 62da647a2b20 ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()") dc66a129adf1 ("wifi: mt76: add a wrapper for wcid access with validation")
drivers/net/wireless/mediatek/mt76/mt7996/main.c 3dd6f67c669c ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()") 8989d8e90f5f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()")
net/mac80211/cfg.c 58fcb1b4287c ("wifi: mac80211: reject VHT opmode for unsupported channel widths") 037dc18ac3fb ("wifi: mac80211: add support for storing station S1G capabilities")
Merge tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking fixes from Jakub Kicinski
"Big chunk of fixes for WiFi, Johannes says probably the last for the
release.
The Netlink fixes (on top of the tree) restore operation of iw (WiFi
CLI) which uses sillily small recv buffer, and is the reason for this
'emergency PR'.
The GRE multicast fix also stands out among the user-visible
regressions.
Current release - fix to a fix:
- netlink: make sure we always allow at least one skb to be queued,
even if the recvbuf is (mis)configured to be tiny
Previous releases - regressions:
- gre: fix IPv6 multicast route creation
Previous releases - always broken:
- wifi: prevent A-MSDU attacks in mesh networks
- wifi: cfg80211: fix S1G beacon head validation and detection
- wifi: mac80211:
- always clear frame buffer to prevent stack leak in cases which
hit a WARN()
- fix monitor interface in device restart
- wifi: mwifiex: discard erroneous disassoc frames on STA interface
- wifi: mt76:
- prevent null-deref in mt7925_sta_set_decap_offload()
- add missing RCU annotations, and fix sleep in atomic
- fix decapsulation offload
- fixes for scanning
- phy: microchip: improve link establishment and reset handling
- eth: mlx5e: fix race between DIM disable and net_dim()
- bnxt_en: correct DMA unmap len for XDP_REDIRECT"
* tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
netlink: make sure we allow at least one dump skb
netlink: Fix rmem check in netlink_broadcast_deliver().
bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
bnxt_en: Flush FW trace before copying to the coredump
bnxt_en: Fix DCB ETS validation
net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
net/mlx5e: Add new prio for promiscuous mode
net/mlx5e: Fix race between DIM disable and net_dim()
net/mlx5: Reset bw_share field when changing a node's parent
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
selftests: net: lib: fix shift count out of range
selftests: Add IPv6 multicast route generation tests for GRE devices.
gre: Fix IPv6 multicast route creation.
net: phy: microchip: limit 100M workaround to link-down events on LAN88xx
net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
net: appletalk: Fix device refcount leak in atrtr_create()
netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()
wifi: mac80211: add the virtual monitor after reconfig complete
wifi: mac80211: always initialize sdata::key_list
...
Merge tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix performance regression when setting values of multiple GPIO lines
at once
- make sure the GPIO OF xlate code doesn't end up passing an
uninitialized local variable to GPIO core
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: remove bouncing address for Nandor Han
gpio: of: initialize local variable passed to the .of_xlate() callback
gpiolib: fix performance regression when using gpio_chip_get_multiple()
Merge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- small fix relevant to arm64 server and custom CMA configuration (Feng
Tang)
* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-contiguous: hornor the cma address limit setup by user
Jakub Kicinski [Fri, 11 Jul 2025 00:11:21 +0000 (17:11 -0700)]
netlink: make sure we allow at least one dump skb
Commit under Fixes tightened up the memory accounting for Netlink
sockets. Looks like the accounting is too strict for some existing
use cases, Marek reported issues with nl80211 / WiFi iw CLI.
To reduce number of iterations Netlink dumps try to allocate
messages based on the size of the buffer passed to previous
recvmsg() calls. If user space uses a larger buffer in recvmsg()
than sk_rcvbuf we will allocate an skb we won't be able to queue.
Make sure we always allow at least one skb to be queued.
Same workaround is already present in netlink_attachskb().
Alternative would be to cap the allocation size to
rcvbuf - rmem_alloc
but as I said, the workaround is already present in other places.
Jakub Kicinski [Fri, 11 Jul 2025 14:28:36 +0000 (07:28 -0700)]
Merge branch 'bnxt_en-3-bug-fixes'
Michael Chan says:
====================
bnxt_en: 3 bug fixes
The first one fixes a possible failure when setting DCB ETS.
The second one fixes the ethtool coredump (-W 2) not containing
all the FW traces. The third one fixes the DMA unmap length when
transmitting XDP_REDIRECT packets.
====================
bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
When transmitting an XDP_REDIRECT packet, call dma_unmap_len_set()
with the proper length instead of 0. This bug triggers this warning
on a system with IOMMU enabled:
bnxt_en: Flush FW trace before copying to the coredump
bnxt_fill_drv_seg_record() calls bnxt_dbg_hwrm_log_buffer_flush()
to flush the FW trace buffer. This needs to be done before we
call bnxt_copy_ctx_mem() to copy the trace data.
Without this fix, the coredump may not contain all the FW
traces.
Fixes: 3c2179e66355 ("bnxt_en: Add FW trace coredump segments to the coredump") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In bnxt_ets_validate(), the code incorrectly loops over all possible
traffic classes to check and add the ETS settings. Fix it to loop
over the configured traffic classes only.
The unconfigured traffic classes will default to TSA_ETS with 0
bandwidth. Looping over these unconfigured traffic classes may
cause the validation to fail and trigger this error message:
"rejecting ETS config starving a TC\n"
The .ieee_setets() will then fail.
Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.") Reviewed-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
The function ll_temac_ethtools_set_ringparam() incorrectly checked
rx_pending twice, once correctly for RX and once mistakenly in place
of tx_pending. This caused tx_pending to be left unchecked against
TX_BD_NUM_MAX.
As a result, invalid TX ring sizes may have been accepted or valid
ones wrongly rejected based on the RX limit, leading to potential
misconfiguration or unexpected results.
This patch corrects the condition to properly validate tx_pending.
Jianbo Liu [Thu, 10 Jul 2025 13:53:44 +0000 (16:53 +0300)]
net/mlx5e: Add new prio for promiscuous mode
An optimization for promiscuous mode adds a high-priority steering
table with a single catch-all rule to steer all traffic directly to
the TTC table.
However, a gap exists between the creation of this table and the
insertion of the catch-all rule. Packets arriving in this brief window
would miss as no rule was inserted yet, unnecessarily incrementing the
'rx_steer_missed_packets' counter and dropped.
This patch resolves the issue by introducing a new prio for this
table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing
so, packets arriving during the window now fall through to the next
prio (at MLX5E_NIC_PRIO) instead of being dropped.
Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Fix race between DIM disable and net_dim()
There's a race between disabling DIM and NAPI callbacks using the dim
pointer on the RQ or SQ.
If NAPI checks the DIM state bit and sees it still set, it assumes
`rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after
that check, the pointer might already be set to NULL, leading to a NULL
pointer dereference in net_dim().
Fix this by calling `synchronize_net()` before freeing the DIM context.
This ensures all in-progress NAPI callbacks are finished before the
pointer is cleared.
net/mlx5: Reset bw_share field when changing a node's parent
When changing a node's parent, its scheduling element is destroyed and
re-created with bw_share 0. However, the node's bw_share field was not
updated accordingly.
Set the node's bw_share to 0 after re-creation to keep the software
state in sync with the firmware configuration.
Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Jul 2025 14:07:56 +0000 (07:07 -0700)]
Merge tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-07-11
Sean Nyekjaer's patch targets the m_can driver and demotes the "msg
lost in rx" message to debug level to prevent flooding the kernel log
with error messages.
* tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
====================
However, during review, we noticed that the patch conflicts with another
patch in netdev tree:
https://lore.kernel.org/netdev/1749651015-9668-1-git-send-email-shradhagupta@linux.microsoft.com/
As this series has no dependency with the rest of the series, we think it
is best to split out this one and send it to netdev, to avoid conflict
resolution headache later on.
Can netdev maintainers please pick it up?
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nam Cao [Mon, 7 Jul 2025 08:20:16 +0000 (10:20 +0200)]
PCI: hv: Switch to msi_create_parent_irq_domain()
Move away from the legacy MSI domain setup, switch to use
msi_create_parent_irq_domain().
While doing the conversion, I noticed that hv_compose_msi_msg() is doing
more than it is supposed to (composing message). This function also
allocates and populates struct tran_int_desc, which should be done in
hv_pcie_domain_alloc() instead. It works, but it is not the correct design.
However, I have no hardware to test such change, therefore I leave a TODO
note.
Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nam Cao [Mon, 7 Jul 2025 08:20:15 +0000 (10:20 +0200)]
irqdomain: Export irq_domain_free_irqs_top()
Export irq_domain_free_irqs_top(), making it usable for drivers compiled as
modules.
Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 11 Jul 2025 01:18:40 +0000 (18:18 -0700)]
Merge tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next (v2)
The following series contains an initial small batch of Netfilter
updates for net-next:
1) Remove DCCP conntrack support, keep DCCP matches around in order to
avoid breakage when loading ruleset, add Kconfig to wrap the code
so it can be disabled by distributors.
2) Remove buggy code aiming at shrinking netlink deletion event, then
re-add it correctly in another patch. This is to prevent -stable to
pick up on a fix that breaks old userspace. From Phil Sutter.
3) Missing WARN_ON_ONCE() to check for lockdep_commit_lock_is_held()
to uncover bugs. From Fedor Pchelkin.
* tag 'nf-next-25-07-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: adjust lockdep assertions handling
netfilter: nf_tables: Reintroduce shortened deletion notifications
netfilter: nf_tables: Drop dead code from fill_*_info routines
netfilter: conntrack: remove DCCP protocol support
====================
====================
net: ftgmac100: Add SoC reset support for RMII mode
This patch series adds support for an optional reset line to the
ftgmac100 ethernet controller, as used on Aspeed SoCs. On these SoCs,
the internal MAC reset is not sufficient to reset the RMII interface.
By providing a SoC-level reset via the device tree "resets" property,
the driver can properly reset both the MAC and RMII logic, ensuring
correct operation in RMII mode.
The series includes:
- Device tree binding update to document the new "resets" property.
- Addition of MAC1/2/3/4 reset definitions for AST2600.
- Driver changes to assert/deassert the reset line as needed.
This improves reliability and initialization of the MAC in RMII mode
on Aspeed platforms.
====================
net: ftgmac100: Add optional reset control for RMII mode on Aspeed SoCs
On Aspeed SoCs, the internal MAC reset is insufficient to fully reset the
RMII interface; only the SoC-level reset line can properly reset the RMII
logic. This patch adds support for an optional "resets" property in the
device tree, allowing the driver to assert and deassert the SoC reset line
when operating in RMII mode. This ensures the MAC and RMII interface are
correctly reset and initialized.
dt-bindings: clock: ast2600: Add reset definitions for MAC1 and MAC2
Add ASPEED_RESET_MAC1 and ASPEED_RESET_MAC2 reset definitions to
the ast2600-clock binding header. These are required for proper
reset control of the MAC1 and MAC2 ethernet controllers on the
AST2600 SoC.
In Aspeed AST2600 design, the MAC internal delay on MAC register cannot
fully reset the RMII interfaces, it may cause the RMII incompletely.
Therefore, we need to add resets property to do SoC-level reset line to
reset the whole MAC function that includes ftgmac, RGMII and RMII.
Hangbin Liu [Wed, 9 Jul 2025 09:12:44 +0000 (09:12 +0000)]
selftests: net: lib: fix shift count out of range
I got the following warning when writing other tests:
+ handle_test_result_pass 'bond 802.3ad' '(lacp_active off)'
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ log_test_result 'bond 802.3ad' '(lacp_active off)' ' OK '
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ local 'result= OK '
+ shift
+ local retmsg=
+ shift
/net/tools/testing/selftests/net/forwarding/../lib.sh: line 315: shift: shift count out of range
This happens because an extra shift is executed even after all arguments
have been consumed. Remove the last shift in log_test_result() to avoid
this warning.
When fixing IPv6 link-local address generation on GRE devices with
commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), I accidentally broke the default IPv6 multicast route
creation on these GRE devices.
Fix that in patch 1, making the GRE specific code yet a bit closer to
the generic code used by most other network interface types.
Then extend the selftest in patch 2 to cover this case.
====================
selftests: Add IPv6 multicast route generation tests for GRE devices.
The previous patch fixes a bug that prevented the creation of the
default IPv6 multicast route (ff00::/8) for some GRE devices. Now let's
extend the GRE IPv6 selftests to cover this case.
Also, rename check_ipv6_ll_addr() to check_ipv6_device_config() and
adapt comments and script output to take into account the fact that
we're not limited to link-local address generation.
Use addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config() so that we don't just get the inet6_dev, but also
install the default ff00::/8 multicast route.
Before commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), the multicast route was created at the end of the
function by addrconf_add_mroute(). But this code path is now only taken
in one particular case (gre devices not bound to a local IP address and
in EUI64 mode). For all other cases, the function exits early and
addrconf_add_mroute() is not called anymore.
Using addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config(), fixes the problem as it will create the default
multicast route for all gre devices. This also brings
addrconf_gre_config() a bit closer to the normal netdevice IPv6
configuration code (addrconf_dev_config()).
Cc: stable@vger.kernel.org Fixes: 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.") Reported-by: Aiden Yang <ling@moedove.com> Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/ Reviewed-by: Gary Guo <gary@garyguo.net> Tested-by: Gary Guo <gary@garyguo.net> Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch series improves the reliability of the Microchip LAN88xx
PHYs, particularly in edge cases involving fixed link configurations or
forced speed modes.
Patch 1 assigns genphy_soft_reset() to the .soft_reset hook to ensure
that stale link partner advertisement (LPA) bits are properly cleared
during reconfiguration. Without this, outdated autonegotiation bits may
remain visible in some parallel detection cases.
Patch 2 restricts the 100 Mbps workaround (originally intended to handle
cable length switching) to only run when the link transitions to the
PHY_NOLINK state. This prevents repeated toggling that can confuse
autonegotiating link partners such as the Intel i350, leading to
unstable link cycles.
Both patches were tested on a LAN7850 (with integrated LAN88xx PHY)
against an Intel I350 NIC. The full test suite - autonegotiation, fixed
link, and parallel detection - passed successfully.
====================
net: phy: microchip: limit 100M workaround to link-down events on LAN88xx
Restrict the 100Mbit forced-mode workaround to link-down transitions
only, to prevent repeated link reset cycles in certain configurations.
The workaround was originally introduced to improve signal reliability
when switching cables between long and short distances. It temporarily
forces the PHY into 10 Mbps before returning to 100 Mbps.
However, when used with autonegotiating link partners (e.g., Intel i350),
executing this workaround on every link change can confuse the partner
and cause constant renegotiation loops. This results in repeated link
down/up transitions and the PHY never reaching a stable state.
Limit the workaround to only run during the PHY_NOLINK state. This ensures
it is triggered only once per link drop, avoiding disruptive toggling
while still preserving its intended effect.
Note: I am not able to reproduce the original issue that this workaround
addresses. I can only confirm that 100 Mbit mode works correctly in my
test setup. Based on code inspection, I assume the workaround aims to
reset some internal state machine or signal block by toggling speeds.
However, a PHY reset is already performed earlier in the function via
phy_init_hw(), which may achieve a similar effect. Without a reproducer,
I conservatively keep the workaround but restrict its conditions.
Fixes: e57cf3639c32 ("net: lan78xx: fix accessing the LAN7800's internal phy specific registers from the MAC driver") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250709130753.3994461-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
Enable .soft_reset for the LAN88xx PHY driver by assigning
genphy_soft_reset() to ensure that the phylib core performs a proper
soft reset during reconfiguration.
Previously, the driver left .soft_reset unimplemented, so calls to
phy_init_hw() (e.g., from lan88xx_link_change_notify()) did not fully
reset the PHY. As a result, stale contents in the Link Partner Ability
(LPA) register could persist, causing the PHY to incorrectly report
that the link partner advertised autonegotiation even when it did not.
Using genphy_soft_reset() guarantees a clean reset of the PHY and
corrects the false autoneg reporting in these scenarios.
Fixes: ccb989e4d1ef ("net: phy: microchip: Reset LAN88xx PHY to ensure clean link state on LAN7800/7850") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250709130753.3994461-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mingming Cao [Wed, 9 Jul 2025 15:33:32 +0000 (08:33 -0700)]
ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
The previous hardcoded definitions of NUM_RX_STATS and
NUM_TX_STATS were not updated when new fields were added
to the ibmvnic_{rx,tx}_queue_stats structures. Specifically,
commit 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx
batched") added a fourth TX stat, but NUM_TX_STATS remained 3,
leading to a mismatch.
This patch replaces the static defines with dynamic sizeof-based
calculations to ensure the stat arrays are correctly sized.
This fixes incorrect indexing and prevents incomplete stat
reporting in tools like ethtool.
Fixes: 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx batched") Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250709153332.73892-1-mmc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: pse-pd: pd692x0: reduce stack usage in pd692x0_setup_pi_matrix
The pd692x0_manager array in this function is really too big to fit on the
stack, though this never triggered a warning until a recent patch made
it slightly bigger:
drivers/net/pse-pd/pd692x0.c: In function 'pd692x0_setup_pi_matrix':
drivers/net/pse-pd/pd692x0.c:1210:1: error: the frame size of 1584 bytes is larger than 1536 bytes [-Werror=frame-larger-than=]
Change the function to dynamically allocate the array here.
Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250709153210.1920125-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kito Xu [Wed, 9 Jul 2025 03:52:51 +0000 (03:52 +0000)]
net: appletalk: Fix device refcount leak in atrtr_create()
When updating an existing route entry in atrtr_create(), the old device
reference was not being released before assigning the new device,
leading to a device refcount leak. Fix this by calling dev_put() to
release the old device reference before holding the new one.
Jakub Kicinski [Tue, 8 Jul 2025 22:06:40 +0000 (15:06 -0700)]
selftests: drv-net: test RSS header field configuration
Test reading RXFH fields over IOCTL and netlink.
# ./tools/testing/selftests/drivers/net/hw/rss_api.py
TAP version 13
1..3
ok 1 rss_api.test_rxfh_indir_ntf
ok 2 rss_api.test_rxfh_indir_ctx_ntf
ok 3 rss_api.test_rxfh_fields
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Jakub Kicinski [Tue, 8 Jul 2025 22:06:39 +0000 (15:06 -0700)]
ethtool: rss: report which fields are configured for hashing
Implement ETHTOOL_GRXFH over Netlink. The number of flow types is
reasonable (around 20) so report all of them at once for simplicity.
Do not maintain the flow ID mapping with ioctl at the uAPI level.
This gives us a chance to clean up the confusion that come from
RxNFC vs RxFH (flow direction vs hashing) in the ioctl.
Try to align with the names used in ethtool CLI, they seem to have
stood the test of time just fine. One annoyance is that we still
call L4 ports the weird names, but I guess they also apply to IPSec
(where they cover the SPI) so it is what it is.
Jakub Kicinski [Tue, 8 Jul 2025 22:06:38 +0000 (15:06 -0700)]
ethtool: mark ETHER_FLOW as usable for Rx hash
Looks like some drivers (ena, enetc, fbnic.. there's probably more)
consider ETHER_FLOW to be legitimate target for flow hashing.
I'm not sure how intentional that is from the uAPI perspective
vs just an effect of ethtool IOCTL doing minimal input validation.
But Netlink will do strict validation, so we need to decide whether
we allow this use case or not. I don't see a strong reason against
it, and rejecting it would potentially regress a number of drivers.
So update the comments and flow_type_hashable().
Jakub Kicinski [Tue, 8 Jul 2025 22:06:37 +0000 (15:06 -0700)]
tools: ynl: decode enums in auto-ints
Use enum decoding on auto-ints. Looks like we only had enum
auto-ints for input values until now. Upcoming RSS work will
need this to declare an attribute with flags as a uint.
Jakub Kicinski [Tue, 8 Jul 2025 22:06:36 +0000 (15:06 -0700)]
ethtool: rss: make sure dump takes the rss lock
After commit 040cef30b5e6 ("net: ethtool: move get_rxfh callback
under the rss_lock") we're expected to take rss_lock around get.
Switch dump to using the new prep helper and move the locking into it.
Jakub Kicinski [Fri, 11 Jul 2025 00:24:21 +0000 (17:24 -0700)]
Merge tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Quite a bit more work, notably:
- mt76: firmware recovery improvements, MLO work
- iwlwifi: use embedded PNVM in (to be released) FW images
to fix compatibility issues
- cfg80211/mac80211: extended regulatory info support (6 GHz)
- cfg80211: use "faux device" for regulatory
* tag 'wireless-next-2025-07-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (48 commits)
wifi: mac80211: don't complete management TX on SAE commit
wifi: cfg80211/mac80211: implement dot11ExtendedRegInfoSupport
wifi: mac80211: send extended MLD capa/ops if AP has it
wifi: mac80211: copy first_part into HW scan
wifi: cfg80211: add a flag for the first part of a scan
wifi: mac80211: remove DISALLOW_PUNCTURING_5GHZ code
wifi: cfg80211: only verify part of Extended MLD Capabilities
wifi: nl80211: make nl80211_check_scan_flags() type safe
wifi: cfg80211: hide scan internals
wifi: mac80211: fix deactivated link CSA
wifi: mac80211: add mandatory bitrate support for 6 GHz
wifi: mac80211: remove spurious blank line
wifi: mac80211: verify state before connection
wifi: mac80211: avoid weird state in error path
wifi: iwlwifi: mvm: remove support for iwl_wowlan_info_notif_v4
wifi: iwlwifi: bump minimum API version in BZ
wifi: iwlwifi: mvm: remove unneeded argument
wifi: iwlwifi: mvm: remove MLO GTK rekey code
wifi: iwlwifi: pcie: rename iwl_pci_gen1_2_probe() argument
wifi: iwlwifi: match discrete/integrated to fix some names
...
====================
Wang Liang [Tue, 8 Jul 2025 03:33:42 +0000 (11:33 +0800)]
net: replace ND_PRINTK with dynamic debug
ND_PRINTK with val > 1 only works when the ND_DEBUG was set in compilation
phase. Replace it with dynamic debug. Convert ND_PRINTK with val <= 1 to
net_{err,warn}_ratelimited, and convert the rest to net_dbg_ratelimited.
Jakub Kicinski [Thu, 10 Jul 2025 22:02:23 +0000 (15:02 -0700)]
Merge branch 'further-mt7988-devicetree-work'
Frank Wunderlich says:
====================
further mt7988 devicetree work
This series continues mt7988 devicetree work
- Extend cpu frequency scaling with CCI
- GPIO leds
- Basic network-support (ethernet controller + builtin switch + SFP Cages)
depencies (i hope this list is complete and latest patches/series linked):
support interrupt-names is optional again as i re-added the reserved IRQs
(they are not unusable as i thought and can allow features in future)
https://patchwork.kernel.org/project/netdevbpf/patch/20250619132125.78368-2-linux@fw-web.de/
needs change in mtk ethernet driver for the sram to be read from separate node:
https://patchwork.kernel.org/project/netdevbpf/patch/c2b9242229d06af4e468204bcf42daa1535c3a72.1751461762.git.daniel@makrotopia.org/
for SFP-Function (macs currently disabled):
PCS clearance which is a 1.5 year discussion currently ongoing
Daniel asked netdev for a way 2 go:
https://lore.kernel.org/netdev/aEwfME3dYisQtdCj@pidgin.makrotopia.org/
e.g. something like this (one of):
* https://patchwork.kernel.org/project/netdevbpf/patch/20250610233134.3588011-4-sean.anderson@linux.dev/ (v6)
* https://patchwork.kernel.org/project/netdevbpf/patch/20250511201250.3789083-4-ansuelsmth@gmail.com/ (v4)
* https://patchwork.kernel.org/project/netdevbpf/patch/ba4e359584a6b3bc4b3470822c42186d5b0856f9.1721910728.git.daniel@makrotopia.org/
dt-bindings: net: dsa: mediatek,mt7530: add internal mdio bus
Mt7988 buildin switch has own mdio bus where ge-phys are connected.
Add related property for this.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://patch.msgid.link/20250709111147.11843-7-linux@fw-web.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dt-bindings: net: dsa: mediatek,mt7530: add dsa-port definition for mt7988
Add own dsa-port binding for SoC with internal switch where only phy-mode
'internal' is valid.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://patch.msgid.link/20250709111147.11843-6-linux@fw-web.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Meditak Filogic SoCs (MT798x) have dedicated MMIO-SRAM for dma operations.
MT7981 and MT7986 currently use static offset to ethernet MAC register
which will be changed in separate patch once this way is accepted.
Add "sram" property to map ethernet controller to dedicated mmio-sram node.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250709111147.11843-5-linux@fw-web.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for MT7988 and RSS/LRO allow the interrupt-names
property.
In this way driver can request the interrupts by name which is much
more readable in the driver code and SoC's dtsi than relying on a
specific order.
Frame-engine-IRQs (fe0..3):
MT7621, MT7628: 1 FE-IRQ
MT7622, MT7623: 3 FE-IRQs (only two used by the driver for now)
MT7981, MT7986: 4 FE-IRQs (only two used by the driver for now)
RSS/LRO IRQs (pdma0..3) additional only on Filogic (MT798x) with
count of 4. So all IRQ-names (8) for Filogic.
Set boundaries for all compatibles same as irq count.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250709111147.11843-4-linux@fw-web.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dt-bindings: net: mediatek,net: allow up to 8 IRQs
Increase the maximum IRQ count to 8 (4 FE + 4 RSS/LRO).
Frame-engine-IRQs (max 4):
MT7621, MT7628: 1 FE-IRQ
MT7622, MT7623: 3 FE-IRQs (only two used by the driver for now)
MT7981, MT7986, MT7988: 4 FE-IRQs (only two used by the driver for now)
Mediatek Filogic SoCs (mt798x) have 4 additional IRQs for RSS and/or
LRO. So MT798x have 8 IRQs in total.
MT7981 does not have a ethernet-node yet.
MT7986 Ethernet node is updated with RSS/LRO IRQs in this series.
MT7988 Ethernet node is added in this series.
Jakub Kicinski [Thu, 10 Jul 2025 20:32:35 +0000 (13:32 -0700)]
Merge branch 'virtio_udp_tunnel_08_07_2025' of https://github.com/pabeni/linux-devel
Paolo Abeni says:
====================
virtio: introduce GSO over UDP tunnel
Some virtualized deployments use UDP tunnel pervasively and are impacted
negatively by the lack of GSO support for such kind of traffic in the
virtual NIC driver.
The virtio_net specification recently introduced support for GSO over
UDP tunnel, this series updates the virtio implementation to support
such a feature.
Currently the kernel virtio support limits the feature space to 64,
while the virtio specification allows for a larger number of features.
Specifically the GSO-over-UDP-tunnel-related virtio features use bits
65-69.
The first four patches in this series rework the virtio and vhost
feature support to cope with up to 128 bits. The limit is set by
a define and could be easily raised in future, as needed.
This implementation choice is aimed at keeping the code churn as
limited as possible. For the same reason, only the virtio_net driver is
reworked to leverage the extended feature space; all other
virtio/vhost drivers are unaffected, but could be upgraded to support
the extended features space in a later time.
The last four patches bring in the actual GSO over UDP tunnel support.
As per specification, some additional fields are introduced into the
virtio net header to support the new offload. The presence of such
fields depends on the negotiated features.
New helpers are introduced to convert the UDP-tunneled skb metadata to
an extended virtio net header and vice versa. Such helpers are used by
the tun and virtio_net driver to cope with the newly supported offloads.
Tested with basic stream transfer with all the possible permutations of
host kernel/qemu/guest kernel with/without GSO over UDP tunnel support.
====================
Merge tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - regressions:
- tcp: refine sk_rcvbuf increase for ooo packets
- bluetooth: fix attempting to send HCI_Disconnect to BIS handle
- rxrpc: fix over large frame size warning
- eth: bcmgenet: initialize u64 stats seq counter
Previous releases - regressions:
- tcp: correct signedness in skb remaining space calculation
- sched: abort __tc_modify_qdisc if parent class does not exist
- vsock: fix transport_{g2h,h2g} TOCTOU
- rxrpc: fix bug due to prealloc collision
- tipc: fix use-after-free in tipc_conn_close().
- bluetooth: fix not marking Broadcast Sink BIS as connected
- phy: qca808x: fix WoL issue by utilizing at8031_set_wol()
- eth: am65-cpsw-nuss: fix skb size by accounting for skb_shared_info
Previous releases - always broken:
- netlink: fix wraparounds of sk->sk_rmem_alloc.
- atm: fix infinite recursive call of clip_push().
- eth:
- stmmac: fix interrupt handling for level-triggered mode in DWC_XGMAC2
- rtsn: fix a null pointer dereference in rtsn_probe()"
* tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
net/sched: sch_qfq: Fix null-deref in agg_dequeue
rxrpc: Fix oops due to non-existence of prealloc backlog struct
rxrpc: Fix bug due to prealloc collision
MAINTAINERS: remove myself as netronome maintainer
selftests/net: packetdrill: add tcp_ooo-before-and-after-accept.pkt
tcp: refine sk_rcvbuf increase for ooo packets
net/sched: Abort __tc_modify_qdisc if parent class does not exist
net: ethernet: ti: am65-cpsw-nuss: Fix skb size by accounting for skb_shared_info
net: thunderx: avoid direct MTU assignment after WRITE_ONCE()
selftests/tc-testing: Create test case for UAF scenario with DRR/NETEM/BLACKHOLE chain
atm: clip: Fix NULL pointer dereference in vcc_sendmsg()
atm: clip: Fix infinite recursive call of clip_push().
atm: clip: Fix memory leak of struct clip_vcc.
atm: clip: Fix potential null-ptr-deref in to_atmarpd().
net: phy: smsc: Fix link failure in forced mode with Auto-MDIX
net: phy: smsc: Force predictable MDI-X state on LAN87xx
net: phy: smsc: Fix Auto-MDIX configuration when disabled by strap
net: stmmac: Fix interrupt handling for level-triggered mode in DWC_XGMAC2
rxrpc: Fix over large frame size warning
net: airoha: Fix an error handling path in airoha_probe()
...
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Many patches, pretty much all of them small, that accumulated while I
was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host
on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested
mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are
actively being created, so as not to create a non-SEV{-ES} vCPU in
an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification
of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring
IRQ routing, to avoid dictating a specific ioctl() ordering to
userspace
- Conditionally reschedule when setting memory attributes to avoid
soft lockups when userspace converts huge swaths of memory to/from
private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that
resulted in the guest-visible workarounds field being filled at the
wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes
to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: avoid underflow when scaling TSC frequency
KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
KVM: arm64: Don't free hyp pages with pKVM on GICv2
KVM: arm64: Fix error path in init_hyp_mode()
KVM: arm64: Adjust range correctly during host stage-2 faults
KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
Documentation: KVM: Fix unexpected unindent warnings
KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
KVM: Allow CPU to reschedule while setting per-page memory attributes
KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.
Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.
The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].
The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.
Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572
wifi: mac80211: add the virtual monitor after reconfig complete
In reconfig we add the virtual monitor in 2 cases:
1. If we are resuming (it was deleted on suspend)
2. If it was added after an error but before the reconfig
(due to the last non-monitor interface removal).
In the second case, the removal of the non-monitor interface will succeed
but the addition of the virtual monitor will fail, so we add it in the
reconfig.
The problem is that we mislead the driver to think that this is an existing
interface that is getting re-added - while it is actually a completely new
interface from the drivers' point of view.
Some drivers act differently when a interface is re-added. For example, it
might not initialize things because they were already initialized.
Such drivers will - in this case - be left with a partialy initialized vif.
To fix it, add the virtual monitor after reconfig_complete, so the
driver will know that this is a completely new interface.
This is currently not initialized for a virtual monitor, leading to a
NULL pointer dereference when - for example - iterating over all the
keys of all the vifs.
Xiang Mei [Sat, 5 Jul 2025 21:21:43 +0000 (14:21 -0700)]
net/sched: sch_qfq: Fix null-deref in agg_dequeue
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.
To avoid code duplication, the following changes are made:
1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.
2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.
3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.
RQTs (Receive Queue Table) should redirect traffic to the channels' RQs
when they're active. Otherwise, redirect to the designated "drop RQ".
RQTs are created in "inactive" state, pointing to the "drop RQ".
In activate and de-activate flows, do not "deactivate" the rest of RQTs
(beyond the num of channels), as they are already inactive.
This cuts down unnecessary execution of FW commands (MODIFY_RQT), and
improves the latency of open/close channels or configuration change.
Perf:
NIC: Connect-X7.
Configuration: 1 combined channel, max num channels 248.
Measure time for "interface up + interface down".
Before: 0.313 sec
After: 0.057 sec (5.5x faster)
247 MODIFY_RQT commands saved in interface up.
247 MODIFY_RQT commands saved in interface down.
Maor Gottlieb [Tue, 8 Jul 2025 21:16:26 +0000 (00:16 +0300)]
net/mlx5: Warn when write combining is not supported
Warn if write combining is not supported, as it can impact latency.
Add the warning message to be printed only when the driver actually
run the test and detect unsupported state, rather than when
inheriting parent's result for SFs.
Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gal Pressman [Tue, 8 Jul 2025 21:16:25 +0000 (00:16 +0300)]
net/mlx5e: Replace recursive VLAN push handling with an iterative loop
mlx5e_tc_act_vlan_add_push_action() uses tail-recursion to walk through
a stack of VLAN devices.
There is no need for a complicated recursion with unnecessary stack
consumption and less obvious code flow, rewrite the function so that it
uses a do while loop instead.
Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Remove unused VLAN insertion logic in TX path
The VLAN insertion capability (`wqe_vlan_insert`) was never enabled on
all mlx5 devices. When VLAN TX offload is advertised but this
capability is not supported, the driver uses inline headers to insert
the VLAN tag.
To support this, the driver used to set the
`MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE` bit to enforce L2 inline mode
when `wqe_vlan_insert` was not supported. Since the capability is
disabled on all devices, this logic was always active, and the SQ flag
has become redundant. L2 inline is enforced unconditionally for
VLAN-tagged packets.
The `skb_vlan_tag_present()` check in the else-if block of
`mlx5e_sq_xmit_wqe()` is never true by this point in the TX flow,
as the VLAN tag has already been inserted by the driver using inline
headers. As a result, this code is never executed.
Remove the redundant SQ state, dead VLAN insertion code block, and
related logic.
David Howells [Tue, 8 Jul 2025 21:15:04 +0000 (22:15 +0100)]
rxrpc: Fix oops due to non-existence of prealloc backlog struct
If an AF_RXRPC service socket is opened and bound, but calls are
preallocated, then rxrpc_alloc_incoming_call() will oops because the
rxrpc_backlog struct doesn't get allocated until the first preallocation is
made.
Fix this by returning NULL from rxrpc_alloc_incoming_call() if there is no
backlog struct. This will cause the incoming call to be aborted.
Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Suggested-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Willy Tarreau <w@1wt.eu>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 8 Jul 2025 21:15:03 +0000 (22:15 +0100)]
rxrpc: Fix bug due to prealloc collision
When userspace is using AF_RXRPC to provide a server, it has to preallocate
incoming calls and assign to them call IDs that will be used to thread
related recvmsg() and sendmsg() together. The preallocated call IDs will
automatically be attached to calls as they come in until the pool is empty.
To the kernel, the call IDs are just arbitrary numbers, but userspace can
use the call ID to hold a pointer to prepared structs. In any case, the
user isn't permitted to create two calls with the same call ID (call IDs
become available again when the call ends) and EBADSLT should result from
sendmsg() if an attempt is made to preallocate a call with an in-use call
ID.
However, the cleanup in the error handling will trigger both assertions in
rxrpc_cleanup_call() because the call isn't marked complete and isn't
marked as having been released.
Fix this by setting the call state in rxrpc_service_prealloc_one() and then
marking it as being released before calling the cleanup function.
Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Junvyyang, Tencent Zhuque Lab <zhuque@tencent.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: LePremierHomme <kwqcheii@proton.me>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/20250708211506.2699012-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
vsock/test: fix test for null ptr deref when transport changes
In test_stream_transport_change_client(), the client sends CONTROL_CONTINUE
on each iteration, even when connect() is unsuccessful. This causes a flood
of control messages in the server that hangs around for more than 10
seconds after the test finishes, triggering several timeouts and causing
subsequent tests to fail. This was discovered in testing a newly proposed
test that failed in this way on the client side:
...
33 - SOCK_STREAM transport change null-ptr-deref...ok
34 - SOCK_STREAM ioctl(SIOCINQ) functionality...recv timed out
The CONTROL_CONTINUE message is used only to tell to the server to call
accept() to consume successful connections, so that subsequent connect()
will not fail for finding the queue full.
Send CONTROL_CONTINUE message only when the connect() has succeeded, or
found the queue full. Note that the second connect() can also succeed if
the first one was interrupted after sending the request.
Fixes: 3a764d93385c ("vsock/test: Add test for null ptr deref when transport changes") Cc: leonardi@redhat.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708111701.129585-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Proper bcm54811 PHY driver initialization for MII-Lite.
The bcm54811 PHY in MLP package must be setup for MII-Lite interface
mode by software. Normally, the PHY to MAC interface is selected in
hardware by setting the bootstrap pins of the PHY. However, MII and
MII-Lite share the same hardware setup and must be distinguished by
software, setting appropriate bit in a configuration register.
The MII-Lite interface mode is non-standard one, defined by Broadcom
for some of their PHYs. The MII-Lite lightness consist in omitting
RXER, TXER, CRS and COL signals of the standard MII interface.
Absence of COL them makes half-duplex links modes impossible but
does not interfere with Broadcom's BroadR-Reach link modes, because
they are full-duplex only.
To do it in a clean way, MII-Lite must be introduced first, including
its limitation to link modes (no half-duplex), because it is a
prerequisite for the patch #3 of this series. The patch #4 does not
depend on MII-Lite directly but both #3 and #4 are necessary for
bcm54811 to work properly without additional configuration steps to be
done - for example in the bootloader, before the kernel starts.
PATCH 1 - Add MII-Lite PHY interface mode as defined by Broadcom for
their two-wire PHYs. It can be used with most Ethernet controllers
under certain limitations (no half-duplex link modes etc.).
PATCH 2 - Add MII-Lite PHY interface type
PATCH 3 - Activation of MII-Lite interface mode on Broadcom bcm5481x
PHYs
PATCH 4 - Initialize the BCM54811 PHY properly so that it conforms
to the datasheet regarding a reserved bit in the LRE Control
register, which must be written to zero after every device reset.
Ignore the LDS capability bit in LRE Status register on bcm54811.
====================
Reset the bit 12 in PHY's LRE Control register upon initialization.
According to the datasheet, this bit must be written to zero after
every device reset.
Broadcom PHYs featuring the BroadR-Reach two-wire link mode are usually
capable to operate in simplified MII mode, without TXER, RXER, CRS and
COL signals as defined for the MII. The absence of COL signal makes
half-duplex link modes impossible, however, the BroadR-Reach modes are
all full-duplex only.
Depending on the IC encapsulation, there exist MII-Lite-only PHYs such
as bcm54811 in MLP. The PHY itself is hardware-strapped to select among
multiple RGMII and MII-Lite modes, but the MII-Lite mode must be also
activated by software.
Add MII-Lite activation for bcm5481x PHYs.
Signed-off-by: Kamil Horák - 2N <kamilh@axis.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20250708090140.61355-4-kamilh@axis.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>