net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses
Emails to the maintainer of Qualcomm PPE Ethernet driver (Luo Jie
<quic_luoj@quicinc.com>) bounce permanently (full mailbox), because the
"quicinc.com" addresses were deprecated for public work. All Qualcomm
contributors are aware of that and were asked to fix their addresses.
Driver is not supported - in terms of how netdev understands supported
commitment - if maintainer does not care to receive the patches for its
code, so demote it to "maintained" to reflect true status.
Fix all occurences of Luo Jie email address to preferred and working
domain.
Wei Fang [Wed, 24 Jun 2026 07:27:26 +0000 (15:27 +0800)]
net: enetc: fix potential divide-by-zero when num_vsi is zero
For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
pf->caps.num_vsi is zero. This leads to a divide-by-zero in
enetc4_default_rings_allocation() when distributing rings among PF and
VFs.
Division by zero is undefined behavior in C. On ARM64, the UDIV/SDIV
instructions silently return zero rather than raising an exception, so
the issue does not cause a visible crash. However, relying on this
behavior is incorrect and poses a cross-platform compatibility risk.
Add an explicit check for num_vsi == 0 and return early after the PF's
rings have been configured.
Fixes: 2d673b0e2f8d ("net: enetc: add standalone ENETC support for i.MX94") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260624072726.1238903-1-wei.fang@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback
Fix the Micrel PHY in the example which shouldn't have the
fallback "ethernet-phy-ieee802.3-c22" compatible:
Documentation/devicetree/bindings/net/renesas,ether.example.dtb: ethernet-phy@1 \
(ethernet-phy-id0022.1537): compatible: ['ethernet-phy-id0022.1537', 'ethernet-phy-ieee802.3-c22'] is too long
from schema $id: http://devicetree.org/schemas/net/micrel.yaml
Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Conor Dooley <conor.dooley@microchip.com> Acked-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Fixes: 37a2fce09001 ("dt-bindings: sh_eth convert bindings to json-schema") Link: https://patch.msgid.link/20260624150250.131966-2-robh@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ct_limit_set() is documented as being called with ovs_mutex held. It
walks the ct limit hlist with hlist_for_each_entry_rcu(), but the
iterator does not currently pass the OVS lockdep condition used
elsewhere for RCU-protected OVS objects.
Pass lockdep_ovsl_is_held() to the iterator. This matches the function's
existing caller contract and lets CONFIG_PROVE_RCU_LIST distinguish the
ovs_mutex-protected update path from the RCU read-side ct_limit_get()
path.
This was found by our static analysis tool and then manually reviewed
against the current tree. In the reviewed CONFIG_PROVE_RCU_LIST triage
run, the writer-side ct limit update produced the expected "RCU-list
traversed in non-reader section!!" warning while ovs_mutex was held,
with the stack matching ct_limit_set() and ovs_ct_limit_set_zone_limit().
The change is limited to documenting the existing protection contract.
This is a lockdep annotation cleanup. It does not change the conntrack
limit list update or release behavior.
Eric Dumazet [Thu, 25 Jun 2026 06:59:36 +0000 (06:59 +0000)]
net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync
Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() during concurrent device operations.
The workqueue core clears the internal pending bit before invoking the
worker. At that point, a concurrent thread can queue the work again.
When the already running worker eventually clears the work_pending flag
to 0, it mistakenly clears the flag for the newly queued instance.
udp_tunnel_nic_unregister() then observes work_pending as 0 and frees
the structure while the second work item is still active in the queue,
leading to UAF.
Fix this by returning early in udp_tunnel_nic_device_sync() if
work_pending is already set, preventing redundant work queueing.
Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure") Reported-by: Yue Sun <samsun1006219@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260625065938.654652-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 25 Jun 2026 02:56:58 +0000 (19:56 -0700)]
Merge tag 'nf-26-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Add a workaround to avoid a possible crash if nf_nat and nft_chain_nat are
compiled built-in and nf_nat fails to register, allowing nft_chain_nat to
access the incorrect pernetns area. This is crash specific of all built-in
compilation. From Matias Krause.
2) Revisit conncount GC optimization for confirmed conntracks, skip GC round
if IPS_ASSURED is set on. This is addressing an issue for corner case
use case scenario involving locally generated traffic. No crash, just a
functionality fix. From Fernando F. Mancera.
3) Validate iph->ihl in flowtable IPIP tunnel support, from Lorenzo Bianconi.
This a sanity check to bounces back malformed IPIP packets to classic
forwarding path.
4) Kdoc fixes for x_tables.h, from Randy Dunlap.
5) Use info->options so nft_synproxy_tcp_options() stays on the same local
snapshot, otherwise eval path can observe inconsistent mix of mss and
timestamps. From Runyu Xiao.
6) Add conntrack_sctp_collision.sh to cover for SCTP INIT collisions.
From Yi Chen.
7) Do not allow NFPROTO_UNSPEC targets if family is NFPROTO_BRIDGE in
nft_compat. This allows to use non-sense targets such as xt_nat leading
to crash. From Florian Westphal.
8) Add a selftest queueing from bridge family. From Florian Westphal.
9) Do not allow to reset a conntrack helper via ctnetlink. This feature
antedates the creation of the conntrack-tools, and it is not used
I don't have a usecase for it, I prefer to remove than fixing it.
10) Add deprecation warning for IPv4 only conntrack helpers for PPTP
and IRC. From Florian Westphal.
11) Store the master tuple in the expectation object and use it,
otherwise SLAB_TYPESAFE_RCU rules allow to display incorrect
master tuple information through ctnetlink.
12) Run expectation eviction when inserting an expectation with no
helper, this is a fix for the nft_ct custom expectation support.
13) Fix nft_ct custom expectation timeouts, userspace provides a
timeout in milliseconds but kernel assumes this comes in seconds.
From Florian Westphal.
14) Cap maximum number of expectations per class to 255 expectations
per master conntrack at helper registration. This is a fix to
restrict the maximum number of expectations per master conntrack
which can be a issue for the new lazy GC expectation approach.
* tag 'nf-26-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration
netfilter: nft_ct: expectation timeouts are passed in milliseconds
netfilter: nf_conntrack_expect: run expectation eviction with no helper
netfilter: nf_conntrack_expect: store master_tuple in expectation
netfilter: conntrack: add deprecation warnings for irc and pptp trackers
netfilter: ctnetlink: do not allow to reset helper on existing conntrack
selftests: nft_queue.sh: add a bridge queue test
netfilter: nft_compat: ebtables emulation must reject non-bridge targets
selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test
netfilter: nft_synproxy: stop bypassing the priv->info snapshot
netfilter: x_tables.h: fix all kernel-doc warnings
netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto()
netfilter: nf_conncount: prevent connlimit drops for early confirmed ct
netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init()
====================
Jakub Kicinski [Thu, 25 Jun 2026 02:42:43 +0000 (19:42 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-06-22 (ice, i40e, e1000e)
For ice:
Dawid changes call to release control VSI during reset to prevent
leaking it.
Lukasz fixes flow control error check to check value rather than treat
is as bitmap values.
Paul makes link related errors non-fatal to probe to allow for recovery
in certain NVM update situations.
Marcin moves netif_keep_dst() to only be called once when entering
switchdev mode.
ZhaoJinming adds a cleanup path for ice_dpll_init_info() to prevent
memory leaks on error path.
For i40e:
Mohamed Khalfella corrects argument passed in macro to match the
one provided to the macro.
For e1000e:
Dima resolves power state issues by adjusting value of PLL clock gate
and re-enabling K1; a quirk table is added to keep it off for known bad
systems.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake
i40e: Fix i40e_debug() to use struct i40e_hw argument
ice: dpll: fix memory leak in ice_dpll_init_info error paths
ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info
ice: call netif_keep_dst() once when entering switchdev mode
ice: fix ice_init_link() error return preventing probe
ice: fix AQ error code comparison in ice_set_pauseparam()
ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()
====================
The current MII interface register definition from the vendor is wrong,
use the right number for the macro. Also, correct the interface mask
in spacemit_set_phy_intf_sel() so it can update the register with the
right number
Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC") Signed-off-by: Inochi Amaoto <inochiama@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260623074637.503864-2-inochiama@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: ethernet: sunplus: spl2sw: fix phy_node refcount leak in remove
mac->phy_node is acquired via of_parse_phandle() in spl2sw_probe() and
stored in the mac private data, transferring ownership of the
device_node reference to mac. On driver removal, spl2sw_phy_remove()
disconnects the PHY but never drops that reference, so each
probe-then-remove cycle leaks one of_node refcount per port permanently.
Drop the reference after phy_disconnect(). While at it, remove the
redundant inner "if (ndev)" check; comm->ndev[i] was just verified
non-NULL on the line above.
Compile-tested only; no SP7021 hardware available.
tools/ynl: add missing uapi header deps in Makefile.deps
drm_ras includes drm/drm_ras.h, which is a relatively new header not yet
shipped in most distro kernel-header packages. Without the explicit
entry, the build might fail with a message like this:
drm_ras-user.c:19:10: error: ‘DRM_RAS_CMD_CLEAR_ERROR_COUNTER’ \
undeclared here (not in a function); did you mean \
‘DRM_RAS_CMD_GET_ERROR_COUNTER’
Ruoyu Wang [Tue, 23 Jun 2026 02:57:59 +0000 (10:57 +0800)]
net: sungem: fix probe error cleanup
gem_init_one() calls gem_remove_one() when register_netdev() fails.
gem_remove_one() unregisters and frees resources owned by the net_device,
including the DMA block, MMIO mapping, PCI regions, and the net_device
itself. gem_init_one() then falls through to its own cleanup labels and
frees the same resources again.
Keep the register_netdev() error path in gem_init_one(): clear drvdata so
PM/remove paths do not see a half-registered device, remove the NAPI
instance added during probe, and let the existing cleanup labels release
the resources once.
The issue was found by a local static-analysis checker for probe error
paths. The reported path was manually inspected before sending this fix.
Compile-tested with CONFIG_SUNGEM=y. Runtime testing was not performed
because no sungem hardware is available.
HanQuan [Tue, 23 Jun 2026 01:52:08 +0000 (01:52 +0000)]
net/tcp-ao: fix use-after-free of key in del_async path
In tcp_ao_delete_key(), the del_async path skips the current_key
and rnext_key validity checks present in the synchronous path,
assuming these pointers are always NULL on LISTEN sockets. However,
if a key was added with set_current=1/set_rnext=1 while the socket
was in CLOSE state, current_key and rnext_key will be non-NULL
after listen() transitions the socket to LISTEN.
When such a key is deleted with del_async=1, hlist_del_rcu() and
call_rcu() free the key without clearing the dangling pointers.
After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
current_key->sndid and rnext_key->rcvid from freed slab memory.
Clear current_key and rnext_key in the del_async path when they
reference the key being deleted.
Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)") Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260623015208.1191687-1-eilaimemedsnaimel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Greg Thelen [Mon, 22 Jun 2026 16:16:59 +0000 (09:16 -0700)]
tools: ynl: build archives with $(AR)
Use $(AR) to allow build system to override the archiver tool (e.g.,
when cross-compiling for a different architecture) by setting the AR
environment variable.
GNU Make defaults AR to ar, so this change will not break existing build
environments that do not explicitly set AR.
Fixes: 07c3cc51a085 ("tools: net: package libynl for use in selftests") Fixes: 86878f14d71a ("tools: ynl: user space helpers") Signed-off-by: Greg Thelen <gthelen@google.com> Link: https://patch.msgid.link/20260622161659.145047-1-gthelen@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Arnd Bergmann [Mon, 22 Jun 2026 12:41:07 +0000 (14:41 +0200)]
eth: mlx5: fix macsec dependency
Configurations with mlx5 built-in but macsec=m fail to link:
x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_add_gid_macsec_operations':
macsec.c:(.text+0x77d): undefined reference to `macsec_netdev_is_offloaded'
x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_del_gid_macsec_operations':
macsec.c:(.text+0xe81): undefined reference to `macsec_netdev_is_offloaded'
Fix the dependency so this configuration cannot happen.
Maoyi Xie [Mon, 22 Jun 2026 08:01:57 +0000 (16:01 +0800)]
net: usb: kalmia: bound RX frame length in kalmia_rx_fixup()
kalmia_rx_fixup() computes usb_packet_length = skb->len - (2 *
KALMIA_HEADER_LENGTH) as a u16, guarded only by a pre-loop check that
skb->len is at least KALMIA_HEADER_LENGTH, which is 6. A device can
deliver a short bulk-IN frame with skb->len in the 6 to 11 range, or
leave a short trailing remainder on a later loop iteration. Either case
underflows usb_packet_length to about 65530.
That bypasses the usb_packet_length < ether_packet_length truncation path.
The device-supplied ether_packet_length, a le16 up to 65535 read from
header_start[2], then drives a memcmp() and the following skb_trim() and
skb_pull() past the end of the rx buffer. The rx buffer is hard_mtu * 10,
which is 14000 bytes. That is an out of bounds read.
Require both the start and end framing headers to be present before
subtracting them, on every loop iteration.
Fixes: d40261236e8e ("net/usb: Add Samsung Kalmia driver for Samsung GT-B3730") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/178211531778.2216480.12637613349790980750@maoyixie.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang Mei [Thu, 18 Jun 2026 03:26:22 +0000 (20:26 -0700)]
geneve: validate inner network offset in geneve_gro_complete()
Even with both paths gated on gs->gro_hint, geneve_gro_complete()
re-derives the inner dispatch type and length from the packet and the
current gs->gro_hint, independently of geneve_gro_receive(). The two can
disagree if gs->gro_hint flips under a concurrent geneve_quiesce()/
geneve_unquiesce() (sk_user_data is NULL across a synchronize_net()), or if
the re-read option bytes differ from the ones receive parsed.
geneve_gro_receive() already records the inner network header position in
NAPI_GRO_CB()->inner_network_offset. Have geneve_gro_complete() compute the
offset it is about to dispatch at, adding ETH_HLEN in the ETH_P_TEB case
where eth_gro_complete() steps over the inner MAC header, and bail out if
it lands past inner_network_offset.
Use a lower bound rather than exact equality: between gh_len and the inner
L3 header, geneve_gro_receive() may also have pulled an inner VLAN tag
(vlan_gro_receive() advances the recorded offset past it), which only moves
inner_network_offset further out. A valid frame therefore always satisfies
inner_nh <= inner_network_offset, while a gh_len inflated by a hint
gro_receive() did not honour dispatches past the validated inner header,
i.e. the out-of-bounds completion. Only the latter is rejected.
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path") Suggested-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260618032622.484720-2-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang Mei [Thu, 18 Jun 2026 03:26:21 +0000 (20:26 -0700)]
geneve: gate GRO hint in geneve_gro_complete() on gs->gro_hint
geneve_gro_receive() reads the GRO hint through geneve_sk_gro_hint_off(),
which honours it only when the socket enabled IFLA_GENEVE_GRO_HINT
(gs->gro_hint). geneve_gro_complete() instead calls the low-level
geneve_opt_gro_hint_off() and acts on the hint unconditionally.
On a tunnel without the hint, receive aggregates the frames as plain
ETH_P_TEB while complete still honours an attacker-supplied hint option: it
inflates gh_len by gro_hint->nested_hdr_len (u8) and redirects the dispatch
type, so the inner gro_complete handler runs at nhoff + gh_len, an offset
receive never pulled nor validated, reading out of bounds of the skb head:
BUG: KASAN: slab-out-of-bounds in ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
Read of size 1 at addr ffff88800fe91980 by task exploit/153
ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
geneve_gro_complete (drivers/net/geneve.c:965)
udp_gro_complete (net/ipv4/udp_offload.c:940)
inet_gro_complete (net/ipv4/af_inet.c:1621)
__gro_flush (net/core/gro.c:306)
Gate the complete path on gs->gro_hint too via geneve_sk_gro_hint_off(), so
both paths agree. Tunnels that enable the hint are unaffected.
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path") Reported-by: Weiming Shi <bestswngs@gmail.com> Reported-by: Kyle Zeng <kylebot@openai.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260618032622.484720-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yun Zhou [Mon, 22 Jun 2026 07:43:50 +0000 (15:43 +0800)]
net: mvneta: re-enable percpu interrupt on resume
On Marvell MPIC platforms (Armada 370/XP/38x), mvneta uses a percpu
IRQ disable/enable scheme for NAPI: the ISR (mvneta_percpu_isr) calls
disable_percpu_irq() to mask the MPIC per-CPU interrupt and schedules
NAPI poll, which calls enable_percpu_irq() on completion to unmask.
If suspend occurs while NAPI poll is pending (between
disable_percpu_irq in the ISR and enable_percpu_irq in poll
completion), the interrupt is never re-enabled:
1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
=> MPIC masked, percpu_enabled cpumask bit cleared
2. NAPI poll does not complete before suspend proceeds
(on PREEMPT_RT this is highly likely since softirqs run in
ksoftirqd which gets frozen; on non-RT it can happen when
softirq processing is deferred to ksoftirqd)
3. mvneta_stop_dev => napi_disable(): cancels the pending poll
without executing the completion path
4. suspend_device_irqs => IRQCHIP_MASK_ON_SUSPEND: masks MPIC
(already masked, but records IRQS_SUSPENDED)
5. Resume: mpic_resume checks irq_percpu_is_enabled() => false
(bit was cleared in step 1) => skips unmask
6. mvneta_start_dev only restores device-level INTR_NEW_MASK,
does not touch the MPIC per-CPU mask
Result: MPIC per-CPU interrupt stays masked permanently. The NIC
generates interrupts (INTR_NEW_CAUSE != 0) but the CPU never
receives them, causing complete loss of network connectivity.
Fix by calling on_each_cpu(mvneta_percpu_enable) in the resume path
to unconditionally unmask the MPIC per-CPU interrupt regardless of
pre-suspend state.
Fixes: 12bb03b436da ("net: mvneta: Handle per-cpu interrupts") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260622074350.1666290-1-yun.zhou@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ratheesh Kannoth [Mon, 22 Jun 2026 03:42:29 +0000 (09:12 +0530)]
octeontx2-af: fix CGX debugfs RVU AF PCI reference leaks
CGX per-lmac debugfs seq readers obtained struct rvu via
pci_get_drvdata(pci_get_device(..., PCI_DEVID_OCTEONTX2_RVU_AF, ...)),
which leaks a PCI device reference on every read. Store rvu and the CGX
handle in debugfs inode private data when creating stats, mac_filter,
and fwdata files (one context per CGX), and use debugfs aux numbers for
fwdata so lmac_id matches the other CGX debugfs entries.
Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters") Fixes: dbc52debf95f ("octeontx2-af: Debugfs support for DMAC filters") Fixes: 49f02e6877d1 ("Octeontx2-af: Debugfs support for firmware data") Cc: Linu Cherian <lcherian@marvell.com> Reported-by: Yuho Choi <dbgh9129@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260622034229.2254145-1-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NIX maximum number of LFs can be set via devlink command
but that can be done before assigning any LFs to a PF/VF.
The condition used to check whether any LFs are assigned is
incorrect. This patch fixes that condition.
Haoxiang Li [Sun, 21 Jun 2026 03:17:14 +0000 (11:17 +0800)]
net: wwan: t7xx: destroy DMA pool on CLDMA late init failure
t7xx_cldma_late_init() creates md_ctrl->gpd_dmapool before
initializing the TX and RX rings. If any ring initialization
fails, the error path frees the already initialized rings but
leaves the DMA pool allocated.
Destroy md_ctrl->gpd_dmapool on the late-init failure path
to avoid leaking the DMA pool.
Fixes: 39d439047f1d ("net: wwan: t7xx: Add control DMA interface") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20260621031714.3605022-1-haoxiang_li2024@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Sat, 20 Jun 2026 15:04:51 +0000 (17:04 +0200)]
net: airoha: fix BQL underflow in shared QDMA TX ring
When multiple netdevs share a QDMA TX ring and one device is stopped,
netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
pending skbs remain in the shared HW TX ring. When NAPI later completes
those skbs via netdev_tx_completed_queue(), the already-zeroed
dql->num_queued counter underflows.
Fix the issue:
- Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
skbs are completed naturally by NAPI with proper BQL accounting.
- Rework airoha_qdma_tx_cleanup() to disable TX DMA, flush BQL
counters, DMA-unmap and free all pending skbs while skb->dev
references are still valid. Use a per-queue flushing flag checked
under q->lock in airoha_dev_xmit() to prevent races between teardown
and transmit. Call airoha_qdma_stop_napi() before
airoha_qdma_tx_cleanup() at the call sites.
- Move DMA engine start into probe. Split DMA teardown so TX DMA is
disabled in airoha_qdma_tx_cleanup() and RX DMA in
airoha_qdma_cleanup().
- Remove qdma->users counter since DMA lifetime is now tied to
probe/cleanup rather than per-netdev open/stop.
Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260620-airoha-bql-fixes-v3-1-76b95374e63e@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jan Klos [Sat, 20 Jun 2026 01:19:53 +0000 (03:19 +0200)]
net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit
On RTL8127A connected to a link partner that advertises 10000baseT
speed cannot be changed to anything other than 10000baseT as 10GbE
is always advertised regardless of any setting. Fix this by
clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
call to phy_modify_mmd_changed().
Fixes: 83d962316128 ("net: phy: realtek: add RTL8127-internal PHY") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: Jan Klos <honza.klos@gmail.com> Link: https://patch.msgid.link/20260620011956.37181-1-honza.klos@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ratheesh Kannoth [Fri, 19 Jun 2026 09:51:00 +0000 (15:21 +0530)]
octeontx2-af: npc: cn20k: Fix subbank free list indexing for search order
subbank_srch_order[i] is the physical subbank at search-order slot i,
so each subbank's arr_idx must be i (its slot), not
subbank_srch_order[sb->idx]. The old logic mis-keyed xa_sb_free
and broke allocation traversal order.
Populate arr_idx and xa_sb_free in a single pass over the search
order after subbank structs are initialized.
Fixes: 7ac9d4c4075c ("octeontx2-af: npc: cn20k: add subbank search order control") Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260619095100.1864440-1-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0
Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().
Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.
MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.
Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.
Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG") Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260619055348.467224-1-ernis@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dsa_unregister_switch() frees the dsa_port objects. If a CRC error
schedules the work during teardown it can run after the ports have been
freed and dereference freed memory.
Guard the port walk with MXL862XX_FLAG_WORK_STOPPED, which is already set
before dsa_unregister_switch(). DSA tears the ports down under
rtnl_lock(), so checking the flag under rtnl_lock() means the work either
runs before teardown and sees valid ports, or runs afterwards, observes
the flag and skips the walk. This mirrors the host_flood_work handler,
which skips torn-down ports under rtnl_lock().
Daniel Golle [Fri, 19 Jun 2026 03:39:25 +0000 (04:39 +0100)]
net: dsa: mxl862xx: avoid unaligned 16-bit access in api_wrap
The MXL862XX_API_* macros pass the address of a stack-allocated, __packed
firmware-ABI struct to mxl862xx_api_wrap() as a void *. The struct has an
alignment of 1, so the compiler is free to place it at an odd address.
mxl862xx_api_wrap() reinterprets that buffer as a __le16 * and accesses it
with data[i], for which the compiler assumes the natural 2-byte alignment
of __le16 and emits aligned 16-bit loads/stores (e.g. lhu/sh on MIPS).
When the buffer lands on an odd address these fault on architectures that
do not support unaligned access, such as MIPS32.
-Waddress-of-packed-member does not catch this: the packed origin is
laundered through the void * parameter, so the cast inside api_wrap looks
alignment-safe to the compiler and no warning is emitted.
Use get_unaligned_le16()/put_unaligned_le16() for the three 16-bit word
accesses. The byte accesses (*(u8 *)&data[i], crc16()) are already safe
and are left unchanged.
David Yang [Thu, 18 Jun 2026 14:01:55 +0000 (22:01 +0800)]
net: dsa: realtek: fix memory leak in rtl8366rb_setup_led()
led_classdev_register_ext() only reads init_data.devicename - it never
stores the pointer. However, the caller allocated devicename with
kasprintf() but never freed it, leaking the string memory.
Fix it with a stack buffer to avoid dynamic buffers completely.
Fixes: 32d617005475 ("net: dsa: realtek: add LED drivers for rtl8366rb") Signed-off-by: David Yang <mmyangfl@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260618140200.1888707-1-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ixp4xx_hss_probe() allocates two HDLC netdevs. The first one is stored
in ndev, initialized, and registered with register_hdlc_device(). The
second one is stored in port->netdev and later used by the remove path
for unregister_hdlc_device() and free_netdev().
This means that the registered netdev is not the same object that is
unregistered and freed on remove. It also leaks the first allocation if
the second alloc_hdlcdev() call fails, and the first allocation is not
checked before ndev is used.
Older code allocated the HDLC netdev only once and stored the same object
in both the local variable and port->netdev. The buggy conversion split
this into two alloc_hdlcdev() calls. A later rename changed the local
variable name to ndev, but the underlying mismatch remained.
Fix this by allocating the HDLC netdev only once and assigning the same
object to port->netdev.
Fixes: 99ebe65eb9c0 ("net: ixp4xx_hss: move out assignment in if condition") Cc: stable@vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260622043015.643637-1-haoxiang_li2024@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Fri, 19 Jun 2026 11:37:14 +0000 (13:37 +0200)]
net: airoha: fix netif_set_real_num_tx_queues for sparse QoS channels
airoha_tc_htb_alloc_leaf_queue() assigns queue IDs based on the channel
index (opt->qid = AIROHA_NUM_TX_RING + channel), but updates
real_num_tx_queues with a simple increment (num_tx_queues + 1). When QoS
channels are allocated sparsely (e.g., channels 0 and 3 without 1 and
2), the returned qid can exceed real_num_tx_queues, causing out-of-bounds
accesses in the networking stack.
For example, allocating channel 0 then channel 3 results in
real_num_tx_queues = 34 but qid = 35, which is out of range [0, 34).
Fix this by computing real_num_tx_queues based on the highest active
channel index rather than using a simple counter, in both the allocation
and deletion paths.
Lorenzo Bianconi [Fri, 19 Jun 2026 11:37:13 +0000 (13:37 +0200)]
net: airoha: Fix off-by-one in airoha_tc_remove_htb_queue()
airoha_tc_htb_alloc_leaf_queue() computes the HTB QoS channel index
as opt->classid % AIROHA_NUM_QOS_CHANNELS and stores it in qos_sq_bmap.
However, airoha_tc_remove_htb_queue() clears the HTB configuration
using queue + 1 as the channel index, causing an off-by-one error.
Use queue directly as the QoS channel index to match the allocation
logic.
Jakub Kicinski [Mon, 22 Jun 2026 15:47:53 +0000 (08:47 -0700)]
eth: fbnic: fix ordering of heartbeat vs ownership
When requesting ownership of the NIC (MAC/PHY control), we set up
the heartbeat to look stale:
/* Initialize heartbeat, set last response to 1 second in the past
* so that we will trigger a timeout if the firmware doesn't respond
*/
fbd->last_heartbeat_response = req_time - HZ;
fbd->last_heartbeat_request = req_time;
The scheme is a bit odd, but it should work in principle.
Fix the ordering of operations. We have to set up the stale heartbeat
before we send the message. Otherwise if the response is very fast
we will override it. This triggers on QEMU if we run on the core
that handles the IRQ, and results in ndo_open failing with ETIMEDOUT.
The change in ordering doesn't impact releasing the ownership.
Both ndo_stop and heartbeat check are under rtnl_lock.
Fixes: 20d2e88cc746 ("eth: fbnic: Add initial messaging to notify FW of our presence") Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20260622154753.827506-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ipv6: fix error handling in disable_ipv6 sysctl
While working on a different IPv6 patch series I have spotted multiple
minor bugs around sysctl error handling and notifications. In general,
they are not serious issues.
In addition, there is one more issue in forwarding sysctl as it does not
check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
of this series and I am aiming it at the net-next tree once it re-opens.
During v3, Ido's pointed out that it is unnecessary to reset the
position pointer when the return value is negative as at
new_sync_write() the ppos is only advanced when ret return value is
positive. That means we can get rid of that operation in ipv4/ipv6
sysctls. That is going to be sent to net-next too.
====================
ipv6: fix missing notification for ignore_routes_with_linkdown
When changing the ignore_routes_with_linkdown sysctl for a specific
interface, the RTM_NEWNETCONF netlink notification was not being emitted
to userspace. Fix this by emitting the notification when needed.
In addition, fix bogus return value for successful "all" and specific
interface write operation leading to a wrong reset of the position
pointer.
Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-7-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6: fix state corruption during proxy_ndp sysctl restart
When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but as the value was already modified by the initial
proc_dointvec() call, the restarted syscall will read the newly modified
value as the 'old' state.
Fix this by taking the RTNL lock before parsing the input value if the
operation is a write.
Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-6-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.
Fix this by checking the return value of proc_dointvec() and returning
early on failure.
Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-5-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When writing to the forwarding sysctl, if proc_dointvec() fails to parse
the input, it returns a negative error code. The current implementation
is overwriting that error for write operations.
This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.
Fix this by checking the return value of proc_dointvec() and returning
early on failure. In addition, adjust return code of
addrconf_fixup_forwarding() for successful operation.
Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-4-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6: fix error handling in ignore_routes_with_linkdown sysctl
When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.
This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.
Fix this by checking the return value of proc_dointvec() and returning
early on failure.
Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-3-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.
This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.
Fix this by checking the return value of proc_dointvec() and returning
early on failure.
Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters") Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260622130857.5115-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Runyu Xiao [Fri, 19 Jun 2026 15:18:16 +0000 (23:18 +0800)]
net: au1000: move free_irq out of the close-time spinlocked section
au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.
This was found by our static analysis tool and then confirmed by manual
review of the in-tree au1000_close() .ndo_stop path. The reviewed path
keeps aup->lock held across the MAC reset, queue stop and
free_irq(dev->irq, dev).
A directed runtime validation kept that ndo_stop carrier and the same
free_irq(dev->irq, dev) operation under the driver lock. Lockdep reported
"BUG: sleeping function called from invalid context" and "Invalid wait
context" while free_irq() was taking desc->request_mutex, with
au1000_close() and free_irq() on the stack.
Drop aup->lock before freeing the IRQ. The protected close-time work still
stops the device and queue before IRQ teardown, but the sleepable IRQ core
path now runs outside the spinlocked section.
Xin Long [Sat, 20 Jun 2026 15:48:54 +0000 (11:48 -0400)]
sctp: fix err_chunk memory leaks in INIT handling
When sctp_verify_init() encounters unrecognized parameters, it allocates an
err_chunk to report them. However, this chunk is leaked in several code
paths:
1. In sctp_sf_do_5_1B_init(), if security_sctp_assoc_request() fails after
sctp_verify_init() has populated err_chunk, the function returns
immediately without freeing it.
2. In sctp_sf_do_unexpected_init(), the same leak occurs on the
security_sctp_assoc_request() failure path.
3. In sctp_sf_do_unexpected_init(), on the success path after copying
unrecognized parameters to the INIT-ACK, the function returns without
freeing err_chunk, unlike sctp_sf_do_5_1B_init() which properly frees
it.
Fix all three leaks by adding sctp_chunk_free(err_chunk) calls before
returning in the error paths and on the success path in
sctp_sf_do_unexpected_init().
Fixes: c081d53f97a1 ("security: pass asoc to sctp_assoc_request and sctp_sk_clone") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/0656704f1b0158287c98aec09ba36c83e4a537ab.1781970534.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Sat, 20 Jun 2026 13:07:49 +0000 (09:07 -0400)]
net/sched: cls_api: Handle TC_ACT_CONSUMED in tcf_qevent_handle
tcf_classify() can return TC_ACT_CONSUMED while the skb is held by the
defragmentation engine (e.g. act_ct on out-of-order fragments). When
that happens the skb is no longer owned by the caller and must not be
touched again.
tcf_qevent_handle() did not handle TC_ACT_CONSUMED: it fell through the
switch and returned the skb to the caller as if classification had
passed. The only qdisc that wires up qevents today is RED, via three call sites
(qe_mark on RED_PROB_MARK/HARD_MARK, qe_early_drop on congestion_drop)
red_enqueue() was continuing to operate on an skb it no longer owns in this
case -- enqueueing it, dropping it, or updating statistics. Resulting in a UAF.
(with ct defrag enabled and traffic that produces out-of-order
fragments, e.g. a fragmented UDP stream)
Handle TC_ACT_CONSUMED in tcf_qevent_handle() the same way the ingress
and egress fast paths do: treat it as stolen and return NULL without
touching the skb. Unlike the TC_ACT_STOLEN case, the skb must not be
dropped/freed here, as it is no longer owned by us.
Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags") Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260620130749.226642-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Sitnicki [Fri, 19 Jun 2026 17:09:29 +0000 (19:09 +0200)]
selftests/bpf: Add LWT encap tests for skb metadata
Test that an LWT encapsulation does not silently corrupt XDP metadata
sitting in the skb headroom. Exercise all three LWT dispatch paths:
- BPF LWT xmit prog reserves headroom on the LWT .xmit redirect,
- mpls pushes an MPLS label on the LWT .xmit redirect,
- seg6 in encap mode runs on the LWT .input redirect,
- ioam6 encap inserts an IOAM Hop-by-Hop option on LWT .output redirect.
Jakub Sitnicki [Fri, 19 Jun 2026 17:09:28 +0000 (19:09 +0200)]
net: lwtunnel: Drop skb metadata before LWT encapsulation
skb metadata is meant for passing information between XDP and TC. It lives
in the skb headroom, immediately before skb->data. LWT programs cannot
access the __sk_buff->data_meta pseudo-pointer to metadata.
However, LWT encapsulation prepends outer headers, moving skb->data back
over the headroom where the metadata sits. On an RX-originated (forwarded)
packet that still carries XDP metadata this goes wrong in two different
ways, depending on the encap type:
1. Non-BPF LWT encaps (mpls, seg6, ioam6 ...) call skb_push()/skb_pull()
and silently overwrite the metadata that sits in the headroom.
2) BPF LWT xmit calls bpf_skb_change_head(), which uses skb_data_move().
That helper expects metadata immediately before skb->data. But since
the IP output path runs LWT xmit before neighbour output has built
the outgoing L2 header, for forwarded packets skb->data points at the
L3 header while skb_mac_header() still points at the old L2 header.
skb_data_move() sees metadata ending at skb_mac_header(), not before
skb->data, warns and clears metadata:
Every encap funnels through the three LWT dispatch helpers, so drop the
metadata there, right before handing the skb to the encap op. This
single chokepoint covers all encap types and all three redirect modes:
Alternatively, we could clear the metadata right after TC ingress hook.
That would require a compromise, however. Metadata would become
inaccessible from TC egress (in setups where it actually reaches the
hook it tact, that is without any L2 tunnels on path).
1) xfrm: use compat translator only for u64 alignment mismatch
Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT
so 32-bit compat tasks on arches whose 32-bit ABI already matches
the native 64-bit layout are no longer rejected with -EOPNOTSUPP.
From Sanman Pradhan.
2) net: af_key: initialize alg_key_len for IPComp states
Initialize the alg_key_len to 0 in the IPComp branch of
pfkey_msg2xfrm_state() so an uninitialized value cannot drive
xfrm_alg_len() into a slab-out-of-bounds kmemdup during
XFRM_MSG_MIGRATE. From Zijing Yin.
3) xfrm: Fix dev use-after-free in xfrm async resumption
Stash the original skb->dev and extend the RCU critical section
across xfrm_rcv_cb() and transport_finish() to prevent a
tunnel-device UAF and original-device refcount leak when a
callback replaces skb->dev. From Dong Chenchen.
4) xfrm: Fix xfrm state cache insertion race
Move the state-validity check inside xfrm_state_lock in the
input state cache insertion path so a state cannot be killed
between the check and the insert. From Herbert Xu.
5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count
and xfrm_policy_default to silence the KCSAN data race reported
on net->xfrm.policy_count. From Eric Dumazet.
6) espintcp: use sk_msg_free_partial to fix partial send
Replace the manual skmsg accounting in espintcp with
sk_msg_free_partial() so the skmsg stays consistent on every
iteration and the partial-send accounting bugs go away.
From Sabrina Dubroca.
7) xfrm: validate selector family and prefixlen during match
Reject mismatched address families in xfrm_selector_match() and
bound prefixlen in addr4_match()/addr_match() to prevent the
shift-out-of-bounds syzbot reported when an AF_UNSPEC selector
with a large prefixlen is matched against an IPv4 flow.
From Eric Dumazet.
* tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: validate selector family and prefixlen during match
espintcp: use sk_msg_free_partial to fix partial send
xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
xfrm: Fix xfrm state cache insertion race
xfrm: Fix dev use-after-free in xfrm async resumption
net: af_key: initialize alg_key_len for IPComp states
xfrm: use compat translator only for u64 alignment mismatch
====================
Nicolai Buchwitz [Mon, 22 Jun 2026 10:29:11 +0000 (12:29 +0200)]
net: usb: lan78xx: restore VLAN and hash filters after link up
Configured VLANs intermittently stop receiving traffic after a link
down/up cycle, e.g. when the network cable is unplugged and plugged back
in. VLAN filtering stays enabled but all VLAN-tagged frames are dropped
until a VLAN is added or removed again.
The LAN7801 datasheet (DS00002123E) states:
"A portion of the MAC operates on clocks generated by the Ethernet
PHY. During a PHY reset event, this portion of the MAC is designed to
not be taken out of reset until the PHY clocks are operational"
(section 8.10, MAC Reset Watchdog Timer)
"After a reset event, the RFE will automatically initialize the
contents of the VHF to 0h."
(section 7.1.4, VHF Organization)
Thus a link down/up cycle stops and restarts the PHY clock, resets the
PHY-clocked portion of the MAC, and the RFE clears its VLAN/DA hash
filter (VHF) memory. The VHF holds both the VLAN filter table and the
multicast hash table, but the driver never reprograms either from its
shadow copy once the link is back, so both stay empty.
Reprogram the VLAN filter and multicast hash tables on link up.
Reported-by: Sven Schuchmann <schuchmann@schleissheimer.de> Closes: https://lore.kernel.org/netdev/BEZP281MB224501E38B30BFDC4BD3D364D9E32@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM/T/#u Tested-by: Sven Schuchmann <schuchmann@schleissheimer.de> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver") Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de> Link: https://patch.msgid.link/20260622102911.484045-1-nb@tipi-net.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 22 Jun 2026 11:18:25 +0000 (11:18 +0000)]
veth: fix NAPI leak in XDP enable error path
During XDP enablement in veth, if xdp_rxq_info_reg() or
xdp_rxq_info_reg_mem_model() fails, the driver rolls back the changes.
However, the rollback loop:
for (i--; i >= start; i--) {
decrements the loop index 'i' before the first iteration. This
correctly skips unregistering the rxq for the failed index 'i' (as
registration failed or was already cleaned up), but it also
erroneously skips calling netif_napi_deli() for rq[i].xdp_napi.
Since netif_napi_add() was already called for index 'i', this leaves
a dangling napi_struct in the device's napi_list. When the veth
device is later destroyed, the freed queue memory (which contains the
leaked NAPI structure) can be reused.
The subsequent device teardown iterates the NAPI list and
corrupts the reallocated memory, leading to UAF.
Fix this by explicitly deleting the NAPI association for the failed
index 'i' before rolling back the successfully configured queues.
Fixes: b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path") Reported-by: Guenter Roeck <groeck@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://patch.msgid.link/20260622111825.88337-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Meghana Malladi [Thu, 18 Jun 2026 10:03:48 +0000 (15:33 +0530)]
net: ti: icssg: Fix XSK zero copy TX during application wakeup
emac_xsk_xmit_zc() handles tx xmit for zero copy and gets called
inside napi context. User application wakes up the kernel while
initiating the transmit which triggers napi to start processing
the tx packets. The num_tx check inside emac_tx_complete_packets()
returns early if no packet transfer happen hindering the call
to emac_xsk_xmit_zc(). Remove this check to let application
wakeup initiate zero copy xmit traffic.
Add __netif_tx_lock() to ensure that the TX queue is protected
from concurrent access during the transmission of XDP frames.
This fixes netdev watchdog timeout for long runs.
net: dsa: sja1105: round up PTP perout pin duration
pin_duration is converted from the user-provided period to SJA1105
clock ticks and is later passed as the cycle_time argument to
future_base_time().
Very small period values may become zero after the conversion,
which can lead to a division by zero in future_base_time().
Round zero pin_duration up to 1 tick so that the smallest unsupported
periods use the minimum non-zero hardware duration instead of passing
zero to future_base_time().
Fixes: 747e5eb31d59 ("net: dsa: sja1105: configure the PTP_CLK pin as EXT_TS or PER_OUT") Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru> Link: https://patch.msgid.link/20260618110508.53094-1-aga@itb.spb.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Mon, 22 Jun 2026 11:01:08 +0000 (11:01 +0000)]
net: do not acquire dev->tx_global_lock in netdev_watchdog_up()
Marek Szyprowski reported a deadlock during system resume when virtio_net
driver is used.
The deadlock occurs because netif_device_attach() is called while holding
dev->tx_global_lock (via netif_tx_lock_bh() in virtnet_restore_up()).
netif_device_attach() calls __netdev_watchdog_up(), which now also tries
to acquire dev->tx_global_lock to synchronize with dev_watchdog().
This recursive lock acquisition results in a deadlock.
Fix this by removing the tx_global_lock acquisition from netdev_watchdog_up().
The critical state (watchdog_timer and watchdog_ref_held) is already
protected by dev->watchdog_lock, which was introduced in the blamed commit.
Fixes: 8eed5519e496 ("net: watchdog: fix refcount tracking races") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Closes: https://lore.kernel.org/netdev/a443376e-5187-4268-93b3-58047ef113a8@samsung.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260622110108.69541-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang Mei [Sat, 20 Jun 2026 20:15:31 +0000 (13:15 -0700)]
net, bpf: check master for NULL in xdp_master_redirect()
xdp_master_redirect() dereferences the result of
netdev_master_upper_dev_get_rcu() without a NULL check, but that helper
returns NULL when the receiving device has no upper-master adjacency.
The reach guard only checks netif_is_bond_slave(). On bond slave release
bond_upper_dev_unlink() drops the upper-master adjacency before clearing
IFF_SLAVE, so an XDP_TX reaching xdp_master_redirect() in that window
still passes netif_is_bond_slave() while master is already NULL, and
faults on master->flags at offset 0xb0:
The missing check dates back to the original code; commit 1921f91298d1
("net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master")
later added the master->flags read where the fault now lands but kept the
unconditional deref. Check master for NULL before use; a NULL master is
treated the same as one that is not up.
Fixes: 879af96ffd72 ("net, core: Add support for XDP redirection to slave device") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260620201531.180123-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
selftests/xsk: stabilize timeout test behavior
This series improves AF_XDP selftests by making timeout handling
explicit and fixing sources of non-determinism in xsk timeout tests.
Patch 1 introduces test_spec::poll_tmout and removes implicit
dependence on RX UMEM setup state for timeout behavior.
Patch 2 fixes thread harness sequencing by attaching XDP programs
before worker startup, removing signal-based termination, and using
barrier synchronization only for dual-thread runs.
Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
configuration does not leak into subsequent cases on shared-netdev
runs.
Together these changes make timeout handling easier to follow and
improve selftest stability, especially on real NIC runs.
====================
Tushar Vyavahare [Tue, 16 Jun 2026 15:49:52 +0000 (21:19 +0530)]
selftests/xsk: make poll timeout mode explicit
Stop inferring timeout behavior from RX UMEM initialization state.
That ties timeout semantics to setup internals and obscures intent.
Use test_spec::poll_tmout as the explicit timeout-mode selector in
TX and RX paths.
In RX, treat poll timeout as expected only in timeout mode.
In TX, let send_pkts() own loop completion in non-timeout mode
and use __send_pkts() only for progress and timeout detection.
This makes timeout logic explicit and keeps control flow predictable.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260616154955.1492560-2-tushar.vyavahare@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration
On helper registration, the maximum number of expectations cannot go over
NF_CT_EXPECT_MAX_CNT (255), but zero can be specified then
nf_conntrack_expect_max applies. Turn zero into NF_CT_EXPECT_MAX_CNT
otherwise, expectation LRU eviction on insertion is disabled.
Moreover, expand this sanity check all expectation classes.
This max_expecy policy is only tunable since userspace helpers are
available, set Fixes: tag to the commit that adds such infrastructure.
Remove the check for p->max_expected given this field must always
be non-zero after this patch.
Florian Westphal [Tue, 23 Jun 2026 05:30:34 +0000 (07:30 +0200)]
netfilter: nft_ct: expectation timeouts are passed in milliseconds
Userspace passes '5000' in case user asks for 5 seconds.
Allowing for sub-second expectation lifetimes makes sense to me. so
fix up the kernel side instead of munging nft to send a value rounded
up to next second.
Also note that this violates nft convention of passing integers in
network byte order, but we can't change this anymore.
netfilter: nf_conntrack_expect: store master_tuple in expectation
Store master conntrack tuple in the expectation since exp->master might
refer to a different conntrack when accessed from rcu read side lock
area due to typesafe rcu rules.
Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 12 Jun 2026 06:03:50 +0000 (08:03 +0200)]
netfilter: conntrack: add deprecation warnings for irc and pptp trackers
IRC Direct client-to-client requires plaintext. IRC over TLS should be
preferred, making this helper ineffective. Add a deprecation warning and
update the help text to better reflect that this is needed for the DCC
extension, not IRC itself.
PPTP is esoteric these days and it is the only helper that requires the
destroy callback in the conntrack helper API.
Removal would simplify the conntrack core.
Both helpers are IPv4 only.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: ctnetlink: do not allow to reset helper on existing conntrack
This feature allows to reset a helper for an existing conntrack, but it
is not safe. This requires a synchronized_rcu() call after resetting the
helper, which is going to be expensive for a large batch of conntrack
entries. This also needs to call to the .destroy callback to release the
GRE/PPTP mappings to fix it.
This feature antedates the creation of the conntrack-tools and I cannot
find a good use-case for this. Given that I cannot find any user in the
netfilter.org userspace tree, I prefer to remove this feature.
Fixes: c1d10adb4a52 ("[NETFILTER]: Add ctnetlink port for nf_conntrack") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Mon, 15 Jun 2026 18:10:44 +0000 (20:10 +0200)]
netfilter: nft_compat: ebtables emulation must reject non-bridge targets
xtables targets return netfilter verdicts: NF_ACCEPT, NF_DROP, and so
on. ebtables targets return incompatible verdicts: EBT_ACCEPT,
EBT_DROP, ... We cannot allow fallback to NFPROTO_UNSPEC.
ebtables doesn't permit this since 11ff7288beb2 ("netfilter: ebtables: reject non-bridge targets")
but that commit missed the nft_compat layer.
Reported-by: Ren Wei <n05ec@lzu.edu.cn> Reported-by: Wyatt Feng <bronzed_45_vested@icloud.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Yi Chen [Thu, 11 Jun 2026 14:50:13 +0000 (16:50 +0200)]
selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test
The existing test covered a scenario where a delayed INIT_ACK chunk
updates the vtag in conntrack after the association has already been
established.
A similar issue can occur with a delayed SCTP INIT chunk.
Add a new simultaneous-open test case where the client's INIT is
delayed, allowing conntrack to establish the association based on
the server-initiated handshake.
When the stale INIT arrives later, it may get recorded and cause a
following INIT_ACK from the peer to be accepted instead of dropped.
This INIT_ACK overwrites the vtag in conntrack, causing subsequent
SCTP DATA chunks to be considered as invalid and then dropped by
nft rules matching on ct state invalid.
This test verifies such stale INIT chunks do not cause problems.
Signed-off-by: Yi Chen <yiche.cy@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Runyu Xiao [Thu, 11 Jun 2026 04:21:20 +0000 (12:21 +0800)]
netfilter: nft_synproxy: stop bypassing the priv->info snapshot
nft_synproxy_eval_v4() and nft_synproxy_eval_v6() already take a
whole-object READ_ONCE() snapshot of the shared priv->info state before
building the SYNACK reply, but nft_synproxy_tcp_options() still masks
opts->options with priv->info.options from the live shared object.
When a named synproxy object is updated concurrently with SYN traffic,
the eval path can then mix mss and timestamp handling from the local
snapshot with an options mask taken from a newer configuration, so one
SYNACK no longer reflects a coherent synproxy configuration.
Use info->options so nft_synproxy_tcp_options() stays on the same local
snapshot that the eval path already copied from priv->info.
Randy Dunlap [Sun, 14 Jun 2026 05:25:24 +0000 (22:25 -0700)]
netfilter: x_tables.h: fix all kernel-doc warnings
- use correct names in kernel-doc comments
- add missing struct members to kernel-doc comments
Warning: include/linux/netfilter/x_tables.h:41 struct member 'targinfo' not described in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:41 Excess struct member 'targetinfo' description in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'family' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'nft_compat' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:101 expecting prototype for struct xt_mdtor_param. Prototype was for struct xt_mtdtor_param instead
Warning: include/linux/netfilter/x_tables.h:121 struct member 'net' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'table' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'target' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'targinfo' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'hook_mask' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'family' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'nft_compat' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:345 expecting prototype for xt_recseq(). Prototype was for DECLARE_PER_CPU() instead
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto()
Add sanity check for iph->ihl field in nf_flow_ip4_tunnel_proto() before
using it to compute the header size, avoiding out-of-bounds access with
malformed IP headers.
While at it, use iph->protocol instead of the hardcoded IPPROTO_IPIP
constant when setting ctx->tun.proto and reference ctx->tun.hdr_size
when updating ctx->offset.
Fixes: ab427db178858 ("netfilter: flowtable: Add IPIP rx sw acceleration") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_conncount: prevent connlimit drops for early confirmed ct
Commit 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add
was skipped") introduced a regression where packets for valid
connections are dropped when using connlimit for soft-limiting
scenarios.
The issue occurs when a new connection reuses a socket currently in
the TIME_WAIT state. In this scenario, the connection tracking entry
is evaluated as already confirmed. Previously, __nf_conncount_add()
assumed that if a connection was confirmed and did not originate from
the loopback interface, it should skip the addition and return -EEXIST.
Skipping the addition triggers a garbage collection run that cleans up
the TIME_WAIT connection. Consequently, the active connection count
drops to 0, which xt_connlimit mishandles, leading to the false rejection
of the perfectly valid new connection.
Fix this by replacing the interface check with protocol-agnostic state
checks. We now skip the tree insertion and preserve the lockless garbage
collection optimization only if the connection is IPS_ASSURED. This
allows early-confirmed setup packets (such as reused TIME_WAIT sockets
or locally generated SYN-ACKs) to be properly evaluated and counted
without falsely dropping. The goto check_connections path is maintained
to ensure these setup packets are deduplicated correctly.
This has been tested with slowhttptest and HTTP server configured
locally to ensure we are not breaking soft-limiting scenarios for local
or external connections. In addition, it was tested with a OVS zone
limit too.
Fixes: 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped") Reported-by: Alejandro Olivan Alvarez <alejandro.olivan.alvarez@gmail.com> Closes: https://lore.kernel.org/netfilter-devel/177349610461.3071718.4083978280323144323@eldamar.lan/ Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init()
We ran into below KASAN splat, which is mostly uninteresting, beside
for having nf_nat_register_fn() in the call chain as a cause for the
offending access:
==================================================================
BUG: KASAN: slab-out-of-bounds in nf_nat_register_fn+0x5f9/0x640
Read of size 8 at addr ffff890031e54c20 by task iptables/9510
The out-of-bounds report, though, is a red herring as it is for an
access that shouldn't have happened in the first place.
When nf_nat_init() fails to register its BPF kfuncs, it'll unwind and,
among others, call unregister_pernet_subsys() to deregister its per-net
ops. This makes the previously allocated net id available for reuse by
the next caller of register_pernet_subsys(), in our case, synproxy.
However, 'nat_net_id' will still hold the previously allocated value.
If nf_nat.o gets build as a module, all this doesn't matter. A failed
initialization routine makes the module fail to load and any dependent
module won't be able to load either. However, if nf_nat.o is built-in,
a failing init won't /completely/ make its functionality unavailable to
dependent modules, namely the code and static data is still there, free
to be called by modules like nft_chain_nat.ko.
Case in point, nft_chain_nat registers hooks that'll call into nf_nat
which, in our case, failed to initialize and therefore won't have a
valid net id nor related net_nat object any more.
Code in nf_nat, namely nf_nat_register_fn() and nf_nat_unregister_fn(),
still making use of the reallocated net id, lead to a type confusion as
the call to net_generic() will no longer return memory belonging to an
object suited to fit 'struct nat_net' but 'struct synproxy_net' instead.
The latter is only 24 bytes on 64-bit systems, much smaller than struct
nat_net which is 176 bytes, perfectly explaining the OOB KASAN report.
Detect and handle a failed nf_nat_init() by testing the 'nf_nat_hook'
pointer which will be reset to NULL on initialization errors to prevent
the usage of an invalid nat_net pointer.
As this check is only needed when nf_nat.o is built-in, guard it by
'#ifndef MODULE...'.
Fixes: cbc1dd5b659f ("netfilter: nf_nat: Fix possible memory leak in nf_nat_init()") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The net-next-hw spinners on netdev.bots.linux.dev observe failing
so-txtime-py tests. A review of stdout shows most failures to be
due to exceeding the 4ms grace period. All I saw were within 8ms.
So increase to that.
Double the bounds from 4 to 8ms. This is still is small enough to
differentiate the delays programmed by the test, 10 and 20ms.
Fixes: 5c6baef3885c ("selftests: drv-net: convert so_txtime to drv-net") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260610170651.1b644001@kernel.org/ Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260621200137.1564776-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).
Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
- channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
- channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
- channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
- channel 3: clears bit 24..31 (channel 3 only) - correct by accident
While BIT(32+) on arm64 produces 64-bit values truncated to 0 in u32
mask parameter, the loop still incorrectly clears queues within the
same channel beyond queue 7.
Even though this is functionally harmless (the register resets to 0
and is only ever cleared, never set — so clearing extra bits is a
no-op), the loop bound is semantically wrong and should be fixed for
correctness and clarity.
Fix by using AIROHA_NUM_QOS_QUEUES (8) as the loop upper bound.
Abdun Nihaal [Sat, 20 Jun 2026 06:23:50 +0000 (11:53 +0530)]
bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()
If the allocation of fp[i].tpa_info fails, the error path will not free
the struct bnx2x_fastpath allocated earlier, as it is not linked to the
bp structure yet. Fix that by linking it immediately after allocation.
Cc: stable@vger.kernel.org Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure") Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260620062402.89549-1-nihaal@cse.iitm.ac.in Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv4: fib: Don't ignore error route in local/main tables.
When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
fib_lookup() performs route lookup directly on two tables.
Since the first lookup does not properly bail out, the result
of an error route in the merged local/main table could be
overwritten by another route in the default table:
# unshare -n
# ip link set lo up
# ip route add 192.168.0.0/24 dev lo table 253
# ip route add unreachable 192.168.0.0/24
# ip route get 192.168.0.1
192.168.0.1 dev lo table default uid 0
cache <local>
Once a random rule is added, the error route is respected:
# ip rule add table 0
# ip rule del table 0
# ip route get 192.168.0.1
RTNETLINK answers: No route to host
Let's fix the inconsistent behaviour.
Fixes: f4530fa574df ("ipv4: Avoid overhead when no custom FIB rules are installed.") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260619212753.3367244-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 19 Jun 2026 19:15:38 +0000 (12:15 -0700)]
eth: bnxt: improve the timing of stats
Kernel selftests wait 1.25x of the promised stats refresh time
(as read from ethtool -c). bnxt reports 1sec by default, but
the stats update process has two steps. First device DMAs the
new values, then the service task performs update in full-width
SW counters. So the worst case delay is actually 2x.
Note that the behavior is different for ring stats and port stats.
Port stats are fetched synchronously by the service worker, so
there's no risk of doubling up the delay there.
The problem of stale stats impacts not only tests but real workloads
which monitor egress bandwidth of a NIC. The inaccuracy causes double
counting in the next cycle and spurious overload alarms.
Try to read from the DMA buffer more aggressively, to mitigate
timing issues between DMA and service task. The SW update should
be cheap.
Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.") Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260619191538.104165-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xiang Mei [Fri, 19 Jun 2026 04:53:34 +0000 (21:53 -0700)]
ipv6: Fix null-ptr-deref in fib6_nh_mtu_change().
fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and
dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown()
clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has
released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and
oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(),
but this re-fetch is unguarded; nexthop-backed routes survive
addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is
nulled.
Return 0 when idev is NULL, matching rt6_mtu_change_route() and the
fib6_mtu() fix in commit 5ad509c1fdad ("ipv6: Fix null-ptr-deref in
fib6_mtu().").
Oops: general protection fault, ... KASAN: null-ptr-deref in range
[0x00000000000002a8-0x00000000000002af]
RIP: 0010:fib6_nh_mtu_change+0x203/0x990
rt6_mtu_change_route+0x141/0x1d0
__fib6_clean_all+0xd0/0x160
rt6_mtu_change+0xb4/0x100
ndisc_router_discovery+0x24b5/0x2cb0
icmpv6_rcv+0x12e9/0x1710
ipv6_rcv+0x39b/0x410
Fixes: c0b220cf7d80 ("ipv6: Refactor exception functions") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260619045334.2427073-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fan Wu [Wed, 17 Jun 2026 02:05:18 +0000 (02:05 +0000)]
hdlc_ppp: sync per-proto timers before freeing hdlc state
Each PPP control protocol (LCP/IPCP/IPV6CP) embedded in struct ppp
registers a timer via timer_setup(). That struct ppp is the
hdlc->state allocation, which detach_hdlc_protocol() frees with kfree()
in both teardown paths: unregister_hdlc_device() and the re-attach inside
attach_hdlc_protocol().
The ppp proto never registered a .detach callback, so
detach_hdlc_protocol() performs no timer synchronization before the
kfree(). The only cancel, timer_delete(&proto->timer) in ppp_cp_event(),
is partial (it does not wait for a running callback) and only runs on the
->CLOSED transition; ppp_stop()/ppp_close() do not sync either. A
ppp_timer callback already executing (blocked on ppp->lock) survives the
kfree and then dereferences proto->state / ppp->lock in freed memory,
leading to a use-after-free.
Fix this by adding a .detach helper that calls timer_shutdown_sync() on
every per-proto timer. detach_hdlc_protocol() invokes proto->detach(dev)
before kfree(hdlc->state), so timer_shutdown_sync()
now runs on both free paths.
timer_shutdown_sync() is used instead of timer_delete_sync() because the
keepalive path re-arms the timer through add_timer()/mod_timer() and
shutdown blocks any re-activation during teardown.
Initialize the per-protocol timers in ppp_ioctl() when the protocol is
attached, and remove the now-redundant timer_setup() from ppp_start(), so
that the timers are initialized exactly once at attach time and
ppp_timer_release() never operates on uninitialized timer_list
structures. attach_hdlc_protocol() uses kmalloc() (not kzalloc), so
struct ppp's protos[i].timer is uninitialized garbage until the first
timer_setup(); without this init-at-attach, attaching the PPP protocol
without ever bringing the device up would leave timer_shutdown_sync()
operating on uninitialized memory in .detach. Moving the init out of
ppp_start() (which only runs on NETDEV_UP) into the attach path makes the
initialization unconditional and avoids initializing the same timer_list
twice.
icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
with FW PA stat names regardless of whether the PA stats block is
present on the hardware. emac_get_stat_by_name() already guards the
PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
is NULL the lookup falls through to netdev_err() and returns -EINVAL.
Because ndo_get_stats64 is polled regularly by the networking stack
this produces thousands of log entries of the form:
A secondary consequence is that the int(-EINVAL) return value is
implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
into the __u64 fields of rtnl_link_stats64, silently corrupting the
rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.
Every other PA-aware code path in the driver is already guarded with
the same `if (emac->prueth->pa_stats)` check. Apply the same guard
here.
Ziran Zhang [Tue, 16 Jun 2026 01:32:45 +0000 (09:32 +0800)]
rocker: Fix memory leak in ofdpa_port_fdb()
In ofdpa_port_fdb(), the hash_del() only unlinks the node from
hash table, but does not free it.
Fix this by adding kfree(found) after the !found == removing check,
where the pointer value is no longer needed.
Found by Coccinelle kfree script.
Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems Signed-off-by: Ziran Zhang <zhangcoder@yeah.net> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake
Commit 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1")
disabled K1 by default on Meteor Lake and newer systems due to packet
loss observed on various platforms. However, disabling K1 caused an
increase in power consumption.
To mitigate this, reconfigure the PLL clock gate value so that K1 can
remain enabled without incurring the additional power consumption.
Re-enable K1 by default, but keep the private flag to support disabling
it via ethtool. Additionally, introduce a DMI quirk table, so that K1 may
be disabled by default on known problematic systems. Currently, this
includes the Dell Pro 16 Plus, where the issue has been reported to persist
despite the changes to the PLL lock timeout.
i40e: Fix i40e_debug() to use struct i40e_hw argument
i40e_debug() macro takes struct i40e_hw *h as first argument. But the
macro body uses hw instead of h. This has been working so far because hw
happens to be the name of the variable in the context where the macro is
expanded. Fix the macro to use the passed argument.
Fixes: 5dfd37c37a44 ("i40e: Split i40e_osdep.h") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ZhaoJinming [Fri, 29 May 2026 05:37:33 +0000 (13:37 +0800)]
ice: dpll: fix memory leak in ice_dpll_init_info error paths
Several error return paths in ice_dpll_init_info() directly return
without freeing previously allocated resources, causing memory leaks:
- When de->input_prio allocation fails, d->inputs is leaked
- When dp->input_prio allocation fails, d->inputs and de->input_prio
are leaked
- When ice_get_cgu_rclk_pin_info() fails, all previously allocated
inputs/outputs/input_prio are leaked
- When ice_dpll_init_pins_info(RCLK_INPUT) fails, same resources
are leaked
Fix this by jumping to the deinit_info label which properly calls
ice_dpll_deinit_info() to free all allocated resources.
Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ZhaoJinming [Fri, 29 May 2026 05:37:32 +0000 (13:37 +0800)]
ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info
ice_dpll_deinit_info() calls kfree() on several pf->dplls fields
(inputs, outputs, eec.input_prio, pps.input_prio) but does not set
the pointers to NULL afterward. This leaves dangling pointers in the
pf->dplls structure.
While not currently exploitable through existing code paths, this is
unsafe because:
1. If ice_dpll_init_info() is called again after a deinit (e.g. during
driver recovery), and a subsequent allocation within init fails, the
error path will jump to deinit_info and call ice_dpll_deinit_info()
again. Since some pointers still hold the old freed addresses, this
would result in a double-free.
2. Any future code that checks these pointers before use or after free
would be unprotected against use-after-free.
Follow the common kernel convention of setting pointers to NULL after
kfree() so that:
- kfree(NULL) is a safe no-op, preventing double-free
- NULL checks on these pointers become meaningful
This is a preparatory fix for a subsequent patch that routes additional
error paths in ice_dpll_init_info() to the deinit_info label.
Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu") Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Marcin Szycik [Wed, 8 Apr 2026 14:14:29 +0000 (16:14 +0200)]
ice: call netif_keep_dst() once when entering switchdev mode
netif_keep_dst() only needs to be called once for the uplink VSI, not
once for each port representor. Move it from ice_eswitch_setup_repr()
to ice_eswitch_enable_switchdev().
Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path") Signed-off-by: Marcin Szycik <marcin.szycik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice_init_link() can return an error status from ice_update_link_info()
or ice_init_phy_user_cfg(), causing probe to fail.
An incorrect NVM update procedure can result in link/PHY errors, and
the recommended resolution is to update the NVM using the correct
procedure. If the driver fails probe due to link errors, the user
cannot update the NVM to recover. The link/PHY errors logged are
non-fatal: they are already annotated as 'not a fatal error if this
fails'.
Since none of the errors inside ice_init_link() should prevent probe
from completing, convert it to void and remove the error check in the
caller. All failures are already logged; callers have no meaningful
recovery path for link init errors.
Fixes: 5b246e533d01 ("ice: split probe into smaller functions") Cc: stable@vger.kernel.org Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Alexander Nowlin <alexander.nowlin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Lukasz Czapnik [Fri, 27 Mar 2026 07:22:35 +0000 (08:22 +0100)]
ice: fix AQ error code comparison in ice_set_pauseparam()
Fix unreachable code: the conditionals in ice_set_pauseparam() used
the bitwise-AND operator suggesting aq_failures is a bitmap, but it
is actually an enum, making the third condition logically unreachable.
Replace the if-else ladder with a switch statement. Also move the
aq_failures initialization to the variable declaration and remove the
redundant zeroing from ice_set_fc().
Fixes: fcea6f3da546 ("ice: Add stats and ethtool support") Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Dawid Osuchowski [Fri, 27 Mar 2026 07:22:32 +0000 (08:22 +0100)]
ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()
Resetting all VFs causes resource leak on VFs with FDIR filters
enabled as CTRL VSIs are only invalidated and not freed. Fix by using
ice_vf_ctrl_vsi_release() instead of ice_vf_ctrl_invalidate_vsi() which
aligns behavior with the ice_reset_vf() function.
Fixes: da62c5ff9dcd ("ice: Add support for per VF ctrl VSI enabling") Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jakub Kicinski [Mon, 22 Jun 2026 17:33:38 +0000 (10:33 -0700)]
Merge tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net. This batches
fixes for real crashes with trivial/correctness fixes. There is too
a rework of the conntrack expectation timeout strategy to deal with
a possible race when removing an expectation.
1) Fix the incorrect flowtable timeout extension for entries in
hw offload, from Adrian Bente. This is correcting a defect in
the functionality, no crash.
2) Hold reference to device under the fake dst in br_netfilter,
from Haoze Xie. This is fixing a possible UaF if the device
is removed while packet is sitting in nfqueue.
3) Reject template conntrack in xt_cluster, otherwise access to
uninitialize conntrack fields are possible leading to WARN_ON
due to unset layer 3 protocol. From Wyatt Feng.
4) Make sure the IPv6 tunnel header is in the linear skb data
area before pulling. While at it remove incomplete NEXTHDR_DEST
support. From Lorenzo Bianconi. This possibly leading to crash
if IPv4 header is not in the linear area.
5) Use test_bit_acquire in ipset hash set to avoid reordering
of subsequent memory access. This is addressing a LLM related
report, no crash has been observed. From Jozsef Kadlecsik.
6) Use test_bit_acquire in ipset bitmap set too, for the same
reason as in the previous patch, from Jozsef Kadlecsik.
7) Call kfree_rcu() after rcu_assign_pointer() to address a
possible UaF if kfree_rcu() runs inmediately, which to my
understanding never happens. Never observed in practise,
reported by LLM. Also from Jozsef Kadlecsik.
8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
to avoid that ipset GC handler re-queues work as reported by LLM.
From Jozsef Kadlecsik. This is for correctness.
9) Restore the check in nft_payload for exceeding payloda offset
over 2^16. From Florian Westphal. This fixes a silent truncation,
not a big deal, but better be assertive and reject it.
10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
prerouting. From Florian Westphal. Harmless but it could allow
to read bytes from skb->cb.
11) Zero out destination hardware address during the flowtable
path setup, also from Florian. This is a correctness fix, LLM
points that possible infoleak can happen but topology to achieve
it is not clear.
12) Skip IPv4 options if present when building the IPV4 reject reply.
Otherwise bytes in the IPv4 options header can be sent back to
origin where the ICMP header is being expected. Again from
Florian Westphal.
13) Replace timer API for expectation by GC worker approach. This
is implicitly fixing a race between nf_ct_remove_expectations()
which might fail to remove the expectation due to timer_del()
returning false because timer has expired and callback is
being run concurrently. This fix is addressing a crash that has
been already reported with a reproducer.
14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
infoleak of 4-bytes. From Florian Westphal.
* tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
netfilter: nf_reject: skip iphdr options when looking for icmp header
netfilter: nft_flow_offload: zero device address for non-ether case
netfilter: nft_meta_bridge: add validate callback for get operations
netfilter: nft_payload: reject offsets exceeding 65535 bytes
netfilter: ipset: make sure gc is properly stopped
netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
netfilter: xt_cluster: reject template conntracks in hash match
netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
netfilter: flowtable: fix offloaded ct timeout never being extended
====================
Ioana Ciornei [Thu, 18 Jun 2026 09:28:12 +0000 (12:28 +0300)]
dpaa2-switch: do not accept VLAN uppers while bridged
The dpaa2-switch driver does not support VLAN uppers while its ports are
bridged. This scenario tried to be prevented by rejecting a bridge join
while VLAN uppers exist but the reverse order was still possible.
This patches adds a check so that the dpaa2-switch also does not accept
VLAN uppers while bridged.
Jiayuan Chen [Thu, 18 Jun 2026 10:43:35 +0000 (18:43 +0800)]
ipv6: ioam: fix type confusion of dst_entry
IOAM uses a dummy dst_entry(null_dst) to mark that the destination should
not be changed after the transformation. This dst is stored in the IOAM lwt
state and may be passed to dst_cache_set_ip6().
However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which
treats the dst_entry as part of a struct rt6_info. Since the null_dst was
embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted
in an invalid cast and rt6_get_cookie() reading fields from the wrong
object.
In practice, the wrong cookie is not used while dst->obsolete is zero, but
rt6_get_cookie() may also access per-cpu value when rt->sernum is
zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which
can become zero, making this a potential invalid pointer access.
Fix this by embedding a full struct rt6_info for the dummy IPv6 route and
passing its dst member to the dst APIs.
datalen already includes fraggap (datalen = length + fraggap). When
fraggap is non-zero, this is not the first skb and transhdrlen is zero.
The fraggap bytes carried over from the previous skb are copied just past
the fragment headers in the new skb's linear area. The linear area is
therefore undersized by fraggap bytes while pagedlen is overstated by the
same amount, and the copy writes past skb->end into the trailing
skb_shared_info.
An unprivileged user can trigger this via a UDPv6 socket using
MSG_MORE together with MSG_SPLICE_PAGES.
The bad accounting was introduced by commit 773ba4fe9104 ("ipv6:
avoid partial copy for zc"). Before commit ce650a166335 ("udp6: Fix
__ip6_append_data()'s handling of MSG_SPLICE_PAGES"), the negative
copy value caused -EINVAL to be returned. That later commit allowed
MSG_SPLICE_PAGES to proceed in this case, making the corruption
triggerable.
The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.
After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.
Since a negative copy is no longer expected for a valid MSG_SPLICE_PAGES
case, remove the MSG_SPLICE_PAGES exception from the negative copy check.
Fixes: 773ba4fe9104 ("ipv6: avoid partial copy for zc") Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com> Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/ajFTqRljatR17fFy@DESKTOP-19IMU7U.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
datalen already includes fraggap, but the fraggap bytes carried over
from the previous skb are copied into the new skb's linear area at
offset transhdrlen by the subsequent skb_copy_and_csum_bits(). The
linear area is therefore undersized by fraggap bytes while pagedlen is
overstated by the same amount.
The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.
After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.
Fixes: 8eb77cc73977 ("ipv4: avoid partial copy for zc") Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com> Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/ajFR1eLAIs42TN3g@DESKTOP-19IMU7U.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xingquan Liu [Fri, 19 Jun 2026 15:13:48 +0000 (11:13 -0400)]
selftests/tc-testing: Add DualPI2 GSO backlog accounting test
Add a regression test for DualPI2 GSO backlog accounting when it is
used as a child qdisc of QFQ.
The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
the leaf qdisc. DualPI2 splits the skb into two segments. After the
traffic drains, both QFQ and DualPI2 must report zero backlog and zero
qlen.
On kernels with the broken accounting, QFQ can keep a stale non-zero
qlen after all real packets have been dequeued.
Signed-off-by: Xingquan Liu <b1n@b1n.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260619151447.223640-2-b1n@b1n.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xingquan Liu [Fri, 19 Jun 2026 15:13:47 +0000 (11:13 -0400)]
net/sched: dualpi2: fix GSO backlog accounting
When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.
With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.
Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.
Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc") Cc: stable@vger.kernel.org Signed-off-by: Xingquan Liu <b1n@b1n.io> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260619151447.223640-1-b1n@b1n.io Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Weiming Shi [Wed, 17 Jun 2026 06:55:13 +0000 (14:55 +0800)]
ipv6: ndisc: fix NULL deref in accept_untracked_na()
accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
and dereferences idev->cnf.accept_untracked_na without a NULL check,
even though its only caller ndisc_recv_na() already fetched and
NULL-checked idev for the same device.
Both reads of dev->ip6_ptr run in the same RCU read-side critical
section, but a concurrent addrconf_ifdown() can clear dev->ip6_ptr
between them: lowering the MTU below IPV6_MIN_MTU calls addrconf_ifdown()
without the synchronize_net() that orders the unregister path, so the
re-fetch returns NULL and oopses:
It is reachable by an unprivileged user via a network namespace.
Pass the caller's already validated idev instead of re-fetching it; the
idev stays alive for the whole RCU critical section, so it is safe even
after dev->ip6_ptr has been cleared.
Fixes: aaa5f515b16b ("net: ipv6: new accept_untracked_na option to accept na only if in-network") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260617065512.2529757-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>