git.ipfire.org Git - thirdparty/linux.git/log

net: ethernet: qualcomm: ppe: Demote from supported and fix maintainer addresses

Emails to the maintainer of Qualcomm PPE Ethernet driver (Luo Jie
<quic_luoj@quicinc.com>) bounce permanently (full mailbox), because the
"quicinc.com" addresses were deprecated for public work. All Qualcomm
contributors are aware of that and were asked to fix their addresses.

Driver is not supported - in terms of how netdev understands supported
commitment - if maintainer does not care to receive the patches for its
code, so demote it to "maintained" to reflect true status.

Fix all occurences of Luo Jie email address to preferred and working
domain.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Acked-by: Luo Jie <jie.luo@oss.qualcomm.com>
Link: https://patch.msgid.link/20260623073307.36483-2-krzysztof.kozlowski@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: fix potential divide-by-zero when num_vsi is zero

For i.MX94 series, all the standalone ENETCs do not support SR-IOV, so
pf->caps.num_vsi is zero. This leads to a divide-by-zero in
enetc4_default_rings_allocation() when distributing rings among PF and
VFs.

Division by zero is undefined behavior in C. On ARM64, the UDIV/SDIV
instructions silently return zero rather than raising an exception, so
the issue does not cause a visible crash. However, relying on this
behavior is incorrect and poses a cross-platform compatibility risk.

Add an explicit check for num_vsi == 0 and return early after the PF's
rings have been configured.

Fixes: 2d673b0e2f8d ("net: enetc: add standalone ENETC support for i.MX94")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260624072726.1238903-1-wei.fang@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: renesas,ether: Drop example "ethernet-phy-ieee802.3-c22" fallback

Fix the Micrel PHY in the example which shouldn't have the
fallback "ethernet-phy-ieee802.3-c22" compatible:

Documentation/devicetree/bindings/net/renesas,ether.example.dtb: ethernet-phy@1 \
(ethernet-phy-id0022.1537): compatible: ['ethernet-phy-id0022.1537', 'ethernet-phy-ieee802.3-c22'] is too long
from schema $id: http://devicetree.org/schemas/net/micrel.yaml

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Acked-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Fixes: 37a2fce09001 ("dt-bindings: sh_eth convert bindings to json-schema")
Link: https://patch.msgid.link/20260624150250.131966-2-robh@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

openvswitch: conntrack: annotate ct limit hlist traversal

ct_limit_set() is documented as being called with ovs_mutex held. It
walks the ct limit hlist with hlist_for_each_entry_rcu(), but the
iterator does not currently pass the OVS lockdep condition used
elsewhere for RCU-protected OVS objects.

Pass lockdep_ovsl_is_held() to the iterator. This matches the function's
existing caller contract and lets CONFIG_PROVE_RCU_LIST distinguish the
ovs_mutex-protected update path from the RCU read-side ct_limit_get()
path.

This was found by our static analysis tool and then manually reviewed
against the current tree. In the reviewed CONFIG_PROVE_RCU_LIST triage
run, the writer-side ct limit update produced the expected "RCU-list
traversed in non-reader section!!" warning while ovs_mutex was held,
with the stack matching ct_limit_set() and ovs_ct_limit_set_zone_limit().
The change is limited to documenting the existing protection contract.

This is a lockdep annotation cleanup. It does not change the conntrack
limit list update or release behavior.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://patch.msgid.link/20260624150149.3510541-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync

Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() during concurrent device operations.

The workqueue core clears the internal pending bit before invoking the
worker. At that point, a concurrent thread can queue the work again.
When the already running worker eventually clears the work_pending flag
to 0, it mistakenly clears the flag for the newly queued instance.
udp_tunnel_nic_unregister() then observes work_pending as 0 and frees
the structure while the second work item is still active in the queue,
leading to UAF.

Fix this by returning early in udp_tunnel_nic_device_sync() if
work_pending is already set, preventing redundant work queueing.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Reported-by: Yue Sun <samsun1006219@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260625065938.654652-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'nf-26-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Add a workaround to avoid a possible crash if nf_nat and nft_chain_nat are
   compiled built-in and nf_nat fails to register, allowing nft_chain_nat to
   access the incorrect pernetns area. This is crash specific of all built-in
   compilation. From Matias Krause.

2) Revisit conncount GC optimization for confirmed conntracks, skip GC round
   if IPS_ASSURED is set on. This is addressing an issue for corner case
   use case scenario involving locally generated traffic. No crash, just a
   functionality fix. From Fernando F. Mancera.

3) Validate iph->ihl in flowtable IPIP tunnel support, from Lorenzo Bianconi.
   This a sanity check to bounces back malformed IPIP packets to classic
   forwarding path.

4) Kdoc fixes for x_tables.h, from Randy Dunlap.

5) Use info->options so nft_synproxy_tcp_options() stays on the same local
   snapshot, otherwise eval path can observe inconsistent mix of mss and
   timestamps. From Runyu Xiao.

6) Add conntrack_sctp_collision.sh to cover for SCTP INIT collisions.
   From Yi Chen.

7) Do not allow NFPROTO_UNSPEC targets if family is NFPROTO_BRIDGE in
   nft_compat. This allows to use non-sense targets such as xt_nat leading
   to crash. From Florian Westphal.

8) Add a selftest queueing from bridge family. From Florian Westphal.

9) Do not allow to reset a conntrack helper via ctnetlink. This feature
   antedates the creation of the conntrack-tools, and it is not used
   I don't have a usecase for it, I prefer to remove than fixing it.

10) Add deprecation warning for IPv4 only conntrack helpers for PPTP
    and IRC. From Florian Westphal.

11) Store the master tuple in the expectation object and use it,
    otherwise SLAB_TYPESAFE_RCU rules allow to display incorrect
    master tuple information through ctnetlink.

12) Run expectation eviction when inserting an expectation with no
    helper, this is a fix for the nft_ct custom expectation support.

13) Fix nft_ct custom expectation timeouts, userspace provides a
    timeout in milliseconds but kernel assumes this comes in seconds.
    From Florian Westphal.

14) Cap maximum number of expectations per class to 255 expectations
    per master conntrack at helper registration. This is a fix to
    restrict the maximum number of expectations per master conntrack
    which can be a issue for the new lazy GC expectation approach.

* tag 'nf-26-06-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration
  netfilter: nft_ct: expectation timeouts are passed in milliseconds
  netfilter: nf_conntrack_expect: run expectation eviction with no helper
  netfilter: nf_conntrack_expect: store master_tuple in expectation
  netfilter: conntrack: add deprecation warnings for irc and pptp trackers
  netfilter: ctnetlink: do not allow to reset helper on existing conntrack
  selftests: nft_queue.sh: add a bridge queue test
  netfilter: nft_compat: ebtables emulation must reject non-bridge targets
  selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test
  netfilter: nft_synproxy: stop bypassing the priv->info snapshot
  netfilter: x_tables.h: fix all kernel-doc warnings
  netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto()
  netfilter: nf_conncount: prevent connlimit drops for early confirmed ct
  netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init()
====================

Link: https://patch.msgid.link/20260623221548.701545-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2026-06-22 (ice, i40e, e1000e)

For ice:
Dawid changes call to release control VSI during reset to prevent
leaking it.

Lukasz fixes flow control error check to check value rather than treat
is as bitmap values.

Paul makes link related errors non-fatal to probe to allow for recovery
in certain NVM update situations.

Marcin moves netif_keep_dst() to only be called once when entering
switchdev mode.

ZhaoJinming adds a cleanup path for ice_dpll_init_info() to prevent
memory leaks on error path.

For i40e:
Mohamed Khalfella corrects argument passed in macro to match the
one provided to the macro.

For e1000e:
Dima resolves power state issues by adjusting value of PLL clock gate
and re-enabling K1; a quirk table is added to keep it off for known bad
systems.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake
  i40e: Fix i40e_debug() to use struct i40e_hw argument
  ice: dpll: fix memory leak in ice_dpll_init_info error paths
  ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info
  ice: call netif_keep_dst() once when entering switchdev mode
  ice: fix ice_init_link() error return preventing probe
  ice: fix AQ error code comparison in ice_set_pauseparam()
  ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()
====================

Link: https://patch.msgid.link/20260622220059.2471844-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-dwmac-spacemit-fix-wrong-macro-definition'

Inochi Amaoto says:

====================
net: stmmac: dwmac-spacemit: Fix wrong macro definition

Fix Wrong macro definition of the Spacemit K3.
====================

Link: https://patch.msgid.link/20260623074637.503864-1-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-spacemit: Fix wrong irq definition

The current irq definition of the wake irq and the lpi irq
is wrong, replace them with the right number and name.

Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260623074637.503864-3-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-spacemit: Fix wrong phy interface definition

The current MII interface register definition from the vendor is wrong,
use the right number for the macro. Also, correct the interface mask
in spacemit_set_phy_intf_sel() so it can update the register with the
right number

Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20260623074637.503864-2-inochiama@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: sunplus: spl2sw: fix phy_node refcount leak in remove

mac->phy_node is acquired via of_parse_phandle() in spl2sw_probe() and
stored in the mac private data, transferring ownership of the
device_node reference to mac. On driver removal, spl2sw_phy_remove()
disconnects the PHY but never drops that reference, so each
probe-then-remove cycle leaks one of_node refcount per port permanently.

Drop the reference after phy_disconnect(). While at it, remove the
redundant inner "if (ndev)" check; comm->ndev[i] was just verified
non-NULL on the line above.

Compile-tested only; no SP7021 hardware available.

Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021")
Signed-off-by: Shitalkumar Gandhi <shitalkumar.gandhi@cambiumnetworks.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/f3bdd4c91f3e2269b4e256075f9dc70808b1b8e9.1782195965.git.shitalkumar.gandhi@cambiumnetworks.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools/ynl: add missing uapi header deps in Makefile.deps

drm_ras includes drm/drm_ras.h, which is a relatively new header not yet
shipped in most distro kernel-header packages. Without the explicit
entry, the build might fail with a message like this:

  drm_ras-user.c:19:10: error: ‘DRM_RAS_CMD_CLEAR_ERROR_COUNTER’ \
   undeclared here (not in a function); did you mean \
  ‘DRM_RAS_CMD_GET_ERROR_COUNTER’

Signed-off-by: Thorsten Leemhuis <linux@leemhuis.info>
Link: https://patch.msgid.link/20260623070818.2161810-1-linux@leemhuis.info
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sungem: fix probe error cleanup

gem_init_one() calls gem_remove_one() when register_netdev() fails.
gem_remove_one() unregisters and frees resources owned by the net_device,
including the DMA block, MMIO mapping, PCI regions, and the net_device
itself. gem_init_one() then falls through to its own cleanup labels and
frees the same resources again.

Keep the register_netdev() error path in gem_init_one(): clear drvdata so
PM/remove paths do not see a half-registered device, remove the NAPI
instance added during probe, and let the existing cleanup labels release
the resources once.

The issue was found by a local static-analysis checker for probe error
paths. The reported path was manually inspected before sending this fix.

Compile-tested with CONFIG_SUNGEM=y. Runtime testing was not performed
because no sungem hardware is available.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260623025759.3468566-1-ruoyuw560@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/tcp-ao: fix use-after-free of key in del_async path

In tcp_ao_delete_key(), the del_async path skips the current_key
and rnext_key validity checks present in the synchronous path,
assuming these pointers are always NULL on LISTEN sockets. However,
if a key was added with set_current=1/set_rnext=1 while the socket
was in CLOSE state, current_key and rnext_key will be non-NULL
after listen() transitions the socket to LISTEN.

When such a key is deleted with del_async=1, hlist_del_rcu() and
call_rcu() free the key without clearing the dangling pointers.
After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
current_key->sndid and rnext_key->rcvid from freed slab memory.

Clear current_key and rnext_key in the del_async path when they
reference the key being deleted.

Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)")
Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260623015208.1191687-1-eilaimemedsnaimel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: build archives with $(AR)

Use $(AR) to allow build system to override the archiver tool (e.g.,
when cross-compiling for a different architecture) by setting the AR
environment variable.

GNU Make defaults AR to ar, so this change will not break existing build
environments that do not explicitly set AR.

Fixes: 07c3cc51a085 ("tools: net: package libynl for use in selftests")
Fixes: 86878f14d71a ("tools: ynl: user space helpers")
Signed-off-by: Greg Thelen <gthelen@google.com>
Link: https://patch.msgid.link/20260622161659.145047-1-gthelen@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: mlx5: fix macsec dependency

Configurations with mlx5 built-in but macsec=m fail to link:

x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_add_gid_macsec_operations':
macsec.c:(.text+0x77d): undefined reference to `macsec_netdev_is_offloaded'
x86_64-linux-ld: drivers/infiniband/hw/mlx5/macsec.o: in function `mlx5r_del_gid_macsec_operations':
macsec.c:(.text+0xe81): undefined reference to `macsec_netdev_is_offloaded'

Fix the dependency so this configuration cannot happen.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20260622124229.2444502-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: kalmia: bound RX frame length in kalmia_rx_fixup()

kalmia_rx_fixup() computes usb_packet_length = skb->len - (2 *
KALMIA_HEADER_LENGTH) as a u16, guarded only by a pre-loop check that
skb->len is at least KALMIA_HEADER_LENGTH, which is 6. A device can
deliver a short bulk-IN frame with skb->len in the 6 to 11 range, or
leave a short trailing remainder on a later loop iteration. Either case
underflows usb_packet_length to about 65530.

That bypasses the usb_packet_length < ether_packet_length truncation path.
The device-supplied ether_packet_length, a le16 up to 65535 read from
header_start[2], then drives a memcmp() and the following skb_trim() and
skb_pull() past the end of the rx buffer. The rx buffer is hard_mtu * 10,
which is 14000 bytes. That is an out of bounds read.

Require both the start and end framing headers to be present before
subtracting them, on every loop iteration.

Fixes: d40261236e8e ("net/usb: Add Samsung Kalmia driver for Samsung GT-B3730")
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/178211531778.2216480.12637613349790980750@maoyixie.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: validate inner network offset in geneve_gro_complete()

Even with both paths gated on gs->gro_hint, geneve_gro_complete()
re-derives the inner dispatch type and length from the packet and the
current gs->gro_hint, independently of geneve_gro_receive(). The two can
disagree if gs->gro_hint flips under a concurrent geneve_quiesce()/
geneve_unquiesce() (sk_user_data is NULL across a synchronize_net()), or if
the re-read option bytes differ from the ones receive parsed.

geneve_gro_receive() already records the inner network header position in
NAPI_GRO_CB()->inner_network_offset. Have geneve_gro_complete() compute the
offset it is about to dispatch at, adding ETH_HLEN in the ETH_P_TEB case
where eth_gro_complete() steps over the inner MAC header, and bail out if
it lands past inner_network_offset.

Use a lower bound rather than exact equality: between gh_len and the inner
L3 header, geneve_gro_receive() may also have pulled an inner VLAN tag
(vlan_gro_receive() advances the recorded offset past it), which only moves
inner_network_offset further out. A valid frame therefore always satisfies
inner_nh <= inner_network_offset, while a gh_len inflated by a hint
gro_receive() did not honour dispatches past the validated inner header,
i.e. the out-of-bounds completion. Only the latter is rejected.

Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260618032622.484720-2-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

geneve: gate GRO hint in geneve_gro_complete() on gs->gro_hint

geneve_gro_receive() reads the GRO hint through geneve_sk_gro_hint_off(),
which honours it only when the socket enabled IFLA_GENEVE_GRO_HINT
(gs->gro_hint). geneve_gro_complete() instead calls the low-level
geneve_opt_gro_hint_off() and acts on the hint unconditionally.

On a tunnel without the hint, receive aggregates the frames as plain
ETH_P_TEB while complete still honours an attacker-supplied hint option: it
inflates gh_len by gro_hint->nested_hdr_len (u8) and redirects the dispatch
type, so the inner gro_complete handler runs at nhoff + gh_len, an offset
receive never pulled nor validated, reading out of bounds of the skb head:

  BUG: KASAN: slab-out-of-bounds in ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
  Read of size 1 at addr ffff88800fe91980 by task exploit/153
   ipv6_gro_complete (net/ipv6/ip6_offload.c:196)
   geneve_gro_complete (drivers/net/geneve.c:965)
   udp_gro_complete (net/ipv4/udp_offload.c:940)
   inet_gro_complete (net/ipv4/af_inet.c:1621)
   __gro_flush (net/core/gro.c:306)

Gate the complete path on gs->gro_hint too via geneve_sk_gro_hint_off(), so
both paths agree. Tunnels that enable the hint are unaffected.

Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Reported-by: Kyle Zeng <kylebot@openai.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260618032622.484720-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvneta: re-enable percpu interrupt on resume

On Marvell MPIC platforms (Armada 370/XP/38x), mvneta uses a percpu
IRQ disable/enable scheme for NAPI: the ISR (mvneta_percpu_isr) calls
disable_percpu_irq() to mask the MPIC per-CPU interrupt and schedules
NAPI poll, which calls enable_percpu_irq() on completion to unmask.

If suspend occurs while NAPI poll is pending (between
disable_percpu_irq in the ISR and enable_percpu_irq in poll
completion), the interrupt is never re-enabled:

  1. mvneta_percpu_isr: disable_percpu_irq() + napi_schedule()
     => MPIC masked, percpu_enabled cpumask bit cleared
  2. NAPI poll does not complete before suspend proceeds
     (on PREEMPT_RT this is highly likely since softirqs run in
     ksoftirqd which gets frozen; on non-RT it can happen when
     softirq processing is deferred to ksoftirqd)
  3. mvneta_stop_dev => napi_disable(): cancels the pending poll
     without executing the completion path
  4. suspend_device_irqs => IRQCHIP_MASK_ON_SUSPEND: masks MPIC
     (already masked, but records IRQS_SUSPENDED)
  5. Resume: mpic_resume checks irq_percpu_is_enabled() => false
     (bit was cleared in step 1) => skips unmask
  6. mvneta_start_dev only restores device-level INTR_NEW_MASK,
     does not touch the MPIC per-CPU mask

Result: MPIC per-CPU interrupt stays masked permanently. The NIC
generates interrupts (INTR_NEW_CAUSE != 0) but the CPU never
receives them, causing complete loss of network connectivity.

Fix by calling on_each_cpu(mvneta_percpu_enable) in the resume path
to unconditionally unmask the MPIC per-CPU interrupt regardless of
pre-suspend state.

Fixes: 12bb03b436da ("net: mvneta: Handle per-cpu interrupts")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260622074350.1666290-1-yun.zhou@windriver.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: fix CGX debugfs RVU AF PCI reference leaks

CGX per-lmac debugfs seq readers obtained struct rvu via
pci_get_drvdata(pci_get_device(..., PCI_DEVID_OCTEONTX2_RVU_AF, ...)),
which leaks a PCI device reference on every read. Store rvu and the CGX
handle in debugfs inode private data when creating stats, mac_filter,
and fwdata files (one context per CGX), and use debugfs aux numbers for
fwdata so lmac_id matches the other CGX debugfs entries.

Fixes: f967488d095e ("octeontx2-af: Add per CGX port level NIX Rx/Tx counters")
Fixes: dbc52debf95f ("octeontx2-af: Debugfs support for DMAC filters")
Fixes: 49f02e6877d1 ("Octeontx2-af: Debugfs support for firmware data")
Cc: Linu Cherian <lcherian@marvell.com>
Reported-by: Yuho Choi <dbgh9129@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Link: https://patch.msgid.link/20260622034229.2254145-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Validate NIX maximum LFs correctly

NIX maximum number of LFs can be set via devlink command
but that can be done before assigning any LFs to a PF/VF.
The condition used to check whether any LFs are assigned is
incorrect. This patch fixes that condition.

Fixes: dd7842878633 ("octeontx2-af: Add new devlink param to configure maximum usable NIX block LFs")
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/1782082853-6941-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wwan: t7xx: destroy DMA pool on CLDMA late init failure

t7xx_cldma_late_init() creates md_ctrl->gpd_dmapool before
initializing the TX and RX rings. If any ring initialization
fails, the error path frees the already initialized rings but
leaves the DMA pool allocated.

Destroy md_ctrl->gpd_dmapool on the late-init failure path
to avoid leaking the DMA pool.

Fixes: 39d439047f1d ("net: wwan: t7xx: Add control DMA interface")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com>
Link: https://patch.msgid.link/20260621031714.3605022-1-haoxiang_li2024@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: fix BQL underflow in shared QDMA TX ring

When multiple netdevs share a QDMA TX ring and one device is stopped,
netdev_tx_reset_subqueue() zeroes that device's BQL counters while its
pending skbs remain in the shared HW TX ring. When NAPI later completes
those skbs via netdev_tx_completed_queue(), the already-zeroed
dql->num_queued counter underflows.

Fix the issue:
- Remove netdev_tx_reset_subqueue() from airoha_dev_stop() so pending
  skbs are completed naturally by NAPI with proper BQL accounting.
- Rework airoha_qdma_tx_cleanup() to disable TX DMA, flush BQL
  counters, DMA-unmap and free all pending skbs while skb->dev
  references are still valid. Use a per-queue flushing flag checked
  under q->lock in airoha_dev_xmit() to prevent races between teardown
  and transmit. Call airoha_qdma_stop_napi() before
  airoha_qdma_tx_cleanup() at the call sites.
- Move DMA engine start into probe. Split DMA teardown so TX DMA is
  disabled in airoha_qdma_tx_cleanup() and RX DMA in
  airoha_qdma_cleanup().
- Remove qdma->users counter since DMA lifetime is now tied to
  probe/cleanup rather than per-netdev open/stop.

Fixes: a9c2ca61fec7 ("net: airoha: Support multiple net_devices for a single FE GDM port")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260620-airoha-bql-fixes-v3-1-76b95374e63e@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: Clear MDIO_AN_10GBT_CTRL_ADV10G bit

On RTL8127A connected to a link partner that advertises 10000baseT
speed cannot be changed to anything other than 10000baseT as 10GbE
is always advertised regardless of any setting. Fix this by
clearing MDIO_AN_10GBT_CTRL_ADV10G bit in rtl822x_config_aneg()'s
call to phy_modify_mmd_changed().

Fixes: 83d962316128 ("net: phy: realtek: add RTL8127-internal PHY")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: Jan Klos <honza.klos@gmail.com>
Link: https://patch.msgid.link/20260620011956.37181-1-honza.klos@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: npc: cn20k: Fix subbank free list indexing for search order

subbank_srch_order[i] is the physical subbank at search-order slot i,
so each subbank's arr_idx must be i (its slot), not
subbank_srch_order[sb->idx]. The old logic mis-keyed xa_sb_free
and broke allocation traversal order.

Populate arr_idx and xa_sb_free in a single pass over the search
order after subbank structs are initialized.

Fixes: 7ac9d4c4075c ("octeontx2-af: npc: cn20k: add subbank search order control")
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260619095100.1864440-1-rkannoth@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mana: Fall back to standard MTU when PF reports adapter_mtu of 0

Commit d7709812e13d ("net: mana: hardening: Validate adapter_mtu from
MANA_QUERY_DEV_CONFIG") rejected any adapter_mtu value smaller than
ETH_MIN_MTU + ETH_HLEN, including 0, returning -EPROTO and failing
mana_probe().

Some older PF firmware versions still in the field report
adapter_mtu as 0 in the MANA_QUERY_DEV_CONFIG response. With the
hardening check in place, the MANA VF driver now fails to load on
those hosts, breaking networking entirely for guests.

MANA hardware always supports the standard Ethernet MTU. Treat a
reported adapter_mtu of 0 as "the PF did not advertise a value" and
fall back to ETH_FRAME_LEN, the same value used for the pre-V2
message version path. Only jumbo frames remain unavailable until
the PF reports a valid MTU.

Other small-but-nonzero bogus values are still rejected, preserving
the original protection against the unsigned-subtraction wrap that
would otherwise let ndev->max_mtu underflow to a huge value.

Fixes: d7709812e13d ("net: mana: hardening: Validate adapter_mtu from MANA_QUERY_DEV_CONFIG")
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260619055348.467224-1-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: fix use-after-free of DSA ports in crc_err_work

Upon an MDIO CRC error mxl862xx_crc_err_work_fn() walks the DSA ports
and closes the CPU port conduits:

dsa_switch_for_each_cpu_port(dp, priv->ds)
dev_close(dp->conduit);

mxl862xx_remove() unregisters the switch before cancelling this work:

set_bit(MXL862XX_FLAG_WORK_STOPPED, &priv->flags);
cancel_delayed_work_sync(&priv->stats_work);
dsa_unregister_switch(ds);
mxl862xx_host_shutdown(priv);

dsa_unregister_switch() frees the dsa_port objects. If a CRC error
schedules the work during teardown it can run after the ports have been
freed and dereference freed memory.

Guard the port walk with MXL862XX_FLAG_WORK_STOPPED, which is already set
before dsa_unregister_switch(). DSA tears the ports down under
rtnl_lock(), so checking the flag under rtnl_lock() means the work either
runs before teardown and sees valid ports, or runs afterwards, observes
the flag and skips the walk. This mirrors the host_flood_work handler,
which skips torn-down ports under rtnl_lock().

Link: https://sashiko.dev/#/patchset/cover.1780968180.git.daniel%40makrotopia.org?part=2
Fixes: a319d0c8c8ce ("net: dsa: mxl862xx: add CRC for MDIO communication")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/5e55169926c02f2b914e5ada529d7453b943cda4.1781702256.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mxl862xx: avoid unaligned 16-bit access in api_wrap

The MXL862XX_API_* macros pass the address of a stack-allocated, __packed
firmware-ABI struct to mxl862xx_api_wrap() as a void *. The struct has an
alignment of 1, so the compiler is free to place it at an odd address.

mxl862xx_api_wrap() reinterprets that buffer as a __le16 * and accesses it
with data[i], for which the compiler assumes the natural 2-byte alignment
of __le16 and emits aligned 16-bit loads/stores (e.g. lhu/sh on MIPS).
When the buffer lands on an odd address these fault on architectures that
do not support unaligned access, such as MIPS32.

-Waddress-of-packed-member does not catch this: the packed origin is
laundered through the void * parameter, so the cast inside api_wrap looks
alignment-safe to the compiler and no warning is emitted.

Use get_unaligned_le16()/put_unaligned_le16() for the three 16-bit word
accesses. The byte accesses (*(u8 *)&data[i], crc16()) are already safe
and are left unchanged.

Link: https://sashiko.dev/#/patchset/cover.1781319534.git.daniel%40makrotopia.org?part=4
Fixes: 23794bec1cb6 ("net: dsa: add basic initial driver for MxL862xx switches")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Link: https://patch.msgid.link/599327521db465a534d277de53ab9b6cac01928b.1781702256.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: realtek: fix memory leak in rtl8366rb_setup_led()

led_classdev_register_ext() only reads init_data.devicename - it never
stores the pointer. However, the caller allocated devicename with
kasprintf() but never freed it, leaking the string memory.

Fix it with a stack buffer to avoid dynamic buffers completely.

Fixes: 32d617005475 ("net: dsa: realtek: add LED drivers for rtl8366rb")
Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260618140200.1888707-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ixp4xx_hss: fix duplicate HDLC netdev allocation

ixp4xx_hss_probe() allocates two HDLC netdevs. The first one is stored
in ndev, initialized, and registered with register_hdlc_device(). The
second one is stored in port->netdev and later used by the remove path
for unregister_hdlc_device() and free_netdev().

This means that the registered netdev is not the same object that is
unregistered and freed on remove. It also leaks the first allocation if
the second alloc_hdlcdev() call fails, and the first allocation is not
checked before ndev is used.

Older code allocated the HDLC netdev only once and stored the same object
in both the local variable and port->netdev. The buggy conversion split
this into two alloc_hdlcdev() calls. A later rename changed the local
variable name to ndev, but the underlying mismatch remained.

Fix this by allocating the HDLC netdev only once and assigning the same
object to port->netdev.

Fixes: 99ebe65eb9c0 ("net: ixp4xx_hss: move out assignment in if condition")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/20260622043015.643637-1-haoxiang_li2024@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'airoha-fixes-for-sched-htb-offload-support'

Lorenzo Bianconi says:

====================
airoha: fixes for sched HTB offload support
====================

Link: https://patch.msgid.link/20260619-airoha-qos-fixes-v2-0-5c43485038f9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: fix netif_set_real_num_tx_queues for sparse QoS channels

airoha_tc_htb_alloc_leaf_queue() assigns queue IDs based on the channel
index (opt->qid = AIROHA_NUM_TX_RING + channel), but updates
real_num_tx_queues with a simple increment (num_tx_queues + 1). When QoS
channels are allocated sparsely (e.g., channels 0 and 3 without 1 and
2), the returned qid can exceed real_num_tx_queues, causing out-of-bounds
accesses in the networking stack.
For example, allocating channel 0 then channel 3 results in
real_num_tx_queues = 34 but qid = 35, which is out of range [0, 34).
Fix this by computing real_num_tx_queues based on the highest active
channel index rather than using a simple counter, in both the allocation
and deletion paths.

Fixes: ef1ca9271313b ("net: airoha: Add sched HTB offload support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260619-airoha-qos-fixes-v2-2-5c43485038f9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix off-by-one in airoha_tc_remove_htb_queue()

airoha_tc_htb_alloc_leaf_queue() computes the HTB QoS channel index
as opt->classid % AIROHA_NUM_QOS_CHANNELS and stores it in qos_sq_bmap.
However, airoha_tc_remove_htb_queue() clears the HTB configuration
using queue + 1 as the channel index, causing an off-by-one error.
Use queue directly as the QoS channel index to match the allocation
logic.

Fixes: ef1ca9271313b ("net: airoha: Add sched HTB offload support")
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260619-airoha-qos-fixes-v2-1-5c43485038f9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: fix ordering of heartbeat vs ownership

When requesting ownership of the NIC (MAC/PHY control), we set up
the heartbeat to look stale:

  /* Initialize heartbeat, set last response to 1 second in the past
   * so that we will trigger a timeout if the firmware doesn't respond
   */
  fbd->last_heartbeat_response = req_time - HZ;
  fbd->last_heartbeat_request = req_time;

The response handler then sets:

  fbd->last_heartbeat_response = jiffies;

for which we wait via:

  fbnic_fw_init_heartbeat() -> fbnic_fw_heartbeat_current()

The scheme is a bit odd, but it should work in principle.

Fix the ordering of operations. We have to set up the stale heartbeat
before we send the message. Otherwise if the response is very fast
we will override it. This triggers on QEMU if we run on the core
that handles the IRQ, and results in ndo_open failing with ETIMEDOUT.

The change in ordering doesn't impact releasing the ownership.
Both ndo_stop and heartbeat check are under rtnl_lock.

Fixes: 20d2e88cc746 ("eth: fbnic: Add initial messaging to notify FW of our presence")
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20260622154753.827506-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv6-fix-error-handling-in-disable_ipv6-sysctl'

Fernando Fernandez Mancera says:

====================
ipv6: fix error handling in disable_ipv6 sysctl

While working on a different IPv6 patch series I have spotted multiple
minor bugs around sysctl error handling and notifications. In general,
they are not serious issues.

In addition, there is one more issue in forwarding sysctl as it does not
check for CAP_NET_ADMIN for the namespace. I am keeping that patch out
of this series and I am aiming it at the net-next tree once it re-opens.

During v3, Ido's pointed out that it is unnecessary to reset the
position pointer when the return value is negative as at
new_sync_write() the ppos is only advanced when ret return value is
positive. That means we can get rid of that operation in ipv4/ipv6
sysctls. That is going to be sent to net-next too.
====================

Link: https://patch.msgid.link/20260622130857.5115-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix missing notification for ignore_routes_with_linkdown

When changing the ignore_routes_with_linkdown sysctl for a specific
interface, the RTM_NEWNETCONF netlink notification was not being emitted
to userspace. Fix this by emitting the notification when needed.

In addition, fix bogus return value for successful "all" and specific
interface write operation leading to a wrong reset of the position
pointer.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-7-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix state corruption during proxy_ndp sysctl restart

When handling proxy_ndp, if rtnl_net_trylock() fails, the operation is
retried but as the value was already modified by the initial
proc_dointvec() call, the restarted syscall will read the newly modified
value as the 'old' state.

Fix this by taking the RTNL lock before parsing the input value if the
operation is a write.

Fixes: c92d5491a6d9 ("netconf: add support for IPv6 proxy_ndp")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-6-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix error handling in disable_policy sysctl

When writing to the disable_policy sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is resetting the position argument even if an error
occurred during proc_dointvec() and not only during sysctl restart.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: df789fe75206 ("ipv6: Provide ipv6 version of "disable_policy" sysctl")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-5-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix error handling in forwarding sysctl

When writing to the forwarding sysctl, if proc_dointvec() fails to parse
the input, it returns a negative error code. The current implementation
is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure. In addition, adjust return code of
addrconf_fixup_forwarding() for successful operation.

Fixes: b325fddb7f86 ("ipv6: Fix sysctl unregistration deadlock")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-4-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix error handling in ignore_routes_with_linkdown sysctl

When writing to the ignore_routes_with_linkdown sysctl, if
proc_dointvec() fails to parse the input, it returns a negative error
code. The current implementation is overwriting that error for write
operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 35103d11173b ("net: ipv6 sysctl option to ignore routes when nexthop link is down")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-3-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: fix error handling in disable_ipv6 sysctl

When writing to the disable_ipv6 sysctl, if proc_dointvec() fails to
parse the input, it returns a negative error code. The current
implementation is overwriting that error for write operations.

This results in a silent failure, it returns a successful write although
the configuration was not modified at all. When modifying the "all"
variant it can also modify the configuration of existing interfaces to
the wrong value.

Fix this by checking the return value of proc_dointvec() and returning
early on failure.

Fixes: 56d417b12e57 ("IPv6: Add 'autoconf' and 'disable_ipv6' module parameters")
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260622130857.5115-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: Orphan SUNPLUS ETHERNET DRIVER

I have left Sunplus and no longer have access to the relevant hardware
to test or maintain this driver. Mark the driver as orphaned.

Signed-off-by: Wells Lu <wellslutw@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260622180721.28334-1-wellslutw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: au1000: move free_irq out of the close-time spinlocked section

au1000_close() calls free_irq() while aup->lock is still held with
spin_lock_irqsave(). free_irq() can sleep because it takes the IRQ
descriptor request mutex, so it does not belong inside the close-time
spinlocked section.

This was found by our static analysis tool and then confirmed by manual
review of the in-tree au1000_close() .ndo_stop path. The reviewed path
keeps aup->lock held across the MAC reset, queue stop and
free_irq(dev->irq, dev).

A directed runtime validation kept that ndo_stop carrier and the same
free_irq(dev->irq, dev) operation under the driver lock. Lockdep reported
"BUG: sleeping function called from invalid context" and "Invalid wait
context" while free_irq() was taking desc->request_mutex, with
au1000_close() and free_irq() on the stack.

Drop aup->lock before freeing the IRQ. The protected close-time work still
stops the device and queue before IRQ teardown, but the sleepable IRQ core
path now runs outside the spinlocked section.

Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260619151816.1144289-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: fix err_chunk memory leaks in INIT handling

When sctp_verify_init() encounters unrecognized parameters, it allocates an
err_chunk to report them. However, this chunk is leaked in several code
paths:

1. In sctp_sf_do_5_1B_init(), if security_sctp_assoc_request() fails after
   sctp_verify_init() has populated err_chunk, the function returns
   immediately without freeing it.

2. In sctp_sf_do_unexpected_init(), the same leak occurs on the
   security_sctp_assoc_request() failure path.

3. In sctp_sf_do_unexpected_init(), on the success path after copying
   unrecognized parameters to the INIT-ACK, the function returns without
   freeing err_chunk, unlike sctp_sf_do_5_1B_init() which properly frees
   it.

Fix all three leaks by adding sctp_chunk_free(err_chunk) calls before
returning in the error paths and on the success path in
sctp_sf_do_unexpected_init().

Fixes: c081d53f97a1 ("security: pass asoc to sctp_assoc_request and sctp_sk_clone")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/0656704f1b0158287c98aec09ba36c83e4a537ab.1781970534.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: cls_api: Handle TC_ACT_CONSUMED in tcf_qevent_handle

tcf_classify() can return TC_ACT_CONSUMED while the skb is held by the
defragmentation engine (e.g. act_ct on out-of-order fragments). When
that happens the skb is no longer owned by the caller and must not be
touched again.

tcf_qevent_handle() did not handle TC_ACT_CONSUMED: it fell through the
switch and returned the skb to the caller as if classification had
passed. The only qdisc that wires up qevents today is RED, via three call sites
(qe_mark on RED_PROB_MARK/HARD_MARK, qe_early_drop on congestion_drop)
red_enqueue() was continuing to operate on an skb it no longer owns  in this
case -- enqueueing it, dropping it, or updating statistics. Resulting in a UAF.

  tc qdisc add dev eth0 root handle 1: red ... qevent early_drop block 10
  tc filter add block 10 ... action ct

  (with ct defrag enabled and traffic that produces out-of-order
  fragments, e.g. a fragmented UDP stream)

Handle TC_ACT_CONSUMED in tcf_qevent_handle() the same way the ingress
and egress fast paths do: treat it as stolen and return NULL without
touching the skb. Unlike the TC_ACT_STOLEN case, the skb must not be
dropped/freed here, as it is no longer owned by us.

Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags")
Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com>
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260620130749.226642-1-jhs@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'drop-skb-metadata-before-lwt-encapsulation'

Jakub Sitnicki says:

====================
Drop skb metadata before LWT encapsulation

See description for patch 1.
====================

Link: https://patch.msgid.link/20260619-bpf-lwt-drop-skb-metadata-v3-0-71d6a33ab76b@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/bpf: Add LWT encap tests for skb metadata

Test that an LWT encapsulation does not silently corrupt XDP metadata
sitting in the skb headroom. Exercise all three LWT dispatch paths:

- BPF LWT xmit prog reserves headroom on the LWT .xmit redirect,
- mpls pushes an MPLS label on the LWT .xmit redirect,
- seg6 in encap mode runs on the LWT .input redirect,
- ioam6 encap inserts an IOAM Hop-by-Hop option on LWT .output redirect.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://patch.msgid.link/20260619-bpf-lwt-drop-skb-metadata-v3-2-71d6a33ab76b@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: lwtunnel: Drop skb metadata before LWT encapsulation

skb metadata is meant for passing information between XDP and TC. It lives
in the skb headroom, immediately before skb->data. LWT programs cannot
access the __sk_buff->data_meta pseudo-pointer to metadata.

However, LWT encapsulation prepends outer headers, moving skb->data back
over the headroom where the metadata sits. On an RX-originated (forwarded)
packet that still carries XDP metadata this goes wrong in two different
ways, depending on the encap type:

1. Non-BPF LWT encaps (mpls, seg6, ioam6 ...) call skb_push()/skb_pull()
   and silently overwrite the metadata that sits in the headroom.

2) BPF LWT xmit calls bpf_skb_change_head(), which uses skb_data_move().
   That helper expects metadata immediately before skb->data. But since
   the IP output path runs LWT xmit before neighbour output has built
   the outgoing L2 header, for forwarded packets skb->data points at the
   L3 header while skb_mac_header() still points at the old L2 header.
   skb_data_move() sees metadata ending at skb_mac_header(), not before
   skb->data, warns and clears metadata:

  WARNING: CPU: 21 PID: 454557 at include/linux/skbuff.h:4609 skb_data_move+0x47/0x90
  CPU: 21 UID: 0 PID: 454557 Comm: napi/iconduit-g Tainted: G           O        6.18.21 #1
  RIP: 0010:skb_data_move+0x47/0x90
  Call Trace:
   <IRQ>
   bpf_skb_change_head+0xe6/0x1a0
   bpf_prog_...+0x213/0x2e3
   run_lwt_bpf.isra.0+0x1d3/0x360
   bpf_xmit+0x46/0xe0
   lwtunnel_xmit+0xa1/0xf0
   ip_finish_output2+0x1e7/0x5e0
   ip_output+0x63/0x100
   __netif_receive_skb_one_core+0x85/0xa0
   process_backlog+0x9c/0x150
   __napi_poll+0x2b/0x190
   net_rx_action+0x40b/0x7f0
   handle_softirqs+0xd2/0x270
   do_softirq+0x3f/0x60
   </IRQ>

That is what happens, as for how to fix it - a received packet that
carries metadata can reach an encap through any of the three LWT
redirect modes:

  LWTUNNEL_STATE_INPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       lwtunnel_input

  LWTUNNEL_STATE_OUTPUT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             lwtunnel_output

  LWTUNNEL_STATE_XMIT_REDIRECT
   ip6_rcv_finish
     dst_input
       ip6_forward
         ip6_forward_finish
           dst_output
             ip6_output
               ip6_finish_output
                 ip6_finish_output2
                   lwtunnel_xmit

Every encap funnels through the three LWT dispatch helpers, so drop the
metadata there, right before handing the skb to the encap op. This
single chokepoint covers all encap types and all three redirect modes:

  - lwtunnel_input():  seg6, rpl, ila, seg6_local
  - lwtunnel_output(): ioam6
  - lwtunnel_xmit():   mpls, LWT BPF xmit

Alternatively, we could clear the metadata right after TC ingress hook.
That would require a compromise, however. Metadata would become
inaccessible from TC egress (in setups where it actually reaches the
hook it tact, that is without any L2 tunnels on path).

Fixes: 8989d328dfe7 ("net: Helper to move packet data and metadata after skb_push/pull")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://patch.msgid.link/20260619-bpf-lwt-drop-skb-metadata-v3-1-71d6a33ab76b@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2026-06-22

1) xfrm: use compat translator only for u64 alignment mismatch
   Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT
   so 32-bit compat tasks on arches whose 32-bit ABI already matches
   the native 64-bit layout are no longer rejected with -EOPNOTSUPP.
   From Sanman Pradhan.

2) net: af_key: initialize alg_key_len for IPComp states
   Initialize the alg_key_len to 0 in the IPComp branch of
   pfkey_msg2xfrm_state() so an uninitialized value cannot drive
   xfrm_alg_len() into a slab-out-of-bounds kmemdup during
   XFRM_MSG_MIGRATE. From Zijing Yin.

3) xfrm: Fix dev use-after-free in xfrm async resumption
   Stash the original skb->dev and extend the RCU critical section
   across xfrm_rcv_cb() and transport_finish() to prevent a
   tunnel-device UAF and original-device refcount leak when a
   callback replaces skb->dev. From Dong Chenchen.

4) xfrm: Fix xfrm state cache insertion race
   Move the state-validity check inside xfrm_state_lock in the
   input state cache insertion path so a state cannot be killed
   between the check and the insert. From Herbert Xu.

5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
   Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count
   and xfrm_policy_default to silence the KCSAN data race reported
   on net->xfrm.policy_count. From Eric Dumazet.

6) espintcp: use sk_msg_free_partial to fix partial send
   Replace the manual skmsg accounting in espintcp with
   sk_msg_free_partial() so the skmsg stays consistent on every
   iteration and the partial-send accounting bugs go away.
   From Sabrina Dubroca.

7) xfrm: validate selector family and prefixlen during match
   Reject mismatched address families in xfrm_selector_match() and
   bound prefixlen in addr4_match()/addr_match() to prevent the
   shift-out-of-bounds syzbot reported when an AF_UNSPEC selector
   with a large prefixlen is matched against an IPv4 flow.
   From Eric Dumazet.

* tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: validate selector family and prefixlen during match
  espintcp: use sk_msg_free_partial to fix partial send
  xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
  xfrm: Fix xfrm state cache insertion race
  xfrm: Fix dev use-after-free in xfrm async resumption
  net: af_key: initialize alg_key_len for IPComp states
  xfrm: use compat translator only for u64 alignment mismatch
====================

Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: lan78xx: restore VLAN and hash filters after link up

Configured VLANs intermittently stop receiving traffic after a link
down/up cycle, e.g. when the network cable is unplugged and plugged back
in. VLAN filtering stays enabled but all VLAN-tagged frames are dropped
until a VLAN is added or removed again.

The LAN7801 datasheet (DS00002123E) states:

  "A portion of the MAC operates on clocks generated by the Ethernet
   PHY. During a PHY reset event, this portion of the MAC is designed to
   not be taken out of reset until the PHY clocks are operational"
  (section 8.10, MAC Reset Watchdog Timer)

  "After a reset event, the RFE will automatically initialize the
   contents of the VHF to 0h."
  (section 7.1.4, VHF Organization)

Thus a link down/up cycle stops and restarts the PHY clock, resets the
PHY-clocked portion of the MAC, and the RFE clears its VLAN/DA hash
filter (VHF) memory. The VHF holds both the VLAN filter table and the
multicast hash table, but the driver never reprograms either from its
shadow copy once the link is back, so both stay empty.

Reprogram the VLAN filter and multicast hash tables on link up.

Reported-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Closes: https://lore.kernel.org/netdev/BEZP281MB224501E38B30BFDC4BD3D364D9E32@BEZP281MB2245.DEUP281.PROD.OUTLOOK.COM/T/#u
Tested-by: Sven Schuchmann <schuchmann@schleissheimer.de>
Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260622102911.484045-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

veth: fix NAPI leak in XDP enable error path

During XDP enablement in veth, if xdp_rxq_info_reg() or
xdp_rxq_info_reg_mem_model() fails, the driver rolls back the changes.

However, the rollback loop:
for (i--; i >= start; i--) {

decrements the loop index 'i' before the first iteration. This
correctly skips unregistering the rxq for the failed index 'i' (as
registration failed or was already cleaned up), but it also
erroneously skips calling netif_napi_deli() for rq[i].xdp_napi.

Since netif_napi_add() was already called for index 'i', this leaves
a dangling napi_struct in the device's napi_list. When the veth
device is later destroyed, the freed queue memory (which contains the
leaked NAPI structure) can be reused.

The subsequent device teardown iterates the NAPI list and
corrupts the reallocated memory, leading to UAF.

Fix this by explicitly deleting the NAPI association for the failed
index 'i' before rolling back the successfully configured queues.

Fixes: b02e5a0ebb17 ("xsk: Propagate napi_id to XDP socket Rx path")
Reported-by: Guenter Roeck <groeck@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Link: https://patch.msgid.link/20260622111825.88337-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ti: icssg: Fix XSK zero copy TX during application wakeup

emac_xsk_xmit_zc() handles tx xmit for zero copy and gets called
inside napi context. User application wakes up the kernel while
initiating the transmit which triggers napi to start processing
the tx packets. The num_tx check inside emac_tx_complete_packets()
returns early if no packet transfer happen hindering the call
to emac_xsk_xmit_zc(). Remove this check to let application
wakeup initiate zero copy xmit traffic.

Add __netif_tx_lock() to ensure that the TX queue is protected
from concurrent access during the transmission of XDP frames.
This fixes netdev watchdog timeout for long runs.

Fixes: e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Link: https://patch.msgid.link/20260618100348.2209907-1-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: sja1105: round up PTP perout pin duration

pin_duration is converted from the user-provided period to SJA1105
clock ticks and is later passed as the cycle_time argument to
future_base_time().

Very small period values may become zero after the conversion,
which can lead to a division by zero in future_base_time().

Round zero pin_duration up to 1 tick so that the smallest unsupported
periods use the minimum non-zero hardware duration instead of passing
zero to future_base_time().

Fixes: 747e5eb31d59 ("net: dsa: sja1105: configure the PTP_CLK pin as EXT_TS or PER_OUT")
Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru>
Link: https://patch.msgid.link/20260618110508.53094-1-aga@itb.spb.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: do not acquire dev->tx_global_lock in netdev_watchdog_up()

Marek Szyprowski reported a deadlock during system resume when virtio_net
driver is used.

The deadlock occurs because netif_device_attach() is called while holding
dev->tx_global_lock (via netif_tx_lock_bh() in virtnet_restore_up()).
netif_device_attach() calls __netdev_watchdog_up(), which now also tries
to acquire dev->tx_global_lock to synchronize with dev_watchdog().

This recursive lock acquisition results in a deadlock.

Fix this by removing the tx_global_lock acquisition from netdev_watchdog_up().

The critical state (watchdog_timer and watchdog_ref_held) is already
protected by dev->watchdog_lock, which was introduced in the blamed commit.

Fixes: 8eed5519e496 ("net: watchdog: fix refcount tracking races")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/netdev/a443376e-5187-4268-93b3-58047ef113a8@samsung.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260622110108.69541-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net, bpf: check master for NULL in xdp_master_redirect()

xdp_master_redirect() dereferences the result of
netdev_master_upper_dev_get_rcu() without a NULL check, but that helper
returns NULL when the receiving device has no upper-master adjacency.

The reach guard only checks netif_is_bond_slave(). On bond slave release
bond_upper_dev_unlink() drops the upper-master adjacency before clearing
IFF_SLAVE, so an XDP_TX reaching xdp_master_redirect() in that window
still passes netif_is_bond_slave() while master is already NULL, and
faults on master->flags at offset 0xb0:

  BUG: kernel NULL pointer dereference, address: 00000000000000b0
  RIP: 0010:xdp_master_redirect (net/core/filter.c:4432)
  Call Trace:
   xdp_master_redirect (net/core/filter.c:4432)
   bpf_prog_run_generic_xdp (include/net/xdp.h:700)
   do_xdp_generic (net/core/dev.c:5608)
   __netif_receive_skb_one_core (net/core/dev.c:6204)
   process_backlog (net/core/dev.c:6319)
   __napi_poll (net/core/dev.c:7729)
   net_rx_action (net/core/dev.c:7792)
   handle_softirqs (kernel/softirq.c:622)
   __dev_queue_xmit (include/linux/bottom_half.h:33)
   packet_sendmsg (net/packet/af_packet.c:3082)
   __sys_sendto (net/socket.c:2252)
  Kernel panic - not syncing: Fatal exception in interrupt

The missing check dates back to the original code; commit 1921f91298d1
("net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master")
later added the master->flags read where the fault now lands but kept the
unconditional deref. Check master for NULL before use; a NULL master is
treated the same as one that is not up.

Fixes: 879af96ffd72 ("net, core: Add support for XDP redirection to slave device")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260620201531.180123-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-xsk-stabilize-timeout-test-behavior'

Tushar Vyavahare says:

====================
selftests/xsk: stabilize timeout test behavior

This series improves AF_XDP selftests by making timeout handling
explicit and fixing sources of non-determinism in xsk timeout tests.

Patch 1 introduces test_spec::poll_tmout and removes implicit
dependence on RX UMEM setup state for timeout behavior.

Patch 2 fixes thread harness sequencing by attaching XDP programs
before worker startup, removing signal-based termination, and using
barrier synchronization only for dual-thread runs.

Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
configuration does not leak into subsequent cases on shared-netdev
runs.

Together these changes make timeout handling easier to follow and
improve selftest stability, especially on real NIC runs.
====================

Link: https://patch.msgid.link/20260616154955.1492560-1-tushar.vyavahare@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/xsk: restore shared_umem after POLL_TXQ_FULL

POLL_TXQ_FULL temporarily disables shared_umem on TX to exercise the
TX timeout path in isolation.

With shared_umem enabled, TX setup expects RX UMEM to be initialized
first and fails with: "RX UMEM is not initialized before shared-UMEM TX
setup".

Save and restore shared_umem around POLL_TXQ_FULL execution, and restore
it on both success and pkt_stream_replace() failure paths.

Also add an in-code comment explaining why shared_umem is temporarily
disabled in this test.

This keeps timeout setup local and prevents cross-test state leakage.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260616154955.1492560-4-tushar.vyavahare@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/xsk: fix timeout thread harness sequencing

Prevent workers from running before XDP program attachment completes.
The previous ordering allowed races between worker startup and setup.

Attach XDP programs before entering traffic validation.

Remove SIGUSR1-based worker termination and use pthread_join() for
thread shutdown so blocking syscalls are not interrupted.

Use barriers only for dual-thread runs so participants match and
teardown ordering stays deterministic.

This removes setup/startup races and stabilizes harness sequencing.

Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260616154955.1492560-3-tushar.vyavahare@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/xsk: make poll timeout mode explicit

Stop inferring timeout behavior from RX UMEM initialization state.
That ties timeout semantics to setup internals and obscures intent.

Use test_spec::poll_tmout as the explicit timeout-mode selector in
TX and RX paths.

In RX, treat poll timeout as expected only in timeout mode.
In TX, let send_pkts() own loop completion in non-timeout mode
and use __send_pkts() only for progress and timeout detection.

This makes timeout logic explicit and keeps control flow predictable.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Tushar Vyavahare <tushar.vyavahare@intel.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20260616154955.1492560-2-tushar.vyavahare@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration

On helper registration, the maximum number of expectations cannot go over
NF_CT_EXPECT_MAX_CNT (255), but zero can be specified then
nf_conntrack_expect_max applies. Turn zero into NF_CT_EXPECT_MAX_CNT
otherwise, expectation LRU eviction on insertion is disabled.

Moreover, expand this sanity check all expectation classes.

This max_expecy policy is only tunable since userspace helpers are
available, set Fixes: tag to the commit that adds such infrastructure.

Remove the check for p->max_expected given this field must always
be non-zero after this patch.

Fixes: 12f7a505331e ("netfilter: add user-space connection tracking helper infrastructure")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_ct: expectation timeouts are passed in milliseconds

Userspace passes '5000' in case user asks for 5 seconds.

Allowing for sub-second expectation lifetimes makes sense to me. so
fix up the kernel side instead of munging nft to send a value rounded
up to next second.

Also note that this violates nft convention of passing integers in
network byte order, but we can't change this anymore.

Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack_expect: run expectation eviction with no helper

Run expectation eviction if no helper is specified to deal with the
nft_ct expectation support.

Cap the maximum expectation limit per master conntrack to
NF_CT_EXPECT_MAX_CNT (255).

Fixes: 857b46027d6f ("netfilter: nft_ct: add ct expectations support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conntrack_expect: store master_tuple in expectation

Store master conntrack tuple in the expectation since exp->master might
refer to a different conntrack when accessed from rcu read side lock
area due to typesafe rcu rules.

Fixes: 02a3231b6d82 ("netfilter: nf_conntrack_expect: store netns and zone in expectation")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: add deprecation warnings for irc and pptp trackers

IRC Direct client-to-client requires plaintext. IRC over TLS should be
preferred, making this helper ineffective. Add a deprecation warning and
update the help text to better reflect that this is needed for the DCC
extension, not IRC itself.

PPTP is esoteric these days and it is the only helper that requires the
destroy callback in the conntrack helper API.

Removal would simplify the conntrack core.

Both helpers are IPv4 only.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: do not allow to reset helper on existing conntrack

This feature allows to reset a helper for an existing conntrack, but it
is not safe. This requires a synchronized_rcu() call after resetting the
helper, which is going to be expensive for a large batch of conntrack
entries. This also needs to call to the .destroy callback to release the
GRE/PPTP mappings to fix it.

This feature antedates the creation of the conntrack-tools and I cannot
find a good use-case for this. Given that I cannot find any user in the
netfilter.org userspace tree, I prefer to remove this feature.

Fixes: c1d10adb4a52 ("[NETFILTER]: Add ctnetlink port for nf_conntrack")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

selftests: nft_queue.sh: add a bridge queue test

Add a test queueing from bridge family.
This was lacking: we queued from inet for ipv4 and ipv6 but
we had no bridge queue test so far.

Given kernel MUST validate that in/out port are still part of
a bridge device on reinject add a test case for this before
adding this check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_compat: ebtables emulation must reject non-bridge targets

xtables targets return netfilter verdicts: NF_ACCEPT, NF_DROP, and so
on. ebtables targets return incompatible verdicts: EBT_ACCEPT,
EBT_DROP, ... We cannot allow fallback to NFPROTO_UNSPEC.

ebtables doesn't permit this since
11ff7288beb2 ("netfilter: ebtables: reject non-bridge targets")
but that commit missed the nft_compat layer.

Reported-by: Ren Wei <n05ec@lzu.edu.cn>
Reported-by: Wyatt Feng <bronzed_45_vested@icloud.com>
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Zhengchuan Liang <zcliangcn@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test

The existing test covered a scenario where a delayed INIT_ACK chunk
updates the vtag in conntrack after the association has already been
established.

A similar issue can occur with a delayed SCTP INIT chunk.

Add a new simultaneous-open test case where the client's INIT is
delayed, allowing conntrack to establish the association based on
the server-initiated handshake.

When the stale INIT arrives later, it may get recorded and cause a
following INIT_ACK from the peer to be accepted instead of dropped.
This INIT_ACK overwrites the vtag in conntrack, causing subsequent
SCTP DATA chunks to be considered as invalid and then dropped by
nft rules matching on ct state invalid.

This test verifies such stale INIT chunks do not cause problems.

Signed-off-by: Yi Chen <yiche.cy@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_synproxy: stop bypassing the priv->info snapshot

nft_synproxy_eval_v4() and nft_synproxy_eval_v6() already take a
whole-object READ_ONCE() snapshot of the shared priv->info state before
building the SYNACK reply, but nft_synproxy_tcp_options() still masks
opts->options with priv->info.options from the live shared object.

When a named synproxy object is updated concurrently with SYN traffic,
the eval path can then mix mss and timestamp handling from the local
snapshot with an options mask taken from a newer configuration, so one
SYNACK no longer reflects a coherent synproxy configuration.

Use info->options so nft_synproxy_tcp_options() stays on the same local
snapshot that the eval path already copied from priv->info.

Fixes: ee394f96ad75 ("netfilter: nft_synproxy: add synproxy stateful object support")
Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: x_tables.h: fix all kernel-doc warnings

- use correct names in kernel-doc comments
- add missing struct members to kernel-doc comments

Warning: include/linux/netfilter/x_tables.h:41 struct member 'targinfo' not described in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:41 Excess struct member 'targetinfo' description in 'xt_action_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'family' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:90 struct member 'nft_compat' not described in 'xt_mtchk_param'
Warning: include/linux/netfilter/x_tables.h:101 expecting prototype for struct xt_mdtor_param. Prototype was for struct xt_mtdtor_param instead

Warning: include/linux/netfilter/x_tables.h:121 struct member 'net' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'table' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'target' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'targinfo' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'hook_mask' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'family' not described in 'xt_tgchk_param'
Warning: include/linux/netfilter/x_tables.h:121 struct member 'nft_compat' not described in 'xt_tgchk_param'

Warning: include/linux/netfilter/x_tables.h:345 expecting prototype for xt_recseq(). Prototype was for DECLARE_PER_CPU() instead

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flowtable: Validate iph->ihl in nf_flow_ip4_tunnel_proto()

Add sanity check for iph->ihl field in nf_flow_ip4_tunnel_proto() before
using it to compute the header size, avoiding out-of-bounds access with
malformed IP headers.
While at it, use iph->protocol instead of the hardcoded IPPROTO_IPIP
constant when setting ctx->tun.proto and reference ctx->tun.hdr_size
when updating ctx->offset.

Fixes: ab427db178858 ("netfilter: flowtable: Add IPIP rx sw acceleration")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_conncount: prevent connlimit drops for early confirmed ct

Commit 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add
was skipped") introduced a regression where packets for valid
connections are dropped when using connlimit for soft-limiting
scenarios.

The issue occurs when a new connection reuses a socket currently in
the TIME_WAIT state. In this scenario, the connection tracking entry
is evaluated as already confirmed. Previously, __nf_conncount_add()
assumed that if a connection was confirmed and did not originate from
the loopback interface, it should skip the addition and return -EEXIST.

Skipping the addition triggers a garbage collection run that cleans up
the TIME_WAIT connection. Consequently, the active connection count
drops to 0, which xt_connlimit mishandles, leading to the false rejection
of the perfectly valid new connection.

Fix this by replacing the interface check with protocol-agnostic state
checks. We now skip the tree insertion and preserve the lockless garbage
collection optimization only if the connection is IPS_ASSURED. This
allows early-confirmed setup packets (such as reused TIME_WAIT sockets
or locally generated SYN-ACKs) to be properly evaluated and counted
without falsely dropping. The goto check_connections path is maintained
to ensure these setup packets are deduplicated correctly.

This has been tested with slowhttptest and HTTP server configured
locally to ensure we are not breaking soft-limiting scenarios for local
or external connections. In addition, it was tested with a OVS zone
limit too.

Fixes: 69894e5b4c5e ("netfilter: nft_connlimit: update the count if add was skipped")
Reported-by: Alejandro Olivan Alvarez <alejandro.olivan.alvarez@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/177349610461.3071718.4083978280323144323@eldamar.lan/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_nat: avoid invalid nat_net pointer use on failed nf_nat_init()

We ran into below KASAN splat, which is mostly uninteresting, beside
for having nf_nat_register_fn() in the call chain as a cause for the
offending access:

==================================================================
BUG: KASAN: slab-out-of-bounds in nf_nat_register_fn+0x5f9/0x640
Read of size 8 at addr ffff890031e54c20 by task iptables/9510

CPU: 0 UID: 0 PID: 9510 Comm: iptables Not tainted 6.18.18-grsec-full-20260320181326 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
[…] dump_stack_lvl+0xee/0x160 ffff88004117eeb8
[…] print_report+0x6e/0x640 ffff88004117eee0
[…] ? __phys_addr+0x8e/0x140 ffff88004117eef0
[…] ? kasan_addr_to_slab+0x51/0xe0 ffff88004117ef08
[…] ? complete_report_info+0xec/0x1c0 ffff88004117ef20
[…] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef48
[…] kasan_report+0xbc/0x140 ffff88004117ef50
[…] ? nf_nat_register_fn+0x5f9/0x640 ffff88004117ef90
[…] nf_nat_register_fn+0x5f9/0x640 ffff88004117eff8
[…] ? nf_nat_icmp_reply_translation+0x6e0/0x6e0 ffff88004117f070
[…] nf_tables_register_hook.part.0+0xa0/0x220 ffff88004117f080
[…] nf_tables_addchain.constprop.0+0x1054/0x1fc0 ffff88004117f0b8
[…] ? nft_chain_lookup.part.0+0x4ce/0xac0 ffff88004117f130
[…] ? nf_tables_abort+0x3d80/0x3d80 ffff88004117f190
[…] ? nf_tables_dumpreset_obj+0x100/0x100 ffff88004117f1c8
[…] ? nft_table_lookup.part.0+0x255/0x300 ffff88004117f310
[…] ? nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f358
[…] nf_tables_newchain+0x21a4/0x2fa0 ffff88004117f360
[…] ? nf_tables_addchain.constprop.0+0x1fc0/0x1fc0 ffff88004117f458
[…] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f488
[…] ? lock_acquire+0x16f/0x320 ffff88004117f490
[…] ? find_held_lock+0x3b/0xe0 ffff88004117f4b0
[…] ? __nla_parse+0x45/0x80 ffff88004117f500
[…] nfnetlink_rcv_batch+0xbca/0x19a0 ffff88004117f550
[…] ? nfnetlink_net_exit_batch+0x120/0x120 ffff88004117f618
[…] ? __sanitizer_cov_trace_switch+0x63/0xe0 ffff88004117f720
[…] ? gr_acl_handle_mmap+0x1c4/0x320 ffff88004117f7c0
[…] ? nla_get_range_signed+0x4a0/0x4a0 ffff88004117f7e8
[…] ? gr_is_capable+0x6f/0xe0 ffff88004117f830
[…] ? __nla_parse+0x45/0x80 ffff88004117f860
[…] ? skb_pull+0x103/0x1a0 ffff88004117f880
[…] nfnetlink_rcv+0x3db/0x4a0 ffff88004117f8b0
[…] ? nfnetlink_rcv_batch+0x19a0/0x19a0 ffff88004117f8d8
[…] ? netlink_lookup+0xe2/0x240 ffff88004117f900
[…] netlink_unicast+0x74b/0xb00 ffff88004117f930
[…] ? netlink_attachskb+0xb20/0xb20 ffff88004117f980
[…] ? __check_object_size+0x3e/0xaa0 ffff88004117f998
[…] ? security_netlink_send+0x51/0x160 ffff88004117f9c8
[…] netlink_sendmsg+0xa03/0x1200 ffff88004117f9f8
[…] ? netlink_unicast+0xb00/0xb00 ffff88004117fa70
[…] ? netlink_unicast+0xb00/0xb00 ffff88004117fac8
[…] ? ____sys_sendmsg+0xe2a/0x1040 ffff88004117faf8
[…] ____sys_sendmsg+0xe2a/0x1040 ffff88004117fb00
[…] ? kernel_recvmsg+0x300/0x300 ffff88004117fb60
[…] ? reacquire_held_locks+0xe9/0x260 ffff88004117fbc8
[…] ___sys_sendmsg+0x138/0x200 ffff88004117fbf8
[…] ? do_recvmmsg+0x7e0/0x7e0 ffff88004117fc30
[…] ? lockdep_hardirqs_on_prepare+0x101/0x1e0 ffff88004117fc50
[…] ? lock_acquire+0x16f/0x320 ffff88004117fd20
[…] ? lock_acquire+0x16f/0x320 ffff88004117fd58
[…] ? find_held_lock+0x3b/0xe0 ffff88004117fd70
[…] __sys_sendmsg+0x17a/0x260 ffff88004117fdc8
[…] ? __sys_sendmsg_sock+0x80/0x80 ffff88004117fdf0
[…] ? syscall_trace_enter+0x15e/0x2c0 ffff88004117fe98
[…] do_syscall_64+0x7d/0x400 ffff88004117fec8
[…] entry_SYSCALL_64_safe_stack+0x4a/0x60 ffff88004117fef8
</TASK>
==================================================================

The out-of-bounds report, though, is a red herring as it is for an
access that shouldn't have happened in the first place.

When nf_nat_init() fails to register its BPF kfuncs, it'll unwind and,
among others, call unregister_pernet_subsys() to deregister its per-net
ops. This makes the previously allocated net id available for reuse by
the next caller of register_pernet_subsys(), in our case, synproxy.
However, 'nat_net_id' will still hold the previously allocated value.

If nf_nat.o gets build as a module, all this doesn't matter. A failed
initialization routine makes the module fail to load and any dependent
module won't be able to load either. However, if nf_nat.o is built-in,
a failing init won't /completely/ make its functionality unavailable to
dependent modules, namely the code and static data is still there, free
to be called by modules like nft_chain_nat.ko.

Case in point, nft_chain_nat registers hooks that'll call into nf_nat
which, in our case, failed to initialize and therefore won't have a
valid net id nor related net_nat object any more.

Code in nf_nat, namely nf_nat_register_fn() and nf_nat_unregister_fn(),
still making use of the reallocated net id, lead to a type confusion as
the call to net_generic() will no longer return memory belonging to an
object suited to fit 'struct nat_net' but 'struct synproxy_net' instead.
The latter is only 24 bytes on 64-bit systems, much smaller than struct
nat_net which is 176 bytes, perfectly explaining the OOB KASAN report.

Detect and handle a failed nf_nat_init() by testing the 'nf_nat_hook'
pointer which will be reset to NULL on initialization errors to prevent
the usage of an invalid nat_net pointer.

As this check is only needed when nf_nat.o is built-in, guard it by
'#ifndef MODULE...'.

Fixes: cbc1dd5b659f ("netfilter: nf_nat: Fix possible memory leak in nf_nat_init()")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

selftests: drv-net: so_txtime: relax variance bounds

The net-next-hw spinners on netdev.bots.linux.dev observe failing
so-txtime-py tests. A review of stdout shows most failures to be
due to exceeding the 4ms grace period. All I saw were within 8ms.
So increase to that.

Double the bounds from 4 to 8ms. This is still is small enough to
differentiate the delays programmed by the test, 10 and 20ms.

Fixes: 5c6baef3885c ("selftests: drv-net: convert so_txtime to drv-net")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20260610170651.1b644001@kernel.org/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260621200137.1564776-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Fix TX scheduler queue mask loop upper bound

In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).

Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
- channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
- channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
- channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
- channel 3: clears bit 24..31 (channel 3 only) - correct by accident

While BIT(32+) on arm64 produces 64-bit values truncated to 0 in u32
mask parameter, the loop still incorrectly clears queues within the
same channel beyond queue 7.

Even though this is functionally harmless (the register resets to 0
and is only ever cleared, never set — so clearing extra bits is a
no-op), the loop bound is semantically wrong and should be fixed for
correctness and clarity.

Fix by using AIROHA_NUM_QOS_QUEUES (8) as the loop upper bound.

Fixes: ef1ca9271313 ("net: airoha: Add sched HTB offload support")
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Wayen Yan <win847@gmail.com>
Link: https://patch.msgid.link/178187479434.2400840.1312143943526335838@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnx2x: fix potential memory leak in bnx2x_alloc_mem_bp()

If the allocation of fp[i].tpa_info fails, the error path will not free
the struct bnx2x_fastpath allocated earlier, as it is not linked to the
bp structure yet. Fix that by linking it immediately after allocation.

Cc: stable@vger.kernel.org
Fixes: 15192a8cf8a8 ("bnx2x: Split the FP structure")
Signed-off-by: Abdun Nihaal <nihaal@cse.iitm.ac.in>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260620062402.89549-1-nihaal@cse.iitm.ac.in
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: fib: Don't ignore error route in local/main tables.

When CONFIG_IP_MULTIPLE_TABLES is enabled but no rule is added,
fib_lookup() performs route lookup directly on two tables.

Since the first lookup does not properly bail out, the result
of an error route in the merged local/main table could be
overwritten by another route in the default table:

  # unshare -n
  # ip link set lo up
  # ip route add 192.168.0.0/24 dev lo table 253
  # ip route add unreachable 192.168.0.0/24
  # ip route get 192.168.0.1
  192.168.0.1 dev lo table default uid 0
      cache <local>

Once a random rule is added, the error route is respected:

  # ip rule add table 0
  # ip rule del table 0
  # ip route get 192.168.0.1
  RTNETLINK answers: No route to host

Let's fix the inconsistent behaviour.

Fixes: f4530fa574df ("ipv4: Avoid overhead when no custom FIB rules are installed.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260619212753.3367244-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: bnxt: improve the timing of stats

Kernel selftests wait 1.25x of the promised stats refresh time
(as read from ethtool -c). bnxt reports 1sec by default, but
the stats update process has two steps. First device DMAs the
new values, then the service task performs update in full-width
SW counters. So the worst case delay is actually 2x.

Note that the behavior is different for ring stats and port stats.
Port stats are fetched synchronously by the service worker, so
there's no risk of doubling up the delay there.

The problem of stale stats impacts not only tests but real workloads
which monitor egress bandwidth of a NIC. The inaccuracy causes double
counting in the next cycle and spurious overload alarms.

Try to read from the DMA buffer more aggressively, to mitigate
timing issues between DMA and service task. The SW update should
be cheap.

Fixes: 51f307856b60 ("bnxt_en: Allow statistics DMA to be configurable using ethtool -C.")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260619191538.104165-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: Fix null-ptr-deref in fib6_nh_mtu_change().

fib6_nh_mtu_change() re-fetches idev via __in6_dev_get(arg->dev) and
dereferences idev->cnf.mtu6 without a NULL check. addrconf_ifdown()
clears dev->ip6_ptr with RCU_INIT_POINTER() after rt6_disable_ip() has
released tb6_lock, so the RA-driven MTU walk can observe a NULL idev and
oops. The caller rt6_mtu_change_route() guards its own __in6_dev_get(),
but this re-fetch is unguarded; nexthop-backed routes survive
addrconf_ifdown()'s flush, so the walk still reaches it after ip6_ptr is
nulled.

Return 0 when idev is NULL, matching rt6_mtu_change_route() and the
fib6_mtu() fix in commit 5ad509c1fdad ("ipv6: Fix null-ptr-deref in
fib6_mtu().").

  Oops: general protection fault, ... KASAN: null-ptr-deref in range
        [0x00000000000002a8-0x00000000000002af]
  RIP: 0010:fib6_nh_mtu_change+0x203/0x990
   rt6_mtu_change_route+0x141/0x1d0
   __fib6_clean_all+0xd0/0x160
   rt6_mtu_change+0xb4/0x100
   ndisc_router_discovery+0x24b5/0x2cb0
   icmpv6_rcv+0x12e9/0x1710
   ipv6_rcv+0x39b/0x410

Fixes: c0b220cf7d80 ("ipv6: Refactor exception functions")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260619045334.2427073-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hdlc_ppp: sync per-proto timers before freeing hdlc state

Each PPP control protocol (LCP/IPCP/IPV6CP) embedded in struct ppp
registers a timer via timer_setup(). That struct ppp is the
hdlc->state allocation, which detach_hdlc_protocol() frees with kfree()
in both teardown paths: unregister_hdlc_device() and the re-attach inside
attach_hdlc_protocol().

The ppp proto never registered a .detach callback, so
detach_hdlc_protocol() performs no timer synchronization before the
kfree(). The only cancel, timer_delete(&proto->timer) in ppp_cp_event(),
is partial (it does not wait for a running callback) and only runs on the
->CLOSED transition; ppp_stop()/ppp_close() do not sync either. A
ppp_timer callback already executing (blocked on ppp->lock) survives the
kfree and then dereferences proto->state / ppp->lock in freed memory,
leading to a use-after-free.

Fix this by adding a .detach helper that calls timer_shutdown_sync() on
every per-proto timer. detach_hdlc_protocol() invokes proto->detach(dev)
before kfree(hdlc->state), so timer_shutdown_sync()
now runs on both free paths.
timer_shutdown_sync() is used instead of timer_delete_sync() because the
keepalive path re-arms the timer through add_timer()/mod_timer() and
shutdown blocks any re-activation during teardown.

Initialize the per-protocol timers in ppp_ioctl() when the protocol is
attached, and remove the now-redundant timer_setup() from ppp_start(), so
that the timers are initialized exactly once at attach time and
ppp_timer_release() never operates on uninitialized timer_list
structures. attach_hdlc_protocol() uses kmalloc() (not kzalloc), so
struct ppp's protos[i].timer is uninitialized garbage until the first
timer_setup(); without this init-at-attach, attaching the PPP protocol
without ever bringing the device up would leave timer_shutdown_sync()
operating on uninitialized memory in .detach. Moving the init out of
ppp_start() (which only runs on NETDEV_UP) into the attach path makes the
initialization unconditional and avoids initializing the same timer_list
twice.

This bug was found by static analysis.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
Link: https://patch.msgid.link/20260617020518.116319-1-fanwu01@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: icssg: guard PA stat lookups

icssg_ndo_get_stats64() unconditionally calls emac_get_stat_by_name()
with FW PA stat names regardless of whether the PA stats block is
present on the hardware.  emac_get_stat_by_name() already guards the
PA stats lookup with `if (emac->prueth->pa_stats)`; when that pointer
is NULL the lookup falls through to netdev_err() and returns -EINVAL.
Because ndo_get_stats64 is polled regularly by the networking stack
this produces thousands of log entries of the form:

  icssg-prueth icssg1-eth end0: Invalid stats FW_RX_ERROR

A secondary consequence is that the int(-EINVAL) return value is
implicitly widened to a near-ULLONG_MAX unsigned value when accumulated
into the __u64 fields of rtnl_link_stats64, silently corrupting the
rx_errors, rx_dropped and tx_dropped counters reported by `ip -s link`.

Every other PA-aware code path in the driver is already guarded with
the same `if (emac->prueth->pa_stats)` check.  Apply the same guard
here.

Fixes: 0d15a26b247d ("net: ti: icssg-prueth: Add ICSSG FW Stats")
Signed-off-by: Philippe Schenker <philippe.schenker@impulsing.ch>
Reviewed-by: Simon Horman <horms@kernel.org>
Cc: danishanwar@ti.com
Cc: rogerq@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260618093037.3448858-1-dev@pschenker.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rocker: Fix memory leak in ofdpa_port_fdb()

In ofdpa_port_fdb(), the hash_del() only unlinks the node from
hash table, but does not free it.

Fix this by adding kfree(found) after the !found == removing check,
where the pointer value is no longer needed.

Found by Coccinelle kfree script.

Cc: <stable+noautosel@kernel.org> # rocker is a test harness, it's never loaded on production systems
Signed-off-by: Ziran Zhang <zhangcoder@yeah.net>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260616013245.7098-1-zhangcoder@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

e1000e: Reconfigure PLL clock gate timeout and re-enable K1 on Meteor Lake

Commit 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1")
disabled K1 by default on Meteor Lake and newer systems due to packet
loss observed on various platforms. However, disabling K1 caused an
increase in power consumption.

To mitigate this, reconfigure the PLL clock gate value so that K1 can
remain enabled without incurring the additional power consumption.
Re-enable K1 by default, but keep the private flag to support disabling
it via ethtool. Additionally, introduce a DMI quirk table, so that K1 may
be disabled by default on known problematic systems. Currently, this
includes the Dell Pro 16 Plus, where the issue has been reported to persist
despite the changes to the PLL lock timeout.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220954
Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20250623/048860.html
Link: https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20260330/054059.html
Signed-off-by: Dima Ruinskiy <dima.ruinskiy@intel.com>
Co-developed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Fixes: 3c7bf5af21960 ("e1000e: Introduce private flag to disable K1")
Tested-by: Moriya Kadosh <moriyax.kadosh@intel.com>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Fix i40e_debug() to use struct i40e_hw argument

i40e_debug() macro takes struct i40e_hw *h as first argument. But the
macro body uses hw instead of h. This has been working so far because hw
happens to be the name of the variable in the context where the macro is
expanded. Fix the macro to use the passed argument.

Fixes: 5dfd37c37a44 ("i40e: Split i40e_osdep.h")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: dpll: fix memory leak in ice_dpll_init_info error paths

Several error return paths in ice_dpll_init_info() directly return
without freeing previously allocated resources, causing memory leaks:

- When de->input_prio allocation fails, d->inputs is leaked
- When dp->input_prio allocation fails, d->inputs and de->input_prio
  are leaked
- When ice_get_cgu_rclk_pin_info() fails, all previously allocated
  inputs/outputs/input_prio are leaked
- When ice_dpll_init_pins_info(RCLK_INPUT) fails, same resources
  are leaked

Fix this by jumping to the deinit_info label which properly calls
ice_dpll_deinit_info() to free all allocated resources.

Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: dpll: set pointers to NULL after kfree in ice_dpll_deinit_info

ice_dpll_deinit_info() calls kfree() on several pf->dplls fields
(inputs, outputs, eec.input_prio, pps.input_prio) but does not set
the pointers to NULL afterward. This leaves dangling pointers in the
pf->dplls structure.

While not currently exploitable through existing code paths, this is
unsafe because:

1. If ice_dpll_init_info() is called again after a deinit (e.g. during
   driver recovery), and a subsequent allocation within init fails, the
   error path will jump to deinit_info and call ice_dpll_deinit_info()
   again. Since some pointers still hold the old freed addresses, this
   would result in a double-free.

2. Any future code that checks these pointers before use or after free
   would be unprotected against use-after-free.

Follow the common kernel convention of setting pointers to NULL after
kfree() so that:
- kfree(NULL) is a safe no-op, preventing double-free
- NULL checks on these pointers become meaningful

This is a preparatory fix for a subsequent patch that routes additional
error paths in ice_dpll_init_info() to the deinit_info label.

Fixes: d7999f5ea64b ("ice: implement dpll interface to control cgu")
Signed-off-by: ZhaoJinming <zhaojinming@uniontech.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: call netif_keep_dst() once when entering switchdev mode

netif_keep_dst() only needs to be called once for the uplink VSI, not
once for each port representor. Move it from ice_eswitch_setup_repr()
to ice_eswitch_enable_switchdev().

Fixes: defd52455aee ("ice: do Tx through PF netdev in slow-path")
Signed-off-by: Marcin Szycik <marcin.szycik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: fix ice_init_link() error return preventing probe

ice_init_link() can return an error status from ice_update_link_info()
or ice_init_phy_user_cfg(), causing probe to fail.

An incorrect NVM update procedure can result in link/PHY errors, and
the recommended resolution is to update the NVM using the correct
procedure. If the driver fails probe due to link errors, the user
cannot update the NVM to recover. The link/PHY errors logged are
non-fatal: they are already annotated as 'not a fatal error if this
fails'.

Since none of the errors inside ice_init_link() should prevent probe
from completing, convert it to void and remove the error check in the
caller. All failures are already logged; callers have no meaningful
recovery path for link init errors.

Fixes: 5b246e533d01 ("ice: split probe into smaller functions")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: fix AQ error code comparison in ice_set_pauseparam()

Fix unreachable code: the conditionals in ice_set_pauseparam() used
the bitwise-AND operator suggesting aq_failures is a bitmap, but it
is actually an enum, making the third condition logically unreachable.

Replace the if-else ladder with a switch statement. Also move the
aq_failures initialization to the variable declaration and remove the
redundant zeroing from ice_set_fc().

Fixes: fcea6f3da546 ("ice: Add stats and ethtool support")
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

ice: fix FDIR CTRL VSI resource leak in ice_reset_all_vfs()

Resetting all VFs causes resource leak on VFs with FDIR filters
enabled as CTRL VSIs are only invalidated and not freed. Fix by using
ice_vf_ctrl_vsi_release() instead of ice_vf_ctrl_invalidate_vsi() which
aligns behavior with the ice_reset_vf() function.

Reproduction:
  echo 1 > /sys/class/net/$pf/device/sriov_numvfs
  ethtool -N $vf flow-type ether proto 0x9000 action 0
  echo 1 > /sys/class/net/$pf/device/reset

Fixes: da62c5ff9dcd ("ice: Add support for per VF ctrl VSI enabling")
Signed-off-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

Merge tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net. This batches
fixes for real crashes with trivial/correctness fixes. There is too
a rework of the conntrack expectation timeout strategy to deal with
a possible race when removing an expectation.

1) Fix the incorrect flowtable timeout extension for entries in
   hw offload, from Adrian Bente. This is correcting a defect in
   the functionality, no crash.

2) Hold reference to device under the fake dst in br_netfilter,
   from Haoze Xie. This is fixing a possible UaF if the device
   is removed while packet is sitting in nfqueue.

3) Reject template conntrack in xt_cluster, otherwise access to
   uninitialize conntrack fields are possible leading to WARN_ON
   due to unset layer 3 protocol. From Wyatt Feng.

4) Make sure the IPv6 tunnel header is in the linear skb data
   area before pulling. While at it remove incomplete NEXTHDR_DEST
   support. From Lorenzo Bianconi. This possibly leading to crash
   if IPv4 header is not in the linear area.

5) Use test_bit_acquire in ipset hash set to avoid reordering
   of subsequent memory access. This is addressing a LLM related
   report, no crash has been observed. From Jozsef Kadlecsik.

6) Use test_bit_acquire in ipset bitmap set too, for the same
   reason as in the previous patch, from Jozsef Kadlecsik.

7) Call kfree_rcu() after rcu_assign_pointer() to address a
   possible UaF if kfree_rcu() runs inmediately, which to my
   understanding never happens. Never observed in practise,
   reported by LLM. Also from Jozsef Kadlecsik.

8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
   to avoid that ipset GC handler re-queues work as reported by LLM.
   From Jozsef Kadlecsik. This is for correctness.

9) Restore the check in nft_payload for exceeding payloda offset
    over 2^16. From Florian Westphal. This fixes a silent truncation,
    not a big deal, but better be assertive and reject it.

10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
    prerouting. From Florian Westphal. Harmless but it could allow
    to read bytes from skb->cb.

11) Zero out destination hardware address during the flowtable
    path setup, also from Florian. This is a correctness fix, LLM
    points that possible infoleak can happen but topology to achieve
    it is not clear.

12) Skip IPv4 options if present when building the IPV4 reject reply.
    Otherwise bytes in the IPv4 options header can be sent back to
    origin where the ICMP header is being expected. Again from
    Florian Westphal.

13) Replace timer API for expectation by GC worker approach. This
    is implicitly fixing a race between nf_ct_remove_expectations()
    which might fail to remove the expectation due to timer_del()
    returning false because timer has expired and callback is
    being run concurrently. This fix is addressing a crash that has
    been already reported with a reproducer.

14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
    infoleak of 4-bytes. From Florian Westphal.

* tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
  netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
  netfilter: nf_reject: skip iphdr options when looking for icmp header
  netfilter: nft_flow_offload: zero device address for non-ether case
  netfilter: nft_meta_bridge: add validate callback for get operations
  netfilter: nft_payload: reject offsets exceeding 65535 bytes
  netfilter: ipset: make sure gc is properly stopped
  netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
  netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
  netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
  netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
  netfilter: xt_cluster: reject template conntracks in hash match
  netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
  netfilter: flowtable: fix offloaded ct timeout never being extended
====================

Link: https://patch.msgid.link/20260620222738.112506-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dpaa2-switch: do not accept VLAN uppers while bridged

The dpaa2-switch driver does not support VLAN uppers while its ports are
bridged. This scenario tried to be prevented by rejecting a bridge join
while VLAN uppers exist but the reverse order was still possible.

This patches adds a check so that the dpaa2-switch also does not accept
VLAN uppers while bridged.

Fixes: f48298d3fbfa ("staging: dpaa2-switch: move the driver out of staging")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Link: https://patch.msgid.link/20260618092813.432535-2-ioana.ciornei@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: ioam: fix type confusion of dst_entry

IOAM uses a dummy dst_entry(null_dst) to mark that the destination should
not be changed after the transformation. This dst is stored in the IOAM lwt
state and may be passed to dst_cache_set_ip6().

However, the IPv6 dst cache path eventually calls rt6_get_cookie(), which
treats the dst_entry as part of a struct rt6_info. Since the null_dst was
embedded directly as a struct dst_entry in struct ioam6_lwt, this resulted
in an invalid cast and rt6_get_cookie() reading fields from the wrong
object.

In practice, the wrong cookie is not used while dst->obsolete is zero, but
rt6_get_cookie() may also access per-cpu value when rt->sernum is
zero. In this case, rt->sernum aliases ioam6_lwt::cache::reset_ts, which
can become zero, making this a potential invalid pointer access.

Fix this by embedding a full struct rt6_info for the dummy IPv6 route and
passing its dst member to the dst APIs.

Fixes: 47ce7c854563 ("net: ipv6: ioam6: fix double reallocation")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260618104336.48934-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ipv4-ipv6-account-for-fraggap-on-paged-allocation-paths'

Wongi Lee says:

====================
ipv4/ipv6: account for fraggap on paged allocation paths

Fix fraggap accounting in the paged-allocation paths of IPv4 and IPv6.

The IPv6 patch is the v4 update of the previously posted patch. The IPv4
patch handles the same code pattern (by Ido).

v3: https://lore.kernel.org/aiq3f7UZGFp0F3MV@DESKTOP-19IMU7U.localdomain
v2: https://lore.kernel.org/aigx83czv+UJZA0d@DESKTOP-19IMU7U.localdomain
v1: https://lore.kernel.org/aibiIYMAwUErTw5U@DESKTOP-19IMU7U.localdomain
====================

Link: https://patch.msgid.link/ajFQn6yh43eDeQm9@DESKTOP-19IMU7U.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: account for fraggap on the paged allocation path

In __ip6_append_data(), when the paged-allocation branch is taken
(MSG_MORE / NETIF_F_SG / large fraglen), alloclen and pagedlen are
computed as

alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;

datalen already includes fraggap (datalen = length + fraggap). When
fraggap is non-zero, this is not the first skb and transhdrlen is zero.
The fraggap bytes carried over from the previous skb are copied just past
the fragment headers in the new skb's linear area. The linear area is
therefore undersized by fraggap bytes while pagedlen is overstated by the
same amount, and the copy writes past skb->end into the trailing
skb_shared_info.

An unprivileged user can trigger this via a UDPv6 socket using
MSG_MORE together with MSG_SPLICE_PAGES.

The bad accounting was introduced by commit 773ba4fe9104 ("ipv6:
avoid partial copy for zc"). Before commit ce650a166335 ("udp6: Fix
__ip6_append_data()'s handling of MSG_SPLICE_PAGES"), the negative
copy value caused -EINVAL to be returned. That later commit allowed
MSG_SPLICE_PAGES to proceed in this case, making the corruption
triggerable.

The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.

After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.
Since a negative copy is no longer expected for a valid MSG_SPLICE_PAGES
case, remove the MSG_SPLICE_PAGES exception from the negative copy check.

Fixes: 773ba4fe9104 ("ipv6: avoid partial copy for zc")
Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com>
Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/ajFTqRljatR17fFy@DESKTOP-19IMU7U.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv4: account for fraggap on the paged allocation path

In __ip_append_data(), when the paged-allocation branch is taken,
alloclen and pagedlen are computed as

alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;

datalen already includes fraggap, but the fraggap bytes carried over
from the previous skb are copied into the new skb's linear area at
offset transhdrlen by the subsequent skb_copy_and_csum_bits(). The
linear area is therefore undersized by fraggap bytes while pagedlen is
overstated by the same amount.

The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.

After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.

Fixes: 8eb77cc73977 ("ipv4: avoid partial copy for zc")
Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com>
Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/ajFR1eLAIs42TN3g@DESKTOP-19IMU7U.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Add DualPI2 GSO backlog accounting test

Add a regression test for DualPI2 GSO backlog accounting when it is
used as a child qdisc of QFQ.

The test sends one UDP GSO datagram through a QFQ class with DualPI2 as
the leaf qdisc. DualPI2 splits the skb into two segments. After the
traffic drains, both QFQ and DualPI2 must report zero backlog and zero
qlen.

On kernels with the broken accounting, QFQ can keep a stale non-zero
qlen after all real packets have been dequeued.

Signed-off-by: Xingquan Liu <b1n@b1n.io>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260619151447.223640-2-b1n@b1n.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: dualpi2: fix GSO backlog accounting

When DualPI2 splits a GSO skb into N segments, it propagates N
additional packets to its parent before returning NET_XMIT_SUCCESS.
The parent then accounts for the original skb once more, leaving its
qlen one larger than the number of packets actually queued.

With QFQ as the parent, after all real packets are dequeued, QFQ still
has a non-zero qlen while its in-service aggregate has no active
classes. qfq_choose_next_agg() returns NULL and qfq_dequeue() passes
the result to qfq_peek_skb(), causing a NULL pointer dereference.

Follow the same pattern used by tbf_segment() and taprio: count only
successfully queued segments, propagate the difference between the
original skb and those segments, and return NET_XMIT_SUCCESS whenever
at least one segment was queued.

Fixes: 8f9516daedd6 ("sched: Add enqueue/dequeue of dualpi2 qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Xingquan Liu <b1n@b1n.io>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260619151447.223640-1-b1n@b1n.io
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: ndisc: fix NULL deref in accept_untracked_na()

accept_untracked_na() re-fetches the inet6_dev with __in6_dev_get(dev)
and dereferences idev->cnf.accept_untracked_na without a NULL check,
even though its only caller ndisc_recv_na() already fetched and
NULL-checked idev for the same device.

Both reads of dev->ip6_ptr run in the same RCU read-side critical
section, but a concurrent addrconf_ifdown() can clear dev->ip6_ptr
between them: lowering the MTU below IPV6_MIN_MTU calls addrconf_ifdown()
without the synchronize_net() that orders the unregister path, so the
re-fetch returns NULL and oopses:

BUG: KASAN: null-ptr-deref in ndisc_recv_na (net/ipv6/ndisc.c:974)
Read of size 4 at addr 0000000000000364
Call Trace:
  <IRQ>
  ndisc_recv_na (net/ipv6/ndisc.c:974)
  icmpv6_rcv (net/ipv6/icmp.c:1193)
  ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:479)
  ip6_input_finish (net/ipv6/ip6_input.c:534)
  ip6_input (net/ipv6/ip6_input.c:545)
  ip6_mc_input (net/ipv6/ip6_input.c:635)
  ipv6_rcv (net/ipv6/ip6_input.c:351)
  </IRQ>

It is reachable by an unprivileged user via a network namespace.

Pass the caller's already validated idev instead of re-fetching it; the
idev stays alive for the whole RCU critical section, so it is safe even
after dev->ip6_ptr has been cleared.

Fixes: aaa5f515b16b ("net: ipv6: new accept_untracked_na option to accept na only if in-network")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260617065512.2529757-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>