Linus Torvalds [Thu, 4 Jun 2026 21:35:55 +0000 (14:35 -0700)]
Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter, wireless and Bluetooth.
Current release - fix to a fix:
- Bluetooth: MGMT: fix backward compatibility with bluetoothd
which adds stray bytes to MGMT_OP_ADD_EXT_ADV_DATA
Previous releases - regressions:
- af_unix: fix inq_len update inaccuracy on partial read
- eth: fec: fix pinctrl default state restore order on resume
- wifi: iwlwifi:
- mvm: don't support the reset handshake for old firmwares
- pcie: simplify the resume flow if fast resume is not used,
work around NIC access failures
Previous releases - always broken:
- Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
- sctp: fix a couple of bugs in COOKIE_ECHO processing
- sched: fix pedit partial COW leading to page cache corruption
- wifi: nl80211: reject oversized EMA RNR lists
- netfilter:
- conntrack_irc: fix possible out-of-bounds read
- bridge: make ebt_snat ARP rewrite writable
- appletalk: zero-initialize aarp_entry to prevent heap info leak
- ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options
- mptcp: fix number of bugs reported by AI scans and discovered
during NVMe over MPTCP testing"
* tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
Reapply "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"
udp: clear skb->dev before running a sockmap verdict
sctp: purge outqueue on stale COOKIE-ECHO handling
bonding: annotate data-races arcound churn variables
net/802/mrp: fix vector attribute parsing in mrp_pdu_parse_vecattr
rtase: Avoid sleeping in get_stats64()
ieee802154: 6lowpan: only accept IPv6 packets in lowpan_xmit()
ipv6: mcast: Fix use-after-free when processing MLD queries
selftests: net: add vxlan vnifilter notification test
vxlan: vnifilter: fix spurious notification on VNI update
vxlan: vnifilter: send notification on VNI add
rtase: Reset TX subqueue when clearing TX ring
octeontx2-af: npc: Fix CPT channel mask in npc_install_flow
dt-bindings: ethernet: eswin: fix hsp-sp-csr backward compatibility
sctp: validate cached peer INIT chunk length in COOKIE_ECHO processing
net/sched: fix pedit partial COW leading to page cache corruption
vsock/vmci: fix sk_ack_backlog leak on failed handshake
net: bonding: fix NULL pointer dereference in bond_do_ioctl()
geneve: fix length used in GRO hint UDP checksum adjustment
net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown
...
Linus Torvalds [Thu, 4 Jun 2026 20:38:42 +0000 (13:38 -0700)]
Merge tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix CFI violation in probestub function
The probestub is a function to allow tprobes to hook to a tracepoint
to gain access to its parameters.
The function itself is only referenced by the tracepoint structure
which lives in the __tracepoint section. objtool explicitly ignores
that section and when processing functions in the kernel, if it
detects one that has no references it will seal it to have its ENDBR
stripped on boot up.
This means the probstub function will have its ENDBR stripped and if
a tprobe is attached to it with IBT enabled, it will go *boom*.
* tag 'trace-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix CFI violation in probestub being called by tprobes
Linus Torvalds [Thu, 4 Jun 2026 19:31:20 +0000 (12:31 -0700)]
Merge tag 's390-7.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Enable IOMMUFD and VFIO cdev such that PCI pass-through to
QEMU/KVM can optionally utilize native IOMMUFD
- With HAVE_ARCH_BUG_FORMAT enabled the BUG infrastructure might
misinterpret flags or fault. Fix this by moving the "format"
field emission into __BUG_ENTRY()
- The generic version of _THIS_IP_ is known to be brittle and may
break with current and future GCC and Clang optimizations. Fix
it by overriding _THIS_IP_
* tag 's390-7.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: Implement _THIS_IP_ using inline asm
s390/bug: Always emit format word in __BUG_ENTRY
s390/configs: Enable IOMMUFD and VFIO cdev in defconfigs
Breno reports a lockdep warning in bnxt. During FW reset the driver
may end up calling netif_set_real_num_tx_queues() (if queue count
changes), so calls to bnxt_open() still require rtnl_lock.
Sechang Lim [Wed, 3 Jun 2026 16:27:33 +0000 (16:27 +0000)]
udp: clear skb->dev before running a sockmap verdict
On the UDP receive path skb->dev is repurposed as dev_scratch (the
truesize/state cache set by udp_set_dev_scratch()), through the
union { struct net_device *dev; unsigned long dev_scratch; } in sk_buff.
When a UDP socket is in a sockmap, sk_data_ready is
sk_psock_verdict_data_ready(), which calls udp_read_skb() -> recv_actor()
(sk_psock_verdict_recv) to run the attached SK_SKB verdict program in softirq.
If that program calls a socket-lookup helper (bpf_sk_lookup_tcp/udp,
bpf_skc_lookup_tcp), bpf_skc_lookup() does:
if (skb->dev)
caller_net = dev_net(skb->dev);
skb->dev still holds the dev_scratch value (a non-NULL integer), so dev_net()
dereferences it as a struct net_device * and the kernel takes a general
protection fault on a non-canonical address in softirq:
The rmem charge that dev_scratch accounted for is released by skb_recv_udp() on
dequeue, just above, so the scratch is dead by the time recv_actor() runs. Clear
skb->dev so bpf_skc_lookup() falls back to sock_net(skb->sk), which
skb_set_owner_sk_safe() set just above.
Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Cc: stable@vger.kernel.org Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260603162737.697215-1-rhkrqnwk98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Wed, 3 Jun 2026 18:11:44 +0000 (14:11 -0400)]
sctp: purge outqueue on stale COOKIE-ECHO handling
sctp_stream_update() is only invoked when the association is moved into
COOKIE_WAIT during association setup/reconfiguration. In this path, the
outbound stream scheduler state (stream->out_curr) is expected to be
clean, since no user data should have been transmitted yet unless the
state machine has already partially progressed.
However, a corner case exists in sctp_sf_do_5_2_6_stale(): when a
Stale Cookie ERROR is received, the association is rolled back from
COOKIE_ECHOED to COOKIE_WAIT. In this scenario, user data may already
have been queued and even bundled with the COOKIE-ECHO chunk.
During the rollback, sctp_stream_update() frees the old stream table
and installs a new one, but it does not invalidate stream->out_curr.
As a result, out_curr may still point to a freed sctp_stream_out
entry from the previous stream state.
Later, SCTP scheduler dequeue paths (FCFS, RR, PRIO, etc.) rely on
stream->out_curr->ext, which can lead to use-after-free once the old
stream state has been released via sctp_stream_free().
This results in crashes such as (reported by Yuqi):
BUG: KASAN: slab-use-after-free in sctp_sched_fcfs_dequeue+0x13a/0x140
Read of size 8 at addr ff1100004d4d3208 by task mini_poc/9312
CPU: 1 UID: 1001 PID: 9312 Comm: mini_poc Not tainted 7.1.0-rc1-00305-gbd3a4795d574 #5 PREEMPT(full)
sctp_sched_fcfs_dequeue+0x13a/0x140
sctp_outq_flush+0x1603/0x33e0
sctp_do_sm+0x31c9/0x5d30
sctp_assoc_bh_rcv+0x392/0x6f0
sctp_inq_push+0x1db/0x270
sctp_rcv+0x138d/0x3c10
Fix this by fully purging the association outqueue when handling the
Stale Cookie case. This ensures all pending transmit and retransmit
state is dropped, and any scheduler cached pointers are invalidated,
making it safe to rebuild stream state during COOKIE_WAIT restart.
Updating only stream->out_curr would be insufficient, since queued
and retransmittable data would still reference the old stream state and
trigger later use-after-free in dequeue paths.
Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Yuqi Xu <xuyq21@lenovo.com> Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/94318159b9052907a6cbb7256aee8b5f8dfbfccb.1780510304.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These fields are updated asynchronously by the bonding state machine
in ad_churn_machine() while holding bond->mode_lock.
bond_info_show_slave() and bond_fill_slave_info() read them without
bond->mode_lock being held, we need to add READ_ONCE() and
WRITE_ONCE() annotations.
Note that AD_CHURN_MONITOR, AD_CHURN, and AD_NO_CHURN are defined
exclusively in (kernel private) include/net/bond_3ad.h header.
They should be moved to include/uapi/linux/if_bonding.h or userspace
tools will have to hardcode their values.
Fixes: 4916f2e2f3fc ("bonding: print churn state via netlink") Fixes: 14c9551a32eb ("bonding: Implement port churn-machine (AD standard 43.4.17).") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260603123514.388226-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yizhou Zhao [Wed, 3 Jun 2026 06:00:13 +0000 (14:00 +0800)]
net/802/mrp: fix vector attribute parsing in mrp_pdu_parse_vecattr
In mrp_pdu_parse_vecattr(), vector attribute events are encoded three
per byte and valen tracks the number of events left to process.
The parser decrements valen after processing the first and second events
from each event byte, but not after processing the third one. When valen
is exactly a multiple of three, the loop continues after the last valid
event and consumes the next byte as a new event byte, applying a
spurious event to the MRP applicant state.
Additionally, when valen is zero the parser unconditionally consumes
attrlen bytes as FirstValue and advances the offset, even though per
IEEE 802.1ak a VectorAttribute with only a LeaveAllEvent has valen of
zero and no FirstValue or Vector fields. This corrupts the offset for
subsequent PDU parsing.
Also, when valen exceeds three the loop crosses byte boundaries but
the attribute value is not incremented between the last event of one
byte and the first event of the next. This causes the first event of
the next byte to use the same attribute value as the third event
rather than the next consecutive value.
Decrement valen after processing the third event, skip FirstValue
consumption when valen is zero, and increment the attribute value at
the end of each loop iteration.
Fixes: febf018d2234 ("net/802: Implement Multiple Registration Protocol (MRP)") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Link: https://patch.msgid.link/20260603060016.21522-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Justin Lai [Wed, 3 Jun 2026 06:18:16 +0000 (14:18 +0800)]
rtase: Avoid sleeping in get_stats64()
The .ndo_get_stats64 callback must not sleep because it can be
called when reading /proc/net/dev.
rtase_get_stats64() calls rtase_dump_tally_counter(), which polls
the tally counter dump bit with read_poll_timeout(). This may
sleep while waiting for the hardware counter dump to complete.
Use read_poll_timeout_atomic() instead to avoid sleeping in the
get_stats64() path.
Eric Dumazet [Wed, 3 Jun 2026 07:29:55 +0000 (07:29 +0000)]
ieee802154: 6lowpan: only accept IPv6 packets in lowpan_xmit()
The aoe driver (or similar) generates a non-IPv6 packet
(e.g., ETH_P_AOE) and queues it for transmission via dev_queue_xmit()
on a 6LoWPAN interface (configured by the user or test case).
Since the packet is not IPv6, the 6LoWPAN header_ops->create function
(lowpan_header_create or header_create) returns early without initializing
the lowpan_addr_info structure in the skb headroom.
In the transmit function (lowpan_xmit), the driver calls lowpan_header
(or setup_header) which unconditionally copies and uses the lowpan_addr_info
from the headroom, which contains uninitialized data.
Fix this by dropping non IPv6 packets.
A similar fix is needed in net/bluetooth/6lowpan.c bt_xmit().
Ido Schimmel [Wed, 3 Jun 2026 10:18:11 +0000 (13:18 +0300)]
ipv6: mcast: Fix use-after-free when processing MLD queries
When processing an MLD query, a pointer to the multicast group address
is retrieved when initially parsing the packet. This pointer is later
dereferenced without being reloaded despite the fact that the skb header
might have been reallocated following the pskb_may_pull() calls, leading
to a use-after-free [1].
Fix by copying the multicast group address when the packet is initially
parsed.
[1]
BUG: KASAN: slab-use-after-free in __mld_query_work (net/ipv6/mcast.c:1512)
Read of size 8 at addr ffff8881154b8e90 by task kworker/4:1/118
When a vxlan device has vnifilter enabled, userspace observers
(e.g., bridge monitor vni) miss VNI add events and see spurious
notifications on no-op VNI re-adds.
Patch 1 fixes the missing notification on VNI add: vxlan_vni_add()
guarded the notification on a 'changed' flag that vxlan_vni_update_group()
only sets when a multicast group or remote is supplied, so VNIs added
without a group (e.g., L3 VXLAN) were silently created.
Patch 2 fixes the spurious notification on VNI update: vxlan_vni_update()
tested 'if (changed)' against a bool pointer instead of dereferencing it,
so every re-add produced a notification regardless of whether anything
actually changed.
Patch 3 adds a selftest covering both bugs along with a few related
cases (add with remote, remote update, delete-nonexistent).
====================
Andy Roulin [Tue, 2 Jun 2026 18:51:38 +0000 (11:51 -0700)]
selftests: net: add vxlan vnifilter notification test
Add a selftest for VXLAN vnifilter netlink notifications that verifies
RTM_NEWTUNNEL and RTM_DELTUNNEL are sent correctly when VNIs are added,
deleted, or updated, and that no spurious notifications are sent when
a VNI is re-added with the same attributes.
Andy Roulin [Tue, 2 Jun 2026 18:51:37 +0000 (11:51 -0700)]
vxlan: vnifilter: fix spurious notification on VNI update
When a VNI is re-added with the same attributes (e.g. same group or no
group), vxlan_vni_update() sends a spurious RTM_NEWTUNNEL notification
even though nothing changed.
The bug is that 'if (changed)' tests whether the pointer is non-NULL,
not the bool value it points to. Since every caller passes a valid
pointer, the condition is always true and the notification fires
unconditionally.
Fix by dereferencing the pointer: 'if (*changed)'.
Reproducer:
# ip link add vxlan100 type vxlan dstport 4789 local 10.0.0.1 \
nolearning external vnifilter
# ip link set vxlan100 up
# bridge monitor vni &
# bridge vni add vni 1000 dev vxlan100
# bridge vni add vni 1000 dev vxlan100 # spurious notification
Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Signed-off-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260602185138.253265-3-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andy Roulin [Tue, 2 Jun 2026 18:51:36 +0000 (11:51 -0700)]
vxlan: vnifilter: send notification on VNI add
When a new VNI is added to a vxlan device with vnifilter enabled,
no RTM_NEWTUNNEL notification is sent to userspace. This means
'bridge monitor vni' never shows VNI add events, even though
VNI delete events are reported correctly.
The bug is in vxlan_vni_add(), where the notification is guarded by
'if (changed)'. The 'changed' flag is set by vxlan_vni_update_group()
only when the multicast group or remote IP is modified, but for a
new VNI added without a group (e.g. in L3 VxLAN interface scenarios),
the function returns early without setting changed=true. Since this
is a new VNI, the notification should be sent unconditionally.
The notification is not guarded by the return value of
vxlan_vni_update_group() because, at this point, the VNI has already
been inserted into the hash table and list with no rollback on error.
The VNI will be visible in 'bridge vni show' regardless, so userspace
should be informed. This is consistent with vxlan_vni_del() which also
notifies unconditionally.
The 'if (changed)' guard remains correct in vxlan_vni_update(), which
handles the case where a VNI already exists and is being re-added --
there, we only want to notify if the group/remote actually changed.
Reproducer:
# ip link add vxlan100 type vxlan dstport 4789 local 10.0.0.1 \
nolearning external vnifilter
# ip link set vxlan100 up
# bridge monitor vni &
# bridge vni add vni 1000 dev vxlan100 # no notification
# bridge vni delete vni 1000 dev vxlan100 # notification received
Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Reported-by: Chirag Shah <chirag@nvidia.com> Signed-off-by: Andy Roulin <aroulin@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20260602185138.253265-2-aroulin@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
octeontx2-af: npc: Fix CPT channel mask in npc_install_flow
Use the CPT-aware NIX channel mask in the npc_install_flow path so that
when the host PF installs steering rules in kernel for a VF used from
userspace (e.g. DPDK), MCAM entries see the same channel mask semantics as
other RX paths.
Fixes: 56bcef528bd8 ("octeontx2-af: Use npc_install_flow API for promisc and broadcast entries") Cc: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/20260602045853.1558530-1-rkannoth@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Tue, 2 Jun 2026 01:06:06 +0000 (21:06 -0400)]
sctp: validate cached peer INIT chunk length in COOKIE_ECHO processing
When a listening SCTP server processes a COOKIE_ECHO chunk, the cached
peer INIT chunk embedded after the cookie is parsed and its parameters
are later walked by sctp_process_init() using sctp_walk_params().
However, the chunk header length of this cached INIT chunk was not
validated against the remaining buffer in the COOKIE_ECHO payload. If
the length field is inflated, the parameter walk can run beyond the
actual received data, leading to out-of-bounds reads and potential
memory corruption during later parameter handling (e.g. STATE_COOKIE
processing and kmemdup() copies).
Add a bounds check in sctp_unpack_cookie() to ensure the cached INIT
chunk length does not exceed the available data in the COOKIE_ECHO
buffer before it is used.
Rajat Gupta [Sun, 31 May 2026 12:32:21 +0000 (08:32 -0400)]
net/sched: fix pedit partial COW leading to page cache corruption
tcf_pedit_act() computes the COW range for skb_ensure_writable()
once before the key loop using tcfp_off_max_hint, but the hint does
not account for the runtime header offset added by typed keys. This
can leave part of the write region un-COW'd.
Fix by moving skb_ensure_writable() inside the per-key loop where
the actual write offset is known, and add overflow checking on the
offset arithmetic. For negative offsets (e.g. Ethernet header edits
at ingress), use skb_cow() to COW the headroom instead. Guard
offset_valid() against INT_MIN, where negation is undefined.
Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Reported-by: Han Guidong <2045gemini@gmail.com> Reported-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Han Guidong <2045gemini@gmail.com> Tested-by: Han Guidong <2045gemini@gmail.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/20260531123221.48732-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raf Dickson [Tue, 26 May 2026 10:43:56 +0000 (10:43 +0000)]
vsock/vmci: fix sk_ack_backlog leak on failed handshake
When vmci_transport_recv_connecting_server() returns an error,
vmci_transport_recv_listen() calls vsock_remove_pending() but never
calls sk_acceptq_removed(). This leaves sk_ack_backlog incremented
permanently.
Repeated handshake failures (malformed packets, queue pair alloc
failure, event subscribe failure) cause sk_ack_backlog to climb
toward sk_max_ack_backlog. Once it reaches the limit the listener
permanently refuses all new connections with -ECONNREFUSED, a
silent denial of service requiring a process restart to recover.
The two existing sk_acceptq_removed() calls in af_vsock.c do not
cover this path: line 764 checks vsock_is_pending() which returns
false after vsock_remove_pending(), and line 1889 is only reached
on successful accept().
Fix by balancing sk_acceptq_added() with sk_acceptq_removed() on
the error path.
ZhaoJinming [Mon, 1 Jun 2026 08:56:49 +0000 (16:56 +0800)]
net: bonding: fix NULL pointer dereference in bond_do_ioctl()
In bond_do_ioctl(), slave_dev is obtained via __dev_get_by_name() which
can return NULL if the requested interface name does not exist. However,
the subsequent slave_dbg() call is placed before the NULL check:
The slave_dbg() macro expands to netdev_dbg(bond_dev, "(slave %s): " fmt,
(slave_dev)->name, ...) which unconditionally dereferences slave_dev->name
before the NULL check is performed. This results in a NULL pointer
dereference kernel oops when a user calls bonding ioctl (e.g.
SIOCBONDENSLAVE, SIOCBONDRELEASE, etc.) with a non-existent slave
interface name.
This is reachable from userspace via the bonding ioctl interface with
CAP_NET_ADMIN capability, making it a potential local denial-of-service
vector.
Fix by moving the slave_dbg() call after the NULL check.
Eva Kurchatova [Wed, 3 Jun 2026 15:31:42 +0000 (18:31 +0300)]
tracing: Fix CFI violation in probestub being called by tprobes
The probestub is a function to allow tprobes to hook to a tracepoint to
gain access to its parameters. The function itself is only referenced by
the tracepoint structure which lives in the __tracepoint section. objtool
explicitly ignores that section and when processing functions in the
kernel, if it detects one that has no references it will seal it to have
its ENDBR stripped on boot up.
This means when a tprobe is attached to the sched_wakeup tracepoint, when it
is triggered it will call __probestub_sched_wakeup and due to the missing
ENDBR on a CFI-enabled machine it will take a #CP exception.
Fix this by adding CFI_NOSEAL annotation to probestub declaration.
Cc: stable@vger.kernel.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260603153147.573589-1-eva.kurchatova@virtuozzo.com Fixes: d5173f753750 ("objtool: Exclude __tracepoints data from ENDBR checks") Signed-off-by: Eva Kurchatova <eva.kurchatova@virtuozzo.com>
[ Updated change log ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Antoine Tenart [Fri, 29 May 2026 14:47:00 +0000 (16:47 +0200)]
geneve: fix length used in GRO hint UDP checksum adjustment
In geneve_post_decap_hint the length used for adjusting the UDP checksum
should be 'skb->len - gro_hint->nested_tp_offset' (UDP length) instead
of 'skb->len - gro_hint->nested_nh_offset' (IP length).
Fixes: fd0dd796576e ("geneve: use GRO hint option in the RX path") Cc: Paolo Abeni <pabeni@redhat.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://sashiko.dev/#/patchset/20260521131436.748832-1-jhs%40mojatatu.com Signed-off-by: Antoine Tenart <atenart@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260529144713.780938-1-atenart@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Fix use-after-free in metadata dst teardown in airoha_eth and mtk_eth_soc drivers
airoha_metadata_dst_free() and mtk_free_dev() call metadata_dst_free()
which frees the metadata_dst with kfree() immediately, bypassing the RCU
grace period.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path and runs call_rcu_hurry() if refcount goes to
zero.
====================
net: ethernet: mtk_eth_soc: Fix use-after-free in metadata dst teardown
mtk_free_dev() calls metadata_dst_free() which frees the metadata_dst
with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, a use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.
net: airoha: Fix use-after-free in metadata dst teardown
airoha_metadata_dst_free() runs metadata_dst_free() which frees the
metadata_dst with kfree() immediately, bypassing the RCU grace period.
In the RX path, skb_dst_set_noref() sets a non-refcounted pointer from
the skb to the metadata_dst. This function requires RCU read-side
protection and the dst must remain valid until all RCU readers complete.
Since metadata_dst_free() calls kfree() directly, an use-after-free can
occur if any skb still holds a noref pointer to the dst when the driver
tears it down.
Replace metadata_dst_free() with dst_release() which properly goes
through the refcount path: when the refcount drops to zero, it schedules
the actual free via call_rcu_hurry(), ensuring all RCU readers have
completed before the memory is freed.
Jakub Kicinski [Thu, 4 Jun 2026 02:07:46 +0000 (19:07 -0700)]
Merge tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_core: fix memory leak in error path of hci_alloc_dev()
- hci_sync: reject oversized Broadcast Announcement prepend
- MGMT: Fix backward compatibility with userspace
- MGMT: validate advertising TLV before type checks
- L2CAP: reject BR/EDR signaling packets over MTUsig
- RFCOMM: validate skb length in MCC handlers
- RFCOMM: hold listener socket in rfcomm_connect_ind()
- ISO: Fix not releasing hdev reference on iso_conn_big_sync
- ISO: Fix a use-after-free of the hci_conn pointer
- ISO: Fix data-race on iso_pi fields in hci_get_route calls
- SCO: Fix data-race on sco_pi fields in sco_connect
- BNEP: reject short frames before parsing
* tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: MGMT: Fix backward compatibility with userspace
Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect
Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls
Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer
Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync
Bluetooth: fix memory leak in error path of hci_alloc_dev()
Bluetooth: bnep: reject short frames before parsing
Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend
Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
Bluetooth: RFCOMM: validate skb length in MCC handlers
Bluetooth: MGMT: validate advertising TLV before type checks
Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind()
====================
Jakub Kicinski [Thu, 4 Jun 2026 02:07:34 +0000 (19:07 -0700)]
Merge tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Things are finally quieting down:
- iwlwifi:
- FW reset handshake removal for older devices
- NIC access fix in fast resume
- avoid too large command for some BIOSes
- fix TX power constraints in AP mode
- cfg80211:
- fix netlink parse overflow
- fix potential 6 GHz scan memory leak
- enforce HE/EHT consistency to avoid mac80211 crash
- mac80211: guard radiotap antenna parsing
* tag 'wireless-2026-06-03' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: enforce HE/EHT cap/oper consistency
wifi: fix leak if split 6 GHz scanning fails
wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
wifi: nl80211: reject oversized EMA RNR lists
wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used
wifi: iwlwifi: mvm: avoid oversized UATS command copy
wifi: iwlwifi: mld: send tx power constraints before link activation
wifi: iwlwifi: mvm: don't support the reset handshake for old firmwares
====================
Jakub Kicinski [Thu, 4 Jun 2026 02:04:46 +0000 (19:04 -0700)]
Merge branch 'mptcp-misc-fixes-for-v7-1-rc7'
Matthieu Baerts says:
====================
mptcp: misc fixes for v7.1-rc7
Here are various unrelated fixes:
- Patch 1: fix missing wakeups when multiple threads are reading from
the same fd. A fix for v5.7.
- Patch 2: fix retransmission loop when MPTCP checksum is enabled. A fix
for v5.14.
- Patch 3: fix a TOCTOU race while computing rcv_wnd. A fix for v5.11.
- Patch 4: allow subflows receive window to shrink if needed. A fix for
v5.19.
- Patches 5-6: avoid 'extra_subflows' to underflow with the userspace
PM. A fix for v5.19.
- Patch 7: report errors if one subflow cannot set SO_TIMESTAMPING. A
fix for v5.14.
- Patch 8: try to set TCP_MAXSEG on all subflows, before reporting
errors, if any. A fix for v6.17.
- Patch 9: check desc->count in read_sock, to act as expected. A fix
for v7.0.
- Patch 10: fix an uninit value in mptcp_established_options, reported
by syzbot. A fix for v7.1-rc1.
- Patch 11: fix a similar issue than the previous patch, exposed by the
same modification from v7.1-rc1, but was already causing issues since
v5.15.
====================
When an ADD_ADDR needs to be sent, it could be prepared if there is
enough remaining space and even if the packet is not a pure ACK. But it
would be dropped soon after.
Indeed, in mptcp_pm_add_addr_signal(), there is enough space to fit a
DSS of 20 octets and an ADD_ADDR echo containing an IPv4 address on 8
octets for example. In this case, the packet would be prepared, the
MPTCP_ADD_ADDR_ECHO bit would be removed from pm->addr_signal, but the
option would be silently dropped in mptcp_established_options_add_addr()
not to override DSS info in the union from 'struct mptcp_out_options',
and also because mptcp_write_options() will enforce mutually exclusion
with DSS.
Instead, don't even try to send an ADD_ADDR if it is not a pure ACK.
Retry for each new packet until a pure-ACK is emitted. That's fine to do
that, because each time an ADD_ADDR (echo) is scheduled, a pure ACK is
queued.
This also simplifies the code, and the skb checks can be done earlier,
before the lock.
Note: also, since commit 6d0060f600ad ("mptcp: Write MPTCP DSS headers
to outgoing data packets"), opts->ahmac would not have been set to 0
when other suboptions were not dropped, and when sending an ADD_ADDR
echo. That would have resulted in sending an ADD_ADDR using garbage
info, where there was not enough space, instead of an echo one without
the ADD_ADDR HMAC.
Local variable opts created at:
__tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536
__tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499
The output path currently omits initializing the mptcp extension
`use_map` flag in a few corner cases.
Address the issue always zeroing all the extensions flags before
eventually initializing the individual bits. To that extent, introduce
and use a struct_group to avoid multiple bitwise operations.
Gang Yan [Tue, 2 Jun 2026 12:14:16 +0000 (22:14 +1000)]
mptcp: check desc->count in read_sock
__tcp_read_sock() checks desc->count after each skb is consumed and
breaks the loop when it reaches 0. The MPTCP variant lacks this check.
This is a functional bug, other subsystems also rely on this check:
TLS strparser sets desc->count to 0 once a full TLS record is assembled
and depends on this break to stop reading.
Add the same desc->count check to __mptcp_read_sock(), mirroring
__tcp_read_sock().
The mptcp_setsockopt_all_sf(), currently used only with TCP_MAXSEG,
stopped when one subflow returned an error.
Even if it is not wrong, this is different from the other helpers trying
to set the option on all subflows, and then returning an error if at
least one of them had an issue.
Follow this behaviour, for a question of uniformity.
sock_set_timestamping() can fail for different reasons. The returned
value should then be checked.
If sock_set_timestamping() fails for at least one subflow, the first
error is now reported to the userspace, similar to what is done with
other socket options.
Fixes: 9061f24bf82e ("mptcp: sockopt: propagate timestamp request to subflows") Cc: stable@vger.kernel.org Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Closes: https://lore.kernel.org/willemdebruijn.kernel.178a41a53d041@gmail.com Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-7-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tao Cui [Tue, 2 Jun 2026 12:14:13 +0000 (22:14 +1000)]
selftests: mptcp: add test for extra_subflows underflow on userspace PM
Add a test to verify that when userspace PM fails to create a subflow
(e.g. using an unreachable address), the extra_subflows counter is not
decremented below zero.
Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos") Cc: stable@vger.kernel.org Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-6-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tao Cui [Tue, 2 Jun 2026 12:14:12 +0000 (22:14 +1000)]
mptcp: pm: fix extra_subflows underflow on userspace PM subflow creation
The userspace PM increments extra_subflows after __mptcp_subflow_connect()
succeeds, but __mptcp_subflow_connect() calls mptcp_pm_close_subflow()
on failure to roll back the pre-increment done by the kernel PM's fill_*()
helpers. Because the userspace PM hasn't incremented yet at that point,
this decrement is spurious and causes extra_subflows to underflow.
Fix it by aligning the userspace PM with the kernel PM: increment
extra_subflows before calling __mptcp_subflow_connect(), so the existing
error path in subflow.c correctly rolls it back on failure. Also simplify
the error handling by taking pm.lock only when needed for cleanup.
Fixes: 77e4b94a3de6 ("mptcp: update userspace pm infos") Cc: stable@vger.kernel.org Signed-off-by: Tao Cui <cuitao@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-5-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:11 +0000 (22:14 +1000)]
mptcp: allow subflow rcv wnd to shrink
In MPTCP connection, the `window` field in the TCP header refers to the
MPTCP-level rcv_nxt and it's right edge should not move backward. Such
constraint is enforced at DSS option generation time.
At the same time, the TCP stack ensures independently that the TCP-level
rcv wnd right's edge does not move backward. That in turn causes artificial
inflating of the MPTCP rcv window when the incoming data is acked at the
TCP level and is OoO in the MPTCP sequence space (or lands in the backlog).
As a consequence, the incoming traffic can exceed the receiver rcvbuf size
even when the sender is not misbehaving.
Prevent such scenario forcibly allowing the TCP subflow to shrink the
TCP-level rcv wnd regardless of the current netns setting.
Fixes: f3589be0c420 ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-4-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:10 +0000 (22:14 +1000)]
mptcp: close TOCTOU race while computing rcv_wnd
The MPTCP output path access locklessly the MPTCP-level ack_seq
in multiple times, using possibly different values for the data_ack
in the DSS option and to compute the announced rcv wnd for the same
packet.
Refactor the cote to avoid inconsistencies which may confuse the
peer. Also ensure that the MPTCP level rcv wnd is updated only when
the egress packet actually contains a DSS ack.
Fixes: fa3fe2b15031 ("mptcp: track window announced to peer") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-3-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 2 Jun 2026 12:14:09 +0000 (22:14 +1000)]
mptcp: fix retransmission loop when csum is enabled
Sashiko noted that retransmission with csum enabled can actually
transmit new data, but currently the relevant code does not update
accordingly snd_nxt.
The may cause incoming ack drop and an endless retransmission loop.
Paolo Abeni [Tue, 2 Jun 2026 12:14:08 +0000 (22:14 +1000)]
mptcp: fix missing wakeups in edge scenarios
The mptcp_recvmsg() can fill MPTCP socket receive queue via
mptcp_move_skbs(), but currently does not try to wakeup any listener,
because the same process is going to check the receive queue soon.
When multiple threads are reading from the same fd, the above can
cause stall. Add the missing wakeup.
Fixes: 6771bfd9ee24 ("mptcp: update mptcp ack sequence from work queue") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-1-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kurt Kanzenbach [Fri, 29 May 2026 17:11:47 +0000 (19:11 +0200)]
ptp: vclock: Switch from RCU to SRCU
The usage of PTP vClocks leads immediately to the following issues with
ptp4l with LOCKDEP and DEBUG_ATOMIC_SLEEP enabled: "BUG: sleeping function
called from invalid context".
ptp_convert_timestamp() acquires a mutex_t within a RCU read section. This
is illegal, because acquiring a mutex_t can result in voluntary scheduling
request which is not allowed within a RCU read section.
Replace the RCU usage with SRCU where sleeping is allowed.
Reported-by: Florian Zeitz <florian.zeitz@schettke.com> Closes: https://lore.kernel.org/all/00a8cce8-410e-4038-98af-49be6d93d7bd@schettke.com/ Fixes: 67d93ffc0f3c ("ptp: vclock: use mutex to fix "sleep on atomic" bug") Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260529-vclock_rcu-v2-1-02a5531fab92@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 2 Jun 2026 16:15:47 +0000 (16:15 +0000)]
ipv4: restrict IPOPT_SSRR and IPOPT_LSRR options
This patch restricts setting Loose Source and Record Route (LSRR)
and Strict Source and Record Route (SSRR) IP options to users
with CAP_NET_RAW capability.
This prevents unprivileged applications from forcing packets to route
through attacker-controlled nodes to leak TCP ISN and possibly other
protocol information.
While LSRR and SSRR are commonly filtered in many network environments,
they may still be supported and forwarded along some network paths.
RFC 7126 (Recommendations on Filtering of IPv4 Packets Containing
IPv4 Options) recommend to drop these options in 4.3 and 4.4.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Tamir Shahar <tamirthesis@gmail.com> Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260602161547.2642155-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianyu Li [Mon, 1 Jun 2026 11:36:40 +0000 (19:36 +0800)]
af_unix: Add test for SCM_INQ on partial read
Add test to verify that when a skb is partially consumed,
unix_inq_len() return correct remaining byte count.
Before:
# RUN scm_inq.stream.partial_read ...
# scm_inq.c:165:partial_read:Expected remain (512) == *(int *)CMSG_DATA(cmsg) (768)
# partial_read: Test terminated by assertion
# FAIL scm_inq.stream.partial_read
not ok 2 scm_inq.stream.partial_read
After:
# RUN scm_inq.stream.partial_read ...
# OK scm_inq.stream.partial_read
ok 2 scm_inq.stream.partial_read
Jianyu Li [Mon, 1 Jun 2026 11:36:39 +0000 (19:36 +0800)]
af_unix: Fix inq_len update problem in partial read
Currently inq_len is updated only when the whole skb is consumed.
If only part of the data is read, following SIOCINQ query would
get value greater than what actually left.
This change update inq_len timely in unix_stream_read_generic(),
and adjust unix_stream_read_skb() accordingly to prevent
repetitive update.
Fixes: f4e1fb04c123 ("af_unix: Use cached value for SOCK_STREAM in unix_inq_len().") Signed-off-by: Jianyu Li <jianyu.li@mediatek.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260601113640.231897-2-jianyu.li@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Suman Ghosh [Fri, 29 May 2026 11:37:05 +0000 (17:07 +0530)]
octeontx2-af: Fix initialization of mcam's entry2target_pffunc field
NPC mcam entry stores a mapping between mcam entry and target pcifunc.
During initialization of this field, API kmalloc_array has been used which
caused some junk values to array. Whereas, the array is expected to be
initialized by 0. This patch fixes the same by using kcalloc instead of
kmalloc_array.
Fixes: 55307fcb9258 ("octeontx2-af: Add mbox messages to install and delete MCAM rules") Signed-off-by: Suman Ghosh <sumang@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1780054625-17090-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geetha sowjanya [Fri, 29 May 2026 11:37:57 +0000 (17:07 +0530)]
octeontx2-pf: Fix NDC sync operation errors
On system reboot "rvu_nicpf 0002:03:00.0: NDC sync operation failed"
error messages are shown, even if the operations is successful.
This is due to wrong if error check in ndc_syc() function.
Fixes: 42c45ac1419c ("octeontx2-af: Sync NIX and NPA contexts from NDC to LLC/DRAM") Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1780054677-17249-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yizhou Zhao [Fri, 29 May 2026 10:50:16 +0000 (18:50 +0800)]
appletalk: aarp: zero-initialize aarp_entry to prevent heap info leak
aarp_alloc() allocates struct aarp_entry without zeroing it, but only
initializes refcnt and packet_queue. When an unresolved AARP entry is
created, hwaddr[ETH_ALEN] is left uninitialized.
aarp_seq_show() later prints this field with %pM when users read
/proc/net/atalk/arp. This can expose 6 bytes of stale heap data for
each unresolved entry.
Fix this by zero-initializing struct aarp_entry at allocation time.
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260529105017.81531-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Jelonek [Thu, 28 May 2026 20:52:40 +0000 (20:52 +0000)]
net: sfp: initialize i2c_block_size at adapter configure time
sfp->i2c_block_size is only assigned in sfp_sm_mod_probe(), which runs
from the state machine timer after SFP_F_PRESENT has been set. Between
those two points, sfp_module_eeprom() (the ethtool -m callback) gates
only on SFP_F_PRESENT and can be entered with i2c_block_size still at
its kzalloc'd value of 0.
On a pure-I2C adapter, sfp_i2c_read() then issues an i2c_transfer()
with msgs[1].len = 0 inside a loop that subtracts this_len from len
each iteration; on adapters that succeed a zero-length read the loop
never advances, spinning while holding rtnl_lock.
This was previously addressed by initializing i2c_block_size in
sfp_alloc() (commit 813c2dd78618), but the initialization was dropped
when i2c_block_size was split from i2c_max_block_size.
Initialize sfp->i2c_block_size from sfp->i2c_max_block_size in
sfp_i2c_configure(), so the field is valid as soon as the adapter is
known. sfp_sm_mod_probe() still reassigns it on each module insertion
to recover from a per-module clamp to 1 (sfp_id_needs_byte_io).
Jason Xing [Sat, 30 May 2026 04:26:30 +0000 (12:26 +0800)]
xsk: cache csum_start/csum_offset to fix TOCTOU in xsk_skb_metadata()
The TX metadata area resides in the UMEM buffer which is memory-mapped
and concurrently writable by userspace. In xsk_skb_metadata(),
csum_start and csum_offset are read from shared memory for bounds
validation, then read again for skb assignment. A malicious userspace
application can race to overwrite these values between the two reads,
bypassing the bounds check and causing out-of-bounds memory access
during checksum computation in the transmit path.
Fix this by reading csum_start and csum_offset into local variables
once, then using the local copies for both validation and assignment.
Note that other metadata fields (flags, launch_time) and the cached
csum fields may be mutually inconsistent due to concurrent userspace
writes, but this is benign: the only security-critical invariant is
that each field's validated value is the same one used, which local
caching guarantees.
Closes: https://lore.kernel.org/all/20260503200927.73EA1C2BCB4@smtp.kernel.org/ Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") Link: https://patch.msgid.link/20260530042630.80626-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 3 Jun 2026 16:09:24 +0000 (09:09 -0700)]
Merge tag 'mmc-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Fix host controller programming for eMMC fixed driver type
MMC host:
- dw_mmc-rockchip: Add missing private data for very old controllers
- litex_mmc: Fix clock management
- renesas_sdhi: Add OF entry for RZ/G2H SoC
- sdhci: Manage signal voltage switch during system resume for some hosts
- sdhci-of-dwcmshc: Fix reset, clk and SDIO support for Eswin EIC7700"
* tag 'mmc-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci: add signal voltage switch in sdhci_resume_host
mmc: dw_mmc-rockchip: Add missing private data for very old controllers
mmc: litex_mmc: Set mandatory idle clocks before CMD0
mmc: litex_mmc: Use DIV_ROUND_UP for more accurate clock calculation
mmc: renesas_sdhi: Add OF entry for RZ/G2H SoC
mmc: sdhci-of-dwcmshc: Fix reset, clk, and SDIO support for Eswin EIC7700
mmc: core: Fix host controller programming for fixed driver type
Linus Torvalds [Wed, 3 Jun 2026 15:59:24 +0000 (08:59 -0700)]
Merge tag 'cgroup-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"One cpuset fix and a maintenance update, both low-risk:
- Fix cpuset partition CPU accounting under sibling CPU exclusion
that could produce wrong CPU assignments and trigger
scheduling-domain warnings. Includes selftests.
- Update an email address in MAINTAINERS"
* tag 'cgroup-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Change Ridong's email
cgroup/cpuset: Add test cases for sibling CPU exclusion on partition update
cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation
Linus Torvalds [Wed, 3 Jun 2026 15:52:26 +0000 (08:52 -0700)]
Merge tag 'sched_ext-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"Two low-risk fixes:
- Drop a spurious warning that can fire during cgroup migration while
a sched_ext scheduler is loaded
- Fix a drgn-based debug script that broke after scheduler state
moved into a per-scheduler struct"
* tag 'sched_ext-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task()
tools/sched_ext: Fix scx_show_state per-scheduler state reads
Bluetooth: MGMT: Fix backward compatibility with userspace
bluetoothd has a bug with makes it send extra bytes as part of
MGMT_OP_ADD_EXT_ADV_DATA which are now being checked to be the
exact the expected length, relax this so only when the expected
length is greater than the data length to cause an error since
that would result in accessing invalid memory, otherwise just
ignore the extra bytes.
SeungJu Cheon [Mon, 1 Jun 2026 11:19:08 +0000 (20:19 +0900)]
Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect
sco_sock_connect() copies the destination address into sco_pi(sk)->dst
under lock_sock(), then releases the lock and calls sco_connect(),
which reads dst, src, setting, and codec without holding lock_sock() in
hci_get_route() and hci_connect_sco().
These fields may be modified concurrently by connect(), bind(), or
setsockopt() on the same socket, resulting in data-races reported by
KCSAN.
Fix this by snapshotting dst, src, setting, and codec under lock_sock()
at the start of sco_connect() before passing them to hci_get_route()
and hci_connect_sco().
BUG: KCSAN: data-race in memcmp+0x45/0xb0
race at unknown origin, with read to 0xffff88800e6b0dd0 of 1 bytes
by task 315 on cpu 0:
memcmp+0x45/0xb0
hci_connect_acl+0x1b7/0x6b0
hci_connect_sco+0x4d/0xb30
sco_sock_connect+0x27b/0xd60
__sys_connect_file+0xbd/0xe0
__sys_connect+0xe0/0x110
__x64_sys_connect+0x40/0x50
x64_sys_call+0xcad/0x1c60
do_syscall_64+0x133/0x590
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 9a8ec9e8ebb5 ("Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm") Signed-off-by: SeungJu Cheon <suunj1331@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
SeungJu Cheon [Mon, 1 Jun 2026 11:19:07 +0000 (20:19 +0900)]
Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls
iso_connect_bis(), iso_connect_cis(), iso_listen_bis(), and
iso_conn_big_sync() call hci_get_route() using iso_pi(sk)->dst,
iso_pi(sk)->src, and iso_pi(sk)->src_type without holding lock_sock().
These fields may be modified concurrently by connect() or setsockopt()
on the same socket, resulting in data-races reported by KCSAN.
Fix this by snapshotting the required fields under lock_sock() before
calling hci_get_route().
BUG: KCSAN: data-race in memcmp+0x45/0xb0
race at unknown origin, with read to 0xffff8880122135cf of 1 bytes
by task 333 on cpu 1:
memcmp+0x45/0xb0
hci_get_route+0x27e/0x490
iso_connect_cis+0x4c/0xa10
iso_sock_connect+0x60e/0xb30
__sys_connect_file+0xbd/0xe0
__sys_connect+0xe0/0x110
__x64_sys_connect+0x40/0x50
x64_sys_call+0xcad/0x1c60
do_syscall_64+0x133/0x590
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 241f51931c35 ("Bluetooth: ISO: Avoid circular locking dependency") Signed-off-by: SeungJu Cheon <suunj1331@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer
In iso_sock_rebind_bc(), the bis pointer is cached, then the socket lock is
dropped:
bis = iso_pi(sk)->conn->hcon;
/* Release the socket before lookups since that requires hci_dev_lock
* which shall not be acquired while holding sock_lock for proper
* ordering.
*/
release_sock(sk);
hci_dev_lock(bis->hdev);
During the unlocked window, could a concurrent close() destroy the connection
and free the bis structure, causing hci_dev_lock(bis->hdev) to access memory
after it is freed, fix this by using the hdev reference which was safely
acquired via iso_conn_get_hdev().
Fixes: d3413703d5f8 ("Bluetooth: ISO: Add support to bind to trigger PAST") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync
hci_get_route() returns a reference-counted hci_dev pointer via
hci_dev_hold(). The function exits normally or with an error without ever
releasing it.
Fixes: 07a9342b94a9 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bharath Reddy [Mon, 1 Jun 2026 03:24:26 +0000 (08:54 +0530)]
Bluetooth: fix memory leak in error path of hci_alloc_dev()
Early failures in Bluetooth HCI UART configuration leak SRCU percpu
memory.
When device initialization fails before hci_register_dev() completes,
the HCI_UNREGISTER flag is never set. As a result, when the device
reference count reaches zero, bt_host_release() evaluates this flag as
false and falls back to a direct kfree(hdev).
Because hci_release_dev() is bypassed, the SRCU struct initialized
early in hci_alloc_dev() is never cleaned up, resulting in a leak of
percpu memory.
Fix the leak by explicitly calling cleanup_srcu_struct() in the
fallback (unregistered) branch of bt_host_release() before freeing
the device.
Reported-by: syzbot+535ecc844591e50588a5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=535ecc844591e50588a5 Tested-by: syzbot+535ecc844591e50588a5@syzkaller.appspotmail.com Fixes: 1d6123102e9f ("Bluetooth: hci_core: Fix use-after-free in vhci_flush()") Signed-off-by: Bharath Reddy <kbreddy.rpbc@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Zhang Cen [Fri, 29 May 2026 03:22:09 +0000 (11:22 +0800)]
Bluetooth: bnep: reject short frames before parsing
A BNEP peer can send a short BNEP SDU. bnep_rx_frame() reads the
packet type byte immediately and, for control packets, reads the control
opcode and setup UUID-size byte before proving that those bytes are
present. bnep_rx_control() also dereferences the control opcode without
rejecting an empty control payload.
Use skb_pull_data() for the fixed fields in bnep_rx_frame() so a NULL
return gates each dereference. Split the control handler so the frame
path can pass an opcode that has already been pulled, and keep the
byte-buffer wrapper for extension control payloads.
For BNEP_SETUP_CONN_REQ, name the UUID-size byte before pulling the
setup payload. struct bnep_setup_conn_req carries destination and source
service UUIDs after that byte, each uuid_size bytes, so the parser now
documents that tuple explicitly instead of leaving the pull length as an
opaque multiplication.
Validation reproduced this kernel report:
KASAN slab-out-of-bounds in bnep_rx_frame.isra.0+0x130c/0x1790
The buggy address belongs to the object at ffff88800c0f7908 which belongs
to the cache kmalloc-8 of size 8
The buggy address is located 0 bytes to the right of allocated 1-byte
region [ffff88800c0f7908, ffff88800c0f7909)
Read of size 1
Call trace:
dump_stack_lvl+0xb3/0x140 (?:?)
print_address_description+0x57/0x3a0 (?:?)
bnep_rx_frame+0x130c/0x1790 (net/bluetooth/bnep/core.c:306)
print_report+0xb9/0x2b0 (?:?)
__virt_addr_valid+0x1ba/0x3a0 (?:?)
srso_alias_return_thunk+0x5/0xfbef5 (?:?)
kasan_addr_to_slab+0x21/0x60 (?:?)
kasan_report+0xe0/0x110 (?:?)
process_one_work+0xfce/0x17e0 (kernel/workqueue.c:3200)
worker_thread+0x65c/0xe40 (?:?)
__kthread_parkme+0x184/0x230 (?:?)
kthread+0x35e/0x470 (?:?)
_raw_spin_unlock_irq+0x28/0x50 (?:?)
ret_from_fork+0x586/0x870 (?:?)
__switch_to+0x74f/0xdc0 (?:?)
ret_from_fork_asm+0x1a/0x30 (?:?)
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Assisted-by: Codex:gpt-5.5 Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Existing advertising instances can already hold the maximum extended
advertising payload. When hci_adv_bcast_annoucement() prepends the
Broadcast Announcement service data to that payload, the combined data
may no longer fit in the temporary buffer used to rebuild the
advertising data.
Reject that case before copying the existing payload and report the
failure through the device log. This keeps the existing advertising
data intact and avoids overrunning the temporary buffer.
Fixes: 5725bc608252 ("Bluetooth: hci_sync: Fix broadcast/PA when using an existing instance") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Yuqi Xu <xuyq21@lenovo.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig
net/bluetooth/l2cap_core.c:l2cap_sig_channel() accepts BR/EDR
signaling packets up to the channel MTU and dispatches each command
without enforcing the signaling MTU (MTUsig). A Bluetooth BR/EDR peer
within radio range can send a fixed-channel CID 0x0001 packet that is
larger than MTUsig and contains many L2CAP_ECHO_REQ commands before
pairing. In a real-radio stock-kernel run, one 681-byte signaling
packet containing 168 zero-length ECHO_REQ commands made the target
transmit 168 ECHO_RSP frames over about 220 ms.
Impact: a Bluetooth BR/EDR peer within radio range, before pairing, can
force 168 ECHO_RSP frames from one 681-byte fixed-channel signaling
packet containing packed ECHO_REQ commands.
Define Linux's BR/EDR signaling MTU as the spec minimum of 48 bytes and
reject any larger signaling packet with one L2CAP_COMMAND_REJECT_RSP
carrying L2CAP_REJ_MTU_EXCEEDED before any command is dispatched.
The Bluetooth Core spec wording for MTUExceeded says the reject
identifier shall match the first request command in the packet, and
that packets containing only responses shall be silently discarded.
Linux intentionally deviates from that prescription: silently
discarding desynchronizes the peer because the remote stack never
learns its responses were dropped, and locating the first request
command requires walking command headers past MTUsig, i.e. processing
bytes from a packet we have already decided is too large to process.
We therefore always emit one reject and use the identifier from the
first command header, a single fixed-offset byte read.
The unrestricted BR/EDR signaling parser and ECHO_REQ response path both
trace to the initial git import; no later introducing commit is
available for a Fixes tag.
SeungJu Cheon [Mon, 25 May 2026 11:04:43 +0000 (20:04 +0900)]
Bluetooth: RFCOMM: validate skb length in MCC handlers
The RFCOMM MCC handlers cast skb->data to protocol-specific structs
without validating skb->len first. A malicious remote device can send
truncated MCC frames and trigger out-of-bounds reads in these handlers.
Fix this by using skb_pull_data() to validate and access the required
data before dereferencing it.
rfcomm_recv_rpn() requires special handling since ETSI TS 07.10 allows
1-byte RPN requests. Handle this by validating only the DLCI byte first,
and validating the full struct only when len > 1.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Muhammad Bilal <meatuni001@gmail.com> Signed-off-by: SeungJu Cheon <suunj1331@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Zhang Cen [Thu, 28 May 2026 09:45:06 +0000 (17:45 +0800)]
Bluetooth: MGMT: validate advertising TLV before type checks
tlv_data_is_valid() reads each advertising data field length from
data[i], then inspects data[i + 1] for managed EIR types before
checking that the current field still fits inside the supplied buffer.
A malformed field whose length byte is the last byte of the buffer can
therefore make the parser read one byte past the advertising data.
KASAN reported the following when a malformed MGMT_OP_ADD_ADVERTISING
request reached that path:
BUG: KASAN: vmalloc-out-of-bounds in tlv_data_is_valid()
Read of size 1
Call trace:
tlv_data_is_valid()
add_advertising()
hci_mgmt_cmd()
hci_sock_sendmsg()
Move the existing element-length check before any type-octet inspection
so each non-empty element is proven to contain its type byte before the
parser looks at data[i + 1].
Fixes: 2bb36870e8cb ("Bluetooth: Unify advertising instance flags check") Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Zhang Cen [Thu, 28 May 2026 07:56:41 +0000 (15:56 +0800)]
Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind()
rfcomm_get_sock_by_channel() scans rfcomm_sk_list under the list lock,
but returns the selected listener after dropping that lock without
taking a reference. rfcomm_connect_ind() then locks the listener,
queues a child socket on it, and may notify it after unlocking it.
The buggy scenario involves two paths, with each column showing the
order within that path:
rfcomm_connect_ind(): listener close:
1. Find parent in 1. close() enters
rfcomm_get_sock_by_channel() rfcomm_sock_release().
2. Drop rfcomm_sk_list.lock 2. rfcomm_sock_shutdown()
without pinning parent. closes the listener.
3. Call lock_sock(parent) and 3. rfcomm_sock_kill()
bt_accept_enqueue(parent, unlinks and puts parent.
sk, true).
4. Read parent flags and may 4. parent can be freed.
call sk_state_change().
If close wins the race, parent can be freed before
rfcomm_connect_ind() reaches lock_sock(), bt_accept_enqueue(), or the
deferred-setup callback.
Take a reference on the listener before leaving rfcomm_sk_list.lock.
After lock_sock() succeeds, recheck that it is still in BT_LISTEN
before queueing a child, cache the deferred-setup bit while the parent
is locked, and drop the reference after the last parent use.
KASAN reported a slab-use-after-free in lock_sock_nested() from
rfcomm_connect_ind(), with the freeing stack going through
rfcomm_sock_kill() and rfcomm_sock_release().
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Xiang Mei reports that mac80211 could crash if eht_cap is set
but eht_oper isn't. Rather than fixing that for the individual
user(s), enforce that both HE/EHT have consistent elements.
Johannes Berg [Tue, 2 Jun 2026 11:28:38 +0000 (13:28 +0200)]
Merge tag 'iwlwifi-fixes-2026-05-31' of https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next
wifi: iwlwifi: fixes - 2026-05-31
Miri Korenblit says:
====================
This contains a few fixes:
- Don't grab nic access in non-fast-resume
- Don't send a large hcmd than transport supports
- In AP mode, don't send tx power constraints command before activating
the link
- Don't do sw reset handshake on older firmwares.
====================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fedor Pchelkin [Mon, 1 Jun 2026 09:41:56 +0000 (12:41 +0300)]
wifi: fix leak if split 6 GHz scanning fails
rdev->int_scan_req is leaked if cfg80211_scan() fails. Note that it's
supposed to be released at ___cfg80211_scan_done() but this doesn't happen
as rdev->scan_req is NULL at that point, too, leading to the early return
from the freeing function.
Jiayuan Chen [Fri, 29 May 2026 15:22:18 +0000 (23:22 +0800)]
ipv6: anycast: insert aca into global hash under idev->lock
syzbot reported a splat [1]: a slab-use-after-free in
ipv6_chk_acast_addr(), which walks the global inet6_acaddr_lst[] hash
under RCU and dereferences a struct ifacaddr6 that has already been
freed while still linked in the hash, so a later reader walks into a
dangling node.
In __ipv6_dev_ac_inc() the aca is allocated with refcount 1, then
aca_get() bumps it to 2 to keep it alive across the unlocked region.
It is published to idev->ac_list under idev->lock, but
ipv6_add_acaddr_hash() runs after write_unlock_bh(). A concurrent
teardown (ipv6_ac_destroy_dev() from addrconf_ifdown(), under RTNL)
can slip into that window:
CPU0 __ipv6_dev_ac_inc CPU1 ipv6_ac_destroy_dev (RTNL)
------------------------------ ------------------------------------
aca_alloc() refcnt 1
aca_get() refcnt 2
write_lock_bh(idev->lock)
add aca to ac_list
write_unlock_bh(idev->lock)
write_lock_bh(idev->lock)
pull aca off ac_list
write_unlock_bh(idev->lock)
ipv6_del_acaddr_hash(aca)
hlist_del_init_rcu() is a no-op,
aca is not in the hash yet
aca_put() refcnt 2->1
ipv6_add_acaddr_hash(aca)
aca now inserted into the hash
aca_put() refcnt 1->0
call_rcu(aca_free_rcu) -> kfree(aca)
The hash removal becomes a no-op because the insertion has not
happened yet, so once CPU0 inserts and drops the last reference, the
aca is freed while still linked in inet6_acaddr_lst[], and readers
dereference freed memory after the slab slot is reused.
This window opened once RTNL stopped serializing the join path against
device teardown. Move ipv6_add_acaddr_hash() inside the idev->lock
section so the ac_list and hash insertions are atomic with respect to
teardown: a racing remover now either misses the aca entirely or finds
it in both lists.
acaddr_hash_lock is now nested under idev->lock, which is acquired in
softirq context, so switch all acaddr_hash_lock sites to spin_lock_bh()
to avoid the irq lock inversion reported in [2].
Tejun Heo [Mon, 1 Jun 2026 19:22:37 +0000 (09:22 -1000)]
sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task()
A WARN fires when systemd's user manager writes "+cpu +memory +pids" to
its own subtree_control while a sched_ext scheduler is loaded:
WARNING: at kernel/sched/ext.c:3227 scx_cgroup_move_task+0xa8/0xb0
scx_cgroup_move_task+0xa8/0xb0
sched_move_task+0x134/0x290
cpu_cgroup_attach+0x39/0x70
cgroup_migrate_execute+0x37d/0x450
cgroup_update_dfl_csses+0x1e3/0x270
cgroup_subtree_control_write+0x3e7/0x440
scx_cgroup_can_attach() arms cgrp_moving_from only when a task's cpu
cgroup changes. It can still be NULL when scx_cgroup_move_task() runs,
through this sequence:
Step Result
--------------------------------- ----------------------------------
1. cpu enabled on cgroup G cpu css = A
2. cpu toggled off then on for G A killed, B created (same cgroup)
3. an exiting task keeps A alive migration skips it, A now stale
4. +memory migrates G stale A vs current B pulls cpu in
5. cpu attach runs for all tasks hits a live, cpu-unchanged task
6. scx_cgroup_move_task() on it cgrp_moving_from NULL -> WARN
The mismatch is that scx_cgroup_can_attach() keys on cgroup identity
while migration drives the move on css identity, so a NULL cgrp_moving_from
here is a legitimate css-only migration, not a missing prep.
The call is already gated on cgrp_moving_from, so just drop the warning.
ops.cgroup_prep_move() and ops.cgroup_move() stay paired.
Zhao Zhang [Sat, 30 May 2026 15:57:14 +0000 (23:57 +0800)]
sctp: diag: reject stale associations in dump_one path
The SCTP exact sock_diag lookup can hold a transport reference, block on
lock_sock(sk), and then resume after sctp_association_free() has marked
the association dead and freed its bind address list.
When that happens, inet_assoc_attr_size() and
inet_diag_msg_sctpasoc_fill() can still dereference association state
that is no longer valid for reporting. In particular,
inet_diag_msg_sctpasoc_fill() may read an empty bind-address list as a
real sctp_sockaddr_entry and trigger an out-of-bounds read from
unrelated association memory.
Reject the association after taking the socket lock if it has been
reaped or detached from the endpoint, and report the lookup as stale.
This keeps the exact dump-one path from formatting torn association
state.
Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhao Zhang <zzhan461@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/fac6043fa20a2ff68e12958c431836f692c51268.1780113823.git.zzhan461@ucr.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tapio Reijonen [Fri, 29 May 2026 06:18:57 +0000 (06:18 +0000)]
net: fec: fix pinctrl default state restore order on resume
In fec_resume(), fec_enet_clk_enable() is called before
pinctrl_pm_select_default_state() in the non-WoL path, inverting the
ordering used in fec_suspend() which correctly switches to the sleep
pinctrl state before disabling clocks.
For PHYs with the PHY_RST_AFTER_CLK_EN flag (e.g. TI DP83848 or
SMSC LAN87xx), fec_enet_clk_enable() triggers a hardware reset pulse
via the phy-reset GPIO. With the GPIO pin still in sleep pinctrl state
at that point, the GPIO write has no physical effect and the PHY never
receives the required reset after clock enable, leading to unreliable
link establishment after system resume.
Fix by restoring the default pinctrl state before enabling clocks,
making resume the proper mirror of suspend. The call is made
unconditionally: fec_suspend() only switches to the sleep pinctrl state
on the non-WoL path and leaves the pins in the default state when WoL
is enabled, so on a WoL resume the device is already in the default
state and pinctrl_pm_select_default_state() is a no-op.
David Thompson [Fri, 29 May 2026 21:03:00 +0000 (21:03 +0000)]
net: lan743x: permit VLAN-tagged packets up to configured MTU
VLAN-tagged interfaces on lan743x devices were previously unreachable via
SSH and failed to respond to large ping packets (e.g. "ping -s 1469" given
MTU=1500). In these scenarios, "ethtool -S" reports non-zero "RX Oversize
Frame Errors". According to Microchip AN2948, the MAC_RX FSE (VLAN field
size enforcement) bit determines whether frames with VLAN tags exceeding
the base MTU plus tag length are discarded.
The driver must set the MAC_RX.FSE bit before setting MAC_RX.RXEN to allow
VLAN-tagged frames up to the interface MTU, preventing them from being
treated as oversized. As a result, both the base and VLAN-tagged interfaces
can use the same MTU without receive errors.
Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Signed-off-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Thangaraj Samynathan <Thangaraj.s@microchip.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> # lan7430 on arm64 (RevPi Link: https://patch.msgid.link/20260529210300.433135-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yuqi Xu [Fri, 29 May 2026 13:01:44 +0000 (21:01 +0800)]
net: rds: clear i_sends on setup unwind
The RDS IB connection teardown path is written so it can run during
partial startup and on repeated shutdown attempts. It uses NULL
pointers to distinguish resources that are still owned from resources
that have already been released.
When rds_ib_setup_qp() fails after allocating i_sends but before
allocating i_recvs, the sends_out path frees i_sends without clearing
the pointer. A later shutdown pass can still treat that stale pointer
as a live send ring allocation.
Clear i_sends after vfree() in the error unwind path so the existing
shutdown logic continues to use the correct ownership state.
Yizhou Zhao [Wed, 27 May 2026 08:31:58 +0000 (16:31 +0800)]
net: garp: fix unsigned integer underflow in garp_pdu_parse_attr
The receive-side GARP attribute parser computes dlen with reversed
operands:
dlen = sizeof(*ga) - ga->len;
ga->len is the on-wire attribute length and includes the GARP attribute
header. For normal attributes with data, ga->len is larger than
sizeof(*ga), so the subtraction underflows in unsigned arithmetic.
The resulting value is later passed to garp_attr_lookup(), whose length
argument is u8. After truncation, the parsed data length usually no
longer matches the length stored for locally registered attributes, so
received Join/Leave events are ignored. This breaks the GARP receive path
for common attributes, such as GVRP VLAN registration attributes.
Compute the data length as the attribute length minus the header length.
Fixes: eca9ebac651f ("net: Add GARP applicant-only participant") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260527083200.42861-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported the warning [0] in hsr_addr_is_self(),
whose assumption is simply wrong.
hsr->self_node is cleared in hsr_del_self_node(), which
is called from hsr_dellink().
Since dev->rtnl_link_ops->dellink() is called before
unregister_netdevice_many(), there is a window when
user can find the device but without hsr->self_node.
Jakub Kicinski [Tue, 2 Jun 2026 18:57:21 +0000 (11:57 -0700)]
Merge tag 'nf-26-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for net:
1) Fix splat with PREEMPT_RCU because smp_processor_id() in nfqueue,
from Fernando Fernandez Mancera.
2) Fix possible use of pointer to old IPVS scheduler after RCU grace
period when editing service, from Julian Anastasov.
3) Fix possible forever RCU walk over rt->fib6_siblings in nft_fib6,
if rt is unlinked mid-iteration, apparently same issue happens in
the fib6 core. From Jiayuan Chen.
4) Add mutex to guard refcount in synproxy infrastructure, since
concurrent hook {un}registration can happen.
From Fernando Fernandez Mancera.
5) Bail out if IRC conntrack helper fails to parse a command, do not
try parsing using other command handlers, from Florian Westphal.
This fixes a possible out-of-bound read.
6) Possible use-after-free in nft_tunnel by releasing template dst
after all references has been dropped, from Tristan Madani.
7) Ignore conntrack template in nft_ct, from Jiayuan Chen.
8) Missing skb_ensure_writable() in ebt_snat, Yiming Qian.
9) Remove multi-register byteorder support, this allows for kernel
stack info leak, from Florian Westphal.
* tag 'nf-26-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_byteorder: remove multi-register support
netfilter: bridge: make ebt_snat ARP rewrite writable
netfilter: nft_ct: bail out on template ct in get eval
netfilter: nft_tunnel: fix use-after-free on object destroy
netfilter: conntrack_irc: fix possible out-of-bounds read
netfilter: synproxy: add mutex to guard hook reference counting
netfilter: nft_fib_ipv6: bail out of sibling walk if rt got unlinked
ipvs: clear the svc scheduler ptr early on edit
netfilter: xt_NFQUEUE: prefer raw_smp_processor_id
====================
tcp: Add preempt_{disable,enable}_nested() in reqsk_queue_hash_req().
syzbot reported a weird reqsk->rsk_refcnt underflow in
__inet_csk_reqsk_queue_drop().
The captured reqsk_put() in __inet_csk_reqsk_queue_drop()
is called only when it successfully removes reqsk from ehash.
Moreover, reqsk_timer_handler() calls another reqsk_put()
after that.
This indicates that the reqsk was missing both refcnts for
ehash and the timer itself.
Since all the syzbot reports had PREEMPT_RT enabled, the only
possible scenario is that reqsk_queue_hash_req() is preempted
after mod_timer() and before refcount_set(), and then the timer
triggered after 1s aborts the reqsk due to its listener's close().
Let's wrap mod_timer() and refcount_set() with
preempt_disable_nested() and preempt_enable_nested().
Note that inet_ehash_insert() holds the normal spin_lock()
(mutex in PREEMPT_RT), so it must be called outside of
preempt_disable_nested(), but this is fine.
The lookup path just ignores 0 sk_refcnt entries in ehash
and tries to create another reqsk, but this will fail at
inet_ehash_insert().
Oscar Maes [Thu, 28 May 2026 14:03:20 +0000 (16:03 +0200)]
pcnet32: stop holding device spin lock during napi_complete_done
napi_complete_done may call gro_flush_normal (though not currently, as GRO
is unsupported at the moment), which may result in packet TX. This will
eventually result in calling pcnet32_start_xmit - resulting in a deadlock
while trying to re-acquire the already locked spin lock.
It is safe to split the spinlock block into two, because the hardware
registers are still protected from concurrent access, and the two blocks
perform unrelated operations that don't need to happen atomically.
Fixes: 5b2ec6f2be51 ("pcnet32: use napi_complete_done()") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Oscar Maes <oscmaes92@gmail.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260528140320.5556-1-oscmaes92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 2 Jun 2026 17:54:11 +0000 (10:54 -0700)]
Merge tag 'soc-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"Following the previous set of fixes, this addresses another
significant number of small issues found in firmware drivers (tee,
optee, qcomtee, qcom ice, exynos acpm) drivers through various tools.
This is about error handling, resource leaks, concurrency and a
use-after-free bug.
The fixes for the Qualcomm ICE driver also introduce interface changes
in the UFS and MMC drivers using it.
Outside of firmware drivers, there are a few fixes across the tree:
- Minor driver code mistakes in the Atmel EBI memory controller, the
i.MX soc ID driver and socfpga boot logic
- A defconfig change to avoid a boot time regression on multiple
qualcomm boards
- Device tree fixes for qualcomm, at91 and gemini, addressing mostly
minor configuration mistakes"
* tag 'soc-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (28 commits)
firmware: samsung: acpm: Fix infinite loop on sequence number exhaustion
firmware: samsung: acpm: Fix missing LKMM barriers in sequence allocator
firmware: samsung: acpm: Fix false timeouts and Use-After-Free in polling
ARM: dts: gemini: Fix partition offsets
ARM: socfpga: Fix OF node refcount leak in SMP setup
soc: qcom: ice: Fix the error code when 'qcom,ice' property is not found
arm64: dts: qcom: eliza: Add power-domain and iface clk for ice node
arm64: dts: qcom: milos: Add power-domain and iface clk for ice node
tee: qcomtee: add missing va_end in early return qcomtee_object_user_init()
tee: fix params_from_user() error path in tee_ioctl_supp_recv
tee: shm: fix shm leak in register_shm_helper()
tee: fix tee_ioctl_object_invoke_arg padding
arm64: defconfig: Enable PCI M.2 power sequencing driver
scsi: ufs: ufs-qcom: Remove NULL check from devm_of_qcom_ice_get()
mmc: sdhci-msm: Remove NULL check from devm_of_qcom_ice_get()
soc: qcom: ice: Return proper error codes from devm_of_qcom_ice_get() instead of NULL
soc: qcom: ice: Return -ENODEV if the ICE platform device is not found
soc: qcom: ice: Fix race between qcom_ice_probe() and of_qcom_ice_get()
ARM: dts: microchip: sam9x7: fix GMAC clock configuration
firmware: samsung: acpm: Fix mailbox channel leak on probe error
...
Linus Torvalds [Tue, 2 Jun 2026 15:59:35 +0000 (08:59 -0700)]
Merge tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
"13 hotfixes. All are for MM. 10 are cc:stable and the remaining 3
address post-7.1 issues or aren't considered suitable for backporting.
There's a three-patch series "userfaultfd: verify VMA state across
UFFDIO_COPY retry" from Mike Rapoport which fixes a few uffd things.
The rest are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-06-01-20-58' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
userfaultfd: remove redundant check in vm_uffd_ops()
userfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAs
userfaultfd: verify VMA state across UFFDIO_COPY retry
mm/huge_memory: update file PMD counter before folio_put()
mm/huge_memory: update file PUD counter before folio_put()
mm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollback
mm/damon/ops-common: call folio_test_lru() after folio_get()
mm/cma: fix reserved page leak on activation failure
mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
mm/hugetlb: restore reservation on error in hugetlb folio copy paths
mm/cma_debug: fix invalid accesses for inactive CMA areas
memcg: use round-robin victim selection in refill_stock
mm/hugetlb: avoid false positive lockdep assertion
wifi: mac80211: limit injected antenna index in ieee80211_parse_tx_radiotap
When parsing the radiotap header of an injected frame,
ieee80211_parse_tx_radiotap() uses the IEEE80211_RADIOTAP_ANTENNA value
directly as a shift count:
*iterator.this_arg is an 8-bit value taken straight from the frame
supplied by userspace, so BIT() can be asked to shift by up to 255. That
is undefined behaviour on the unsigned long and is reported by UBSAN:
UBSAN: shift-out-of-bounds in net/mac80211/tx.c:2174:30
shift exponent 235 is too large for 64-bit type 'unsigned long'
Call Trace:
ieee80211_parse_tx_radiotap+0xadb/0x1950 net/mac80211/tx.c:2174
ieee80211_monitor_start_xmit+0xb1f/0x1250 net/mac80211/tx.c:2451
...
packet_sendmsg+0x3eb6/0x50f0 net/packet/af_packet.c:3109
info->control.antennas is a 2-bit bitmap (u8 antennas:2), so only antenna
indices 0 and 1 can ever be represented. Ignore any larger value instead
of shifting out of bounds.
Reported-by: syzbot+8e0622f6d9446420271f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8e0622f6d9446420271f Fixes: ef246a1480cc ("wifi: mac80211: support antenna control in injection") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Link: https://patch.msgid.link/20260531011721.102941-1-kartikey406@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Yuqi Xu [Fri, 29 May 2026 15:25:37 +0000 (23:25 +0800)]
wifi: nl80211: reject oversized EMA RNR lists
nl80211_parse_rnr_elems() stores the parsed element count in a
u8-backed cfg80211_rnr_elems::cnt field and uses that count to size
the flexible array allocation.
Reject nested NL80211_ATTR_EMA_RNR_ELEMS input once the count reaches
255, before incrementing it again. This keeps the parser aligned with
the data structure it fills and matches the existing bound check used
by nl80211_parse_mbssid_elems().
Fixes: dbbb27e183b1 ("cfg80211: support RNR for EMA AP") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:gpt-5.4 Signed-off-by: Yuqi Xu <xuyuqiabc@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Link: https://patch.msgid.link/20260529152542.1412734-1-n05ec@lzu.edu.cn Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Linus Torvalds [Tue, 2 Jun 2026 02:55:30 +0000 (19:55 -0700)]
Merge tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fix from Mikulas Patocka:
- fix race condition in dm-cache-policy-smq
* tag 'for-7.1/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: check allocation under invalidate lock
Mark Bloch [Thu, 28 May 2026 19:14:10 +0000 (22:14 +0300)]
devlink: Release nested relation on devlink free
devlink relation state is normally released from devl_unregister(), which
calls devlink_rel_put(). This misses devlink instances that get a nested
relation before registration and then fail probe before devl_register() is
reached.
That flow can happen for SFs. The child devlink gets linked to its
parent before registration, then a later probe error calls devlink_free()
directly. Since the instance was never registered, devl_unregister() is not
called and devlink->rel is leaked.
Release any pending relation from devlink_free() as well. The registered
path is unchanged because devl_unregister() already clears devlink->rel
before devlink_free() runs.
Fixes: c137743bce02 ("devlink: introduce object and nested devlink relationship infra") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260528191411.3270532-1-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 2 Jun 2026 02:50:33 +0000 (19:50 -0700)]
Merge tag 'auxdisplay-v7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay
Pull auxdisplay updates from Andy Shevchenko:
- Fix potential out-of-bound access in line-display library
- Miscellaneous refactoring and cleaning up
[ Andy says this could easily be delayed until 7.2, but it's _so_ tiny
that it's more work for me to schedule it for later than to just take
it now, and just doesn't seem worth delaying - Linus ]
* tag 'auxdisplay-v7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
auxdisplay: Kconfig: drop unneeded quotes in PANEL_BOOT_MESSAGE dep
auxdisplay: line-display: fix OOB read on zero-length message_store()
auxdisplay: max6959: use regmap_assign_bits() in max6959_enable()
Lee Jones [Wed, 27 May 2026 13:36:29 +0000 (13:36 +0000)]
l2tp: pppol2tp: hold reference to session in pppol2tp_ioctl()
pppol2tp_ioctl() read sock->sk->sk_user_data directly without any
locks or reference counting. If a controllable sleep was induced during
copy_from_user() (e.g. via a userfaultfd page fault sleep), a concurrent
socket close could trigger pppol2tp_session_close() asynchronously. This
frees the l2tp_session structure via the l2tp_session_del_work workqueue.
Upon resuming, the ioctl thread dereferences the stale session pointer,
resulting in a Use-After-Free (UAF).
Fix this by securely fetching the session reference using the RCU-safe,
refcounted helper pppol2tp_sock_to_session(sk) on entry. This locks the
session's refcount across the sleep. We structured the function to exit
via standard err breaks, guaranteeing that l2tp_session_put() is cleanly
called on all return paths to drop the reference.
To preserve existing behavior we validate the session and its magic
signature only for the specific L2TP commands that require it. This
ensures that generic/unknown ioctls called on an unconnected socket
still return -ENOIOCTLCMD and correctly fall back to generic handlers
(e.g. in sock_do_ioctl()).
Yizhou Zhao [Wed, 27 May 2026 08:18:01 +0000 (16:18 +0800)]
6lowpan: fix off-by-one in multicast context address compression
The second memcpy in lowpan_iphc_mcast_ctx_addr_compress() uses
&data[1] as destination and &ipaddr->s6_addr[11] as source, but
both should be offset by one: &data[2] and &ipaddr->s6_addr[12]
respectively.
This off-by-one has two consequences:
1. data[1] is overwritten with s6_addr[11], corrupting the RIID
field in the compressed multicast address
2. data[5] is never written, so uninitialized kernel stack memory
is transmitted over the network via lowpan_push_hc_data(),
leaking kernel stack contents
The correct inline data layout must match what the decompression
function lowpan_uncompress_multicast_ctx_daddr() expects:
data[0..1] = s6_addr[1..2] (flags/scope + RIID)
data[2..5] = s6_addr[12..15] (group ID)
Also zero-initialize the data array as a defensive measure against
similar bugs in the future.
Fixes: 5609c185f24d ("6lowpan: iphc: add support for stateful compression") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://patch.msgid.link/20260527081806.42747-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Sun, 31 May 2026 16:08:12 +0000 (12:08 -0400)]
net/sched: act_api: use RCU with deferred freeing for action lifecycle
When NEWTFILTER and DELFILTER are run concurrently it is possible to create a
race with an associated action.
Let's illustrate with CPU0 running NEWTFILTER and CPU1 running DELFILTER:
0: mutex_lock() <-- holds the idr lock
0: rcu_read_lock()
0: p = idr_find(idr, index) <-- action p is valid (RCU protects IDR)
0: mutex_unlock() <-- releases the idr lock
1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held
1: idr_remove(idr, index) <-- Action removed from IDR
1: mutex_unlock() <-- mutex released allowing us to delete the action
1: tcf_action_cleanup(p); kfree(p) <-- Kfrees p immediately, no deferral
0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- ouch, UAF p points to freed memory
This patch fixes the race condition between NEWTFILTER and DELFILTER by
adding struct rcu_head to tc_action used in the deferral and introducing a
call_rcu() in the delete path to defer the final kfree().
Note: this is a revert of commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu")
but also modernization/simplification to directly use kfree_rcu().
Let's illustrate the new restored code path:
0: rcu_read_lock()
1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held
1: idr_remove(idr, index)
1: mutex_unlock()
1: call_rcu(&p->tcfa_rcu, tcf_action_rcu_free) <-- defer kfree after grace period
0: p = idr_find(idr, index)
0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- fails, refcnt already 0
1: rcu_read_unlock() <-- release so freeing can run after grace period
After CPU1 calls idr_remove(), the object is no longer reachable through the IDR.
CPU0's subsequent idr_find() will return NULL, and even if it still held a
stale pointer, the immediate kfree() is now deferred until after the RCU grace
period, so no UAF can occur.
Fixes: d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") Suggested-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Kyle Zeng <kylebot@openai.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260531160812.68020-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guangshuo Li [Fri, 29 May 2026 15:57:45 +0000 (23:57 +0800)]
dm cache policy smq: check allocation under invalidate lock
commit 2d1f7b65f5de ("dm cache policy smq: fix missing locks in
invalidating cache blocks") added mq->lock around the destructive part of
smq_invalidate_mapping(), but left the e->allocated check outside the
critical section.
That leaves a check-then-act race. Two concurrent invalidators can both
observe e->allocated as true before either of them takes mq->lock. The
first invalidator that acquires the lock removes the entry from the
queues and hash table and then calls free_entry(), which clears
e->allocated and puts the entry back on the free list. The second
invalidator can then acquire mq->lock and continue with the stale result
of the unlocked check.
This can corrupt the SMQ queues or hash table by deleting an entry that
is no longer on those structures. It can also hit the allocation check in
free_entry() when the same entry is freed again.
Move the allocation check under mq->lock so the predicate and the
destructive operations are serialized by the same lock.
Nikolay Kuratov [Tue, 26 May 2026 16:29:32 +0000 (19:29 +0300)]
net/mlx5: Reorder completion before putting command entry in cmd_work_handler
Assuming callback != NULL && !page_queue, cmd_work_handler takes
command entry with refcnt == 1 from mlx5_cmd_invoke.
If either semaphore timeout or index allocation error happens,
it does final cmd_ent_put(ent). To avoid access to freed memory,
notify slotted completion before cmd_ent_put.
This is theoretical issue found by Svace static analyser.
Cc: stable@vger.kernel.org Fixes: 485d65e135712 ("net/mlx5: Add a timeout to acquire the command queue semaphore") Fixes: 0e2909c6bec90 ("net/mlx5: Fix variable not being completed when function returns") Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260526162932.501584-1-kniv@yandex-team.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Florian Westphal [Tue, 12 May 2026 13:36:14 +0000 (15:36 +0200)]
netfilter: nft_byteorder: remove multi-register support
64bit byteorder conversion is broken when several registers need to be
converted because the source register array advances in steps for 4 bytes
instead of 8:
for (i = ...
src64 = nft_reg_load64(&src[i]);
~~~~~ u32 *src
nft_reg_store64(&dst64[i],
Remove the multi-register support, it has other issues as well:
Pablo points out that commit caf3ef7468f7 ("netfilter: nf_tables: prevent OOB access in nft_byteorder_eval")
alters semantics: before the loop operated on registers, i.e.
for ( ... )
dst32[i] = htons((u16)src32[i])
.. but after the patch it will operate on bytes, which makes this
useless to convert e.g. concatenations, which store each compound
in its own register.
Multi-convert of u32 has one theoretical application:
ct mark . meta mark . tcp dport @intervalset
Because ct mark and meta mark are host byte order, use with
intervals has to convert the byteorder for ct/meta mark value
to network byte order (bigendian).
I.e. two separate calls. Theoretically it could be changed to do:
[ meta load mark => reg 1 ]
[ ct load mark => reg 9 ]
[ byteorder reg 1 = htonl(reg 1, 4, 8) ]
...
But then all it would take to change the set to
meta mark . tcp dport . ct mark
... and we'd be back to two "byteorder" calls. IOW, support to
convert a range of registers is both dysfunctional and dubious.
Simplify this: remove the feature.
Pablo Neira Ayuso points out that nftables before 1.1.0 can generate
incorrect byteorder conversions, see 9fe58952c45a,
"evaluate: skip byteorder conversion for selector smaller than 2 bytes"
in nftables.git). Affected rulesets fail to load with this change and
old userspace due to 'len != size' check.
Fixes: c301f0981fdd ("netfilter: nf_tables: fix pointer math issue in nft_byteorder_eval()") Cc: <stable+noautosel@kernel.org> # may break rule load with old nftables versions Reported-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/netfilter-devel/20240206104336.ctigqpkunom2ufmn@lion.mk-sys.cz/ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Yiming Qian [Sat, 23 May 2026 12:29:10 +0000 (12:29 +0000)]
netfilter: bridge: make ebt_snat ARP rewrite writable
The ebtables SNAT target keeps the Ethernet source address rewrite
behind skb_ensure_writable(skb, 0). This is intentional: at the bridge
ebtables hooks the Ethernet header is addressed through
skb_mac_header()/eth_hdr(), while skb->data points at the Ethernet
payload. Asking skb_ensure_writable() for ETH_HLEN bytes would check
the payload, not the Ethernet header, and would reintroduce the small
packet regression fixed by commit 63137bc5882a.
However, the optional ARP sender hardware address rewrite is different.
It writes through skb_store_bits() at an offset relative to skb->data:
skb_header_pointer() only safely reads the ARP header; it does not make
the later sender hardware address range writable. If that range is
still held in a nonlinear skb fragment backed by a splice-imported file
page, skb_store_bits() maps the frag page and copies the new MAC address
directly into it.
Ensure the ARP SHA range is writable before reading the ARP header and
before calling skb_store_bits().
Fixes: 63137bc5882a ("netfilter: ebtables: Fixes dropping of small packets in bridge nat") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jiayuan Chen [Thu, 28 May 2026 11:09:19 +0000 (19:09 +0800)]
netfilter: nft_ct: bail out on template ct in get eval
I noticed this issue while looking at a historic syzbot report [1].
A rule like the one below is enough to trigger the bug:
table ip t {
chain pre {
type filter hook prerouting priority raw;
ct zone set 1
ct original saddr 1.2.3.4 accept
}
}
The first expression attaches a per-cpu template ct via
nft_ct_set_zone_eval() (nf_ct_tmpl_alloc -> kzalloc, tuple is all
zero, nf_ct_l3num(ct) == 0). The next expression then calls
nft_ct_get_eval() on the same skb, treats the template as a real ct
and hits the 16-byte memcpy path. With dreg at NFT_REG32_15 this
overflows past struct nft_regs on the kernel stack; with smaller
dreg values it silently clobbers adjacent registers.
Reject template ct at the eval entry and in nft_ct_get_fast_eval(),
mirroring the check nft_ct_set_eval() already has. Additionally,
bound the address copy in NFT_CT_SRC / NFT_CT_DST by priv->len
instead of by nf_ct_l3num(ct): nf_ct_get_tuple() zeroes the tuple
before pkt_to_tuple() fills in only the protocol-relevant leading
bytes, so the trailing bytes of tuple->{src,dst}.u3.all are
well-defined zero. priv->len is validated at rule load, so the
copy size is now bounded by the destination register rather than
by an untrusted field on the conntrack.
Tristan Madani [Wed, 27 May 2026 13:57:50 +0000 (13:57 +0000)]
netfilter: nft_tunnel: fix use-after-free on object destroy
nft_tunnel_obj_destroy() calls metadata_dst_free() which directly
kfree()s the metadata_dst, ignoring the dst_entry refcount. Packets
that took a reference via dst_hold() in nft_tunnel_obj_eval() and
are still queued (e.g. in a netem qdisc) are left with a dangling
pointer. When these packets are eventually dequeued, dst_release()
operates on freed memory.
Replace metadata_dst_free() with dst_release() so the metadata_dst
is freed only after all references are dropped. The dst subsystem
already handles metadata_dst cleanup in dst_destroy() when
DST_METADATA is set.