Eric Dumazet [Tue, 5 May 2026 13:32:33 +0000 (13:32 +0000)]
inetpeer: add a missing read_seqretry() in inet_getpeer()
When performing a lockless lookup over the inet_peer rbtree,
if a matching node is found, inet_getpeer() returns it immediately
without validating the seqlock sequence.
This missing check introduces a race condition:
Trigger Path: When a host receives an incoming fragmented IPv4 packet,
ip4_frag_init() (in net/ipv4/ip_fragment.c) calls inet_getpeer_v4()
to track the peer.
The Race: If the packet is from a new source IP, CPU A acquires the
write_seqlock, allocates a new inet_peer node (p), sets its IP address
(daddr), and links it to the rbtree (rb_link_node).
Uninitialized Access: Due to the lack of memory barriers between
rb_link_node and the initialization of the rest of the struct
(like refcount_set(&p->refcnt, 1)), CPU A can make the node visible
to readers before its refcnt is initialized.
This is especially true on weakly-ordered architectures like ARM64
where the CPU can reorder the memory stores.
Lockless Reader: Concurrently, CPU B processes a second fragmented packet
from the same source IP. CPU B does a lockless lookup, finds the newly
inserted node, and returns it immediately.
Use-After-Free (UAF): CPU B reads p->refcnt as uninitialized garbage
(left over from previous kmalloc-128/192 allocations).
If the garbage is > 0, refcount_inc_not_zero(&p->refcnt) succeeds.
CPU A then executes refcount_set(&p->refcnt, 1), overwriting CPU B's increment.
When CPU B finishes with the fragment queue, it calls inet_putpeer(),
which drops the refcount to 0 and frees the node via RCU.
The node is now freed but remains linked in the rbtree,
resulting in a Use-After-Free in the rbtree.
Fixes: b145425f269a ("inetpeer: remove AVL implementation in favor of RB tree") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260505133233.3039575-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
netdevsim: psp: fix init and uninit bugs
This series has three fixes. The first is a straightforward NULL
pointer dereference that is reachable by creating and destroying some
vfs on a kernel with INET_PSP enabled.
The last two patches deal with nsim_psp_rereg_write(), which is a
debugfs handler that reregisters netdevsim's psp_dev without
aquiescing and disabling tx/rx processing. This was added to enable
some tests in psp.py where a psp device is unregistered while it still
referenced by tcp socket state.
There are two issues with this code:
1. Calls to nsim_psp_uninit() are not properly serialized
2. netdevsim's psp_dev refcount can be released while nsim_do_psp() is
reading from it.
====================
Daniel Zahka [Tue, 5 May 2026 10:42:25 +0000 (03:42 -0700)]
netdevsim: psp: rcu protect psp_dev reference
There are two issues with the way psp_dev is used in nsim_do_psp():
1. There is no check for IS_ERR() on the peers psp_dev, before
dereferencing.
2. The refcount on this psp_dev can be dropped by
nsim_psp_rereg_write()
To fix this, we can make netdevsim's reference to its psp_dev an rcu
reference, and then nsim_do_psp() can read the fields it needs from an
rcu critical section.
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-3-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Tue, 5 May 2026 10:42:24 +0000 (03:42 -0700)]
netdevsim: psp: serialize calls to nsim_psp_uninit()
The debugfs write handler, nsim_psp_rereg_write(), can race against
nsim_destroy() and against itself, causing nsim_psp_uninit() to run
more than once concurrently. Two complementary changes serialize all
callers:
1. Delete the psp_rereg debugfs file from nsim_psp_uninit() before
doing the actual teardown. debugfs_remove() drains any in-flight
writers and prevents new ones from starting.
2. Add a mutex around the body of nsim_psp_rereg_write() so that two
concurrent userspace writers cannot both enter the teardown path
at once.
The teardown work itself is moved into a new __nsim_psp_uninit() that
the rereg handler calls under the mutex, while the public
nsim_psp_uninit() wraps it with the debugfs_remove()/mutex_destroy()
pair so nsim_destroy() doesn't have to know about the psp internals.
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-2-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Tue, 5 May 2026 10:42:23 +0000 (03:42 -0700)]
netdevsim: psp: only call nsim_psp_uninit() on PFs
VFs go through nsim_init_netdevsim_vf() which never calls
nsim_psp_init(), so ns->psp.dev stays NULL. nsim_psp_uninit() guards
with !IS_ERR(ns->psp.dev), so destroying a VF reaches
psp_dev_unregister(NULL) and dereferences NULL on the first
mutex_lock(&psd->lock):
Fixes: f857478d6206 ("netdevsim: a basic test PSP implementation") Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260505-psd-rcu-v1-1-a8f69ec1ab96@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1. Fix an IPv6 encapsulation error path that leaked route references
when UDPv6 ESP decapsulation resolved to an error route.
From Yilin Zhu.
2. Fix AH with ESN on async crypto paths by accounting for the extra
high-order sequence number when reconstructing the temporary
authentication layout in the completion callbacks.
From Michael Bomarito.
3. Fix XFRM output so it does not overwrite already-correct inner header
pointers when a tunnel layer such as VXLAN has already saved them.
The fix comes with new selftests. From Cosmin Ratiu.
4. Add the missing native payload size entry for XFRM_MSG_MAPPING in the
compat translation path. From Ruijie Li.
5. Harden __xfrm_state_delete() against repeated or inconsistent unhashing
of state list nodes by keying the removal on actual list membership and
using delete-and-init helpers. From Michal Kosiorek.
6. Prevent ESP from decrypting shared splice-backed skb fragments in place
by marking UDP splice frags as shared and forcing copy-on-write in ESP
input when needed. From Kuan-Ting Chen.
* tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: esp: avoid in-place decrypt on shared skb frags
xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
xfrm: provide message size for XFRM_MSG_MAPPING
xfrm: Don't clobber inner headers when already set
tools/selftests: Add a VXLAN+IPsec traffic test
tools/selftests: Use a sensible timeout value for iperf3 client
xfrm: ah: account for ESN high bits in async callbacks
ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
====================
Jakub Kicinski [Wed, 6 May 2026 23:10:02 +0000 (16:10 -0700)]
Merge tag 'ovpn-net-20260504' of https://github.com/OpenVPN/ovpn-net-next
Antonio Quartulli says:
====================
Includes changes:
* ensure MAC header offset is reset before delivering packet
* ensure gro_cells_receive() and dstats_dev_add() are called
with BH disabled
* reduce ping count in selftest to ensure it completes within
timeout
* tag 'ovpn-net-20260504' of https://github.com/OpenVPN/ovpn-net-next:
selftests: ovpn: reduce ping count in test.sh
ovpn: ensure packet delivery happens with BH disabled
ovpn: reset MAC header before passing skb up
====================
Bluetooth: HIDP: serialise l2cap_unregister_user via hidp_session_sem
Commit dbf666e4fc9b ("Bluetooth: HIDP: Fix possible UAF") made
hidp_session_remove() drop the L2CAP reference and set
session->conn = NULL once the session is considered removed, and
added a bare if (session->conn) guard around the kthread-exit
l2cap_unregister_user() call in hidp_session_thread(). The sibling
ioctl site in hidp_connection_del() still reads session->conn
unlocked and unguarded, and the kthread-exit guard itself is a
lockless double-read.
hidp_session_find() drops hidp_session_sem before returning, so
hidp_session_remove() can null session->conn between the lookup and
the call in hidp_connection_del(). Worse, since commit 752a6c9596dd
("Bluetooth: L2CAP: Fix use-after-free in l2cap_unregister_user")
takes mutex_lock(&conn->lock) inside l2cap_unregister_user(), a
stale non-NULL snapshot also UAFs on conn->lock. v1 only added an
if (session->conn) guard at the ioctl site, which doesn't address
either race; Luiz suggested snapshotting session->conn under the
sem and clearing it before the call.
Taking hidp_session_sem across l2cap_unregister_user() would be
wrong: l2cap_conn_del() already establishes the lock order
conn->lock -> hidp_session_sem
via l2cap_unregister_all_users() -> user->remove ==
hidp_session_remove(), so taking hidp_session_sem before conn->lock
would AB/BA deadlock.
Factor a helper hidp_session_unregister_conn() that under
down_write(&hidp_session_sem) snapshots session->conn and clears
the member, then outside the sem calls l2cap_unregister_user() and
l2cap_conn_put() on the snapshot. Call it from both
hidp_connection_del() and hidp_session_thread()'s exit path. At
most one consumer wins the write-sem; later callers observe
session->conn == NULL and skip the unregister and put, so the
reference hidp_session_new() took via l2cap_conn_get() is consumed
exactly once. session_free() already tolerates a NULL session->conn.
Fixes: dbf666e4fc9b ("Bluetooth: HIDP: Fix possible UAF") Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/all/20260422011437.176643-1-michael.bommarito@gmail.com/ Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sizeof(conn->num_bis) is wrong - it would make sense to either use
conn->num_bis (before setting that to 0) or sizeof(conn->bis).
Fix it by using sizeof(conn->bis), the least intrusive change.
Luckily, nothing actually depends on this memset() working properly:
Nothing seems to ever read from conn->bis beyond conn->num_bis, and when
conn->num_bis is increased, the corresponding elements of conn->bis are
initialized. So I think this line could also just be removed.
This is a purely theoretical fix and should have no impact on actual
behavior.
Fixes: 42ecf1947135 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: RFCOMM: pull credit byte with skb_pull_data()
rfcomm_recv_data() treats the first payload byte as a credit field when
the UIH frame carries PF and credit-based flow control is enabled.
After the header has been stripped, the PF/CFC path consumes that byte
with a direct skb->data dereference followed by skb_pull(). A malformed
short frame can reach this path without a byte available.
Use skb_pull_data() so the length check and pull happen together before
the returned credit byte is consumed.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
virtbt_rx_handle() reads the leading pkt_type byte from the RX skb
and forwards the remainder to hci_recv_frame() for every
event/ACL/SCO/ISO type, without checking that the remaining payload
is at least the fixed HCI header for that type.
After the preceding patch bounds the backend-supplied used.len to
[1, VIRTBT_RX_BUF_SIZE], a one-byte completion still reaches
hci_recv_frame() with skb->len already pulled to 0. If the byte
happened to be HCI_ACLDATA_PKT, the ACL-vs-ISO classification
fast-path in hci_dev_classify_pkt_type() dereferences
hci_acl_hdr(skb)->handle whenever the HCI device has an active
CIS_LINK, BIS_LINK, or PA_LINK connection, reading two bytes of
uninitialized RX-buffer data. The same hazard exists for every
packet type the driver accepts because none of the switch cases in
virtbt_rx_handle() check skb->len against the per-type minimum HCI
header size before handing the frame to the core.
After stripping pkt_type, require skb->len to cover the fixed
header size for the selected type (event 2, ACL 4, SCO 3, ISO 4)
before calling hci_recv_frame(); drop ratelimited otherwise.
Unknown pkt_type values still take the original kfree_skb() default
path.
Use bt_dev_err_ratelimited() because both the length and pkt_type
values come from an untrusted backend that can otherwise flood the
kernel log.
Fixes: 160fbcf3bfb9 ("Bluetooth: virtio_bt: Use skb_put to set length") Cc: stable@vger.kernel.org Cc: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: virtio_bt: clamp rx length before skb_put
virtbt_rx_work() calls skb_put(skb, len) where len comes directly
from virtqueue_get_buf() with no validation against the buffer we
posted to the device. The RX skb is allocated in virtbt_add_inbuf()
and exposed to virtio as exactly 1000 bytes via sg_init_one().
Checking len against skb_tailroom(skb) is not sufficient because
alloc_skb() can leave more tailroom than the 1000 bytes actually
handed to the device. A malicious or buggy backend can therefore
report used.len between 1001 and skb_tailroom(skb), causing skb_put()
to include uninitialized kernel heap bytes that were never written by
the device.
The same path also accepts len == 0, in which case skb_put(skb, 0)
leaves the skb empty but virtbt_rx_handle() still reads the pkt_type
byte from skb->data, consuming uninitialized memory.
Define VIRTBT_RX_BUF_SIZE once and reuse it in alloc_skb() and
sg_init_one(), and gate virtbt_rx_work() on that same constant so
the bound checked matches the buffer actually exposed to the device.
Reject used.len == 0 in the same gate so an empty completion can
no longer reach virtbt_rx_handle().
Use bt_dev_err_ratelimited() because the length value comes from an
untrusted backend that can otherwise flood the kernel log.
Same class of bug as commit c04db81cd028 ("net/9p: Fix buffer
overflow in USB transport layer"), which hardened the USB 9p
transport against unchecked device-reported length.
Fixes: 160fbcf3bfb9 ("Bluetooth: virtio_bt: Use skb_put to set length") Cc: stable@vger.kernel.org Cc: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: btmtk: validate WMT event SKB length before struct access
btmtk_usb_hci_wmt_sync() casts the WMT event response SKB data to
struct btmtk_hci_wmt_evt (7 bytes) and struct btmtk_hci_wmt_evt_funcc
(9 bytes) without first checking that the SKB contains enough data.
A short firmware response causes out-of-bounds reads from SKB tailroom.
Use skb_pull_data() to validate and advance past the base WMT event
header. For the FUNC_CTRL case, pull the additional status field bytes
before accessing them.
Fixes: d019930b0049 ("Bluetooth: btmtk: move btusb_mtk_hci_wmt_sync to btmtk.c") Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix data-race on iso_pi(sk) in socket and HCI event paths
Several iso_pi(sk) fields (qos, qos_user_set, bc_sid, base, base_len,
sync_handle, bc_num_bis) are written under lock_sock in
iso_sock_setsockopt() and iso_sock_bind(), but read and written under
hci_dev_lock only in two other paths:
- iso_connect_bis() / iso_connect_cis(), invoked from connect(2),
read qos/base/bc_sid and reset qos to default_qos on the
qos_user_set validation failure -- all without lock_sock.
- iso_connect_ind(), invoked from hci_rx_work, writes sync_handle,
bc_sid, qos.bcast.encryption, bc_num_bis, base and base_len on
PA_SYNC_ESTABLISHED / PAST_RECEIVED / BIG_INFO_ADV_REPORT /
PER_ADV_REPORT events. The BIG_INFO handler additionally passes
&iso_pi(sk)->qos together with sync_handle / bc_num_bis / bc_bis
to hci_conn_big_create_sync() while setsockopt may be mutating
them.
Acquire lock_sock around the affected accesses in both paths.
The locking order hci_dev_lock -> lock_sock matches the existing
iso_conn_big_sync() precedent, whose comment documents the same
requirement for hci_conn_big_create_sync(). The HCI connect/bind
helpers do not wait for command completion -- they enqueue work via
hci_cmd_sync_queue{,_once}() / hci_le_create_cis_pending() and
return -- so the added hold time is comparable to iso_conn_big_sync().
KCSAN report:
BUG: KCSAN: data-race in iso_connect_cis / iso_sock_setsockopt
read to 0xffffa3ae8ce3cdc8 of 1 bytes by task 335 on cpu 0:
iso_connect_cis+0x49f/0xa20
iso_sock_connect+0x60e/0xb40
__sys_connect_file+0xbd/0xe0
__sys_connect+0xe0/0x110
__x64_sys_connect+0x40/0x50
x64_sys_call+0xcad/0x1c60
do_syscall_64+0x133/0x590
entry_SYSCALL_64_after_hwframe+0x77/0x7f
write to 0xffffa3ae8ce3cdc8 of 60 bytes by task 334 on cpu 1:
iso_sock_setsockopt+0x69a/0x930
do_sock_setsockopt+0xc3/0x170
__sys_setsockopt+0xd1/0x130
__x64_sys_setsockopt+0x64/0x80
x64_sys_call+0x1547/0x1c60
do_syscall_64+0x133/0x590
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 334 Comm: iso_setup_race Not tainted 7.0.0-10949-g8541d8f725c6 #44 PREEMPT(lazy)
The iso_connect_ind() races were found by inspection.
Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: SeungJu Cheon <suunj1331@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix data-race on dst in iso_sock_connect()
iso_sock_connect() copies the destination address into
iso_pi(sk)->dst under lock_sock, then releases the lock and reads
it back with bacmp() to decide between the CIS and BIS connect
paths:
if (bacmp(&iso_pi(sk)->dst, BDADDR_ANY)) // <- no lock held
This read after release_sock() races with any concurrent write to
iso_pi(sk)->dst on the same socket.
Fix by reading the destination address directly from the local
sockaddr argument (sa->iso_bdaddr) instead of iso_pi(sk)->dst.
Since sa is a function-local argument, reading it requires no
locking and avoids the race.
This patch addresses only the bacmp() race in iso_sock_connect();
other unprotected iso_pi(sk) accesses are fixed separately in the
next patch.
KCSAN report:
BUG: KCSAN: data-race in memcmp+0x39/0xb0
race at unknown origin, with read to 0xffff8f96ea66dde3 of 1 bytes by task 549 on cpu 1:
memcmp+0x39/0xb0
iso_sock_connect+0x275/0xb40
__sys_connect_file+0xbd/0xe0
__sys_connect+0xe0/0x110
__x64_sys_connect+0x40/0x50
x64_sys_call+0xcad/0x1c60
do_syscall_64+0x133/0x590
entry_SYSCALL_64_after_hwframe+0x77/0x7f
value changed: 0x00 -> 0xee
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 549 Comm: iso_race_combin Not tainted 7.0.0-08391-g1d51b370a0f8 #40 PREEMPT(lazy)
Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: SeungJu Cheon <suunj1331@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized
When a fault is injected during hci_uart line discipline setup, the
proto open() callback may fail leaving hu->priv as NULL. A subsequent
TIOCSTI ioctl can trigger the recv() callback before priv is
initialized, causing a NULL pointer dereference.
Fix all four affected HCI UART protocol drivers by adding a NULL check
on hu->priv at the start of their recv() callbacks: h4, h5, ath and
bcsp.
Reported-by: syzbot+ff30eeab8e07b37d524e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ff30eeab8e07b37d524e Signed-off-by: Aurelien DESBRIERES <aurelien@hackers.camp> Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: btintel_pcie: treat boot stage bit 12 as warning
CSR boot stage register bit 12 is documented as a device warning,
not a fatal error. Rename the bit definition accordingly and stop
including it in btintel_pcie_in_error().
This keeps warning-only boot stage values from being classified as
errors while preserving abort-handler state as the actual error
condition.
Fixes: 190377500fde ("Bluetooth: btintel_pcie: Dump debug registers on error") Signed-off-by: Kiran K <kiran.k@intel.com> Signed-off-by: Sai Teja Aluvala <aluvala.sai.teja@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pauli Virtanen [Sat, 18 Apr 2026 15:41:12 +0000 (18:41 +0300)]
Bluetooth: SCO: hold sk properly in sco_conn_ready
sk deref in sco_conn_ready must be done either under conn->lock, or
holding a refcount, to avoid concurrent close. conn->sk and parent sk is
currently accessed without either, and without checking parent->sk_state:
[Task 1] [Task 2]
sco_sock_release
sco_conn_ready
sk = conn->sk
lock_sock(sk)
conn->sk = NULL
lock_sock(sk)
release_sock(sk)
sco_sock_kill(sk)
UAF on sk deref
and similarly for access to sco_get_sock_listen() return value.
Fix possible UAF by holding sk refcount in sco_conn_ready() and making
sco_get_sock_listen() increase refcount. Also recheck after lock_sock
that the socket is still valid. Adjust conn->sk locking so it's
protected also by lock_sock() of the associated socket if any.
Fixes: 27c24fda62b60 ("Bluetooth: switch to lock_sock in SCO") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_conn_ready() [hdev->lock via hci_cb_list_lock]
...
mutex_lock(&conn->lock)
This is a classic AB/BA deadlock which lockdep reports as a circular
locking dependency when connecting a BLE MIDI keyboard (Carry-On FC-49).
Fix this by making hci_le_conn_update() defer the HCI command through
hci_cmd_sync_queue() so it no longer needs to take hdev->lock in the
caller context. The sync callback uses __hci_cmd_sync_status_sk() to
wait for the HCI_EV_LE_CONN_UPDATE_COMPLETE event, then updates the
stored connection parameters (hci_conn_params) and notifies userspace
(mgmt_new_conn_param) only after the controller has confirmed the update.
A reference on hci_conn is held via hci_conn_get()/hci_conn_put() for
the lifetime of the queued work to prevent use-after-free, and
hci_conn_valid() is checked before proceeding in case the connection was
removed while the work was pending. The hci_dev_lock is held across
hci_conn_valid() and all conn field accesses to prevent a concurrent
disconnect from invalidating the connection mid-use.
Fixes: f044eb0524a0 ("Bluetooth: Store latency and supervision timeout in connection params") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Dudu Lu [Wed, 15 Apr 2026 10:43:55 +0000 (18:43 +0800)]
Bluetooth: l2cap: fix MPS check in l2cap_ecred_reconf_req
The L2CAP specification states that if more than one channel is being
reconfigured, the MPS shall not be decreased. The current check has
two issues:
1) The comparison uses >= (greater-than-or-equal), which incorrectly
rejects reconfiguration requests where the MPS stays the same.
Since the spec says MPS "shall be greater than or equal to the
current MPS", only a strict decrease (remote_mps > mps) should be
rejected. Keeping the same MPS is valid.
2) The multi-channel guard uses `&& i` (loop index) to approximate
"more than one channel", but this incorrectly allows MPS decrease
for the first channel (i==0) even when multiple channels are being
reconfigured. Replace with `&& num_scid > 1` which correctly
checks whether the request covers more than one channel.
Fixes: 7accb1c4321a ("Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ") Signed-off-by: Dudu Lu <phx0fer@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Dudu Lu [Wed, 15 Apr 2026 09:39:53 +0000 (17:39 +0800)]
Bluetooth: bnep: fix incorrect length parsing in bnep_rx_frame() extension handling
In bnep_rx_frame(), the BNEP_FILTER_NET_TYPE_SET and
BNEP_FILTER_MULTI_ADDR_SET extension header parsing has two bugs:
1) The 2-byte length field is read with *(u16 *)(skb->data + 1), which
performs a native-endian read. The BNEP protocol specifies this field
in big-endian (network byte order), and the same file correctly uses
get_unaligned_be16() for the identical fields in
bnep_ctrl_set_netfilter() and bnep_ctrl_set_mcfilter().
2) The length is multiplied by 2, but unlike BNEP_SETUP_CONN_REQ where
the length byte counts UUID pairs (requiring * 2 for two UUIDs per
entry), the filter extension length field already represents the total
data size in bytes. This is confirmed by bnep_ctrl_set_netfilter()
which reads the same field as a byte count and divides by 4 to get
the number of filter entries.
The bogus * 2 means skb_pull advances twice as far as it should,
either dropping valid data from the next header or causing the pull
to fail entirely when the doubled length exceeds the remaining skb.
Fix by splitting the pull into two steps: first use skb_pull_data() to
safely pull and validate the 3-byte fixed header (ctrl type + length),
then pull the variable-length data using the properly decoded length.
Fixes: bf8b9a9cb77b ("Bluetooth: bnep: Add support to extended headers of control frames") Signed-off-by: Dudu Lu <phx0fer@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: hci_event: Fix OOB read and infinite loop in hci_le_create_big_complete_evt
hci_le_create_big_complete_evt() iterates over BT_BOUND connections for
a BIG handle using a while loop, accessing ev->bis_handle[i++] on each
iteration. However, there is no check that i stays within ev->num_bis
before the array access.
When a controller sends a LE_Create_BIG_Complete event with fewer
bis_handle entries than there are BT_BOUND connections for that BIG,
or with num_bis=0, the loop reads beyond the valid bis_handle[] flex
array into adjacent heap memory. Since the out-of-bounds values
typically exceed HCI_CONN_HANDLE_MAX (0x0EFF), hci_conn_set_handle()
rejects them and the connection remains in BT_BOUND state. The same
connection is then found again by hci_conn_hash_lookup_big_state(),
creating an infinite loop with hci_dev_lock held.
Fix this by terminating the BIG if in case not all BIS could be setup
properly.
Fixes: a0bfde167b50 ("Bluetooth: ISO: Add support for connecting multiple BISes") Cc: stable@vger.kernel.org Signed-off-by: ZhiTao Ou <hkbinbinbin@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
David Carlier [Sun, 12 Apr 2026 20:29:16 +0000 (21:29 +0100)]
Bluetooth: hci_conn: fix potential UAF in create_big_sync
Add hci_conn_valid() check in create_big_sync() to detect stale
connections before proceeding with BIG creation. Handle the
resulting -ECANCELED in create_big_complete() and re-validate the
connection under hci_dev_lock() before dereferencing, matching the
pattern used by create_le_conn_complete() and create_pa_complete().
Keep the hci_conn object alive across the async boundary by taking
a reference via hci_conn_get() when queueing create_big_sync(), and
dropping it in the completion callback. The refcount and the lock
are complementary: the refcount keeps the object allocated, while
hci_dev_lock() serializes hci_conn_hash_del()'s list_del_rcu() on
hdev->conn_hash, as required by hci_conn_del().
hci_conn_put() is called outside hci_dev_unlock() so the final put
(which resolves to kfree() via bt_link_release) does not run under
hdev->lock, though the release path would be safe either way.
Without this, create_big_complete() would unconditionally
dereference the conn pointer on error, causing a use-after-free
via hci_connect_cfm() and hci_conn_del().
Fixes: eca0ae4aea66 ("Bluetooth: Add initial implementation of BIS connections") Cc: stable@vger.kernel.org Co-developed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pauli Virtanen [Sun, 12 Apr 2026 18:47:42 +0000 (21:47 +0300)]
Bluetooth: SCO: fix sleeping under spinlock in sco_conn_ready
sco_conn_ready calls sleeping functions under conn->lock spinlock.
The critical section can be reduced: conn->hcon is modified only with
hdev->lock held. It is guaranteed to be held in sco_conn_ready, so
conn->lock is not needed to guard it.
Move taking conn->lock after lock_sock(parent). This also follows the
lock ordering lock_sock() > conn->lock elsewhere in the file.
Fixes: 27c24fda62b60 ("Bluetooth: switch to lock_sock in SCO") Signed-off-by: Pauli Virtanen <pav@iki.fi> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
- Fix build failures introduced when allowing to build 32-/64-bit only
VDSO
- Switch to dynamic parisc root device to avoid upcoming warnings
- Fix IRQ leak in LASI driver
* tag 'parisc-for-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix IRQ leak in LASI driver
parisc: Fix 64-bit kernel build when CONFIG_COMPAT=n
parisc: Fix build failure for 32-bit kernel with PA2.0 instruction set
parisc: drivers: switch to dynamic root device
Revert "parisc: led: fix reference leak on failed device registration"
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM/arm64 and KVM/s390 will eventually share some code. Add me as
a cross-reviewer from the s390 team to arm64 to help to keep both
architectures in sync.
KVM: arm64: Handle permission faults with guest_memfd
gmem_abort() calls kvm_pgtable_stage2_map() to make changes to stage 2. It
does this for both relaxing permissions on an existing mapping and to
install a missing mapping.
kvm_pgtable_stage2_map() doesn't make changes to stage 2 if there is an
existing, valid entry and the new entry modifies only the permissions.
This is checked in:
and if only the permissions differ, kvm_pgtable_stage2_map() returns
-EAGAIN and KVM returns to the guest to replay the instruction. The
assumption is that a concurrent fault on a different VCPU already mapped
the faulting IPA, and replaying the instruction will either succeed, or
cause a permission fault, which should be handled with
kvm_pgtable_stage2_relax_perms().
gmem_abort(), on a read or write fault on a system without DIC (instruction
cache invalidation required for data to instruction coherence), installs a
valid entry with read and write permissions, but without executable
permissions. On an execution fault on the same page, gmem_abort() attempts
to relax the permissions to allow execution, but calls
kvm_pgtable_stage2_map() to change the existing, valid, entry.
kvm_pgtable_stage2_map() returns -EAGAIN and KVM resumes execution from the
faulting instruction, which leads to an infinite loop of permission faults
on the same instruction.
Allow the guest to make progress by using kvm_pgtable_stage2_relax_perms()
to relax permissions.
Wei-Lin Chang [Tue, 5 May 2026 14:47:35 +0000 (15:47 +0100)]
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
When running an nVHE L1, TCR_EL2 is mapped to TCR_EL1. Writes to the
register are trapped and written to TCR_EL1 after a translation.
Booting an nVHE L1 with 52-bit VA isn't working because the translation
was ignoring the DS bit set by the guest, hence causing repeating level
0 faults. Add it in the translation function.
James Morse [Tue, 5 May 2026 16:52:03 +0000 (17:52 +0100)]
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
C1-Pro cores with SME have an erratum where TLBI+DSB does not complete
all outstanding SME accesses. Instead a DSB needs to be executed on the
affected CPUs. The implication is that pages cannot be unmapped from the
host Stage 2 and then provided to a protected guest or to the
hypervisor. Host SME accesses may still complete after this point.
This erratum breaks pKVM's guarantees, and the workaround is hard to
implement as EL2 and EL1 share a security state meaning EL1 can mask
IPIs sent by EL2, leading to interrupt blackouts.
Instead, do this in EL3. This has the advantage of a separate security
state, meaning lower EL cannot mask the IPI. It is also simpler for EL3
to know about CPUs that are off or in PSCI's CPU_SUSPEND.
Add the needed hook to host_stage2_set_owner_metadata_locked(). This
covers the cases where the host loses access to a page:
__pkvm_host_donate_guest()
__pkvm_guest_unshare_host()
host_stage2_set_owner_locked() when owner_id == PKVM_ID_HYP
Since pKVM relies on the firmware call for correctness, check for the
firmware counterpart during protected KVM initialisation and fail the
pKVM initialisation if it is missing.
Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Sudeep Holla <sudeep.holla@kernel.org> Link: https://patch.msgid.link/20260505165205.2690919-1-catalin.marinas@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Vincent Guittot [Sun, 3 May 2026 10:45:03 +0000 (12:45 +0200)]
sched/fair: Fix wakeup_preempt_fair() for not waking up task
Make sure to only call pick_next_entity() on an non-empty cfs_rq.
The assumption that p is always enqueued and not delayed, is only true for
wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and
the cfs might become empty. Test if there are still queued tasks before trying
again to determine if p could be the next one to be picked.
There are at least 2 cases:
When cfs becomes idle, it tries to pull tasks but if those pulled tasks are
delayed, they will be dequeued when attached to cfs. attach_tasks() ->
attach_task() -> wakeup_preempt(rq, p, 0);
A misfit task running on cfs A triggers a load balance to be pulled on a better
cpu, the load balance on cfs B starts an active load balance to pulled the
running misfit task. If there is a delayed dequeue task on cfs A, it can be
pulled instead of the previously running misfit task. attach_one_task() ->
attach_task() -> wakeup_preempt(rq, p, 0);
Zhan Xusheng [Fri, 1 May 2026 10:40:06 +0000 (12:40 +0200)]
sched/fair: Fix overflow in vruntime_eligible()
Zhan Xusheng reported running into sporadic a s64 mult overflow in
vruntime_eligible().
When constructing a worst case scenario:
If you have cgroups, then you can have an entity of weight 2 (per
calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC)
* NICE_0_LOAD, which is around 44 bits as per the comment on entity_key().
The other extreme is 100*NICE_0_LOAD, thus you get:
And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will
overflow s64.
Zhan suggested using __builtin_mul_overflow(), however after staring at
compiler output for various architectures using godbolt, it seems that using an
__int128 multiplication often results in better code.
Specifically, a number of architectures already compute the __int128 product to
determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By
explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern,
which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches
(no OF handling).
Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the
__builtin_mul_overflow() variant as a fallback.
[peterz: Changelog and __int128 bits] Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
Due to the incompatibility with TCMalloc the RSEQ optimizations and
extended features (time slice extensions) have been disabled and made
run-time conditional.
The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.
The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.
The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.
As this is part of their performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.
This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the cases since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:
1) Unconditional updates of the user read only fields (CPU, node, MMCID)
are removed. Those fields are only updated on registration, task
migration and MMCID changes.
2) Unconditional evaluation of the criticial section pointer is
removed. It's only evaluated when user space was interrupted and was
scheduled out or before delivering a signal in the interrupted
context.
3) The read/only requirement of the ID fields is enforced. When the
kernel detects that userspace manipulated the fields, the process is
terminated. This ensures that multiple entities (libraries) can
utilize RSEQ without interfering.
4) Todays extended RSEQ feature (time slice extensions) and future
extensions are only enabled in the v2 enabled mode.
Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.
Unfortunately that also affects users of older GLIBC versions which
register the original size of 32 bytes and do not evaluate the kernel
required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE.
That's the result of the lack of enforcement in the original implementation
and the unwillingness of a single entity to cooperate with the larger
ecosystem for many years.
Implement the required registration changes by restructuring the spaghetti
code and adding the size/version check. Also add documentation about the
differences of legacy and optimized RSEQ V2 mode.
Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints!
Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org Cc: stable@vger.kernel.org
Thomas Gleixner [Sun, 26 Apr 2026 14:21:02 +0000 (16:21 +0200)]
rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode
The optimized RSEQ V2 mode requires that user space adheres to the ABI
specification and does not modify the read-only fields cpu_id_start,
cpu_id, node_id and mm_cid behind the kernel's back.
While the kernel does not rely on these fields, the adherence to this is a
fundamental prerequisite to allow multiple entities, e.g. libraries, in an
application to utilize the full potential of RSEQ without stepping on each
other toes.
Validate this adherence on every update of these fields. If the kernel
detects that user space modified the fields, the application is force
terminated.
Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org Cc: stable@vger.kernel.org
Thomas Gleixner [Sun, 26 Apr 2026 15:51:07 +0000 (17:51 +0200)]
selftests/rseq: Validate legacy behavior
The RSEQ legacy mode behavior requires that the ID fields in the rseq
region are unconditionally updated on every context switch and before
signal delivery even if not required by the ABI specification.
To ensure that this behavior is preserved for legacy users in the future,
add a test which validates that with a sleep() and a signal sent to self.
Provide a run script which prevents GLIBC from registering a RSEQ region,
so that the test can register it's own legacy sized region.
Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.764705536%40kernel.org Cc: stable@vger.kernel.org
Colin Walters [Tue, 5 May 2026 22:42:57 +0000 (15:42 -0700)]
ovl: fix verity lazy-load guard broken by fsverity_active() semantic change
Commit f77f281b6118 ("fsverity: use a hashtable to find the
fsverity_info") made fsverity_active() check whether the inode has the
verity flag, rather than whether the inode's fsverity_info is loaded.
This broke ovl_ensure_verity_loaded(), which wants to load the
fsverity_info for any verity inodes that haven't had it loaded yet.
Therefore, to check that the fsverity_info hasn't yet been loaded, use
fsverity_get_info(inode) == NULL instead of !fsverity_active(inode).
Also, since fsverity_get_info() now involves a hash table lookup, put
the more lightweight IS_VERITY() flag check first.
Jakub Kicinski [Wed, 6 May 2026 14:29:31 +0000 (07:29 -0700)]
Merge tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Quite a number of fixes now:
- mac80211
- remove HT NSS validation to work with broken APs
(with a kunit fix now)
- remove 'static' that could cause races
- check station link lookup before further processing
- fix use-after-free due to delete in list iteration
- remove AP station on assoc failures to fix crashes
- ath12k
- fix OF node refcount imbalance
- fix queue flush ("REO update") in MLO
- fix RCU assert
- ath12k:
- fix Kconfig with POWER_SEQUENCING
- fix WMI buffer leaks on error conditions
- don't use uninitialized stack data when processing RSSI events
- fix logic for determining the peer ID in the RX path
- ath5k: fix a potential stack buffer overwrite
- rsi: fix thread lifetime race
- brcmfmac: fix potential UAF
- nl80211:
- stricter permissions/checks for PMK and netns
- fix netlink policy vs. code type confusion
- cw1200: revert a broken locking change
- various fixes to not trust values from firmware
* tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (25 commits)
wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation
wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
wifi: nl80211: fix NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST usage
wifi: mac80211: remove station if connection prep fails
wifi: mac80211: use safe list iteration in radar detect work
wifi: libertas: notify firmware load wait on disconnect
wifi: ath5k: do not access array OOB
wifi: ath12k: fix peer_id usage in normal RX path
wifi: ath12k: initialize RSSI dBm conversion event state
wifi: ath12k: fix leak in some ath12k_wmi_xxx() functions
wifi: cw1200: Revert "Fix locking in error paths"
wifi: mac80211: tests: mark HT check strict
wifi: rsi: fix kthread lifetime race between self-exit and external-stop
wifi: mac80211: drop stray 'static' from fast-RX rx_result
wifi: mac80211: check ieee80211_rx_data_set_link return in pubsta MLO path
wifi: nl80211: require admin perm on SET_PMK / DEL_PMK
wifi: libertas: fix integer underflow in process_cmdrequest()
wifi: b43legacy: enforce bounds check on firmware key index in RX path
wifi: b43: enforce bounds check on firmware key index in b43_rx()
wifi: brcmfmac: Fix potential use-after-free issue when stopping watchdog task
...
====================
Linus Torvalds [Wed, 6 May 2026 14:27:30 +0000 (07:27 -0700)]
Merge tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Fix issues in EFI graceful recovery on x86 introduced by changes to
the kernel mode FPU APIs
- I-cache coherency fixes for the LoongArch EFI stub
- Locking fix for EFI pstore
- Code tweak for efivarfs
* tag 'efi-fixes-for-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: Restore IRQ state in EFI page fault handler
x86/efi: Fix graceful fault handling after FPU softirq changes
efi/libstub: Synchronize instruction cache after kernel relocation
efi/loongarch: Implement efi_cache_sync_image()
efi/libstub: Move efi_relocate_kernel() into its only remaining user
efi: pstore: Drop efivar lock when efi_pstore_open() returns with an error
efivarfs: use QSTR() in efivarfs_alloc_dentry
Takashi Iwai [Wed, 6 May 2026 14:10:00 +0000 (16:10 +0200)]
Merge tag 'asoc-fix-v7.1-rc2' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v7.1
Another batch of fixes, plus a couple of quirks (mostly AMD ones, as has
been the case recently). All driver changes, including fixes for the
KUnit tests for the Cirrus drivers that could cause memory corruption.
ASoC: cs35l56: Don't use devres to unregister component
Manually call snd_soc_unregister_component() from cs35l56_remove()
instead of using devres cleanup. This ensures that the component is
destroyed before cs35l56_remove() starts cleanup of anything the
component code could be using.
Devres cleanup happens after the driver remove() callback, so if
snd_soc_register_component() is used, it will not be destroyed until
after cs35l56_remove() has returned. But there is some cleanup that
must be done in cs35l56_remove(), or wrapped in a custom devres
cleanup handler to ensure correct ordering. The simplest option is
to call snd_soc_unregister_component() at the start of cs35l56_remove().
Fixes: e49611252900 ("ASoC: cs35l56: Add driver for Cirrus Logic CS35L56") Closes: https://sashiko.dev/#/patchset/20260501103002.2843735-1-rf%40opensource.cirrus.com Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Link: https://patch.msgid.link/20260505161124.3621000-2-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
Breno Leitao [Tue, 5 May 2026 16:02:13 +0000 (09:02 -0700)]
arm64/fpsimd: ptrace: zero target's fpsimd_state, not the tracer's
sve_set_common() is the backend for PTRACE_SETREGSET(NT_ARM_SVE) and
PTRACE_SETREGSET(NT_ARM_SSVE). Every write in the function operates on
the tracee (target) - except a single memset that uses current instead,
zeroing the tracer's saved V0-V31 / FPSR / FPCR shadow on every ptrace
SETREGSET call.
The memset is meant to give the tracee a defined zero register image
before the user-supplied payload is copied in (for partial writes,
header-only writes, and FPSIMD<->SVE format switches). Aiming it at
current both denies the tracee that clean slate and silently corrupts
the tracer.
The corruption of the tracer's saved FPSIMD state is not always
observable. Where the tracer's state is live on a CPU, this may be
reused without loading the corrupted state from memory, and will
eventually be written back over the corrupted state. Where the tracer's
state is saved in SVE_PT_REGS_SVE format, only the FPSR and FPCR are
clobbered, and the effective copy of the vectors is in the task's
sve_state.
Reproducible on an arm64 kernel with SVE: a single-threaded tracer that
loads a known pattern into V0-V31, issues PTRACE_SETREGSET(NT_ARM_SVE)
on a child, and reads V0-V31 back observes them all zeroed within tens
of thousands of iterations when a sibling thread keeps stealing the
FPSIMD CPU binding.
Maoyi Xie [Mon, 4 May 2026 15:37:55 +0000 (23:37 +0800)]
io_uring/wait: honour caller's time namespace for IORING_ENTER_ABS_TIMER
io_uring_enter() with IORING_ENTER_ABS_TIMER takes an absolute
timespec from the caller via ext_arg->ts. It arms an ABS mode
hrtimer in __io_cqring_wait_schedule(). The conversion path in
io_uring/wait.c parses ext_arg->ts inline rather than going
through io_parse_user_time(). It therefore does not pick up the
time namespace conversion added by the previous patch.
Apply timens_ktime_to_host() to the parsed time on the
IORING_ENTER_ABS_TIMER branch. This mirrors the IORING_TIMEOUT_ABS
fix in io_parse_user_time(). Use ctx->clockid as the clock id.
ctx->clockid is set either at ring creation or via
IORING_REGISTER_CLOCK.
timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces. It is also a no-op for callers in the initial time
namespace. The fast path is unchanged.
Reproducer: in unshare --user --time, with a -10s monotonic
offset, call io_uring_enter with min_complete=1,
IORING_ENTER_ABS_TIMER, and ts = now + 1s. The call returns
-ETIME after <1ms instead of after the expected ~1s.
Maoyi Xie [Mon, 4 May 2026 15:37:54 +0000 (23:37 +0800)]
io_uring/timeout: honour caller's time namespace for IORING_TIMEOUT_ABS
io_uring's IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT accept a
timespec from the caller via io_parse_user_time(). With
IORING_TIMEOUT_ABS, the timestamp is an absolute deadline on the
selected clock. The clock is CLOCK_MONOTONIC by default.
CLOCK_BOOTTIME and CLOCK_REALTIME are also selectable.
A submitter inside a CLONE_NEWTIME time namespace observes
CLOCK_MONOTONIC and CLOCK_BOOTTIME shifted by the namespace's
offsets relative to the host. Every other ABS timer interface in
the kernel converts the caller's absolute time to host view via
timens_ktime_to_host() before arming an hrtimer:
io_parse_user_time() does not. As a result, an absolute timeout
submitted from within a time namespace is interpreted in host
view. That is generally a different point in time. It may already
be in the past, causing the timer to fire immediately, or far in
the future, causing the timer not to fire when expected.
Reproducer: in unshare --user --time, with a -10s monotonic
offset, submit IORING_OP_TIMEOUT with IORING_TIMEOUT_ABS and
deadline = now + 1s. The CQE is delivered after <1ms instead of
the expected ~1s.
Apply timens_ktime_to_host() to the parsed time when
IORING_TIMEOUT_ABS is set. Split the existing clock id resolver
in io_timeout_get_clock() into a flags only helper
io_flags_to_clock(), so io_parse_user_time() can resolve the
clock without a struct io_timeout_data.
timens_ktime_to_host() is a no-op for clocks not affected by time
namespaces, e.g. CLOCK_REALTIME. It is also a no-op for callers
in the initial time namespace. The fast path is unchanged.
SQPOLL is also covered. The SQPOLL kernel thread is created via
create_io_thread() with CLONE_THREAD and no CLONE_NEW* flag.
copy_namespaces() therefore shares the submitter's nsproxy by
reference. Inside the SQPOLL kthread, current->nsproxy->time_ns
is the submitter's time_ns. timens_ktime_to_host() resolves
correctly.
Ming Lei [Wed, 6 May 2026 08:22:38 +0000 (16:22 +0800)]
ublk: validate physical_bs_shift, io_min_shift and io_opt_shift
ublk_validate_params() checks logical_bs_shift is within
[9, PAGE_SHIFT] but has no upper bound for physical_bs_shift,
io_min_shift, or io_opt_shift. A malicious userspace can set any
of these to a large value (e.g., 44), causing undefined behavior
from `1 << shift` in ublk_ctrl_start_dev() since the result is
stored in 32-bit unsigned int.
Cap all three at ilog2(SZ_256M) (28). 256M is big enough to cover
all practical block sizes, and originates from the maximum physical
block size possible in NVMe (lba_size * (1 + npwg), where npwg is
16-bit).
Also zero out ub->params with memset() when copy_from_user() fails
or ublk_validate_params() returns error, so that no stale or partial
params survive for a subsequent START_DEV to consume.
Maoyi Xie [Wed, 6 May 2026 06:48:54 +0000 (14:48 +0800)]
wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation
NL80211_CMD_GET_SCAN is implemented as a multi-call dumpit. The first
invocation of nl80211_prepare_wdev_dump() validates the requested wdev
against the caller's netns via __cfg80211_wdev_from_attrs(). Subsequent
invocations look up the same wiphy by its global index and do not check
that the wiphy is still in the caller's netns.
Add the same filter to the continuation path. If the wiphy's netns no
longer matches the caller's, return -ENODEV and the netlink dump
machinery terminates the walk cleanly.
Maoyi Xie [Wed, 6 May 2026 06:48:53 +0000 (14:48 +0800)]
wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
NL80211_CMD_SET_WIPHY_NETNS dispatches with GENL_UNS_ADMIN_PERM, which
verifies that the caller has CAP_NET_ADMIN for the source netns. It
doesn't verify that the caller has CAP_NET_ADMIN over the target netns
selected by NL80211_ATTR_NETNS_FD or NL80211_ATTR_PID.
This diverges from the convention enforced in
net/core/rtnetlink.c::rtnl_get_net_ns_capable():
/* For now, the caller is required to have CAP_NET_ADMIN in
* the user namespace owning the target net ns.
*/
if (!sk_ns_capable(sk, net->user_ns, CAP_NET_ADMIN))
return ERR_PTR(-EACCES);
A user with CAP_NET_ADMIN in their own user namespace can therefore
push a wiphy into an arbitrary netns (including init_net) over which
they have no privilege.
Mirror the rtnetlink convention by requiring CAP_NET_ADMIN in the
target netns before calling cfg80211_switch_netns().
This is documented as a u8 and has a policy of NLA_U8, but uses
nla_get_u32() which means it's completely broken on big-endian.
Fix it to use nla_get_u8().
Johannes Berg [Tue, 5 May 2026 13:15:34 +0000 (15:15 +0200)]
wifi: mac80211: remove station if connection prep fails
If connection preparation fails for MLO connections, then the
interface is completely reset to non-MLD. In this case, we must
not keep the station since it's related to the link of the vif
being removed. Delete an existing station. Any "new_sta" is
already being removed, so that doesn't need changes.
This fixes a use-after-free/double-free in debugfs if that's
enabled, because a vif going from MLD (and to MLD, but that's
not relevant here) recreates its entire debugfs.
Cássio Gabriel [Wed, 6 May 2026 03:34:47 +0000 (00:34 -0300)]
ALSA: core: Serialize deferred fasync state checks
snd_fasync_helper() updates fasync->on under snd_fasync_lock, and
snd_fasync_work_fn() now also evaluates fasync->on under the same
lock. snd_kill_fasync() still tests the flag before taking the lock,
leaving an unsynchronized read against FASYNC enable/disable updates.
Move the enabled-state check into the locked section.
Also clear fasync->on under snd_fasync_lock in snd_fasync_free()
before unlinking the pending entry. Together with the locked sender-side
check, this publishes teardown before flushing the deferred work and
prevents a racing sender from requeueing the entry after free has
started.
Fixes: ef34a0ae7a26 ("ALSA: core: Add async signal helpers") Fixes: 8146cd333d23 ("ALSA: core: Fix potential data race at fasync handling") Cc: stable@vger.kernel.org Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com> Link: https://patch.msgid.link/20260506-alsa-core-fasync-on-lock-v1-1-ea48c77d6ca4@gmail.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
Rodrigo Faria [Tue, 5 May 2026 18:55:18 +0000 (19:55 +0100)]
ALSA: hda/realtek: Add mute LED fixup for HP Pavilion 15-cs1xxx
Add a new fixup for the mute LED on the HP Pavilion 15-cs1xxx series
using the VREF on NID 0x1b.
The BIOS on these models (tested up to F.32) incorrectly reports
the mute LED on NID 0x18 via DMI OEM strings, which lacks VREF
capabilities. This fixup overrides the LED pin to the correct
NID 0x1b.
Cássio Gabriel [Wed, 6 May 2026 03:15:48 +0000 (00:15 -0300)]
ALSA: seq: Fix UMP group 16 filtering
The sequencer UAPI defines group_filter as an unsigned int bitmap.
Bit 0 filters groupless messages and bits 1-16 filter UMP groups 1-16.
The internal snd_seq_client storage is only unsigned short, so bit 16
is truncated when userspace sets the filter. The same truncation affects
the automatic UMP client filter used to avoid delivery to inactive
groups, so events for group 16 cannot be filtered.
Store the internal bitmap as unsigned int and keep both userspace-provided
and automatically generated values limited to the defined UAPI bits.
timers/migration: Fix another hotplug activation race
The hotplug control CPU is assumed to be active in the hierarchy but
that doesn't imply that the root is active. If the current CPU is not
the one that activated the current hierarchy, and the CPU performing
this duty is still halfway through the tree, the root may still be
observed inactive. And this can break the activation of a new root as in
the following scenario:
1) Initially, the whole system has 64 CPUs and only CPU 63 is awake.
[GRP1:0]
active
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
idle idle active
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle idle active
2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the
[GRP1:0]->parent dereference (that would be NULL and stop the walk)
in __walk_groups_from().
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
idle idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle idle idle
3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate
up to GRP1:0 due to yet another #VMEXIT.
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
active idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
idle active idle
3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1
role.
[GRP1:0]
idle
/ | \
/ | \
[GRP0:0] [...] [GRP0:7]
active idle idle
/ | \ |
CPU 0 CPU 1 ... CPU 63
active active idle
4) CPU 0 boots CPU 64. It creates a new root for it.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
5) CPU 0 activates the new root, but note that GRP1:0 is still idle,
waiting for CPU 1 to resume from #VMEXIT and activate it.
[GRP2:0]
active
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent.
Therefore it propagates the stale inactive state of GRP1:0 up to
GRP2:0.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
idle idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it
doesn't observe its parent link because no ordering enforced that.
Therefore GRP2:0 is spuriously left idle.
[GRP2:0]
idle
/ \
/ \
[GRP1:0] [GRP1:1]
active idle
/ | \ \
/ | \ \
[GRP0:0] [...] [GRP0:7] [GRP0:8]
active idle idle idle
/ | \ | |
CPU 0 CPU 1 ... CPU 63 CPU 64
active active idle offline
Such races are highly theoretical and the problem would solve itself
once the old root ever becomes idle again. But it still leaves a taste
of discomfort.
Fix it with enforcing a fully ordered atomic read of the old root state
before propagating the activate state up to the new root. It has a two
directions ordering effect:
* Acquire + release of the latest old root state: If the hotplug control
CPU is not the one that woke up the old root, make sure to acquire its
active state and propagate it upwards through the ordered chain of
activation (the acquire pairs with the cmpxchg() in tmigr_active_up()
and subsequent releases will pair with atomic_read_acquire() and
smp_mb__after_atomic() in tmigr_inactive_up()).
* Release: If the hotplug control CPU is not the one that must wake up
the old root, but the CPU covering that is lagging behind its duty,
publish the links from the old root to the new parents. This way the
lagging CPU will propagate the active state itself.
Linus Torvalds [Wed, 6 May 2026 02:44:46 +0000 (19:44 -0700)]
Merge tag 'loongarch-fixes-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Fix some build and runtime issues after 32BIT Kconfig option enabled,
improve the platform-specific PCI controller compatibility, drop
custom __arch_vdso_hres_capable(), and fix a lot of KVM bugs"
* tag 'loongarch-fixes-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Move unconditional delay into timer clear scenery
LoongArch: KVM: Fix HW timer interrupt lost when inject interrupt by software
LoongArch: KVM: Move AVEC interrupt injection into switch loop
LoongArch: KVM: Use kvm_set_pte() in kvm_flush_pte()
LoongArch: KVM: Fix missing EMULATE_FAIL in kvm_emu_mmio_read()
LoongArch: KVM: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
LoongArch: KVM: Fix "unreliable stack" for kvm_exc_entry
LoongArch: KVM: Compile switch.S directly into the kernel
LoongArch: vDSO: Drop custom __arch_vdso_hres_capable()
LoongArch: Fix potential ADE in loongson_gpu_fixup_dma_hang()
LoongArch: Use per-root-bridge PCIH flag to skip mem resource fixup
LoongArch: Fix SYM_SIGFUNC_START definition for 32BIT
LoongArch: Specify -m32/-m64 explicitly for 32BIT/64BIT
LoongArch: Make CONFIG_64BIT as the default option
Jason Xing [Sat, 2 May 2026 20:07:22 +0000 (23:07 +0300)]
xsk: fix u64 descriptor address truncation on 32-bit architectures
In copy mode TX, xsk_skb_destructor_set_addr() stores the 64-bit
descriptor address into skb_shinfo(skb)->destructor_arg (void *) via a
uintptr_t cast:
On 32-bit architectures uintptr_t is 32 bits, so the upper 32 bits of
the descriptor address are silently dropped. In XDP_ZEROCOPY unaligned
mode the chunk offset is encoded in bits 48-63 of the descriptor
address (XSK_UNALIGNED_BUF_OFFSET_SHIFT = 48), meaning the offset is
lost entirely. The completion queue then returns a truncated address to
userspace, making buffer recycling impossible.
Fix this by handling the 32-bit case directly in
xsk_skb_destructor_set_addr(): when !CONFIG_64BIT, allocate an
xsk_addrs struct (the same path already used for multi-descriptor
SKBs) to store the full u64 address. The existing tagged-pointer logic
in xsk_skb_destructor_is_addr() stays unchanged: slab pointers returned
from kmem_cache_zalloc() are always word-aligned and therefore have
bit 0 clear, which correctly identifies them as a struct pointer
rather than an inline tagged address on every architecture.
Factor the shared kmem_cache_zalloc + destructor_arg assignment into
__xsk_addrs_alloc() and add a wrapper xsk_addrs_alloc() that handles
the inline-to-list upgrade (is_addr check + get_addr + num_descs = 1).
The three former open-coded kmem_cache_zalloc call sites now reduce to
a single call each.
Propagate the -ENOMEM from xsk_skb_destructor_set_addr() through
xsk_skb_init_misc() so the caller can clean up the skb via kfree_skb()
before skb->destructor is installed.
The overhead is one extra kmem_cache_zalloc per first descriptor on
32-bit only; 64-bit builds are completely unchanged.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/ Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number") Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-9-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:21 +0000 (23:07 +0300)]
xsk: fix xsk_addrs slab leak on multi-buffer error path
When xsk_build_skb() / xsk_build_skb_zerocopy() sees the first
continuation descriptor, it promotes destructor_arg from an inlined
address to a freshly allocated xsk_addrs (num_descs = 1). The counter
is bumped to >= 2 only at the very end of a successful build (by calling
xsk_inc_num_desc()).
If the build fails in between (e.g. alloc_page() returns NULL with
-EAGAIN, or the MAX_SKB_FRAGS overflow hits), we jump to free_err, skip
calling xsk_inc_num_desc() to increment num_descs and leave the half-built
skb attached to xs->skb for the app to retry. The skb now has
1) destructor_arg = a real xsk_addrs pointer,
2) num_descs = 1
If the app never retries and just close()s the socket, xsk_release()
calls xsk_drop_skb() -> xsk_consume_skb(), which decides whether to
free xsk_addrs by testing num_descs > 1:
if (unlikely(num_descs > 1))
kmem_cache_free(xsk_tx_generic_cache, destructor_arg);
Because num_descs is exactly 1 the branch is skipped and the
xsk_addrs object is leaked to the xsk_tx_generic_cache slab.
Fix it by directly testing if destructor_arg is still addr. Or else it
is modified and used to store the newly allocated memory from
xsk_tx_generic_cache regardless of increment of num_desc, which we
need to handle.
Closes: https://lore.kernel.org/all/20260419045824.D9E5EC2BCAF@smtp.kernel.org/ Fixes: 0ebc27a4c67d ("xsk: avoid data corruption on cq descriptor number") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-8-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:20 +0000 (23:07 +0300)]
xsk: avoid skb leak in XDP_TX_METADATA case
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means no frag and xs->skb is
NULL) and users enable metadata feature.
2. xsk_skb_metadata() returns a error code.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'.
4. there is no chance to free this skb anymore.
Closes: https://lore.kernel.org/all/20260415085204.3F87AC19424@smtp.kernel.org/ Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-7-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:19 +0000 (23:07 +0300)]
xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
Once xsk_skb_init_misc() has been called on an skb, its destructor is
set to xsk_destruct_skb(), which submits the descriptor address(es) to
the completion queue and advances the CQ producer. If such an skb is
subsequently freed via kfree_skb() along an error path - before the
skb has ever been handed to the driver - the destructor still runs and
submits a bogus, half-initialized address to the CQ.
Postpone the init phase when we believe the allocation of first frag is
successfully completed. Before this init, skb can be safely freed by
kfree_skb().
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/ Fixes: c30d084960cf ("xsk: avoid overwriting skb fields for multi-buffer traffic") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-6-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:18 +0000 (23:07 +0300)]
xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
When xsk_build_skb() processes multi-buffer packets in copy mode, the
first descriptor stores data into the skb linear area without adding
any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
to accumulate subsequent descriptors.
If a continuation descriptor fails (e.g. alloc_page returns NULL with
-EAGAIN), we jump to free_err where the condition:
if (skb && !skb_shinfo(skb)->nr_frags)
kfree_skb(skb);
evaluates to true because nr_frags is still 0 (the first descriptor
used the linear area, not frags). This frees the skb while xs->skb
still points to it, creating a dangling pointer. On the next transmit
attempt or socket close, xs->skb is dereferenced, causing a
use-after-free or double-free.
Fix by using a !xs->skb check to handle first frag situation, ensuring
we only free skbs that were freshly allocated in this call
(xs->skb is NULL) and never free an in-progress multi-buffer skb that
the caller still references.
Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/ Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:17 +0000 (23:07 +0300)]
xsk: handle NULL dereference of the skb without frags issue
When a first descriptor (xs->skb == NULL) triggers -EOVERFLOW in
xsk_build_skb_zerocopy() (e.g., MAX_SKB_FRAGS exceeded), the
free_err -EOVERFLOW handler unconditionally dereferences xs->skb
via xsk_inc_num_desc(xs->skb) and xsk_drop_skb(xs->skb), causing
a NULL pointer dereference.
Fix this by guarding the existing xsk_inc_num_desc()/xsk_drop_skb()
calls with an xs->skb check (for the continuation case), and add
an else branch for the first-descriptor case that manually cancels
the one reserved CQ slot and increments invalid_descs by one to
account for the single invalid descriptor.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:16 +0000 (23:07 +0300)]
xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
Fix it by explicitly adding kfree_skb() before returning back to its
caller.
How to reproduce it in virtio_net:
1. the current skb is the first one (which means xs->skb is NULL) and
hit the limit MAX_SKB_FRAGS.
2. xsk_build_skb_zerocopy() returns -EOVERFLOW.
3. the caller xsk_build_skb() clears skb by using 'skb = NULL;'. This
is why bug can be triggered.
4. there is no chance to free this skb anymore.
Note that if in this case the xs->skb is not NULL, xsk_build_skb() will
call xsk_drop_skb(xs->skb) to do the right thing.
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path") Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20260502200722.53960-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing [Sat, 2 May 2026 20:07:15 +0000 (23:07 +0300)]
xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
skb_checksum_help() is a common helper that writes the folded
16-bit checksum back via skb->data + csum_start + csum_offset,
i.e. it relies on the skb's linear head and fails (with WARN_ONCE
and -EINVAL) when skb_headlen() is 0.
AF_XDP generic xmit takes two very different paths depending on the
netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
skip the "copy payload into a linear head" step on purpose as a
performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
pages as frags and never calls skb_put(), so skb_headlen() stays 0
for the whole skb. For these skbs there is simply no linear area for
skb_checksum_help() to write the csum into - the sw-csum fallback is
structurally inapplicable.
The patch tries to catch this and reject the combination with error at
setup time. Rejecting at bind() converts this silent per-packet failure
into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
because they do not call skb_checksum_help().
Without the patch, every descriptor carrying 'XDP_TX_METADATA |
XDP_TXMD_FLAGS_CHECKSUM' produces:
1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
2) sendmsg() returning -EINVAL without consuming the descriptor
(invalid_descs is not incremented),
3) a wedged TX ring: __xsk_generic_xmit() does not advance the
consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
the same descriptor and re-hits the same WARN until the socket
is closed.
Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function") Link: https://patch.msgid.link/20260502200722.53960-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
powerpc/vdso: Drop -DCC_USING_PATCHABLE_FUNCTION_ENTRY from 32-bit flags with clang
After commit 73cdf24e81e4 ("powerpc64: make clang cross-build
friendly"), building 64-bit little endian + CONFIG_COMPAT=y with clang
results in many warnings along the lines of:
$ make -skj"$(nproc)" ARCH=powerpc LLVM=1 ppc64le_defconfig compat.config arch/powerpc/kernel/vdso/
...
In file included from <built-in>:4:
In file included from lib/vdso/gettimeofday.c:6:
In file included from include/vdso/datapage.h:15:
In file included from include/vdso/cache.h:5:
arch/powerpc/include/asm/cache.h:77:8: warning: unknown attribute 'patchable_function_entry' ignored [-Wunknown-attributes]
77 | static inline u32 l1_icache_bytes(void)
| ^~~~~~
include/linux/compiler_types.h:235:58: note: expanded from macro 'inline'
235 | #define inline inline __gnu_inline __inline_maybe_unused notrace
| ^~~~~~~
include/linux/compiler_types.h:215:34: note: expanded from macro 'notrace'
215 | #define notrace __attribute__((patchable_function_entry(0, 0)))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
arch/powerpc/Makefile adds -DCC_USING_PATCHABLE_FUNCTION_ENTRY to
KBUILD_CPPFLAGS, which is inherited by the 32-bit vDSO. However, the
32-bit little endian target does not support
'-fpatchable-function-entry', resulting in the warnings above.
Remove -DCC_USING_PATCHABLE_FUNCTION_ENTRY from the 32-bit vDSO flags
when building with clang to avoid the warnings.
Tzung-Bi Shih [Tue, 5 May 2026 05:34:03 +0000 (05:34 +0000)]
platform/chrome: cros_ec_typec: Init mutex in Thunderbolt registration
cros_typec_register_thunderbolt() missed initializing the `adata->lock`
mutex. This leads to a NULL dereference when the mutex is later
acquired (e.g. in cros_typec_altmode_work()).
Initialize the mutex in cros_typec_register_thunderbolt() to fix the
issue.
Jakub Kicinski [Wed, 6 May 2026 02:13:12 +0000 (19:13 -0700)]
Merge branch 'net-mlx5-fixes-for-socket-direct'
Tariq Toukan says:
====================
net/mlx5: Fixes for Socket-Direct
This series fixes several race conditions and bugs in the mlx5
Socket-Direct (SD) single netdev flow.
Patch 1 serializes mlx5_sd_init()/mlx5_sd_cleanup() with
mlx5_devcom_comp_lock() and tracks the SD group state on the primary
device, preventing concurrent or duplicate bring-up/tear-down.
Patch 2 fixes the debugfs "multi-pf" directory being stored on the
calling device's sd struct instead of the primary's, which caused
memory leaks and recreation errors when cleanup ran from a different PF.
Patch 3 fixes a race where a secondary PF could access the primary's
auxiliary device after it had been unbound, by holding the primary's
device lock while operating on its auxiliary device.
Patch 4 fixes missing cleanup on ETH probe errors. The analogous gap on
the resume path requires introducing sd_suspend/resume APIs that only
destroy FW resources and is left for a follow-up series.
====================
Shay Drory [Mon, 4 May 2026 18:02:06 +0000 (21:02 +0300)]
net/mlx5e: SD, Fix race condition in secondary device probe/remove
When utilizing Socket-Direct single netdev functionality the driver
resolves the actual auxiliary device using mlx5_sd_get_adev(). However,
the current implementation returns the primary ETH auxiliary device
without holding the device lock, leading to a potential race condition
where the ETH device could be unbound or removed concurrently during
probe, suspend, resume, or remove operations.[1]
Fix this by introducing mlx5_sd_put_adev() and updating
mlx5_sd_get_adev() so that secondaries devices would get a ref and
acquire the device lock of the returned auxiliary device. After the lock
is acquired, a second devcom check is needed[2].
In addition, update The callers to pair the get operation with the new
put operation, ensuring the lock is held while the auxiliary device is
being operated on and released afterwards.
The "primary" designation is determined once in sd_register(). It's set
before devcom is marked ready, and it never changes after that.
In Addition, The primary path never locks a secondary: When the primary
device invoke mlx5_sd_get_adev(), it sees dev == primary and returns.
no additional lock is taken.
Therefore lock ordering is always: secondary_lock -> primary_lock. The
reverse never happens, so ABBA deadlock is impossible.
Shay Drory [Mon, 4 May 2026 18:02:05 +0000 (21:02 +0300)]
net/mlx5e: SD, Fix missing cleanup on probe error
When _mlx5e_probe() fails, the preceding successful mlx5_sd_init() is
not undone. Auxiliary bus probe failure skips binding, so mlx5e_remove()
is never called for that adev and the matching mlx5_sd_cleanup() never
runs - leaking the per-dev SD struct.
Call mlx5_sd_cleanup() on the probe error path to balance
mlx5_sd_init().
A similar gap exists on the resume path: mlx5_sd_init() and
mlx5_sd_cleanup() are currently bundled with both probe/remove and
suspend/resume, even though only the FW alias state actually needs to
follow the suspend/resume lifecycle - the sd struct allocation and
devcom membership are software state that should track the full bound
lifetime. As a result, a failed resume can leave a still-bound device
with sd == NULL, which mlx5_sd_get_adev() can't distinguish from a
non-SD device. Fixing this requires sd_suspend/resume APIs which will
only destroy FW resources and is left for a follow-up series.
Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260504180206.268568-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shay Drory [Mon, 4 May 2026 18:02:04 +0000 (21:02 +0300)]
net/mlx5: SD, Keep multi-pf debugfs entries on primary
mlx5_sd_init() creates the "multi-pf" debugfs directory under the
primary device debugfs root, but stored the dentry in the calling
device's sd struct. When sd_cleanup() run on a different PF,
this leads to using the wrong sd->dfs for removing entries, which
results in memory leak and an error in when re-creating the SD.[1]
Fix it by explicitly storing the debugfs dentry in the primary
device sd struct and use it for all per-group files.
[1]
debugfs: 'multi-pf' already exists in '0000:08:00.1'
Shay Drory [Mon, 4 May 2026 18:02:03 +0000 (21:02 +0300)]
net/mlx5: SD: Serialize init/cleanup
mlx5_sd_init() / mlx5_sd_cleanup() may run from multiple PFs in the same
Socket-Direct group. This can cause the SD bring-up/tear-down sequence
to be executed more than once or interleaved across PFs.
Protect SD init/cleanup with mlx5_devcom_comp_lock() and track the SD
group state on the primary device. Skip init if the primary is already
UP, and skip cleanup unless the primary is UP.
The state check on cleanup is needed because sd_register() drops the
devcom comp lock between marking the comp ready and assigning
primary_dev on each peer. A concurrent cleanup that acquires the lock
in this window would observe devcom_is_ready==true while primary_dev
is still NULL (causing mlx5_sd_get_primary() to return NULL) or while
the FW alias setup performed by mlx5_sd_init()'s body has not yet run
(causing sd_cmd_unset_primary() to dereference a NULL tx_ft). Gate the
cleanup body on primary_sd->state == MLX5_SD_STATE_UP, which is set
only at the very end of mlx5_sd_init() under the same comp lock - so
observing UP guarantees primary_dev, secondaries[], tx_ft, and dfs are
all populated. Also bail explicitly if mlx5_sd_get_primary() returns
NULL, in case state is checked on a peer whose primary_dev hasn't been
assigned yet.
In addition, move mlx5_devcom_comp_set_ready(false) from sd_unregister()
into the cleanup's locked section, including the !primary and
state != UP early-exit paths, so the device cannot unregister and free
its struct mlx5_sd while devcom is still marked ready. A concurrent
init acquiring the devcom lock will now observe devcom is no longer
ready and bail out immediately.
Fixes: 381978d28317 ("net/mlx5e: Create single netdev per SD group") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260504180206.268568-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The reason for failure is the existence of TX keys, which are removed by
the PSP dev unregistration happening in:
profile->cleanup() -> mlx5e_psp_unregister() -> mlx5e_psp_cleanup()
-> psp_dev_unregister()
...but this isn't invoked in the devlink reload flow, only when changing
the NIC profile (e.g. when transitioning to switchdev mode) or on dev
teardown.
Move PSP device registration into mlx5e_nic_enable(), and unregistration
into the corresponding mlx5e_nic_disable(). These functions are called
during netdev attach/detach after RX & TX are set up.
This ensures that the keys will be gone by the time the PD is destroyed.
Cosmin Ratiu [Mon, 4 May 2026 18:10:58 +0000 (21:10 +0300)]
net/mlx5e: psp: Fix invalid access on PSP dev registration fail
priv->psp->psp is initialized with the PSP device as returned by
psp_dev_create(). This could also return an error, in which case a
future psp_dev_unregister() will result in unpleasantness.
Avoid that by using a local variable and only saving the PSP device when
registration succeeds.
In case psp_dev_create() fails, priv->psp and steering structs are left
in place, but they will be inert. The unchecked access of priv->psp in
mlx5e_psp_offload_handle_rx_skb() won't happen because without a PSP
device, there can be no SAs added and therefore no packets will be
successfully decrypted and be handed off to the SW handler.
powerpc/perf: Update check for PERF_SAMPLE_DATA_SRC marked events
The core-book3s PMU sampling code validates the SIER TYPE field
when PERF_SAMPLE_DATA_SRC is requested. The SIER TYPE field
indicates the instruction type and is only valid for
random sampling (marked events). To handle cases observed where
SIER TYPE could be zero even for marked events,validation was
added to drop such samples and increment event->lost_samples.
However, this validation was applied to all samples,
including continuous sampling. In continuous sampling mode,
the PMU does not set the SIER TYPE field, so it remains zero.
As a result, valid continuous samples were incorrectly
treated as invalid and dropped. Fixed this by gating the
SIER TYPE validation with mark_event, so the check runs only
for marked (random) events. Continuous samples now skip this
check and are recorded normally in the final data recording path.
Fixes: 2ffb26afa642 ("arch/powerpc/perf: Check the instruction type before creating sample with perf_mem_data_src") Signed-off-by: Shivani Nittor <shivani@linux.ibm.com> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Reviewed-by: Athira Rajeev <atrajeev@linux.ibm.com>
[Maddy: Fixed reviewed-by tag] Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20260421150628.96500-1-shivani@linux.ibm.com
powerpc/8xx: Fix interrupt mask in cpm1_gpiochip_add16()
Allthough fsl,cpm1-gpio-irq-mask always contains a 16 bits value,
it is a standard u32 OF property as documented in
Documentation/devicetree/bindings/soc/fsl/cpm_qe/gpio.txt
The driver erroneously uses of_property_read_u16() leading to a
mask which is always 0.
Pavitra Jha [Fri, 1 May 2026 11:07:12 +0000 (07:07 -0400)]
net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler
t7xx_port_enum_msg_handler() uses the modem-supplied port_count field as
a loop bound over port_msg->data[] without checking that the message buffer
contains sufficient data. A modem sending port_count=65535 in a 12-byte
buffer triggers a slab-out-of-bounds read of up to 262140 bytes.
Add a sizeof(*port_msg) check before accessing the port message header
fields to guard against undersized messages.
Add a struct_size() check after extracting port_count and before the loop.
In t7xx_parse_host_rt_data(), guard the rt_feature header read with a
remaining-buffer check before accessing data_len, validate feat_data_len
against the actual remaining buffer to prevent OOB reads and signed
integer overflow on offset.
Pass msg_len from both call sites: skb->len at the DPMAIF path after
skb_pull(), and the validated feat_data_len at the handshake path.
powerpc/vmx: avoid KASAN instrumentation in enter_vmx_ops() for kexec
The kexec sequence invokes enter_vmx_ops() via copy_page() with the MMU
disabled. In this context, code must not rely on normal virtual address
translations or trigger page faults.
With KASAN enabled, functions get instrumented and may access shadow
memory using regular address translation. When executed with the MMU
off, this can lead to page faults (bad_page_fault) from which the
kernel cannot recover in the kexec path, resulting in a hang.
The kexec path sets preempt_count to HARDIRQ_OFFSET before entering
the MMU-off copy sequence.
Since kexec sets preempt_count to HARDIRQ_OFFSET, in_interrupt()
evaluates to true and enter_vmx_ops() returns early.
As in_interrupt() (and preempt_count()) are always inlined, mark
enter_vmx_ops() with __no_sanitize_address to avoid KASAN
instrumentation and shadow memory access with MMU disabled, helping
kexec boot fine with KASAN enabled.
powerpc/kdump: fix KASAN sanitization flag for core_$(BITS).o
KASAN instrumentation is intended to be disabled for the kexec core
code, but the existing Makefile entry misses the object suffix. As a
result, the flag is not applied correctly to core_$(BITS).o.
So when KASAN is enabled, kexec_copy_flush and copy_segments in
kexec/core_64.c are instrumented, which can result in accesses to
shadow memory via normal address translation paths. Since these run
with the MMU disabled, such accesses may trigger page faults
(bad_page_fault) that cannot be handled in the kdump path, ultimately
causing a hang and preventing the kdump kernel from booting. The same
is true for kexec as well, since the same functions are used there.
Update the entry to include the “.o” suffix so that KASAN
instrumentation is properly disabled for this object file.
pseries/papr-hvpipe: Fix style and checkpatch issues in enable_hvpipe_IRQ()
While at it let's also fix the similar style issue in
enable_hvpipe_IRQ() function. This also fixes a minor checkpatch warning
which I got due to an extra space before " ==".
pseries/papr-hvpipe: Refactor and simplify hvpipe_rtas_recv_msg()
Simplify hvpipe_rtas_recv_msg() by removing three levels of nesting...
if (!ret)
if (buf)
if (size < bytes_written)
... this refactoring of the function bails out to "out:" label first, in case
of any error. This simplifies the init flow.
pseries/papr-hvpipe: Kill task_struct pointer from struct hvpipe_source_info
We don't really use task_struct pointer for anything meaningful. So just
kill it for now, and we can bring back later if we need this for any
future debug purposes.
pseries/papr-hvpipe: Simplify spin unlock usage in papr_hvpipe_handle_release()
Once the src_info is removed from the global list, no one can access it.
This simplies the usage of spin_unlock_irqrestore() in
papr_hvpipe_handle_release()