Namjae Jeon [Sun, 7 Jun 2026 11:15:51 +0000 (20:15 +0900)]
ksmbd: prevent path traversal bypass by restricting caseless retry
ksmbd_vfs_path_lookup() enforces LOOKUP_BENEATH to restrict path
resolution within the share root. When a crafted path attempts to
escape the share boundary using parent-directory components ('..'),
vfs_path_parent_lookup() detects this and immediately fails,
returning -EXDEV.
However, a bug exists in __ksmbd_vfs_kern_path() under caseless mode.
The function fails to intercept the -EXDEV error and erroneously
falls through to the caseless retry logic, which is intended only
for genuinely missing files. During this retry process, the path
is reconstructed, leading to an unintended LOOKUP_BENEATH bypass
that allows write-capable users to create zero-length files or
directories outside the exported share.
Fix this by ensuring that the execution only proceeds to the caseless
lookup retry when the error is specifically -ENOENT. Any other errors,
such as -EXDEV from a path traversal attempt, must be returned immediately.
Cc: stable@vger.kernel.org Reported-by: Y s65 <yu4ys@outlook.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Davide Ornaghi [Sat, 6 Jun 2026 07:11:04 +0000 (16:11 +0900)]
ksmbd: fix UAF of struct file_lock in SMB2_LOCK deferred-lock cancellation
When a blocking byte-range lock request is deferred in the
FILE_LOCK_DEFERRED path, ksmbd registers the asynchronous work into
the connection's async_requests list via setup_async_work(). The cancel
callback smb2_remove_blocked_lock() holds a reference to the flock.
If the lock waiter is subsequently woken up but the work state is no
longer KSMBD_WORK_ACTIVE (e.g., due to a concurrent cancellation), the
cleanup path calls locks_free_lock(flock) without dequeuing the work from
the async_requests list. Concurrently, smb2_cancel() walks the list
under conn->request_lock and invokes the cancel callback, which then
dereferences the already freed 'flock'. This leads to a slab-use-after-free
inside __wake_up_common.
Fix this by restructuring the cleanup logic after the worker returns
from ksmbd_vfs_posix_lock_wait(). Move list_del(&smb_lock->llist) and
release_async_work(work) to the top of the cleanup block. This guarantees
that the async work is completely dequeued and serialized under
conn->request_lock before locks_free_lock(flock) is called, rendering
the flock unreachable for any concurrent smb2_cancel().
Cc: stable@vger.kernel.org Signed-off-by: Davide Ornaghi <d.ornaghi97@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Guangshuo Li [Fri, 5 Jun 2026 04:30:16 +0000 (12:30 +0800)]
ksmbd: fix use-after-free in same_client_has_lease()
same_client_has_lease() returns an opinfo pointer from ci->m_op_list
after dropping ci->m_lock without taking a reference.
smb_grant_oplock() then dereferences that pointer in copy_lease() and
when checking breaking_cnt. A concurrent close can remove the old lease
from ci->m_op_list and drop the last reference before the caller uses
the returned pointer, leading to a use-after-free.
Take a reference when same_client_has_lease() selects an existing lease,
drop any previous match while scanning, and release the returned
reference in smb_grant_oplock() after copying the lease state.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Hem Parekh [Tue, 2 Jun 2026 23:56:46 +0000 (16:56 -0700)]
ksmbd: fix out-of-bounds read in smb_check_perm_dacl()
The permission-check ACE walk in smb_check_perm_dacl() validates the ACE
header size and caps sid.num_subauth at SID_MAX_SUB_AUTHORITIES, but it
never checks that ace->size is actually large enough to contain
num_subauth sub-authorities before compare_sids() dereferences them.
CIFS_SID_BASE_SIZE covers the SID header up to but excluding the
sub_auth[] array, and offsetof(struct smb_ace, sid) is the ACE header,
so the existing guards only guarantee the 8-byte SID base, i.e. zero
sub-authorities. compare_sids() then reads ace->sid.sub_auth[i] for
i < min(local_sid->num_subauth, ace->sid.num_subauth). The local
comparison SIDs (sid_everyone, sid_unix_NFS_mode, and the id_to_sid()
result) always have at least one sub-authority, and an attacker controls
the ACE revision and authority bytes (which lie within the in-bounds SID
base), so they can match one of those SIDs and force the sub_auth read.
A crafted ACE with size == 16 and num_subauth >= 1 placed at the tail of
the security descriptor therefore causes a heap out-of-bounds read of up
to SID_MAX_SUB_AUTHORITIES * sizeof(__le32) bytes past the pntsd
allocation. The security descriptor is loaded by ksmbd_vfs_get_sd_xattr()
into a buffer sized exactly to the on-disk data (kzalloc(sd_size) in
ndr_decode_v4_ntacl()), so the read lands past the allocation. The
malformed descriptor can be stored verbatim via SMB2_SET_INFO (the DACL
is not normalised before being written to the security.NTACL xattr) and
the read fires on a subsequent SMB2_CREATE access check, making this
reachable by an authenticated client on a share that uses ACL xattrs.
Add the missing num_subauth-versus-ace_size check, mirroring the
identical guards already present in the sibling parsers parse_dacl() and
smb_inherit_dacl().
Fixes: d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()") Cc: stable@vger.kernel.org Signed-off-by: Hem Parekh <hemparekh1596@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Both CoreI2C and the hardened versions of it on mpfs and pic64gx have a
reset pin. For the former, usually this is wired to a common fabric
reset not managed by software and for the latter two the platform
firmware takes them out of reset on first-party boards (or those using
modified versions of the vendor firmware), but not all boards may take
this approach. Permit providing a reset in devicetree for Linux, or
other devicetree-consuming software, to use.
Gary Guo [Thu, 11 Jun 2026 19:05:54 +0000 (20:05 +0100)]
rust: bitfield: mark `Debug` impl as `#[inline]`
A `Debug` impl is for debugging and is normally not used, and therefore
should ideally not be code-generated unless used. However, Rust has no way
of knowing if a dependent crate is going to use the trait impl or not, so
unless it is marked as `#[inline]`, it will be code-generated in the
defining crate (as it is not generic).
Mark the impl generated by bitfield macro `#[inline]`, so they do not stay
in the binary unless used.
This reduces nova-core.o .text by 17% (from 151922 bytes to 125676 bytes).
Signed-off-by: Gary Guo <gary@garyguo.net> Fixes: b7b8b4ccdad4 ("rust: extract `bitfield!` macro from `register!`") Acked-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://patch.msgid.link/20260611190555.2298991-1-gary@kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Various names for Qualcomm as a company are used in user-visible config
options: QCOM, Qualcomm and Qualcomm Technologies. Switch to unified
"Qualcomm" so it will be easier for users to identify the options when
for example running menuconfig.
David Carlier [Wed, 6 May 2026 15:40:15 +0000 (16:40 +0100)]
i2c: ls2x-v2: return IRQ_HANDLED after servicing an error
The event ISR reads SR1 and, when an error flag (ARLO/AF/BERR) is set,
calls loongson2_i2c_isr_error() which clears the offending flag, issues
STOP for the AF case, records msg->result, masks every CR2 interrupt
enable and completes the waiter. The handler then returns IRQ_NONE,
declaring to the IRQ core that the device did not interrupt.
That report is wrong. The device did interrupt and the handler fully
serviced it. Because the IRQ is requested with IRQF_SHARED, the genirq
spurious-IRQ tracker counts each error as unhandled. A bus that emits
sporadic NACKs, arbitration losses or bus errors will therefore march
toward the spurious-IRQ threshold and the line can end up disabled,
wedging the controller.
Return IRQ_HANDLED on this path. The other IRQ_NONE site, taken when
neither an event nor an error bit is set, remains correct.
ipv4: fib_rule: Move fib4_rules_exit() to ->exit().
syzbot reported use-after-free of net->ipv4.rules_ops. [0]
It can be reproduced with these commands:
while true; do
ip netns add ns1
ip -n ns1 link set dev lo up
ip -n ns1 address add 192.0.2.1/24 dev lo
ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 address add 198.51.100.1/24 dev dummy1
ip -n ns1 rule add ipproto tcp sport 12345 table 12345
ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321
ip netns del ns1
done
The cited commit moved fib4_rules_exit() earlier to ->exit_rtnl(),
but the kernel socket destroyed in ->exit() could eventually reach
__fib_lookup().
I left fib4_rules_exit() in ->exit_rtnl() because fib4_rule_delete()
calls fib_unmerge(), which requires RTNL.
However, when ->delete() is called, ->configure() has already been
called, thus fib_unmerge() in ->delete() has no effect.
Let's remove fib_unmerge() in fib4_rule_delete() and move
fib4_rules_exit() to ->exit().
Many thanks to Ido Schimmel for providing the nice repro very quickly.
Note that we can make fib_rules_ops.delete() return void once
net-next opens.
[0]:
BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321
Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641
Eric Dumazet [Tue, 16 Jun 2026 14:13:17 +0000 (14:13 +0000)]
net: serialize netif_running() check in enqueue_to_backlog()
Syzbot reported a KASAN slab-use-after-free in fib_rules_lookup().
The root cause is a race condition where packets can escape the backlog
flushing during device unregistration (e.g., during netns exit).
Commit e9e4dd3267d0 ("net: do not process device backlog during unregistration")
introduced a lockless netif_running() check in enqueue_to_backlog() to
prevent queuing packets to an unregistering device.
However, this creates a TOCTOU race window.
A lockless transmitter (like veth_xmit) can pass
the check before dev_close() clears IFF_UP. If the transmitter is then
delayed, flush_all_backlogs() can run and finish before the transmitter
grabs the backlog lock and queues the packet. The packet then escapes
the flush and triggers UAF later when processed.
Fix this by moving the netif_running() check inside the backlog lock.
This serializes the check with the flush work (which also grabs the lock).
We then either queue the packet before the flush runs (so it gets flushed),
or check netif_running() after the flush/close completes (so it gets dropped).
Fixes: e9e4dd3267d0 ("net: do not process device backlog during unregistration") Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a315824.b0403584.28d0ff.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260616141317.407791-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge in late fixes in preparation for the net-next PR.
Conflicts:
net/tls/tls_sw.c 406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms") 79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path")
drivers/net/ethernet/microsoft/mana/mana_en.c f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check") d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size")
https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk
io_uring: Use system_dfl_wq instead of system_unbound_wq
Commit de7341ffe49e ("io_uring: switch normal task_work to a mpscq")
added a use of system_unbound_wq, which is deprecated in favor of
system_dfl_wq added by commit 128ea9f6ccfb ("workqueue: Add
system_percpu_wq and system_dfl_wq"). An upcoming warning in the
workqueue tree flags this with:
workqueue: work func io_tctx_fallback_work enqueued on deprecated workqueue. Use system_{percpu|dfl}_wq instead.
Yiming Qian [Wed, 10 Jun 2026 06:21:36 +0000 (06:21 +0000)]
net: skmsg: preserve sg.copy across SG transforms
The sk_msg sg.copy bitmap is part of the scatterlist entry ownership
state. A set bit tells sk_msg_compute_data_pointers() not to expose the
entry through writable BPF ctx->data. This protects entries backed by
pages that are not private to the sk_msg, such as splice-backed file
page-cache pages.
Several sk_msg transform paths move, copy, split, or compact
msg->sg.data[] entries without moving the matching sg.copy bit. This can
make an externally backed entry arrive at a new slot with a clear copy
bit. A later SK_MSG verdict can then expose sg_virt(sge) as writable
ctx->data and BPF stores can modify the original page cache.
Keep sg.copy synchronized with sg.data[] whenever entries are
transferred, shifted, split, or copied into a new sk_msg. Clear the bit
when an entry is replaced by a newly allocated private page or freed.
This covers the BPF pull/push/pop helpers, sk_msg_shift_left/right(),
sk_msg_xfer(), and tls_split_open_record(), including the partial tail
entry created during TLS open-record splitting.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Cc: stable@vger.kernel.org Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Yiming Qian <yimingqian591@gmail.com> Link: https://patch.msgid.link/20260610062137.49075-1-yimingqian591@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
appletalk: move the protocol out of tree
This tiny series moves appletalk out of tree, to:
https://github.com/linux-netdev/mod-orphan
Core maintainainers are unable to keep up with the rate of security
bug reports and fixes. Nobody seems to care about appletalk enough
to review the patches.
As Eric pointed out Mac OS dropped AppleTalk over a decade ago.
====================
Jakub Kicinski [Mon, 15 Jun 2026 22:29:35 +0000 (15:29 -0700)]
appletalk: move the protocol out of tree
AppleTalk has been removed in MacOS X 10.6 (Snow Leopard), in 2009,
according to Wikipedia. We recently got a burst of AI generated
fixes to this protocol which nobody is reviewing.
Let AppleTalk follow AX.25 and hamradio out of the Linux tree.
We we will maintain the code at: github.com/linux-netdev/mod-orphan
for anyone interested in playing with it.
Retain the uAPI for now. No strong reason, simply because I suspect
keeping it will be less controversial.
Jakub Kicinski [Mon, 15 Jun 2026 22:29:34 +0000 (15:29 -0700)]
appletalk: stop storing per-interface state in struct net_device
AppleTalk keeps its per-interface control block (struct atalk_iface)
directly in struct netdevice (dev->atalk_ptr). This is the only thing
tying the protocol into the core net_device layout and is the sole
blocker to moving AppleTalk out of tree.
Replace dev->atalk_ptr with a small ifindex-keyed hashtable internal
to ddp.c. The existing atalk_interfaces list stays the owner of the iface
objects; the hashtable is purely a fast dev->iface index and reuses
the same atalk_interfaces_lock.
AFAICT this patch does not make this code any more racy than it already
is, I'm sure Sashiko will point out some basically existing bugs.
AFAICT atalk_interfaces_lock is the innermost lock already.
i3c: mipi-i3c-hci: Use named initializers for platform_device_id's .driver_data
The assignment in this driver uses a mixed way to initialize the
platform_device_id array. .name is assigned by name and .driver_data by
position. Unify that to use named assignment for both struct members.
This is needed for a planned change to struct platform_device_id
replacing .driver_data by an anonymous union.
Qu Wenruo [Tue, 16 Jun 2026 08:12:36 +0000 (17:42 +0930)]
block: respect iov_iter::nofault flag in bio_iov_iter_bounce_write()
For the incoming usage of IOMAP_DIO_BOUNCE in btrfs, btrfs has set
iov_iter::nofault to prevent deadlock when a page fault is needed to
read out the buffer.
However bio_iov_iter_bounce_write() doesn't respect iov_iter::nofault
flag, and just call a plain copy_from_iter() so it can still trigger
page fault and cause deadlock in btrfs.
Fix it by utilizing copy_folio_from_iter_atomic() if nofault flag is
set, otherwise use copy_folio_from_iter().
Qu Wenruo [Tue, 16 Jun 2026 08:12:35 +0000 (17:42 +0930)]
block: revert the iov_iter after a short copy in bio_iov_iter_bounce_write()
For the incoming IOMAP_DIO_BOUNCE flag usage inside btrfs, it's pretty
easy to hit short copy inside bio_iov_iter_bounce_write().
This is because btrfs has disabled page fault to avoid certain deadlock
during direct writes, and instead btrfs manually fault in the pages then
retry.
And inside bio_iov_iter_bounce_write(), if we hit a short write, we
didn't revert the iov_iter, which can cause problems like unexpected
garbage for the next retry.
Revert the iov_iter after a short copy.
One thing to note is that, the folio is allocated then immediately
queued into the bio, so the proper revert size should be
(bi_size - this_len + copied).
Jisheng Zhang [Fri, 12 Jun 2026 00:28:35 +0000 (08:28 +0800)]
spi: dw: fix wrong BAUDR setting after resume
After resuming from suspend to ram, spi transfer stops working. Further
debugging shows that the BAUDR register isn't correctly set, this is
due to dws->current_freq doesn't match the HW BAUDR setting,
specifically, the dws->current_freq equals to speed_hz, but BAUDR is 0.
so the dw_spi_set_clk() in below code won't be called:
The mismatch comes from dw_spi_shutdown_chip() when suspending.
Fix this mismatch by setting dws->current_freq to 0 as well when
clearing BAUDR reg in dw_spi_shutdown_chip().
Kunihiko Hayashi [Tue, 16 Jun 2026 01:12:23 +0000 (10:12 +0900)]
spi: uniphier: Fix completion initialization order before devm_request_irq()
The driver calls devm_request_irq() before initializing the completion
used by the interrupt handler. Because the interrupt may occur immediately
after devm_request_irq(), the handler may execute before init_completion().
This may result in calling complete() on an uninitialized completion,
causing undefined behavior. This has been observed with KASAN.
Fix this by initializing the completion before registering the IRQ.
Reported-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Reported-by: Kyungwook Boo <bookyungwook@gmail.com> Fixes: 5ba155a4d4cc ("spi: add SPI controller driver for UniPhier SoC") Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260616011223.201357-1-hayashi.kunihiko@socionext.com Signed-off-by: Mark Brown <broonie@kernel.org>
Shuvam Pandey [Mon, 15 Jun 2026 20:18:00 +0000 (02:03 +0545)]
accel/amdxdna: Use caller client for debug BO sync
amdxdna_drm_sync_bo_ioctl() looks up args->handle in the ioctl caller's
drm_file. For SYNC_DIRECT_FROM_DEVICE, it then calls
amdxdna_hwctx_sync_debug_bo(), but passes abo->client.
amdxdna_hwctx_sync_debug_bo() uses the passed client both as the handle
namespace for debug_bo_hdl and as the owner of the hardware context xarray.
Those must match the file that supplied args->handle. The BO's stored
client pointer is object state, not the ioctl context.
Pass filp->driver_priv instead, matching the original handle lookup.
Jacob Moroni [Tue, 16 Jun 2026 15:56:01 +0000 (15:56 +0000)]
RDMA/irdma: Replace waitqueue and flag with completion
The driver previously used a waitqueue along with an explicit
request_done flag, but without proper barriers around request_done.
An earlier patch by Gui-Dong Han <hanguidong02@gmail.com> attempted
to fix this by adding the missing memory barriers. Rather than
adding the barriers, this patch replaces the waitqueue+flag with
a completion, which is designed for this exact purpose.
Junxian Huang [Sat, 13 Jun 2026 10:20:45 +0000 (18:20 +0800)]
RDMA/hns: Fix memory leak of bonding resources
In a corner case of concurrent driver removal and driver reset,
bonding resource is first released in hns_roce_hw_v2_exit() during
driver removal, and then is allocated again in hns_roce_register_device()
during driver reset. This leads to memory leak because the release
timing has already passed. This may also lead to a kernel panic
as below because of the leaked notifier callback:
Zhenhao Wan [Thu, 11 Jun 2026 17:15:54 +0000 (01:15 +0800)]
RDMA/rtrs-srv: Bound RDMA-Write length to chunk size in rdma_write_sg
When the server answers an RTRS READ, rdma_write_sg() builds the source
scatter/gather entry for the IB_WR_RDMA_WRITE that returns data to the
peer. Its length is taken directly from the wire descriptor:
rd_msg points into the chunk buffer that the remote peer filled via
RDMA-WRITE-WITH-IMM (rtrs_srv_rdma_done() -> process_io_req() ->
process_read()), so desc[0].len is attacker-controlled and, before this
change, was only rejected when zero. The source address is the fixed
chunk start (dma_addr[msg_id]) and the source lkey is the PD-wide
local_dma_lkey, which is not tied to the chunk's MR mapping, so the verbs
layer does not constrain the transfer length to max_chunk_size. msg_id
and off are bounded against queue_depth and max_chunk_size in
rtrs_srv_rdma_done(), but desc[0].len is a separate field that was not
checked against the chunk size.
A peer that advertises desc[0].len larger than max_chunk_size can make
the posted RDMA write read past the chunk's mapped region. The resulting
behaviour depends on the IOMMU configuration: with no IOMMU or in
passthrough mode the read may extend into memory adjacent to the chunk
and be returned to the peer, which can disclose host memory; with a
translating IOMMU the out-of-range access is expected to fault and abort
the connection. In either case the transfer exceeds what the protocol
permits and is driven by a remote peer.
Reject a descriptor length above max_chunk_size, mirroring the existing
off >= max_chunk_size bound in rtrs_srv_rdma_done(). Legitimate clients
do not exceed it: the client sets desc[0].len to its MR length, which is
capped at the negotiated max_io_size (max_chunk_size - MAX_HDR_SIZE).
Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality") Link: https://patch.msgid.link/r/20260612-master-v1-1-70cde5c6fdc9@gmail.com Reported-by: Yuhao Jiang <danisjiang@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Zhenhao Wan <whi4ed0g@gmail.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
docs: infiniband: correct name of option to enable the ib_uverbs module
The Infiniband documentation states that CONFIG_INFINIBAND_USER_VERBS
should be used to enable the ib_uverbs module. However, this option was
renamed to CONFIG_INFINIBAND_USER_ACCESS in commit 17781cd6186c
("[PATCH] IB: clean up user access config options"). Update the
documentation to reflect this.
Selvin Xavier [Mon, 15 Jun 2026 22:47:51 +0000 (15:47 -0700)]
RDMA/bnxt_re: Reject GET_TOGGLE_MEM when toggle page was not allocated
If a user calls BNXT_RE_METHOD_GET_TOGGLE_MEM on a device that does not
support the CQ/SRQ toggle feature, uctx_cq_page or uctx_srq_page will
be NULL.
Add an explicit -EOPNOTSUPP return after capturing the address from
uctx_cq_page / uctx_srq_page if the address is zero.
Fixes: e275919d9669 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace") Fixes: 181028a0d84c ("RDMA/bnxt_re: Share a page to expose per SRQ info with userspace") Link: https://patch.msgid.link/r/20260615224751.232802-16-selvin.xavier@broadcom.com Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Selvin Xavier [Mon, 15 Jun 2026 22:47:47 +0000 (15:47 -0700)]
RDMA/bnxt_re: Avoid repeated requests to allocate WC pages
Applications can request multiple WC pages for the same ucontext.
As of now, only 1 WC page per ucontext is supported. Add a lock to
avoid concurrent access and a check to fail repeated requests.
Also, if the mmap entry insert fails for the WC, free the Doorbell
page index mapped for the WC page.
Fixes: eee6268421a2 ("RDMA/bnxt_re: Move the UAPI methods to a dedicated file") Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push") Link: https://patch.msgid.link/r/20260615224751.232802-12-selvin.xavier@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Brian Nguyen [Fri, 5 Jun 2026 22:42:58 +0000 (22:42 +0000)]
drm/xe: Add compact-PT and addr mask handling for page reclaim
Current implementation of generate_reclaim_entry() overlooks some
differences between the different page implementations: address masking
and compact 64K page handling.
Address masking of each leaf varies depending on the leaf entry size.
generate_reclaim_entry() is using XE_PTE_ADDR_MASK [51:12] for all leaf
entries. For 2MB PTEs, bit 12 (PAT) is part of the flags so the old mask
corrupts the physical address extraction.
64K pages can be represented as PS64 and a compact PT, which the latter
was not handled. Compact pages aren't walked by the unbind walker, so we
separately walk through the compact PT to ensure none of the leaf 64K
PTEs are dropped. Previously, compact PT were causing an abort since it
was considered covered and not descended into.
v2:
- Update 64K entry/unbind walker for 64K compact PT handling. (Matthew)
- Rework calculations of reclamation and address mask size.
- Add new func abstracting the error handling before generating the
reclaim entry.
v3:
- Report finer addr granularity in abort debug print for compact.
(Zongyao)
- Add comments for ADDR_MASK usage. (Zongyao)
- Drop existing phys_addr asserts, the new XE_PAGE_ADDR_MASK clears
bits checked, so redundant asserts. (Sashiko)
- WARN_ON to verify compact pt and edge pt won't be possible.
Fixes: b912138df299 ("drm/xe: Create page reclaim list on unbind") Assisted-by: Sashiko-Review:gemini-3.1-pro-preview Cc: stable@vger.kernel.org Cc: Matthew Auld <matthew.auld@intel.com> Suggested-by: Zongyao Bai <zongyao.bai@intel.com> Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Zongyao Bai <zongyao.bai@intel.com> Link: https://patch.msgid.link/20260605224257.2194194-2-brian3.nguyen@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 669252801a4aa4098fbc5dd9dd0bd93f0625abd7) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Tejas Upadhyay [Fri, 12 Jun 2026 07:04:02 +0000 (12:34 +0530)]
drm/xe/guc: Fix buffer overflow in steered register list allocation
The size calculation for the steered register extarray uses only the
geometry DSS mask (g_dss_mask) to determine the number of entries to
allocate:
total = bitmap_weight(gt->fuse_topo.g_dss_mask, ...) * steer_reg_num;
However, the filling loop uses for_each_dss_steering(), which iterates
over for_each_dss(), defined as the union of g_dss_mask and c_dss_mask
(geometry + compute DSS). On platforms with compute-only DSS bits, the
loop writes past the allocated buffer, corrupting adjacent slab objects.
This manifests as list_del corruption and SLUB redzone overwrites during
drm_managed_release on device unbind, since the overflow corrupts the
drmres list_head of neighboring allocations.
Fix by computing the allocation size using the union of both DSS masks,
matching the iteration pattern of for_each_dss_steering().
Matthew Brost [Thu, 11 Jun 2026 23:58:44 +0000 (16:58 -0700)]
drm/xe: Set TTM device beneficial_order to 9 (2M)
Set the TTM device beneficial_order to 9 (2M), which is the sweet
spot for Xe when attempting reclaim on system memory BOs, as it matches
the large GPU page size. This ensures reclaim is attempted at the most
effective order for the driver.
This fixes an issue where an order-10 (4M) allocation cannot be found
despite an abundance of memory. The 4M allocation triggers reclaim,
unnecessarily evicting the working set and hurting performance. Since
the TTM infrastructure was introduced recently, we are tagging the TTM
patch as the Fixes target, even though this resolves an Xe-side problem.
Fixes: 7e9c548d3709 ("drm/ttm: Allow drivers to specify maximum beneficial TTM pool size") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260611235844.3725147-1-matthew.brost@intel.com
(cherry picked from commit 0d81db90d364cb3d733410829118759f28957c5a) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
drm/xe: Fix wa_oob codegen recipe for external module builds
When building with 'make M=drivers/gpu/drm/xe modules', kbuild invokes
scripts/Makefile.build with obj=., causing $(obj) to expand to '.'.
Make normalizes './xe_gen_wa_oob' to 'xe_gen_wa_oob' when constructing
the $^ automatic variable (target name normalization), so the recipe
command becomes just 'xe_gen_wa_oob ...' without any path prefix, and
the shell cannot find the tool.
Fix by replacing $^ with explicit $(obj)/xe_gen_wa_oob and
$(src)/<rules-file> references in both wa_oob recipe commands.
In recipe strings, make does not apply target name normalization, so
$(obj)/xe_gen_wa_oob correctly expands to './xe_gen_wa_oob' and the
shell can execute it. This matches the pattern already used by other
DRM drivers (e.g. radeon's mkregtable).
Fixes: f037e0b78e6d ("drm/xe: add xe_device_wa infrastructure") Cc: Matt Atwood <matthew.s.atwood@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: intel-xe@lists.freedesktop.org Assisted-by: GitHub_Copilot:claude-sonnet-4.6 Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260604074501.172129-1-thomas.hellstrom@linux.intel.com
(cherry picked from commit 3a11a63cc16660d514ff584e7551589655337e87) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Rodrigo Vivi [Wed, 10 Jun 2026 15:25:49 +0000 (11:25 -0400)]
drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
A job that GuC never scheduled (never started) indicates a GuC
scheduling failure; previously such jobs were silently errored out
instead of triggering a GT reset to recover. Trigger a GT reset and
resubmit them, but only when the queue was not already killed or banned:
an unstarted job on an already banned queue is the ban working as
intended and must neither clear the ban nor kick off a reset, otherwise
a banned userspace queue could be resurrected and spam GT resets.
Kernel queues are always recovered this way and wedge the device once
recovery attempts are exhausted, since kernel work must not silently
fail. A started job that times out on a userspace VM bind queue stays
banned rather than being reset and retried.
The queue is banned early in the timeout handler to signal the G2H
scheduling-done handler so it wakes the disable-scheduling waiter;
without it the waiter sleeps the full 5s timeout. When a reset is
warranted the ban is cleared before rearming so that
guc_exec_queue_start() can resubmit jobs after the GT reset - a
still-banned queue would block resubmission and cause an infinite TDR
loop. The already-banned case is gated out before this point via
skip_timeout_check, so it is unaffected.
v2: (Himal) Do it for any queue type, not just kernel/migration
v3: - (Sashiko and Sanjay): don't clear the ban / GT reset for already
killed/banned queues on unstarted-job timeout
- Update commit message
- (Matt) Add Fixes tag
Fixes: fe05cee4d953 ("drm/xe: Don't short circuit TDR on jobs not started") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Assisted-by: GitHub-Copilot:claude-sonnet-4.6 Assisted-by: GitHub-Copilot:claude-opus-4.8 Tested-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patch.msgid.link/20260610152548.404575-3-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit b1107d085e7e8ed15ba6f80c102528a9c8a6cb0e) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Wentao Liang [Wed, 10 Jun 2026 17:27:05 +0000 (10:27 -0700)]
drm/xe: fix refcount leak in xe_range_fence_insert()
xe_range_fence_insert() acquires a reference on fence via
dma_fence_get() and stores it in rfence->fence. It then calls
dma_fence_add_callback() and handles two cases: when the callback
is successfully registered (err == 0) the fence is transferred to
the tree for later cleanup; when the fence is already signaled
(err == -ENOENT) it manually drops the extra reference with
dma_fence_put(fence).
However, dma_fence_add_callback() can fail with other errors
(e.g. -EINVAL) and in that case the code falls through to the free:
label without releasing the acquired reference, leaking it.
Fix the leak by adding an else branch that calls dma_fence_put()
before jumping to free: for any error other than -ENOENT.
drm/xe: include all registered queues in TLB invalidation
Context-based TLB invalidation currently selects only scheduling-active
exec queues via q->ops->active(). During rebind flows, queues may be
suspended (or transitioning through resume) while still owning valid
translations, causing them to be skipped from invalidation and leading
to missed TLB invalidations on LR rebinds.
The underlying issue is a TOCTOU: q->guc->state bits are flipped lock-free
from enable_scheduling(), disable_scheduling{,_deregister}(), the
suspend/resume sched-msg handlers, handle_sched_done(), and
guc_exec_queue_stop(); nothing in send_tlb_inval_ctx_ppgtt() serializes
against them, so any state-based predicate can race.
Include all the registered queues so that TLB invalidations are not
missed. This is race-free because list membership on vm->exec_queues.list
is stable under vm->exec_queues.lock held by the caller. The performance
impact is expected to be minimal and harmless. If it does turn out to be
a concern, we can come back with a race-safe solution to ignore certain
queues.
Fixes: 6cdaa5346d6f ("drm/xe: Add context-based invalidation to GuC TLB invalidation backend") Assisted-by: Claude:claude-opus-4.6 Suggested-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260608162745.338725-2-tilak.tirumalesh.tangudu@intel.com Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>
(cherry picked from commit aa625e1e9f0710e424fe4f0e3f032807df81b5b0) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Raag Jadav [Tue, 2 Jun 2026 04:48:43 +0000 (10:18 +0530)]
drm/xe/drm_ras: Add per node cleanup action
cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Add per node cleanup action which guarantees
cleanup on unwind and also simplifies the cleanup logic.
Raag Jadav [Tue, 2 Jun 2026 04:48:42 +0000 (10:18 +0530)]
drm/xe/drm_ras: Make counter allocation drm managed
cleanup_node_param() is not registered for previous node in case of counter
allocation failure, which results in stale memory of previous node that
isn't cleaned up on unwind. Fix this using drm managed allocation, which is
guaranteed to be cleaned up on unwind.
drm/xe: Clear pending_disable before signaling suspend fence
In the schedule-disable done path for suspend, we
signal the suspend fence before clearing pending_disable.
That wakeup can let suspend_wait complete and resume be queued
immediately. The resume path may then reach enable_scheduling()
while pending_disable is still set and hit the
!exec_queue_pending_disable(q) assertion.
Fix this by clearing pending_disable before signaling
the suspend fence, so any resumed transition observes a
consistent state.
Fixes: 87651f31ae4e ("drm/xe/guc_submit: fix race around suspend_pending") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260603065217.3131066-3-tilak.tirumalesh.tangudu@intel.com
(cherry picked from commit 4b1ae138b0e103d753773956a84eebc2edbf62c4) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
The idle-skip optimization bypasses GuC suspend, so the GPU may not
perform the context switch that flushes TLB entries for invalidated
userptr VMAs. In LR/preempt-fence VM mode, this can lead to missed TLB
invalidation and page faults during userptr invalidation tests.
Restore unconditional schedule toggling on suspend so the context-switch
TLB flush is always performed.
This optimization will be reintroduced with a fix that does not skip
suspend in LR/preempt-fence VM mode.
Fixes: 8533051ce920 ("drm/xe: Skip exec queue schedule toggle if queue is idle during suspend") Cc: stable@vger.kernel.org # v7.0+ Suggested-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Tangudu Tilak Tirumalesh <tilak.tirumalesh.tangudu@intel.com> Reviewed-by: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patch.msgid.link/20260603065217.3131066-2-tilak.tirumalesh.tangudu@intel.com
(cherry picked from commit 6a1e7934d9a6cf46aecae00a99c2603d1295e170) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
The early GuC FW definition meant for our CI branch was accidentally
merged to the drm-xe-next branch instead. This GuC FW will never be
released to linux-firmware, so we do not want the definition to be
available in the mainline Linux codebase.
Fixes: 4e88de313ff4 ("drm/xe/nvls: Define GuC firmware for NVL-S") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Julia Filipchuk <julia.filipchuk@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: stable@vger.kernel.org # v7.0+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20260529193558.185436-11-daniele.ceraolospurio@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit 65b8e0ac86e48cfc9128c04dfc53ea3395d030dd) Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Imre Deak [Fri, 12 Jun 2026 17:26:17 +0000 (20:26 +0300)]
drm/i915/mtl+: Enable PPS before PLL
Enabling PPS after a display port's PLL is enabled leads to PLL / DDI
BUF timeouts during system resuming after a long (> 45 mins) suspended
state, at least on some ARL and MTL laptops, either all or some of them
also containing an Nvidia GPU. Enabling PPS first and then the PLL fixes
the problem for all the reporters.
A similar issue is seen when enabling an external DP output on PHY B
(vs. PHY A in the above eDP cases), where this change will not have any
effect (since no PPS is used in that case). There isn't any direct
connection between PPS and PLL, so the fix for eDP works by some
side-effect only. However Bspec does seem to require enabling PPS first,
so let's do that. Further investigation continues on the actual root
cause and a cure for external panels.
Fixes: 1a7fad2aea74 ("drm/i915/cx0: Enable dpll framework for MTL+") Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/16098 Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/16064 Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/work_items/16042 Cc: Mika Kahola <mika.kahola@intel.com> Cc: stable@vger.kernel.org # v7.0+ Tested-by: Jouni Högander <jouni.hogander@intel.com> Tested-by: Marco Nenciarini <mnencia@kcore.it> Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patch.msgid.link/20260612172617.3427027-1-imre.deak@intel.com
(cherry picked from commit 28783a274e886dd6da61419be6020bd9d0384e9f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Guangshuo Li [Fri, 12 Jun 2026 03:53:10 +0000 (11:53 +0800)]
drm/i915: clear CRTC color blob pointers after dropping refs
intel_crtc_put_color_blobs() drops the CRTC color blob references, but
leaves the corresponding pointers unchanged.
This can matter in intel_crtc_prepare_cleared_state(), which frees the
old CRTC hw state before calling intel_dp_tunnel_atomic_clear_stream_bw().
The latter can fail while looking up the DP tunnel group state, for
example with -EDEADLK.
If that happens, the function returns without completing the cleared
state preparation. The failed atomic state will then be cleared by the
atomic core and intel_crtc_free_hw_state() can be called again for the
same state, dropping the same blob references again.
Clear the blob pointers after dropping the references so repeated cleanup
of the same CRTC hw state is safe.
Fixes: 77fcf58df15e ("drm/i915/dp_tunnel: Fix error handling when clearing stream BW in atomic state") Suggested-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Reviewed-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Imre Deak <imre.deak@intel.com> Link: https://patch.msgid.link/20260612035310.3013066-1-lgs201920130244@gmail.com
(cherry picked from commit d5005addb5f68e8a0edce249506757bdc9e3d8c8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Ville Syrjälä [Thu, 9 Apr 2026 10:08:40 +0000 (13:08 +0300)]
drm/i915/mst: Call intel_pfit_compute_config() for sharpness filter
The sharpness filter property is on the CRTC (as opposed to the
connector) so the expectation is that it's usable on all output
types. Since the sharpness filter is now fully integrateds into
the normal pfit code intel_pfit_compute_config() must be called
from the encoder .compute_config() on all relevant output types.
Sharpness filter is supported on LNL+ so only HDMI and DP SST/MST
outputs are actually relevant. I already took care of HDMI and
DP SST, but (as usual) forgot about DP MST. Add the missing
intel_pfit_compute_config() call to make the sharpness filter
operational on DP MST as well.
Usama Arif [Tue, 16 Jun 2026 14:15:18 +0000 (07:15 -0700)]
block: invalidate cached plug timestamp after task switch
blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.
The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().
As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.
Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().
Usama Arif [Tue, 16 Jun 2026 14:15:17 +0000 (07:15 -0700)]
kernel/fork: clear PF_BLOCK_TS in copy_process()
PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is
non-NULL, and blk_finish_plug() clears it via __blk_flush_plug()
before NULLing the plug pointer. copy_process() breaks the
invariant by inheriting PF_BLOCK_TS from the parent while resetting
the child's plug to NULL.
Clear PF_BLOCK_TS alongside that assignment so callers can rely on
"PF_BLOCK_TS set implies current->plug != NULL" and dereference
current->plug unguarded.
guzebing [Mon, 8 Jun 2026 13:33:16 +0000 (21:33 +0800)]
io_uring/register: preserve SQ array entries on resize
Ring resizing copies pending SQEs from the old SQE array into the new
one so submissions queued before the resize can still be consumed
afterwards.
That copy currently walks the SQ head/tail range directly. This is only
correct when there is no SQ array indirection. With a regular SQ array,
each pending SQ entry contains an index into the SQE array. After resize,
ctx->sq_array is repointed at the newly allocated array, so pending
entries lose their old logical-to-physical mapping and may submit the
wrong SQE.
Remember the old and new SQ arrays while migrating pending SQ entries. For
each pending entry, copy the SQE selected by the old array into the new
destination slot and rebuild the new array entry to point at the copied
SQE. Keep invalid user-provided entries invalid so the normal submission
path still drops them after resize.
Wen Xiong [Tue, 16 Jun 2026 14:31:21 +0000 (10:31 -0400)]
block: Remove redundant plug in __submit_bio()
The patch removes the automatic plug/unplug operations from __submit_bio()
that were added to cache nsecs time when no explicit plug is used.
The plug mechanism is most effective when batching multiple I/O
operations together. Creating a plug for every bio submission
provides minimal benefit while adding function call overhead and
stack usage for every I/O operation.
Below is performance comparison with the latest upstream kernel.
Yitang Yang [Tue, 16 Jun 2026 15:51:29 +0000 (23:51 +0800)]
block: fix IORING_URING_CMD_REISSUE flags check in blkdev_uring_cmd
blkdev_uring_cmd() checks IORING_URING_CMD_REISSUE to determine whether
this is the first issue. However, this flag lives in cmd->flags instead
of issue_flags.
Coincidentally, IO_URING_F_NONBLOCK shares bit 31 with
IORING_URING_CMD_REISSUE. As a result, the SQE read was never performed,
bic->len remained zero, and every BLOCK_URING_CMD_DISCARD failed with
-EINVAL.
Fix it by checking cmd->flags as intended.
Cc: stable@vger.kernel.org Fixes: 212ec34e4e72 ("block: only read from sqe on initial invocation of blkdev_uring_cmd") Signed-off-by: Yitang Yang <yi1tang.yang@gmail.com> Link: https://patch.msgid.link/20260616155129.406057-1-yi1tang.yang@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
====================
tls: reject the combination of TLS and sockmap
There are no known TLS+sockmap users and it has some known
hard to solve bugs. Let's reject this configuration as we
discussed a number of times.
====================
Jakub Kicinski [Sun, 14 Jun 2026 01:41:00 +0000 (18:41 -0700)]
selftests/bpf: test that TLS crypto is rejected on a sockmap socket
TLS and sockmap are mutually exclusive. We already have a test
for the sockmap side rejecting kTLS, add the inverse test matching
patch 1 of this series.
Jakub Kicinski [Sun, 14 Jun 2026 01:40:59 +0000 (18:40 -0700)]
selftests/bpf: drop the unused kTLS program from test_sockmap
With the sockmap + kTLS tests gone, the BPF-side support in test_sockmap
is dead: the tls_sock_map map and bpf_prog3 (which redirected skbs into
it) are no longer referenced. Remove them, along with the now-unused
bpf_write_pass() helper.
bpf_prog3 was progs[2], so renumber the progs[] users in test_sockmap.c:
the sockops program drops to progs[2] and the sk_msg tx programs to
progs[3..7]. Shrink the map/prog arrays from 9 to 8 and drop the
tls_sock_map entry (the last one) from map_names[] to match.
Jakub Kicinski [Sun, 14 Jun 2026 01:40:58 +0000 (18:40 -0700)]
selftests/bpf: remove sockmap + ktls tests
The combination of sockmap and TLS is no longer supported - installing
the TLS ULP on a sockmap socket (and vice versa) is now rejected. Remove
the tests that exercise the combination along with their BPF program;
the file covered nothing but sockmap sockets holding kTLS contexts.
Jakub Kicinski [Sun, 14 Jun 2026 01:40:57 +0000 (18:40 -0700)]
tls: remove dead sockmap (psock) handling from the SW path
TLS and sockmap are now mutually exclusive. Try to delete the code
from sendmsg and recvmsg path which is now obviously dead.
The main goal is to delete enough code for AI security scanners
to no longer bother us with sockmap related bugs. At the same
time retain the code in case someone has the cycles to fix
all of this and make the integration work, again.
If the integration does not get restored we can wipe the rest
of the skmsg code from TLS in two or three releases.
The changes on the Tx side are deeper since that's where most
of the bugs are, Rx side simply takes the data from sockmap
and gives it to the user. On Tx split record handling and
rolling back the iterator were the two problem areas.
Jakub Kicinski [Sun, 14 Jun 2026 01:40:56 +0000 (18:40 -0700)]
tls: reject the combination of TLS and sockmap
TLS and sockmap (BPF psock) integration hides a lot of latent bugs.
Bugs which may be more or less relevant for real users but they
are definitely exploitable.
We could not find anyone actively using this integration so let's
reject this config. Adding a TLS socket to a sockmap was already
rejected by sk_psock_init() through the inet_csk_has_ulp() check.
We need to reject the attempts to configure the TLS keys (rather
than adding the ULP itself) because checking prior to the ULP
installation is tricky without risking a race with sockmap getting
added in parallel (sockmap does not hold the socket lock).
This patch is a minimal rejection of the feature. Subsequent patch
in the series will do a light dead code removal. Full cleanup would
require a major rewrite of the Tx path, we don't need skmsg any more.
Jakub Kicinski [Tue, 16 Jun 2026 15:53:56 +0000 (08:53 -0700)]
Merge branch 'atm-remove-more-dead-code'
Jakub Kicinski says:
====================
atm: remove more dead code
Commit 6deb53595092 ("net: remove unused ATM protocols and legacy
ATM device drivers") removed a good chunk of old ATM drivers.
Our goal going forward is to limit the ATM support to PPPoATM
used in ADSL deployments.
A recent burst of AI generated fixes for net/atm/signaling.c and
net/atm/svc.c made me look closer at the remaining code. PPPoATM runs
over permanent virtual circuits (PF_ATMPVC) with a statically
configured VPI/VCI. We can drop switched virtual circuits (SVCs)
and user-space signaling (atmsigd) support. While digging around
I noticed a few more obviously dead pieces of code.
Annoyingly, I have applied one "fix" to QoS config which will
now make net conflict with this series :/
====================
Jakub Kicinski [Mon, 15 Jun 2026 19:44:16 +0000 (12:44 -0700)]
atm: remove orphaned uAPI for deleted drivers, protocols and SVCs
ATM removals have left a number of uAPI headers and ioctl
definitions with no in-kernel implementation behind them:
- device headers for adapters deleted with the legacy PCI/SBUS drivers:
atm_eni.h, atm_he.h, atm_idt77105.h, atm_nicstar.h, atm_zatm.h and
the atmtcp pair atm_tcp.h / <linux/atm_tcp.h>
- protocol headers for the removed CLIP, LANE and MPOA stacks:
atmarp.h, atmclip.h, atmlec.h, atmmpc.h
- atmsvc.h and the SVC / p2mp / local-address ioctls in atmdev.h
(ATM_{GET,RST,ADD,DEL}ADDR, ATM_{ADD,DEL,GET}LECSADDR,
ATM_{ADD,DROP}PARTY) left behind by the SVC and address-registry
removals
None of these are referenced by any remaining in-tree code.
Let's try to delete all this. Chances are nobody cares about
these headers any more. I'm keeping this separate from the
kernel side code changes for ease of revert, in case I am
proven wrong...
Jakub Kicinski [Mon, 15 Jun 2026 19:44:15 +0000 (12:44 -0700)]
atm: remove unused ATM PHY operations
The PHY operations are vestiges of the SAR/framer split used by the
removed PCI/SBUS ATM adapters:
- atmdev_ops::phy_put / ::phy_get (register accessors) are never called
by the core and solos-pci only listed them as NULL
- struct atmphy_ops and atm_dev::phy have no users at all - nothing
assigns or dereferences them
Remove all of them. atm_dev::phy_data is kept: solos-pci repurposes it
to stash its per-port channel index.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:14 +0000 (12:44 -0700)]
atm: remove the unused pre_send and send_bh device operations
atmdev_ops::pre_send (a TX pre-processing hook) and ::send_bh (a
bottom-half capable send variant) have no implementation behind them:
no remaining ATM driver sets either, so vcc_sendmsg() always skipped
pre_send and the raw AAL0/AAL5 paths always fell back to ->send().
The drivers that used these hooks were removed with the legacy ATM
adapters.
Drop both operations and the dead branches that tested for them.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:13 +0000 (12:44 -0700)]
atm: remove the unused change_qos device operation
atmdev_ops::change_qos() was the hook for renegotiating the traffic
parameters of an already-connected VCC, driven from SO_ATMQOS on a
connected socket (and previously from the SVC as_modify path, now gone).
None of the ATM drivers left in tree implement it - solos-pci only listed
change_qos = NULL - so atm_change_qos() always returned -EOPNOTSUPP.
Drop the operation and return -EOPNOTSUPP directly.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:12 +0000 (12:44 -0700)]
atm: remove SVC socket support and the signaling daemon interface
ATM switched virtual circuits (SVCs) are set up and torn down by a
user-space signaling daemon (atmsigd) which the kernel talks to over
a dedicated "sigd" socket: the kernel marshals Q.2931-style requests
(as_connect, as_listen, as_accept, as_close, ...) to the daemon and
applies the results to PF_ATMSVC sockets. This is the machinery behind
classical SVC use and was the foundation for LANE / MPOA, all of which
have been removed.
DSL deployments do not use any of this. PPPoATM and BR2684 run over
permanent virtual circuits (PF_ATMPVC) with a statically configured
VPI/VCI; no atmsigd, no Q.2931. Neither remaining ATM driver
(solos-pci, the USB DSL modems) is reachable through the SVC path.
Remove the SVC socket family and the signaling interface:
- delete net/atm/svc.c, net/atm/signaling.c and signaling.h
- drop atmsvc_init()/atmsvc_exit() and the PF_ATMSVC registration and
module alias
- drop the ATMSIGD_CTRL ioctl (sigd_attach) and the /proc/net/atm/svc
file
- fold the SVC branch out of atm_change_qos(); all sockets are PVCs now
The obsolete ATM_SETSC ioctl stub is left in place (it already just
warns and returns 0), as is the struct atm_vcc SVC bookkeeping shared
with the queueing layer.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:11 +0000 (12:44 -0700)]
atm: remove the local ATM (NSAP) address registry
net/atm/addr.c maintained the per-device lists of local NSAP addresses
(dev->local) and ILMI-learned LECS addresses (dev->lecs). These exist
solely to serve SVC signaling: the lists are populated through the
ATM_{ADD,DEL,RST}ADDR / ATM_{ADD,DEL,GET}LECSADDR ioctls used by the
atmsigd / ILMI daemons, and consumed when registering addresses with the
signaling daemon. The LECS list belonged to LAN Emulation, which has
been removed.
With no SVC users in a DSL-only configuration these lists are always
empty, so drop the registry entirely:
- remove the ADDR/LECSADDR/RSTADDR ioctls
- drop the now-always-empty "atmaddress" sysfs attribute
- remove the dev->local / dev->lecs lists, structs and enums
- delete net/atm/addr.c and net/atm/addr.h
The device ESI ("MAC" address) and its ATM_{G,S}ETESI ioctls and
"address" sysfs attribute are retained - the USB DSL modems populate
the ESI.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:10 +0000 (12:44 -0700)]
atm: remove dead SONET PHY ioctls
The SONET_* ioctls are SONET/SDH PHY controls that atm_dev_ioctl() and
the compat path only ever forwarded to the driver's ->ioctl() handler.
The PHY drivers that implemented them (the S/UNI library and the framers
on the removed PCI/SBUS adapters) are gone, and neither surviving driver
services them: solos-pci has no ->ioctl, and usbatm handles only
ATM_QUERYLOOP. They now uniformly return an error regardless.
Drop the SONET compat passthrough and the SONET cases in atm_dev_ioctl(),
along with the now-unused linux/sonet.h includes. The SONET_* uAPI
definitions are untouched.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:09 +0000 (12:44 -0700)]
atm: remove the unused send_oam / push_oam callbacks
The atmdev_ops::send_oam device operation and the atm_vcc::push_oam
callback were the kernel's interface for raw F4/F5 OAM cell exchange.
Nothing assigns them a non-NULL value and nothing ever invokes them:
the core only ever initialises push_oam to NULL (in vcc_create() and the
AAL init helpers) and the Solos driver only lists send_oam = NULL for
documentation. The drivers that actually drove OAM through these hooks
were removed along with the legacy ATM adapters.
Jakub Kicinski [Mon, 15 Jun 2026 19:44:08 +0000 (12:44 -0700)]
atm: remove AAL3/4 transport support
AAL3/4 is an obsolete connection-oriented ATM adaptation layer that has
seen no real use since the SMDS-era hardware it was designed for (90s?).
We are only maintaining ATM support in-tree to keep PPPoATM running,
and PPPoATM runs over AAL5.
Drop the "raw" AAL3/4 transport (atm_init_aal34()) and the ATM_AAL34
cases in the connect and traffic-parameter paths. A vcc_connect() with
qos.aal == ATM_AAL34 now fails with -EPROTOTYPE.
Ricardo Robaina [Tue, 16 Jun 2026 12:36:32 +0000 (09:36 -0300)]
io_uring, audit: don't log IORING_OP_RECV_ZC
IORING_OP_RECV_ZC is a read operation. Audit only tracks file/socket
creation, not subsequent reads. Set audit_skip to align with
audit-userspace uringop_table.h.
Fixes: 11ed914bbf94 ("io_uring/zcrx: add io_recvzc request") Suggested-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20260616123632.3209545-1-rrobaina@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 15 Jun 2026 19:43:16 +0000 (13:43 -0600)]
io_uring: get rid of tw_pending for !DEFER task work
The normal task_work path used a tw_pending bit to ensure the callback
was only added once: the mpscq drains incrementally, so a single
tctx_task_work() run can take the queue through empty -> non-empty
several times, and each transition would otherwise re-add the already
pending callback_head. This corrupts the task_work list, and is what
tw_pending protects again.
This can go away, if we stop running the task_work as soon as the queue
empties.
Finalize commit c33c794828f2 ("mm: ptep_get() conversion") and
replace direct page table entry dereferencing with the proper
accessors (ptep_get(), pmdp_get(), etc.).
Override the default getter implementations even though they are
currently identical: pud_clear(), p4d_clear(), and pgd_clear()
require corresponding architecture-specific getters, but these
are not yet defined. This avoids a dependency loop.
Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Selvin Xavier [Mon, 15 Jun 2026 22:47:45 +0000 (15:47 -0700)]
RDMA/bnxt_re: Add a max slot check for SQ
The variable WQE mode must be validated against
the maximum slots supported by HW. The max supported
value is 64K. Adding a max and min check and fail if user
supplied value is more than the max supported and zero.
Fixes: d8ea645d6984 ("RDMA/bnxt_re: Handle variable WQE support for user applications") Link: https://patch.msgid.link/r/20260615224751.232802-10-selvin.xavier@broadcom.com Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Selvin Xavier [Mon, 15 Jun 2026 22:47:44 +0000 (15:47 -0700)]
RDMA/bnxt_re: Avoid displaying the kernel pointer
While dumping the info on MR using the rdma tool, we
dump the mr_hwq which is a kernel pointer. There is
no need to expose this value for end user. So avoid
it.
Fixes: 7363eb76b7f3 ("RDMA/bnxt_re: Support driver specific data collection using rdma tool") Link: https://patch.msgid.link/r/20260615224751.232802-9-selvin.xavier@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>