Jakub Kicinski [Wed, 29 Apr 2026 23:55:57 +0000 (16:55 -0700)]
Merge branch 'net-psp-add-more-validation'
Jakub Kicinski says:
====================
net: psp: add more validation
Address some AI code-scan issues with the PSP code.
I don't think any of these are real bugs, but they may
become bugs in the future. The two real bugs discovered
were posted separately for net. AI reports 3 more which
seem plain wrong (rx SPI "leak" on error etc.).
====================
Jakub Kicinski [Tue, 28 Apr 2026 20:53:52 +0000 (13:53 -0700)]
psp: validate IPv4 header fields in psp_dev_rcv()
psp_dev_rcv() is called from the NIC driver's RX completion path
before the frame reaches ip_rcv_core(), so the IP header has not
been validated in SW, yet. We expect that the device has done
all this validation, but let's also add the SW checks, to avoid
surprises.
Jakub Kicinski [Tue, 28 Apr 2026 20:53:51 +0000 (13:53 -0700)]
psp: add a comment about a psp_dev add netlink notification
In psp_dev_create(), the DEV_ADD_NTF netlink notification is sent
before the device is published to the netdev via rcu_assign_pointer().
IIRC this is intentional because a single PSP device is expected
to be shared with multiple netdevs. So we are trying to default to
not having the netdev info. We can change it if someone complains
but for now just add a comment that it's intentional.
Jakub Kicinski [Tue, 28 Apr 2026 20:53:50 +0000 (13:53 -0700)]
psp: validate protocol before mutating skb in psp_dev_encapsulate()
Code checkers / AI scans will complain that we have already modified
the packet by the time we realize that protocol is not IP.
Move the skb->protocol check to before skb_push()/memmove() so that
the skb is not left in a corrupted state when the function returns
false for an unsupported protocol. psp_dev_rcv() follows similar
pattern.
Today this path is unreachable because both in-tree callers (mlx5 and
netdevsim) only reach psp_dev_encapsulate() from TCP socket TX paths
where skb->protocol is always ETH_P_IP or ETH_P_IPV6, and both drop
the skb on a false return, anyway.
Jakub Kicinski [Tue, 28 Apr 2026 20:39:24 +0000 (13:39 -0700)]
MAINTAINERS: update the IPv4/IPv6 entry and add Ido Schimmel
The IPv4/IPv6 and routing code is not very well separated from
the TCP/UDP code. Scope it down properly by providing a more
accurate file list, instead of net/ipv4/ and net/ipv6/
Now that the entry is more accurately representing layer 3
and routing merge in the nexthop entry into it.
Add Ido Schimmel as a co-maintainer, Ido's git history speaks
for itself.
Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260428203924.1229169-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 28 Apr 2026 20:33:57 +0000 (13:33 -0700)]
selftests: drv-net: clarify linters and frameworks in README
Minor clarifications in the README:
- call out what linters we expect to be clean
- make it clear that by "frameworks" we mean code under lib/
not just factoring code out in the same file
Jakub Kicinski [Tue, 28 Apr 2026 02:53:20 +0000 (19:53 -0700)]
net: add net_iov_init() and use it to initialize ->page_type
Commit db359fccf212 ("mm: introduce a new page type for page pool in
page type") added a page_type field to struct net_iov at the same
offset as struct page::page_type, so that page_pool_set_pp_info() can
call __SetPageNetpp() uniformly on both pages and net_iovs.
The page-type API requires the field to hold the UINT_MAX "no type"
sentinel before a type can be set; for real struct page that invariant
is established by the page allocator on free. struct net_iov is not
allocated through the page allocator, so the field is left as zero
(io_uring zcrx, which uses __GFP_ZERO) or as slab garbage (devmem,
which uses kvmalloc_objs() without zeroing). When the page pool then
calls page_pool_set_pp_info() on a freshly-bound niov,
__SetPageNetpp()'s VM_BUG_ON_PAGE(page->page_type != UINT_MAX) fires
and the kernel BUGs. Triggered in selftests by io_uring zcrx setup
through the fbnic queue restart path:
The same path is reachable through devmem dmabuf binding via
netdev_nl_bind_rx_doit() -> net_devmem_bind_dmabuf_to_queue().
Add a net_iov_init() helper that stamps ->owner, ->type and the
->page_type sentinel, and use it from both the devmem and io_uring
zcrx niov init loops.
Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type") Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Acked-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/20260428025320.853452-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Various names for Qualcomm as a company are used in user-visible config
options: QCOM, Qualcomm and Qualcomm Technologies. Switch to unified
"Qualcomm" so it will be easier for users to identify the options when
for example running menuconfig.
Weiming Shi [Mon, 27 Apr 2026 12:34:50 +0000 (14:34 +0200)]
netfilter: nft_fwd_netdev: use recursion counter in neigh egress path
nft_fwd_neigh can be used in egress chains (NF_NETDEV_EGRESS). When the
forwarding rule targets the same device or two devices forward to each
other, neigh_xmit() triggers dev_queue_xmit() which re-enters
nf_hook_egress(), causing infinite recursion and stack overflow.
Move the nf_get_nf_dup_skb_recursion() accessor and NF_RECURSION_LIMIT
to the shared header nf_dup_netdev.h as a static inline, so that
nft_fwd_netdev can use the recursion counter directly without exported
function call overhead. Guard neigh_xmit() with the same recursion
limit already used in nf_do_netdev_egress().
[ Updated to cache the nf_get_nf_dup_skb_recursion pointer. --pablo ]
Fixes: f87b9464d152 ("netfilter: nft_fwd_netdev: Support egress hook") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
The ttl field has been decremented already and evaluation of this rule
would proceed, just drop this packet instead if there is no destination
device to forwards this packet. This is exactly what nf_dup already does
in this case.
Moreover, check for headroom and call skb_expand_head() like in the IP
output path to ensure there is sufficient headroom when forwarding this
via neigh_xmit().
Fixes: d32de98ea70f ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: replace skb_try_make_writable() by skb_ensure_writable()
skb_try_make_writable() only works on clones and uncloned packets might
have their network header in paged fragments.
nft_fwd needs to work for the ingress and egress hooks, but the egress
hook where skb->data points to the mac header, use skb_network_offset()
to include the mac header. The flowtable is fine since it already uses
the transport offset.
Fixes: d32de98ea70f ("netfilter: nft_fwd_netdev: allow to forward packets via neighbour layer") Fixes: 7d2086871762 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On L1VH, debugfs stats pages are overlay pages: the kernel allocates
them and registers the GPAs with the hypervisor via
HVCALL_MAP_STATS_PAGE2. These overlay mappings persist in the
hypervisor across kexec. If the kexec'd kernel reuses those physical
pages, the hypervisor's overlay semantics cause a machine check
exception.
Fix this by calling mshv_debugfs_exit() from the reboot notifier,
which issues HVCALL_UNMAP_STATS_PAGE for each mapped stats page before
kexec. This releases the overlay bindings so the physical pages can be
safely reused. Guard mshv_debugfs_exit() against being called when
init failed.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
The reboot notifier that tears down the SynIC cpuhp state guards the
cleanup with hv_root_partition(), so on L1VH (where
hv_root_partition() is false) SINT0, SINT5, and SIRBP are never
cleaned up before kexec. The kexec'd kernel then inherits stale
unmasked SINTs and an enabled SIRBP pointing to freed memory.
Remove the hv_root_partition() guard so the cleanup runs for all
parent partitions.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
mshv: limit SynIC management to MSHV-owned resources
The SynIC is shared between VMBus and MSHV. VMBus owns the message
page (SIMP), event flags page (SIEFP), global enable (SCONTROL),
and SINT2. MSHV adds SINT0, SINT5, and the event ring page (SIRBP).
Currently mshv_synic_cpu_init() redundantly enables SIMP, SIEFP, and
SCONTROL that VMBus already configured, and mshv_synic_cpu_exit()
disables all of them. This is wrong because MSHV can be torn down
while VMBus is still active. In particular, a kexec reboot notifier
tears down MSHV first. Disabling SCONTROL, SIMP, and SIEFP out
from under VMBus causes its later cleanup to write SynIC MSRs while
SynIC is disabled, which the hypervisor does not tolerate.
Restrict MSHV to managing only the resources it owns:
- SINT0, SINT5: mask on cleanup, unmask on init
- SIRBP: enable/disable as before
- SIMP, SIEFP, SCONTROL: leave to VMBus when it is active (L1VH
and nested root partition); on a non-nested root partition VMBus
does not run, so MSHV must enable/disable them
While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
register GPAs regardless of the kernel page size, so using PAGE_SHIFT
produces wrong addresses on ARM64 with 64K pages.
Note that initialization order matters - VMBUS first, MSHV second,
and the reverse on de-init. Ideally, we would want a dedicated SYNIC
driver that replaces the cross-dependencies with a clear API and
dynamic tracking. Such refactor should go into its own dedicated
series, outside of this kexec fix series.
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
scripts/x86/intel: Add a script to update the old microcode list
The kernel maintains a table of minimum expected microcode revisions for
Intel CPUs in intel-ucode-defs.h. Systems with microcode older than
these revisions are flagged with X86_BUG_OLD_MICROCODE.
The static list of microcode revisions needs to be updated periodically
in response to releases of the official microcode at:
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git.
Introduce a simple script to extract the revision information from the
microcode files and print it in the precise format expected by the
microcode header.
Maintaining the script in the kernel tree ensures a central location
that a submitter can use to generate the kernel-specific update. This
not only reduces the possibility of errors but also makes it easier to
validate the changes for reviewers and maintainers.
Typically, someone at Intel would see a new public release, wait for at
least three months to ensure the update is stable, run this script to
refresh the intel-ucode-defs.h file, and send a patch upstream to update
the mainline and stable versions.
Having a standard update script and a defined process minimizes the
ambiguity when refreshing the old microcode list. As always, there can
be exceptions to this process which should be supported with appropriate
justification.
x86/microcode/intel: Refresh old_microcode defines with Nov 2025 release
Update the minimum expected revisions of Intel microcode based on the
microcode-20251111 (November 2025) release.
Add an SPDX header as well as a comment to specify that the file is
auto-generated. While at it, remove an extra space before the model
field to match the formatting of the other fields.
Shyam Prasad N [Mon, 30 Mar 2026 10:49:59 +0000 (16:19 +0530)]
cifs: change_conf needs to be called for session setup
Today we skip calling change_conf for negotiates and session setup
requests. This can be a problem for mchan as the immediate next call
after session setup could be due to an I/O that is made on the
mount point. For single channel, this is not a problem as
there will be several calls after setting up session.
This change enforces calling change_conf when the total credits contain
enough for reservations for echoes and oplocks. We expect this to happen
during the last session setup response. This way, echoes and oplocks are
not disabled before the first request to the server. So if that first
request is an open, it does not need to disable requesting leases.
Cc: <stable@vger.kernel.org> Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
smb: client: change allocation requirements in smb2_compound_op
Currently, smb2_compound_op() allocates
struct smb2_compound_vars *vars using GFP_ATOMIC, although
smb2_compound_op() can sleep when it calls compound_send_recv()
before vars is freed.
Allocate vars using GFP_KERNEL.
Signed-off-by: Fredric Cover <fredric.cover.lkernel@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
hv: utils: replace deprecated strcpy with strscpy in kvp_register
strcpy() has been deprecated [1] because it performs no bounds checking
on the destination buffer, which can lead to buffer overflows. While the
current code works correctly, replace strcpy() with the safer strscpy()
to follow secure coding best practices. Use ->body.kvp_register.version
directly as the destination buffer and remove the local variable.
Hyunchul Lee [Tue, 28 Apr 2026 01:34:10 +0000 (10:34 +0900)]
ntfs: drop nlink once for WIN32/DOS aliases
NTFS could store a filename as paired WIN32 and DOS $FILE_NAME attributes
for directories. But ntfs_delete() deleted both attributes for unlinking
a directory, but it also called drop_nlink() for each attributes.
This could trigger warnings when unlinking directories.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Timur Tabi [Tue, 28 Apr 2026 20:28:25 +0000 (15:28 -0500)]
drm/nouveau: expose VBIOS via debugfs on GSP-RM systems
When Nouveau boots with GSP-RM, it bypasses several traditional code
paths to allow GSP-RM to handle those features. In particular,
some VBIOS parsing is skipped, and a side effect is that the VBIOS
is not exposed in the DRM debugfs entries.
Fix this by updating the drm BIOS struct (nvbios) with the VBIOS data
from the nvkm BIOS struct (nvkm_bios). This happens normally in
NVInitVBIOS(), but that function is skipped when booting with GSP-RM.
Eliot Courtney [Tue, 7 Apr 2026 03:59:50 +0000 (12:59 +0900)]
gpu: nova: require little endian
The driver already assumes little endian in a lot of locations. For
example, all the code that reads RPCs out of the command queue just
directly interprets the bytes.
Make this explicit in Kconfig.
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Gary Guo <gary@garyguo.net> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Link: https://patch.msgid.link/20260407-fix-kconfig-v2-1-6b4fb06c690c@nvidia.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
hv: utils: handle and propagate errors in kvp_register
Make kvp_register() return an error code instead of silently ignoring
failures, and propagate the error from kvp_handle_handshake() instead of
returning success.
This propagates both kzalloc_obj() and hvutil_transport_send() failures
to kvp_handle_handshake() and thus to kvp_on_msg().
Fixes: 245ba56a52a3 ("Staging: hv: Implement key/value pair (KVP)") Cc: stable@vger.kernel.org Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Stephen Smalley [Wed, 29 Apr 2026 19:18:40 +0000 (15:18 -0400)]
selinux: switch two allocations to use kzalloc_objs()
These were the only two allocations in the policy loading logic
that were not already using kzalloc_objs() for the policy
data structures. Fix these to be consistent with the rest and
to protect against ill-formed policy.
Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
net/mlx5: Extend query_esw_functions output for multi-function support
Update the query_esw_functions command to support a new response layout
that can report data for multiple network functions. Setting bit 14 of
the op_mod field selects the v1 layout with network_function_params
entries instead of the legacy host_params_context.
The query_host_net_function_v1 read-only capability indicates firmware
support for layout version 1, and query_host_net_function_num_max
advertises the maximum number of network function entries.
Define a new network_function_params layout and a net_function_params
union that groups host_params_context and network_function_params.
Rework the query_esw_functions output to use a flexible array of this
union, and adjust existing driver callers to use it.
Drop the unused host_sf_enable array from
mlx5_ifc_query_esw_functions_out_bits layout. This field has been
deprecated in firmware and is not referenced by the mlx5 driver, so it
can be safely removed.
net/mlx5: Add function_id_type for enable/disable_hca cmds
Add a function_id_type field to the enable_hca and disable_hca command
input layouts in mlx5_ifc.h to allow using vhca_id as the function index
instead of function_id. The new field support by firmware is indicated
by the function_id_type_vhca_id capability bit, which is already exposed
in hca caps.
mlx5: Rename the vport number enums for host PF and VF
Rename the vport number enums MLX5_VPORT_PF to MLX5_VPORT_HOST_PF and
MLX5_VPORT_FIRST_VF to MLX5_VPORT_FIRST_HOST_VF to indicate that these
vport indices represent the host PF and its VFs. This prepares the code
for upcoming support of an additional PF type.
Steven Rostedt [Tue, 28 Apr 2026 16:23:02 +0000 (12:23 -0400)]
tracing/probes: Limit size of event probe to 3K
There currently isn't a max limit an event probe can be. One could make an
event greater than PAGE_SIZE, which makes the event useless because if
it's bigger than the max event that can be recorded into the ring buffer,
then it will never be recorded.
A event probe should never need to be greater than 3K, so make that the
max size. As long as the max is less than the max that can be recorded
onto the ring buffer, it should be fine.
Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events") Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)
alloc_workqueue_va() forwards its va_list to __alloc_workqueue() which
ultimately feeds vsnprintf(). __alloc_workqueue() already carries
__printf(1, 0); the new wrapper needs the same annotation so format
string checking propagates through the forwarding.
Michael Guralnik [Mon, 27 Apr 2026 11:02:36 +0000 (14:02 +0300)]
RDMA/mlx5: Fix null-ptr-deref in Raw Packet QP creation
Raw Packet QPs are unique in that they support separate send and receive
queues, using 2 different user-provided buffers.
They can also be created with one of the queues having size 0, allowing
a send-only or receive-only QP.
The Raw Packet RQ umem is created in the common user QP creation path,
which allows zero-length queues. Add a later validation of the RQ umem
in Raw Packet QP creation path when an RQ was requested.
This prevents possible null-ptr dereference crashes, as seen in the
below trace:
Fixes: 0fb2ed66a14c ("IB/mlx5: Add create and destroy functionality for Raw Packet QP") Link: https://patch.msgid.link/r/20260427-security-bug-fixes-v3-5-4621fa52de0e@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Michael Guralnik [Mon, 27 Apr 2026 11:02:35 +0000 (14:02 +0300)]
RDMA/core: Fix rereg_mr use-after-free race
When a driver creates a new MR during rereg_user_mr, a race window
exists between rdma_alloc_commit_uobject() for the new MR and the point
where the code reads that MR to populate the response keys.
A concurrent rereg_mr or destroy_mr could destroy the MR in this window
and cause UAF in the first thread.
Racing flow between two rereg_mr calls:
CPU0 CPU1
---- ----
rereg_user_mr(mr_handle)
uobj_get_write(mr_handle) -> mr0
mr1 = driver→rereg()
rdma_alloc_commit_uobject(mr1)
// mr1 replaced mr0 and is unlocked
uobj_put_destroy(mr0)
rereg_user_mr(mr_handle)
uobj_get_write(mr_handle) -> mr1
mr2 = driver→rereg()
rdma_alloc_commit_uobject(mr2)
// mr2 replaced mr1 and is unlocked
uobj_put_destroy(mr1)
// Destroys mr1!
resp.lkey = mr1->lkey; // UAF - mr1 was freed!
resp.rkey = mr1->rkey; // UAF - mr1 was freed!
Fix by storing lkey/rkey in local variables before the new MR is
unlocked and using the local variables to set the user response.
Fixes: 6e0954b11c05 ("RDMA/uverbs: Allow drivers to create a new HW object during rereg_mr") Link: https://patch.msgid.link/r/20260427-security-bug-fixes-v3-4-4621fa52de0e@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
IB/core: Fix IPv6 netlink message size in ib_nl_ip_send_msg()
When resolving an RDMA-CM IPv6 address, ib_nl_ip_send_msg() sends a
netlink request to the userspace daemon to perform IP-to-GID
resolution in certain cases. The function allocates the netlink message
buffer using nla_total_size(sizeof(size)), which passes 8 bytes (the
size of size_t) instead of 16 bytes (the size of an IPv6 address).
This results in an 8-byte under-allocation.
This is currently masked by nlmsg_new() over-allocation of the skb
in its internal logic. However, the code remains incorrect.
Fix the issue by supplying the proper IPv6 address length to
nla_total_size().
Fixes: ae43f8286730 ("IB/core: Add IP to GID netlink offload") Link: https://patch.msgid.link/r/20260427-security-bug-fixes-v3-3-4621fa52de0e@nvidia.com Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Edward Srouji [Mon, 27 Apr 2026 11:02:33 +0000 (14:02 +0300)]
RDMA/mlx5: Fix UAF in DCT destroy due to race with create
A potential race condition exists between mlx5_core_destroy_dct() and
mlx5_core_create_dct() that can lead to a use-after-free.
After _mlx5_core_destroy_dct() releases the DCT to firmware, the DCTN
can be immediately reallocated for a new DCT being created concurrently.
If the create path stores the new DCT in the xarray before the destroy path
erases it, the destroy will incorrectly delete the new DCT's entry.
Later accesses then hit freed memory.
Fix by replacing the unconditional xa_erase_irq() with xa_cmpxchg_irq()
that only erases the entry if it hasn't already been replaced (still
contains XA_ZERO_ENTRY), preserving any newly created DCT.
Fixes: afff24899846 ("RDMA/mlx5: Handle DCT QP logic separately from low level QP interface") Link: https://patch.msgid.link/r/20260427-security-bug-fixes-v3-2-4621fa52de0e@nvidia.com Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Edward Srouji [Mon, 27 Apr 2026 11:02:32 +0000 (14:02 +0300)]
RDMA/mlx5: Fix UAF in SRQ destroy due to race with create
A race condition exists between mlx5_cmd_destroy_srq() and
mlx5_cmd_create_srq() that can lead to a use-after-free (UAF) [1].
After destroy_srq_split() releases the SRQ to firmware, the SRQN can be
immediately reallocated for a new SRQ being created concurrently. If the
create path stores the new SRQ in the xarray before the destroy path
erases it, the destroy will incorrectly delete the new SRQ's entry.
Later accesses then hit freed memory.
Fix by replacing the unconditional xa_erase_irq() with xa_cmpxchg_irq()
that only erases the entry if it hasn't already been replaced (still
contains XA_ZERO_ENTRY), preserving any newly created SRQ.
Ashutosh Dixit [Mon, 27 Apr 2026 22:11:32 +0000 (15:11 -0700)]
drm/xe/oa: Use drm_gem_mmap_obj for OA buffer mmap
OA buffer mmap can currently only mmap OA buffer in system memory. CRI
MERTOA buffer can be located in device memory. Switch OA buffer mmap to
using drm_gem_mmap_obj, which can handle mmap's of both system and device
memory buffers.
Ashutosh Dixit [Mon, 27 Apr 2026 22:11:31 +0000 (15:11 -0700)]
drm/xe/oa: Use xe_map layer
OA code should have used xe_map layer to begin with. In CRI, the OA buffer
can be located both in system and device memory. For these reasons, move OA
code to use the xe_map layer when accessing the OA buffer.
v2: Use xe_map layer and put_user in xe_oa_copy_to_user (Umesh)
v3: To avoid performance impact in v2, revert to v1 but move
xe_map_copy_to_user() to xe_map.h
v4: Use bounce buffer and copy_to_user in xe_oa_copy_to_user
v5: Fix offsets in oa_timestamp() and oa_timestamp_clear() (Umesh)
v6: Rename head/tail in helper args to report_offset (Umesh)
v7: Add xe_assert in xe_oa_copy_to_user (Umesh)
Stephen Smalley [Wed, 29 Apr 2026 18:15:31 +0000 (14:15 -0400)]
selinux: fix sel_kill_sb()
Fix the order in sel_kill_sb() to match other pseudo filesystems.
This does not currently matter for selinuxfs per se but could
in the future, e.g. with SELinux namespaces and multiple selinuxfs
instances.
Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Input: elan_i2c - validate firmware size before use
Ensure that the firmware file is large enough to contain the expected
number of pages and the signature (which resides at the end of the
firmware blob) before accessing them to prevent potential out-of-bounds
reads.
The function xe_gt_mcr_steering_info_to_dss_id() has had no callers
since commit fa597710be6e ("drm/xe/guc: Cache DSS info when creating
capture register list") which removed the only call site in
xe_guc_capture.c. Remove the dead function and its declaration.
sched_ext: Require cid-form struct_ops for sub-sched support
Sub-scheduler support is tied to the cid-form struct_ops: sub_attach /
sub_detach will communicate allocation via cmask, and the hierarchy assumes
all participants share a single topological cid space. A cpu-form root that
accepts sub-scheds would need cpu <-> cid translation on every cross-sched
interaction, defeating the purpose.
Enforce this at validate_ops():
- A sub-scheduler (scx_parent(sch) non-NULL) must be cid-form.
- A root that exposes sub_attach / sub_detach must be cid-form.
scx_qmap, which is currently the only scheduler demoing sub-sched support,
was converted to cid-form in the preceding patch, so this doesn't cause
breakage.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
tools/sched_ext: scx_qmap: Port to cid-form struct_ops
Flip qmap's struct_ops to bpf_sched_ext_ops_cid. The kernel now passes
cids and cmasks to callbacks directly, so the per-callback cpu<->cid
translations that the prior patch added drop out and cpu_ctxs[] is
reindexed by cid. Cpu-form kfunc calls switch to their cid-form
counterparts.
The cpu-only kfuncs (idle/any pick, cpumask iteration) have no cid
substitute. Their callers already moved to cmask scans against
qa_idle_cids and taskc->cpus_allowed in the prior patch, so the kfunc
calls drop here without behavior changes.
set_cmask is wired up via cmask_copy_from_kernel() to copy the
kernel-supplied cmask into the arena-resident taskc cmask. The
cpuperf monitor iterates the cid-form perf kfuncs.
v4: Match scx_bpf_cid_override()'s 2-arg form, drop the shard test
plumbing, bound nr_cpu_ids for the verifier, and switch mode 3
from bad-mono to bad-range (Changwoo, Andrea).
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick
Switch qmap's idle-cpu picker from scx_bpf_pick_idle_cpu() to a
BPF-side bitmap scan, still under cpu-form struct_ops. qa_idle_cids
tracks idle cids (updated in update_idle / cpu_offline) and each
task's taskc->cpus_allowed tracks its allowed cids (built in
set_cpumask / init_task); select_cpu / enqueue scan the intersection
for an idle cid. Callbacks translate cpu <-> cid on entry;
cid-qmap-port drops those translations.
The scan is barebone - no core preference or other topology-aware
picks like the in-kernel picker - but qmap is a demo and this is
enough to exercise the plumbing.
v3: qmap_init() refuses to load when nr_cids exceeds SCX_QMAP_MAX_CPUS;
task_ctx's flex array would otherwise overflow into the next slab
entry. (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline
The cid mapping is built from the online cpu set at scheduler enable
and stays valid for that set; routine hotplug invalidates it. The
default cid behavior is to restart the scheduler so the mapping gets
rebuilt against the new online set, and that requires not implementing
cpu_online / cpu_offline (which suppress the kernel's ACT_RESTART).
Drop the two ops along with their print_cpus() helper - the cluster
view was only useful as a hotplug demo and is meaningless over the
dense cid space the scheduler will move to. Wire main() to handle the
ACT_RESTART exit by reopening the skel and reattaching, matching the
pattern in scx_simple / scx_central / scx_flatcg etc. Reset optind so
getopt re-parses argv into the fresh skel rodata each iteration.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Forbid cpu-form kfuncs from cid-form schedulers
cid and cpu are both small s32s, trivially confused when a cid-form
scheduler calls a cpu-keyed kfunc. Reject cid-form programs that
reference any kfunc in the new scx_kfunc_ids_cpu_only at verifier load
time.
The reverse direction is intentionally permissive: cpu-form schedulers
can freely call cid-form kfuncs to ease a gradual cpumask -> cid
migration.
The check sits in scx_kfunc_context_filter() right after the SCX
struct_ops gate and before the any/idle allow and per-op allow-list
checks, so it catches cpu-only kfuncs regardless of which set they
belong to (any, idle, or select_cpu).
v2: Sync per-entry kfunc flags with their primary declarations (Zhao).
pahole intersects flags across BTF_ID_FLAGS() occurrences, so
omitting them drops the flags globally.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Add bpf_sched_ext_ops_cid struct_ops type
cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without a full cid interface,
schedulers end up mixing forms - a subtle-bug factory.
Add sched_ext_ops_cid, which mirrors sched_ext_ops with cid/cmask
replacing cpu/cpumask in the topology-carrying callbacks.
cpu_acquire/cpu_release are deprecated and absent; a prior patch
moved them past @priv so the cid-form can omit them without
disturbing shared-field offsets.
The two structs share byte-identical layout up to @priv, so the
existing bpf_scx init/check hooks, has_op bitmap, and
scx_kf_allow_flags[] are offset-indexed and apply to both.
BUILD_BUG_ON in scx_init() pins the shared-field and renamed-callback
offsets so any future drift trips at boot.
The kernel<->BPF boundary translates between cpu and cid:
- A static key, enabled on cid-form sched load, gates the translation
so cpu-form schedulers pay nothing.
- dispatch, update_idle, cpu_online/offline and dump_cpu translate
the cpu arg at the callsite.
- select_cpu also translates the returned cid back to a cpu.
- set_cpumask is wrapped to synthesize a cmask in a per-cpu scratch
before calling the cid-form callback.
All scheds in a hierarchy share one form. The static key drives the
hot-path branch.
v2: Use struct_size() for the set_cmask_scratch percpu alloc. Move
cid-shard fields and assertions into the later cid-shard patch.
v3: Drop `static` on scx_set_cmask_scratch; add extern in ext_internal.h.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without full cid coverage a
scheduler has to mix cid and cpu forms, which is a subtle-bug factory.
Close the gap with a cid-native interface.
Pair every cpu-form kfunc that takes a cpu id with a cid-form
equivalent (kick, task placement, cpuperf query/set, per-cpu current
task, nr-cpu-ids). Add two cid-natives with no cpu-form sibling:
scx_bpf_this_cid() (cid of the running cpu, scx equivalent of
bpf_get_smp_processor_id) and scx_bpf_nr_online_cids().
scx_bpf_cpu_rq is deprecated; no cid-form counterpart. NUMA node info
is reachable via scx_bpf_cid_topo() on the BPF side.
Each cid-form wrapper is a thin cid -> cpu translation that delegates
to the cpu path, registered in the same context sets so usage
constraints match.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Add cmask, a base-windowed bitmap over cid space
Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
most of its bits for a small window and is awkward in BPF.
scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
cmasks therefore address bits[] against the same global windows, so
cross-cmask word ops reduce to
dest->bits[i] OP= operand->bits[i - delta]
with no bit-shifting, at the cost of up to one extra storage word for
head misalignment. This alignment guarantee is the reason binary ops
can stay word-level; every mutating helper preserves it.
Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
adds the extra helpers that basic idle-cpu selection needs.
No callers yet.
v2: Narrow to helpers that will be used in the planned changes;
set/bit/find/zero ops will be added as usage develops.
v3: cmask_copy_from_kernel: validate src->base == 0 via probe-read;
bit-level nr_bits check instead of round-up word count. (Sashiko)
v4: Bump CMASK_CAS_TRIES to 1<<23 so abort fires only after seconds
of real spinning, not on plausible contention. Switch
__builtin_ctzll() to the ctzll() wrapper for clang compat
(Changwoo).
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
tools/sched_ext: Add struct_size() helpers to common.bpf.h
Add flex_array_size(), struct_size() and struct_size_t() to
scx/common.bpf.h so BPF schedulers can size flex-array-containing
structs the same way kernel code does. These are abbreviated forms of
the <linux/overflow.h> macros.
v3: Use offsetof() instead of sizeof() in struct_size() to match kernel
semantics (no inflation from trailing struct padding). (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
The auto-probed cid mapping reflects the kernel's view of topology
(node -> LLC -> core), but a BPF scheduler may want a different layout -
to align cid slices with its own partitioning, or to work around how the
kernel reports a particular machine.
Add scx_bpf_cid_override(), callable from ops.init() of the root
scheduler. It validates the caller-supplied cpu->cid array and replaces
the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper
silently no-ops on kernels that lack the kfunc.
A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the
kfunc to ops.init() at verifier load time.
Raw cpu numbers are clumsy for sharding and cross-sched communication,
especially from BPF. The space is sparse, numerical closeness doesn't
track topological closeness (x86 hyperthreading often scatters SMT
siblings), and a range of cpu ids doesn't describe anything meaningful.
Sub-sched support makes this acute: cpu allocation, revocation, and
state constantly flow across sub-scheds. Passing whole cpumasks scales
poorly (every op scans 4K bits) and cpumasks are awkward in BPF.
cids assign every cpu a dense, topology-ordered id. CPUs sharing a core,
LLC, or NUMA node occupy contiguous cid ranges, so a topology unit
becomes a (start, length) slice. Communication passes slices; BPF can
process a u64 word of cids at a time.
Build the mapping once at root enable by walking online cpus node -> LLC
-> core. Possible-but-not-online cpus tail the space with no-topo cids.
Expose kfuncs to map cpu <-> cid in either direction and to query each
cid's topology metadata.
v2: Use kzalloc_objs()/kmalloc_objs() for the three allocs in
scx_cid_arrays_alloc() (Cheng-Yang Chou).
v3: scx_cid_init() failure path now drops cpus_read_lock();
BUILD_BUG_ON tightened to match BPF cmask helpers' NR_CPUS<=8192.
(Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Pass struct scx_enable_cmd to scx_enable() rather than unpacking @ops
at every call site and re-packing into a fresh cmd inside. bpf_scx_reg()
now builds the cmd on its stack and hands it in; scx_enable() just
wires up the kthread work and waits.
Relocate struct scx_enable_cmd above scx_alloc_and_add_sched() so
upcoming patches that also want the cmd can see it.
No behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops
cpu_acquire and cpu_release are deprecated and slated for removal. Move
their declarations to the end of struct sched_ext_ops so an upcoming
cid-form struct (sched_ext_ops_cid) can omit them entirely without
disturbing the offsets of the shared fields.
Switch the two SCX_HAS_OP() callers for these ops to direct field checks
since the relocated ops sit outside the SCX_OPI_END range covered by the
has_op bitmap.
scx_kf_allow_flags[] auto-sizes to the highest used SCX_OP_IDX, so
SCX_OP_IDX(cpu_release) moving to a higher index just enlarges the
sparse array; the lookup logic is unchanged.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu()
Callers that already know the cpu is valid shouldn't have to pay for a
redundant check. scx_kick_cpu() is called from the in-kernel balance loop
break-out path with the current cpu (trivially valid) and from
scx_bpf_kick_cpu() with a BPF-supplied cpu that does need validation. Move
the check out of scx_kick_cpu() into scx_bpf_kick_cpu() so the backend is
reusable by callers that have already validated.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h
Things shared across multiple .c files belong in a header. scx_exit() /
scx_error() (and their scx_vexit() / scx_verror() siblings) are already
called from ext_idle.c and the upcoming ext_cid.c, and it was only
build_policy.c's textual inclusion of ext.c that made the references
resolve. Move the whole family to ext_internal.h.
Pure visibility change.
v4: Rebased over the exit_cpu plumbing. scx_exit() and scx_verror()
are now macros wrapping raw_smp_processor_id(); move both macros
plus the underlying __scx_exit() / scx_vexit() declarations to
the header.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it
Rename the static ext.c helper and declare it in ext_internal.h so
ext_idle.c and the upcoming cid code can call it directly instead of
relying on build_policy.c textual inclusion.
Pure rename and visibility change.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
sched_ext: Add ext_types.h for early subsystem-wide defs
Introduce kernel/sched/ext_types.h as the early-def header for the
sched_ext compilation unit. Included from kernel/sched/build_policy.c
before ext_internal.h so every later header and source in the unit
sees its content without re-inclusion. Later patches add their types
here (struct scx_cid_topo, scx_cmask, scx_cid_shard, etc.) so the
subsystem has one place to stash types shared across the TU.
Move enum scx_consts (SCX_DSP_DFL_MAX_BATCH, SCX_WATCHDOG_MAX_TIMEOUT,
SCX_SUB_MAX_DEPTH, etc.) here as the initial content. Ops-facing
content stays in ext_internal.h.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
drm/xe/hdcp: Add NULL check for media_gt in intel_hdcp_gsc_check_status()
When media GT is disabled via configfs, there is no allocation for
media_gt, which is kept as NULL. In such scenario,
intel_hdcp_gsc_check_status() results in a kernel pagefault error due to
>->uc.gsc being evaluated as an invalid memory address.
Fix that by introducing a NULL check on media_gt and bailing out early
if so.
While at it, also drop the NULL check for gsc, since it can't be NULL if
media_gt is not NULL.
v2:
- Get address for gsc only after checking that gt is not NULL.
(Shuicheng)
- Drop the NULL check for gsc. (Shuicheng)
v3:
- Add "Fixes" and "Cc: <stable...>" tags. (Matt)
Jia Yao [Fri, 17 Apr 2026 05:59:17 +0000 (05:59 +0000)]
drm/xe/uapi: Reject coh_none PAT index for CPU_ADDR_MIRROR
Add validation in xe_vm_bind_ioctl() to reject PAT indices
with XE_COH_NONE coherency mode when used with
DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR.
CPU address mirror mappings use system memory that is CPU
cached, which makes them incompatible with COH_NONE PAT
indices. Allowing COH_NONE with CPU cached buffers is a
security risk, as the GPU may bypass CPU caches and read
stale sensitive data from DRAM.
Although CPU_ADDR_MIRROR does not create an immediate
mapping, the backing system memory is still CPU cached.
Apply the same PAT coherency restrictions as
DRM_XE_VM_BIND_OP_MAP_USERPTR.
v2:
- Correct fix tag
v6:
- No change
v7:
- Correct fix tag
v8:
- Rebase
v9:
- Limit the restrictions to iGPU
v10:
- Just add the iGPU logic but keep dGPU logic
Fixes: b43e864af0d4 ("drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR") Cc: <stable@vger.kernel.org> # v6.15+ Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Mathew Alwin <alwin.mathew@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Jia Yao <jia.yao@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260417055917.2027459-3-jia.yao@intel.com
(cherry picked from commit 4d58d7535e826a3175527b6174502f0db319d7f6) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Jia Yao [Fri, 17 Apr 2026 05:59:16 +0000 (05:59 +0000)]
drm/xe/uapi: Reject coh_none PAT index for CPU cached memory in madvise
Add validation in xe_vm_madvise_ioctl() to reject PAT indices with
XE_COH_NONE coherency mode when applied to CPU cached memory.
Using coh_none with CPU cached buffers is a security issue. When the
kernel clears pages before reallocation, the clear operation stays in
CPU cache (dirty). GPU with coh_none can bypass CPU caches and read
stale sensitive data directly from DRAM, potentially leaking data from
previously freed pages of other processes.
This aligns with the existing validation in vm_bind path
(xe_vm_bind_ioctl_validate_bo).
v2(Matthew brost)
- Add fixes
- Move one debug print to better place
v3(Matthew Auld)
- Should be drm/xe/uapi
- More Cc
v4(Shuicheng Lin)
- Fix kmem leak issues by the way
v5
- Remove kmem leak because it has been merged by another patch
v6
- Remove the fix which is not related to current fix
v7
- No change
v8
- Rebase
v9
- Limit the restrictions to iGPU
v10
- No change
Fixes: ada7486c5668 ("drm/xe: Implement madvise ioctl for xe") Cc: <stable@vger.kernel.org> # v6.18+ Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Mathew Alwin <alwin.mathew@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Jia Yao <jia.yao@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260417055917.2027459-2-jia.yao@intel.com
(cherry picked from commit 016ccdb674b8c899940b3944952c96a6a490d10a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Shuicheng Lin [Fri, 17 Apr 2026 16:33:08 +0000 (16:33 +0000)]
drm/xe/gsc: Fix BO leak on error in query_compatibility_version()
When xe_gsc_read_out_header() fails, query_compatibility_version()
returns directly instead of jumping to the out_bo label. This skips
the xe_bo_unpin_map_no_vm() call, leaving the BO pinned and mapped
with no remaining reference to free it.
Fix by using goto out_bo so the error path properly cleans up the BO,
consistent with the other error handling in the same function.
Shuicheng Lin [Wed, 15 Apr 2026 22:54:28 +0000 (22:54 +0000)]
drm/xe/eustall: Fix drm_dev_put called before stream disable in close
In xe_eu_stall_stream_close(), drm_dev_put() is called before the
stream is disabled and its resources are freed. If this drops the
last reference, the device structures could be freed while the
subsequent cleanup code still accesses them, leading to a
use-after-free.
Fix this by moving drm_dev_put() after all device accesses are
complete. This matches the ordering in xe_oa_release().
Fixes: 9a0b11d4cf3b ("drm/xe/eustall: Add support to init, enable and disable EU stall sampling") Cc: Harish Chegondi <harish.chegondi@intel.com> Assisted-by: Claude:claude-opus-4.6 Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Harish Chegondi <harish.chegondi@intel.com> Link: https://patch.msgid.link/20260415225428.3399934-1-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 35aff528f7297e949e5e19c9cd7fd748cf1cf21c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Shuicheng Lin [Wed, 8 Apr 2026 02:06:47 +0000 (02:06 +0000)]
drm/xe: Fix error cleanup in xe_exec_queue_create_ioctl()
Two error handling issues exist in xe_exec_queue_create_ioctl():
1. When xe_hw_engine_group_add_exec_queue() fails, the error path jumps
to put_exec_queue which skips xe_exec_queue_kill(). If the VM is in
preempt fence mode, xe_vm_add_compute_exec_queue() has already added
the queue to the VM's compute exec queue list. Skipping the kill
leaves the queue on that list, leading to a dangling pointer after
the queue is freed.
2. When xa_alloc() fails after xe_hw_engine_group_add_exec_queue() has
succeeded, the error path does not call
xe_hw_engine_group_del_exec_queue() to remove the queue from the hw
engine group list. The queue is then freed while still linked into
the hw engine group, causing a use-after-free.
Fix both by:
- Changing the xe_hw_engine_group_add_exec_queue() failure path to jump
to kill_exec_queue so that xe_exec_queue_kill() properly removes the
queue from the VM's compute list.
- Adding a del_hw_engine_group label before kill_exec_queue for the
xa_alloc() failure path, which removes the queue from the hw engine
group before proceeding with the rest of the cleanup.
Shuicheng Lin [Wed, 8 Apr 2026 17:52:55 +0000 (17:52 +0000)]
drm/xe: Fix dma-buf attachment leak in xe_gem_prime_import()
When xe_dma_buf_init_obj() fails, the attachment from
dma_buf_dynamic_attach() is not detached. Add dma_buf_detach() before
returning the error. Note: we cannot use goto out_err here because
xe_dma_buf_init_obj() already frees bo on failure, and out_err would
double-free it.
Shuicheng Lin [Wed, 8 Apr 2026 17:52:54 +0000 (17:52 +0000)]
drm/xe: Fix bo leak in xe_dma_buf_init_obj() on allocation failure
When drm_gpuvm_resv_object_alloc() fails, the pre-allocated storage bo
is not freed. Add xe_bo_free(storage) before returning the error.
xe_dma_buf_init_obj() calls xe_bo_init_locked(), which frees the bo on
error. Therefore, xe_dma_buf_init_obj() must also free the bo on its own
error paths. Otherwise, since xe_gem_prime_import() cannot distinguish
whether the failure originated from xe_dma_buf_init_obj() or from
xe_bo_init_locked(), it cannot safely decide whether the bo should be
freed.
Add comments documenting the ownership semantics: on success, ownership
of storage is transferred to the returned drm_gem_object; on failure,
storage is freed before returning.
Shuicheng Lin [Wed, 8 Apr 2026 17:52:53 +0000 (17:52 +0000)]
drm/xe/bo: Fix bo leak on GGTT flag validation in xe_bo_init_locked()
When XE_BO_FLAG_GGTT_ALL is set without XE_BO_FLAG_GGTT, the function
returns an error without freeing a caller-provided bo, violating the
documented contract that bo is freed on failure.
Shuicheng Lin [Wed, 8 Apr 2026 17:52:52 +0000 (17:52 +0000)]
drm/xe/bo: Fix bo leak on unaligned size validation in xe_bo_init_locked()
When type is ttm_bo_type_device and aligned_size != size, the function
returns an error without freeing a caller-provided bo, violating the
documented contract that bo is freed on failure.
Shuicheng Lin [Thu, 9 Apr 2026 00:34:49 +0000 (00:34 +0000)]
drm/xe: Fix potential NULL deref in xe_exec_queue_tlb_inval_last_fence_put_unlocked
xe_exec_queue_tlb_inval_last_fence_put_unlocked() uses q->vm->xe as the
first argument to xe_assert(). This function is called unconditionally
from xe_exec_queue_destroy() for all queues, including kernel queues
that have q->vm == NULL (e.g., queues created during GT init in
xe_gt_record_default_lrcs() with vm=NULL).
While current compilers optimize away the q->vm->xe dereference (even
in CONFIG_DRM_XE_DEBUG=y builds, the compiler pushes the dereference
into the WARN branch that is only taken when the assert condition is
false), the code is semantically incorrect and constitutes undefined
behavior in the C abstract machine for the NULL pointer case.
Use gt_to_xe(q->gt) instead, which is always valid for any exec queue.
This is consistent with how xe_exec_queue_destroy() itself obtains the
xe_device pointer in its own xe_assert at the top of the function.
drm/xe/vf: Use drm mm instead of drm sa for CCS read/write
The suballocator algorithm tracks a hole cursor at the last allocation
and tries to allocate after it. This is optimized for fence-ordered
progress, where older allocations are expected to become reusable first.
In fence-enabled mode, that ordering assumption holds. In fence-disabled
mode, allocations may be freed in arbitrary order, so limiting allocation
to the current hole window can miss valid free space and fail allocations
despite sufficient total space.
Use DRM memory manager instead of sub-allocator to get rid of this issue
as CCS read/write operations do not use fences.
Fixes: 864690cf4dd6 ("drm/xe/vf: Attach and detach CCS copy commands with BO") Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408110145.1639937-6-satyanarayana.k.v.p@intel.com
(cherry picked from commit 6c84b493012aeb05dec29c709377bf0e17ac6815) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Add a memory pool to allocate sub-ranges from a BO-backed pool
using drm_mm.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Maarten Lankhorst <dev@lankhorst.se> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260408110145.1639937-5-satyanarayana.k.v.p@intel.com
(cherry picked from commit 1ce3229f8f269a245ff3b8c65ffae36b4d6afb93) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
firmware: arm_ffa: Unregister bus notifier on teardown for FF-A v1.0
For FF-A v1.0 the driver registers a bus notifier to backfill UUID
matching, but the notifier was never unregistered on cleanup paths.
Track the registration state and unregister it during teardown and early
partition-setup failure.
firmware: arm_ffa: Fix per-vcpu self notifications handling in workqueue
Per-vcpu notification handling already runs from a per-cpu work item on
the target cpu. Routing that path back through smp_call_function_single()
re-enters the call-function IPI path and executes the notification
handler with interrupts disabled. That makes the framework path unsafe,
since it takes a mutex, allocates memory with GFP_KERNEL, and invokes
client callbacks.
Handle per-vcpu self notifications directly from the existing per-cpu
work item instead. This keeps the per-vcpu path in task context and
avoids the extra IPI hop entirely.
firmware: arm_ffa: Avoid collapsing NPI work from different CPUs
Notification pending interrupts are registered as per-CPU IRQs, but the
driver queues all NPI handling through a single shared work_struct.
That allows queue_work_on() calls from different CPUs to collapse onto a
single pending work item even though the work function uses the CPU it
runs on to fetch and handle per-CPU notifications.
Move notif_pcpu_work into the per-CPU ffa_pcpu_irq state and initialize
one work item per CPU. This keeps NPI handling independent per CPU and
avoids losing notifications when multiple CPUs queue work concurrently.
firmware: arm_ffa: Skip free_pages on RX buffer alloc failure
If the RX buffer allocation fails in ffa_init(), the error path jumps to
free_pages even though no buffer has been allocated yet. Route that case
directly to free_drv_info so the cleanup path is only used after at
least one RX/TX buffer allocation has succeeded.
firmware: arm_ffa: Check for NULL FF-A ID table while driver registration
The bus match callback assumes that every FF-A driver provides an
id_table and dereferences it unconditionally. Enforce that contract at
registration time so a buggy client driver cannot crash the bus during
match.
Matt Roper [Wed, 8 Apr 2026 22:27:44 +0000 (15:27 -0700)]
drm/xe/debugfs: Correct printing of register whitelist ranges
The register-save-restore debugfs prints whitelist entries as offset
ranges. E.g.,
REG[0x39319c-0x39319f]: allow read access
for a single dword-sized register. However the GENMASK value used to
set the lower bits to '1' for the upper bound of the whitelist range
incorrectly included one more bit than it should have, causing the
whitelist ranges to sometimes appear twice as large as they really were.
For example,
REG[0x6210-0x6217]: allow rw access
was also intended to be a single dword-sized register whitelist (with a
range 0x6210-0x6213) but was printed incorrectly as a qword-sized range
because one too many bits was flipped on. Similar 'off by one' logic
was applied when printing 4-dword register ranges and 64-dword register
ranges as well.
Correct the GENMASK logic to print these ranges in debugfs correctly.
No impact outside of correcting the misleading debugfs output.
Matt Roper [Fri, 10 Apr 2026 22:50:30 +0000 (15:50 -0700)]
drm/xe: Mark ROW_CHICKEN5 as a masked register
ROW_CHICKEN5 is a masked register (i.e., to adjust the value of any of
the lower 16 bits, the corresponding bit in the upper 16 bits must also
be set). Add the XE_REG_OPTION_MASKED to its definition; failure to do
so will cause workaround updates of this register to not apply properly.
Matt Roper [Fri, 10 Apr 2026 22:50:29 +0000 (15:50 -0700)]
drm/xe/tuning: Use proper register offset for GAMSTLB_CTRL
From Xe2 onward (i.e., all platforms officially supported by the Xe
driver), the GAMSTLB_CTRL register is located at offset 0x477C and
represented by the macro "GAMSTLB_CTRL" in code. However the register
formerly resided at offset 0xCF4C on Xe1-era platforms, and we also have
macro XEHP_GAMSTLB_CTRL that represents this old offset in the
unofficial/developer-only Xe1 code. When tuning for the register was
added for Xe3p_LPG, the old Xe1-era macro was accidentally used instead
of the proper macro for Xe2 and beyond, causing the tuning to not be
applied properly. Use the proper definition so that the correct offset
is written to.
drm/xe/xe3p_lpg: Add missing indirect ring state feature flag
Even though commit 8fcb7dfb8bbf ("drm/xe/xe3p_lpg: Add support for
graphics IP 35.10") mentions that the support for Indirect Ring State
exists for Xe3p_LPG, it missed actually setting the feature flag in
graphics_xe3p_lpg. Fix that by adding the missing member.
Matt Roper [Wed, 1 Apr 2026 20:12:44 +0000 (13:12 -0700)]
drm/xe: Drop redundant rtp entries for Wa_14019988906 & Wa_14019877138
There appears to have been a silent merge conflict between some commits
updating the workaround tables on Xe's -fixes and -next branches:
- Commit bc6387a2e0c1 ("drm/xe/xe2_hpg: Fix handling of Wa_14019988906
& Wa_14019877138") from the fixes branch moved the Xe2_HPG instance
of two workarounds touching the PSS_CHICKEN register from the
engine_was[] table to the lrc_was[] table; the equivalent
implementation for all other platforms/IPs were already properly
located on lrc_was[]. This commit on the fixes branch is a
cherry-pick of commit e04c609eedf4 ("drm/xe/xe2_hpg: Fix handling of
Wa_14019988906 & Wa_14019877138") that already existed on the next
branch.
- Commit 55b19abb6c44 ("drm/xe: Consolidate workaround entries for
Wa_14019877138") and commit c2142a1a8415 ("drm/xe: Consolidate
workaround entries for Wa_14019988906") consolidated the individual
entries per IP generation for each workaround into single, larger
range-based entries.
During merge conflict resolution the Xe2_HPG-specific entries (i.e.,
those with rule "GRAPHICS_VERSION_RANGE(2001, 2002)") were accidentally
resurrected, even though the table already contains the consolidated
entries that match a superset of thse ranges. These redundant entries
don't cause any build failures but do trigger a dmesg error during probe
on BMG-G21 devices:
Matthew Brost [Thu, 26 Mar 2026 21:01:16 +0000 (14:01 -0700)]
drm/xe: Drop registration of guc_submit_wedged_fini from xe_guc_submit_wedge()
xe_guc_submit_wedge() runs in the DMA-fence signaling path, where
GFP_KERNEL memory allocations are not permitted. However, registering
guc_submit_wedged_fini via drmm_add_action_or_reset() triggers such an
allocation.
Avoid this by moving the logic from guc_submit_wedged_fini() into
guc_submit_fini(), where wedged exec queue references are dropped during
normal teardown.
DaeMyung Kang [Sat, 25 Apr 2026 09:38:28 +0000 (18:38 +0900)]
ksmbd: rewrite stop_sessions() with restartable iteration
stop_sessions() walks conn_list with hash_for_each() and, for every
entry, drops conn_list_lock across the transport ->shutdown() call
before re-acquiring the read lock to continue the loop. The hash
walk relies on cross-iteration state (the current bucket and the
hlist position), which is not preserved across unlock/relock: if
another thread performs a list mutation during the unlocked window,
the ongoing iteration becomes unreliable and can re-visit
connections that have already been handled or skip connections that
have not. The outer `if (!hash_empty(conn_list)) goto again;` retry
masks the symptom in the common case but does not address the
unsafe iteration itself.
Reframe the loop so it never relies on iterator state across
unlock/relock. Under conn_list_lock held for read, pick the first
connection whose ->shutdown() has not yet been issued by this path,
pin it by taking an extra reference, record that fact on the
connection and mark it EXITING while still inside the locked walk,
then drop the lock. Then call ->shutdown() outside the lock, drop
the pin (freeing the connection if the handler already released its
reference), and restart from the top.
Use a new per-connection flag, conn->stop_called, as the "shutdown
issued from stop_sessions()" marker rather than reusing the status
state. ksmbd_conn_set_exiting() is also invoked by
ksmbd_sessions_deregister() on sibling channels of a multichannel
session without issuing a transport shutdown, so treating
KSMBD_SESS_EXITING as "already handled here" would skip connections
that still need shutdown() to wake their handler out of recv(),
leaving the outer retry waiting indefinitely for the hash to drain.
stop_sessions() is serialised by init_lock in
ksmbd_conn_transport_destroy(), so writing stop_called under the
read lock has no other writer.
Set EXITING inside the locked walk so the selection, the stop_called
marker, and the status transition all happen together, and guard
against regressing a connection that has already advanced to
KSMBD_SESS_RELEASING on its own (for example, if the handler exited
its receive loop for an unrelated reason between teardown steps).
When the pin drop is the last put, release the transport and pair
ida_destroy(&target->async_ida) with the ida_init() done in
ksmbd_conn_alloc(), so stop_sessions() retiring a connection on its
own does not leak the xarray backing of the embedded async_ida.
The outer retry with msleep() is kept to wait for handler threads to
reach ksmbd_conn_free() and drain the hash.
Observed with an instrumented build that logs one line per visit and
widens the unlocked window before ->shutdown() by 200 ms, under
five concurrent cifs mounts (nosharesock, one connection each):
* Current code: the same connection address is revisited many
times during a single stop_sessions() call and ->shutdown() is
invoked well beyond the number of live connections before the
hash finally drains.
* Rewritten code: each live connection produces exactly one
->shutdown() call; the function returns as soon as the hash is
empty.
Functional teardown via `ksmbd.control --shutdown` with the same
five mounts completes cleanly on the rewritten path.
Performance is observably unchanged. Tearing down N concurrent
nosharesock cifs connections with `ksmbd.control --shutdown` +
`rmmod ksmbd` takes essentially the same wall time before and after
the rewrite:
N before after
10 4.93s 5.34s
30 7.34s 7.03s
50 7.31s 7.01s (3-run avg: 7.04s vs 7.25s)
100 6.98s 6.78s
200 6.77s 6.89s
and the number of ->shutdown() calls equals the number of live
connections on both paths when the race is not widened. The
teardown is dominated by the msleep(100)-based outer retry waiting
for handler threads to run ksmbd_conn_free(), not by the iteration
itself; the restartable loop's worst-case O(N^2) visit cost is in
the microseconds even at N=200 and sits far below the msleep(100)
granularity.
Applied alone on top of ksmbd-for-next-next, this patch does not
introduce a new leak site. Under the same reproducer (10x
concurrent-holders + ss -K + ksmbd.control --shutdown + rmmod), the
tree still shows the pre-existing per-connection transport leak
count that arises when the last refcount drop lands in one of
ksmbd_conn_r_count_dec(), __free_opinfo() or session_fd_check() -
all of which end with a bare kfree() today. kmemleak backtraces
for the unreferenced objects point into the TCP accept path
(sk_clone -> inet_csk_clone_lock, sock_alloc_inode) and none
involve stop_sessions(). Plugging those bare-kfree sites is the
responsibility of the follow-up patch.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: stable@vger.kernel.org Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Cássio Gabriel [Wed, 29 Apr 2026 13:20:02 +0000 (10:20 -0300)]
ALSA: usb-audio: Update Babyface Pro control caches only after successful writes
snd_bbfpro_ctl_put() and snd_bbfpro_vol_put()
cache the requested packed control state in
kcontrol->private_value before issuing the USB write.
Their get and resume paths use that cached value directly,
so a failed write can leave the driver reporting and later
replaying a setting the hardware never accepted.
Update the cached state only after a successful USB write.
Cássio Gabriel [Wed, 29 Apr 2026 13:20:01 +0000 (10:20 -0300)]
ALSA: usb-audio: Roll back quirk control caches on write errors
Several mixer quirk callbacks cache the requested
control value in kcontrol->private_value before
issuing a single vendor or class write.
Their paired get and resume paths consume that cache
directly, so a failed write currently leaves software
state changed even though the update did not succeed.
That can make later reads report a value the device
never accepted and can replay the stale cache on resume.
Restore the previous cached value on failure in
the Audigy2NX LED, Emu0204 channel switch,
Xonar U1 output switch, Native Instruments controls,
FTU effect program switch, and Sound Blaster E1 input source switch.