Matthew Brost [Sat, 22 Nov 2025 01:25:01 +0000 (17:25 -0800)]
drm/gpusvm: Limit the number of retries in drm_gpusvm_get_pages
drm_gpusvm_get_pages should not be allowed to retry forever, cap the
time spent in the function to HMM_RANGE_DEFAULT_TIMEOUT has this is
essentially a wrapper around hmm_range_fault.
Introduce device xe_caching_pt flag to selectively turn it on for
supported platforms. It allows to eliminate version check and enable
this feature for the future platforms.
drm/xe/vm: Skip ufence association for CPU address mirror VMA during MAP
The MAP operation for a CPU address mirror VMA does not require ufence
association because such mappings are not GPU-synchronized and do not
participate in GPU job completion signaling.
Remove the unnecessary ufence addition for this case to avoid -EBUSY
failure in check_ufence of unbind ops.
drm/xe/svm: Extend MAP range to reduce vma fragmentation
When DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR is set during VM_BIND_OP_MAP,
the mapping logic now checks adjacent cpu_addr_mirror VMAs with default
attributes and expands the mapping range accordingly. This ensures that
bo_unmap operations ideally target the same area and helps reduce
fragmentation by coalescing nearby compatible VMAs into a single mapping.
drm/xe: Merge adjacent default-attribute VMAs during garbage collection
While restoring default memory attributes for VMAs during garbage
collection, extend the target range by checking neighboring VMAs. If
adjacent VMAs are CPU-address-mirrored and have default attributes,
include them in the mergeable range to reduce fragmentation and improve
VMA reuse.
drm/xe: Add helper to extend CPU-mirrored VMA range for merge
Introduce xe_vm_find_cpu_addr_mirror_vma_range(), which computes an
extended range around a given range by including adjacent VMAs that are
CPU-address-mirrored and have default memory attributes. This helper is
useful for determining mergeable range without performing the actual merge.
v2
- Add assert
- Move unmap check to this patch
v3
- Decrease offset to check by SZ_4K to avoid wrong vma return in fast
lookup path
Tomasz Lis [Mon, 24 Nov 2025 22:28:53 +0000 (23:28 +0100)]
drm/xe: Protect against unset LRC when pausing submissions
While pausing submissions, it is possible to encouner an exec queue
which is during creation, and therefore doesn't have a valid xe_lrc
struct reference.
Protect agains such situation, by checking for NULL before access.
Reviewed-by: Matthew Brost <matthew.brost@intel.com> Fixes: c25c1010df88 ("drm/xe/vf: Replay GuC submission state on pause / unpause") Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251124222853.1900800-1-tomasz.lis@intel.com
Matthew Brost [Fri, 21 Nov 2025 15:27:50 +0000 (07:27 -0800)]
drm/xe/vf: Start re-emission from first unsignaled job during VF migration
The LRC software ring tail is reset to the first unsignaled pending
job's head.
Fix the re-emission logic to begin submitting from the first unsignaled
job detected, rather than scanning all pending jobs, which can cause
imbalance.
v2:
- Include missing local changes
v3:
- s/skip_replay/restore_replay (Tomasz)
Lukasz Laguna [Mon, 24 Nov 2025 19:02:37 +0000 (20:02 +0100)]
drm/xe/pf: Handle MERT catastrophic errors
The MERT block triggers an interrupt when a catastrophic error occurs.
Update the interrupt handler to read the MERT catastrophic error type
and log appropriate debug message.
Lukasz Laguna [Mon, 24 Nov 2025 19:02:36 +0000 (20:02 +0100)]
drm/xe/pf: Add TLB invalidation support for MERT
Add support for triggering and handling MERT TLB invalidation. After
LMTT updates, the MERT TLB invalidation is initiated to ensure memory
translations remain coherent.
Completion of the invalidation is signaled via MERT interrupt (bit 13 in
the GFX master interrupt register). Detect and handle this interrupt to
properly synchronize the invalidation flow.
Zhanjun Dong [Mon, 27 Oct 2025 21:42:12 +0000 (17:42 -0400)]
drm/xe/uc: Change assertion to error on huc authentication failure
The fault injection test can cause the xe_huc_auth function to fail.
This is an intentional failure, so in this scenario we don't want to
throw an assert and taint the kernel, because that will impact CI
execution.
The .bulk_profile/sched_priority file is always write-only, unlike
the profile/sched_priority files which can be either read-write or
read-only (in case of PF or VFs respectively).
Shuicheng Lin [Mon, 10 Nov 2025 18:45:23 +0000 (18:45 +0000)]
drm/xe/guc: Fix resource leak in xe_guc_ct_init_noalloc()
xe_guc_ct_init_noalloc() allocates the CT workqueue and other helpers
before it tries to initialize ct->lock. If drmm_mutex_init() fails
we currently bail out without releasing those resources because the
guc_ct_fini() hasn’t been registered yet.
Since destroy_workqueue() in guc_ct_fini() may flush the workqueue, which
in turn can take the ct lock, the initialization sequence is restructured
to first initialize the ct->lock, then set up all CT state, and finally
register guc_ct_fini().
v2: guc_ct_fini() does take ct lock. (Matt)
v3: move primelockdep() together with drmm_mutex_init(). (Lucas)
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patch.msgid.link/20251110184522.1581001-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Mika Kuoppala [Thu, 20 Nov 2025 16:14:35 +0000 (18:14 +0200)]
drm/xe: Fix memory leak when handling pagefault vma
When the pagefault handling code was moved to a new file, an extra
drm_exec_init() was added to the VMA path. This call is unnecessary because
xe_validation_ctx_init() already performs a drm_exec_init(), resulting in a
memory leak reported by kmemleak.
Remove the redundant drm_exec_init() from the VMA pagefault handling code.
Fixes: fb544b844508 ("drm/xe: Implement xe_pagefault_queue_work") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: "Christian König" <christian.koenig@amd.com> Cc: intel-xe@lists.freedesktop.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251120161435.3674556-1-mika.kuoppala@linux.intel.com
Sanjay Yadav [Tue, 18 Nov 2025 11:49:00 +0000 (17:19 +0530)]
drm/xe/oa: Fix potential UAF in xe_oa_add_config_ioctl()
In xe_oa_add_config_ioctl(), we accessed oa_config->id after dropping
metrics_lock. Since this lock protects the lifetime of oa_config, an
attacker could guess the id and call xe_oa_remove_config_ioctl() with
perfect timing, freeing oa_config before we dereference it, leading to
a potential use-after-free.
Fix this by caching the id in a local variable while holding the lock.
v2: (Matt A)
- Dropped mutex_unlock(&oa->metrics_lock) ordering change from
xe_oa_remove_config_ioctl()
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6614 Fixes: cdf02fe1a94a7 ("drm/xe/oa/uapi: Add/remove OA config perf ops") Cc: <stable@vger.kernel.org> # v6.11+ Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251118114859.3379952-2-sanjay.kumar.yadav@intel.com
Matt Roper [Tue, 18 Nov 2025 16:43:53 +0000 (08:43 -0800)]
drm/xe: Return forcewake reference type from force_wake_get_any_engine()
Adjust the signature of force_wake_get_any_engine() such that it returns
a 'struct xe_force_wake_ref' rather than a boolean success/failure.
Failure cases are now recognized by inspecting the hardware engine
returned by reference; a NULL hwe indicates that no engine's forcewake
could be obtained.
These changes will make it cleaner and easier to incorporate scope-based
cleanup in force_wake_get_any_engine()'s caller in a future patch.
Matt Roper [Tue, 18 Nov 2025 16:43:51 +0000 (08:43 -0800)]
drm/xe/devcoredump: Use scope-based cleanup
Use scope-based cleanup for forcewake and runtime PM in the devcoredump
code. This eliminates some goto-based error handling and slightly
simplifies other functions.
v2:
- Move the forcewake acquisition slightly higher in
devcoredump_snapshot() so that we maintain an easy-to-understand LIFO
cleanup order. (Gustavo)
Matt Roper [Tue, 18 Nov 2025 16:43:46 +0000 (08:43 -0800)]
drm/xe/mocs: Use scope-based cleanup
Using scope-based cleanup for runtime PM and forcewake in the MOCS code
allows us to eliminate some goto-based error handling and simplify some
other functions.
Matt Roper [Tue, 18 Nov 2025 16:43:45 +0000 (08:43 -0800)]
drm/xe/guc_pc: Use scope-based cleanup
Use scope-based cleanup for forcewake and runtime PM in the GuC PC code.
This allows us to eliminate to goto-based cleanup and simplifies some
other functions.
Matt Roper [Tue, 18 Nov 2025 16:43:43 +0000 (08:43 -0800)]
drm/xe/gt_idle: Use scope-based cleanup
Use scope-based cleanup for runtime PM and forcewake in the GT idle
code.
v2:
- Use scoped_guard() over guard() in idle_status_show() and
idle_residency_ms_show(). (Gustavo)
- Eliminate unnecessary 'ret' local variable in name_show().
Matt Roper [Tue, 18 Nov 2025 16:43:42 +0000 (08:43 -0800)]
drm/xe/gt: Use scope-based cleanup
Using scope-based cleanup for forcewake and runtime PM allows us to
reduce or eliminate some of the goto-based error handling and simplify
several functions.
v2:
- Drop changes to do_gt_restart(). This function still has goto-based
logic, making scope-based cleanup unsafe for now. (Gustavo)
Matt Roper [Tue, 18 Nov 2025 16:43:41 +0000 (08:43 -0800)]
drm/xe/pm: Add scope-based cleanup helper for runtime PM
Add a scope-based helpers for runtime PM that may be used to simplify
cleanup logic and potentially avoid goto-based cleanup.
For example, using
guard(xe_pm_runtime)(xe);
will get runtime PM and cause a corresponding put to occur automatically
when the current scope is exited. 'xe_pm_runtime_noresume' can be used
as a guard replacement for the corresponding 'noresume' variant.
There's also an xe_pm_runtime_ioctl conditional guard that can be used
as a replacement for xe_runtime_ioctl():
In a few rare cases (such as gt_reset_worker()) we need to ensure that
runtime PM is dropped when the function is exited by any means
(including error paths), but the function does not need to acquire
runtime PM because that has already been done earlier by a different
function. For these special cases, an 'xe_pm_runtime_release_only'
guard can be used to handle the release without doing an acquisition.
These guards will be used in future patches to eliminate some of our
goto-based cleanup.
v2:
- Specify success condition for xe_pm runtime_ioctl as _RET >= 0 so
that positive values will be properly identified as success and
trigger destructor cleanup properly.
v3:
- Add comments to the kerneldoc for the existing 'get' functions
indicating that scope-based handling should be preferred where
possible. (Gustavo)
Matt Roper [Tue, 18 Nov 2025 16:43:40 +0000 (08:43 -0800)]
drm/xe/forcewake: Add scope-based cleanup for forcewake
Since forcewake uses a reference counting get/put model, there are many
places where we need to be careful to drop the forcewake reference when
bailing out of a function early on an error path. Add scope-based
cleanup options that can be used in place of explicit get/put to help
prevent mistakes in this area.
Obtain forcewake on the XE_FW_GT domain and hold it until the
end of the current block. The wakeref will be dropped
automatically when the current scope is exited by any means
(return, break, reaching the end of the block, etc.).
Hold all forcewake domains for the following block. As with the
CLASS usage, forcewake will be dropped automatically when the
block is exited by any means.
Use of these cleanup helpers should allow us to remove some ugly
goto-based error handling and help avoid mistakes in functions with lots
of early error exits.
An 'xe_force_wake_release_only' class is also added for cases where a
forcewake reference is passed in from another function and the current
function is responsible for releasing it in every flow and error path.
v2:
- Create a separate constructor that just wraps xe_force_wake_get for
use in the class. This eliminates the need to update the signature
of xe_force_wake_get(). (Michal)
v3:
- Wrap xe_with_force_wake's 'done' marker in __UNIQUE_ID. (Gustavo)
- Add a note to xe_force_wake_get()'s kerneldoc explaining that
scope-based cleanup is preferred when possible. (Gustavo)
- Add an xe_force_wake_release_only class. (Gustavo)
v4:
- Add NULL check on fw in release_only variant. (Gustavo)
Matt Roper [Tue, 18 Nov 2025 20:26:05 +0000 (12:26 -0800)]
drm/xe/vm: Use for_each_tlb_inval() to calculate invalidation fences
ops_execute() calculates the size of a fence array based on
XE_MAX_GT_PER_TILE, while the code that actually fills in the fence
array uses a for_each_tlb_inval() iterator. This works out okay today
since both approaches come up with the same number of invalidation
fences (2: primary GT invalidation + media GT invalidation), but could
be problematic in the future if there isn't a 1:1 relationship between
TLBs needing invalidation and potential GTs on the tile.
Adjust the allocation code to use the same for_each_tlb_inval()
counting logic as the code that fills the array to future-proof the
code.
drm/xe/vf: Shadow buffer management for CCS read/write operations
CCS copy command consist of 5-dword sequence. If vCPU halts during
save/restore operations while these sequences are being programmed,
incomplete writes can cause page faults during IGPU CCS metadata saving.
Use shadow buffer management to prevent partial write issues during CCS
operations.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251118120745.3460172-3-satyanarayana.k.v.p@intel.com
drm/xe/sa: Shadow buffer support in the sub-allocator pool
The existing sub-allocator is limited to managing a single buffer object.
This enhancement introduces shadow buffer functionality to support
scenarios requiring dual buffer management.
The changes include added shadow buffer object creation capability,
Management for both primary and shadow buffers, and appropriate locking
mechanisms for thread-safe operations.
This enables more flexible buffer allocation strategies in scenarios where
shadow buffering is required.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251118120745.3460172-2-satyanarayana.k.v.p@intel.com
Current gu2host handler registered as MSI-X vector 0 and as per bspec for
a msix vector 0 interrupt, the driver must check the legacy registers
190008(TILE_INT_REG), 190060h (GT INTR Identity Reg 0) and other registers
mentioned in "Interrupt Service Routine Pseudocode" otherwise it will block
the next interrupts. To overcome this issue replacing guc2host handler
with legacy xe_irq_handler.
Fixes: da889070be7b2 ("drm/xe/irq: Separate MSI and MSI-X flows")
Bspec: 62357 Signed-off-by: Venkata Ramana Nayana <venkata.ramana.nayana@intel.com> Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://patch.msgid.link/20251107083141.2080189-1-venkata.ramana.nayana@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Michał Winiarski [Fri, 14 Nov 2025 12:23:39 +0000 (13:23 +0100)]
drm/xe/pf: Check for fence error on VRAM save/restore
The code incorrectly assumes that the VRAM save/restore fence is valid.
Fix it by checking for error.
Fixes: 49cf1b9b609fe ("drm/xe/pf: Handle VRAM migration data as part of PF control") Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251114122339.1791026-1-michal.winiarski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Michał Winiarski [Fri, 14 Nov 2025 10:07:13 +0000 (11:07 +0100)]
drm/xe/pf: Drop the VF VRAM BO reference on successful restore
The reference is only dropped on error. Fix it by adding the missing
xe_bo_put().
Fixes: 49cf1b9b609fe ("drm/xe/pf: Handle VRAM migration data as part of PF control") Reported-by: Adam Miszczak <adam.miszczak@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251114100713.1776073-1-michal.winiarski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Shuicheng Lin [Mon, 10 Nov 2025 23:26:58 +0000 (23:26 +0000)]
drm/xe: Remove duplicate DRM_EXEC selection from Kconfig
There are 2 identical "select DRM_EXEC" lines for DRM_XE.
Remove one to clean up the configuration.
Fixes: d490ecf57790 ("drm/xe: Rework xe_exec and the VM rebind worker to use the drm_exec helper") Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patch.msgid.link/20251110232657.1807998-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Matt Roper [Thu, 13 Nov 2025 23:40:39 +0000 (15:40 -0800)]
drm/xe/kunit: Fix forcewake assertion in mocs test
The MOCS kunit test calls KUNIT_ASSERT_TRUE_MSG() with a condition of
'true;' this prevents the assertion from ever failing. Replace
KUNIT_ASSERT_TRUE_MSG with KUNIT_FAIL_AND_ABORT to get the intended
failure behavior in cases where forcewake was not acquired successfully.
Fixes: 51c0ee84e4dc ("drm/xe/tests/mocs: Hold XE_FORCEWAKE_ALL for LNCF regs") Cc: Tejas Upadhyay <tejas.upadhyay@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patch.msgid.link/20251113234038.2256106-2-matthew.d.roper@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Michał Winiarski [Fri, 14 Nov 2025 13:40:30 +0000 (14:40 +0100)]
drm/xe/pf: Fix kernel-doc warning in migration_save_consume
The kernel-doc for xe_sriov_pf_migration_save_consume() contained
multiple "Return:" sections, causing a warning.
Fix it by removing the extra line.
Fixes: 67df4a5cbc583 ("drm/xe/pf: Add data structures and handlers for migration rings") Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251114134030.1795947-1-michal.winiarski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Shuicheng Lin [Wed, 12 Nov 2025 18:10:06 +0000 (18:10 +0000)]
drm/xe: Prevent BIT() overflow when handling invalid prefetch region
If user provides a large value (such as 0x80) for parameter
prefetch_mem_region_instance in vm_bind ioctl, it will cause
BIT(prefetch_region) overflow as below:
"
------------[ cut here ]------------
UBSAN: shift-out-of-bounds in drivers/gpu/drm/xe/xe_vm.c:3414:7
shift exponent 128 is too large for 64-bit type 'long unsigned int'
CPU: 8 UID: 0 PID: 53120 Comm: xe_exec_system_ Tainted: G W 6.18.0-rc1-lgci-xe-kernel+ #200 PREEMPT(voluntary)
Tainted: [W]=WARN
Hardware name: ASUS System Product Name/PRIME Z790-P WIFI, BIOS 0812 02/24/2023
Call Trace:
<TASK>
dump_stack_lvl+0xa0/0xc0
dump_stack+0x10/0x20
ubsan_epilogue+0x9/0x40
__ubsan_handle_shift_out_of_bounds+0x10e/0x170
? mutex_unlock+0x12/0x20
xe_vm_bind_ioctl.cold+0x20/0x3c [xe]
...
"
Fix it by validating prefetch_region before the BIT() usage.
v2: Add Closes and Cc stable kernels. (Matt)
Reported-by: Koen Koning <koen.koning@intel.com> Reported-by: Peter Senna Tschudin <peter.senna@linux.intel.com> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6478 Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251112181005.2120521-2-shuicheng.lin@intel.com
Xin Wang [Mon, 10 Nov 2025 22:14:58 +0000 (22:14 +0000)]
drm/xe/pat: Add helper to query compression enable status
Add xe_pat_index_get_comp_en() helper function to check whether
compression is enabled for a given PAT index by extracting the
XE2_COMP_EN bit from the PAT table entry.
There are no current users, however there are multiple in-flight series
which will all use this helper.
CC: Nitin Gote <nitin.r.gote@intel.com> CC: Sanjay Yadav <sanjay.kumar.yadav@intel.com> CC: Matt Roper <matthew.d.roper@intel.com> Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Nitin Gote <nitin.r.gote@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251110221458.1864507-2-x.wang@intel.com
Matt Roper [Mon, 10 Nov 2025 23:20:21 +0000 (15:20 -0800)]
drm/xe/oa: Store forcewake reference in stream structure
Calls to xe_force_wake_put() should generally pass the exact reference
returned by xe_force_wake_get(). Since OA grabs and releases forcewake
in different functions, xe_oa_stream_destroy() is currently calling put
with a hardcoded ALL mask. Although this works for now, it's somewhat
fragile in case OA moves to more precise power domain management in the
future.
Stash the original reference obtained during stream initialization
inside the stream structure so that we can use it directly when the
stream is destroyed.
Matt Roper [Mon, 10 Nov 2025 23:20:20 +0000 (15:20 -0800)]
drm/xe/eustall: Store forcewake reference in stream structure
Calls to xe_force_wake_put() should generally pass the exact reference
returned by xe_force_wake_get(). Since EU stall grabs and releases
forcewake in different functions, xe_eu_stall_disable_locked() is
currently calling put with a hardcoded RENDER domain. Although this
works for now, it's somewhat fragile in case the power domain(s)
required by stall sampling change in the future, or if workarounds show
up that require us to obtain additional domains.
Stash the original reference obtained during stream enable inside the
stream structure so that we can use it directly when the stream is
disabled.
Michal Wajdeczko [Wed, 12 Nov 2025 12:44:08 +0000 (13:44 +0100)]
drm/xe/pf: Use migration-friendly GGTT auto-provisioning
Instead of trying very hard to find the largest fair GGTT size that
could be allocated for VFs on the current tile, pick some smaller
rounded down to power-of-two value that is more likely to be
provisioned in the same manner by the other PF instance:
Note that due to FW/HW limitations we can't share all 4GiB GGTT
address space with VFs, so for the larger (>7) number of the VFs
the change in the outcome is happening at different points than
we have in case of GuC contexts/doorbells IDs.
Michał Winiarski [Wed, 12 Nov 2025 13:22:20 +0000 (14:22 +0100)]
drm/intel/bmg: Allow device ID usage with single-argument macros
When INTEL_BMG_G21_IDS were added as a subplatform, token concatenation
operator usage was omitted, making INTEL_BMG_IDS not usable with
single-argument macros.
Fix that by adding the missing operator.
Michał Winiarski [Wed, 12 Nov 2025 13:22:19 +0000 (14:22 +0100)]
drm/xe/pf: Add wait helper for VF FLR
VF FLR requires additional processing done by PF driver.
The processing is done after FLR is already finished from PCIe
perspective.
In order to avoid a scenario where migration state transitions while
PF processing is still in progress, additional synchronization
point is needed.
Add a helper that will be used as part of VF driver struct
pci_error_handlers .reset_done() callback.
Lukasz Laguna [Wed, 12 Nov 2025 13:22:17 +0000 (14:22 +0100)]
drm/xe/migrate: Add function to copy of VRAM data in chunks
Introduce a new function to copy data between VRAM and sysmem objects.
The existing xe_migrate_copy() is tailored for eviction and restore
operations, which involves additional logic and operates on entire
objects.
The xe_migrate_vram_copy_chunk() allows copying chunks of data to or
from a dedicated buffer object, which is essential in case of VF
migration.
Michał Winiarski [Wed, 12 Nov 2025 13:22:13 +0000 (14:22 +0100)]
drm/xe/pf: Add helpers for VF GGTT migration data handling
In an upcoming change, the VF GGTT migration data will be handled as
part of VF control state machine. Add the necessary helpers to allow the
migration data transfer to/from the HW GGTT resource.
Michał Winiarski [Wed, 12 Nov 2025 13:22:11 +0000 (14:22 +0100)]
drm/xe/pf: Switch VF migration GuC save/restore to struct migration data
In upcoming changes, the GuC VF migration data will be handled as part
of separate SAVE/RESTORE states in VF control state machine.
Now that the data is decoupled from both guc_state debugfs and PAUSE
state, we can safely remove the struct xe_gt_sriov_state_snapshot and
modify the GuC save/restore functions to operate on struct
xe_sriov_migration_data.
Michał Winiarski [Wed, 12 Nov 2025 13:22:10 +0000 (14:22 +0100)]
drm/xe/pf: Don't save GuC VF migration data on pause
In upcoming changes, the GuC VF migration data will be handled as part
of separate SAVE/RESTORE states in VF control state machine.
Remove it from PAUSE state.
Michał Winiarski [Wed, 12 Nov 2025 13:22:09 +0000 (14:22 +0100)]
drm/xe/pf: Remove GuC migration data save/restore from GT debugfs
In upcoming changes, SR-IOV VF migration data will be extended beyond
GuC data and exported to userspace using VFIO interface (with a
vendor-specific variant driver) and a device-level debugfs interface.
Remove the GT-level debugfs.
Michał Winiarski [Wed, 12 Nov 2025 13:22:08 +0000 (14:22 +0100)]
drm/xe/pf: Increase PF GuC Buffer Cache size and use it for VF migration
Contiguous PF GGTT VMAs can be scarce after creating VFs.
Increase the GuC buffer cache size to 8M for PF so that we can fit GuC
migration data (which currently maxes out at just over 4M) and use the
cache instead of allocating fresh BOs.
Michał Winiarski [Wed, 12 Nov 2025 13:22:07 +0000 (14:22 +0100)]
drm/xe: Allow the caller to pass guc_buf_cache size
An upcoming change will use GuC buffer cache as a place where GuC
migration data will be stored, and the memory requirement for that is
larger than indirect data.
Allow the caller to pass the size based on the intended usecase.
Michał Winiarski [Wed, 12 Nov 2025 13:22:06 +0000 (14:22 +0100)]
drm/xe: Add sa/guc_buf_cache sync interface
In upcoming changes the cached buffers are going to be used to read data
produced by the GuC. Add a counterpart to flush, which synchronizes the
CPU-side of suballocation with the GPU data and propagate the interface
to GuC Buffer Cache.
Michał Winiarski [Wed, 12 Nov 2025 13:22:04 +0000 (14:22 +0100)]
drm/xe/pf: Add minimalistic migration descriptor
The descriptor reuses the KLV format used by GuC and contains metadata
that can be used to quickly fail migration when source is incompatible
with destination.
Michał Winiarski [Wed, 12 Nov 2025 13:22:03 +0000 (14:22 +0100)]
drm/xe/pf: Add support for encap/decap of bitstream to/from packet
Add debugfs handlers for migration state and handle bitstream
.read()/.write() to convert from bitstream to/from migration data
packets.
As descriptor/trailer are handled at this layer - add handling for both
save and restore side.
Michał Winiarski [Wed, 12 Nov 2025 13:22:02 +0000 (14:22 +0100)]
drm/xe/pf: Add helpers for migration data packet allocation / free
Now that it's possible to free the packets - connect the restore
handling logic with the ring.
The helpers will also be used in upcoming changes that will start
producing migration data packets.
Michał Winiarski [Wed, 12 Nov 2025 13:22:01 +0000 (14:22 +0100)]
drm/xe/pf: Add data structures and handlers for migration rings
Migration data is queued in a per-GT ptr_ring to decouple the worker
responsible for handling the data transfer from the .read() and .write()
syscalls.
Add the data structures and handlers that will be used in future
commits.
Michał Winiarski [Wed, 12 Nov 2025 13:21:59 +0000 (14:21 +0100)]
drm/xe/pf: Convert control state to bitmap
In upcoming changes, the number of states will increase as a result of
introducing SAVE and RESTORE states.
This means that using unsigned long as underlying storage won't work on
32-bit architectures, as we'll run out of bits.
Use bitmap instead.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202510231918.XlOqymLC-lkp@intel.com/ Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251112132220.516975-4-michal.winiarski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Michał Winiarski [Wed, 12 Nov 2025 13:21:58 +0000 (14:21 +0100)]
drm/xe: Move migration support to device-level struct
Upcoming changes will allow users to control VF state and obtain its
migration data with a device-level granularity (not tile/gt).
Change the data structures to reflect that and move the GT-level
migration init to happen after device-level init.
Michał Winiarski [Wed, 12 Nov 2025 13:21:57 +0000 (14:21 +0100)]
drm/xe/pf: Remove GuC version check for migration support
Since commit 4eb0aab6e4434 ("drm/xe/guc: Bump minimum required GuC
version to v70.29.2"), the minimum GuC version required by the driver
is v70.29.2, which should already include everything that we need for
migration.
Remove the version check.
Sk Anirban [Wed, 12 Nov 2025 18:51:55 +0000 (00:21 +0530)]
drm/xe/guc: Eliminate RPe caching for SLPC parameter handling
RPe is runtime-determined by PCODE and caching it caused stale values,
leading to incorrect GuC SLPC parameter settings.
Drop the cached rpe_freq field and query fresh values from hardware
on each use to ensure GuC SLPC parameters reflect current RPe.
v2: Remove cached RPe frequency field (Rodrigo)
v3: Remove extra variable (Vinay)
Modify function name (Vinay)
v4: Maintain a separate function for PVC (Rodrigo)
v5: Avoid RPn update while fetching RPe frequency (Rodrigo)
v6: Split platform-specific RPe comments (Vinay)
drm/xe/pf: Allow to lockdown the PF using custom guard
Some driver components, like eudebug or ccs-mode, can't be used
when VFs are enabled. Add functions to allow those components
to block the PF from enabling VFs for the requested duration.
Introduce trivial counter to allow lockdown or exclusive access
that can be used in the scenarios where we can't follow the strict
owner semantics as required by the rw_semaphore implementation.
Before enabling VFs, the PF will try to arm the "vfs_enabling"
guard for the exclusive access. This will fail if there are
some lockdown requests already initiated by the other components.
For testing purposes, add debugfs file which will call these new
functions from the file's open/close hooks.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Christoph Manszewski <christoph.manszewski@intel.com> Reviewed-by: Christoph Manszewski <christoph.manszewski@intel.com> Link: https://patch.msgid.link/20251109162451.4779-1-michal.wajdeczko@intel.com
Lucas De Marchi [Mon, 10 Nov 2025 16:41:08 +0000 (08:41 -0800)]
drm/xe/pcode: Rework error mapping
The sparse array used for error decoding from is unnecessarily big. It
should be better handled by a switch statement that will also allow us
to more easily improve this code.
Add a CASE_ERR() macro to keep the table compact and use it instead of
the 256-entries array, which saves some space:
Kriish Sharma [Mon, 10 Nov 2025 18:42:06 +0000 (18:42 +0000)]
drm/xe: fix kernel-doc function name mismatch in xe_pm.c
Documentation build reported:
WARNING: ./drivers/gpu/drm/xe/xe_pm.c:131 expecting prototype for xe_pm_might_block_on_suspend(). Prototype was for xe_pm_block_on_suspend() instead
The kernel-doc comment for xe_pm_block_on_suspend() incorrectly used
the function name xe_pm_might_block_on_suspend(). Fix the header to
match the actual function prototype.
drm/xe/pf: Add runtime registers for GFX ver >= 35
Add a dedicated runtime register list for GFX ver >= 35.
Compared to the list for GFX >= 30, this variant drops
HUC_KERNEL_LOAD_INFO, MIRROR_FUSE1 and adds SERVICE_COPY_ENABLE.
v2:
- drop MIRROR_FUSE1 register
- update commit message
Fixes: 5e0de2dfbc1b ("drm/xe/cri: Add CRI platform definition") Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251107211845.3633633-1-piotr.piorkowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Fri, 7 Nov 2025 18:23:45 +0000 (10:23 -0800)]
drm/xe/vram: Move forcewake down to get_flat_ccs_offset()
With SG_TILE_ADDR_RANGE use, the only thing requiring GT forcewake while
probing for vram size is the get_flat_ccs_offset(). Move the forcewake
down where it's needed.
Fei Yang [Fri, 7 Nov 2025 18:23:44 +0000 (10:23 -0800)]
drm/xe: Use SG_TILE_ADDR_RANGE instead of TILE_ADDR_RANGE
The TILE_ADDR_RANGE register is not available on all platforms going
forward as it was deprecated and is being replaced by equivalent
registers within SoC MMIO space. While that doesn't happen, the
SG_TILE_ADDR_RANGE (base 0x1083a0) is still valid for all platforms
supported by xe. Use that instead.
Fixes: 50292f9af8ec ("drm/xe: Move 'vm_max_level' flag back to platform descriptor") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Gustavo Sousa <gustavo.sousa@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patch.msgid.link/20251108040634.6376-2-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>