Matt Roper [Thu, 5 Feb 2026 21:41:40 +0000 (13:41 -0800)]
drm/xe: Move number of XeCore fuse registers to graphics descriptor
The number of registers used to express the XeCore mask has some
"special cases" that don't always get inherited by later IP versions so
it's cleaner and simpler to record the numbers in the IP descriptor
rather than adding extra conditions to the standalone get_num_dss_regs()
function.
Note that a minor change here is that we now always treat the number of
registers as 0 for the media GT. Technically a copy of these fuse
registers does exist in the media GT as well (at the usual
0x380000+$offset location), but the value of those is always supposed to
read back as 0 because media GTs never have any XeCores or EUs.
v2:
- Add a kunit assertion to catch descriptors that forget to initialize
either count. (Gustavo)
Dump forcewake status and ref counts for all domains as part
of this debugfs. This is the sample output from gt1-
$ cat /sys/kernel/debug/dri//0/gt1/powergate_info
Media Power Gating Enabled: yes
Media Slice0 Power Gate Status: down
GSC Power Gate Status: down
GT.ref_count=0, GT.forcewake=0x10000
VDBox0.ref_count=0, VDBox0.forcewake=0x10000
VEBox0.ref_count=0, VEBox0.forcewake=0x10000
GSC.ref_count=0, GSC.forcewake=0x10000
Also, extract out the GuC RC related set/unset param functions
into xe_guc_rc file. GuC still allows us to override GuC RC mode
using an SLPC H2G interface. Continue to use that interface, but
move the related code to the newly created xe_guc_rc file.
Move enable/disable GuC RC logic into the new file. This will
allow us to independently enable/disable GuC RC and not rely
on SLPC related functions. GuC already provides separate H2G
interfaces to setup GuC RC and SLPC.
Pallavi Mishra [Thu, 29 Jan 2026 05:47:22 +0000 (05:47 +0000)]
drm/xe/tests: Fix g2g_test_array indexing
The G2G KUnit test allocates a compact N×N
matrix sized by gt_count and verifies entries
using dense indices: idx = (j * gt_count) + i
The producer path currently computes idx using
gt->info.id. However, gt->info.id values
are not guaranteed to be contiguous.
For example, with gt_count=2 and IDs {0,3},
this formula produces indices beyond the
allocated range, causing mismatches and
potential out-of-bounds access.
Update the producer to map each GT to a dense
index in [0..gt_count-1] and compute:
idx = (tx_dense * gt_count) + rx_dense
Additionally, introduce an event-based delay
in g2g_test_in_order() to ensure ordering
between sends.
drm/xe: Mutual exclusivity between CCS-mode and PF
Due to SLA agreement between PF and VFs, currently we block CCS
mode changes if driver is running as PF, even if there are no VFs
enabled yet. Use lockdown mechanism provided by the PF to relax
that limitation and still enforce above VFs related requirements.
In case of devm_add_action_or_reset() failure the provided cleanup
action will be run immediately on the not yet initialized kobject.
This may lead to errors like:
Fix that by calling kobject_init() and kobject_add() separately
and register cleanup action after the kobject is initialized.
Also make this cleanup registration a part of the create helper to
fix another mistake, as in the loop we were wrongly passing parent
kobject while registering cleanup action, and this resulted in some
undetected leaks.
Fixes: 5c170a4d9c53 ("drm/xe/pf: Prepare sysfs for SR-IOV admin attributes") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20260203235332.1350-1-michal.wajdeczko@intel.com
Matthew Brost [Fri, 30 Jan 2026 19:49:28 +0000 (11:49 -0800)]
drm/gpusvm: Allow device pages to be mapped in mixed mappings after system pages
The current code rejects device mappings whenever system pages have
already been encountered. This is not the intended behavior when
allow_mixed is set.
Relax the restriction by permitting a single pagemap to be selected when
allow_mixed is enabled, even if system pages were found earlier.
Matthew Brost [Fri, 30 Jan 2026 19:49:27 +0000 (11:49 -0800)]
drm/gpusvm: Force unmapping on error in drm_gpusvm_get_pages
drm_gpusvm_get_pages() only sets the local flags prior to committing the
pages. If an error occurs mid-mapping, has_dma_mapping will be clear,
causing the unmap function to skip unmapping pages that were
successfully mapped before the error. Fix this by forcibly setting
has_dma_mapping in the error path to ensure all previously mapped pages
are properly unmapped.
Fixes: 99624bdff867 ("drm/gpusvm: Add support for GPU Shared Virtual Memory") Cc: stable@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://patch.msgid.link/20260130194928.3255613-2-matthew.brost@intel.com
Instead of relying on fixed relation to the display probe flag,
add configfs attribute to allow an administrator to configure
desired PF operation mode in a more flexible way.
Michal Wajdeczko [Wed, 21 Jan 2026 21:42:14 +0000 (22:42 +0100)]
drm/xe/configfs: Always return consistent max_vfs value
The max_vfs parameter used by the Xe driver has its default value
definition, but it could be altered by the module parameter or by
the device specific configfs attribute.
To avoid mistakes or code duplication, always rely on the configfs
helper (or stub), which will provide necessary fallback if needed.
Michal Wajdeczko [Wed, 21 Jan 2026 21:42:12 +0000 (22:42 +0100)]
drm/xe: Keep all defaults in single header
We already have most of Xe defaults defined in xe_module.c,
where we use them for the modparam initializations, but some
were defined elsewhere, which breaks the consistency.
Introduce xe_defaults.h file, that will act as a placeholder
for all our default values, and can be used from other places.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The refactoring is going to alter the default behavior of
alloc_workqueue() to be unbound by default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU. For more details see the Link tag below.
In order to keep alloc_workqueue() behavior identical, explicitly request
WQ_PERCPU.
drm/xe: replace use of system_unbound_wq with system_dfl_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
Michal Wajdeczko [Tue, 27 Jan 2026 19:37:26 +0000 (20:37 +0100)]
drm/xe/guc: Allow second H2G retry on FLR
During VF FLR the scratch registers could be cleared both by the
GuC and by the PF driver. Allow to retry more times once we find
out that the HXG header was cleared and wait at least 256ms before
resending the same message again to the GuC.
Michal Wajdeczko [Tue, 27 Jan 2026 19:37:25 +0000 (20:37 +0100)]
drm/xe/guc: Wait before retrying sending H2G
We shall resend H2G message after receiving NO_RESPONSE_RETRY reply,
but since GuC dropped that H2G due to some interim state, we should
give it a little time to stabilize. Wait before sending the same H2G
again, start with 1ms delay, then increase exponentially to 256ms.
Michal Wajdeczko [Tue, 27 Jan 2026 19:37:23 +0000 (20:37 +0100)]
drm/xe/guc: Limit sleep while waiting for H2G credits
Instead of endlessly increasing the sleep timeout while waiting
for the H2G credits, use exponential increase only up to the given
limit, like it was initially done in the GuC submission code.
While here, fix the actual timeout to the 1s as it was documented.
Michal Wajdeczko [Wed, 28 Jan 2026 22:27:13 +0000 (23:27 +0100)]
drm/xe/pf: Simplify IS_SRIOV_PF macro
Instead of two having variants of the IS_SRIOV_PF macro, move the
CONFIG_PCI_IOV check to the xe_device_is_sriov_pf() function and
let the compiler optimize that. This will help us drop poor man's
type check of the macro parameter that fails on const xe pointer.
Shuicheng Lin [Thu, 29 Jan 2026 23:38:38 +0000 (23:38 +0000)]
drm/xe: Fix kerneldoc for xe_tlb_inval_job_alloc_dep
Correct the function name in the kerneldoc.
It is for below warning:
"Warning: drivers/gpu/drm/xe/xe_tlb_inval_job.c:210 expecting prototype for
xe_tlb_inval_alloc_dep(). Prototype was for xe_tlb_inval_job_alloc_dep()
instead"
Fixes: 15366239e2130 ("drm/xe: Decouple TLB invalidations from GT") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-8-shuicheng.lin@intel.com
Shuicheng Lin [Thu, 29 Jan 2026 23:38:37 +0000 (23:38 +0000)]
drm/xe: Fix kerneldoc for xe_gt_tlb_inval_init_early
Correct the function name in the kerneldoc.
It is for below warning:
"Warning: drivers/gpu/drm/xe/xe_tlb_inval.c:136 expecting prototype for
xe_gt_tlb_inval_init(). Prototype was for xe_gt_tlb_inval_init_early()
instead"
v2: add () for the function. (Michal)
Fixes: db16f9d90c1d9 ("drm/xe: Split TLB invalidation code in frontend and backend") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-7-shuicheng.lin@intel.com
Shuicheng Lin [Thu, 29 Jan 2026 23:38:36 +0000 (23:38 +0000)]
drm/xe: Fix kerneldoc for xe_migrate_exec_queue
Correct the function name in the kerneldoc.
It is for below warning:
"Warning: drivers/gpu/drm/xe/xe_migrate.c:1262 expecting prototype for
xe_get_migrate_exec_queue(). Prototype was for xe_migrate_exec_queue()
instead"
Fixes: 916ee4704a865 ("drm/xe/vf: Register CCS read/write contexts with Guc") Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20260129233834.419977-6-shuicheng.lin@intel.com
Shuicheng Lin [Fri, 30 Jan 2026 04:39:08 +0000 (04:39 +0000)]
drm/xe/query: Fix topology query pointer advance
The topology query helper advanced the user pointer by the size
of the pointer, not the size of the structure. This can misalign
the output blob and corrupt the following mask. Fix the increment
to use sizeof(*topo).
There is no issue currently, as sizeof(*topo) happens to be equal
to sizeof(topo) on 64-bit systems (both evaluate to 8 bytes).
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260130043907.465128-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Xin Wang [Fri, 30 Jan 2026 17:53:49 +0000 (17:53 +0000)]
drm/xe: use entry_dump callbacks for xe2+ PAT dumps
Move xe2+ PAT entry printing into the entry_dump op so platform
specific logic stays localized, simplifying future maintenance.
v2:
- Do not null xe->pat.ops for VFs.
- Skip PAT init and dump on VFs (-EOPNOTSUPP), avoiding NULL ops use.
v3:
- fixed typo
v4: (Matt)
- Switch xe2_dump() to use the new ops->entry_dump() vfunc.
- Remove xe3p_xpc_dump() and reuse the common xe2_dump() for Xe3p XPC.
- This also fixes Xe3p_HPM media PAT dumping by using the proper
non-MCR access for the PAT register range (bspec 76445).
Cc: Matt Roper <matthew.d.roper@intel.com> Suggested-by: Brian Nguyen <brian3.nguyen@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20260130175349.2249033-1-x.wang@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
drm/xe/guc: Fix kernel-doc warning in GuC scheduler ABI header
The GuC scheduler ABI header contains a file-level comment that is not
intended to document a kernel-doc symbol. Using kernel-doc comment
syntax (/** */) triggers kernel-doc warnings.
With "-Werror", this causes the build to fail. Convert the comment to a
regular block comment.
HDRTEST drivers/gpu/drm/xe/abi/guc_scheduler_abi.h
Warning: drivers/gpu/drm/xe/abi/guc_scheduler_abi.h:11 This comment starts with '/**', but isn't a kernel-doc comment. Refer to Documentation/doc-guide/kernel-doc.rst
* Generic defines required for registration with and submissions to the GuC
1 warnings as errors
make[6]: *** [drivers/gpu/drm/xe/Makefile:377: drivers/gpu/drm/xe/abi/guc_scheduler_abi.hdrtest] Error 3
make[5]: *** [scripts/Makefile.build:544: drivers/gpu/drm/xe] Error 2
make[4]: *** [scripts/Makefile.build:544: drivers/gpu/drm] Error 2
make[3]: *** [scripts/Makefile.build:544: drivers/gpu] Error 2
make[2]: *** [scripts/Makefile.build:544: drivers] Error 2
make[1]: *** [/home/kbuild2/kernel/Makefile:2088: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2
Michal Wajdeczko [Wed, 17 Dec 2025 15:07:02 +0000 (16:07 +0100)]
drm/xe/pf: Fix typo in function kernel-doc
The function name is missing an underscore, which results in:
Warning: ../drivers/gpu/drm/xe/xe_gt_sriov_pf_control.c:1261
This comment starts with '/**', but isn't a kernel-doc comment.
Refer to Documentation/doc-guide/kernel-doc.rst
* xe_gt_sriov_pf_control_trigger restore_vf() - Start ...
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://patch.msgid.link/20251217150702.2669-1-michal.wajdeczko@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
drm/xe/multi_queue: Protect priority against concurrent access
Use a spinlock to protect multi-queue priority being concurrently
updated by multiple set_priority ioctls and to protect against
concurrent read and write to this field.
Shuicheng Lin [Tue, 20 Jan 2026 18:32:43 +0000 (18:32 +0000)]
drm/xe/nvm: Defer xe->nvm assignment until init succeeds
Allocate and initialize the NVM structure using a local pointer and
assign it to xe->nvm only after all initialization steps succeed.
This avoids exposing a partially initialized xe->nvm and removes the
need to explicitly clear xe->nvm on error paths, simplifying error
handling and making the lifetime rules clearer.
Cc: Alexander Usyskin <alexander.usyskin@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Brian Nguyen <brian3.nguyen@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Brian Nguyen <brian3.nguyen@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20260120183239.2966782-8-shuicheng.lin@intel.com
Shuicheng Lin [Tue, 20 Jan 2026 18:32:42 +0000 (18:32 +0000)]
drm/xe/nvm: Fix double-free on aux add failure
After a successful auxiliary_device_init(), aux_dev->dev.release
(xe_nvm_release_dev()) is responsible for the kfree(nvm). When
there is failure with auxiliary_device_add(), driver will call
auxiliary_device_uninit(), which call put_device(). So that the
.release callback will be triggered to free the memory associated
with the auxiliary_device.
Move the kfree(nvm) into the auxiliary_device_init() failure path
and remove the err goto path to fix below error.
"
[ 13.232905] ==================================================================
[ 13.232911] BUG: KASAN: double-free in xe_nvm_init+0x751/0xf10 [xe]
[ 13.233112] Free of addr ffff888120635000 by task systemd-udevd/273
Shuicheng Lin [Tue, 20 Jan 2026 18:32:41 +0000 (18:32 +0000)]
drm/xe/nvm: Manage nvm aux cleanup with devres
Move nvm teardown to a devm-managed action registered from xe_nvm_init().
This ensures the auxiliary NVM device is deleted on probe failure and
device detach without requiring explicit calls from remove paths.
As part of this, drop xe_nvm_fini() from xe_device_remove() and from the
survivability sysfs teardown, and remove the public xe_nvm_fini() API from
the header.
Shuicheng Lin [Fri, 23 Jan 2026 18:04:26 +0000 (18:04 +0000)]
drm/xe/gt: Use CLASS() for forcewake in xe_gt_enable_comp_1wcoh
Adopt the scoped forcewake management using CLASS(xe_force_wake, ...)
to simplify the code and ensure proper resource release.
Cc: Xin Wang <x.wang@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Xin Wang <x.wang@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20260123180425.3262944-2-shuicheng.lin@intel.com
Michal Wajdeczko [Thu, 22 Jan 2026 15:19:24 +0000 (16:19 +0100)]
drm/xe/vf: Reset VF GuC state on fini
Unlike native/PF driver, which was explicitly triggering full GuC
reset during driver unwind, the VF driver was not notifying GuC that
it is about to unwind, and this could lead GuC to access stale data,
which in turn could be interpreted as VF's malicious activity.
Add managed action to send to GuC VF_RESET message during GT unwind.
drm/xe: Move _THIS_IP_ usage from xe_vm_create() to dedicated function
After commit a3866ce7b122 ("drm/xe: Add vm to exec queues association"),
building for an architecture other than x86 (which defines its own
_THIS_IP_) with clang fails with:
drivers/gpu/drm/xe/xe_vm.c:1586:3: error: cannot jump from this indirect goto statement to one of its possible targets
1586 | drm_exec_retry_on_contention(&exec);
| ^
include/drm/drm_exec.h:123:4: note: expanded from macro 'drm_exec_retry_on_contention'
123 | goto *__drm_exec_retry_ptr; \
| ^
drivers/gpu/drm/xe/xe_vm.c:1542:3: note: possible target of indirect goto statement
1542 | might_lock(&vm->exec_queues.lock);
| ^
include/linux/lockdep.h:553:33: note: expanded from macro 'might_lock'
553 | lock_release(&(lock)->dep_map, _THIS_IP_); \
| ^
include/linux/instruction_pointer.h:10:41: note: expanded from macro '_THIS_IP_'
10 | #define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; })
| ^
drivers/gpu/drm/xe/xe_vm.c:1583:2: note: jump exits scope of variable with __attribute__((cleanup))
1583 | xe_validation_guard(&ctx, &xe->val, &exec, (struct xe_val_flags) {.interruptible = true},
| ^
drivers/gpu/drm/xe/xe_validation.h:189:2: note: expanded from macro 'xe_validation_guard'
189 | scoped_guard(xe_validation, _ctx, _val, _exec, _flags, &_ret) \
| ^
include/linux/cleanup.h:442:2: note: expanded from macro 'scoped_guard'
442 | __scoped_guard(_name, __UNIQUE_ID(label), args)
| ^
include/linux/cleanup.h:433:20: note: expanded from macro '__scoped_guard'
433 | for (CLASS(_name, scope)(args); \
| ^
drivers/gpu/drm/xe/xe_vm.c:1542:3: note: jump enters a statement expression
1542 | might_lock(&vm->exec_queues.lock);
| ^
include/linux/lockdep.h:553:33: note: expanded from macro 'might_lock'
553 | lock_release(&(lock)->dep_map, _THIS_IP_); \
| ^
include/linux/instruction_pointer.h:10:20: note: expanded from macro '_THIS_IP_'
10 | #define _THIS_IP_ ({ __label__ __here; __here: (unsigned long)&&__here; })
| ^
While this is a false positive error because __drm_exec_retry_ptr is
only ever assigned the label in drm_exec_until_all_locked() (thus it can
never jump over the cleanup variable), this error is not unreasonable in
general because the only supported use case for taking the address of a
label is computed gotos [1]. The kernel's use of the address of a label
in _THIS_IP_ is considered problematic by both GCC [2][3] and clang [4]
but they need to provide something equivalent before they can break this
use case.
Hide the usage of _THIS_IP_ by moving the CONFIG_PROVE_LOCKING if
statement to its own function, avoiding the error. This is similar to
commit 187e16f69de2 ("drm/xe: Work around clang multiple goto-label
error") but with the sources of _THIS_IP_.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patch.msgid.link/20260109211041.2446012-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Vinay Belgaumkar [Sat, 24 Jan 2026 00:59:17 +0000 (16:59 -0800)]
drm/xe/ptl: Disable DCC on PTL
On PTL, the recommendation is to disable DCC(Duty Cycle Control) as
it may cause some regressions due to added latencies. Upcoming GuC
releases will disable DCC on PTL as well, but we need to force it in
KMD so that this behavior is propagated to older kernels.
Shuicheng Lin [Thu, 22 Jan 2026 21:40:54 +0000 (21:40 +0000)]
drm/xe: Skip address copy for sync-only execs
For parallel exec queues, xe_exec_ioctl() copied the batch buffer address
array from userspace without checking num_batch_buffer.
If user creates a sync-only exec that doesn't use the address field, the
exec will fail with -EFAULT.
Add num_batch_buffer check to skip the copy, and the exec could be executed
successfully.
Here is the sync-only exec:
struct drm_xe_exec exec = {
.extensions = 0,
.exec_queue_id = qid,
.num_syncs = 1,
.syncs = (uintptr_t)&sync,
.address = 0, /* ignored for sync-only */
.num_batch_buffer = 0, /* sync-only */
};
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260122214053.3189366-2-shuicheng.lin@intel.com
Nitin Gote [Tue, 20 Jan 2026 05:47:25 +0000 (11:17 +0530)]
drm/xe: derive mem copy capability from graphics version
Drop .has_mem_copy_instr from the platform descriptors and set it
in xe_info_init() after handle_gmdid() populates graphics_verx100.
Centralizing the GRAPHICS_VER(xe) >= 20 check keeps MEM_COPY enabled
on Xe2+ and removes redundant per-platform plumbing.
Bspec: 57561
Fixes: 1e12dbae9d72 ("drm/xe/migrate: support MEM_COPY instruction") Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Link: https://patch.msgid.link/20260120054724.1982608-2-nitin.r.gote@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Sanjay Yadav [Wed, 21 Jan 2026 11:14:17 +0000 (16:44 +0530)]
drm/xe: Use DRM_BUDDY_CONTIGUOUS_ALLOCATION for contiguous allocations
The VRAM/stolen memory managers do not currently set
DRM_BUDDY_CONTIGUOUS_ALLOCATION for contiguous allocations. Enabling
this flag activates the buddy allocator's try_harder path, which helps
handle fragmented memory scenarios.
This enables the __alloc_contig_try_harder fallback in the buddy
allocator, allowing contiguous allocation requests to succeed even when
memory is fragmented by combining allocations from both(RHS and LHS)
sides of a large free block.
v2: (Matt B)
- Remove redundant logic for rounding allocation size and trimming when
TTM_PL_FLAG_CONTIGUOUS is set, since drm_buddy now handles this when
DRM_BUDDY_CONTIGUOUS_ALLOCATION is enabled
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6713 Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260121111416.3104399-2-sanjay.kumar.yadav@intel.com
Thomas Hellström [Wed, 21 Jan 2026 09:10:48 +0000 (10:10 +0100)]
drm/xe: Select CONFIG_DEVICE_PRIVATE when DRM_XE_GPUSVM is selected
CONFIG_DEVICE_PRIVATE is a prerequisite for DRM_XE_GPUSVM.
Explicitly select it so that DRM_XE_GPUSVM is not unintentionally
left out from distro configs not explicitly enabling
CONFIG_DEVICE_PRIVATE.
v2:
- Select also CONFIG_ZONE_DEVICE since it's needed by
CONFIG_DEVICE_PRIVATE.
v3:
- Depend on CONFIG_ZONE_DEVICE rather than selecting it.
Cc: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <dri-devel@lists.freedesktop.org> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20260121091048.41371-3-thomas.hellstrom@linux.intel.com
Thomas Hellström [Wed, 21 Jan 2026 09:10:47 +0000 (10:10 +0100)]
drm, drm/xe: Fix xe userptr in the absence of CONFIG_DEVICE_PRIVATE
CONFIG_DEVICE_PRIVATE is not selected by default by some distros,
for example Fedora, and that leads to a regression in the xe driver
since userptr support gets compiled out.
It turns out that DRM_GPUSVM, which is needed for xe userptr support
compiles also without CONFIG_DEVICE_PRIVATE, but doesn't compile
without CONFIG_ZONE_DEVICE.
Exclude the drm_pagemap files from compilation with !CONFIG_ZONE_DEVICE,
and remove the CONFIG_DEVICE_PRIVATE dependency from CONFIG_DRM_GPUSVM and
the xe driver's selection of it, re-enabling xe userptr for those configs.
v2:
- Don't compile the drm_pagemap files unless CONFIG_ZONE_DEVICE is set.
- Adjust the drm_pagemap.h header accordingly.
Fixes: 9e9787414882 ("drm/xe/userptr: replace xe_hmm with gpusvm") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.18+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patch.msgid.link/20260121091048.41371-2-thomas.hellstrom@linux.intel.com
Matthew Auld [Tue, 20 Jan 2026 11:06:11 +0000 (11:06 +0000)]
drm/xe/migrate: fix job lock assert
We are meant to be checking the user vm for the bind queue, but actually
we are checking the migrate vm. For various reasons this is not
currently firing but this will likely change in the future.
Now that we have the user_vm attached to the bind queue, we can fix this
by directly checking that here.
Fixes: dba89840a920 ("drm/xe: Add GT TLB invalidation jobs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Arvind Yadav <arvind.yadav@intel.com> Link: https://patch.msgid.link/20260120110609.77958-4-matthew.auld@intel.com
Matthew Auld [Tue, 20 Jan 2026 11:06:10 +0000 (11:06 +0000)]
drm/xe/uapi: disallow bind queue sharing
Currently this is very broken if someone attempts to create a bind
queue and share it across multiple VMs. For example currently we assume
it is safe to acquire the user VM lock to protect some of the bind queue
state, but if allow sharing the bind queue with multiple VMs then this
quickly breaks down.
To fix this reject using a bind queue with any VM that is not the same
VM that was originally passed when creating the bind queue. This a uAPI
change, however this was more of an oversight on kernel side that we
didn't reject this, and expectation is that userspace shouldn't be using
bind queues in this way, so in theory this change should go unnoticed.
Based on a patch from Matt Brost.
v2 (Matt B):
- Hold the vm lock over queue create, to ensure it can't be closed as
we attach the user_vm to the queue.
- Make sure we actually check for NULL user_vm in destruction path.
v3:
- Fix error path handling.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Reported-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Michal Mrozek <michal.mrozek@intel.com> Cc: Carl Zhang <carl.zhang@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Arvind Yadav <arvind.yadav@intel.com> Acked-by: Michal Mrozek <michal.mrozek@intel.com> Link: https://patch.msgid.link/20260120110609.77958-3-matthew.auld@intel.com
Matthew Brost [Fri, 16 Jan 2026 22:17:31 +0000 (14:17 -0800)]
drm/xe: Add context-based invalidation to GuC TLB invalidation backend
Introduce context-based invalidation support to the GuC TLB invalidation
backend. This is implemented by iterating over each exec queue per GT
within a VM, skipping inactive queues, and issuing a context-based (GuC
ID) H2G TLB invalidation. All H2G messages, except the final one, are
sent with an invalid seqno, which the G2H handler drops to ensure the
TLB invalidation fence is only signaled once all H2G messages are
completed.
A watermark mechanism is also added to switch between context-based TLB
invalidations and full device-wide invalidations, as the return on
investment for context-based invalidation diminishes when many exec
queues are mapped.
v2:
- Fix checkpatch warnings
v3:
- Rebase on PRL
- Use ref counting to avoid racing with deregisters
v4:
- Extra braces (Stuart)
- Use per GT list (CI)
- Reorder put
Matthew Brost [Fri, 16 Jan 2026 22:17:30 +0000 (14:17 -0800)]
drm/xe: Add exec queue active vfunc
If an exec queue is inactive (e.g., not registered or scheduling is
disabled), TLB invalidations are not issued for that queue. Add a
virtual function to determine the active state, which TLB invalidation
logic can hook into.
Matthew Brost [Fri, 16 Jan 2026 22:17:29 +0000 (14:17 -0800)]
drm/xe: Add xe_tlb_inval_idle helper
Introduce the xe_tlb_inval_idle helper to detect whether any TLB
invalidations are currently in flight. This is used in context-based TLB
invalidations to determine whether dummy TLB invalidations need to be
sent to maintain proper TLB invalidation fence ordering..
v2:
- Implement xe_tlb_inval_idle based on pending list
Matthew Brost [Fri, 16 Jan 2026 22:17:28 +0000 (14:17 -0800)]
drm/xe: Add send_tlb_inval_ppgtt helper
Extract the common code that issues a TLB invalidation H2G for PPGTTs
into a helper function. This helper can be reused for both ASID-based
and context-based TLB invalidations.
Matthew Brost [Fri, 16 Jan 2026 22:17:27 +0000 (14:17 -0800)]
drm/xe: Rename send_tlb_inval_ppgtt to send_tlb_inval_asid_ppgtt
Context-based TLB invalidations have their own set of GuC TLB
invalidation operations. Rename the current PPGTT invalidation function,
which operates on ASIDs, to a more descriptive name that reflects its
purpose.
Matthew Brost [Fri, 16 Jan 2026 22:17:25 +0000 (14:17 -0800)]
drm/xe: Add vm to exec queues association
Maintain a list of exec queues per vm which will be used by TLB
invalidation code to do context-ID based tlb invalidations.
v4:
- More asserts (Stuart)
- Per GT list (CI)
- Skip adding / removal if context TLB invalidatiions not supported
(Stuart)
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Tested-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260116221731.868657-6-matthew.brost@intel.com
Matthew Brost [Fri, 16 Jan 2026 22:17:24 +0000 (14:17 -0800)]
drm/xe: Add xe_device_asid_to_vm helper
Introduce the xe_device_asid_to_vm helper, which can be used throughout
the driver to resolve the VM from a given ASID.
v4:
- Move forward declare after includes (Stuart)
Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Tested-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260116221731.868657-5-matthew.brost@intel.com
Matthew Brost [Fri, 16 Jan 2026 22:17:22 +0000 (14:17 -0800)]
drm/xe: Make usm.asid_to_vm allocation use GFP_NOWAIT
Ensure the asid_to_vm lookup is reclaim-safe so it can be performed
during TLB invalidations, which is necessary for context-based TLB
invalidation support.
Matthew Brost [Fri, 16 Jan 2026 22:17:21 +0000 (14:17 -0800)]
drm/xe: Add normalize_invalidation_range
Extract the code that determines the alignment of TLB invalidation into
a helper function — normalize_invalidation_range. This will be useful
when adding context-based invalidations to the GuC TLB invalidation
backend.
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Tested-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260116221731.868657-2-matthew.brost@intel.com
Matthew Brost [Fri, 16 Jan 2026 22:03:32 +0000 (14:03 -0800)]
drm/xe: Ban entire multi-queue group on any job timeout
In multi-queue mode, we only have control over the entire group, so we
cannot ban individual queues or signal fences until the whole group is
removed from hardware. Implement banning of the entire group if any job
within it times out.
v2:
- Fix lock inversion (Niranjana)
- Initialize new queues in group to stopped
v3:
- Blindly call xe_exec_queue_multi_queue_primary (Niranjana)
- More comments around temporary list when stopping (Niranjana)
- Restart group on false timeout (Niranjana)
Nakshtra Goyal [Tue, 13 Jan 2026 09:19:28 +0000 (14:49 +0530)]
drm/xe/xe_query: Remove check for gt
There's no need to check a userspace-provided GT ID (which may come from
any tile) against the number of GTs that can be present on a single
tile. The xe_device_get_gt() lookup already checks that the GT ID passed
is valid for the current device.(Matt Roper)
Matthew Brost [Wed, 14 Jan 2026 18:49:05 +0000 (10:49 -0800)]
drm/xe: Reduce LRC timestamp stuck message on VFs to notice
An LRC timestamp getting stuck is a somewhat normal occurrence. If a
single VF submits a job that does not get timesliced, the LRC timestamp
will not increment. Reduce the LRC timestamp stuck message on VFs to
notice (same log level as job timeout) to avoid false CI bugs in tests
where a VF submits a job that does not get timesliced.
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7032 Fixes: bb63e7257e63 ("drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR") Suggested-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://patch.msgid.link/20260114184905.4189026-1-matthew.brost@intel.com
Matt Roper [Thu, 15 Jan 2026 03:28:02 +0000 (19:28 -0800)]
drm/xe: Cleanup unused header includes
clangd reports many "unused header" warnings throughout the Xe driver.
Start working to clean this up by removing unnecessary includes in our
.c files and/or replacing them with explicit includes of other headers
that were previously being included indirectly.
By far the most common offender here was unnecessary inclusion of
xe_gt.h. That likely originates from the early days of xe.ko when
xe_mmio did not exist and all register accesses, including those
unrelated to GTs, were done with GT functions.
There's still a lot of additional #include cleanup that can be done in
the headers themselves; that will come as a followup series.
v2:
- Squash the 79-patch series down to a single patch. (MattB)
Michal Wajdeczko [Mon, 12 Jan 2026 18:37:16 +0000 (19:37 +0100)]
drm/xe/mert: Improve handling of MERT CAT errors
All MERT catastrophic errors but VF's LMTT fault are serious, so
we shouldn't limit our handling only to print debug messages.
Change CATERR message to error level and then declare the device
as wedged to match expectation from the design document. For the
LMTT faults, add a note about adding tracking of this unexpected
VF activity.
While at it, rename register fields defnitions to match the BSpec.
Also drop trailing include guard name from the regs.h file.
Fei Yang [Mon, 12 Jan 2026 22:03:30 +0000 (14:03 -0800)]
drm/xe: vram addr range is expanded to bit[17:8]
The bit field used to be [14:8] with [17:15] marked as SPARE and
defaulted to 0. So, simply expand the read to bit[17:8] assuming
the platforms using only bit[14:8] have zeros in the expanded bits.
Marco Crivellari [Mon, 12 Jan 2026 09:44:06 +0000 (10:44 +0100)]
drm/xe: Replace use of system_wq with tlb_inval->timeout_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen, workqueue users must be converted to the better named
new workqueues with no intended behaviour changes:
This way the old obsolete workqueues (system_wq, system_unbound_wq) can be
removed in the future.
After a carefully evaluation, because this is the fence signaling path, we
changed the code in order to use one of the Xe's workqueue.
So, a new workqueue named 'timeout_wq' has been added to
'struct xe_tlb_inval' and has been initialized with 'gt->ordered_wq'
changing the system_wq uses with tlb_inval->timeout_wq.
Karthik Poosa [Mon, 12 Jan 2026 20:35:21 +0000 (02:05 +0530)]
drm/xe/hwmon: Expose individual VRAM channel temperature
Expose individual VRAM temperature attributes.
Update Xe hwmon documentation for this entry.
v2:
- Avoid using default switch case for VRAM individual temperatures.
- Append labels with VRAM channel number.
- Update kernel version in Xe hwmon documentation.
v3:
- Add missing brackets in Xe hwmon documentation from VRAM channel sysfs.
- Reorder BMG_VRAM_TEMPERATURE_N macro in xe_pcode_regs.h.
- Add api to check if VRAM is available on the channel.
v4:
- Improve VRAM label handling to eliminate temp variable by
introducing a dedicated array vram_label in xe_hwmon_thermal_info.
- Remove a magic number.
- Change the label from vram_X to vram_ch_X.
v5:
- Address review comments from Raag.
- Change vram to VRAM in commit title and subject.
- Refactor BMG_VRAM_TEMPERATURE_N macro.
- Refactor is_vram_ch_available().
- Rephrase a comment.
- Check individual VRAM temperature limits in addition to VRAM
availability in xe_hwmon_temp_is_visible. (Raag)
- Move VRAM label change out of this patch.
v6:
- Use in_range() for VRAM_N index check instead of if check. (Raag)
- Minor aesthetic changes.
Karthik Poosa [Mon, 12 Jan 2026 20:35:20 +0000 (02:05 +0530)]
drm/xe/hwmon: Expose GPU PCIe temperature
Expose GPU PCIe average temperature and its limits via hwmon sysfs entry
temp5_xxx.
Update Xe hwmon sysfs documentation for this.
v2: Update kernel version in Xe hwmon documentation. (Raag)
v3:
- Address review comments from Raag.
- Remove redundant debug log.
- Update kernel version in Xe hwmon documentation. (Raag)
v4:
- Address review comments from Raag.
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Use TEMP_MASK instead of TEMP_MASK_MAILBOX.
- Add PCIE_SENSOR_MASK which uses REG_FIELD_GET as replacement of
PCIE_SENSOR_SHIFT.
v5:
- Address review comments from Raag.
- Use REG_FIELD_GET to get PCIe temperature.
- Move PCIE_SENSOR_GROUP_ID and PCIE_SENSOR_MASK to xe_pcode_api.h
- Cosmetic change.
Karthik Poosa [Mon, 12 Jan 2026 20:35:19 +0000 (02:05 +0530)]
drm/xe/hwmon: Expose memory controller temperature
Expose GPU memory controller average temperature and its limits under
temp4_xxx.
Update Xe hwmon documentation for this.
v2:
- Rephrase commit message. (Badal)
- Update kernel version in Xe hwmon documentation. (Raag)
v3:
- Update kernel version in Xe hwmon documentation.
- Address review comments from Raag.
- Remove obvious comments.
- Remove redundant debug logs.
- Remove unnecessary checks.
- Avoid magic numbers.
- Add new comments.
- Use temperature sensors count to make memory controller visible.
- Use temperature limits of package for memory controller.
v4:
- Address review comments from Raag.
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Use DIV_ROUND_UP to calculate dwords needed for temperature limits.
- Minor aesthetic refinements.
- Remove unused TEMP_MASK_MAILBOX.
v5:
- Use REG_FIELD_GET to get count from READ_THERMAL_DATA output. (Raag)
- Change count print from decimal to hexadecimal.
- Cosmetic changes.
Karthik Poosa [Mon, 12 Jan 2026 20:35:18 +0000 (02:05 +0530)]
drm/xe/hwmon: Expose temperature limits
Read temperature limits using pcode mailbox and expose shutdown
temperature limit as tempX_emergency, critical temperature limit as
tempX_crit and GPU max temperature limit as temp2_max.
Update Xe hwmon documentation with above entries.
v2:
- Resolve a documentation warning.
- Address below review comments from Raag.
- Update date and kernel version in Xe hwmon documentation.
- Remove explicit disable of has_mbx_thermal_info for unsupported
platforms.
- Remove unnecessary default case in switches.
- Remove obvious comments.
- Use TEMP_LIMIT_MAX to compute number of dwords needed in
xe_hwmon_thermal_info.
- Remove THERMAL_LIMITS_DWORDS macro.
- Use has_mbx_thermal_info for checking thermal mailbox support.
v3:
- Address below minor comments. (Raag)
- Group new temperature attributes with existing temperature attributes
as per channel index in Xe hwmon documentation.
- Rename enums of xe_temp_limit to improve clarity.
- Use DIV_ROUND_UP to calculate dwords needed for temperature limits.
- Use return instead of breaks in xe_hwmon_temp_read.
- Minor aesthetic refinements.
v4:
- Remove a redundant break. (Raag)
- Update drm_dbg to drm_warn to inform user of unavailability for
thermal mailbox on expected platforms.
drm/xe/gsc: Make GSC FW load optional for newer platforms
On newer platforms GSC FW is only required for content protection
features, so the core driver features work perfectly fine without it
(and we did in fact not enable it to start with on PTL). Therefore, we
can selectively enable the GSC only if the FW is found on disk, without
failing if it is not found.
Note that this means that the FW can now be enabled (i.e., we're looking
for it) but not available (i.e., we haven't found it), so checks on FW
support should use the latter state to decide whether to go on or not.
As part of the rework, the message for FW not found has been cleaned up
to be more readable.
While at it, drop the comment about xe_uc_fw_init() since the code has
been reworked and the statement no longer applies.
drm/xe/device: Convert wait for lmem init into an assert
Prior to lmem init check, driver is waiting for the pcode uncore_init
status. uncore_init status will be flagged after the complete boot and
initialization of the SoC by the pcode. uncore_init confirms that lmem
init and mmio unblock has been already completed.
It makes no sense to check for lmem init after the pcode uncore_init
check. So change the wait for lmem init check into an assert which
confirms lmem init is set.
A careful inspection of __xe_ggtt_insert_bo_at() shows that
the ggtt_node can always be seen as inserted from xe_bo.c
due to the way error handling is performed.
The checks are also a little bit too paranoid, since we
never create a bo with ggtt_node[id] initialised but not
inserted into the GGTT, which can be seen by looking at
__xe_ggtt_insert_bo_at()
Additionally, the size of the GGTT is never bigger than 4 GB,
so adding a check at that level is incorrect.
drm/xe: Convert xe_fb_pin to use a callback for insertion into GGTT
The rotation details belong in xe_fb_pin.c, while the operations involving
GGTT belong to xe_ggtt.c. As directly locking xe_ggtt etc results in
exposing all of xe_ggtt details anyway, create a special function that
allocates a ggtt_node, and allow display to populate it using a callback
as a compromise.
drm/xe: Start using ggtt->start in preparation of balloon removal
Instead of having ggtt->size point to the end of ggtt, have ggtt->size
be the actual size of the GGTT, and introduce ggtt->start to point to
the beginning of GGTT.
This will allow a massive cleanup of GGTT in case of SRIOV-VF.
drm/xe/mert: Use local mert variable to simplify the code
There is no need to always refer to MERT data using tile pointer.
Use of local mert pointer will simplify the code and make it look
like other existing MERT function.
Michal Wajdeczko [Sun, 11 Jan 2026 21:38:47 +0000 (22:38 +0100)]
drm/xe/mert: Always refer to MERT using xe_device
There is only one MERT instance and while it is located on the root
tile, it is safer to refer to it using xe_device rather than xe_tile.
This will also allow to align signature with other MERT function.
Matthew Brost [Sat, 10 Jan 2026 01:27:39 +0000 (17:27 -0800)]
drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR
We now have proper infrastructure to accurately check the LRC timestamp
without toggling the scheduling state for non-VFs. For VFs, it is still
possible to get an inaccurate view if the context is on hardware. We
guard against free-running contexts on VFs by banning jobs whose
timestamps are not moving. In addition, VFs have a timeslice quantum
that naturally triggers context switches when more than one VF is
running, thus updating the LRC timestamp.
For multi-queue, it is desirable to avoid scheduling toggling in the TDR
because this scheduling state is shared among many queues. Furthermore,
this change simplifies the GuC state machine. The trade-off for VF cases
seems worthwhile.
v5:
- Add xe_lrc_timestamp helper (Umesh)
v6:
- Reduce number of tries on stuck timestamp (VF testing)
- Convert job timestamp save to a memory copy (VF testing)
v7:
- Save ctx timestamp to LRC when start VF job (VF testing)