drm/xe/multi_queue: Set QUEUE_DRAIN_MODE for Multi Queue batches
To properly support soft light restore between batches
being arbitrated at the CFEG, PIPE_CONTROL instructions
have a new bit in the first DW, QUEUE_DRAIN_MODE. When
set, this indicates to the CFEG that it should only
drain the current queue.
Additionally we no longer want to set the CS_STALL bit
for these multi queue queues as this causes the entire
pipeline to stall waiting for completion of the prior
batch, preventing this soft light restore from occurring
between queues in a queue group.
v4: Assert !multi_queue where applicable (Matt Roper)
drm/xe/multi_queue: Handle tearing down of a multi queue
As all queues of a multi queue group use the primary queue of the group
to interface with GuC. Hence there is a dependency between the queues of
the group. So, when primary queue of a multi queue group is cleaned up,
also trigger a cleanup of the secondary queues also. During cleanup, stop
and re-start submission for all queues of a multi queue group to avoid
any submission happening in parallel when a queue is being cleaned up.
v2: Initialize group->list_lock, add fs_reclaim dependency, remove
unwanted secondary queues cleanup (Matt Brost)
v3: Properly handle cleanup of multi-queue group (Matt Brost)
v4: Fix IS_ENABLED(CONFIG_LOCKDEP) check (Matt Brost)
Revert stopping/restarting of submissions on queues of the
group in TDR as it is not needed.
drm/xe/multi_queue: Add support for multi queue dynamic priority change
Support dynamic priority change for multi queue group queues via
exec queue set_property ioctl. Issue CGP_SYNC command to GuC through
the drm scheduler message interface for priority to take effect.
v2: Move is_multi_queue check to exec_queue layer and assert
is_multi_queue being set in guc submission layer (Matt Brost)
v3: Assert CGP_SYNC message length is valid (Matt Brost)
drm/xe/multi_queue: Add exec_queue set_property ioctl support
This patch adds support for exec_queue set_property ioctl.
It is derived from the original work which is part of
https://patchwork.freedesktop.org/series/112188/
Currently only DRM_XE_EXEC_QUEUE_SET_PROPERTY_MULTI_QUEUE_PRIORITY
property can be dynamically set.
v2: Check for and update kernel-doc which property this ioctl
supports (Matt Brost)
Only MULTI_QUEUE_PRIORITY property is valid for secondary queues of a
multi queue group. MULTI_QUEUE_PRIORITY only applies to multi queue group
queues. Detect invalid user queue property setting and return error.
drm/xe/multi_queue: Add multi queue priority property
Add support for queues of a multi queue group to set
their priority within the queue group by adding property
DRM_XE_EXEC_QUEUE_SET_PROPERTY_MULTI_QUEUE_PRIORITY.
This is the only other property supported by secondary
queues of a multi queue group, other than
DRM_XE_EXEC_QUEUE_SET_PROPERTY_MULTI_QUEUE.
v2: Add kernel doc for enum xe_multi_queue_priority,
Add assert for priority values, fix includes and
declarations (Matt Brost)
v3: update uapi kernel-doc (Matt Brost)
v4: uapi change due to rebase
drm/xe/multi_queue: Add GuC interface for multi queue support
Implement GuC commands and response along with the Context
Group Page (CGP) interface for multi queue support.
Ensure that only primary queue (q0) of a multi queue group
communicate with GuC. The secondary queues of the group only
need to maintain LRCA and interface with drm scheduler.
Use primary queue's submit_wq for all secondary queues of a multi
queue group. This serialization avoids any locking around CGP
synchronization with GuC.
v2: Fix G2H_LEN_DW_MULTI_QUEUE_CONTEXT value, add more comments
(Matt Brost)
v3: Minor code refactro, use xe_gt_assert
v4: Use xe_guc_ct_wake_waiters(), remove vf recovery support
(Matt Brost)
drm/xe/multi_queue: Add user interface for multi queue support
Multi Queue is a new mode of execution supported by the compute and
blitter copy command streamers (CCS and BCS, respectively). It is an
enhancement of the existing hardware architecture and leverages the
same submission model. It enables support for efficient, parallel
execution of multiple queues within a single context. All the queues
of a group must use the same address space (VM).
The new DRM_XE_EXEC_QUEUE_SET_PROPERTY_MULTI_QUEUE execution queue
property supports creating a multi queue group and adding queues to
a queue group. All queues of a multi queue group share the same
context.
A exec queue create ioctl call with above property specified with value
DRM_XE_SUPER_GROUP_CREATE will create a new multi queue group with the
queue being created as the primary queue (aka q0) of the group. To add
secondary queues to the group, they need to be created with the above
property with id of the primary queue as the value. The properties of
the primary queue (like priority, timeslice) applies to the whole group.
So, these properties can't be set for secondary queues of a group.
Once destroyed, the secondary queues of a multi queue group can't be
replaced. However, they can be dynamically added to the group up to a
total of 64 queues per group. Once the primary queue is destroyed,
secondary queues can't be added to the queue group.
v2: Remove group->lock, fix xe_exec_queue_group_add()/delete()
function semantics, add additional comments, remove unused
group->list_lock, add XE_BO_FLAG_GGTT_INVALIDATE for cgp bo,
Assert LRC is valid, update uapi kernel doc.
(Matt Brost)
v3: Use XE_BO_FLAG_PINNED_LATE_RESTORE/USER_VRAM/GGTT_INVALIDATE
flags for cgp bo (Matt)
v4: Ensure queue is not a vm_bind queue
uapi change due to rebase
drm/xe/vf: Reset recovery_queued after issuing RESFIX_START
During VF_RESTORE or VF_RESUME, the GuC sends a migration interrupt and
clears the RESFIX_START marker. If migration or resume occurs before the
VF issues its own RESFIX_START, VF KMD may receive two back-to-back
migration interrupts. VF then sends RESFIX_START to indicate the beginning
of fixups and RESFIX_DONE to mark completion. However, the second
RESFIX_START fails because the GuC is already in the RUNNING state.
Clear the recovery_queued flag after sending a RESFIX_START message to
ignore duplicated IRQs seen before we start actual recovery.
This ensures the state is reset only after the fixup process begins,
avoiding redundant work item queuing.
Fixes: b5fbb94341a2 ("drm/xe/vf: Introduce RESFIX start marker support") Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251210052546.622809-6-satyanarayana.k.v.p@intel.com
Ensure VF migration recovery work is only queued when no recovery is
already queued and teardown is not in progress.
Fixes: b47c0c07c350 ("drm/xe/vf: Teardown VF post migration worker on driver unload") Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251210052546.622809-5-satyanarayana.k.v.p@intel.com
drm/xe/bo: Don't include the CCS metadata in the dma-buf sg-table
Some Xe bos are allocated with extra backing-store for the CCS
metadata. It's never been the intention to share the CCS metadata
when exporting such bos as dma-buf. Don't include it in the
dma-buf sg-table.
Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Karol Wachowski <karol.wachowski@linux.intel.com> Link: https://patch.msgid.link/20251209204920.224374-1-thomas.hellstrom@linux.intel.com
Francois Dugast [Wed, 10 Dec 2025 16:50:00 +0000 (17:50 +0100)]
drm/xe/hw_engine_group: Add stats for mode switching
The GT stats interface is extended to include counters of how many
queues are either interrupted or waited on in the hardware engine
groups. This can help application debugging.
v2: Rename to queue as those operations are queue-based (Matthew Brost)
Junxiao Chang [Fri, 7 Nov 2025 03:31:52 +0000 (11:31 +0800)]
drm/me/gsc: mei interrupt top half should be in irq disabled context
MEI GSC interrupt comes from i915 or xe driver. It has top half and
bottom half. Top half is called from i915/xe interrupt handler. It
should be in irq disabled context.
With RT kernel(PREEMPT_RT enabled), by default IRQ handler is in
threaded IRQ. MEI GSC top half might be in threaded IRQ context.
generic_handle_irq_safe API could be called from either IRQ or
process context, it disables local IRQ then calls MEI GSC interrupt
top half.
This change fixes B580 GPU boot issue with RT enabled.
Fixes: e02cea83d32d ("drm/xe/gsc: add Battlemage support") Tested-by: Baoli Zhang <baoli.zhang@intel.com> Signed-off-by: Junxiao Chang <junxiao.chang@intel.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20251107033152.834960-1-junxiao.chang@intel.com Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Riana Tauro [Mon, 8 Dec 2025 08:45:42 +0000 (14:15 +0530)]
drm/xe/xe_survivability: Add support for survivability mode v2
v2 survivability breadcrumbs introduces a new mode called
SPI Flash Descriptor Override mode (FDO). This is enabled by
PCODE when MEI itself fails and firmware cannot be updated via
MEI using igsc. This mode provides the ability to update
the firmware directly via SPI driver.
Xe KMD initializes the nvm aux driver if FDO mode is enabled.
Userspace should check FDO mode entry in survivability info sysfs before
using the SPI driver to update firmware.
2) Add survivability_info directory to expose boot breadcrumbs.
Entries in survivability mode sysfs are only visible when
boot breadcrumb registers are populated.
Provides data about boot status and has bits that
indicate the support for the other breadcrumbs
Postcode Trace / Postcode Trace Overflow :
Each postcode is represented as an 8-bit value and represents
a boot failure event. When a new failure event is logged by Pcode
the existing postcodes are shifted left. These entries provide a
history of 8 postcodes.
Tomasz Lis [Thu, 4 Dec 2025 20:08:20 +0000 (21:08 +0100)]
drm/xe/vf: Stop waiting for ring space on VF post migration recovery
If wait for ring space started just before migration, it can delay
the recovery process, by waiting without bailout path for up to 2
seconds.
Two second wait for recovery is not acceptable, and if the ring was
completely filled even without the migration temporarily stopping
execution, then such a wait will result in up to a thousand new jobs
(assuming constant flow) being added while the wait is happening.
While this will not cause data corruption, it will lead to warning
messages getting logged due to reset being scheduled on a GT under
recovery. Also several seconds of unresponsiveness, as the backlog
of jobs gets progressively executed.
Add a bailout condition, to make sure the recovery starts without
much delay. The recovery is expected to finish in about 100 ms when
under moderate stress, so the condition verification period needs to be
below that - settling at 64 ms.
The theoretical max time which the recovery can take depends on how
many requests can be emitted to engine rings and be pending execution.
While stress testing, it was possible to reach 10k pending requests
on rings when a platform with two GTs was used. This resulted in max
recovery time of 5 seconds. But in real life situations, it is very
unlikely that the amount of pending requests will ever exceed 100,
and for that the recovery time will be around 50 ms - well within our
claimed limit of 100ms.
Fixes: a4dae94aad6a ("drm/xe/vf: Wakeup in GuC backend on VF post migration recovery") Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251204200820.2206168-1-tomasz.lis@intel.com
Raag Jadav [Wed, 3 Dec 2025 12:33:55 +0000 (18:03 +0530)]
drm/xe/throttle: Skip reason prefix while emitting array
The newly introduced "reasons" attribute already signifies possible
reasons for throttling and makes the prefix in individual attribute
names redundant while emitting them as an array. Skip the prefix.
Xin Wang [Fri, 5 Dec 2025 07:06:33 +0000 (07:06 +0000)]
drm/xe: expose PAT software config to debugfs
The existing "pat" debugfs node dumps the live PAT registers. Under
SR-IOV the VF cannot touch those registers, so the file vanishes and
users lose all PAT visibility. Add a VF-safe "pat_sw_config" entry to
the VF-safe debugfs list. It prints the cached PAT table the driver
programmed, rather than poking HW, so PF and VF instances present the
same view.
This lets IGT and other tools query the PAT configuration without
carrying platform-specific tables or mirroring kernel logic.
v2: (Jonathan)
- Only append "(* = reserved entry)" to the PAT table header on Xe2+
platforms where it actually applies.
- Deduplicate the PTA/ATS mode printing by introducing the small
drm_printf_pat_mode() helper macro.
v3: (Matt)
- Print IDX[XE_CACHE_NONE_COMPRESSION] on every Xe2+ platform so the
dump always reflects the value the driver might use (even if it defaults
to 0) and future IP revisions don’t need extra condition tweaks.
v4:
- Drop the drm_printf_pat_mode macro and introduce a real helper
xe2_pat_entry_dump(). (Jani)
- Reuse the helper across all PTA/ATS/PAT dumps for xe2+ entries to keep
output format identical.
v5: (Matt)
- Split the original patch into two: one for refactoring helpers, one for
the new debugfs entry.
CC: Jani Nikula <jani.nikula@intel.com> Suggested-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20251205070633.28072-1-x.wang@intel.com
Xin Wang [Fri, 5 Dec 2025 07:02:19 +0000 (07:02 +0000)]
drm/xe: Refactor PAT dump to use shared helpers
Move the PAT entry formatting into shared helper functions to ensure
consistency and enable code reuse.
This preparation is necessary for a follow-up patch that introduces a
software-based PAT dump, which is required for debugging on VFs where
hardware access is limited.
V2: (Matt)
- Xe3p XPC doesn’t define COMP_EN; omit it to match bspec and avoid
confusion.
Suggested-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patch.msgid.link/20251205070220.27859-1-x.wang@intel.com
Ashutosh Dixit [Tue, 2 Dec 2025 02:51:13 +0000 (18:51 -0800)]
drm/xe/oa: Allow exec_queue's to be specified only for OAG OA unit
Exec_queue's are only used for OAR/OAC functionality for OAG unit. Make
this requirement explicit, which avoids complications in the code for
other (non-OAG) OA units.
Ashutosh Dixit [Tue, 2 Dec 2025 02:51:12 +0000 (18:51 -0800)]
drm/xe/oa/uapi: Add gt_id to struct drm_xe_oa_unit
gt_id was previously omitted from 'struct drm_xe_oa_unit' because it could
be determine from hwe's attached to the OA unit. However, we now have OA
units which don't have any hwe's attached to them. Hence add gt_id to
'struct drm_xe_oa_unit' in order to provide this needed information to
userspace.
Arnd Bergmann [Thu, 4 Dec 2025 09:46:58 +0000 (10:46 +0100)]
drm/xe: fix drm_gpusvm_init() arguments
The Xe driver fails to build when CONFIG_DRM_XE_GPUSVM is disabled
but CONFIG_DRM_GPUSVM is turned on, due to the clash of two commits:
In file included from drivers/gpu/drm/xe/xe_vm_madvise.c:8:
drivers/gpu/drm/xe/xe_svm.h: In function 'xe_svm_init':
include/linux/stddef.h:8:14: error: passing argument 5 of 'drm_gpusvm_init' makes integer from pointer without a cast [-Wint-conversion]
drivers/gpu/drm/xe/xe_svm.h:217:38: note: in expansion of macro 'NULL'
217 | NULL, NULL, 0, 0, 0, NULL, NULL, 0);
| ^~~~
In file included from drivers/gpu/drm/xe/xe_bo_types.h:11,
from drivers/gpu/drm/xe/xe_bo.h:11,
from drivers/gpu/drm/xe/xe_vm_madvise.c:11:
include/drm/drm_gpusvm.h:254:35: note: expected 'long unsigned int' but argument is of type 'void *'
254 | unsigned long mm_start, unsigned long mm_range,
| ~~~~~~~~~~~~~~^~~~~~~~
In file included from drivers/gpu/drm/xe/xe_vm_madvise.c:14:
drivers/gpu/drm/xe/xe_svm.h:216:16: error: too many arguments to function 'drm_gpusvm_init'; expected 10, have 11
216 | return drm_gpusvm_init(&vm->svm.gpusvm, "Xe SVM (simple)", &vm->xe->drm,
| ^~~~~~~~~~~~~~~
217 | NULL, NULL, 0, 0, 0, NULL, NULL, 0);
| ~
include/drm/drm_gpusvm.h:251:5: note: declared here
Adapt the caller to the new argument list by removing the extraneous
NULL argument.
Fixes: 9e9787414882 ("drm/xe/userptr: replace xe_hmm with gpusvm") Fixes: 10aa5c806030 ("drm/gpusvm, drm/xe: Fix userptr to not allow device private pages") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20251204094704.1030933-1-arnd@kernel.org
Arnd Bergmann [Thu, 4 Dec 2025 09:41:36 +0000 (10:41 +0100)]
drm/xe/pf: fix VFIO link error
The Makefile logic for building xe_sriov_vfio.o was added incorrectly,
as setting CONFIG_XE_VFIO_PCI=m means it doesn't get included into a
built-in xe driver:
Sanjay Yadav [Thu, 4 Dec 2025 04:04:03 +0000 (09:34 +0530)]
drm/xe/uapi: Add NO_COMPRESSION BO flag and query capability
Introduce DRM_XE_GEM_CREATE_FLAG_NO_COMPRESSION to let userspace
opt out of CCS compression on a per-BO basis. When set, the driver
maps this to XE_BO_FLAG_NO_COMPRESSION, skips CCS metadata
allocation/clearing, and rejects compressed PAT indices at vm_bind.
This avoids extra memory ops and manual CCS state handling for buffers.
To allow userspace to detect at runtime whether the kernel supports this
feature, add DRM_XE_QUERY_CONFIG_FLAG_HAS_NO_COMPRESSION_HINT and expose
it via query_config() on Xe2+ platforms.
v2
- Changed error code from -EINVAL to -EOPNOTSUPP for unsupported flag
usage on pre-Xe2 platforms
- Fixed checkpatch warning in xe_vm.c
- Fixed kernel-doc formatting in xe_drm.h
v3
- Rebase
- Updated commit title and description
- Added UAPI for DRM_XE_QUERY_CONFIG_FLAG_HAS_NO_COMPRESSION_HINT and
exposed it via query_config()
v4
- Rebase
v5
- Included Mesa PR and IGT PR in the commit description
- Used xe_pat_index_get_comp_en() to extract the compression
v6
- Added XE_IOCTL_DBG() checks for argument validation
Suggested-by: Matthew Auld <matthew.auld@intel.com> Suggested-by: José Roberto de Souza <jose.souza@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Link: https://patch.msgid.link/20251204040402.2692921-2-sanjay.kumar.yadav@intel.com
Matt Roper [Tue, 2 Dec 2025 22:25:52 +0000 (14:25 -0800)]
drm/xe/sync: Use for_each_tlb_inval() to calculate invalidation fences
xe_sync_in_fence_get() uses the same kind of mismatched fence array
allocation vs looping logic that was previously noted and changed by
commit 0a4c2ddc711a ("drm/xe/vm: Use for_each_tlb_inval() to calculate
invalidation fences"). As with that commit, the mismatch doesn't cause
any problem at the moment since for_each_tlb_inval() loops the same
number of times as XE_MAX_GT_PER_TILE (2). However we don't want to
assume that these will always be the same in the future, so switch to
using for_each_tlb_inval() in both places to future-proof the code.
Backmerging to bring in a needed dependency for the Xe VFIO
driver variant. This should ideally have been done before we
commited that, so we now have a small window in drm-xe-next
where that driver doesn't compile.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512030331.I8CveRre-lkp@intel.com/ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Fixes: 2a6c826cfeed ("drm/amd: Skip power ungate during suspend for VPE") Cc: stable@vger.kernel.org Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reported-by: Konstantin <answer2019@yandex.ru> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220812 Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We need to call amdgpu_vm_handle_fault() on page fault
on all gfx9 and newer parts to properly update the
page tables, not just for recoverable page faults.
Cc: stable@vger.kernel.org Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We need to call amdgpu_vm_handle_fault() on page fault
on all gfx9 and newer parts to properly update the
page tables, not just for recoverable page faults.
Cc: stable@vger.kernel.org Reviewed-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Brady Norander [Tue, 25 Mar 2025 21:05:17 +0000 (17:05 -0400)]
drm/amdgpu: use static ids for ACP platform devs
mfd_add_hotplug_devices() assigns child platform devices with
PLATFORM_DEVID_AUTO, but the ACP machine drivers expect the platform
device names to never change. Use mfd_add_devices() instead and give
each cell a unique id.
Signed-off-by: Brady Norander <bradynorander@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amdgpu/sdma6: Update SDMA 6.0.3 FW version to include UMQ protected-fence fix
On GFX11.0.3, earlier SDMA firmware versions issue the
PROTECTED_FENCE write from the user VMID (e.g. VMID 8) instead of
VMID 0. This causes a GPU VM protection fault when SDMA tries to
write the secure fence location, as seen in the UMQ SDMA test
(cs-sdma-with-IP-DMA-UMQ)
v2: Updated commit message
v3: s/gfx11.0.3/sdma 6.0.3/ in patch title (Alex)
Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Natalie Vock [Mon, 1 Dec 2025 17:52:38 +0000 (12:52 -0500)]
drm/amdgpu: Forward VMID reservation errors
Otherwise userspace may be fooled into believing it has a reserved VMID
when in reality it doesn't, ultimately leading to GPU hangs when SPM is
used.
Fixes: 80e709ee6ecc ("drm/amdgpu: add option params to enforce process isolation between graphics and compute") Cc: stable@vger.kernel.org Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Natalie Vock <natalie.vock@gmx.de> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:52 +0000 (14:29 +0100)]
drm/amdgpu/gmc8: Delegate VM faults to soft IRQ handler ring
On old GPUs, it may be an issue that handling the interrupts from
VM faults is too slow and the interrupt handler (IH) ring may
overflow, which can cause an eventual hang.
Delegate the processing of all VM faults to the soft
IRQ handler ring.
As a result, we spend much less time in the IRQ handler that
interacts with the HW IH ring, which significantly reduces the
chance of hangs/reboots.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:51 +0000 (14:29 +0100)]
drm/amdgpu/gmc7: Delegate VM faults to soft IRQ handler ring
On old GPUs, it may be an issue that handling the interrupts from
VM faults is too slow and the interrupt handler (IH) ring may
overflow, which can cause an eventual hang.
Delegate the processing of all VM faults to the soft
IRQ handler ring.
As a result, we spend much less time in the IRQ handler that
interacts with the HW IH ring, which significantly reduces the
chance of hangs/reboots.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:50 +0000 (14:29 +0100)]
drm/amdgpu/gmc6: Delegate VM faults to soft IRQ handler ring
On old GPUs, it may be an issue that handling the interrupts from
VM faults is too slow and the interrupt handler (IH) ring may
overflow, which can cause an eventual hang.
Delegate the processing of all VM faults to the soft
IRQ handler ring.
As a result, we spend much less time in the IRQ handler that
interacts with the HW IH ring, which significantly reduces the
chance of hangs/reboots.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:49 +0000 (14:29 +0100)]
drm/amdgpu/gmc6: Cache VM fault info
Call amdgpu_vm_update_fault_cache on GMC v6 similarly to how we
do in GMC v7-v8 so that VM fault info can be used later by
userspace for debugging.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:48 +0000 (14:29 +0100)]
drm/amdgpu/gmc6: Don't print MC client as it's unknown
The VM_CONTEXT1_PROTECTION_FAULT_MCCLIENT register
doesn't exist on GMC v6 so we can't print the MC client as a
string like we do on GMC v7-v8. However, we still print the
mc_id from VM_CONTEXT1_PROTECTION_FAULT_STATUS.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:47 +0000 (14:29 +0100)]
drm/amdgpu/cz_ih: Enable soft IRQ handler ring
We are going to use the soft IRQ handler ring on GMC v8
to process interrupts from VM faults.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:46 +0000 (14:29 +0100)]
drm/amdgpu/tonga_ih: Enable soft IRQ handler ring
We are going to use the soft IRQ handler ring on GMC v8
to process interrupts from VM faults.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:45 +0000 (14:29 +0100)]
drm/amdgpu/iceland_ih: Enable soft IRQ handler ring
We are going to use the soft IRQ handler ring on GMC v8
to process interrupts from VM faults.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:44 +0000 (14:29 +0100)]
drm/amdgpu/cik_ih: Enable soft IRQ handler ring
We are going to use the soft IRQ handler ring on GMC v7 (CIK)
to process interrupts from VM faults.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Timur Kristóf [Wed, 26 Nov 2025 13:29:43 +0000 (14:29 +0100)]
drm/amdgpu/si_ih: Enable soft IRQ handler ring
We are going to use the soft IRQ handler ring on GMC v6 (SI)
to process interrupts from VM faults.
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ian Chen [Thu, 13 Nov 2025 05:07:58 +0000 (13:07 +0800)]
drm/amd/display: fix Smart Power OLED not working after S4
[HOW]
Before enable smart power OLED, we need to call set pipe to let
DMUB get correct ABM config.
Reviewed-by: Robin Chen <robin.chen@amd.com> Signed-off-by: Ian Chen <ian.chen@amd.com> Signed-off-by: Roman Li <roman.li@amd.com> Tested-by: Dan Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ivan Lipski [Fri, 21 Nov 2025 20:03:57 +0000 (15:03 -0500)]
drm/amd/display: Move RGB-type check for audio sync to DCE HW sequence
[Why&How]
DVI-A & VGA connectors are applicable to DCE ASICs, so move them to
dce110_hwseq.c to block audio sync on SIGNAL_TYPE_RGB for DCE ASICs.
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Tested-by: Dan Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/xe/vf: Add debugfs entries to test VF double migration
VF migration sends a marker to the GUC before resource fixups begin,
and repeats the marker with the RESFIX_DONE notification. This prevents
the GUC from submitting jobs during double migration events.
To reliably test double migration, a second migration must be triggered
while fixups from the first migration are still in progress. Since fixups
complete quickly, reproducing this scenario is difficult. Introduce
debugfs controls to add delays in the post-fixup phase, creating a
deterministic window for subsequent migrations.
Each state will pause with a 1-second delay per iteration, continuing until
its corresponding bit is cleared.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Acked-by: Adam Miszczak <adam.miszczak@linux.intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251201095011.21453-10-satyanarayana.k.v.p@intel.com
drm/xe/vf: Requeue recovery on GuC MIGRATION error during VF post-migration
Handle GuC response `XE_GUC_RESPONSE_VF_MIGRATED` as a special case in the
VF post-migration recovery flow. When this error occurs, it indicates that
a new migration was detected while the resource fixup process was still in
progress. Instead of failing immediately, requeue the VF into the recovery
path to allow proper handling of the new migration event.
This improves robustness of VF recovery in SR-IOV environments where
migrations can overlap with resource fixup steps.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251201095011.21453-9-satyanarayana.k.v.p@intel.com
In scenarios involving double migration, the VF KMD may encounter
situations where it is instructed to re-migrate before having the
opportunity to send RESFIX_DONE for the initial migration. This can occur
when the fix-up for the prior migration is still underway, but the VF KMD
is migrated again.
Consequently, this may lead to the possibility of sending two migration
notifications (i.e., pending fix-up for the first migration and a second
notification for the new migration). Upon receiving the first RES_FIX
notification, the GuC will resume VF submission on the GPU, potentially
resulting in undefined behavior, such as system hangs or crashes.
To avoid this, post migration, a marker is sent to the GUC prior to the
start of resource fixups to indicate start of resource fixups. The same
marker is sent along with RESFIX_DONE notification so that GUC can avoid
submitting jobs to HW in case of double migration.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251201095011.21453-8-satyanarayana.k.v.p@intel.com
drm/xe/vf: Enable VF migration only on supported GuC versions
Enable VF migration starting with GuC 70.54.0 (compatibility version
1.27.0) which supports additional VF2GUC_RESFIX_START message required
to handle migration recovery in a more robust way.
Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Tomasz Lis <tomasz.lis@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patch.msgid.link/20251201095011.21453-7-satyanarayana.k.v.p@intel.com
Dave Airlie [Tue, 2 Dec 2025 08:09:01 +0000 (18:09 +1000)]
Merge tag 'drm-misc-next-2025-12-01-1' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
Extra drm-misc-next for v6.19-rc1:
UAPI Changes:
- Add support for drm colorop pipeline.
- Add COLOR PIPELINE plane property.
- Add DRM_CLIENT_CAP_PLANE_COLOR_PIPELINE.
Cross-subsystem Changes:
- Attempt to use higher order mappings in system heap allocator.
- Always taint kernel with sw-sync.
Core Changes:
- Small fixes to drm/gem.
- Support emergency restore to drm-client.
- Allocate and release fb_info in single place.
- Rework ttm pipelined eviction fence handling.
Driver Changes:
- Support the drm color pipeline in vkms, amdgfx.
- Add NVJPG driver for tegra.
- Assorted small fixes and updates to rockchip, bridge/dw-hdmi-qp,
panthor.
- Add ASL CS5263 DP-to-HDMI simple bridge.
- Add and improve support for G LD070WX3-SL01 MIPI DSI, Samsung LTL106AL0,
Samsung LTL106AL01, Raystar RFF500F-AWH-DNN, Winstar WF70A8SYJHLNGA,
Wanchanglong w552946aaa, Samsung SOFEF00, Lenovo X13s panel.
- Add support for it66122 to it66121.
- Support mali-G1 gpu in panthor.
Implement DRM_XE_EXEC_QUEUE_SET_HANG_REPLAY_STATE which sets the exec
queue default state to user data passed in. The intent is for a Mesa
tool to use this replay GPU hangs.
v2:
- Enable the flag DRM_XE_EXEC_QUEUE_SET_HANG_REPLAY_STATE
- Fix the page size math calculation to avoid a crash
v4:
- Use vmemdup_user (Maarten)
- Copy default state first into LRC, then replay state (Testing, Carlos)
Cc: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patch.msgid.link/20251126185952.546277-10-matthew.brost@intel.com
Matthew Brost [Wed, 26 Nov 2025 18:59:51 +0000 (10:59 -0800)]
drm/xe: Add replay_offset and replay_length lines to LRC HWCTX snapshot
Add replay_offset and replay_length lines to LRC HWCTX snapshot with the
idea being this information can be used extract the data which needs to
be pass to exec queue extension DRM_XE_EXEC_QUEUE_SET_HANG_REPLAY_STATE
so GPU hang can be replayed via a Mesa tool.
Add DRM_XE_EXEC_QUEUE_SET_HANG_REPLAY_STATE which accepts a user pointer
to populate the exec queue state so that a GPU hang can be replayed via
a Mesa tool.
v2: Update the value for HANG_REPLAY_STATE flag
Cc: José Roberto de Souza <jose.souza@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Carlos Santa <carlos.santa@intel-corp-partner.google.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Acked-by: José Roberto de Souza <jose.souza@intel.com> Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patch.msgid.link/20251126185952.546277-8-matthew.brost@intel.com
Matthew Brost [Wed, 26 Nov 2025 18:59:48 +0000 (10:59 -0800)]
drm/xe: Add cpu_caching to properties line in VM snapshot capture
Add CPU caching to properties line in VM snapshot capture indicating
the BO caching properites. This is useful information for debug and
will help build a robust GPU hang replay tool.
Matthew Brost [Wed, 26 Nov 2025 18:59:47 +0000 (10:59 -0800)]
drm/xe: Add pat_index to properties line in VM snapshot capture
Add pat index to properties line in VM snapshot capture indicating
the VMA caching properites. This is useful information for debug and
will help build a robust GPU hang replay tool.
Matthew Brost [Wed, 26 Nov 2025 18:59:46 +0000 (10:59 -0800)]
drm/xe: Add mem_region to properties line in VM snapshot capture
Add memory region to properties line in VM snapshot capture indicating
where the memory is located. The memory region corresponds to regions in
the uAPI. This is useful information for debug and will help build a
robust GPU hang replay tool.
Matthew Brost [Wed, 26 Nov 2025 18:59:45 +0000 (10:59 -0800)]
drm/xe: Add "null_sparse" type to VM snap properties
Add "null_sparse" type to VM snap properties indicating the VMA reads
zero and writes are droppped. This is useful information for debug and
will help build a robust GPU hang replay tool.
The current format is:
[<vma address>]: <permissions>|<type>
Permissions has two options, either "read_only" or "read_write".
Type has three options, either "userptr", "null_sparse", or "bo".
Matthew Brost [Wed, 26 Nov 2025 18:59:44 +0000 (10:59 -0800)]
drm/xe: Add properties line to VM snapshot capture
Add properties line to VM snapshot capture which includes additional
information about VMA being dumped. This is helpful for debug purposes
but also to build a robust GPU hang replay tool.
The current format is:
[<vma address>]: <permissions>|<type>
Permissions has two options, either "read_only" or "read_write".
Shuicheng Lin [Fri, 14 Nov 2025 20:56:39 +0000 (20:56 +0000)]
drm/xe: Fix freq kobject leak on sysfs_create_files failure
Ensure gt->freq is released when sysfs_create_files() fails
in xe_gt_freq_init(). Without this, the kobject would leak.
Add kobject_put() before returning the error.
Fixes: fdc81c43f0c1 ("drm/xe: use devm_add_action_or_reset() helper") Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Alex Zuo <alex.zuo@intel.com> Reviewed-by: Xin Wang <x.wang@intel.com> Link: https://patch.msgid.link/20251114205638.2184529-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Michał Winiarski [Thu, 27 Nov 2025 09:39:34 +0000 (10:39 +0100)]
vfio/xe: Add device specific vfio_pci driver variant for Intel graphics
In addition to generic VFIO PCI functionality, the driver implements
VFIO migration uAPI, allowing userspace to enable migration for Intel
Graphics SR-IOV Virtual Functions.
The driver binds to VF device and uses API exposed by Xe driver to
transfer the VF migration data under the control of PF device.
Michał Winiarski [Thu, 27 Nov 2025 09:39:31 +0000 (10:39 +0100)]
drm/xe/pf: Enable SR-IOV VF migration
All of the necessary building blocks are now in place to support SR-IOV
VF migration.
Flip the enable/disable logic to match VF code and disable the feature
only for platforms that don't meet the necessary prerequisites.
To allow more testing and experiments, on DEBUG builds any missing
prerequisites will be ignored.
Raag Jadav [Thu, 30 Oct 2025 12:23:57 +0000 (17:53 +0530)]
drm/xe/gt: Introduce runtime suspend/resume
If power state is retained between suspend/resume cycle, we don't need
to perform full GT re-initialization. Introduce runtime helpers for GT
which greatly reduce suspend/resume delay.
v2: Drop redundant xe_gt_sanitize() and xe_guc_ct_stop() (Daniele)
Use runtime naming for guc helpers (Daniele)
v3: Drop redundant logging, add kernel doc (Michal)
Use runtime naming for ct helpers (Michal)
v4: Fix tags (Rodrigo)
v5: Include host_l2_vram workaround (Daniele)
Reuse xe_guc_submit_enable/disable() helpers (Daniele)
Raag Jadav [Thu, 30 Oct 2025 12:23:56 +0000 (17:53 +0530)]
drm/xe/pm: Assert on runtime suspend if VFs are enabled
We hold an additional reference to the runtime PM to keep PF in D0
during VFs lifetime, as our VFs do not implement the PM capability.
This means we should never be runtime suspending as long as VFs are
enabled.
v8: Add !IS_SRIOV_VF() assert (Matthew Brost)
Suggested-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20251030122357.128825-4-raag.jadav@intel.com
Raag Jadav [Thu, 30 Oct 2025 12:23:55 +0000 (17:53 +0530)]
drm/xe/guc_submit: Introduce pause/unpause() helpers for PF
Introduce pause/unpause() helpers which stop/start further runs of
submission tasks on given GuC and can be called from PF context. This
is in preparation of usecases where we simply need to stop/start the
scheduler without losing GuC state and don't require dealing with VF
migration.
Raag Jadav [Thu, 30 Oct 2025 12:23:54 +0000 (17:53 +0530)]
drm/xe/vf: Update pause/unpause() helpers with VF naming
Now that pause/unpause() helpers have been updated for VF migration
usecase, update their naming to match the functionality and while at it,
add IS_SRIOV_VF() assert to make sure they are not abused.
v7: Add IS_SRIOV_VF() assert (Matthew Brost)
Use "vf" suffix (Michal)
Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patch.msgid.link/20251030122357.128825-2-raag.jadav@intel.com
Piotr Piórkowski [Thu, 27 Nov 2025 07:36:43 +0000 (08:36 +0100)]
drm/xe: Move VRAM MM debugfs creation to tile level
Previously, VRAM TTM resource manager debugfs entries (vram0_mm / vram1_mm)
were created globally in the XE debugfs root directory. But technically,
each tile has an associated VRAM TTM manager, which it can own.
Let's create VRAM memory manager debugfs entries directly under each tile's
debugfs directory for better alignment with the per-tile memory layout.
Jonathan Cavitt [Mon, 17 Nov 2025 19:01:15 +0000 (19:01 +0000)]
drm/xe/xe_sriov_packet: Return int from pf_descriptor_init
pf_descriptor_init currently returns a size_t, which is an unsigned
integer data type. This conflicts with it returning a negative errno
value on failure.
Make it return an int instead. This mirrors how pf_trailer_init is used
later.
Signed-off-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Alex Zuo <alex.zuo@intel.com> Link: https://patch.msgid.link/20251117190114.69953-2-jonathan.cavitt@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>