Thomas Hellström [Mon, 29 Sep 2025 11:26:49 +0000 (13:26 +0200)]
drm/xe/bo: Fix an idle assertion for local bos
Before calling ttm_bo_populate() in the CPU fault path of a bo,
we assert that the bo is not being migrated. However, for
local bos we share the reservation object with other local bos
that might be in the process of being migrated. Also some VM
operations may attach USAGE_KERNEL fences to the common
reservation object and trigger false positives from the assert.
So remove the assert and instead wait for bo idle. This may
unnecessarily wait for idle in some cases but since we're
doing this wait later in the fault path anyway we might as
well do it here as well.
This fixes warnings like:
Sep 25 14:56:23 desky kernel: ------------[ cut here ]------------
Sep 25 14:56:23 desky kernel: xe 0000:03:00.0: [drm] Assertion `dma_resv_test_signaled(tbo->base.resv, DMA_RESV_USAGE_KERNEL) || (tbo->ttm && ttm_tt_is_populated(tbo->ttm))` failed!
platform: BATTLEMAGE subplatform: 1
graphics: Xe2_HPG 20.01 step A0
media: Xe2_HPM 13.01 step A1
Sep 25 14:56:23 desky kernel: WARNING: CPU: 6 PID: 24767 at drivers/gpu/drm/xe/xe_bo.c:1748 xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Modules linked in: cpuid dm_crypt xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfr>
Sep 25 14:56:23 desky kernel: snd_soc_sdca snd_seq_midi prime_numbers coretemp snd_seq_midi_event drm_ttm_helper snd_hda_codec drm_buddy drm_exec snd_rawmidi snd_soc_core snd_hda_cor>
Sep 25 14:56:23 desky kernel: CPU: 6 UID: 1000 PID: 24767 Comm: steamwebhelper Tainted: G U W 6.17.0-rc7+ #32 PREEMPT(voluntary)
Sep 25 14:56:23 desky kernel: Tainted: [U]=USER, [W]=WARN
Sep 25 14:56:23 desky kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D36/PRO Z690-P DDR4 (MS-7D36), BIOS A.A1 10/18/2022
Sep 25 14:56:23 desky kernel: RIP: 0010:xe_bo_fault_migrate+0x1bb/0x300 [xe]
Sep 25 14:56:23 desky kernel: Code: fa 64 29 f9 48 c7 c7 40 e0 d3 c1 51 48 c7 c1 c0 e3 d3 c1 52 4c 8b 45 c0 41 50 44 8b 4d c8 4d 89 e0 48 8b 55 a8 e8 25 27 95 ef <0f> 0b 48 83 c4 40 4>
Sep 25 14:56:23 desky kernel: RSP: 0000:ffffae1ca88c7b10 EFLAGS: 00010286
Sep 25 14:56:23 desky kernel: RAX: 0000000000000000 RBX: ffff8d7cfd7e6800 RCX: 0000000000000027
Sep 25 14:56:23 desky kernel: RDX: ffff8d845019cec8 RSI: 0000000000000001 RDI: ffff8d845019cec0
Sep 25 14:56:23 desky kernel: RBP: ffffae1ca88c7bc8 R08: 0000000000000000 R09: 0000000000000000
Sep 25 14:56:23 desky kernel: R10: 0000000000000000 R11: 0000000000000004 R12: ffffffffc1db1faa
Sep 25 14:56:23 desky kernel: R13: ffffffffc1db2ab4 R14: 0000000000000001 R15: ffffae1ca88c7bd8
Sep 25 14:56:23 desky kernel: FS: 00007fb1baf31940(0000) GS:ffff8d849c870000(0000) knlGS:0000000000000000
Sep 25 14:56:23 desky kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 25 14:56:23 desky kernel: CR2: 00007fb1b2860020 CR3: 00000001705a9004 CR4: 0000000000772ef0
Sep 25 14:56:23 desky kernel: PKRU: 55555558
Sep 25 14:56:23 desky kernel: Call Trace:
Sep 25 14:56:23 desky kernel: <TASK>
Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault_fastpath+0x11e/0x220 [xe]
Sep 25 14:56:23 desky kernel: xe_bo_cpu_fault+0x84/0x410 [xe]
Sep 25 14:56:23 desky kernel: ? __x64_sys_mmap+0x33/0x50
Sep 25 14:56:23 desky kernel: ? x64_sys_call+0x1b2e/0x20d0
Sep 25 14:56:23 desky kernel: ? do_syscall_64+0x9d/0x1f0
Sep 25 14:56:23 desky kernel: ? __check_object_size+0x4a/0x2e0
Sep 25 14:56:23 desky kernel: __do_fault+0x36/0x190
Sep 25 14:56:23 desky kernel: do_fault+0xcf/0x570
Sep 25 14:56:23 desky kernel: __handle_mm_fault+0x92b/0xfe0
Sep 25 14:56:23 desky kernel: ? ktime_get_mono_fast_ns+0x39/0xd0
Sep 25 14:56:23 desky kernel: handle_mm_fault+0x164/0x2c0
Sep 25 14:56:23 desky kernel: do_user_addr_fault+0x2cb/0x840
Sep 25 14:56:23 desky kernel: exc_page_fault+0x75/0x180
Sep 25 14:56:23 desky kernel: asm_exc_page_fault+0x27/0x30
Sep 25 14:56:23 desky kernel: RIP: 0033:0x7fb1bc388bb7
Sep 25 14:56:23 desky kernel: Code: 48 ff c7 48 01 fe 48 8d 54 11 80 0f 1f 84 00 00 00 00 00 c5 fe 6f 0e c5 fe 6f 56 20 c5 fe 6f 5e 40 c5 fe 6f 66 60 48 83 ee 80 <c5> fd 7f 0f c5 fd 7>
Sep 25 14:56:23 desky kernel: RSP: 002b:00007ffd7814fad8 EFLAGS: 00010207
Sep 25 14:56:23 desky kernel: RAX: 00007fb1b2860000 RBX: 0000000000000690 RCX: 00007fb1b2860000
Sep 25 14:56:23 desky kernel: RDX: 00007fb1b2860610 RSI: 0000556eda79f4c0 RDI: 00007fb1b2860020
Sep 25 14:56:23 desky kernel: RBP: 00007ffd7814fb60 R08: 0000000000000000 R09: 000000012be0e000
Sep 25 14:56:23 desky kernel: R10: 00007fb1b2860000 R11: 0000000000000246 R12: 0000556edd39a240
Sep 25 14:56:23 desky kernel: R13: 00007fb1b2dcb010 R14: 0000556eda79f420 R15: 0000000000000000
Sep 25 14:56:23 desky kernel: </TASK>
Michal Wajdeczko [Sun, 28 Sep 2025 14:00:28 +0000 (16:00 +0200)]
drm/xe/pf: Make GGTT/LMEM debugfs files per-tile
Due to initial design of the Xe debugfs, the GGTT and LMEM files
were defined on the primary GT, instead of being per-tile.
While PF provisioning code is now still maintaining GGTT and LMEM
also on the per primary-GT level, this will be refactored soon,
but we can fix debugfs layout now, as part of the new SR-IOV tree.
For backward compatibility we will provide some symlinks that can
be removed once our tools will be fully converted.
As we are making all those changes in the user facing interface,
take this as apportunity to also start replacing the "LMEM" term,
used by the SR-IOV code, with the "VRAM" term, used by Xe driver.
Michal Wajdeczko [Sun, 28 Sep 2025 14:00:26 +0000 (16:00 +0200)]
drm/xe/pf: Move SR-IOV GT debugfs files to new tree
Instead of expanding GT debugfs directories with large number of
SR-IOV files, as those are replicated per each SR-IOV function,
move them to our new debugfs tree, organized by the function.
But to avoid breaking IGT tests that use current layout, provide
symlinks which could be removed once transition period is over,
or we can we can leave them for convenience.
Michal Wajdeczko [Sun, 28 Sep 2025 14:00:25 +0000 (16:00 +0200)]
drm/xe/pf: Populate SR-IOV debugfs tree with tiles
Populate new per SR-IOV function debugfs directories with next
level directories that represent tiles. There are no files yet,
but we will continue updating that tree in upcoming patches.
Michal Wajdeczko [Sun, 28 Sep 2025 14:00:24 +0000 (16:00 +0200)]
drm/xe/pf: Create separate debugfs tree for SR-IOV files
Currently we expose debugfs files related to SR-IOV functions
together with other native files, but that approach will not
scale well as we plan to add more attributes and also expose
some of them on the per-tile basis.
Start building separate tree for SR-IOV specific debugfs files
where we can replicate similar files per every SR-IOV function:
Michal Wajdeczko [Sun, 28 Sep 2025 14:00:23 +0000 (16:00 +0200)]
drm/xe/pf: Promote PF debugfs function to its own file
In upcoming patches, we will build on the PF separate debugfs
tree for all SR-IOV related files and this new code will need
dedicated file. To minimize large diffs later, move existing
function now as-is, so any future modifications will be done
directly in target file.
Shuicheng Lin [Thu, 25 Sep 2025 02:31:46 +0000 (02:31 +0000)]
drm/xe/hw_engine_group: Fix double write lock release in error path
In xe_hw_engine_group_get_mode(), a write lock is acquired before
calling switch_mode(), which in turn invokes
xe_hw_engine_group_suspend_faulting_lr_jobs().
On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(),
the write lock is released there, and then again in
xe_hw_engine_group_get_mode(), leading to a double release.
Fix this by keeping both acquire and release operation in
xe_hw_engine_group_get_mode().
Fixes: 770bd1d34113 ("drm/xe/hw_engine_group: Ensure safe transition between execution modes") Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Mon, 22 Sep 2025 19:58:33 +0000 (12:58 -0700)]
drm/xe/guc: Refactor GuC load to use poll_timeout_us()
Currently there are 2 wait loops for loading GuC: one in
xe_mmio_wait32_not() and one guc_wait_ucode(). Now that there's a
generic poll_timeout_us(), refactor the code to use that to be more
readable.
Main change in behavior is that there's no exponential wait anymore:
that is now replaced by a 10msec retry.
Lucas De Marchi [Mon, 22 Sep 2025 19:58:32 +0000 (12:58 -0700)]
drm/xe/guc: Extract function to print load error
Move the error parsing and print out of guc_wait_ucode() into a helper
to clean up the wait function. Since now the `load_done != 1` condition
has a return statement, also simplify the if/else chain.
Lucas De Marchi [Mon, 22 Sep 2025 19:58:31 +0000 (12:58 -0700)]
drm/xe/guc: Drop helper to read freq
As the forcewake is already held during GuC load, there's no need to use
a helper function to call xe_guc_pc_get_cur_freq(). Just call
xe_guc_pc_get_cur_freq_fw() directly.
Lucas De Marchi [Mon, 22 Sep 2025 19:58:30 +0000 (12:58 -0700)]
drm/xe/guc_pc: Use poll_timeout_us() for waiting
Convert wait_for_pc_state() and wait_for_act_freq_limit() to
poll_timeout_us(). This brings 2 changes in behavior: Drop the
exponential wait and fix a potential much longer sleep.
usleep_range() will wait anywhere between `wait` and `wait << 1`, so
it's not correct to assume `slept += wait`. This code is not really
accurate. Pairing this with the exponential wait increase, it could be
waiting much longer than intended.
Lucas De Marchi [Wed, 24 Sep 2025 15:27:11 +0000 (08:27 -0700)]
drm/xe/configfs: Improve doc for ctx_restore* attributes
Spell out the syntax instead of only using examples. Particularly
important the <engine-class> part since that's different than
engines_allowed and may confuse users. The same batch buffer is used for
all engines of a certain class.
Cc: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Fixes: e2a9854d806e ("drm/xe/configfs: Allow to select by class only") Link: https://lore.kernel.org/r/20250924152709.659483-4-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Wed, 24 Sep 2025 15:27:10 +0000 (08:27 -0700)]
drm/xe/configfs: Fix engine class parsing
If mask is NULL, only the engine class should be accepted, so the
pattern string should be completely parsed. This should fix passing e.g.
rcs0 to ctx_restore_post_bb when it's only expecting the engine class.
Reported-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Closes: https://lore.kernel.org/r/20250922155544.67712-1-jonathan.cavitt@intel.com Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aNJKnrCQmL9xS9Gv@stanley.mountain Fixes: e2a9854d806e ("drm/xe/configfs: Allow to select by class only") Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Link: https://lore.kernel.org/r/20250924152709.659483-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Matthew Auld [Fri, 19 Sep 2025 12:20:53 +0000 (13:20 +0100)]
drm/xe/uapi: loosen used tracking restriction
Currently this is hidden behind perfmon_capable() since this is
technically an info leak, given that this is a system wide metric.
However the granularity reported here is always PAGE_SIZE aligned, which
matches what the core kernel is already willing to expose to userspace
if querying how many free RAM pages there are on the system, and that
doesn't need any special privileges. In addition other drm drivers seem
happy to expose this.
The motivation here if with oneAPI where they want to use the system
wide 'used' reporting here, so not the per-client fdinfo stats. This has
also come up with some perf overlay applications wanting this
information.
Fixes: 1105ac15d2a1 ("drm/xe/uapi: restrict system wide accounting") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com
Michal Wajdeczko [Mon, 22 Sep 2025 10:12:07 +0000 (12:12 +0200)]
drm/xe/tests: Fix build break on clang 16.0.6
The following error was reported when building with clang 16.0.6:
In file included from drivers/gpu/drm/xe/xe_pci.c:1104:
>> drivers/gpu/drm/xe/tests/xe_pci.c:214:2: error: initializer \
element is not a compile-time constant
graphics_ip_xelp,
^~~~~~~~~~~~~~~~
drivers/gpu/drm/xe/tests/xe_pci.c:221:2: error: initializer \
element is not a compile-time constant
media_ip_xem,
^~~~~~~~~~~~
2 errors generated.
Fix that by explicit re-definition of pre-GMDID IPs, as there are
not so many of them.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509192041.tQwdE4DS-lkp@intel.com/ Fixes: 5bb5258e357e ("drm/xe/tests: Add pre-GMDID IP descriptors to param generators") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250922101207.192028-1-michal.wajdeczko@intel.com
Michal Wajdeczko [Fri, 19 Sep 2025 16:04:30 +0000 (18:04 +0200)]
drm/xe/debugfs: Improve .show() helper for GT-based attributes
Like we did for tile-based attributes, introduce separate show()
helper that implicitly takes an RPM reference prior to the call
to the actual print() function. This translates into some savings.
Michal Wajdeczko [Fri, 19 Sep 2025 16:04:29 +0000 (18:04 +0200)]
drm/xe/debugfs: Make ggtt file per-tile
Due to initial lack of per-tile debugfs directories, the ggtt file
attribute was created as per-GT file. Fix that since now we have
proper per-tile directories.
Lucas De Marchi [Mon, 22 Sep 2025 22:11:34 +0000 (15:11 -0700)]
drm/xe/psmi: Do not return NULL
The checks for id and bo_size are impossible conditions. If they were
possible, then the caller should not be using IS_ERR(). Just replace
them with asserts which should be compiled out when not debugging and
at the same time prevent other refactors to break this assumption.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aK1nZjyAF0s7bnHg@stanley.mountain Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250922221133.109921-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Thomas Hellström [Thu, 18 Sep 2025 14:28:48 +0000 (16:28 +0200)]
drm/xe/pm: Add lockdep annotation for the pm_block completion
Similar to how we annotate dma-fences, add lockep annotation to
the pm_block completion to ensure we don't wait for it while holding
locks that are needed in the pm notifier or in the device
suspend / resume callbacks.
Thomas Hellström [Thu, 18 Sep 2025 14:28:47 +0000 (16:28 +0200)]
drm/xe/pm: Hold the validation lock around evicting user-space bos for suspend
During pm notifier eviction we may still race with validations.
Ensure those are blocked out during eviction to ensure we have
access to as much system memory as possible.
During the suspend operation itself, we run single-threaded so that
shouldn't be a problem.
Thomas Hellström [Thu, 18 Sep 2025 09:22:07 +0000 (11:22 +0200)]
drm/xe/dma-buf: Allow pinning of p2p dma-buf
RDMA NICs typically requires the VRAM dma-bufs to be pinned in
VRAM for pcie-p2p communication, since they don't fully support
the move_notify() scheme. We would like to support that.
However allowing unaccounted pinning of VRAM creates a DOS vector
so up until now we haven't allowed it.
However with cgroups support in TTM, the amount of VRAM allocated
to a cgroup can be limited, and since also the pinned memory is
accounted as allocated VRAM we should be safe.
An analogy with system memory can be made if we observe the
similarity with kernel system memory that is allocated as the
result of user-space action and that is accounted using __GFP_ACCOUNT.
Ideally, to be more flexible, we would add a "pinned_memory",
or possibly "kernel_memory" limit to the dmem cgroups controller,
that would additionally limit the memory that is pinned in this way.
If we let that limit default to the dmem::max limit we can
introduce that without needing to care about regressions.
Considering that we already pin VRAM in this way for at least
page-table memory and LRC memory, and the above path to greater
flexibility, allow this also for dma-bufs.
v2:
- Update comments about pinning in the dma-buf kunit test
(Niranjana Vishwanathapura)
Cc: Dave Airlie <airlied@gmail.com> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://lore.kernel.org/r/20250918092207.54472-4-thomas.hellstrom@linux.intel.com
Thomas Hellström [Thu, 18 Sep 2025 09:22:05 +0000 (11:22 +0200)]
drm/xe: Don't copy pinned kernel bos twice on suspend
We were copying the bo content the bos on the list
"xe->pinned.late.kernel_bo_present" twice on suspend.
Presumingly the intent is to copy the pinned external bos on
the first pass.
This is harmless since we (currently) should have no pinned
external bos needing copy since
a) exernal system bos don't have compressed content,
b) We do not (yet) allow pinning of VRAM bos.
Still, fix this up so that we copy pinned external bos on
the first pass. We're about to allow bos pinned in VRAM.
Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier") Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250918092207.54472-2-thomas.hellstrom@linux.intel.com
Dave Airlie [Sun, 21 Sep 2025 21:42:05 +0000 (07:42 +1000)]
Merge tag 'drm-xe-next-2025-09-19' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
UAPI Changes:
- Drop L3 bank mask reporting from the media GT on Xe3 and later. Only
do that for the primary GT. No userspace needs or uses it for media
and some platforms may report bogus values.
- Add SLPC power_profile sysfs interface with support for base and
power_saving modes (Vinay Belgaumkar, Rodrigo Vivi)
- Add configfs attributes to add post/mid context-switch commands
(Lucas De Marchi)
Cross-subsystem Changes:
- Fix hmm_pfn_to_map_order() usage in gpusvm and refactor APIs to
align with pieces previous handled by xe_hmm (Matthew Auld)
Core Changes:
- Add MEI driver for Late Binding Firmware Update/Upload
(Alexander Usyskin)
Driver Changes:
- Fix GuC CT teardown wrt TLB invalidation (Satyanarayana)
- Fix CCS save/restore on VF (Satyanarayana)
- Increase default GuC crash buffer size (Zhanjun)
- Allow to clear GT stats in debugfs to aid debugging (Matthew Brost)
- Add more SVM GT stats to debugfs (Matthew Brost)
- Fix error handling in VMA attr query (Himal)
- Move sa_info in debugfs to be per tile (Michal Wajdeczko)
- Limit number of retries upon receiving NO_RESPONSE_RETRY from GuC to
avoid endless loop (Michal Wajdeczko)
- Fix configfs handling for survivability_mode undoing user choice when
unbinding the module (Michal Wajdeczko)
- Refactor configfs attribute visibility to future-proof it and stop
exposing survivability_mode if not applicable (Michal Wajdeczko)
- Constify some functions (Harish Chegondi, Michal Wajdeczko)
- Add/extend more HW workarounds for Xe2 and Xe3
(Harish Chegondi, Tangudu Tilak Tirumalesh)
- Replace xe_hmm with gpusvm (Matthew Auld)
- Improve fake pci and WA kunit handling for testing new platforms
(Michal Wajdeczko)
- Reduce unnecessary PTE writes when migrating (Sanjay Yadav)
- Cleanup GuC interface definitions and log message (John Harrison)
- Small improvements around VF CCS (Michal Wajdeczko)
- Enable bus mastering for the I2C controller (Raag Jadav)
- Prefer devm_mutex of hand rolling it (Christophe JAILLET)
- Drop sysfs and debugfs attributes not available for VF (Michal Wajdeczko)
- GuC CT devm actions improvements (Michal Wajdeczko)
- Recommend new GuC versions for PTL and BMG (Julia Filipchuk)
- Improveme driver handling for exhaustive eviction using new
xe_validation wrapper around drm_exec (Thomas Hellström)
- Add and use printk wrappers for tile and device (Michal Wajdeczko)
- Better document workaround handling in Xe (Lucas De Marchi)
- Improvements on ARRAY_SIZE and ERR_CAST usage (Lucas De Marchi,
Fushuai Wang)
- Align CSS firmware headers with the GuC APIs (John Harrison)
- Test GuC to GuC (G2G) communication to aid debug in pre-production
firmware (John Harrison)
- Bail out driver probing if GuC fails to load (John Harrison)
- Allow error injection in xe_pxp_exec_queue_add()
(Daniele Ceraolo Spurio)
- Minor refactors in xe_svm (Shuicheng Lin)
- Fix madvise ioctl error handling (Shuicheng Lin)
- Use attribute groups to simplify sysfs registration
(Michal Wajdeczko)
- Add Late Binding Firmware implementation in Xe to work together with
the MEI component (Badal Nilawar, Daniele Ceraolo Spurio, Rodrigo
Vivi)
- Fix build with CONFIG_MODULES=n (Lucas De Marchi)
Lucas De Marchi [Fri, 12 Sep 2025 21:54:51 +0000 (14:54 -0700)]
drm/xe: Fix build with CONFIG_MODULES=n
When building with CONFIG_MODULES=n, the __exit functions are dropped.
However our init functions may call them for error handling, so they are
not good candidates for the exit sections.
Fix this error reported by 0day:
ld.lld: error: relocation refers to a symbol in a discarded section: xe_configfs_exit
>>> defined in vmlinux.a(drivers/gpu/drm/xe/xe_configfs.o)
>>> referenced by xe_module.c
>>> drivers/gpu/drm/xe/xe_module.o:(init_funcs) in archive vmlinux.a
This is the only exit function using __exit. Drop it to fix the build.
Cc: Riana Tauro <riana.tauro@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506092221.1FmUQmI8-lkp@intel.com/ Fixes: 16280ded45fb ("drm/xe: Add configfs to enable survivability mode") Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://lore.kernel.org/r/20250912-fix-nomodule-build-v1-1-d11b70a92516@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Lucas De Marchi [Tue, 16 Sep 2025 21:15:43 +0000 (14:15 -0700)]
drm/xe/lrc: Allow to add user commands mid context switch
Like done for post-context-restore commands, allow to add commands from
configfs in the middle of context restore. Since currently the indirect
ctx hardcodes the offset to CTX_INDIRECT_CTX_OFFSET_DEFAULT, this is
executed in the very beginning of engine context restore.
Lucas De Marchi [Tue, 16 Sep 2025 21:15:42 +0000 (14:15 -0700)]
drm/xe/lrc: Allow INDIRECT_CTX for more engine classes
Currently it's only allowed for render and compute. Going forward we
want to enable it for more engine classes. Let the XE_LRC_FLAG_INDIRECT_CTX
flag (and thus gt_engine_needs_indirect_ctx()) be the deciding factor
for its availability.
While at it, add the missing const to rcs_funcs array. Since
CTX_INDIRECT_CTX_OFFSET_DEFAULT already matches the HW default and
gt_engine_needs_indirect_ctx() only ever enables it for rcs/ccs, there
is no change in behavior, it's only preparation for future use case.
Lucas De Marchi [Tue, 16 Sep 2025 21:15:41 +0000 (14:15 -0700)]
drm/xe/configfs: Add post context restore bb
Allow the user to specify commands to execute during a context restore.
Currently it's possible to parse 2 types of actions:
- cmd: the instructions are added as is to the bb
- reg: just use the address and value, without worrying about
encoding the right LRI instruction. This is possibly the most
useful use case, so added a dedicated action for that.
This also prepares for future BBs: mid context restore and rc6 context
restore that can re-use the same parsing functions.
Lucas De Marchi [Tue, 16 Sep 2025 21:15:40 +0000 (14:15 -0700)]
drm/xe/lrc: Allow to add user commands on context switch
During validation it's useful to allows additional commands to be
executed on context switch. Fetch the commands from configfs (to be
added) and add them to the WA BB.
Lucas De Marchi [Tue, 16 Sep 2025 21:15:39 +0000 (14:15 -0700)]
drm/xe/configfs: Allow to select by class only
For a future configfs attribute, it's desirable to select by engine mask
only as the instance doesn't make sense.
Rename the function lookup_engine_mask() to lookup_engine_info() and
make it return the entry. This allows parse_engine() to still return an
item if the caller wants to allow parsing a class-only string like
"rcs", "bcs", "ccs", etc.
Matthew Schwartz [Thu, 11 Sep 2025 17:48:51 +0000 (10:48 -0700)]
drm/amd/display: Only restore backlight after amdgpu_dm_init or dm_resume
On clients that utilize AMD_PRIVATE_COLOR properties for HDR support,
brightness sliders can include a hardware controlled portion and a
gamma-based portion. This is the case on the Steam Deck OLED when using
gamescope with Steam as a client.
When a user sets a brightness level while HDR is active, the gamma-based
portion and/or hardware portion are adjusted to achieve the desired
brightness. However, when a modeset takes place while the gamma-based
portion is in-use, restoring the hardware brightness level overrides the
user's overall brightness level and results in a mismatch between what
the slider reports and the display's current brightness.
To avoid overriding gamma-based brightness, only restore HW backlight
level after boot or resume. This ensures that the backlight level is
set correctly after the DC layer resets it while avoiding interference
with subsequent modesets.
Fixes: 7875afafba84 ("drm/amd/display: Fix brightness level not retained over reboot") Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4551 Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Guangshuo Li [Thu, 18 Sep 2025 10:57:05 +0000 (18:57 +0800)]
drm/amdgpu/atom: Check kcalloc() for WS buffer in amdgpu_atom_execute_table_locked()
kcalloc() may fail. When WS is non-zero and allocation fails, ectx.ws
remains NULL while ectx.ws_size is set, leading to a potential NULL
pointer dereference in atom_get_src_int() when accessing WS entries.
Return -ENOMEM on allocation failure to avoid the NULL dereference.
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Wed, 27 Aug 2025 09:45:45 +0000 (11:45 +0200)]
drm/amdgpu: revert to old status lock handling v3
It turned out that protecting the status of each bo_va with a
spinlock was just hiding problems instead of solving them.
Revert the whole approach, add a separate stats_lock and lockdep
assertions that the correct reservation lock is held all over the place.
This not only allows for better checks if a state transition is properly
protected by a lock, but also switching back to using list macros to
iterate over the state of lists protected by the dma_resv lock of the
root PD.
v2: re-add missing check
v3: split into two patches
Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/xe/xe_late_bind_fw: Introduce debug fs node to disable late binding
Introduce a debug filesystem node to disable late binding fw reload
during the system or runtime resume. This is intended for situations
where the late binding fw needs to be loaded from user mode,
perticularly for validation purpose.
Note that xe kmd doesn't participate in late binding flow from user
space. Binary loaded from the userspace will be lost upon entering to
D3 cold hence user space app need to handle this situation.
drm/xe/xe_late_bind_fw: Load late binding firmware
Load late binding firmware
v2:
- s/EAGAIN/EBUSY/
- Flush worker in suspend and driver unload (Daniele)
v3:
- Use retry interval of 6s, in steps of 200ms, to allow
other OS components release MEI CL handle (Sasha)
v4:
- return -ENODEV if component not added (Daniele)
- parse and print status returned by csc
v5:
- Use payload to check firmware valid (Daniele)
- Obtain the RPM reference before scheduling the worker to
ensure the device remains awake until the worker completes
firmware loading (Rodrigo)
v6:
- In case of error donot re-attempt fw download (Daniele)
v7 (Rodrigo):
- Rename of mei structs and callback.
Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250905154953.3974335-6-badal.nilawar@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
drm/xe/xe_late_bind_fw: Initialize late binding firmware
Search for late binding firmware binaries and populate the meta data of
firmware structures.
v2 (Daniele):
- drm_err if firmware size is more than max pay load size
- s/request_firmware/firmware_request_nowarn/ as firmware will
not be available for all possible cards
v3 (Daniele):
- init firmware from within xe_late_bind_init, propagate error
- switch late_bind_fw to array to handle multiple firmware types
v4 (Daniele):
- Alloc payload dynamically, fix nits
v6 (Daniele)
- %s/MAX_PAYLOAD_SIZE/XE_LB_MAX_PAYLOAD_SIZE/
Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250905154953.3974335-5-badal.nilawar@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Introduce xe_late_bind_fw to enable firmware loading for the devices,
such as the fan controller, during the driver probe. Typically,
firmware for such devices are part of IFWI flash image but can be
replaced at probe after OEM tuning.
This patch binds mei late binding component to enable firmware loading.
v2:
- Add devm_add_action_or_reset to remove the component (Daniele)
- Add INTEL_MEI_GSC check in xe_late_bind_init() (Daniele)
v3:
- Fail driver probe if late bind initialization fails,
add has_late_bind flag (Daniele)
v4:
- %s/I915_COMPONENT_LATE_BIND/INTEL_COMPONENT_LATE_BIND/
v6:
- rebased
v7:
- rebased
- In xe_late_bind_init, use drm_err when returning an error to
stop the probe (Lucas)
- Use imperative mode in commit message (Lucas)
Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250905154953.3974335-4-badal.nilawar@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Introduce a new MEI client driver to support Late Binding firmware
upload/update for Intel discrete graphics platforms.
Late Binding is a runtime firmware upload/update mechanism that allows
payloads, such as fan control and voltage regulator, to be securely
delivered and applied without requiring SPI flash updates or
system reboots. This driver enables the Xe graphics driver and other
user-space tools to push such firmware blobs to the authentication
firmware via the MEI interface.
The driver handles authentication, versioning, and communication
with the authentication firmware, which in turn coordinates with
the PUnit/PCODE to apply the payload.
This is a foundational component for enabling dynamic, secure,
and re-entrant configuration updates on platforms like Battlemage.
Cc: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250905154953.3974335-3-badal.nilawar@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Add a new helper function that allows MEI client drivers
to query the maximum transmission unit (MTU) for a connected
MEI client.
This is useful for clients that need to transmit large payloads,
such as firmware blobs, allowing them to determine the maximum
message size that can be safely sent before starting transmission and
size of the buffer to allocate when receiving data.
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Badal Nilawar <badal.nilawar@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20250905154953.3974335-2-badal.nilawar@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
drm/amdgpu: add missing comment for the new argument
In function 'amdgpu_vm_lock_done_list' update the comment
for the new argument 'vm'.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509180211.UAqME0zj-lkp@intel.com/ Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 17 Sep 2025 16:42:11 +0000 (12:42 -0400)]
drm/amdgpu: suspend KFD and KGD user queues for S0ix
We need to make sure the user queues are preempted so
GFX can enter gfxoff.
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 17 Sep 2025 16:42:10 +0000 (12:42 -0400)]
drm/amdgpu/userq: Optimize S0ix handling
In S0i3, GFX state is retained, so it's preferrable to
preempt queues rather than unmapping them as the overhead
is lower.
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
AMDGPU_PTE_PRT_GFX12 flag is missed during pageTable rework, add it back.
Fixes: 6716a823d18d ("drm/amdgpu: rework how PTE flags are generated v3") Signed-off-by: Joe Wang <joe.wang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 17 Sep 2025 16:42:09 +0000 (12:42 -0400)]
drm/amdkfd: add proper handling for S0ix
When in S0i3, the GFX state is retained, so all we need to do
is stop the runlist so GFX can enter gfxoff.
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: David Perry <david.perry@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 20 Aug 2025 20:04:18 +0000 (16:04 -0400)]
drm/amdgpu: remove non-DC DCE 11 code
DC has been the default for ~8 years now and supports
many things that the non-DC code does not (audio, DP MST, etc.).
No DCE 11.x IPs ever supported analog encoders so that is not
an issue. Finally drop this code.
Acked-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Mon, 15 Sep 2025 12:28:49 +0000 (20:28 +0800)]
drm/amd/pm: Enable npm metrics data
Enable npm metrics data for smu_v13_0_12
v3: Add node id check for setting NPM_CAPS (Lijo)
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Fri, 29 Aug 2025 04:25:54 +0000 (12:25 +0800)]
drm/amd/pm: Fetch npm data from system metrics table
Fetch npm data from system metrics table for smu_v13_0_12
v3: Remove intermittent type for npm data, remove node id check,
move npm caps check to npm_get_data function (Lijo)
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Wed, 27 Aug 2025 13:22:13 +0000 (21:22 +0800)]
drm/amd/pm: Add sysfs node for node power
Add sysfs node to expose node power limit for smu_v13_0_12
v2: Remove support check from visible function (Kevin)
v3: Update comments (Kevin)
Remove sysfs remove file, change format specifier
for sysfs_emit, use attribute_group.name (Lijo)
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Mon, 15 Sep 2025 09:53:19 +0000 (17:53 +0800)]
drm/amd/pm: Allow system metrics table in 1vf mode
Allow fetching system metrics table in 1VF mode
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Thomas Hellström [Thu, 11 Sep 2025 08:03:24 +0000 (10:03 +0200)]
drm/xe: Work around clang multiple goto-label error
When using drm_exec_retry_on_contention(), clang may consider
all labels for which we take addresses in a function as
potential retry goto targets, although strictly only one
is possible. It will then in some situations generate false
positive errors.
In this case, the compiler, for some architectures, consider the
might_lock(&m->job_mutex);
as a potential goto target from drm_exec_retry_on_contention(),
and errors.
Work around that by moving the xe_validate / drm_exec
transaction to a separate function.
v2:
- New commit message based on analysis of Nathan Chancellor
Fixes: 59eabff2a352 ("drm/xe: Convert xe_bo_create_pin_map() for exhaustive eviction") Cc: Matthew Brost <matthew.brost@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509101853.nDmyxTEM-lkp@intel.com/ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> # build Link: https://lore.kernel.org/r/20250911080324.180307-1-thomas.hellstrom@linux.intel.com
Michal Wajdeczko [Tue, 16 Sep 2025 17:00:29 +0000 (19:00 +0200)]
drm/xe/sysfs: Simplify sysfs registration
Instead of manually maintaining each sysfs file define and use
attribute groups and register them using device managed function.
Then use is_visible() to filter-out unsupported attributes.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250916170029.3313-3-michal.wajdeczko@intel.com
Michal Wajdeczko [Tue, 16 Sep 2025 17:00:28 +0000 (19:00 +0200)]
drm/xe/vf: Don't expose sysfs attributes not applicable for VFs
VFs can't read BMG_PCIE_CAP(0x138340) register nor access PCODE
(already guarded by the info.skip_pcode flag) so we shouldn't
expose attributes that require any of them to avoid errors like:
Fixes: 0e414bf7ad01 ("drm/xe: Expose PCIe link downgrade attributes") Fixes: cdc36b66cd41 ("drm/xe: Expose fan control and voltage regulator version") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250916170029.3313-2-michal.wajdeczko@intel.com
Shuicheng Lin [Thu, 11 Sep 2025 17:31:40 +0000 (17:31 +0000)]
drm/xe/madvise: Fix ioctl argument check
It is "preferred_mem_loc" instead of "atomic" for the ATTR_PREFERRED_LOC
path.
Also include 2 minor changes with no functional impact.
1. Remove the redundant "attr.atomic_access" assignment.
2. Replace down_read_interruptible() with
xe_svm_notifier_lock_interruptible() to pair with
xe_svm_notifier_unlock().
Fixes: ada7486c5668 ("drm/xe: Implement madvise ioctl for xe") Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://lore.kernel.org/r/20250911173139.1405878-2-shuicheng.lin@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Shuicheng Lin [Thu, 11 Sep 2025 03:14:06 +0000 (03:14 +0000)]
drm/xe: Misc refine for svm
These changes should have no functional impact.
1. Correct typo of "operation"in macro range_debug().
2. Combine 2 spin_lock() call in xe_svm_garbage_collector() into 1.
3. Drop redundant preferred_region_is_vram check in
xe_svm_range_needs_migrate_to_vram().
4. Combine the devmem_possible check in xe_svm_handle_pagefault().
need_vram includes the IS_DGFX() check, so there is no change for
.devmem_only.
v2: revert !ctx.devmem_only change (Matt)
v3: rebase code and refine commit message.
v4: rebase code and refine commit message.
Michal Wajdeczko [Tue, 16 Sep 2025 17:16:45 +0000 (19:16 +0200)]
drm/xe/tests: Add pre-GMDID IP descriptors to param generators
Recently introduced kunit parameter generators were based on
the existing arrays which have only GDMID-based IPs and didn't
take into account IP definitions from pre-GMDID era.
Add test only arrays with pre-GMDID IPs (as those will not change)
and extend param generators to start iterating over them.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250916171645.3335-1-michal.wajdeczko@intel.com
Dave Airlie [Wed, 17 Sep 2025 06:09:24 +0000 (16:09 +1000)]
Merge tag 'drm-rust-next-2025-09-16' of https://gitlab.freedesktop.org/drm/rust/kernel into drm-next
DRM Rust changes for v6.18
Alloc
- Add BorrowedPage type and AsPageIter trait
- Implement Vmalloc::to_page() and VmallocPageIter
- Implement AsPageIter for VBox and VVec
DMA & Scatterlist
- Add dma::DataDirection and type alias for dma_addr_t
- Abstraction for struct scatterlist and struct sg_table
DRM
- In the DRM GEM module, simplify overall use of generics, add
DriverFile type alias and drop Object::SIZE.
Nova (Core)
- Various register!() macro improvements (paving the way for lifting
it to common driver infrastructure)
- Minor VBios fixes and refactoring
- Minor firmware request refactoring
- Advance firmware boot stages; process Booter and patch its
signature, process GSP and GSP bootloader
- Switch development fimrware version to r570.144
- Add basic firmware bindings for r570.144
- Move GSP boot code to its own module
- Clean up and take advantage of pin-init features to store most of
the driver's private data within a single allocation
- Update ARef import from sync::aref
- Add website to MAINTAINERS entry
Nova (DRM)
- Update ARef import from sync::aref
- Add website to MAINTAINERS entry
Pin-Init
- Merge pin-init PR from Benno
- `#[pin_data]` now generates a `*Projection` struct similar to the
`pin-project` crate.
- Add initializer code blocks to `[try_][pin_]init!` macros: make
initializer macros accept any number of `_: {/* arbitrary code
*/},` & make them run the code at that point.
- Make the `[try_][pin_]init!` macros expose initialized fields via
a `let` binding as `&mut T` or `Pin<&mut T>` for later fields.
Rust
- Various methods for AsBytes and FromBytes traits
Tyr
- Initial Rust driver skeleton for ARM Mali GPUs.
- It can power up the GPU, query for GPU metatdata through MMIO and
provide the metadata to userspace via DRM device IOCTL (struct
drm_panthor_dev_query).
Since the PXP start comes after __xe_exec_queue_init() has completed,
we need to cleanup what was done in that function in case of a PXP
start error.
__xe_exec_queue_init calls the submission backend init() function,
so we need to introduce an opposite for that. Unfortunately, while
we already have a fini() function pointer, it performs other
operations in addition to cleaning up what was done by the init().
Therefore, for clarity, the existing fini() has been renamed to
destroy(), while a new fini() has been added to only clean up what was
done by the init(), with the latter being called by the former (via
xe_exec_queue_fini).
Fixes: 72d479601d67 ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250909221240.3711023-3-daniele.ceraolospurio@intel.com
Christian König [Wed, 27 Aug 2025 09:45:45 +0000 (11:45 +0200)]
drm/amdgpu: re-order and document VM code
Re-order fields in the VM structure and try to improve the
documentation a bit.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Wed, 27 Aug 2025 08:17:48 +0000 (10:17 +0200)]
drm/amdgpu: remove check for BO reservation add assert instead
We should leave such checks to lockdep and not implement something
manually.
Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Asad Kamal [Wed, 27 Aug 2025 10:19:13 +0000 (18:19 +0800)]
drm/amd/pm: Update pmfw headers for smu_v13_0_12
Update pmfw headers for smu_v13_0_12 to include node power limit
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Rename amdgpu_hwmon_get_sensor_generic to use for generic pm
interfaces
Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the V14_0_2 smu
The I2C init for V14_0_2 uses i2c_add_adapter() and i2c_del_adapter(),
this commit replaces the use of these two functions with
devm_i2c_add_adapter(). Notice that V14_0_2 init initializes multiple
I2C buses in a loop; if something goes wrong, the previous adapters are
removed, and the amdgpu load is interrupted. Since I2C init is required
for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the V13_0_6 smu
The I2C init for V13_0_6 uses i2c_add_adapter() and i2c_del_adapter(),
this commit replaces the use of these two functions with
devm_i2c_add_adapter(). Notice that V13_0_6 init initializes multiple
I2C buses in a loop; if something goes wrong, the previous adapters are
removed, and the amdgpu load is interrupted. Since I2C init is required
for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the V13 smu
The I2C init for SMU_V13 uses i2c_add_adapter() and i2c_del_adapter(),
this commit replaces the use of these two functions with
devm_i2c_add_adapter(). Notice that SMU_V13 init initializes multiple
I2C buses in a loop; if something goes wrong, the previous adapters are
removed, and the amdgpu load is interrupted. Since I2C init is required
for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the Sienna smu
The I2C init for Sienna Cichlid uses i2c_add_adapter() and
i2c_del_adapter(), this commit replaces the use of these two functions
with devm_i2c_add_adapter(). Notice that Sienna Cichlid init initializes
multiple I2C buses in a loop; if something goes wrong, the previous
adapters are removed, and the amdgpu load is interrupted. Since I2C init
is required for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the Navi10 smu
The I2C init for Navi10 uses i2c_add_adapter() and i2c_del_adapter(),
this commit replaces the use of these two functions with
devm_i2c_add_adapter(). Notice that Navi10 init initializes multiple I2C
buses in a loop; if something goes wrong, the previous adapters are
removed, and the amdgpu load is interrupted. Since I2C init is required
for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/pm: Use devm_i2c_add_adapter() in the Arcturus smu
The I2C init for Arcturus uses i2c_add_adapter() and i2c_del_adapter(),
this commit replaces the use of these two functions with
devm_i2c_add_adapter(). Notice that Arcturus init initializes multiple
I2C buses in a loop; if something goes wrong, the previous adapters are
removed, and the amdgpu load is interrupted. Since I2C init is required
for the correct load of amdgpu, it is safe to rely on
devm_i2c_add_adapter() to handle any previously initialized I2C adapter.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amdgpu/amdgpu_i2c: Use devm_i2c_add_adapter instead of i2c_add_adapter
This commit replaces i2c_add_adapter() with devm_i2c_add_adapter() and
removes part of the cleanup logic since the new function handles the i2c
removal.
Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd/display: Use devm_i2c_add_adapter to simplify i2c cleanup logic
This commit replaces the utilization of i2c_add/del_adapter() with
devm_i2c_add_adapter() to reduce the amount of boilerplate. Using
devm_i2c_add_adapter() has the advantage of removing the manual
manipulation of the I2C adapter.
Suggested-by: Robert Beckett <bob.beckett@collabora.com> Signed-off-by: Rodrigo Siqueira <siqueira@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
James Flowers [Sat, 13 Sep 2025 05:19:52 +0000 (22:19 -0700)]
drm/amd/display: Use kmalloc_array() instead of kmalloc()
Documentation/process/deprecated.rst recommends against the use of kmalloc
with dynamic size calculations due to the risk of overflow and smaller
allocation being made than the caller was expecting. This could lead to
buffer overflow in code similar to the memcpy in
amdgpu_dm_plane_add_modifier().
Signed-off-by: James Flowers <bold.zone2373@fastmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Wed, 27 Aug 2025 07:28:40 +0000 (09:28 +0200)]
drm/amdgpu: fix userq VM validation v4
That was actually complete nonsense and not validating the BOs
at all. The code just cleared all VM areas were it couldn't grab the
lock for a BO.
Try to fix this. Only compile tested at the moment.
v2: fix fence slot reservation as well as pointed out by Sunil.
also validate PDs, PTs, per VM BOs and update PDEs
v3: grab the status_lock while working with the done list.
v4: rename functions, add some comments, fix waiting for updates to
complete.
v4: rename amdgpu_vm_lock_done_list(), add some more comments
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Christian König [Wed, 27 Aug 2025 11:14:43 +0000 (13:14 +0200)]
drm/amdgpu: reject gang submissions under SRIOV
Gang submission means that the kernel driver guarantees that multiple
submissions are executed on the HW at the same time on different engines.
Background is that those submissions then depend on each other and each
can't finish stand alone.
SRIOV now uses world switch to preempt submissions on the engines to allow
sharing the HW resources between multiple VFs.
The problem is now that the SRIOV world switch can't know about such inter
dependencies and will cause a timeout if it waits for a partially running
gang submission.
To conclude SRIOV and gang submissions are fundamentally incompatible at
the moment. For now just disable them.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yang Li [Tue, 16 Sep 2025 02:10:39 +0000 (10:10 +0800)]
drm/xe: Remove duplicate header files
Fix some duplicate includes in xe:
./drivers/gpu/drm/xe/xe_tlb_inval.c: xe_tlb_inval.h is included more than once.
./drivers/gpu/drm/xe/xe_pt.c: xe_tlb_inval_job.h is included more than once.
While at it, also sort the include lines alphabetically.
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=24705 Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=24706 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
[Reword commit message] Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250916021039.1632766-1-yang.lee@linux.alibaba.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
John Harrison [Tue, 9 Sep 2025 22:41:31 +0000 (15:41 -0700)]
drm/xe/guc: Return an error code if the GuC load fails
Due to multiple explosion issues in the early days of the Xe driver,
the GuC load was hacked to never return a failure. That prevented
kernel panics and such initially, but now all it achieves is creating
more confusing errors when the driver tries to submit commands to a
GuC it already knows is not there. So fix that up.
As a stop-gap and to help with debug of load failures due to invalid
GuC init params, a wedge call had been added to the inner GuC load
function. The reason being that it leaves the GuC log accessible via
debugfs. However, for an end user, simply aborting the module load is
much cleaner than wedging and trying to continue. The wedge blocks
user submissions but it seems that various bits of the driver itself
still try to submit to a dead GuC and lots of subsequent errors occur.
And with regards to developers debugging why their particular code
change is being rejected by the GuC, it is trivial to either add the
wedge back in and hack the return code to zero again or to just do a
GuC log dump to dmesg.
v2: Add support for error injection testing and drop the now redundant
wedge call.
Zongyao Bai [Mon, 15 Sep 2025 21:47:15 +0000 (05:47 +0800)]
drm/xe/sysfs: Add cleanup action in xe_device_sysfs_init
On partial failure, some sysfs files created before the failure might
not be removed. Add common cleanup step to remove them all immediately,
as is should be harmless to attempt to remove non-existing files.
Fixes: 0e414bf7ad01 ("drm/xe: Expose PCIe link downgrade attributes") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Zongyao Bai <zongyao.bai@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250915214716.1327379-2-zongyao.bai@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Dave Airlie [Tue, 16 Sep 2025 00:35:41 +0000 (10:35 +1000)]
Merge tag 'exynos-drm-next-for-v6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-next
New feature
- Add glue layer support for Exynos7870 DSIM in Exynos DSI driver
. Introduces Exynos7870 DSIM bridge integration at Exynos DRM DSI layer.
Bug fixups for exynos7_drm_decon.c module
- Remove redundant ctx->suspended state handling
. Cleans up unused state check logic as call flow is now correctly managed.
. Fixes an issue where decon_commit() was blocked from decon_atomic_enable() due to incorrect state setting.