]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
12 days agodrm/amdgpu: convert amdgpu_vm_lock_by_pasid() to drm_exec
Mikhail Gavrilov [Fri, 29 May 2026 06:47:38 +0000 (11:47 +0500)] 
drm/amdgpu: convert amdgpu_vm_lock_by_pasid() to drm_exec

amdgpu_vm_lock_by_pasid() looks up a VM by PASID and reserves its root
PD with a bare amdgpu_bo_reserve(), returning the still-reserved root to
the caller. A caller that then needs to reserve further BOs (for example
the devcoredump IB dump) ends up nesting reservation_ww_class_mutex
acquires without a ww_acquire_ctx, which lockdep flags as recursive
locking.

Convert the helper to take a drm_exec context and lock the root PD with
drm_exec_lock_obj(). Callers now run it inside a
drm_exec_until_all_locked() loop and can lock additional BOs in the same
ww ticket, so there is no nested ww_mutex acquire.

The drm_exec context holds its own reference on the locked root BO, so
the helper no longer hands a root reference back to the caller: the
root output parameter is dropped, and the transient reference taken
across the PASID lookup is released before returning.

The only existing caller, amdgpu_vm_handle_fault(), is updated
accordingly. Its is_compute_context path, which previously dropped the
root reservation around svm_range_restore_pages() and re-took it, now
finalises the drm_exec context and re-initialises a fresh one; behaviour
is otherwise unchanged.

No functional change intended for the page-fault path.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 14682de8ad377bf13ea66e47c26dcfea0b19a21d)

12 days agodrm/amdgpu: Don't use UTS_RELEASE directly
Uwe Kleine-König (The Capable Hub) [Tue, 28 Apr 2026 14:47:03 +0000 (16:47 +0200)] 
drm/amdgpu: Don't use UTS_RELEASE directly

UTS_RELEASE evaluates to a static string and changes quite easily (e.g.
uncommitted changes in the source tree or new commits). So when checking
if a patch introduces changes to the resulting binary each usage of
UTS_RELEASE is source of annoyance.

Instead of using UTS_RELEASE directly use init_utsname()->release which
evaluates to the same string but with that a change of UTS_RELEASE
doesn't affect amdgpu_dev_coredump.o.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Link: https://patch.msgid.link/20260428144704.1114562-2-u.kleine-koenig@baylibre.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d785df5598fd1d1cc2f2f45c05448271b6d490b7)

12 days agodrm/amdkfd: Fix NULL deref during sysfs teardown
Geoffrey McRae [Mon, 1 Jun 2026 13:55:53 +0000 (23:55 +1000)] 
drm/amdkfd: Fix NULL deref during sysfs teardown

Move kfd_process_remove_sysfs() earlier in kfd_process_wq_release() so
that all sysfs/procfs entries are removed before tearing down PDDs and
dropping lead_thread. The per-process sysfs attributes are backed by
struct kfd_process_device, and their show/store callbacks dereference
PDD fields. Since sysfs removal waits for active callbacks to complete,
removing these entries first closes a race where userspace reads sdma_*
and stats_* files after PDD teardown.

Previously this cleanup ran after kfd_process_destroy_pdds(), which
resets p->n_pdds to 0. This meant kfd_process_remove_sysfs() could no
longer walk the PDD array, so the per-PDD sysfs cleanup did not run as
intended.

This race caused NULL pointer dereferences observed in
kfd_sdma_activity_worker and kfd_procfs_stats_show.

Also harden kfd_process_remove_sysfs() against partially
initialized or already-freed objects:
- Check kobj_queues before removing PASID and deleting it
- Guard kobj_stats and kobj_counters before use

These checks prevent invalid dereferences during cleanup.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Geoffrey McRae <geoffrey.mcrae@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 674c692702341fed321720b4b92036c5934fb485)

12 days agodrm/amdgpu: validate CP_GFX_SHADOW chunk size in CS pass1
Mario Limonciello [Sat, 13 Jun 2026 02:07:24 +0000 (21:07 -0500)] 
drm/amdgpu: validate CP_GFX_SHADOW chunk size in CS pass1

Add a minimum-length check for the AMDGPU_CHUNK_ID_CP_GFX_SHADOW chunk in
amdgpu_cs_pass1(), matching the gate already present for the IB, FENCE and
BO_HANDLES chunk types.

The CP_GFX_SHADOW case previously shared a bare break with the dependency
and syncobj chunk types, which do not dereference a fixed-size struct. When
userspace submits this chunk with length_dw == 0, vmemdup_array_user() is
called with size 0 and returns ZERO_SIZE_PTR, which passes the IS_ERR()
check. amdgpu_cs_p2_shadow() then dereferences chunk->kdata as a struct
drm_amdgpu_cs_chunk_cp_gfx_shadow (reading shadow->flags), faulting on the
ZERO_SIZE_PTR and causing a NULL-pointer dereference.

This is reachable by an unprivileged process in the render group. Reject
undersized chunks with -EINVAL during pass1 so the bad submission is
rejected before pass2 ever dereferences the data.

Fixes: ac9287055ff1 ("drm/amdgpu: add gfx shadow CS IOCTL support")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7f61b2eef7415eccdb40850aca0de94211948657)
Cc: stable@vger.kernel.org
12 days agodrm/amdgpu: check amdgpu_vm_bo_find() result in GET_MAPPING_INFO
Mario Limonciello [Sat, 13 Jun 2026 02:11:53 +0000 (21:11 -0500)] 
drm/amdgpu: check amdgpu_vm_bo_find() result in GET_MAPPING_INFO

The AMDGPU_GEM_OP_GET_MAPPING_INFO path of amdgpu_gem_op_ioctl() looks
up the bo_va for the buffer object in the caller's VM via
amdgpu_vm_bo_find(), but uses the returned pointer without checking it.

amdgpu_vm_bo_find() returns NULL when the BO has no bo_va in that VM,
which is the normal case for a BO that has never been mapped. The result
is fed straight into amdgpu_vm_bo_va_for_each_valid_mapping(), which
expands to list_for_each_entry(mapping, &(bo_va)->valids, list) and
dereferences bo_va, causing a NULL pointer dereference.

This is reachable by any process able to issue the ioctl (render group)
simply by requesting mapping info for an unmapped BO.

Return -ENOENT when no bo_va is found, jumping to out_exec so the
drm_exec context and GEM object reference are released.

Fixes: 4d82724f7f2b ("drm/amdgpu: Add mapping info option for GEM_OP ioctl")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 528b19377affc1cc7362a70a254c1dda793595f9)
Cc: stable@vger.kernel.org
12 days agodrm/amdgpu: initialize irq.lock spinlock earlier
Thadeu Lima de Souza Cascardo [Mon, 8 Jun 2026 19:22:35 +0000 (16:22 -0300)] 
drm/amdgpu: initialize irq.lock spinlock earlier

If there is an early failure during amdgpu probe, like missing firmware, it
will end up calling amdgpu_irq_disable_all, which takes irq.lock spinlock
without it being initialized.

Initializing irq.lock earlier at amdgpu_device_init fixes the issue.

[   79.334079] INFO: trying to register non-static key.
[   79.334081] The code is fine but needs lockdep annotation, or maybe
[   79.334083] you didn't initialize this object before use?
[   79.334084] turning off the locking correctness validator.
[   79.334088] CPU: 2 UID: 0 PID: 1819 Comm: bash Not tainted 7.1.0-rc5-gfd06300b2348 #96 PREEMPT  8e8f461221633dae3c832d6689eaf0546c0ed4cd
[   79.334092] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0133 08/05/2024
[   79.334094] Call Trace:
[   79.334095]  <TASK>
[   79.334097]  dump_stack_lvl+0x5d/0x80
[   79.334103]  register_lock_class+0x7af/0x7c0
[   79.334109]  __lock_acquire+0x416/0x2610
[   79.334114]  lock_acquire+0xcf/0x310
[   79.334117]  ? amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.334503]  ? _raw_spin_lock_irqsave+0x53/0x60
[   79.334508]  _raw_spin_lock_irqsave+0x3f/0x60
[   79.334510]  ? amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.334881]  amdgpu_irq_disable_all+0x3b/0xf0 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.335240]  amdgpu_device_fini_hw+0x90/0x32c [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.335704]  amdgpu_driver_load_kms.cold+0x22/0x44 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.336159]  amdgpu_pci_probe+0x204/0x440 [amdgpu c88bab43d391d519ad0d5c8e5a099b4aceefa180]
[   79.336494]  local_pci_probe+0x3c/0x80
[   79.336500]  pci_call_probe+0x55/0x2e0
[   79.336505]  ? _raw_spin_unlock+0x2d/0x50
[   79.336508]  ? pci_match_device+0x157/0x180
[   79.336512]  pci_device_probe+0x9b/0x170
[   79.336516]  really_probe+0xd5/0x370
[   79.336521]  __driver_probe_device+0x84/0x150
[   79.336525]  device_driver_attach+0x47/0xb0
[   79.336528]  bind_store+0x73/0xc0
[   79.336531]  kernfs_fop_write_iter+0x176/0x250
[   79.336536]  vfs_write+0x24d/0x560
[   79.336542]  ksys_write+0x71/0xe0
[   79.336546]  do_syscall_64+0x122/0x710
[   79.336550]  ? do_syscall_64+0xd1/0x710
[   79.336553]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[   79.336557] RIP: 0033:0x7f92fd675006
[   79.336561] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08
[   79.336562] RSP: 002b:00007ffe4fa867a0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[   79.336565] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f92fd675006
[   79.336567] RDX: 000000000000000d RSI: 000055b2dfce59b0 RDI: 0000000000000001
[   79.336568] RBP: 00007ffe4fa867c0 R08: 0000000000000000 R09: 0000000000000000
[   79.336569] R10: 0000000000000000 R11: 0000000000000202 R12: 000000000000000d
[   79.336570] R13: 000055b2dfce59b0 R14: 00007f92fd7ca5c0 R15: 000055b2dfdbaf70
[   79.336574]  </TASK>

Fixes: 9950cda2a018 ("drm/amdgpu: drop the drm irq pre/post/un install callbacks")
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7dba3e10ecdeec85208e255853fcd3890880b10e)

12 days agodrm/amdkfd: fix list_del corruption in kfd_criu_resume_svm
Mario Limonciello [Sat, 13 Jun 2026 02:22:04 +0000 (21:22 -0500)] 
drm/amdkfd: fix list_del corruption in kfd_criu_resume_svm

The cleanup tail of kfd_criu_resume_svm() walks
svms->criu_svm_metadata_list and kfree()s each struct criu_svm_metadata
without removing it from the list. The list head is left pointing at
freed kmalloc-96 objects.

A second AMDKFD_IOC_CRIU_OP from the same process re-enters: list_empty()
reads the dangling ->next (use-after-free), the loop walks freed entries,
and each is kfree()'d again (double-free). This is reachable by an
unprivileged render-group user via /dev/kfd with no capabilities required.

Add list_del() before the kfree() so the list is properly emptied. The
list_for_each_entry_safe() iterator already caches the next pointer, so
unlinking during the walk is safe.

Fixes: 2a909ae71871 ("drm/amdkfd: CRIU resume shared virtual memory ranges")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6322d278a298e2c1430b9d2697743d3a04b788b1)

12 days agodrm/radeon: fix r100_copy_blit for large BOs
Pavel Ondračka [Wed, 10 Jun 2026 08:32:45 +0000 (10:32 +0200)] 
drm/radeon: fix r100_copy_blit for large BOs

r100_copy_blit() copies BOs as 1024-pixel-wide ARGB8888 blits, so one
GPU page becomes one blit row. Large copies are split into chunks of at
most 8191 rows.

The kernel register header names the packet coordinate dwords SRC_Y_X
and DST_Y_X. In the BITBLT_MULTI description in
R5xx_Acceleration_v1.5.pdf docs, these correspond to [SRC_X1 | SRC_Y1]
and [DST_X1 | DST_Y1], which are signed 13-bit coordinates in the
-8192..8191 range. The old code kept SRC/DST_PITCH_OFFSET at the BO base
and used SRC_Y_X/DST_Y_X as the chunk address, so large BO moves could
exceed that coordinate range.

Compute per-chunk SRC/DST_PITCH_OFFSET bases and emit zero source and
destination coordinates. r100_copy_blit() already packs
SRC/DST_PITCH_OFFSET as pitch plus base offset, so large chunk addresses
belong there rather than in the coordinate fields.

This fixes Prison Architect corruption with 4096x4096 mipped textures
after they are evicted to GTT under memory pressure on RV530.

Closes: https://gitlab.freedesktop.org/mesa/mesa/-/work_items/6716
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 87be26aee76239c6da03e599f238a426897f78ad)
Cc: stable@vger.kernel.org
12 days agodrm/amd/display: Fix mem_type change detection for async flips
Matthew Schwartz [Thu, 11 Jun 2026 15:44:38 +0000 (08:44 -0700)] 
drm/amd/display: Fix mem_type change detection for async flips

[Why]
amdgpu_dm_crtc_mem_type_changed() fetches the "old" and "new" plane state
with two drm_atomic_get_plane_state() calls, which both return the new
state. It compares a state against itself, so it never detects a mem_type
change and never rejects the async flip.

On DCN 3.0.1, this shows up as intermittent corruption when a single DCC
plane is scanned out with immediate flips under gamescope and its buffer
moves between the VRAM carveout and GTT.

[How]
Use drm_atomic_get_old_plane_state() and drm_atomic_get_new_plane_state()
to compare the actual old and new states. These return NULL rather than
an error pointer for a plane that is not part of the commit, so the
IS_ERR() check becomes a NULL check that skips those planes, such as an
unmodified cursor still in the CRTC's plane_mask.

Fixes: 4caacd1671b7 ("drm/amd/display: Do not elevate mem_type change to full update")
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 13158e5dbd896281f3e9982b5437cffa5fd621b2)

12 days agodrm/amd/display: Add IN_FORMATS_ASYNC support for planes
James Lin [Fri, 12 Jun 2026 14:05:29 +0000 (10:05 -0400)] 
drm/amd/display: Add IN_FORMATS_ASYNC support for planes

[Why]
The DRM core exposes an IN_FORMATS_ASYNC plane property describing the
set of format/modifier pairs that are valid for asynchronous (immediate)
page flips. amdgpu already advertises async page flip support via
mode_config.async_page_flip = true, but never implemented the
.format_mod_supported_async plane callback, so the IN_FORMATS_ASYNC
property was not created.

This inconsistency (advertising async flips while exposing IN_FORMATS but
no IN_FORMATS_ASYNC) causes userspace, such as igt-gpu-tools, to emit a
repeated warning during plane initialization, which in turn demotes many
otherwise passing KMS subtests to a WARN result.

[How]
Wire up .format_mod_supported_async to the existing
amdgpu_dm_plane_format_mod_supported callback so the async format list is
populated. amdgpu does not restrict async flips at the format/modifier
level: the async flip constraints are enforced at atomic check and commit
time and only require a fast update (no change to FB pitch, DCC state,
rotation or memory type) between the old and new buffers. Therefore the
set of formats/modifiers valid for async flips is identical to the
regular IN_FORMATS set, and the same callback can be reused.

Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: James Lin <PingLei.Lin@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8e2d7bbd6b184c0c1b0fe7cb404c9b5214d89931)

12 days agodrm/amdgpu/gfx: fix cleaner shader IB buffer overflow
Asad Kamal [Fri, 5 Jun 2026 15:44:08 +0000 (23:44 +0800)] 
drm/amdgpu/gfx: fix cleaner shader IB buffer overflow

The cleaner shader sysfs path allocates a 16-dword (64 byte) IB but
incorrectly fills (align_mask + 1) dwords. On GFX rings align_mask is
0xff, so the loop wrote 256 dwords into a 64-byte buffer, causing a
kernel page fault.

The IB only needs to be a minimal NOP shell to schedule the job; the
cleaner shader itself is emitted on the ring via emit_cleaner_shader().
Fill 16 dwords to match the allocation.

v2: Use ib_size_dw variable (Lijo)

Fixes: d361ad5d2fc0 ("drm/amdgpu: Add sysfs interface for running cleaner shader")
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bf21af331ebf72d0935fd70c73192414a422c03a)
CC: stable@vger.kernel.org
12 days agodrm/amdgpu: allocate lockdep mutex on the heap to fix stack overflow
Prike Liang [Fri, 5 Jun 2026 07:28:40 +0000 (15:28 +0800)] 
drm/amdgpu: allocate lockdep mutex on the heap to fix stack overflow

Replace the stack-allocated amdgpu_lockdep mutex with a heap allocation
via kmalloc to fix a stack overflow caused by the large struct size.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbae980eefb2f46f31cee12f1f8540d0d79f61ae)

12 days agodrm/amdkfd: Fix SMI event PID reporting for containers
Andrew Martin [Thu, 28 May 2026 14:32:52 +0000 (10:32 -0400)] 
drm/amdkfd: Fix SMI event PID reporting for containers

SMI events were reporting incorrect PIDs in containerized environments,
causing test failures where container processes expected to see their
namespace-local PIDs but instead received global host PIDs.

The issue had two root causes:

1. Event functions were called from kernel context (page fault handlers,
   migration workers) where 'current' refers to the kernel worker thread,
   not the userspace GPU process that triggered the event.

2. PID conversion used task_tgid_vnr() which returns the PID in the
   caller's namespace (init namespace for kernel threads), not the task's
   own namespace.

This patch updates the SMI event interface:

- Change 8 event function signatures to accept task_struct pointer
  instead of pid_t, allowing proper namespace-aware PID conversion

- Convert PIDs using task_tgid_nr_ns(task, task_active_pid_ns(task))
  which returns the PID as the process sees it via getpid()

- Update 10 call sites to pass p->lead_thread (the GPU process)
  instead of p->lead_thread->pid or current (kernel worker)

This ensures SMI events report container-local PIDs, which is critical
for containerized GPU workloads to correctly correlate events with their
processes.

Tested-by: Andrew Martin <andmarti@amd.com>
Assisted-by: Claude:Sonnet 4-5
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 60271ec06e04ba5d69d68714f3abdf637d86c257)

12 days agodrm/amd/display: Restore periodic detection for DCN35
Ivan Lipski [Thu, 28 May 2026 16:28:51 +0000 (12:28 -0400)] 
drm/amd/display: Restore periodic detection for DCN35

[Why&How]
Periodic detection callbacks from DCN35 was removed for higher IPS
residency causing some displays to fail to recover after DPMS sleep. The
monitors bounces HPD ~1.2s after link training, and without periodic
detection the system enters IPS with no mechanism to wake and rediscover
the display.

Restore the periodic detection calls in dcn35_clk_mgr for now. It should
be replaced with a proper IPS-aware solution long term using DMUB.

Also remove it from dcn31 and dcn314_clk_mgr.c since they do not have IPS,
thus should not affect them.

Fixes: 3f6c060846be ("drm/amd/display: Remove periodic detection callbacks from dcn35+")
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5318
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0c300e6a76916e944b6b18a64c73f7895a0fee87)
Cc: stable@vger.kernel.org
12 days agodrm/amd/display: Skip PHY SSC reduction on some 8K panels
Roman Li [Wed, 20 May 2026 20:50:34 +0000 (16:50 -0400)] 
drm/amd/display: Skip PHY SSC reduction on some 8K panels

[Why]
Some 8K displays cannot tolerate the reduced phy ssc value
at high link utilization and show corruption or black screen.

[How]
Add an EDID panel-id quirk to utilize existing skip_phy_ssc_reduction flag.

To pass the link into the quirk handler, change the signature of
apply_edid_quirks() to take link as an argument. The dev local in
dm_helpers_parse_edid_caps() becomes unused and is removed.

Fixes: 5fa62c87cffd ("drm/amd/display: Add option to disable PHY SSC reduction on transmitter enable")
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Roman Li <Roman.Li@amd.com>
Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 144169e7be0831e09958a906d08d1856751aa6c6)

12 days agodrm/amdgpu: skip already suspended IP blocks in ip_suspend_phase2
Yunxiang Li [Fri, 5 Jun 2026 12:59:34 +0000 (08:59 -0400)] 
drm/amdgpu: skip already suspended IP blocks in ip_suspend_phase2

The GPU reload test (S3 / mode1 reset / module reload) triggers a
WARN_ON in amdgpu_irq_put() on gfx10 when unloading amdgpu:

  WARNING: CPU: 0 PID: 2314 at amd/amdgpu/amdgpu_irq.c:676 amdgpu_irq_put+0xc3/0xe0 [amdgpu]
  Call Trace:
   gfx_v10_0_hw_fini+0x41/0x150 [amdgpu]
   amdgpu_ip_block_hw_fini+0x29/0xc0 [amdgpu]
   amdgpu_device_fini_hw+0x315/0x610 [amdgpu]
   amdgpu_driver_unload_kms+0x7c/0x90 [amdgpu]
   amdgpu_pci_remove+0x51/0x90 [amdgpu]

amdgpu_device_ip_resume_phase2() skips IP blocks whose status.hw is
already set, but amdgpu_device_ip_suspend_phase2() never had the
matching guard, so a block can be suspended twice (e.g. a reset or
recovery issued while the device is already suspended).  The second
suspend runs hw_fini again, which now releases the gfx fault IRQs
unconditionally, dropping a refcount that is already zero and tripping
the WARN_ON in amdgpu_irq_put().

The fault/EOP IRQ get/put were balanced through late_init/hw_fini
before, which masked the double-suspend; moving the get into hw_init
made the suspend/resume asymmetry visible as an IRQ refcount underflow.

Honor status.hw in ip_suspend_phase2() so suspend mirrors resume and a
block is only torn down once.

Fixes: 9117d8be850b ("drm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini")
Fixes: 482f0e538580 ("drm/amdgpu: fix double ucode load by PSP(v3)")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f44f2af13c418969be358b15743f939d705de998)

12 days agodrm/amdkfd: Properly acquire queue buffers in CRIU restore
David Francis [Thu, 4 Jun 2026 19:04:03 +0000 (15:04 -0400)] 
drm/amdkfd: Properly acquire queue buffers in CRIU restore

When kfd_queue_acquire_buffers() was split off from
set_queue_properties_from_user(), set_queue_properties_from_criu()
was missed. Thus, set_queue_properties_from_criu() is not
filling out the buffer fields of queue_properties, which
can come up when subsequent code expects them to be non-null.

Add the proper call to kfd_queue_acquire_buffers(), and also
use the right cast types in set_queue_properties_from_criu()
(which were missed at the same time)

Signed-off-by: David Francis <David.Francis@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 88ed96abbbe27b70193544fbc1ee06448c274714)

12 days agodrm/amd/pm: re-enable MC access after PrepareMp1ForUnload on SMU V15 APUs
Shubhankar Milind Sardeshpande [Thu, 21 May 2026 05:25:18 +0000 (10:55 +0530)] 
drm/amd/pm: re-enable MC access after PrepareMp1ForUnload on SMU V15 APUs

During smu_v15_0_0_system_features_control(), the driver sends a
PrepareMp1ForUnload message to PMFW. PMFW then performs nBIF and SYSHUB
function-level resets (FLR), disabling PCIe CFG space reset, which
clears the framebuffer enable bit to zero and disables MC (memory controller)
access from the host.

Re-enable MC access via the nbio mc_access_enable callback right after
PrepareMp1ForUnload completes in smu_v15_0_0_system_features_control().

Signed-off-by: Shubhankar Milind Sardeshpande <Shubhankar.MilindSardeshpande@amd.com>
Signed-off-by: Suresh Guttula <Suresh.Guttula@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 840a3c5aeae779a3bc75d7f747c3ed18b1af6507)
Cc: stable@vger.kernel.org
12 days agodrm/amdgpu: initialize iter.start in amdgpu_devcoredump_format
Qiang Yu [Tue, 26 May 2026 06:45:48 +0000 (14:45 +0800)] 
drm/amdgpu: initialize iter.start in amdgpu_devcoredump_format

This fixes read /sys/class/drm/cardN/device/devcoredump/data
return empty content sometimes.

amdgpu_devcoredump_format() leaves struct drm_print_iterator's
.start field uninitialized on the stack before passing it to
drm_coredump_printer(). __drm_puts_coredump() compares the running
.offset against .start to decide whether to skip or copy each
chunk:

if (iterator->offset < iterator->start) {
if (iterator->offset + len <= iterator->start) {
iterator->offset += len;
return;
}
...
}

Fixes: 4bbba79a7f1d ("drm/amdgpu: move devcoredump generation to a worker")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Qiang Yu <Qiang.Yu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit cd6397b7af8262a380e188dc32e9de11ff897ed2)

12 days agodrm/amdkfd: Avoid double-unpin of DOORBELL/MMIO BOs on free
Yunxiang Li [Thu, 4 Jun 2026 16:59:11 +0000 (12:59 -0400)] 
drm/amdkfd: Avoid double-unpin of DOORBELL/MMIO BOs on free

amdgpu_amdkfd_gpuvm_free_memory_of_gpu() unpinned DOORBELL and MMIO
remap BOs (which are pinned at allocation time) before checking whether
the BO is still mapped to the GPU. When the BO is still mapped, the
function returns -EBUSY and leaves the BO alive, but it has already
been unpinned. The BO is then unpinned again when it is finally freed
during process teardown, triggering a ttm_bo_unpin() underflow warning:

  WARNING: CPU: 18 PID: 15066 at ttm/ttm_bo.c:650 amdttm_bo_unpin+0x6d/0x80 [amdttm]
  Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
  RIP: 0010:amdttm_bo_unpin+0x6d/0x80 [amdttm]
  Call Trace:
   amdgpu_bo_unpin+0x1a/0x90 [amdgpu]
   amdgpu_amdkfd_gpuvm_unpin_bo+0x31/0xb0 [amdgpu]
   amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x3bf/0x460 [amdgpu]
   kfd_process_free_outstanding_kfd_bos+0xd4/0x170 [amdgpu]
   kfd_process_wq_release+0x109/0x1b0 [amdgpu]
   process_one_work+0x1e2/0x3b0
   worker_thread+0x50/0x3a0
   kthread+0xdd/0x100
   ret_from_fork+0x29/0x50

Move the unpin after the mapped_to_gpu_memory check so it only happens
once we are committed to freeing the BO.

Fixes: d25e35bc26c3 ("drm/amdgpu: Pin MMIO/DOORBELL BO's in GTT domain")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 927c5b2defb9b09856444d94bebfd056a002bd75)

2 weeks agoMerge tag 'drm-misc-next-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm...
Dave Airlie [Fri, 12 Jun 2026 21:58:44 +0000 (07:58 +1000)] 
Merge tag 'drm-misc-next-fixes-2026-06-11' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

drm-misc-next-fixes for v7.2:
- Fix agp_amd64_probe error propagation.
- Require carveout when PASID is not enabled amdxdna.
- Clear variable to prevent second unbind in amdxdna.
- Add separate Kconfig option for DMABUF_HEAPS_SYSTEM_CC_SHARED.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patch.msgid.link/c7a9dbb0-a5c8-4e67-904e-1a52b3de9bb4@linux.intel.com
2 weeks agodma-buf: move system_cc_shared heap under separate Kconfig
Arnd Bergmann [Wed, 10 Jun 2026 14:23:29 +0000 (19:53 +0530)] 
dma-buf: move system_cc_shared heap under separate Kconfig

While system heap and system_cc_shared heap share a lot of code
and hence the same source file, their users have different needs.

system heap users need it to be a loadable module, while
system_cc_shared heap users don't.

Building as a loadable module breaks system_cc_shared heap on
powerpc and s390 due to un-exported set_memory_encrypted /
set_memory_decrypted functions.

Fix these by reorganising code to put the system_cc_shared heap
under a new Kconfig symbol, which allows either building both
into the kernel, or leave encryption up to the consumers of the
system heap.

Fixes: fd55edff8a0a ("dma-buf: heaps: system: Turn the heap into a module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
  [sumits: updated DMABUF_HEAPS_CC_SYSTEM to DMABUF_HEAPS_SYSTEM_CC_SHARED]
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Acked-by: Maxime Ripard <mripard@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260610142329.3836808-1-sumit.semwal@linaro.org
2 weeks agoaccel/amdxdna: Clear sva pointer after unbind
Lizhi Hou [Thu, 4 Jun 2026 20:28:15 +0000 (13:28 -0700)] 
accel/amdxdna: Clear sva pointer after unbind

Add client->sva = NULL after the unbind makes it consistent with how
amdxdna_sva_fini() already clears the pointer after unbinding. The
IS_ERR_OR_NULL guard in sva_fini will then correctly skip the second
unbind.

Fixes: 3cc5d7a59519 ("accel/amdxdna: Add carveout memory support for non-IOMMU systems")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260604202815.2425882-1-lizhi.hou@amd.com
3 weeks agoMerge tag 'drm-misc-next-fixes-2026-06-05' of https://gitlab.freedesktop.org/drm...
Dave Airlie [Tue, 9 Jun 2026 05:00:01 +0000 (15:00 +1000)] 
Merge tag 'drm-misc-next-fixes-2026-06-05' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-next

drm-misc-next-fixes for v7.2-rc1:
- Revert last minute IS_ERR_OR_NULL changes in nouveau/gsp.
- Fix build warning in drm scheduler.
- Flush caches and TLB before v3d runtime suspend.
- Fix a trace and debug command in amdxdna.
- Fix heap buffer address validation when PASID is disabled in amdxdna.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patch.msgid.link/a4a5bf50-3fc8-4faf-884b-08121687124a@linux.intel.com
3 weeks agoMerge tag 'amd-drm-next-7.2-2026-06-04' of https://gitlab.freedesktop.org/agd5f/linux...
Dave Airlie [Mon, 8 Jun 2026 09:56:59 +0000 (19:56 +1000)] 
Merge tag 'amd-drm-next-7.2-2026-06-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.2-2026-06-04:

amdgpu:
- UserQ fix
- Userptr fix
- MCCS freesync fix
- Remove some triggerable BUG() calls
- DCN 4.2.1 fixes
- Lockdep annotations
- Guilty handling fix
- VCN 5.3 fix
- FRL fixes
- Bounds checking fixes
- HMM fix
- IRQ accounting fix

amdkfd:
- Fix an event information leak
- Events bounds check fix
- Trap cleanup fix
- Bounds checking fixes
- MES fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260604231801.19979-1-alexander.deucher@amd.com
3 weeks agoagp/amd64: Fix broken error propagation in agp_amd64_probe()
Mingyu Wang [Mon, 4 May 2026 07:48:23 +0000 (15:48 +0800)] 
agp/amd64: Fix broken error propagation in agp_amd64_probe()

A NULL pointer dereference was observed in the AMD64 AGP driver when
running in a virtualized environment (e.g. qemu/kvm) without a physical
AMD northbridge. The crash occurs in amd64_fetch_size() when attempting
to dereference the pointer returned by node_to_amd_nb(0).

The root cause of this crash is broken error propagation in
agp_amd64_probe(): When no AMD northbridges are found, cache_nbs()
correctly returns -ENODEV. However, the probe function erroneously
checks the return value against exactly -1, rather than < 0.

As a result, the hardware absence error is masked, allowing the driver
to improperly proceed with initialization. It eventually calls
agp_add_bridge(), which invokes amd64_fetch_size(). Since the hardware
does not exist, node_to_amd_nb(0) returns NULL, leading to a General
Protection Fault (GPF) when accessing its ->misc member.

Fix the issue by correcting the error check in agp_amd64_probe() to
abort properly when cache_nbs() returns any negative error code. This
prevents the driver from erroneously proceeding without hardware, thereby
avoiding the subsequent NULL pointer dereference at its source.

Fixes: a32073bffc65 ("[PATCH] x86_64: Clean and enhance up K8 northbridge access code")
Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v2.6.18+
Link: https://patch.msgid.link/20260504074823.99377-1-w15303746062@163.com
3 weeks agoaccel/amdxdna: Require carveout when PASID and force_iova are disabled
Lizhi Hou [Thu, 4 Jun 2026 19:54:59 +0000 (12:54 -0700)] 
accel/amdxdna: Require carveout when PASID and force_iova are disabled

When both PASID and force_iova are disabled, carveout memory should be
used. Reject buffer allocations that cannot use carveout memory in this
configuration and return an error.

Fixes: 3cc5d7a59519 ("accel/amdxdna: Add carveout memory support for non-IOMMU systems")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260604195459.2423279-1-lizhi.hou@amd.com
3 weeks agoMerge tag 'drm-rust-next-2026-06-04' of https://gitlab.freedesktop.org/drm/rust/kerne...
Dave Airlie [Thu, 4 Jun 2026 23:09:38 +0000 (09:09 +1000)] 
Merge tag 'drm-rust-next-2026-06-04' of https://gitlab.freedesktop.org/drm/rust/kernel into drm-next

DRM Rust changes for v7.2-rc1

- Driver Core (shared via signed tag dd-lifetimes-7.2-rc1):

  - Introduce Higher-Ranked Lifetime Types (HRT) for Rust device
    drivers, allowing driver structs to hold device resources like
    pci::Bar and IoMem directly with a lifetime tied to the binding
    scope, removing the need for Devres indirection and ARef<Device>.

  - Replace drvdata() with scoped registration data on the auxiliary
    bus, using the new ForLt trait to thread lifetimes through
    registrations. Remove drvdata() and driver_type.

- DRM:

  - Add GPUVM immediate mode abstraction for Rust GPU drivers:
    - In immediate mode, GPU virtual address space state is updated
      during job execution (in the DMA fence signalling critical path),
      keeping the GPUVM and the GPU's address space always in sync.

    - Provide GpuVm, GpuVa, and GpuVmBo types for managing address
      spaces, virtual mappings, and GEM object backing respectively.

    - Provide split-merge map/unmap operations that handle partial
      overlaps with existing mappings.

    - drm_exec integration for dma_resv locking and GEM object
      validation based on the external/evicted object lists are not
      yet covered and planned as follow-up work.

  - Introduce DeviceContext type state for drm::Device, allowing
    drivers to restrict operations to contexts where the device is
    guaranteed to be registered (or not yet registered) with userspace.

  - Add FEAT_RENDER flag to the Driver trait for render node support.

- Nova:

  - Hopper/Blackwell enablement:
    - Add GPU identification and architecture-based HAL selection for
      Hopper (GH100) and Blackwell (GB100, GB202).

    - Implement the FSP (Foundation Security Processor) boot path used by
      Hopper and Blackwell, including FSP falcon engine support, EMEM
      operations, MCTP/NVDM message infrastructure, and FSP Chain of
      Trust boot with GSP lockdown release.

    - Add support for 32-bit firmware images and auto-detection of
      firmware image format.

    - Add architecture-specific framebuffer, sysmem flush, PCI config
      mirror, DMA mask, and WPR/non-WPR heap sizing.

  - GSP boot and unload:
    - Refactor the GSP boot process into a chipset-specific HAL,
      keeping the SEC2 and FSP boot paths separated cleanly.

    - Implement proper driver unload: send UNLOADING_GUEST_DRIVER
      command, run Booter Unloader and FWSEC-SB upon unbinding, and run
      the unload bundle on Gsp::boot() failure. This removes the need
      for a manual GPU reset between driver unbind and re-probe.

  - GA100 support:
    - Add support for the GA100 GPU, including IFR header detection and
      skipping, correct fwsignature selection, conditional FRTS boot,
      and documentation of the IFR header layout.

  - VBIOS hardening and refactoring:
    - Harden VBIOS parsing with checked arithmetic, bounds-checked
      accesses, and FromBytes-based structure reads throughout the FWSEC
      and Falcon data paths. Simplify the overall VBIOS module
      structure.

  - HRT adoption:
    - Use lifetime-parameterized pci::Bar directly, replacing the
      Arc<Devres<Bar0>> indirection. Replace ARef<Device> with &'bound
      Device in SysmemFlush and the GSP sequencer. Separate the driver
      type from driver data.

  - Misc:
    - Rename module names to kebab-case (nova-drm, nova-core).

    - Require little-endian in Kconfig, making the existing assumption
      explicit.

- Tyr:

  - Define comprehensive typed register blocks for GPU_CONTROL,
    JOB_CONTROL, MMU_CONTROL (including per-address-space registers),
    and DOORBELL_BLOCK using the kernel register!() macro. This replaces
    manual bit manipulation with typed register and field accessors.

  - Add shmem-backed GEM objects and set DMA mask based on GPU physical
    address width.

  - Adopt HRT: separate driver type from driver data, and use IoMem
    directly instead of Devres for register access during probe.

  - Move clock cleanup into a Drop implementation.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: "Danilo Krummrich" <dakr@kernel.org>
Link: https://patch.msgid.link/DJ0IF39U9ETK.PCCUO7ZEQ4S0@kernel.org
3 weeks agodrm/amdkfd: always resume_all after suspend_all
Alex Deucher [Wed, 6 May 2026 20:50:42 +0000 (16:50 -0400)] 
drm/amdkfd: always resume_all after suspend_all

Need to restore any good queues even if the suspend_all
failed for some.  Always run remove_queue as that will
schedule a GPU reset is removing the queue fails.

v2: move resume_all after remove

Fixes: eb067d65c33e ("drm/amdkfd: Update BadOpcode Interrupt handling with MES")
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini
Yunxiang Li [Wed, 27 May 2026 22:05:37 +0000 (18:05 -0400)] 
drm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini

priv_reg / priv_inst / bad_op and (on v11+) userq EOP IRQs are
acquired in late_init but released in hw_fini.  This split forced
gfx_v9_0_hw_fini() to defensively guard each put with
amdgpu_irq_enabled() because hw_fini runs on paths that may not
reach late_init.

amdgpu_ip_block_hw_fini() only runs after hw_init returns success,
and suspend / resume cycle the refs through the same path, so
hw_init / hw_fini pair without any extra tracking.  Move the gets
there and drop the guards.

While here, fix the pre-existing partial-failure leak in
set_userq_eop_interrupts() (gfx11 / 12_0 / 12_1).  amdgpu_irq_get()
increments the refcount before calling .set, so a failure partway
through the loop leaves earlier successful gets stranded.  Track
the loop position and roll back on the enable path.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: Consult MCCS FreeSync cap only if requested & supported
Michel Dänzer [Mon, 18 May 2026 15:48:09 +0000 (17:48 +0200)] 
drm/amd/display: Consult MCCS FreeSync cap only if requested & supported

When the do_mccs parameter is false, we don't call
dm_helpers_read_mccs_caps, so sink->mccs_caps.freesync_supported is
unlikely to be true.

Fixes: 6f71d5dd3206 ("drm/amd/display: Read sink freesync support via mccs")
Bug: https://gitlab.freedesktop.org/drm/amd/-/work_items/5286
Signed-off-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: Use strscpy in profile mode parsing
Lijo Lazar [Tue, 19 May 2026 13:00:03 +0000 (18:30 +0530)] 
drm/amd/pm: Use strscpy in profile mode parsing

Use strscpy to copy the buffer which makes it explicit that a valid NULL
terminated string gets copied. Also, make it explicit that the source
buffer can be copied safely to the temporary buffer by checking against
its size.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: Fix infinite loop parsing CRAT with zero subtype length
Yongqiang Sun [Mon, 1 Jun 2026 19:28:30 +0000 (15:28 -0400)] 
drm/amdkfd: Fix infinite loop parsing CRAT with zero subtype length

Malformed ACPI CRAT tables can advertise a zero or undersized subtype
length. The parser then fails to advance the cursor and loops forever
while the remaining image still looks large enough for a generic header.

Validate sub_type_hdr->length on each iteration before parsing or
advancing. Return -EINVAL and warn when length is zero or smaller than
the generic subtype header.

Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: fix sysfs topology prop length on buffer truncation
Yongqiang Sun [Mon, 1 Jun 2026 19:48:44 +0000 (15:48 -0400)] 
drm/amdkfd: fix sysfs topology prop length on buffer truncation

sysfs_show_gen_prop() accumulated snprintf()'s return value into the
offset. snprintf() reports bytes that would have been written, not
bytes actually written, so a truncated sysfs show could over-report
its length. Use sysfs_emit_at(), which returns only the bytes written.

Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: drop retry loop in amdgpu_hmm_range_get_pages
Honglei Huang [Fri, 29 May 2026 02:23:17 +0000 (10:23 +0800)] 
drm/amdgpu: drop retry loop in amdgpu_hmm_range_get_pages

Since commit c08972f55594 ("drm/amdgpu: fix amdgpu_hmm_range_get_pages")
moved mmu_interval_read_begin() out of the per-chunk loop, the
captured notifier_seq is no longer refreshed across retries. As a
result, the existing -EBUSY retry path can never make progress:

  hmm_range_fault() returns -EBUSY only when
  mmu_interval_check_retry(notifier, notifier_seq) reports that the
  sequence is stale. Once the sequence has advanced, the stored seq
  will never match again, so every subsequent call within the same
  invocation returns -EBUSY immediately.

The "goto retry" therefore degenerates into a busy spin that simply
burns CPU for the full HMM_RANGE_DEFAULT_TIMEOUT (~1s) window before
finally bailing out with -EAGAIN. This is pure latency with no chance
of recovery, and it actively hurts the KFD userptr stack: the caller
ends up blocked for a second while holding mmap_lock, only to return
-EAGAIN to the restore worker (or to userspace) which would have
re-driven the operation immediately anyway.

Drop the retry/timeout entirely and let -EBUSY propagate straight to
out_free_pfns, where it is already translated to -EAGAIN. Recovery is
handled at a higher level: the KFD restore_userptr_worker reschedules
itself, and the userptr ioctl path returns -EAGAIN to userspace.

No functional regression: the previous behaviour on -EBUSY was already
to fail with -EAGAIN after a 1s stall; we just skip the stall.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Honglei Huang <honghuan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: bound OD parameter parsing to stack array size
Candice Li [Wed, 20 May 2026 04:33:18 +0000 (12:33 +0800)] 
drm/amd/pm: bound OD parameter parsing to stack array size

Reject inputs once parameter_size reaches the array limit, and pass
ARRAY_SIZE(parameter) into parse_input_od_command_lines() for defense in
depth.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: Stop pp_od_clk_voltage emit at PAGE_SIZE
Asad Kamal [Wed, 3 Jun 2026 07:11:33 +0000 (15:11 +0800)] 
drm/amd/pm: Stop pp_od_clk_voltage emit at PAGE_SIZE

Stop appending OD sections in amdgpu_get_pp_od_clk_voltage()
once the sysfs page is full, instead of checking every sysfs_emit_at()
in SMU helpers. This is purely defensive hardening.

v2: Drop the prior series that checked sysfs_emit_at()
return values in every SMU *_emit_clk_levels() helper and
smu_cmn_print_*().(Kevin)

v3: Update description, remove all clamping

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: Unwind debug trap enable on copy_to_user failure
Yongqiang Sun [Tue, 2 Jun 2026 13:59:44 +0000 (09:59 -0400)] 
drm/amdkfd: Unwind debug trap enable on copy_to_user failure

If kfd_dbg_trap_enable() fails while copying runtime_info to userspace,
it had already activated the trap, set debug_trap_enabled, taken an extra
process reference, and opened the debug event file. Return -EFAULT without
unwinding that state, leaving inconsistent trap state and a refcount
imbalance that could break later DISABLE/ENABLE.

On copy_to_user failure, deactivate the trap and undo the rest of the
enable setup before returning.

Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: validate the mes firmware version for gfx12.1
Sunil Khatri [Mon, 1 Jun 2026 14:45:34 +0000 (20:15 +0530)] 
drm/amdgpu: validate the mes firmware version for gfx12.1

MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: validate the mes firmware version for gfx12
Sunil Khatri [Mon, 1 Jun 2026 14:44:50 +0000 (20:14 +0530)] 
drm/amdgpu: validate the mes firmware version for gfx12

MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: compare MES firmware version ucode for gfx11
Sunil Khatri [Mon, 1 Jun 2026 14:41:17 +0000 (20:11 +0530)] 
drm/amdgpu: compare MES firmware version ucode for gfx11

MES firmware should report the same version whether read from
the register or from the firmware ucode binary. This is not
always the case, so add a log when they mismatch.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: Add bounds check for AMDKFD_IOC_WAIT_EVENTS
Sunday Clement [Tue, 19 May 2026 14:02:30 +0000 (10:02 -0400)] 
drm/amdkfd: Add bounds check for AMDKFD_IOC_WAIT_EVENTS

The kfd_wait_on_events ioctl passes a user-supplied num_events parameter
directly to alloc_event_waiters() which calls kcalloc() without validation.
This allows unprivileged users with /dev/kfd access to trigger large kernel
memory allocations, potentially causing memory exhaustion and denial of
service via the OOM killer.

Add a check to reject num_events values exceeding KFD_SIGNAL_EVENT_LIMIT
(4096), which is the maximum number of events a single process can create.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: restart the CS if some parts of the VM are still invalidated
Christian König [Wed, 25 Feb 2026 14:12:02 +0000 (15:12 +0100)] 
drm/amdgpu: restart the CS if some parts of the VM are still invalidated

Make sure that we only submit work with full up to date VM page tables.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: use unsigned types for local pipe and REG_GET counters
Aurabindo Pillai [Tue, 2 Jun 2026 19:17:06 +0000 (15:17 -0400)] 
drm/amd/display: use unsigned types for local pipe and REG_GET counters

Two small type fixes that match how the values are actually consumed:

- decide_zstate_support() iterates from 0 to pipe_count, which is
  unsigned. Make the loop index unsigned int.

- hpo_enc401_read_state() reads HDMI_PIXEL_ENCODING and
  HDMI_DEEP_COLOR_DEPTH via REG_GET_2(), which internally casts the
  output pointer to (uint32_t *). Passing the address of an int is a
  strict-aliasing wart even when the sizes match. Declare the locals
  as uint32_t.

No behavioural change since the values are only compared against small
non-negative constants.

Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: widen dc_hdmi_frl_flags.force_frl_rate to unsigned int
Aurabindo Pillai [Tue, 2 Jun 2026 19:16:16 +0000 (15:16 -0400)] 
drm/amd/display: widen dc_hdmi_frl_flags.force_frl_rate to unsigned int

dc_hdmi_frl_flags.force_frl_rate mirrors dc_debug_options.force_frl_rate,
which was just widened to unsigned int. Match the type here too so the
assignment in link_hdmi_frl.c does not narrow from unsigned to signed.

All call sites in link_hdmi_frl.c only compare the value against 0, 0xF,
or an hdmi_frl_link_rate enum whose values are non-negative, so the
change is behaviour-preserving and does not introduce sign-compare
warnings.

Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/userq: Fix reading timeline points in wait ioctl
David Rosca [Sat, 13 Sep 2025 14:51:02 +0000 (16:51 +0200)] 
drm/amdgpu/userq: Fix reading timeline points in wait ioctl

Use correct u64 type.

Signed-off-by: David Rosca <david.rosca@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/vcn5.0.0: enable secure submission on unified ring for VCN 5.3.0
Jeevana Muthyala [Thu, 14 May 2026 10:56:17 +0000 (16:26 +0530)] 
drm/amdgpu/vcn5.0.0: enable secure submission on unified ring for VCN 5.3.0

Enable secure submission support on the unified ring for VCN IP version
5.3.0 by setting `secure_submission_supported = true` in
vcn_v5_0_0_unified_ring_vm_funcs.

Secure IB submission is supported on VCN 5.3.0 hardware/firmware,
allowing protected decode workloads to bypass the common IB gate.
Without this, secure playback submissions can be blocked and fail.

Other VCN 5.x variants using the same vcn_v5_0_0_ip_block
(e.g. IP_VERSION(5, 0, 0)) do not support secure submission
on the unified ring and therefore continue using non-secure paths.

This change only advertises existing hardware/firmware capability;
non-secure decode paths remain unaffected.

Signed-off-by: Jeevana Muthyala <Jeevana.Muthyala2@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: deprecate guilty handling
Christian König [Tue, 5 May 2026 13:40:04 +0000 (15:40 +0200)] 
drm/amdgpu: deprecate guilty handling

The guilty handling tried to establish a second way of signaling problems with
the GPU back to userspace. This caused quite a bunch of issue we had to work
around, especially lifetime issues with the drm_sched_entity.

Just drop the handling altogether and use the dma_fence based approach instead.

v2: fix reversed condition in entity check (Alex)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: Add lockdep annotations for lock ordering validation
Vitaly Prosyak [Wed, 13 May 2026 20:08:30 +0000 (16:08 -0400)] 
drm/amdgpu: Add lockdep annotations for lock ordering validation

Add lockdep annotations to teach lockdep the correct lock hierarchy
and catch ordering violations during development. This follows the
pattern established by dma-resv in drivers/dma-buf/dma-resv.c.

Lock ordering hierarchy (outermost to innermost):

1. userq_sch_mutex   - Global userq scheduler (enforce_isolation)
2. userq_mutex       - Per-context userq (held across queue create/destroy)
3. notifier_lock     - MMU notifier synchronization
4. vram_lock         - VRAM memory allocator
5. reset_domain->sem - GPU reset synchronization
6. reset_lock        - Reset control mutex
7. srbm_mutex        - SRBM register access
8. grbm_idx_mutex    - GRBM index register access
9. mmio_idx_lock     - MMIO index access (spinlock)

The implementation provides:
- Lock ordering training at module init (amdgpu_lockdep_init)
- Lock class association for real driver locks (amdgpu_lockdep_set_class)

Dummy locks are associated with the same class keys as real driver locks
via lockdep_set_class(), ensuring lockdep connects the training ordering
with actual runtime locks.

Testing:
  Build the kernel with CONFIG_PROVE_LOCKING=y (enables CONFIG_LOCKDEP):
    scripts/config --enable PROVE_LOCKING
    scripts/config --enable DEBUG_LOCKDEP
    make -j$(nproc)

  On boot, dmesg should show:
    AMDGPU: Lockdep annotations initialized (9 lock levels)

  The companion IGT test (tests/amdgpu/amd_lockdep) exercises lock-heavy
  GPU code paths concurrently to trigger lockdep warnings on violations:
    sudo ./build/tests/amdgpu/amd_lockdep
    sudo dmesg | grep -A 50 "circular locking dependency"

  IGT subtests:
    concurrent-reset-and-submit  - reset_sem vs submission locks
    concurrent-mmap-and-evict    - mmap_lock vs vram_lock
    concurrent-userptr-and-reset - notifier_lock vs reset_sem
    stress-all-paths             - all of the above simultaneously

  A clean dmesg (no "circular locking dependency" or "possible recursive
  locking detected" messages) confirms no lock ordering violations.

  For CI integration, the test should be run on kernels compiled with
  CONFIG_LOCKDEP=y; dmesg is scanned post-run for lockdep splats.

v2: (Christian)
- Move notifier_lock and vram_lock before reset locks in hierarchy.
  HMM invalidation holds notifier_lock and can wait for GPU reset
  completion, so notifier_lock must be outer to reset_domain->sem.
- Associate dummy locks with lock class keys via lockdep_set_class()
  so lockdep connects training with real driver locks.
- Update commit message to list all 9 lock levels.

Requires CONFIG_PROVE_LOCKING=y to activate.

Cc: Christian Konig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Christian Konig <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: fix SMI event cross-process information leak
Yongqiang Sun [Wed, 27 May 2026 13:50:47 +0000 (09:50 -0400)] 
drm/amdkfd: fix SMI event cross-process information leak

kfd_smi_ev_enabled() skips the suser privilege check when pid=0.
PROCESS_START, PROCESS_END, and VMFAULT events are emitted with
pid=0 while carrying another process's PID and command name, so any
/dev/kfd user in the render group can monitor all GPU workloads.

Pass the target process PID into kfd_smi_event_add() for these events
so the existing per-client filter restricts delivery to the owning
process or CAP_SYS_ADMIN subscribers.

Signed-off-by: Yongqiang Sun <Yongqiang.Sun@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: Add DCN42B to dml21_translation_helper
Matthew Stewart [Thu, 28 May 2026 22:21:54 +0000 (18:21 -0400)] 
drm/amd/display: Add DCN42B to dml21_translation_helper

Needed for DML to function with DCN42B.

Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com>
Reviewed-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: Fix DCN42B version detection
Matthew Stewart [Wed, 27 May 2026 14:07:02 +0000 (10:07 -0400)] 
drm/amd/display: Fix DCN42B version detection

In resource_parse_asic_id, the check for GC_11_0_4 was unbounded, which
caused it to override the detection of DCN42B.

Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com>
Reviewed-by: Roman Li <Roman.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: Fix user-triggerable BUG()/BUG_ON() calls
Ce Sun [Mon, 18 May 2026 08:44:06 +0000 (16:44 +0800)] 
drm/amdgpu: Fix user-triggerable BUG()/BUG_ON() calls

Replace BUG()/BUG_ON() with error logs and safe returns in several
places where they can be triggered by invalid userspace input,
preventing DoS via kernel panic.

Signed-off-by: Ce Sun <cesun102@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agoMerge tag 'amd-drm-next-7.2-2026-06-03' of https://gitlab.freedesktop.org/agd5f/linux...
Dave Airlie [Thu, 4 Jun 2026 02:06:36 +0000 (12:06 +1000)] 
Merge tag 'amd-drm-next-7.2-2026-06-03' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.2-2026-06-03:

amdgpu:
- BT.2020 fix for DCE
- DC bounds checking fixes
- SDMA 7.1 fix
- UserQ fixes
- SI fix
- SMU 13 fixes
- SMU 14 fixes
- GC 12.1 fix
- Userptr fix
- GC 10.1 fix
- GART fix for non-4K pages
- DCN 4.x fixes
- DCN 4.2 updates
- More DC KUnit tests
- PSR cleanup
- Support for connectors without DDC pins
- Initial DCN 4.2.1 support
- Initial HDMI 2.1 FRL support
- Misc bounds check fixes
- RAS fixes
- GC 11.5.6 support
- SDMA 6.4.0 support
- NBIO 7.11.5 support
- IH 6.4.0 support
- HDP 6.4.0 support
- MMHUB 3.4.2 support
- SMU 15.0.5 support
- ATHUB 3.4.2 support
- VPE 2.2 support
- Devcoredump fixes
- _PR3 fix

amdkfd:
- UAF race fix
- Fix a potential NULL pointer dereference
- GC 11 buffer overflow fix for SDMA
- Profiler locking order fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260604013527.2373534-1-alexander.deucher@amd.com
3 weeks agoMerge tag 'drm-msm-next-2026-05-30' of https://gitlab.freedesktop.org/drm/msm into...
Dave Airlie [Wed, 3 Jun 2026 20:41:21 +0000 (06:41 +1000)] 
Merge tag 'drm-msm-next-2026-05-30' of https://gitlab.freedesktop.org/drm/msm into drm-next

Changes for v7.2

Core:
- Fixed documentation for msm_gem_shrinker functions
- IFPC related enablement/fixes for gen8
- PERFCNTR_CONFIG ioctl support

GPU
- Reworked handling of UBWC configuration
- a810 suppport

MDSS:
- Added Milos platform support
- Reworked handling of UBWC configuration

DisplayPort:
- Reworked HPD handling, preparing for the MST support

DPU:
- Added Milos platform support
- Reworked handling of UBWC configuration

DSI:
- Added Milos platform support

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <rob.clark@oss.qualcomm.com>
Link: https://patch.msgid.link/CACSVV00DXZcvFH2-C3fouve5DGs0DGa-vvsJPuaRmUZZVNKOfg@mail.gmail.com
3 weeks agogpu: nova-core: move lifetime to `Bar0`
Gary Guo [Tue, 2 Jun 2026 17:04:07 +0000 (18:04 +0100)] 
gpu: nova-core: move lifetime to `Bar0`

Currently Nova code uses `&'a Bar0` a lot. This is `&'a Mmio`, where `Mmio`
represents an owned MMIO region; this type only exists as a target for
`Deref` so `Bar` and `IoMem` can share code and should be avoided to be
named directly. The upcoming I/O projection series would make `Io` trait
much simpler to implement, and thus the owned MMIO type would be removed
in favour of direct `Io` implementation on `Bar` and `IoMem`.

Add lifetime parameter to `Bar0<'a>` and change it to be alias of `&'a
pci::Bar<'a, ..>`. This also prepares Nova core so that when I/O projection
series land, this could be changed to using a MMIO view type directly which
avoids double indirection.

Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260602170416.2268531-1-gary@kernel.org
[ Rebase onto latest drm-rust-next (Blackwell enablement). - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
3 weeks agoaccel/amdxdna: Return errors for failed debug BO commands
Lizhi Hou [Fri, 29 May 2026 16:21:22 +0000 (09:21 -0700)] 
accel/amdxdna: Return errors for failed debug BO commands

The config and sync debug BO commands currently may report success even
when the operation fails.

Capture the firmware return status and propagate the corresponding error
to userspace.

Fixes: 7ea046838021 ("accel/amdxdna: Support firmware debug buffer")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260529162122.1976376-1-lizhi.hou@amd.com
3 weeks agoaccel/amdxdna: Remove drv_cmd tracing from job free callback
Lizhi Hou [Fri, 29 May 2026 15:28:37 +0000 (08:28 -0700)] 
accel/amdxdna: Remove drv_cmd tracing from job free callback

aie2_sched_job_free() accesses job->drv_cmd for tracing purposes. However,
job->drv_cmd is owned by the caller and may already have been freed when
the job free callback runs, leading to a potential use-after-free.

Remove the job->drv_cmd access from aie2_sched_job_free().

Fixes: 8711eb2dde2e ("accel/amdxdna: Improve tracing for job lifecycle and mailbox RX worker")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260529152837.1973405-1-lizhi.hou@amd.com
3 weeks agodrm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range
Priya Hosur [Thu, 7 May 2026 08:01:37 +0000 (13:31 +0530)] 
drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range

In smu_v14_0_0_set_soft_freq_limited_range(), the gfxclk floor is
programmed via SetHardMinGfxClk together with SetSoftMaxGfxClk. Under
power_dpm_force_performance_level=high this pins HardMin to peak gfxclk.

In PMFW arbitration HardMin has higher priority than SoftMax, so the
firmware thermal/PPT throttler cannot clamp gfxclk via SoftMax once
HardMin is set to peak. Replace SetHardMinGfxClk with SetSoftMinGfxclk
so the driver still requests peak performance but the firmware
throttler retains the ability to clamp gfxclk under thermal/PPT
pressure. SoftMax handling is unchanged and no other clock domains
are affected.

Signed-off-by: Priya Hosur <Priya.Hosur@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems
Donet Tom [Wed, 27 May 2026 13:19:31 +0000 (18:49 +0530)] 
drm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems

When mapping VRAM pages into the GART page table,
amdgpu_gart_map_vram_range() assumes that the system page size is the
same as the GPU page size.

On systems with non-4K page sizes, multiple GPU pages can exist within
a single CPU page. As a result, the mappings are created incorrectly
because fewer page table entries are programmed than required.

Fix this by programming the mappings correctly for non-4K page size
systems.

Fixes: 237d623ae659 ("drm/amdgpu/gart: Add helper to bind VRAM pages (v2)")
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agoamd/amdkfd: Fix profiler lock init order
Tvrtko Ursulin [Fri, 29 May 2026 09:23:22 +0000 (10:23 +0100)] 
amd/amdkfd: Fix profiler lock init order

A call chain at driver probe exists where profiler lock is used before it
is initialized:

[   12.131440] kfd kfd: Allocated 3969056 bytes on gart
[   12.131561] kfd kfd: Total number of KFD nodes to be created: 1
[   12.132691] ------------[ cut here ]------------
[   12.132703] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[   12.132705] WARNING: kernel/locking/mutex.c:625 at __mutex_lock+0x616/0x1150, CPU#0: (udev-worker)/569
...
[   12.133051] Call Trace:
[   12.133055]  <TASK>
[   12.133059]  ? mark_held_locks+0x40/0x70
[   12.133068]  ? init_mqd+0xe1/0x1b0 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.133671]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[   12.133683]  ? init_mqd+0xe1/0x1b0 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.134235]  init_mqd+0xe1/0x1b0 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.134781]  init_mqd_hiq+0x12/0x30 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.135340]  kq_initialize.constprop.0+0x309/0x400 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.135898]  kernel_queue_init+0x44/0x80 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.136439]  pm_init+0x70/0x100 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.136984]  start_cpsch+0x1dc/0x280 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.137525]  kgd2kfd_device_init+0x70f/0xd10 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.138070]  amdgpu_amdkfd_device_init+0x172/0x230 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]
[   12.138618]  amdgpu_device_init+0x246a/0x2960 [amdgpu 5154987db73e842b9b4f761e2bd86e17c7ada65c]

The human readable call chain is:

kgd2kfd_device_init
  kfd_init_node
    kfd_resume
      node->dqm->ops.start

Where start can be start_cpsch, which calls pm_init, etc, which ends up
calling kq->mqd_mgr->init_mqd, which takes the profiler lock:

init_mqd()
{
...
mutex_lock(&mm->dev->kfd->profiler_lock);
...

Fix it by initializing the mutext at the top of kgd2kfd_device_init().

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a789761de305 ("amd/amdkfd: Add kfd_ioctl_profiler to contain profiler kernel driver changes")
Cc: Benjamin Welton <benjamin.welton@amd.com>
Cc: Perry Yuan <perry.yuan@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Yifan Zhang <yifan1.zhang@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/ras: add ras_suspend callback and use it for cp_ecc_error_irq
Yunxiang Li [Wed, 27 May 2026 18:06:00 +0000 (14:06 -0400)] 
drm/amdgpu/ras: add ras_suspend callback and use it for cp_ecc_error_irq

cp_ecc_error_irq is acquired in amdgpu_gfx_ras_late_init() but
released in gfx_v9_0_hw_fini(), so the put site has to query
amdgpu_irq_enabled() because the get is skipped on SR-IOV VF.

ras_late_init / ras_fini have no suspend counterpart, so move the
put to amdgpu_gfx_ras_suspend() / amdgpu_gfx_ras_fini() and add a
matching ras_suspend callback that is invoked from
amdgpu_ras_suspend() before disable_all_features().  The get and
put now sit in the same place and check the same condition (not
VF, funcs registered), no refcount querying needed.

An active flag gates ras_fini so the
suspend-then-unload-without-resume path falls into
amdgpu_ras_block_late_fini_default() instead of double-releasing
what ras_suspend already cleaned up.

Drop the cp_ecc_error_irq put from gfx_v9_0_hw_fini().  gfx_v8_0
manages cp_ecc_error_irq locally and is unaffected; no other GFX
generation has this IRQ.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: set sub_block_index for mca ras sub-blocks
Yunxiang Li [Mon, 1 Jun 2026 19:15:06 +0000 (15:15 -0400)] 
drm/amdgpu: set sub_block_index for mca ras sub-blocks

The mca ras sub-blocks (mp0, mp1, mpio) all share the
AMDGPU_RAS_BLOCK__MCA block id and are distinguished only by
sub_block_index. The ras manager object for an mca block is selected
with:

con->objs[AMDGPU_RAS_BLOCK__LAST + head->sub_block_index]

Since the rework in commit 7f544c5488cf ("drm/amdgpu: Rework mca ras
sw_init") moved the ras_comm setup into amdgpu_mca_mp*_ras_sw_init() but
left sub_block_index unset, mp0/mp1/mpio all default to index 0 and
collide on the same object slot. mp0 grabs the slot and creates its
sysfs node first; mp1 (and mpio) then find the slot already in use, so
amdgpu_ras_block_late_init() -> amdgpu_ras_sysfs_create() returns
-EINVAL:

  amdgpu: mca.mp1 failed to execute ras_block_late_init_default! ret:-22
  amdgpu: amdgpu_ras_late_init failed -22
  amdgpu: amdgpu_device_ip_late_init failed
  amdgpu: Fatal error during GPU init

The error is currently masked because amdgpu_ras_late_init() does not
check the return value of amdgpu_ras_block_late_init_default(), but it
already leaves mp1/mpio without their sysfs nodes and becomes a fatal
init failure as soon as that return value is honored.

Restore the per-sub-block sub_block_index assignment so each mca
sub-block maps to its own object slot.

Fixes: 7f544c5488cf ("drm/amdgpu: Rework mca ras sw_init")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy
Sunil Khatri [Mon, 25 May 2026 04:26:23 +0000 (09:56 +0530)] 
drm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy

In case when queue_create fails and mqd has already been
allocated and hence wptr_obj is not cleaned up.

So moving that cleanup part to mqd_destroy so it takes
care of all the cases of clean up and during tear down of
the queue.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/ras: chunk UNIRAS CPER debugfs reads
Xiang Liu [Fri, 29 May 2026 14:11:26 +0000 (22:11 +0800)] 
drm/amd/ras: chunk UNIRAS CPER debugfs reads

Legacy CPER ring readers can issue one debugfs read with a buffer larger
than the UNIRAS RAS command payload limit. Passing that full size to
GET_CPER_RECORD makes the command reject the request, so userspace may
only see the ring prefix and treat the CPER stream as empty.

Commit 3c88fb7aa57d ("drm/amd/ras: bound CPER record fetch buffer
size") intentionally bounds CPER record fetch allocation by the command
buffer size. Keep the debugfs ABI as a single contiguous ring read by
splitting the internal GET_CPER_RECORD requests into
RAS_CMD_MAX_CPER_BUF_SZ chunks.

Accumulate the copied payload and update the legacy header write pointers
from the total bytes returned to userspace.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: improve the userq seq BO free bit lookup
Prike Liang [Tue, 26 May 2026 02:25:26 +0000 (10:25 +0800)] 
drm/amdgpu: improve the userq seq BO free bit lookup

Use find_next_zero_bit() to locate the next free seq slot bit
instead of the current walk, for more efficient bitmap scanning.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: Adjust _PR3 detection
Mario Limonciello [Wed, 20 May 2026 15:46:18 +0000 (10:46 -0500)] 
drm/amdgpu: Adjust _PR3 detection

_PR3 detection was changed in commit 134b8c5d8674 ("drm/amd: Fix
detection of _PR3 on the PCIe root port") to look at the root port
of the topology containing the GPU.  This however was too far because
it ignored whether or not all the intermediary bridges could power
off the device.  The original design in commit b10c1c5b3a4e ("drm/amdgpu:
add check for ACPI power resources") was too narrow because it matched
the switches internal to the GPU.

Use the goldilocks approach and look for the first bridge outside of the
GPU and check for _PR3 on that device.

Fixes: 134b8c5d8674 ("drm/amd: Fix detection of _PR3 on the PCIe root port")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: grow VF RAS bad page table with bounded dynamic alloc
Chenglei Xie [Thu, 7 May 2026 14:29:10 +0000 (10:29 -0400)] 
drm/amdgpu: grow VF RAS bad page table with bounded dynamic alloc

The VF RAS error handler used fixed-size bps[] / bps_bo[] arrays (512
slots). When the PF2VF bad-page block listed more entries than fit,
amdgpu_virt_ras_add_bps() could memcpy() past the end of those arrays.

Replace the fixed backing store with a dynamically grown table:
- Add capacity to track allocated slots separately from count.
- Start at 512 slots and realloc bps / bps_bo together when full.
- Refuse growth beyond maximum EEPROM record limit (AMDGPU_VIRT_RAS_BAD_PAGE_TABLE_MAX_CAPACITY).
- Return failure from amdgpu_virt_ras_add_bps() and stop processing
  the PF2VF block if allocation fails or the cap is reached.

Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/userq: remove the vital queue unmap logging
Sunil Khatri [Mon, 25 May 2026 07:48:00 +0000 (13:18 +0530)] 
drm/amdgpu/userq: remove the vital queue unmap logging

Mesa userqueues free does not wait for the free to complete and go ahead
in unmapping the vital bos while kernel is still in queue free and
corresponding cleanup.

So ideally we don't need the logging for that and hence remove the warn
message as this is expected behaviour and functionally, we are making
sure to wait for the required fences before unmap.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11
Andrew Martin [Thu, 28 May 2026 16:54:39 +0000 (12:54 -0400)] 
drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11

The v11 MQD manager incorrectly assigned the CP-compute variants of
checkpoint_mqd/restore_mqd for KFD_MQD_TYPE_SDMA queues. These functions
use sizeof(struct v11_compute_mqd) (2048 bytes) instead of sizeof(struct
v11_sdma_mqd) (512 bytes), causing a 1536-byte overflow.

During CRIU checkpoint of an SDMA queue on Navi3x:
- checkpoint_mqd() reads 2048 bytes from a 512-byte SDMA MQD buffer,
  leaking 1536 bytes of adjacent GTT memory to userspace

During CRIU restore:
- restore_mqd() writes 2048 bytes into a 512-byte SDMA MQD buffer,
  corrupting 1536 bytes of adjacent GTT memory (often the ring buffer
  or neighboring MQDs)

This is a copy-paste regression unique to v11. All other ASIC backends
(cik, vi, v9, v10, v12) correctly use the SDMA-specific variants.

Add checkpoint_mqd_sdma() and restore_mqd_sdma() functions that properly
handle the smaller v11_sdma_mqd structure, matching the pattern used in
other MQD managers.

Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
Assisted-by: Claude:Sonnet 4-5
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdkfd: fix NULL dereference in get_queue_ids()
Muhammad Bilal [Sat, 23 May 2026 16:56:46 +0000 (16:56 +0000)] 
drm/amdkfd: fix NULL dereference in get_queue_ids()

When usr_queue_id_array is NULL and num_queues is non-zero,
get_queue_ids() returns NULL. The callers check only IS_ERR() on the
return value; since IS_ERR(NULL) == false the check passes, and
suspend_queues() calls q_array_invalidate() which immediately
dereferences NULL while iterating num_queues times.

Userspace can trigger this via kfd_ioctl_set_debug_trap() by supplying
num_queues > 0 with a zero queue_array_ptr, causing a kernel panic.

A NULL usr_queue_id_array with num_queues == 0 is a legitimate no-op
(q_array_invalidate never executes, and resume_queues already guards
all queue_ids dereferences behind a NULL check). Return ERR_PTR(-EINVAL)
only when num_queues is non-zero and the pointer is absent; both callers
already propagate IS_ERR() returns correctly to userspace.

Fixes: a70a93fa568b ("drm/amdkfd: add debug suspend and resume process queues operation")
Signed-off-by: Muhammad Bilal <meatuni001@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/display: widen FRL debug knobs to unsigned int
Aurabindo Pillai [Tue, 2 Jun 2026 18:53:25 +0000 (14:53 -0400)] 
drm/amd/display: widen FRL debug knobs to unsigned int

force_frl_rate, select_ffe and limit_ffe in dc_debug_options carry
non-negative configuration values: an FRL link-rate enum (0..0xF), an
FFE level selector and an FFE level limit. They are only ever compared
against 0/0xF, assigned, or cast to uint8_t before being written to
hardware. No call site relies on signed semantics.

Make the types unsigned int to match how the values are actually used
and to silence MISRA-style signedness warnings on internal builds.

Signed-off-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
Vitaly Prosyak [Fri, 29 May 2026 17:50:38 +0000 (13:50 -0400)] 
drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)

Problem:
While developing the amd_close_race IGT test (which intentionally triggers
execute permission faults by removing VM_PAGE_EXECUTABLE from GPU page table
entries), we discovered that on Navi10 (GFX 10.1.x) these faults produce
zero diagnostic output. The GPU simply hangs silently for ~10s until the
scheduler timeout fires. There is no way to distinguish an execute
permission fault from any other type of GPU hang.

Root cause:
GFX 10.1.x defaults to noretry=0, which sets
RETRY_PERMISSION_OR_INVALID_PAGE_FAULT=1 in the GFXHUB UTCL2 registers
(gfxhub_v2_0.c line 313). With this bit set, permission faults (valid PTE,
wrong R/W/X bits) are handled entirely within the UTCL1/UTCL2 hardware
loop: UTCL2 returns an XNACK to UTCL1, and UTCL1 re-requests the
translation indefinitely, expecting software to eventually fix the
permission bits (as happens in SVM/HMM recovery). No interrupt of any kind
reaches the IH ring.

This is different from invalid-page faults (V=0) which DO generate a retry
fault interrupt that the driver can escalate to a no-retry fault. Permission
faults with valid PTEs loop silently forever in hardware.

GFX 10.3+ already defaults to noretry=1, which makes permission faults
generate immediate L2 protection fault interrupts. GFX 10.1.x was
inadvertently left out of this default.

Fix:
Change the noretry=1 threshold from IP_VERSION(10, 3, 0) to
IP_VERSION(10, 1, 0) in amdgpu_gmc_noretry_set(). This is a one-line
change that aligns GFX 10.1.x behavior with GFX 10.3+ and all newer
generations.

With noretry=1, the existing non-retry fault handler
(gmc_v10_0_process_interrupt) already decodes and prints the full
GCVM_L2_PROTECTION_FAULT_STATUS register including PERMISSION_FAULTS,
faulting address, VMID, PASID, and process name. No additional logging
code is needed — the fix is purely routing permission faults to the
existing, fully-capable non-retry interrupt handler.

v2: Dropped GFX10-specific logging from gmc_v10_0.c and
kfd_int_process_v10.c (Felix Kuehling). v1 added logging in the retry
fault handler, but with noretry=1 permission faults take the non-retry
path — the v1 retry handler code was dead and would never execute.

Tested on Navi10 (GFX 10.1.10):
- Execute permission faults now produce immediate, clear output:
    [gfxhub] page fault (src_id:0 ring:64 vmid:4 pasid:592)
     Process amd_close_race pid 13380 thread amd_close_race pid 13384
      in page at address 0x40001000 from client 0x1b (UTCL2)
    GCVM_L2_PROTECTION_FAULT_STATUS:0x00700881
         PERMISSION_FAULTS: 0x8
- No regressions with properly-mapped GPU workloads

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed
Timur Kristóf [Mon, 25 May 2026 11:45:02 +0000 (13:45 +0200)] 
drm/amdgpu/gfxhub: Program CRASH_ON_*_FAULT bits to 0 as needed

When the fault stop mode isn't AMDGPU_VM_FAULT_STOP_ALWAYS,
these bits should be programmed to 0.

Program CRASH_ON_NO_RETRY_FAULT and CRASH_ON_RETRY_FAULT
always, to make sure to clear the bits when we don't want
to crash.

Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: fix waiting for all submissions for userptrs
Christian König [Wed, 18 Feb 2026 12:05:46 +0000 (13:05 +0100)] 
drm/amdgpu: fix waiting for all submissions for userptrs

Wait for all submissions when userptrs need to be invalidated by the MMU
notifier, not just the one the userptr was involved into.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: drm/amdgpu: Set correct DMA mask for gfx12.1
Harish Kasiviswanathan [Tue, 12 May 2026 14:57:49 +0000 (10:57 -0400)] 
drm/amdgpu: drm/amdgpu: Set correct DMA mask for gfx12.1

Set correct DMA mask for gfx12

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: Use asic specific pte_addr_mask
Harish Kasiviswanathan [Tue, 28 Apr 2026 21:45:06 +0000 (17:45 -0400)] 
drm/amdgpu: Use asic specific pte_addr_mask

For PTE creation use asic specific physical page base address mask

v2: Change variable name from pa_mask to pte_addr_mask

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: zero unused SMU argument registers
Yang Wang [Mon, 11 May 2026 08:33:37 +0000 (16:33 +0800)] 
drm/amd/pm: zero unused SMU argument registers

SMU messages may use fewer arguments than the available argument registers,
the previous code only wrote used registers and left the rest unchanged,
so stale values from a prior message could persist.

Write all argument registers for each message and zero the unused tail
to keep command arguments deterministic and avoid unintended carry-over.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: mark metrics.energy_accumulator is invalid for smu 14.0.2
Yang Wang [Fri, 29 May 2026 03:47:31 +0000 (11:47 +0800)] 
drm/amd/pm: mark metrics.energy_accumulator is invalid for smu 14.0.2

EnergyAccumulator is unsupported on SMU 14.0.2, mark it invalid.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: fix smu13 power limit default/cap calculation
Yang Wang [Tue, 19 May 2026 03:18:12 +0000 (11:18 +0800)] 
drm/amd/pm: fix smu13 power limit default/cap calculation

smu_v13_0_0_get_power_limit() and smu_v13_0_7_get_power_limit() mix
runtime power_limit with PP table limits when reporting default/min/max.

When current power limit query succeeds, default_power_limit was set to the
runtime value instead of the PP table default, and min/max could be derived
from inconsistent bases (MsgLimits/runtime), leading to incorrect cap info.

Use SocketPowerLimitAc/Dc as the PP default base (pp_limit), keep
current_power_limit as runtime value, and derive min/max from pp_limit with
OD percentages.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5227
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: apply SMU 13.0.10 workaround during MP1 unload
Yang Wang [Thu, 21 May 2026 14:36:37 +0000 (22:36 +0800)] 
drm/amd/pm: apply SMU 13.0.10 workaround during MP1 unload

On SMU v13.0.10, sending PrepareMp1ForUnload with the default
parameter may leave the device in an inaccessible state. This can
affect runtime power management and partial PnP flows.
e.g: kexec, driver unload, boco/d3cold.

Pass the required workaround parameter 0x55, when preparing MP1 for
unload on SMU v13.0.10, keep the existing behavior for other SMU
versions.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5133
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/amdxcp: use kasprintf for XCP platform device names
Candice Li [Tue, 19 May 2026 04:47:24 +0000 (12:47 +0800)] 
drm/amd/amdxcp: use kasprintf for XCP platform device names

Replace the fixed stack buffer with kasprintf() so platform
device names are always fully formatted.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: use kcalloc in phm table copy helpers
Candice Li [Wed, 20 May 2026 04:14:37 +0000 (12:14 +0800)] 
drm/amd/pm: use kcalloc in phm table copy helpers

Use kcalloc() so multiplication overflow is detected
and allocation fails safely for phm table copy helpers.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: NUL-terminate securedisplay debugfs input from userspace
Candice Li [Tue, 19 May 2026 10:42:09 +0000 (18:42 +0800)] 
drm/amdgpu: NUL-terminate securedisplay debugfs input from userspace

Use strncpy_from_user() instead of copy_from_user() before sscanf() in
the securedisplay_test debugfs write handler so a full-length write
cannot leave the stack buffer without a terminator.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: validate RAS EEPROM tbl_size before record count
Candice Li [Tue, 19 May 2026 09:31:54 +0000 (17:31 +0800)] 
drm/amdgpu: validate RAS EEPROM tbl_size before record count

Corrupt EEPROM data can set tbl_size below the table header size.
Guard the RAS_NUM_RECS macros against undersized tbl_size and reset
the table during init when tbl_size is below the minimum for the table
version instead of trusting the header.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/ras: validate RAS EEPROM tbl_size before record count
Candice Li [Tue, 19 May 2026 09:31:21 +0000 (17:31 +0800)] 
drm/amd/ras: validate RAS EEPROM tbl_size before record count

Corrupt EEPROM data can set tbl_size below the table header size.
Guard the RAS_NUM_RECS macros against undersized tbl_size and reset
the table during init when tbl_size is below the minimum for the table
version instead of trusting the header.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/pm: fix SmartShift bias sysfs store PM refcount on parse error
Candice Li [Mon, 18 May 2026 11:58:10 +0000 (19:58 +0800)] 
drm/amdgpu/pm: fix SmartShift bias sysfs store PM refcount on parse error

Return the parse error before acquiring PM access.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: return -EINVAL on invalid CCLK OD core index
Candice Li [Tue, 19 May 2026 04:19:38 +0000 (12:19 +0800)] 
drm/amd/pm: return -EINVAL on invalid CCLK OD core index

Return -EINVAL after an out-of-range core index for
PP_OD_EDIT_CCLK_VDDC_TABLE.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: bound pp_dpm_set_pp_table() memcpy
Asad Kamal [Tue, 12 May 2026 07:44:32 +0000 (15:44 +0800)] 
drm/amd/pm: bound pp_dpm_set_pp_table() memcpy

The powerplay path allocates hardcode_pp_table once with kmemdup(...,
soft_pp_table_size). memcpy(..., size) used the sysfs store count (up to
PAGE_SIZE) with no upper bound, causing heap overflow. Reject
writes where size exceeds soft_pp_table_size.

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: fix duplicated buffer allocation for concurrent
Shiwu Zhang [Wed, 13 May 2026 06:45:54 +0000 (14:45 +0800)] 
drm/amdgpu: fix duplicated buffer allocation for concurrent

In case of concurrent calling to the bin file writing, use the mutex
to avoid allocating the temporary buffer more than once.

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: fix buffer overflow during vBIOS update
Shiwu Zhang [Wed, 13 May 2026 05:54:58 +0000 (13:54 +0800)] 
drm/amdgpu: fix buffer overflow during vBIOS update

Clamp the buffer postion to write by setting the bin attribute
to the maximum buffer size so that VFS layer will block the
out-of-bounds accessing.

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: Reject negative values in thermal_throttling_logging
Vitaly Prosyak [Fri, 24 Apr 2026 02:30:48 +0000 (22:30 -0400)] 
drm/amd/pm: Reject negative values in thermal_throttling_logging

Discovery: Fuzzing for secure supply chain requirements
Tool: amd_fuzzing_sysfs (IGT test)

The thermal_throttling_logging sysfs store function accepts negative
values like -1 and -9999999, which are nonsensical for a logging interval.

Current behavior:
- Values <= 0 disable logging (intended for 0 only)
- Values 1-3600 enable logging with interval in seconds
- Negative values are accepted and treated as disable

Issue:
Large negative values like -9999999 make no semantic sense and could
indicate input validation bypass attempts. While they functionally
disable logging (same as 0), accepting arbitrary negative values
suggests inadequate input validation.

Fix:
Add explicit check to reject values < 0 before processing.
Only accept:
- 0: disable thermal throttling logging
- 1-3600: enable with interval in seconds (existing validation)

This improves input validation and makes the interface more robust.

Test Results Before Fix:
  thermal_throttling_logging: 6 failures
  - Accepted: 0, -1, -9999999, -2147483648, empty string, 0777

Test Results After Fix:
  thermal_throttling_logging: 3 failures
  - Rejected: -1, -9999999, -2147483648 (now return -EINVAL)
  - Remaining: empty string (VFS behavior), 0 (valid), 0777 (octal)

Tested: amd_fuzzing_sysfs IGT test

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/pm: Add empty string validation to sysfs store functions
Vitaly Prosyak [Thu, 23 Apr 2026 23:44:33 +0000 (19:44 -0400)] 
drm/amd/pm: Add empty string validation to sysfs store functions

Discovery: Fuzzing for secure supply chain requirements
Tool: amd_fuzzing_sysfs (IGT test)

The AMDGPU power management sysfs store functions accept whitespace-only
strings when they should reject them with -EINVAL. This was discovered via
systematic fuzzing of sysfs interfaces crossing the user/kernel trust
boundary.

Affected functions:
- amdgpu_set_power_dpm_force_performance_level (power_dpm_force_performance_level)
- amdgpu_set_power_dpm_state (power_dpm_state)
- amdgpu_set_pp_power_profile_mode (pp_power_profile_mode)
- amdgpu_read_mask (used by pp_dpm_sclk/mclk/fclk/socclk/pcie)
- amdgpu_set_pp_features (pp_features)

Impact:
- Whitespace-only writes (e.g., "\n", " ") can cause unexpected behavior
- Better input validation at user/kernel trust boundary
- Defense-in-depth improvement

Root Cause:
The sysfs_streq() function matches whitespace-only strings against empty
string, allowing invalid input to be processed.

Fix:
Add explicit validation at the start of each affected store function:

    if (count == 0 || sysfs_streq(buf, ""))
        return -EINVAL;

This rejects whitespace-only inputs before they are processed. Note that
write() calls with count=0 (truly empty strings) are handled by the VFS
layer before reaching the sysfs .store() callback - the VFS returns 0
(success) without calling the kernel function. This is POSIX-compliant
behavior and cannot be changed at the kernel driver level.

What This Patch Fixes:
- Whitespace-only strings: "\n", " ", "  ", etc. are now rejected
- Defense-in-depth: Explicit validation at trust boundary
- Code clarity: Intent to reject invalid input is explicit

What This Patch Cannot Fix:
- write(fd, "", 0) returning success - this is VFS layer behavior
- Fuzzer tests for empty strings (count=0) will still report "accepted"
  because the VFS handles this before the kernel callback

Test Results After Fix:
- Whitespace strings ("\n", " ") now properly rejected
- Empty string tests (count=0) still show as "accepted" due to VFS behavior
- Overall improvement in input validation robustness
- No impact on valid inputs

This is a defense-in-depth improvement that hardens input validation
even though VFS layer behavior prevents catching all edge cases.

Tested: amd_fuzzing_sysfs IGT test

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: fix KASAN slab-out-of-bounds in amdgpu_coredump ring dump
Vitaly Prosyak [Thu, 14 May 2026 22:55:42 +0000 (18:55 -0400)] 
drm/amdgpu: fix KASAN slab-out-of-bounds in amdgpu_coredump ring dump

The ring content dump in amdgpu_coredump() uses two separate loops over
adev->rings[]: the first counts rings with unsignalled fences to size
the allocation, and the second copies ring data into the allocated
buffers.

Both loops use the same condition to skip rings:

    atomic_read(&ring->fence_drv.last_seq) == ring->fence_drv.sync_seq

Because last_seq is an atomic that is updated concurrently by the fence
signalling path, additional rings may appear unsignalled in the second
loop that were signalled during the first. When this happens, idx
exceeds the allocated ring_count and the store to coredump->rings[idx]
writes past the end of the kcalloc-ed buffer.

This was found during IGT stressful test amd_queue_reset which
triggers random GPU resets. The OVERSIZE subtest
(CMD_STREAM_EXEC_INVALID_PACKET_LENGTH_OVERSIZE on GFX ring) provokes
a ring timeout and subsequent coredump, which hits the race between
the counting and copying loops. The failure is non-deterministic and
depends on fence signalling timing during the reset.

KASAN log:

  BUG: KASAN: slab-out-of-bounds in amdgpu_coredump+0x1274/0x12f0 [amdgpu]
  Write of size 4 at addr ffff888106154258 by task kworker/u128:5/23625
  CPU: 16 UID: 0 PID: 23625 Comm: kworker/u128:5 Not tainted 6.19.0+ #35
  Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
  Call Trace:
   <TASK>
   dump_stack_lvl+0xa5/0x110
   print_report+0xd1/0x660
   kasan_report+0xf3/0x130
   __asan_report_store4_noabort+0x17/0x30
   amdgpu_coredump+0x1274/0x12f0 [amdgpu]
   amdgpu_job_timedout+0xef0/0x16c0 [amdgpu]
   drm_sched_job_timedout+0x194/0x5c0 [gpu_sched]
   process_one_work+0x84b/0x1990
   worker_thread+0x6b8/0x11b0
   </TASK>

  Allocated by task 23625:
   kasan_save_stack+0x39/0x70
   __kasan_kmalloc+0xc3/0xd0
   __kmalloc_noprof+0x2ec/0x910
   amdgpu_coredump+0x5c5/0x12f0 [amdgpu]
   amdgpu_job_timedout+0xef0/0x16c0 [amdgpu]

  The buggy address belongs to the object at ffff888106154200
   which belongs to the cache kmalloc-rnd-09-96 of size 96
  The buggy address is located 16 bytes to the right of
   allocated 72-byte region [ffff888106154200ffff888106154248)

72 bytes = 3 * sizeof(struct amdgpu_coredump_ring), so ring_count was 3
but idx reached 3+, writing ring_index (at struct offset 16) 16 bytes
past the allocation.

Fix by adding an idx < ring_count guard to the copy loop so it cannot
exceed the allocated count even when the fence state changes between
the two passes.

Fixes: eea85914d15b (drm/amdgpu: save ring content before resetting the device)
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/vpe: add vpe v2.2.0 support
Caden Chien [Mon, 18 May 2026 05:22:23 +0000 (13:22 +0800)] 
drm/amdgpu/vpe: add vpe v2.2.0 support

This initializes VPE IP version 2.2.0

Signed-off-by: Caden Chien <chih-wei.chien@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu/nbio: enable doorbell range init for vpe on v7.11.5
Caden Chien [Fri, 1 May 2026 16:09:13 +0000 (00:09 +0800)] 
drm/amdgpu/nbio: enable doorbell range init for vpe on v7.11.5

This initializes doorbell entry 5 for vpe on v7.11.5

Signed-off-by: Caden Chien <chih-wei.chien@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amdgpu: harden FRU PIA parsing with bounded helpers
Stanley.Yang [Tue, 12 May 2026 11:10:23 +0000 (19:10 +0800)] 
drm/amdgpu: harden FRU PIA parsing with bounded helpers

Replace the open-coded TLV walk with fru_pia_advance()
and fru_pia_copy_field() helpers that bound every read
by the actual EEPROM data length, preventing out-of-bounds
reads on truncated or malformed FRU data.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/ras: make UNIRAS CPER debugfs header legacy-compatible
Xiang Liu [Fri, 29 May 2026 14:10:09 +0000 (22:10 +0800)] 
drm/amd/ras: make UNIRAS CPER debugfs header legacy-compatible

The UNIRAS CPER debugfs path returned a zeroed 12-byte prefix and used
file offset directly as the CPER record index. Legacy CPER ring readers
expect the prefix to contain three 32-bit ring pointers followed
immediately by CPER payload data.

Build the same header shape for UNIRAS reads by reporting a zero read
pointer and matching write pointers for the returned payload size. Keep
an internal record cursor behind the debugfs offset so follow-up reads
continue from the correct CPER record while first reads still expose the
legacy prefix.

Signed-off-by: Xiang Liu <xiang.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/ras: Remove redundant error log
Stanley.Yang [Tue, 26 May 2026 04:06:08 +0000 (12:06 +0800)] 
drm/amd/ras: Remove redundant error log

amdgpu_ras_inject_error() currently prints an extra "ras inject block %u
failed" message, remove the redundant log.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
3 weeks agodrm/amd/ras: snapshot remote cmd header to fix double-fetch
Stanley.Yang [Tue, 12 May 2026 07:44:28 +0000 (15:44 +0800)] 
drm/amd/ras: snapshot remote cmd header to fix double-fetch

The response header lives in PF-controlled shared memory. Copy it
into a local struct once, then read cmd_res and output_size from the
snapshot so the PF cannot flip cmd_res or grow output_size between
checks.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>