]> git.ipfire.org Git - thirdparty/kernel/linux.git/commit
drm/amdgpu: fix recursive ww_mutex acquire in amdgpu_devcoredump_format
authorMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Fri, 29 May 2026 06:47:39 +0000 (11:47 +0500)
committerAlex Deucher <alexander.deucher@amd.com>
Wed, 17 Jun 2026 22:23:43 +0000 (18:23 -0400)
commit7152b248dc3c8d5fa8629e99ed5655dd41b51562
tree1175d0cadcd175fd154560d4b222cde165f43562
parent9920249a5288e7cbec222cd52996bbd9aac7ec9e
drm/amdgpu: fix recursive ww_mutex acquire in amdgpu_devcoredump_format

When dumping IB contents from a hung job, amdgpu_devcoredump_format()
acquired the VM root PD's reservation via amdgpu_vm_lock_by_pasid() and
then, for each IB, called amdgpu_bo_reserve() on the BO backing the IB.
Both reservations are reservation_ww_class_mutex objects and neither
used a ww_acquire_ctx, which trips lockdep:

  WARNING: possible recursive locking detected
  --------------------------------------------
  kworker/u128:0 is trying to acquire lock:
  ffff88838b16e1f0 (reservation_ww_class_mutex){+.+.}-{4:4},
    at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]

  but task is already holding lock:
  ffff8882f82681f0 (reservation_ww_class_mutex){+.+.}-{4:4},
    at: amdgpu_devcoredump_format+0x1594/0x23f0 [amdgpu]

   Possible unsafe locking scenario:
         CPU0
         ----
    lock(reservation_ww_class_mutex);
    lock(reservation_ww_class_mutex);

   *** DEADLOCK ***
   May be due to missing lock nesting notation

  Workqueue: events_unbound amdgpu_devcoredump_deferred_work [amdgpu]
  Call Trace:
   __ww_mutex_lock.constprop.0
   ww_mutex_lock
   amdgpu_bo_reserve
   amdgpu_devcoredump_format+0x1594 [amdgpu]
   amdgpu_devcoredump_deferred_work+0xea [amdgpu]

The two reservations are on different BOs in the captured trace, so the
splat is a lockdep-correctness warning, not an observed deadlock. It
becomes a real self-deadlock whenever the IB BO shares its dma_resv with
the root PD (the always-valid case, see amdgpu_vm_is_bo_always_valid()):
amdgpu_bo_reserve(abo) re-acquires the same ww_mutex without a ticket
and blocks forever. With amdgpu.gpu_recovery=0 the timeout handler
refires every ~2 s and each invocation produces this splat, drowning the
kernel ring buffer.

Now that amdgpu_vm_lock_by_pasid() takes a drm_exec context, move the IB
dumping into a separate helper that locks the root PD and every IB BO
together in a single drm_exec ticket. DRM_EXEC_IGNORE_DUPLICATES handles
IB BOs that share a dma_resv (e.g. always-valid BOs, or two IBs backed
by the same BO). Every lock is now a top-level acquire under one
ww_acquire_ctx, so the recursive ww_mutex condition is gone, and the
per-IB amdgpu_bo_reserve()/amdgpu_bo_unref() dance -- including a BO
refcount leak on the amdgpu_bo_reserve() failure path -- is removed.

Fixes: 7b15fc2d1f1a ("drm/amdgpu: dump job ibs in the devcoredump")
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d6bf4242731219ee08ce54c365631e395486651e)
drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c