From: Vitaly Prosyak <vitaly.prosyak@amd.com>
Date: Tue, 14 Apr 2026 03:07:55 +0000 (-0400)
Subject: drm/amdgpu: fix NULL pointer dereference in amdgpu_devcoredump_format
X-Git-Tag: v7.1-rc1~24^2~3^2~58
X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=d91db49e97382efe38b51f5943c1a7073c0f06cd;p=thirdparty%2Fkernel%2Flinux.git

drm/amdgpu: fix NULL pointer dereference in amdgpu_devcoredump_format

A race condition in the devcoredump code causes a NULL pointer
dereference in amdgpu_devcoredump_format() when multiple GPU resets
occur in quick succession.

The sequence of events:

1. First reset calls amdgpu_coredump(), creates coredump1, sets
   adev->coredump = coredump1, and queues the deferred work.
2. The deferred work begins executing (work_pending() returns false
   since the work is now running, not just queued).
3. A second reset calls amdgpu_coredump(). work_pending() returns
   false because the work is running, so amdgpu_coredump() proceeds:
   creates coredump2, overwrites adev->coredump = coredump2, and
   re-queues the deferred work with queue_work().
4. The first deferred work finishes and unconditionally sets
   adev->coredump = NULL, destroying the reference to coredump2.
5. The re-queued deferred work starts and reads
   adev->coredump = NULL. It then passes this NULL into
   amdgpu_devcoredump_format() which dereferences coredump->adev
   (offset 0 in the struct), triggering:

   KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
   RIP: 0010:amdgpu_devcoredump_format+0xa6/0x36b0 [amdgpu]

This was observed during the amd_deadlock IGT test where multiple
subtests trigger rapid ring resets. The dmesg log shows four
coredumps created within 120ms (at 102.377s, 104.424s, 104.492s,
and 104.497s), with the crash occurring 13ms after the last one.

Fix this with two changes:

- Replace work_pending() with work_busy() in amdgpu_coredump() to
  also reject new coredumps while the deferred work is executing,
  not just when it is queued. This closes the main race window.

- Add a defensive NULL check for adev->coredump at the start of
  amdgpu_devcoredump_deferred_work() to prevent the crash if the
  race still occurs (work_busy() is advisory, not a full barrier).

v2: Drop the job->pasid NULL guard -- that fix was independently
    submitted and merged as commit 4c1f0a162da5 ("drm/amdgpu: add
    job->pasid in check as amdgpu_job could be NULL") by Sunil
    Khatri, reviewed by Christian König.  Integrate with that
    patch as suggested by Christian.

Fixes: 4bbba79a7f1d ("drm/amdgpu: move devcoredump generation to a worker")
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
index 3d7aa6b09815..5e353a83ec0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dev_coredump.c
@@ -464,6 +464,9 @@ static void amdgpu_devcoredump_deferred_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(work, typeof(*adev), coredump_work);
 	struct amdgpu_coredump_info *coredump = adev->coredump;
 
+	if (!coredump)
+		goto end;
+
 	/* Do a one-time preparation of the coredump output because
 	 * repeatingly calling drm_coredump_printer is very slow.
 	 */
@@ -499,7 +502,7 @@ void amdgpu_coredump(struct amdgpu_device *adev, bool skip_vram_check,
 	int i, off, idx;
 
 	/* No need to generate a new coredump if there's one in progress already. */
-	if (work_pending(&adev->coredump_work))
+	if (work_busy(&adev->coredump_work))
 		return;
 
 	if (job && job->pasid)