]> git.ipfire.org Git - thirdparty/kernel/linux.git/commit
drm/amdgpu: Effective health check before reset
authorCe Sun <cesun102@amd.com>
Sat, 26 Jul 2025 12:16:24 +0000 (20:16 +0800)
committerAlex Deucher <alexander.deucher@amd.com>
Mon, 4 Aug 2025 18:27:49 +0000 (14:27 -0400)
commitda467352296f8e50c7ab7057ead44a1df1c81496
treec83b1901ea355184b6126fa750499af74699eaec
parent21c0ffa612c98bcc6dab5bd9d977a18d565ee28e
drm/amdgpu: Effective health check before reset

Move amdgpu_device_health_check into amdgpu_device_gpu_recover to
ensure that if the device is present can be checked before reset

The reason is:
1.During the dpc event, the device where the dpc event occurs is not
present on the bus
2.When both dpc event and ATHUB event occur simultaneously,the dpc thread
holds the reset domain lock when detecting error,and the gpu recover thread
acquires the hive lock.The device is simultaneously in the states of
amdgpu_ras_in_recovery and occurs_dpc,so gpu recover thread will not go to
amdgpu_device_health_check.It waits for the reset domain lock held by the
dpc thread, but dpc thread has not released the reset domain lock.In the dpc
callback slot_reset,to obtain the hive lock, the hive lock is held by the
gpu recover thread at this time.So a deadlock occurred

Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c