The GPU reload test (S3 / mode1 reset / module reload) triggers a
WARN_ON in amdgpu_irq_put() on gfx10 when unloading amdgpu:
WARNING: CPU: 0 PID: 2314 at amd/amdgpu/amdgpu_irq.c:676 amdgpu_irq_put+0xc3/0xe0 [amdgpu]
Call Trace:
gfx_v10_0_hw_fini+0x41/0x150 [amdgpu]
amdgpu_ip_block_hw_fini+0x29/0xc0 [amdgpu]
amdgpu_device_fini_hw+0x315/0x610 [amdgpu]
amdgpu_driver_unload_kms+0x7c/0x90 [amdgpu]
amdgpu_pci_remove+0x51/0x90 [amdgpu]
amdgpu_device_ip_resume_phase2() skips IP blocks whose status.hw is
already set, but amdgpu_device_ip_suspend_phase2() never had the
matching guard, so a block can be suspended twice (e.g. a reset or
recovery issued while the device is already suspended). The second
suspend runs hw_fini again, which now releases the gfx fault IRQs
unconditionally, dropping a refcount that is already zero and tripping
the WARN_ON in amdgpu_irq_put().
The fault/EOP IRQ get/put were balanced through late_init/hw_fini
before, which masked the double-suspend; moving the get into hw_init
made the suspend/resume asymmetry visible as an IRQ refcount underflow.
Honor status.hw in ip_suspend_phase2() so suspend mirrors resume and a
block is only torn down once.
Fixes: 9117d8be850b ("drm/amdgpu/gfx: move fault and EOP IRQ get/put to hw_init/hw_fini")
Fixes: 482f0e538580 ("drm/amdgpu: fix double ucode load by PSP(v3)")
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit
f44f2af13c418969be358b15743f939d705de998)