From: Jiqian Chen Date: Thu, 4 Jun 2026 10:30:23 +0000 (+0800) Subject: drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 X-Git-Tag: v7.2-rc1~10^2~1^2~3 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=85ed06d990ff73212b5a91a406671cabd962e521;p=thirdparty%2Fkernel%2Flinux.git drm/amdgpu/gfx9: Fix Ring and IB test fail after mode2 For Renior APU with gfx9, in some test scenarios with disabling ring_reset, like accessing an unmapped invalid address, it can trigger a gpu job timeout event, then driver uses Mode2 reset to reset GPU, but after Mode2 compute Ring test and IB test fail randomly. It because the HQDs of MECs are always active before or after Mode2, that causes MECs use stale HQDs when MECs are unhalted before driver restore MQDs, and causes CPC and CPF are still stuck after Mode2, then causes compute Ring and IB tests fail. So, add sequences to deactivate HQDs of MECs in suspend IP function of the resetting process. v2: Move all sequences into a new function gfx_v9_0_cp_mode2_clear_state (Ray Huang) To check reset Mode2 method in the if condition (Ray Huang) v3: Move all sequences before Mode2 instead of after Mode2 (Timur Kristóf) v4: Call amdgpu_gfx_rlc_enter/exit_safe_mode int the begin and end of gfx_v9_0_deactivate_kcq_hqd (Alex Deucher) Signed-off-by: Jiqian Chen Reviewed-by: Huang Rui Reviewed-by: Timur Kristóf Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher (cherry picked from commit c3988a7ad4799514447294f04f063b422e0551df) Cc: stable@vger.kernel.org --- diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c index 47721d0c37812..81a759a987258 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c @@ -4071,6 +4071,41 @@ err_priv_inst: return r; } +static void gfx_v9_0_deactivate_kcq_hqd(struct amdgpu_device *adev) +{ + amdgpu_gfx_rlc_enter_safe_mode(adev, 0); + for (int i = 0; i < adev->gfx.num_compute_rings; i++) { + u32 tmp; + struct amdgpu_ring *ring = &adev->gfx.compute_ring[i]; + + mutex_lock(&adev->srbm_mutex); + soc15_grbm_select(adev, ring->me, ring->pipe, ring->queue, 0, 0); + tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE); + /* disable the queue if it's active */ + if (tmp & CP_HQD_ACTIVE__ACTIVE_MASK) { + int j; + + WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 1); + for (j = 0; j < adev->usec_timeout; j++) { + tmp = RREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE); + if (!(tmp & CP_HQD_ACTIVE__ACTIVE_MASK)) + break; + udelay(1); + } + if (j == AMDGPU_MAX_USEC_TIMEOUT) { + DRM_DEBUG("comp_%u_%u_%u dequeue request failed.\n", + ring->me, ring->pipe, ring->queue); + /* Manual disable if dequeue request times out */ + WREG32_SOC15(GC, 0, mmCP_HQD_ACTIVE, 0); + } + WREG32_SOC15(GC, 0, mmCP_HQD_DEQUEUE_REQUEST, 0); + } + soc15_grbm_select(adev, 0, 0, 0, 0, 0); + mutex_unlock(&adev->srbm_mutex); + } + amdgpu_gfx_rlc_exit_safe_mode(adev, 0); +} + static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block) { struct amdgpu_device *adev = ip_block->adev; @@ -4095,6 +4130,10 @@ static int gfx_v9_0_hw_fini(struct amdgpu_ip_block *ip_block) return 0; } + if ((adev->flags & AMD_IS_APU) && amdgpu_in_reset(adev) && + amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_MODE2) + gfx_v9_0_deactivate_kcq_hqd(adev); + /* Use deinitialize sequence from CAIL when unbinding device from driver, * otherwise KIQ is hanging when binding back */