]> git.ipfire.org Git - thirdparty/kernel/linux.git/commit
drm/xe: fix job timeout recovery for unstarted jobs and kernel queues
authorRodrigo Vivi <rodrigo.vivi@intel.com>
Wed, 10 Jun 2026 15:25:49 +0000 (11:25 -0400)
committerMatthew Brost <matthew.brost@intel.com>
Thu, 11 Jun 2026 13:39:43 +0000 (06:39 -0700)
commit347ccc0453fca2c669e8dc8a72000e76ca4adf10
tree092779b1e82f7b088e379c4179af4bd7c7831156
parentba36786b21d19082e696eda85bfcd49e7071944a
drm/xe: fix job timeout recovery for unstarted jobs and kernel queues

A job that GuC never scheduled (never started) indicates a GuC
scheduling failure; previously such jobs were silently errored out
instead of triggering a GT reset to recover. Trigger a GT reset and
resubmit them, but only when the queue was not already killed or banned:
an unstarted job on an already banned queue is the ban working as
intended and must neither clear the ban nor kick off a reset, otherwise
a banned userspace queue could be resurrected and spam GT resets.

Kernel queues are always recovered this way and wedge the device once
recovery attempts are exhausted, since kernel work must not silently
fail. A started job that times out on a userspace VM bind queue stays
banned rather than being reset and retried.

The queue is banned early in the timeout handler to signal the G2H
scheduling-done handler so it wakes the disable-scheduling waiter;
without it the waiter sleeps the full 5s timeout. When a reset is
warranted the ban is cleared before rearming so that
guc_exec_queue_start() can resubmit jobs after the GT reset - a
still-banned queue would block resubmission and cause an infinite TDR
loop. The already-banned case is gated out before this point via
skip_timeout_check, so it is unaffected.

v2: (Himal) Do it for any queue type, not just kernel/migration
v3: - (Sashiko and Sanjay): don't clear the ban / GT reset for already
      killed/banned queues on unstarted-job timeout
    - Update commit message
    - (Matt) Add Fixes tag

Fixes: fe05cee4d953 ("drm/xe: Don't short circuit TDR on jobs not started")
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Assisted-by: GitHub-Copilot:claude-sonnet-4.6
Assisted-by: GitHub-Copilot:claude-opus-4.8
Tested-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Sanjay Yadav <sanjay.kumar.yadav@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Link: https://patch.msgid.link/20260610152548.404575-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
(cherry picked from commit b1107d085e7e8ed15ba6f80c102528a9c8a6cb0e)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
drivers/gpu/drm/xe/xe_guc_submit.c