drm/sched: Allow drivers to skip the reset and keep on running
When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job just took unusually long (longer than the timeout) but is
still running, and there is, thus, no reason to reset the hardware. This
can occur in two scenarios:
1. The job is taking longer than the timeout, but the driver determined
through a GPU-specific mechanism that the hardware is still making
progress. Hence, the driver would like the scheduler to skip the
timeout and treat the job as still pending from then onward. This
happens in v3d, Etnaviv, and Xe.
2. Timeout has fired before the free-job worker. Consequently, the
scheduler calls `sched->ops->timedout_job()` for a job that isn't
timed out.
These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()` - leading to a memory leak.
To solve these problems, create a new `drm_gpu_sched_stat`, called
DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
new status will indicate that the job must be reinserted into
`sched->pending_list`, and the hardware / driver will still complete that
job.
Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-2-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>