]> git.ipfire.org Git - thirdparty/linux.git/commit
drm/sched: Allow drivers to skip the reset and keep on running
authorMaíra Canal <mcanal@igalia.com>
Mon, 14 Jul 2025 22:07:03 +0000 (19:07 -0300)
committerMaíra Canal <mcanal@igalia.com>
Tue, 15 Jul 2025 11:27:07 +0000 (08:27 -0300)
commit0b1217bfdfddf664c15954d1d51ee18ed88a2ccf
tree5d69d265ed7ef607f967e712c709161e63c82832
parent0a5dc1b67ef5c7e851b57764a2aab8cc4341a7b7
drm/sched: Allow drivers to skip the reset and keep on running

When the DRM scheduler times out, it's possible that the GPU isn't hung;
instead, a job just took unusually long (longer than the timeout) but is
still running, and there is, thus, no reason to reset the hardware. This
can occur in two scenarios:

  1. The job is taking longer than the timeout, but the driver determined
     through a GPU-specific mechanism that the hardware is still making
     progress. Hence, the driver would like the scheduler to skip the
     timeout and treat the job as still pending from then onward. This
     happens in v3d, Etnaviv, and Xe.
  2. Timeout has fired before the free-job worker. Consequently, the
     scheduler calls `sched->ops->timedout_job()` for a job that isn't
     timed out.

These two scenarios are problematic because the job was removed from the
`sched->pending_list` before calling `sched->ops->timedout_job()`, which
means that when the job finishes, it won't be freed by the scheduler
though `sched->ops->free_job()` - leading to a memory leak.

To solve these problems, create a new `drm_gpu_sched_stat`, called
DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip the reset. The
new status will indicate that the job must be reinserted into
`sched->pending_list`, and the hardware / driver will still complete that
job.

Reviewed-by: Philipp Stanner <phasta@kernel.org>
Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-2-5c5ba4f55039@igalia.com
Signed-off-by: Maíra Canal <mcanal@igalia.com>
drivers/gpu/drm/scheduler/sched_main.c
include/drm/gpu_scheduler.h