There are two sites that nest rq lock inside scx_sched_lock:
- scx_bypass() takes scx_sched_lock then rq lock per CPU to propagate
per-cpu bypass flags and re-enqueue tasks.
- sysrq_handle_sched_ext_dump() takes scx_sched_lock to iterate all
scheds, scx_dump_state() then takes rq lock per CPU for dump.
And scx_claim_exit() takes scx_sched_lock to propagate exits to
descendants. It can be reached from scx_tick(), BPF kfuncs, and many
other paths with rq lock already held, creating the reverse ordering:
rq lock -> scx_sched_lock vs. scx_sched_lock -> rq lock
Fix by flipping scx_bypass() to take rq lock first, and dropping
scx_sched_lock from sysrq_handle_sched_ext_dump() as scx_sched_all is
already RCU-traversable and scx_dump_lock now prevents dumping a dead
sched. This makes the consistent ordering rq lock -> scx_sched_lock.
Reported-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Link: https://lore.kernel.org/r/20260309163025.2240221-1-yphbchou0911@gmail.com
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
struct rq *rq = cpu_rq(cpu);
struct task_struct *p, *n;
- raw_spin_lock(&scx_sched_lock);
raw_spin_rq_lock(rq);
+ raw_spin_lock(&scx_sched_lock);
scx_for_each_descendant_pre(pos, sch) {
struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
struct scx_exit_info ei = { .kind = SCX_EXIT_NONE, .reason = "SysRq-D" };
struct scx_sched *sch;
- guard(raw_spinlock_irqsave)(&scx_sched_lock);
-
list_for_each_entry_rcu(sch, &scx_sched_all, all)
scx_dump_state(sch, &ei, 0, false);
}