sched_ext: Fix ops->priv clobber on concurrent attach/detach
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:
$ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop
...
BUG: kernel NULL pointer dereference, address:
0000000000000400
RIP: 0010:scx_claim_exit+0x3b/0x120
Call Trace:
<TASK>
bpf_scx_unreg+0x45/0xb0
bpf_struct_ops_map_link_dealloc+0x39/0x50
bpf_link_release+0x18/0x20
__fput+0x10b/0x2e0
__x64_sys_close+0x47/0xa0
The underlying race (diagnosed by Tejun Heo) is a stomp of @ops->priv,
not a missing NULL check:
T2 unreg(K) T1 reg(K)
----------- ---------
sch = ops->priv = sch_b800
scx_disable; flush_disable_work
[scx_root_disable: scx_root=NULL,
mutex_unlock, state=DISABLED]
mutex_lock; state ok
scx_alloc_and_add_sched:
ops->priv = sch_a800
scx_root = sch_a800; init=0
state=ENABLED; mutex_unlock
[flush returns]
RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800
kobject_put(sch_b800)
T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops->priv and dereferences it in scx_claim_exit().
Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.
Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>