scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
- A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
- by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
- ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
- local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
- Additionally, inserting directly from ``ops.select_cpu()`` will cause the
- ``ops.enqueue()`` callback to be skipped.
-
Note that the scheduler core will ignore an invalid CPU selection, for
example, if it's outside the allowed cpumask of the task.
+ A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
+ by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.
+
+ If the task is inserted into ``SCX_DSQ_LOCAL`` from
+ ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
+ is returned from ``ops.select_cpu()``. Additionally, inserting directly
+ from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
+ be skipped.
+
+ Any other attempt to store a task in BPF-internal data structures from
+ ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
+ invoked. This is discouraged, as it can introduce racy behavior or
+ inconsistent state.
+
2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
can make one of the following decisions:
* Queue the task on the BPF side.
+ **Task State Tracking and ops.dequeue() Semantics**
+
+ A task is in the "BPF scheduler's custody" when the BPF scheduler is
+ responsible for managing its lifecycle. A task enters custody when it is
+ dispatched to a user DSQ or stored in the BPF scheduler's internal data
+ structures. Custody is entered only from ``ops.enqueue()`` for those
+ operations. The only exception is dispatching to a user DSQ from
+ ``ops.select_cpu()``: although the task is not yet technically in BPF
+ scheduler custody at that point, the dispatch has the same semantic
+ effect as dispatching from ``ops.enqueue()`` for custody-related
+ purposes.
+
+ Once ``ops.enqueue()`` is called, the task may or may not enter custody
+ depending on what the scheduler does:
+
+ * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
+ ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
+ is done with the task - it either goes straight to a CPU's local run
+ queue or to the global DSQ as a fallback. The task never enters (or
+ exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+ * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+ BPF scheduler's custody. When the task later leaves BPF custody
+ (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+ sleep/property changes), ``ops.dequeue()`` will be called exactly
+ once.
+
+ * **Stored in BPF data structures** (e.g., internal BPF queues): the
+ task is in BPF custody. ``ops.dequeue()`` will be called when it
+ leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
+ on property change / sleep).
+
+ When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
+ The dequeue can happen for different reasons, distinguished by flags:
+
+ 1. **Regular dispatch**: when a task in BPF custody is dispatched to a
+ terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
+ execution), ``ops.dequeue()`` is triggered without any special flags.
+
+ 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+ core scheduling picks a task for execution while it's still in BPF
+ custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+ 3. **Scheduling property change**: when a task property changes (via
+ operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+ priority changes, CPU migrations, etc.) while the task is still in
+ BPF custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+ **Important**: Once a task has left BPF custody (e.g., after being
+ dispatched to a terminal DSQ), property changes will not trigger
+ ``ops.dequeue()``, since the task is no longer managed by the BPF
+ scheduler.
+
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
+
+ ops.dequeue(); /* Exiting BPF scheduler */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task->scx.slice > 0 && task is runnable)
/* scx_entity.flags */
enum scx_ent_flags {
SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */
SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
}
+/*
+ * Return true if @p is moving due to an internal SCX migration, false
+ * otherwise.
+ */
+static inline bool task_scx_migrating(struct task_struct *p)
+{
+ /*
+ * We only need to check sticky_cpu: it is set to the destination
+ * CPU in move_remote_task_to_local_dsq() before deactivate_task()
+ * and cleared when the task is enqueued on the destination, so it
+ * is only non-negative during an internal SCX migration.
+ */
+ return p->scx.sticky_cpu >= 0;
+}
+
+/*
+ * Call ops.dequeue() if the task is in BPF custody and not migrating.
+ * Clears %SCX_TASK_IN_CUSTODY when the callback is invoked.
+ */
+static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
+ struct task_struct *p, u64 deq_flags)
+{
+ if (!(p->scx.flags & SCX_TASK_IN_CUSTODY) || task_scx_migrating(p))
+ return;
+
+ if (SCX_HAS_OP(sch, dequeue))
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
+
+ p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+}
+
static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p,
u64 enq_flags)
{
struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
bool preempt = false;
+ call_task_dequeue(scx_root, rq, p, 0);
+
/*
* If @rq is in balance, the CPU is already vacant and looking for the
* next task to run. No need to preempt or trigger resched after moving
p->scx.ddsp_dsq_id = SCX_DSQ_INVALID;
p->scx.ddsp_enq_flags = 0;
+ /*
+ * Update custody and call ops.dequeue() before clearing ops_state:
+ * once ops_state is cleared, waiters in ops_dequeue() can proceed
+ * and dequeue_task_scx() will RMW p->scx.flags. If we clear
+ * ops_state first, both sides would modify p->scx.flags
+ * concurrently in a non-atomic way.
+ */
+ if (is_local) {
+ local_dsq_post_enq(dsq, p, enq_flags);
+ } else {
+ /*
+ * Task on global/bypass DSQ: leave custody, task on
+ * non-terminal DSQ: enter custody.
+ */
+ if (dsq->id == SCX_DSQ_GLOBAL || dsq->id == SCX_DSQ_BYPASS)
+ call_task_dequeue(sch, rq, p, 0);
+ else
+ p->scx.flags |= SCX_TASK_IN_CUSTODY;
+
+ raw_spin_unlock(&dsq->lock);
+ }
+
/*
* We're transitioning out of QUEUEING or DISPATCHING. store_release to
* match waiters' load_acquire.
*/
if (enq_flags & SCX_ENQ_CLEAR_OPSS)
atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
-
- if (is_local)
- local_dsq_post_enq(dsq, p, enq_flags);
- else
- raw_spin_unlock(&dsq->lock);
}
static void task_unlink_from_dsq(struct task_struct *p,
if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
goto direct;
+ /*
+ * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY
+ * so ops.dequeue() is called when it leaves custody.
+ */
+ p->scx.flags |= SCX_TASK_IN_CUSTODY;
+
/*
* If not directly dispatched, QUEUEING isn't clear yet and dispatch or
* dequeue may be waiting. The store_release matches their load_acquire.
{
struct scx_sched *sch = scx_root;
unsigned long opss;
+ u64 op_deq_flags = deq_flags;
+
+ /*
+ * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a property
+ * change (not sleep or core-sched pick).
+ */
+ if (!(op_deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+ op_deq_flags |= SCX_DEQ_SCHED_CHANGE;
/* dequeue is always temporary, don't reset runnable_at */
clr_task_runnable(p, false);
*/
BUG();
case SCX_OPSS_QUEUED:
- if (SCX_HAS_OP(sch, dequeue))
- SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
- p, deq_flags);
-
+ /* A queued task must always be in BPF scheduler's custody */
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
break;
BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
break;
}
+
+ /*
+ * Call ops.dequeue() if the task is still in BPF custody.
+ *
+ * The code that clears ops_state to %SCX_OPSS_NONE does not always
+ * clear %SCX_TASK_IN_CUSTODY: in dispatch_to_local_dsq(), when
+ * we're moving a task that was in %SCX_OPSS_DISPATCHING to a
+ * remote CPU's local DSQ, we only set ops_state to %SCX_OPSS_NONE
+ * so that a concurrent dequeue can proceed, but we clear
+ * %SCX_TASK_IN_CUSTODY only when we later enqueue or move the
+ * task. So we can see NONE + IN_CUSTODY here and we must handle
+ * it. Similarly, after waiting on %SCX_OPSS_DISPATCHING we see
+ * NONE but the task may still have %SCX_TASK_IN_CUSTODY set until
+ * it is enqueued on the destination.
+ */
+ call_task_dequeue(sch, rq, p, op_deq_flags);
}
static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
lockdep_assert_rq_held(rq);
+ /*
+ * Verify the task is not in BPF scheduler's custody. If flag
+ * transitions are consistent, the flag should always be clear
+ * here.
+ */
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
+
/*
* Set the weight before calling ops.enable() so that the scheduler
* doesn't see a stale value if they inspect the task struct.
if (SCX_HAS_OP(sch, disable))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
scx_set_task_state(p, SCX_TASK_READY);
+
+ /*
+ * Verify the task is not in BPF scheduler's custody. If flag
+ * transitions are consistent, the flag should always be clear
+ * here.
+ */
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
}
static void scx_exit_task(struct task_struct *p)
* it hasn't been dispatched yet. Dequeue from the BPF side.
*/
SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
+
+ /*
+ * The task is being dequeued due to a property change (e.g.,
+ * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+ * etc.).
+ */
+ SCX_DEQ_SCHED_CHANGE = 1LLU << 33,
};
enum scx_pick_idle_cpu_flags {
#define HAVE_SCX_CPU_PREEMPT_UNKNOWN
#define HAVE_SCX_DEQ_SLEEP
#define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
#define HAVE_SCX_DSQ_FLAG_BUILTIN
#define HAVE_SCX_DSQ_FLAG_LOCAL_ON
#define HAVE_SCX_DSQ_INVALID
const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
#define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+ SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
} while (0)