From: Linus Torvalds Date: Wed, 20 Nov 2024 18:08:00 +0000 (-0800) Subject: Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj... X-Git-Tag: v6.13-rc1~164 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=8f7c8b88bda4988f44e595a760438febf51c92c8;p=thirdparty%2Fkernel%2Flinux.git Merge tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve the default select_cpu() implementation making it topology aware and handle WAKE_SYNC better. - set_arg_maybe_null() was used to inform the verifier which ops args could be NULL in a rather hackish way. Use the new __nullable CFI stub tags instead. - On Sapphire Rapids multi-socket systems, a BPF scheduler, by hammering on the same queue across sockets, could live-lock the system to the point where the system couldn't make reasonable forward progress. This could lead to soft-lockup triggered resets or stalling out bypass mode switch and thus BPF scheduler ejection for tens of minutes if not hours. After trying a number of mitigations, the following set worked reliably: - Injecting artificial cpu_relax() loops in two places while sched_ext is trying to turn on the bypass mode. - Triggering scheduler ejection when soft-lockup detection is imminent (a quarter of threshold left). While not the prettiest, the impact both in terms of code complexity and overhead is minimal. - A common complaint on the API is the overuse of the word "dispatch" and the confusion around "consume". This is due to how the dispatch queues became more generic over time. Rename the affected kfuncs for clarity. Thanks to BPF's compatibility features, this change can be made in a way that's both forward and backward compatible. The compatibility code will be dropped in a few releases. - Other misc changes * tag 'sched_ext-for-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (21 commits) sched_ext: Replace scx_next_task_picked() with switch_class() in comment sched_ext: Rename scx_bpf_dispatch[_vtime]_from_dsq*() -> scx_bpf_dsq_move[_vtime]*() sched_ext: Rename scx_bpf_consume() to scx_bpf_dsq_move_to_local() sched_ext: Rename scx_bpf_dispatch[_vtime]() to scx_bpf_dsq_insert[_vtime]() sched_ext: scx_bpf_dispatch_from_dsq_set_*() are allowed from unlocked context sched_ext: add a missing rcu_read_lock/unlock pair at scx_select_cpu_dfl() sched_ext: Clarify sched_ext_ops table for userland scheduler sched_ext: Enable the ops breather and eject BPF scheduler on softlockup sched_ext: Avoid live-locking bypass mode switching sched_ext: Fix incorrect use of bitwise AND sched_ext: Do not enable LLC/NUMA optimizations when domains overlap sched_ext: Introduce NUMA awareness to the default idle selection policy sched_ext: Replace set_arg_maybe_null() with __nullable CFI stub tags sched_ext: Rename CFI stubs to names that are recognized by BPF sched_ext: Introduce LLC awareness to the default idle selection policy sched_ext: Clarify ops.select_cpu() for single-CPU tasks sched_ext: improve WAKE_SYNC behavior for default idle CPU selection sched_ext: Use btf_ids to resolve task_struct sched/ext: Use tg_cgroup() to elieminate duplicate code sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED ... --- 8f7c8b88bda4988f44e595a760438febf51c92c8 diff --cc kernel/sched/ext.c index ecb88c5285447,3c4a94e4258f0..7fff1d0454770 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@@ -2642,10 -2759,10 +2759,10 @@@ static int balance_one(struct rq *rq, s * If the previous sched_class for the current CPU was not SCX, * notify the BPF scheduler that it again has control of the * core. This callback complements ->cpu_release(), which is - * emitted in scx_next_task_picked(). + * emitted in switch_class(). */ if (SCX_HAS_OP(cpu_acquire)) - SCX_CALL_OP(0, cpu_acquire, cpu_of(rq), NULL); + SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL); rq->scx.cpu_released = false; } @@@ -4277,9 -4623,52 +4636,52 @@@ bool task_should_scx(int policy return false; if (READ_ONCE(scx_switching_all)) return true; - return p->policy == SCHED_EXT; + return policy == SCHED_EXT; } + /** + * scx_softlockup - sched_ext softlockup handler + * + * On some multi-socket setups (e.g. 2x Intel 8480c), the BPF scheduler can + * live-lock the system by making many CPUs target the same DSQ to the point + * where soft-lockup detection triggers. This function is called from + * soft-lockup watchdog when the triggering point is close and tries to unjam + * the system by enabling the breather and aborting the BPF scheduler. + */ + void scx_softlockup(u32 dur_s) + { + switch (scx_ops_enable_state()) { + case SCX_OPS_ENABLING: + case SCX_OPS_ENABLED: + break; + default: + return; + } + + /* allow only one instance, cleared at the end of scx_ops_bypass() */ + if (test_and_set_bit(0, &scx_in_softlockup)) + return; + + printk_deferred(KERN_ERR "sched_ext: Soft lockup - CPU%d stuck for %us, disabling \"%s\"\n", + smp_processor_id(), dur_s, scx_ops.name); + + /* + * Some CPUs may be trapped in the dispatch paths. Enable breather + * immediately; otherwise, we might even be able to get to + * scx_ops_bypass(). + */ + atomic_inc(&scx_ops_breather_depth); + + scx_ops_error("soft lockup - CPU#%d stuck for %us", + smp_processor_id(), dur_s); + } + + static void scx_clear_softlockup(void) + { + if (test_and_clear_bit(0, &scx_in_softlockup)) + atomic_dec(&scx_ops_breather_depth); + } + /** * scx_ops_bypass - [Un]bypass scx_ops and guarantee forward progress *