From: Rik van Riel Date: Tue, 26 May 2026 19:43:29 +0000 (-0700) Subject: sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=3b7be8e7fa698359616c3276e005f08c3b6070e4;p=thirdparty%2Flinux.git sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit update_tg_load_avg() is called once per leaf cfs_rq from the __update_blocked_fair() walk that runs inside the NOHZ idle-balance softirq, and again from update_load_avg() with UPDATE_TG. Its first operation after the trivial early-outs is unconditionally: now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; Jakub ran into a system where nohz_idle_balance() was taking 75% of a CPU (which is handling network traffic and doing many irq_exit_cpu calls), with 35% of that CPU spent in update_load_avg, and 17% of the CPU in sched_clock_cpu(), reading the TSC. In a quick synthetic test, it looks like this patch reduces the CPU use of sched_balance_update_blocked_averages by about 20%. Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead. This eliminates the rdtsc, and uses a fairly fresh timestamp, because all callers of update_tg_load_avg() and clear_tg_load_avg() hold rq->lock and have called update_rq_clock(rq) within microseconds: caller pre-state __update_blocked_fair encloser did update_rq_clock(rq) update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr attach_/detach_entity_cfs_rq preceded by update_load_avg(...) clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront so rq->clock is fresh at every call. Since cfs_rqs are per-CPU per-task_group, cfs_rq->last_update_tg_load_avg is always compared against the same rq's clock; no cross-rq drift. Signed-off-by: Rik van Riel Assisted-by: Claude (Anthropic) Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Link: https://patch.msgid.link/20260527110250.6a91718d@fangorn --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 62a2dcb0d03e6..b5819c4899f1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4962,7 +4962,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) * For migration heavy workloads, access to tg->load_avg can be * unbound. Limit the update rate to at most once per ms. */ - now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + now = rq_clock(rq_of(cfs_rq)); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; @@ -4985,7 +4985,7 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq) if (cfs_rq->tg == &root_task_group) return; - now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + now = rq_clock(rq_of(cfs_rq)); delta = 0 - cfs_rq->tg_load_avg_contrib; atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib = 0;