git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob

   1 From de53fd7aedb100f03e5d2231cfce0e4993282425 Mon Sep 17 00:00:00 2001
   2 From: Dave Chiluk <chiluk+linux@indeed.com>
   3 Date: Tue, 23 Jul 2019 11:44:26 -0500
   4 Subject: sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices
   5
   6 From: Dave Chiluk <chiluk+linux@indeed.com>
   7
   8 commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
   9
  10 It has been observed, that highly-threaded, non-cpu-bound applications
  11 running under cpu.cfs_quota_us constraints can hit a high percentage of
  12 periods throttled while simultaneously not consuming the allocated
  13 amount of quota. This use case is typical of user-interactive non-cpu
  14 bound applications, such as those running in kubernetes or mesos when
  15 run on multiple cpu cores.
  16
  17 This has been root caused to cpu-local run queue being allocated per cpu
  18 bandwidth slices, and then not fully using that slice within the period.
  19 At which point the slice and quota expires. This expiration of unused
  20 slice results in applications not being able to utilize the quota for
  21 which they are allocated.
  22
  23 The non-expiration of per-cpu slices was recently fixed by
  24 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
  25 condition")'. Prior to that it appears that this had been broken since
  26 at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
  27 cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
  28 added the following conditional which resulted in slices never being
  29 expired.
  30
  31 if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
  32         /* extend local deadline, drift is bounded above by 2 ticks */
  33         cfs_rq->runtime_expires += TICK_NSEC;
  34
  35 Because this was broken for nearly 5 years, and has recently been fixed
  36 and is now being noticed by many users running kubernetes
  37 (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
  38 that the mechanisms around expiring runtime should be removed
  39 altogether.
  40
  41 This allows quota already allocated to per-cpu run-queues to live longer
  42 than the period boundary. This allows threads on runqueues that do not
  43 use much CPU to continue to use their remaining slice over a longer
  44 period of time than cpu.cfs_period_us. However, this helps prevent the
  45 above condition of hitting throttling while also not fully utilizing
  46 your cpu quota.
  47
  48 This theoretically allows a machine to use slightly more than its
  49 allotted quota in some periods. This overflow would be bounded by the
  50 remaining quota left on each per-cpu runqueueu. This is typically no
  51 more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
  52 change nothing, as they should theoretically fully utilize all of their
  53 quota in each period. For user-interactive tasks as described above this
  54 provides a much better user/application experience as their cpu
  55 utilization will more closely match the amount they requested when they
  56 hit throttling. This means that cpu limits no longer strictly apply per
  57 period for non-cpu bound applications, but that they are still accurate
  58 over longer timeframes.
  59
  60 This greatly improves performance of high-thread-count, non-cpu bound
  61 applications with low cfs_quota_us allocation on high-core-count
  62 machines. In the case of an artificial testcase (10ms/100ms of quota on
  63 80 CPU machine), this commit resulted in almost 30x performance
  64 improvement, while still maintaining correct cpu quota restrictions.
  65 That testcase is available at https://github.com/indeedeng/fibtest.
  66
  67 Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
  68 Signed-off-by: Dave Chiluk <chiluk+linux@indeed.com>
  69 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
  70 Reviewed-by: Phil Auld <pauld@redhat.com>
  71 Reviewed-by: Ben Segall <bsegall@google.com>
  72 Cc: Ingo Molnar <mingo@redhat.com>
  73 Cc: John Hammond <jhammond@indeed.com>
  74 Cc: Jonathan Corbet <corbet@lwn.net>
  75 Cc: Kyle Anderson <kwa@yelp.com>
  76 Cc: Gabriel Munos <gmunoz@netflix.com>
  77 Cc: Peter Oskolkov <posk@posk.io>
  78 Cc: Cong Wang <xiyou.wangcong@gmail.com>
  79 Cc: Brendan Gregg <bgregg@netflix.com>
  80 Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com
  81 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  82
  83 ---
  84  Documentation/scheduler/sched-bwc.rst |   74 +++++++++++++++++++++++++++-------
  85  kernel/sched/fair.c                   |   72 +++------------------------------
  86  kernel/sched/sched.h                  |    4 -
  87  3 files changed, 67 insertions(+), 83 deletions(-)
  88
  89 --- a/Documentation/scheduler/sched-bwc.rst
  90 +++ b/Documentation/scheduler/sched-bwc.rst
  91 @@ -9,15 +9,16 @@ CFS bandwidth control is a CONFIG_FAIR_G
  92  specification of the maximum CPU bandwidth available to a group or hierarchy.
  93
  94  The bandwidth allowed for a group is specified using a quota and period. Within
  95 -each given "period" (microseconds), a group is allowed to consume only up to
  96 -"quota" microseconds of CPU time.  When the CPU bandwidth consumption of a
  97 -group exceeds this limit (for that period), the tasks belonging to its
  98 -hierarchy will be throttled and are not allowed to run again until the next
  99 -period.
 100 -
 101 -A group's unused runtime is globally tracked, being refreshed with quota units
 102 -above at each period boundary.  As threads consume this bandwidth it is
 103 -transferred to cpu-local "silos" on a demand basis.  The amount transferred
 104 +each given "period" (microseconds), a task group is allocated up to "quota"
 105 +microseconds of CPU time. That quota is assigned to per-cpu run queues in
 106 +slices as threads in the cgroup become runnable. Once all quota has been
 107 +assigned any additional requests for quota will result in those threads being
 108 +throttled. Throttled threads will not be able to run again until the next
 109 +period when the quota is replenished.
 110 +
 111 +A group's unassigned quota is globally tracked, being refreshed back to
 112 +cfs_quota units at each period boundary. As threads consume this bandwidth it
 113 +is transferred to cpu-local "silos" on a demand basis. The amount transferred
 114  within each of these updates is tunable and described as the "slice".
 115
 116  Management
 117 @@ -35,12 +36,12 @@ The default values are::
 118
 119  A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
 120  bandwidth restriction in place, such a group is described as an unconstrained
 121 -bandwidth group.  This represents the traditional work-conserving behavior for
 122 +bandwidth group. This represents the traditional work-conserving behavior for
 123  CFS.
 124
 125  Writing any (valid) positive value(s) will enact the specified bandwidth limit.
 126 -The minimum quota allowed for the quota or period is 1ms.  There is also an
 127 -upper bound on the period length of 1s.  Additional restrictions exist when
 128 +The minimum quota allowed for the quota or period is 1ms. There is also an
 129 +upper bound on the period length of 1s. Additional restrictions exist when
 130  bandwidth limits are used in a hierarchical fashion, these are explained in
 131  more detail below.
 132
 133 @@ -53,8 +54,8 @@ unthrottled if it is in a constrained st
 134  System wide settings
 135  --------------------
 136  For efficiency run-time is transferred between the global pool and CPU local
 137 -"silos" in a batch fashion.  This greatly reduces global accounting pressure
 138 -on large systems.  The amount transferred each time such an update is required
 139 +"silos" in a batch fashion. This greatly reduces global accounting pressure
 140 +on large systems. The amount transferred each time such an update is required
 141  is described as the "slice".
 142
 143  This is tunable via procfs::
 144 @@ -97,6 +98,51 @@ There are two ways in which a group may
 145  In case b) above, even though the child may have runtime remaining it will not
 146  be allowed to until the parent's runtime is refreshed.
 147
 148 +CFS Bandwidth Quota Caveats
 149 +---------------------------
 150 +Once a slice is assigned to a cpu it does not expire.  However all but 1ms of
 151 +the slice may be returned to the global pool if all threads on that cpu become
 152 +unrunnable. This is configured at compile time by the min_cfs_rq_runtime
 153 +variable. This is a performance tweak that helps prevent added contention on
 154 +the global lock.
 155 +
 156 +The fact that cpu-local slices do not expire results in some interesting corner
 157 +cases that should be understood.
 158 +
 159 +For cgroup cpu constrained applications that are cpu limited this is a
 160 +relatively moot point because they will naturally consume the entirety of their
 161 +quota as well as the entirety of each cpu-local slice in each period. As a
 162 +result it is expected that nr_periods roughly equal nr_throttled, and that
 163 +cpuacct.usage will increase roughly equal to cfs_quota_us in each period.
 164 +
 165 +For highly-threaded, non-cpu bound applications this non-expiration nuance
 166 +allows applications to briefly burst past their quota limits by the amount of
 167 +unused slice on each cpu that the task group is running on (typically at most
 168 +1ms per cpu or as defined by min_cfs_rq_runtime).  This slight burst only
 169 +applies if quota had been assigned to a cpu and then not fully used or returned
 170 +in previous periods. This burst amount will not be transferred between cores.
 171 +As a result, this mechanism still strictly limits the task group to quota
 172 +average usage, albeit over a longer time window than a single period.  This
 173 +also limits the burst ability to no more than 1ms per cpu.  This provides
 174 +better more predictable user experience for highly threaded applications with
 175 +small quota limits on high core count machines. It also eliminates the
 176 +propensity to throttle these applications while simultanously using less than
 177 +quota amounts of cpu. Another way to say this, is that by allowing the unused
 178 +portion of a slice to remain valid across periods we have decreased the
 179 +possibility of wastefully expiring quota on cpu-local silos that don't need a
 180 +full slice's amount of cpu time.
 181 +
 182 +The interaction between cpu-bound and non-cpu-bound-interactive applications
 183 +should also be considered, especially when single core usage hits 100%. If you
 184 +gave each of these applications half of a cpu-core and they both got scheduled
 185 +on the same CPU it is theoretically possible that the non-cpu bound application
 186 +will use up to 1ms additional quota in some periods, thereby preventing the
 187 +cpu-bound application from fully using its quota by that same amount. In these
 188 +instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to
 189 +decide which application is chosen to run, as they will both be runnable and
 190 +have remaining quota. This runtime discrepancy will be made up in the following
 191 +periods when the interactive application idles.
 192 +
 193  Examples
 194  --------
 195  1. Limit a group to 1 CPU worth of runtime::
 196 --- a/kernel/sched/fair.c
 197 +++ b/kernel/sched/fair.c
 198 @@ -4370,8 +4370,6 @@ void __refill_cfs_bandwidth_runtime(stru
 199
 200         now = sched_clock_cpu(smp_processor_id());
 201         cfs_b->runtime = cfs_b->quota;
 202 -       cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period);
 203 -       cfs_b->expires_seq++;
 204  }
 205
 206  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 207 @@ -4393,8 +4391,7 @@ static int assign_cfs_rq_runtime(struct
 208  {
 209         struct task_group *tg = cfs_rq->tg;
 210         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
 211 -       u64 amount = 0, min_amount, expires;
 212 -       int expires_seq;
 213 +       u64 amount = 0, min_amount;
 214
 215         /* note: this is a positive sum as runtime_remaining <= 0 */
 216         min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;
 217 @@ -4411,61 +4408,17 @@ static int assign_cfs_rq_runtime(struct
 218                         cfs_b->idle = 0;
 219                 }
 220         }
 221 -       expires_seq = cfs_b->expires_seq;
 222 -       expires = cfs_b->runtime_expires;
 223         raw_spin_unlock(&cfs_b->lock);
 224
 225         cfs_rq->runtime_remaining += amount;
 226 -       /*
 227 -        * we may have advanced our local expiration to account for allowed
 228 -        * spread between our sched_clock and the one on which runtime was
 229 -        * issued.
 230 -        */
 231 -       if (cfs_rq->expires_seq != expires_seq) {
 232 -               cfs_rq->expires_seq = expires_seq;
 233 -               cfs_rq->runtime_expires = expires;
 234 -       }
 235
 236         return cfs_rq->runtime_remaining > 0;
 237  }
 238
 239 -/*
 240 - * Note: This depends on the synchronization provided by sched_clock and the
 241 - * fact that rq->clock snapshots this value.
 242 - */
 243 -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 244 -{
 245 -       struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 246 -
 247 -       /* if the deadline is ahead of our clock, nothing to do */
 248 -       if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
 249 -               return;
 250 -
 251 -       if (cfs_rq->runtime_remaining < 0)
 252 -               return;
 253 -
 254 -       /*
 255 -        * If the local deadline has passed we have to consider the
 256 -        * possibility that our sched_clock is 'fast' and the global deadline
 257 -        * has not truly expired.
 258 -        *
 259 -        * Fortunately we can check determine whether this the case by checking
 260 -        * whether the global deadline(cfs_b->expires_seq) has advanced.
 261 -        */
 262 -       if (cfs_rq->expires_seq == cfs_b->expires_seq) {
 263 -               /* extend local deadline, drift is bounded above by 2 ticks */
 264 -               cfs_rq->runtime_expires += TICK_NSEC;
 265 -       } else {
 266 -               /* global deadline is ahead, expiration has passed */
 267 -               cfs_rq->runtime_remaining = 0;
 268 -       }
 269 -}
 270 -
 271  static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 272  {
 273         /* dock delta_exec before expiring quota (as it could span periods) */
 274         cfs_rq->runtime_remaining -= delta_exec;
 275 -       expire_cfs_rq_runtime(cfs_rq);
 276
 277         if (likely(cfs_rq->runtime_remaining > 0))
 278                 return;
 279 @@ -4658,8 +4611,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf
 280                 resched_curr(rq);
 281  }
 282
 283 -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 284 -               u64 remaining, u64 expires)
 285 +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining)
 286  {
 287         struct cfs_rq *cfs_rq;
 288         u64 runtime;
 289 @@ -4684,7 +4636,6 @@ static u64 distribute_cfs_runtime(struct
 290                 remaining -= runtime;
 291
 292                 cfs_rq->runtime_remaining += runtime;
 293 -               cfs_rq->runtime_expires = expires;
 294
 295                 /* we check whether we're throttled above */
 296                 if (cfs_rq->runtime_remaining > 0)
 297 @@ -4709,7 +4660,7 @@ next:
 298   */
 299  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 300  {
 301 -       u64 runtime, runtime_expires;
 302 +       u64 runtime;
 303         int throttled;
 304
 305         /* no need to continue the timer with no bandwidth constraint */
 306 @@ -4737,8 +4688,6 @@ static int do_sched_cfs_period_timer(str
 307         /* account preceding periods in which throttling occurred */
 308         cfs_b->nr_throttled += overrun;
 309
 310 -       runtime_expires = cfs_b->runtime_expires;
 311 -
 312         /*
 313          * This check is repeated as we are holding onto the new bandwidth while
 314          * we unthrottle. This can potentially race with an unthrottled group
 315 @@ -4751,8 +4700,7 @@ static int do_sched_cfs_period_timer(str
 316                 cfs_b->distribute_running = 1;
 317                 raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 318                 /* we can't nest cfs_b->lock while distributing bandwidth */
 319 -               runtime = distribute_cfs_runtime(cfs_b, runtime,
 320 -                                                runtime_expires);
 321 +               runtime = distribute_cfs_runtime(cfs_b, runtime);
 322                 raw_spin_lock_irqsave(&cfs_b->lock, flags);
 323
 324                 cfs_b->distribute_running = 0;
 325 @@ -4834,8 +4782,7 @@ static void __return_cfs_rq_runtime(stru
 326                 return;
 327
 328         raw_spin_lock(&cfs_b->lock);
 329 -       if (cfs_b->quota != RUNTIME_INF &&
 330 -           cfs_rq->runtime_expires == cfs_b->runtime_expires) {
 331 +       if (cfs_b->quota != RUNTIME_INF) {
 332                 cfs_b->runtime += slack_runtime;
 333
 334                 /* we are under rq->lock, defer unthrottling using a timer */
 335 @@ -4868,7 +4815,6 @@ static void do_sched_cfs_slack_timer(str
 336  {
 337         u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
 338         unsigned long flags;
 339 -       u64 expires;
 340
 341         /* confirm we're still not at a refresh boundary */
 342         raw_spin_lock_irqsave(&cfs_b->lock, flags);
 343 @@ -4886,7 +4832,6 @@ static void do_sched_cfs_slack_timer(str
 344         if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 345                 runtime = cfs_b->runtime;
 346
 347 -       expires = cfs_b->runtime_expires;
 348         if (runtime)
 349                 cfs_b->distribute_running = 1;
 350
 351 @@ -4895,11 +4840,10 @@ static void do_sched_cfs_slack_timer(str
 352         if (!runtime)
 353                 return;
 354
 355 -       runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 356 +       runtime = distribute_cfs_runtime(cfs_b, runtime);
 357
 358         raw_spin_lock_irqsave(&cfs_b->lock, flags);
 359 -       if (expires == cfs_b->runtime_expires)
 360 -               lsub_positive(&cfs_b->runtime, runtime);
 361 +       lsub_positive(&cfs_b->runtime, runtime);
 362         cfs_b->distribute_running = 0;
 363         raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 364  }
 365 @@ -5064,8 +5008,6 @@ void start_cfs_bandwidth(struct cfs_band
 366
 367         cfs_b->period_active = 1;
 368         overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
 369 -       cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period);
 370 -       cfs_b->expires_seq++;
 371         hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
 372  }
 373
 374 --- a/kernel/sched/sched.h
 375 +++ b/kernel/sched/sched.h
 376 @@ -335,8 +335,6 @@ struct cfs_bandwidth {
 377         u64                     quota;
 378         u64                     runtime;
 379         s64                     hierarchical_quota;
 380 -       u64                     runtime_expires;
 381 -       int                     expires_seq;
 382
 383         u8                      idle;
 384         u8                      period_active;
 385 @@ -556,8 +554,6 @@ struct cfs_rq {
 386
 387  #ifdef CONFIG_CFS_BANDWIDTH
 388         int                     runtime_enabled;
 389 -       int                     expires_seq;
 390 -       u64                     runtime_expires;
 391         s64                     runtime_remaining;
 392
 393         u64                     throttled_clock;