From: Greg Kroah-Hartman Date: Tue, 28 Nov 2017 08:30:05 +0000 (+0100) Subject: 4.4-stable patches X-Git-Tag: v3.18.85~17 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=13ea052e6e391aa08d5ab02b9d7c8067feb20d21;p=thirdparty%2Fkernel%2Fstable-queue.git 4.4-stable patches added patches: sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch --- diff --git a/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch b/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch new file mode 100644 index 00000000000..fe502244eab --- /dev/null +++ b/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch @@ -0,0 +1,487 @@ +From 4bdced5c9a2922521e325896a7bbbf0132c94e56 Mon Sep 17 00:00:00 2001 +From: "Steven Rostedt (Red Hat)" +Date: Fri, 6 Oct 2017 14:05:04 -0400 +Subject: sched/rt: Simplify the IPI based RT balancing logic + +From: Steven Rostedt (Red Hat) + +commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream. + +When a CPU lowers its priority (schedules out a high priority task for a +lower priority one), a check is made to see if any other CPU has overloaded +RT tasks (more than one). It checks the rto_mask to determine this and if so +it will request to pull one of those tasks to itself if the non running RT +task is of higher priority than the new priority of the next task to run on +the current CPU. + +When we deal with large number of CPUs, the original pull logic suffered +from large lock contention on a single CPU run queue, which caused a huge +latency across all CPUs. This was caused by only having one CPU having +overloaded RT tasks and a bunch of other CPUs lowering their priority. To +solve this issue, commit: + + b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling") + +changed the way to request a pull. Instead of grabbing the lock of the +overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work. + +Although the IPI logic worked very well in removing the large latency build +up, it still could suffer from a large number of IPIs being sent to a single +CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet, +when I tested this on a 120 CPU box, with a stress test that had lots of +RT tasks scheduling on all CPUs, it actually triggered the hard lockup +detector! One CPU had so many IPIs sent to it, and due to the restart +mechanism that is triggered when the source run queue has a priority status +change, the CPU spent minutes! processing the IPIs. + +Thinking about this further, I realized there's no reason for each run queue +to send its own IPI. As all CPUs with overloaded tasks must be scanned +regardless if there's one or many CPUs lowering their priority, because +there's no current way to find the CPU with the highest priority task that +can schedule to one of these CPUs, there really only needs to be one IPI +being sent around at a time. + +This greatly simplifies the code! + +The new approach is to have each root domain have its own irq work, as the +rto_mask is per root domain. The root domain has the following fields +attached to it: + + rto_push_work - the irq work to process each CPU set in rto_mask + rto_lock - the lock to protect some of the other rto fields + rto_loop_start - an atomic that keeps contention down on rto_lock + the first CPU scheduling in a lower priority task + is the one to kick off the process. + rto_loop_next - an atomic that gets incremented for each CPU that + schedules in a lower priority task. + rto_loop - a variable protected by rto_lock that is used to + compare against rto_loop_next + rto_cpu - The cpu to send the next IPI to, also protected by + the rto_lock. + +When a CPU schedules in a lower priority task and wants to make sure +overloaded CPUs know about it. It increments the rto_loop_next. Then it +atomically sets rto_loop_start with a cmpxchg. If the old value is not "0", +then it is done, as another CPU is kicking off the IPI loop. If the old +value is "0", then it will take the rto_lock to synchronize with a possible +IPI being sent around to the overloaded CPUs. + +If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no +IPI being sent around, or one is about to finish. Then rto_cpu is set to the +first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set +in rto_mask, then there's nothing to be done. + +When the CPU receives the IPI, it will first try to push any RT tasks that is +queued on the CPU but can't run because a higher priority RT task is +currently running on that CPU. + +Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it +finds one, it simply sends an IPI to that CPU and the process continues. + +If there's no more CPUs in the rto_mask, then rto_loop is compared with +rto_loop_next. If they match, everything is done and the process is over. If +they do not match, then a CPU scheduled in a lower priority task as the IPI +was being passed around, and the process needs to start again. The first CPU +in rto_mask is sent the IPI. + +This change removes this duplication of work in the IPI logic, and greatly +lowers the latency caused by the IPIs. This removed the lockup happening on +the 120 CPU machine. It also simplifies the code tremendously. What else +could anyone ask for? + +Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and +supplying me with the rto_start_trylock() and rto_start_unlock() helper +functions. + +Signed-off-by: Steven Rostedt (VMware) +Signed-off-by: Peter Zijlstra (Intel) +Cc: Clark Williams +Cc: Daniel Bristot de Oliveira +Cc: John Kacur +Cc: Linus Torvalds +Cc: Mike Galbraith +Cc: Peter Zijlstra +Cc: Scott Wood +Cc: Thomas Gleixner +Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home +Signed-off-by: Ingo Molnar +Signed-off-by: Greg Kroah-Hartman + +--- + kernel/sched/core.c | 6 + + kernel/sched/rt.c | 235 ++++++++++++++++++++++++--------------------------- + kernel/sched/sched.h | 24 +++-- + 3 files changed, 138 insertions(+), 127 deletions(-) + +--- a/kernel/sched/core.c ++++ b/kernel/sched/core.c +@@ -5907,6 +5907,12 @@ static int init_rootdomain(struct root_d + if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL)) + goto free_dlo_mask; + ++#ifdef HAVE_RT_PUSH_IPI ++ rd->rto_cpu = -1; ++ raw_spin_lock_init(&rd->rto_lock); ++ init_irq_work(&rd->rto_push_work, rto_push_irq_work_func); ++#endif ++ + init_dl_bw(&rd->dl_bw); + if (cpudl_init(&rd->cpudl) != 0) + goto free_dlo_mask; +--- a/kernel/sched/rt.c ++++ b/kernel/sched/rt.c +@@ -64,10 +64,6 @@ static void start_rt_bandwidth(struct rt + raw_spin_unlock(&rt_b->rt_runtime_lock); + } + +-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI) +-static void push_irq_work_func(struct irq_work *work); +-#endif +- + void init_rt_rq(struct rt_rq *rt_rq) + { + struct rt_prio_array *array; +@@ -87,13 +83,6 @@ void init_rt_rq(struct rt_rq *rt_rq) + rt_rq->rt_nr_migratory = 0; + rt_rq->overloaded = 0; + plist_head_init(&rt_rq->pushable_tasks); +- +-#ifdef HAVE_RT_PUSH_IPI +- rt_rq->push_flags = 0; +- rt_rq->push_cpu = nr_cpu_ids; +- raw_spin_lock_init(&rt_rq->push_lock); +- init_irq_work(&rt_rq->push_work, push_irq_work_func); +-#endif + #endif /* CONFIG_SMP */ + /* We start is dequeued state, because no RT tasks are queued */ + rt_rq->rt_queued = 0; +@@ -1802,160 +1791,166 @@ static void push_rt_tasks(struct rq *rq) + } + + #ifdef HAVE_RT_PUSH_IPI ++ + /* +- * The search for the next cpu always starts at rq->cpu and ends +- * when we reach rq->cpu again. It will never return rq->cpu. +- * This returns the next cpu to check, or nr_cpu_ids if the loop +- * is complete. ++ * When a high priority task schedules out from a CPU and a lower priority ++ * task is scheduled in, a check is made to see if there's any RT tasks ++ * on other CPUs that are waiting to run because a higher priority RT task ++ * is currently running on its CPU. In this case, the CPU with multiple RT ++ * tasks queued on it (overloaded) needs to be notified that a CPU has opened ++ * up that may be able to run one of its non-running queued RT tasks. ++ * ++ * All CPUs with overloaded RT tasks need to be notified as there is currently ++ * no way to know which of these CPUs have the highest priority task waiting ++ * to run. Instead of trying to take a spinlock on each of these CPUs, ++ * which has shown to cause large latency when done on machines with many ++ * CPUs, sending an IPI to the CPUs to have them push off the overloaded ++ * RT tasks waiting to run. ++ * ++ * Just sending an IPI to each of the CPUs is also an issue, as on large ++ * count CPU machines, this can cause an IPI storm on a CPU, especially ++ * if its the only CPU with multiple RT tasks queued, and a large number ++ * of CPUs scheduling a lower priority task at the same time. ++ * ++ * Each root domain has its own irq work function that can iterate over ++ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT ++ * tassk must be checked if there's one or many CPUs that are lowering ++ * their priority, there's a single irq work iterator that will try to ++ * push off RT tasks that are waiting to run. ++ * ++ * When a CPU schedules a lower priority task, it will kick off the ++ * irq work iterator that will jump to each CPU with overloaded RT tasks. ++ * As it only takes the first CPU that schedules a lower priority task ++ * to start the process, the rto_start variable is incremented and if ++ * the atomic result is one, then that CPU will try to take the rto_lock. ++ * This prevents high contention on the lock as the process handles all ++ * CPUs scheduling lower priority tasks. ++ * ++ * All CPUs that are scheduling a lower priority task will increment the ++ * rt_loop_next variable. This will make sure that the irq work iterator ++ * checks all RT overloaded CPUs whenever a CPU schedules a new lower ++ * priority task, even if the iterator is in the middle of a scan. Incrementing ++ * the rt_loop_next will cause the iterator to perform another scan. + * +- * rq->rt.push_cpu holds the last cpu returned by this function, +- * or if this is the first instance, it must hold rq->cpu. + */ + static int rto_next_cpu(struct rq *rq) + { +- int prev_cpu = rq->rt.push_cpu; ++ struct root_domain *rd = rq->rd; ++ int next; + int cpu; + +- cpu = cpumask_next(prev_cpu, rq->rd->rto_mask); +- + /* +- * If the previous cpu is less than the rq's CPU, then it already +- * passed the end of the mask, and has started from the beginning. +- * We end if the next CPU is greater or equal to rq's CPU. ++ * When starting the IPI RT pushing, the rto_cpu is set to -1, ++ * rt_next_cpu() will simply return the first CPU found in ++ * the rto_mask. ++ * ++ * If rto_next_cpu() is called with rto_cpu is a valid cpu, it ++ * will return the next CPU found in the rto_mask. ++ * ++ * If there are no more CPUs left in the rto_mask, then a check is made ++ * against rto_loop and rto_loop_next. rto_loop is only updated with ++ * the rto_lock held, but any CPU may increment the rto_loop_next ++ * without any locking. + */ +- if (prev_cpu < rq->cpu) { +- if (cpu >= rq->cpu) +- return nr_cpu_ids; ++ for (;;) { + +- } else if (cpu >= nr_cpu_ids) { +- /* +- * We passed the end of the mask, start at the beginning. +- * If the result is greater or equal to the rq's CPU, then +- * the loop is finished. +- */ +- cpu = cpumask_first(rq->rd->rto_mask); +- if (cpu >= rq->cpu) +- return nr_cpu_ids; +- } +- rq->rt.push_cpu = cpu; ++ /* When rto_cpu is -1 this acts like cpumask_first() */ ++ cpu = cpumask_next(rd->rto_cpu, rd->rto_mask); + +- /* Return cpu to let the caller know if the loop is finished or not */ +- return cpu; +-} ++ rd->rto_cpu = cpu; + +-static int find_next_push_cpu(struct rq *rq) +-{ +- struct rq *next_rq; +- int cpu; ++ if (cpu < nr_cpu_ids) ++ return cpu; + +- while (1) { +- cpu = rto_next_cpu(rq); +- if (cpu >= nr_cpu_ids) +- break; +- next_rq = cpu_rq(cpu); ++ rd->rto_cpu = -1; ++ ++ /* ++ * ACQUIRE ensures we see the @rto_mask changes ++ * made prior to the @next value observed. ++ * ++ * Matches WMB in rt_set_overload(). ++ */ ++ next = atomic_read_acquire(&rd->rto_loop_next); + +- /* Make sure the next rq can push to this rq */ +- if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr) ++ if (rd->rto_loop == next) + break; ++ ++ rd->rto_loop = next; + } + +- return cpu; ++ return -1; + } + +-#define RT_PUSH_IPI_EXECUTING 1 +-#define RT_PUSH_IPI_RESTART 2 ++static inline bool rto_start_trylock(atomic_t *v) ++{ ++ return !atomic_cmpxchg_acquire(v, 0, 1); ++} + +-static void tell_cpu_to_push(struct rq *rq) ++static inline void rto_start_unlock(atomic_t *v) + { +- int cpu; ++ atomic_set_release(v, 0); ++} + +- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) { +- raw_spin_lock(&rq->rt.push_lock); +- /* Make sure it's still executing */ +- if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) { +- /* +- * Tell the IPI to restart the loop as things have +- * changed since it started. +- */ +- rq->rt.push_flags |= RT_PUSH_IPI_RESTART; +- raw_spin_unlock(&rq->rt.push_lock); +- return; +- } +- raw_spin_unlock(&rq->rt.push_lock); +- } ++static void tell_cpu_to_push(struct rq *rq) ++{ ++ int cpu = -1; + +- /* When here, there's no IPI going around */ ++ /* Keep the loop going if the IPI is currently active */ ++ atomic_inc(&rq->rd->rto_loop_next); + +- rq->rt.push_cpu = rq->cpu; +- cpu = find_next_push_cpu(rq); +- if (cpu >= nr_cpu_ids) ++ /* Only one CPU can initiate a loop at a time */ ++ if (!rto_start_trylock(&rq->rd->rto_loop_start)) + return; + +- rq->rt.push_flags = RT_PUSH_IPI_EXECUTING; ++ raw_spin_lock(&rq->rd->rto_lock); ++ ++ /* ++ * The rto_cpu is updated under the lock, if it has a valid cpu ++ * then the IPI is still running and will continue due to the ++ * update to loop_next, and nothing needs to be done here. ++ * Otherwise it is finishing up and an ipi needs to be sent. ++ */ ++ if (rq->rd->rto_cpu < 0) ++ cpu = rto_next_cpu(rq); ++ ++ raw_spin_unlock(&rq->rd->rto_lock); + +- irq_work_queue_on(&rq->rt.push_work, cpu); ++ rto_start_unlock(&rq->rd->rto_loop_start); ++ ++ if (cpu >= 0) ++ irq_work_queue_on(&rq->rd->rto_push_work, cpu); + } + + /* Called from hardirq context */ +-static void try_to_push_tasks(void *arg) ++void rto_push_irq_work_func(struct irq_work *work) + { +- struct rt_rq *rt_rq = arg; +- struct rq *rq, *src_rq; +- int this_cpu; ++ struct rq *rq; + int cpu; + +- this_cpu = rt_rq->push_cpu; ++ rq = this_rq(); + +- /* Paranoid check */ +- BUG_ON(this_cpu != smp_processor_id()); +- +- rq = cpu_rq(this_cpu); +- src_rq = rq_of_rt_rq(rt_rq); +- +-again: ++ /* ++ * We do not need to grab the lock to check for has_pushable_tasks. ++ * When it gets updated, a check is made if a push is possible. ++ */ + if (has_pushable_tasks(rq)) { + raw_spin_lock(&rq->lock); +- push_rt_task(rq); ++ push_rt_tasks(rq); + raw_spin_unlock(&rq->lock); + } + +- /* Pass the IPI to the next rt overloaded queue */ +- raw_spin_lock(&rt_rq->push_lock); +- /* +- * If the source queue changed since the IPI went out, +- * we need to restart the search from that CPU again. +- */ +- if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) { +- rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART; +- rt_rq->push_cpu = src_rq->cpu; +- } ++ raw_spin_lock(&rq->rd->rto_lock); + +- cpu = find_next_push_cpu(src_rq); ++ /* Pass the IPI to the next rt overloaded queue */ ++ cpu = rto_next_cpu(rq); + +- if (cpu >= nr_cpu_ids) +- rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING; +- raw_spin_unlock(&rt_rq->push_lock); ++ raw_spin_unlock(&rq->rd->rto_lock); + +- if (cpu >= nr_cpu_ids) ++ if (cpu < 0) + return; + +- /* +- * It is possible that a restart caused this CPU to be +- * chosen again. Don't bother with an IPI, just see if we +- * have more to push. +- */ +- if (unlikely(cpu == rq->cpu)) +- goto again; +- + /* Try the next RT overloaded CPU */ +- irq_work_queue_on(&rt_rq->push_work, cpu); +-} +- +-static void push_irq_work_func(struct irq_work *work) +-{ +- struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work); +- +- try_to_push_tasks(rt_rq); ++ irq_work_queue_on(&rq->rd->rto_push_work, cpu); + } + #endif /* HAVE_RT_PUSH_IPI */ + +--- a/kernel/sched/sched.h ++++ b/kernel/sched/sched.h +@@ -429,7 +429,7 @@ static inline int rt_bandwidth_enabled(v + } + + /* RT IPI pull logic requires IRQ_WORK */ +-#ifdef CONFIG_IRQ_WORK ++#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP) + # define HAVE_RT_PUSH_IPI + #endif + +@@ -450,12 +450,6 @@ struct rt_rq { + unsigned long rt_nr_total; + int overloaded; + struct plist_head pushable_tasks; +-#ifdef HAVE_RT_PUSH_IPI +- int push_flags; +- int push_cpu; +- struct irq_work push_work; +- raw_spinlock_t push_lock; +-#endif + #endif /* CONFIG_SMP */ + int rt_queued; + +@@ -537,6 +531,19 @@ struct root_domain { + struct dl_bw dl_bw; + struct cpudl cpudl; + ++#ifdef HAVE_RT_PUSH_IPI ++ /* ++ * For IPI pull requests, loop across the rto_mask. ++ */ ++ struct irq_work rto_push_work; ++ raw_spinlock_t rto_lock; ++ /* These are only updated and read within rto_lock */ ++ int rto_loop; ++ int rto_cpu; ++ /* These atomics are updated outside of a lock */ ++ atomic_t rto_loop_next; ++ atomic_t rto_loop_start; ++#endif + /* + * The "RT overload" flag: it gets set if a CPU has more than + * one runnable RT task. +@@ -547,6 +554,9 @@ struct root_domain { + + extern struct root_domain def_root_domain; + ++#ifdef HAVE_RT_PUSH_IPI ++extern void rto_push_irq_work_func(struct irq_work *work); ++#endif + #endif /* CONFIG_SMP */ + + /* diff --git a/queue-4.4/series b/queue-4.4/series index 1ae1620969a..c419fbce2ca 100644 --- a/queue-4.4/series +++ b/queue-4.4/series @@ -63,3 +63,4 @@ media-don-t-do-dma-on-stack-for-firmware-upload-in-the-as102-driver.patch media-rc-check-for-integer-overflow.patch cx231xx-cards-fix-null-deref-on-missing-association-descriptor.patch media-v4l2-ctrl-fix-flags-field-on-control-events.patch +sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch