From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Tue, 28 Nov 2017 08:30:05 +0000 (+0100)
Subject: 4.4-stable patches
X-Git-Tag: v3.18.85~17
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=13ea052e6e391aa08d5ab02b9d7c8067feb20d21;p=thirdparty%2Fkernel%2Fstable-queue.git

4.4-stable patches

added patches:
	sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch
---

diff --git a/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch b/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch
new file mode 100644
index 00000000000..fe502244eab
--- /dev/null
+++ b/queue-4.4/sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch
@@ -0,0 +1,487 @@
+From 4bdced5c9a2922521e325896a7bbbf0132c94e56 Mon Sep 17 00:00:00 2001
+From: "Steven Rostedt (Red Hat)" <rostedt@goodmis.org>
+Date: Fri, 6 Oct 2017 14:05:04 -0400
+Subject: sched/rt: Simplify the IPI based RT balancing logic
+
+From: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
+
+commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.
+
+When a CPU lowers its priority (schedules out a high priority task for a
+lower priority one), a check is made to see if any other CPU has overloaded
+RT tasks (more than one). It checks the rto_mask to determine this and if so
+it will request to pull one of those tasks to itself if the non running RT
+task is of higher priority than the new priority of the next task to run on
+the current CPU.
+
+When we deal with large number of CPUs, the original pull logic suffered
+from large lock contention on a single CPU run queue, which caused a huge
+latency across all CPUs. This was caused by only having one CPU having
+overloaded RT tasks and a bunch of other CPUs lowering their priority. To
+solve this issue, commit:
+
+  b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
+
+changed the way to request a pull. Instead of grabbing the lock of the
+overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
+
+Although the IPI logic worked very well in removing the large latency build
+up, it still could suffer from a large number of IPIs being sent to a single
+CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
+when I tested this on a 120 CPU box, with a stress test that had lots of
+RT tasks scheduling on all CPUs, it actually triggered the hard lockup
+detector! One CPU had so many IPIs sent to it, and due to the restart
+mechanism that is triggered when the source run queue has a priority status
+change, the CPU spent minutes! processing the IPIs.
+
+Thinking about this further, I realized there's no reason for each run queue
+to send its own IPI. As all CPUs with overloaded tasks must be scanned
+regardless if there's one or many CPUs lowering their priority, because
+there's no current way to find the CPU with the highest priority task that
+can schedule to one of these CPUs, there really only needs to be one IPI
+being sent around at a time.
+
+This greatly simplifies the code!
+
+The new approach is to have each root domain have its own irq work, as the
+rto_mask is per root domain. The root domain has the following fields
+attached to it:
+
+  rto_push_work	 - the irq work to process each CPU set in rto_mask
+  rto_lock	 - the lock to protect some of the other rto fields
+  rto_loop_start - an atomic that keeps contention down on rto_lock
+		    the first CPU scheduling in a lower priority task
+		    is the one to kick off the process.
+  rto_loop_next	 - an atomic that gets incremented for each CPU that
+		    schedules in a lower priority task.
+  rto_loop	 - a variable protected by rto_lock that is used to
+		    compare against rto_loop_next
+  rto_cpu	 - The cpu to send the next IPI to, also protected by
+		    the rto_lock.
+
+When a CPU schedules in a lower priority task and wants to make sure
+overloaded CPUs know about it. It increments the rto_loop_next. Then it
+atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
+then it is done, as another CPU is kicking off the IPI loop. If the old
+value is "0", then it will take the rto_lock to synchronize with a possible
+IPI being sent around to the overloaded CPUs.
+
+If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
+IPI being sent around, or one is about to finish. Then rto_cpu is set to the
+first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
+in rto_mask, then there's nothing to be done.
+
+When the CPU receives the IPI, it will first try to push any RT tasks that is
+queued on the CPU but can't run because a higher priority RT task is
+currently running on that CPU.
+
+Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
+finds one, it simply sends an IPI to that CPU and the process continues.
+
+If there's no more CPUs in the rto_mask, then rto_loop is compared with
+rto_loop_next. If they match, everything is done and the process is over. If
+they do not match, then a CPU scheduled in a lower priority task as the IPI
+was being passed around, and the process needs to start again. The first CPU
+in rto_mask is sent the IPI.
+
+This change removes this duplication of work in the IPI logic, and greatly
+lowers the latency caused by the IPIs. This removed the lockup happening on
+the 120 CPU machine. It also simplifies the code tremendously. What else
+could anyone ask for?
+
+Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
+supplying me with the rto_start_trylock() and rto_start_unlock() helper
+functions.
+
+Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
+Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Cc: Clark Williams <williams@redhat.com>
+Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
+Cc: John Kacur <jkacur@redhat.com>
+Cc: Linus Torvalds <torvalds@linux-foundation.org>
+Cc: Mike Galbraith <efault@gmx.de>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Scott Wood <swood@redhat.com>
+Cc: Thomas Gleixner <tglx@linutronix.de>
+Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
+Signed-off-by: Ingo Molnar <mingo@kernel.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ kernel/sched/core.c  |    6 +
+ kernel/sched/rt.c    |  235 ++++++++++++++++++++++++---------------------------
+ kernel/sched/sched.h |   24 +++--
+ 3 files changed, 138 insertions(+), 127 deletions(-)
+
+--- a/kernel/sched/core.c
++++ b/kernel/sched/core.c
+@@ -5907,6 +5907,12 @@ static int init_rootdomain(struct root_d
+ 	if (!zalloc_cpumask_var(&rd->rto_mask, GFP_KERNEL))
+ 		goto free_dlo_mask;
+ 
++#ifdef HAVE_RT_PUSH_IPI
++	rd->rto_cpu = -1;
++	raw_spin_lock_init(&rd->rto_lock);
++	init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);
++#endif
++
+ 	init_dl_bw(&rd->dl_bw);
+ 	if (cpudl_init(&rd->cpudl) != 0)
+ 		goto free_dlo_mask;
+--- a/kernel/sched/rt.c
++++ b/kernel/sched/rt.c
+@@ -64,10 +64,6 @@ static void start_rt_bandwidth(struct rt
+ 	raw_spin_unlock(&rt_b->rt_runtime_lock);
+ }
+ 
+-#if defined(CONFIG_SMP) && defined(HAVE_RT_PUSH_IPI)
+-static void push_irq_work_func(struct irq_work *work);
+-#endif
+-
+ void init_rt_rq(struct rt_rq *rt_rq)
+ {
+ 	struct rt_prio_array *array;
+@@ -87,13 +83,6 @@ void init_rt_rq(struct rt_rq *rt_rq)
+ 	rt_rq->rt_nr_migratory = 0;
+ 	rt_rq->overloaded = 0;
+ 	plist_head_init(&rt_rq->pushable_tasks);
+-
+-#ifdef HAVE_RT_PUSH_IPI
+-	rt_rq->push_flags = 0;
+-	rt_rq->push_cpu = nr_cpu_ids;
+-	raw_spin_lock_init(&rt_rq->push_lock);
+-	init_irq_work(&rt_rq->push_work, push_irq_work_func);
+-#endif
+ #endif /* CONFIG_SMP */
+ 	/* We start is dequeued state, because no RT tasks are queued */
+ 	rt_rq->rt_queued = 0;
+@@ -1802,160 +1791,166 @@ static void push_rt_tasks(struct rq *rq)
+ }
+ 
+ #ifdef HAVE_RT_PUSH_IPI
++
+ /*
+- * The search for the next cpu always starts at rq->cpu and ends
+- * when we reach rq->cpu again. It will never return rq->cpu.
+- * This returns the next cpu to check, or nr_cpu_ids if the loop
+- * is complete.
++ * When a high priority task schedules out from a CPU and a lower priority
++ * task is scheduled in, a check is made to see if there's any RT tasks
++ * on other CPUs that are waiting to run because a higher priority RT task
++ * is currently running on its CPU. In this case, the CPU with multiple RT
++ * tasks queued on it (overloaded) needs to be notified that a CPU has opened
++ * up that may be able to run one of its non-running queued RT tasks.
++ *
++ * All CPUs with overloaded RT tasks need to be notified as there is currently
++ * no way to know which of these CPUs have the highest priority task waiting
++ * to run. Instead of trying to take a spinlock on each of these CPUs,
++ * which has shown to cause large latency when done on machines with many
++ * CPUs, sending an IPI to the CPUs to have them push off the overloaded
++ * RT tasks waiting to run.
++ *
++ * Just sending an IPI to each of the CPUs is also an issue, as on large
++ * count CPU machines, this can cause an IPI storm on a CPU, especially
++ * if its the only CPU with multiple RT tasks queued, and a large number
++ * of CPUs scheduling a lower priority task at the same time.
++ *
++ * Each root domain has its own irq work function that can iterate over
++ * all CPUs with RT overloaded tasks. Since all CPUs with overloaded RT
++ * tassk must be checked if there's one or many CPUs that are lowering
++ * their priority, there's a single irq work iterator that will try to
++ * push off RT tasks that are waiting to run.
++ *
++ * When a CPU schedules a lower priority task, it will kick off the
++ * irq work iterator that will jump to each CPU with overloaded RT tasks.
++ * As it only takes the first CPU that schedules a lower priority task
++ * to start the process, the rto_start variable is incremented and if
++ * the atomic result is one, then that CPU will try to take the rto_lock.
++ * This prevents high contention on the lock as the process handles all
++ * CPUs scheduling lower priority tasks.
++ *
++ * All CPUs that are scheduling a lower priority task will increment the
++ * rt_loop_next variable. This will make sure that the irq work iterator
++ * checks all RT overloaded CPUs whenever a CPU schedules a new lower
++ * priority task, even if the iterator is in the middle of a scan. Incrementing
++ * the rt_loop_next will cause the iterator to perform another scan.
+  *
+- * rq->rt.push_cpu holds the last cpu returned by this function,
+- * or if this is the first instance, it must hold rq->cpu.
+  */
+ static int rto_next_cpu(struct rq *rq)
+ {
+-	int prev_cpu = rq->rt.push_cpu;
++	struct root_domain *rd = rq->rd;
++	int next;
+ 	int cpu;
+ 
+-	cpu = cpumask_next(prev_cpu, rq->rd->rto_mask);
+-
+ 	/*
+-	 * If the previous cpu is less than the rq's CPU, then it already
+-	 * passed the end of the mask, and has started from the beginning.
+-	 * We end if the next CPU is greater or equal to rq's CPU.
++	 * When starting the IPI RT pushing, the rto_cpu is set to -1,
++	 * rt_next_cpu() will simply return the first CPU found in
++	 * the rto_mask.
++	 *
++	 * If rto_next_cpu() is called with rto_cpu is a valid cpu, it
++	 * will return the next CPU found in the rto_mask.
++	 *
++	 * If there are no more CPUs left in the rto_mask, then a check is made
++	 * against rto_loop and rto_loop_next. rto_loop is only updated with
++	 * the rto_lock held, but any CPU may increment the rto_loop_next
++	 * without any locking.
+ 	 */
+-	if (prev_cpu < rq->cpu) {
+-		if (cpu >= rq->cpu)
+-			return nr_cpu_ids;
++	for (;;) {
+ 
+-	} else if (cpu >= nr_cpu_ids) {
+-		/*
+-		 * We passed the end of the mask, start at the beginning.
+-		 * If the result is greater or equal to the rq's CPU, then
+-		 * the loop is finished.
+-		 */
+-		cpu = cpumask_first(rq->rd->rto_mask);
+-		if (cpu >= rq->cpu)
+-			return nr_cpu_ids;
+-	}
+-	rq->rt.push_cpu = cpu;
++		/* When rto_cpu is -1 this acts like cpumask_first() */
++		cpu = cpumask_next(rd->rto_cpu, rd->rto_mask);
+ 
+-	/* Return cpu to let the caller know if the loop is finished or not */
+-	return cpu;
+-}
++		rd->rto_cpu = cpu;
+ 
+-static int find_next_push_cpu(struct rq *rq)
+-{
+-	struct rq *next_rq;
+-	int cpu;
++		if (cpu < nr_cpu_ids)
++			return cpu;
+ 
+-	while (1) {
+-		cpu = rto_next_cpu(rq);
+-		if (cpu >= nr_cpu_ids)
+-			break;
+-		next_rq = cpu_rq(cpu);
++		rd->rto_cpu = -1;
++
++		/*
++		 * ACQUIRE ensures we see the @rto_mask changes
++		 * made prior to the @next value observed.
++		 *
++		 * Matches WMB in rt_set_overload().
++		 */
++		next = atomic_read_acquire(&rd->rto_loop_next);
+ 
+-		/* Make sure the next rq can push to this rq */
+-		if (next_rq->rt.highest_prio.next < rq->rt.highest_prio.curr)
++		if (rd->rto_loop == next)
+ 			break;
++
++		rd->rto_loop = next;
+ 	}
+ 
+-	return cpu;
++	return -1;
+ }
+ 
+-#define RT_PUSH_IPI_EXECUTING		1
+-#define RT_PUSH_IPI_RESTART		2
++static inline bool rto_start_trylock(atomic_t *v)
++{
++	return !atomic_cmpxchg_acquire(v, 0, 1);
++}
+ 
+-static void tell_cpu_to_push(struct rq *rq)
++static inline void rto_start_unlock(atomic_t *v)
+ {
+-	int cpu;
++	atomic_set_release(v, 0);
++}
+ 
+-	if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+-		raw_spin_lock(&rq->rt.push_lock);
+-		/* Make sure it's still executing */
+-		if (rq->rt.push_flags & RT_PUSH_IPI_EXECUTING) {
+-			/*
+-			 * Tell the IPI to restart the loop as things have
+-			 * changed since it started.
+-			 */
+-			rq->rt.push_flags |= RT_PUSH_IPI_RESTART;
+-			raw_spin_unlock(&rq->rt.push_lock);
+-			return;
+-		}
+-		raw_spin_unlock(&rq->rt.push_lock);
+-	}
++static void tell_cpu_to_push(struct rq *rq)
++{
++	int cpu = -1;
+ 
+-	/* When here, there's no IPI going around */
++	/* Keep the loop going if the IPI is currently active */
++	atomic_inc(&rq->rd->rto_loop_next);
+ 
+-	rq->rt.push_cpu = rq->cpu;
+-	cpu = find_next_push_cpu(rq);
+-	if (cpu >= nr_cpu_ids)
++	/* Only one CPU can initiate a loop at a time */
++	if (!rto_start_trylock(&rq->rd->rto_loop_start))
+ 		return;
+ 
+-	rq->rt.push_flags = RT_PUSH_IPI_EXECUTING;
++	raw_spin_lock(&rq->rd->rto_lock);
++
++	/*
++	 * The rto_cpu is updated under the lock, if it has a valid cpu
++	 * then the IPI is still running and will continue due to the
++	 * update to loop_next, and nothing needs to be done here.
++	 * Otherwise it is finishing up and an ipi needs to be sent.
++	 */
++	if (rq->rd->rto_cpu < 0)
++		cpu = rto_next_cpu(rq);
++
++	raw_spin_unlock(&rq->rd->rto_lock);
+ 
+-	irq_work_queue_on(&rq->rt.push_work, cpu);
++	rto_start_unlock(&rq->rd->rto_loop_start);
++
++	if (cpu >= 0)
++		irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ }
+ 
+ /* Called from hardirq context */
+-static void try_to_push_tasks(void *arg)
++void rto_push_irq_work_func(struct irq_work *work)
+ {
+-	struct rt_rq *rt_rq = arg;
+-	struct rq *rq, *src_rq;
+-	int this_cpu;
++	struct rq *rq;
+ 	int cpu;
+ 
+-	this_cpu = rt_rq->push_cpu;
++	rq = this_rq();
+ 
+-	/* Paranoid check */
+-	BUG_ON(this_cpu != smp_processor_id());
+-
+-	rq = cpu_rq(this_cpu);
+-	src_rq = rq_of_rt_rq(rt_rq);
+-
+-again:
++	/*
++	 * We do not need to grab the lock to check for has_pushable_tasks.
++	 * When it gets updated, a check is made if a push is possible.
++	 */
+ 	if (has_pushable_tasks(rq)) {
+ 		raw_spin_lock(&rq->lock);
+-		push_rt_task(rq);
++		push_rt_tasks(rq);
+ 		raw_spin_unlock(&rq->lock);
+ 	}
+ 
+-	/* Pass the IPI to the next rt overloaded queue */
+-	raw_spin_lock(&rt_rq->push_lock);
+-	/*
+-	 * If the source queue changed since the IPI went out,
+-	 * we need to restart the search from that CPU again.
+-	 */
+-	if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
+-		rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
+-		rt_rq->push_cpu = src_rq->cpu;
+-	}
++	raw_spin_lock(&rq->rd->rto_lock);
+ 
+-	cpu = find_next_push_cpu(src_rq);
++	/* Pass the IPI to the next rt overloaded queue */
++	cpu = rto_next_cpu(rq);
+ 
+-	if (cpu >= nr_cpu_ids)
+-		rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
+-	raw_spin_unlock(&rt_rq->push_lock);
++	raw_spin_unlock(&rq->rd->rto_lock);
+ 
+-	if (cpu >= nr_cpu_ids)
++	if (cpu < 0)
+ 		return;
+ 
+-	/*
+-	 * It is possible that a restart caused this CPU to be
+-	 * chosen again. Don't bother with an IPI, just see if we
+-	 * have more to push.
+-	 */
+-	if (unlikely(cpu == rq->cpu))
+-		goto again;
+-
+ 	/* Try the next RT overloaded CPU */
+-	irq_work_queue_on(&rt_rq->push_work, cpu);
+-}
+-
+-static void push_irq_work_func(struct irq_work *work)
+-{
+-	struct rt_rq *rt_rq = container_of(work, struct rt_rq, push_work);
+-
+-	try_to_push_tasks(rt_rq);
++	irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ }
+ #endif /* HAVE_RT_PUSH_IPI */
+ 
+--- a/kernel/sched/sched.h
++++ b/kernel/sched/sched.h
+@@ -429,7 +429,7 @@ static inline int rt_bandwidth_enabled(v
+ }
+ 
+ /* RT IPI pull logic requires IRQ_WORK */
+-#ifdef CONFIG_IRQ_WORK
++#if defined(CONFIG_IRQ_WORK) && defined(CONFIG_SMP)
+ # define HAVE_RT_PUSH_IPI
+ #endif
+ 
+@@ -450,12 +450,6 @@ struct rt_rq {
+ 	unsigned long rt_nr_total;
+ 	int overloaded;
+ 	struct plist_head pushable_tasks;
+-#ifdef HAVE_RT_PUSH_IPI
+-	int push_flags;
+-	int push_cpu;
+-	struct irq_work push_work;
+-	raw_spinlock_t push_lock;
+-#endif
+ #endif /* CONFIG_SMP */
+ 	int rt_queued;
+ 
+@@ -537,6 +531,19 @@ struct root_domain {
+ 	struct dl_bw dl_bw;
+ 	struct cpudl cpudl;
+ 
++#ifdef HAVE_RT_PUSH_IPI
++	/*
++	 * For IPI pull requests, loop across the rto_mask.
++	 */
++	struct irq_work rto_push_work;
++	raw_spinlock_t rto_lock;
++	/* These are only updated and read within rto_lock */
++	int rto_loop;
++	int rto_cpu;
++	/* These atomics are updated outside of a lock */
++	atomic_t rto_loop_next;
++	atomic_t rto_loop_start;
++#endif
+ 	/*
+ 	 * The "RT overload" flag: it gets set if a CPU has more than
+ 	 * one runnable RT task.
+@@ -547,6 +554,9 @@ struct root_domain {
+ 
+ extern struct root_domain def_root_domain;
+ 
++#ifdef HAVE_RT_PUSH_IPI
++extern void rto_push_irq_work_func(struct irq_work *work);
++#endif
+ #endif /* CONFIG_SMP */
+ 
+ /*
diff --git a/queue-4.4/series b/queue-4.4/series
index 1ae1620969a..c419fbce2ca 100644
--- a/queue-4.4/series
+++ b/queue-4.4/series
@@ -63,3 +63,4 @@ media-don-t-do-dma-on-stack-for-firmware-upload-in-the-as102-driver.patch
 media-rc-check-for-integer-overflow.patch
 cx231xx-cards-fix-null-deref-on-missing-association-descriptor.patch
 media-v4l2-ctrl-fix-flags-field-on-control-events.patch
+sched-rt-simplify-the-ipi-based-rt-balancing-logic.patch