4.4-stable patches

author Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)

committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)
author Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)
committer Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)
diff --git a/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch b/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch

new file mode 100644 (file)

index 0000000..c699e75
--- /dev/null
+++ b/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch
@@ -0,0 +1,174 @@
+From d7852fbd0f0423937fa287a598bfde188bb68c22 Mon Sep 17 00:00:00 2001
+From: Linus Torvalds <torvalds@linux-foundation.org>
+Date: Thu, 11 Jul 2019 09:54:40 -0700
+Subject: access: avoid the RCU grace period for the temporary subjective credentials
+
+From: Linus Torvalds <torvalds@linux-foundation.org>
+
+commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream.
+
+It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU
+work because it installs a temporary credential that gets allocated and
+freed for each system call.
+
+The allocation and freeing overhead is mostly benign, but because
+credentials can be accessed under the RCU read lock, the freeing
+involves a RCU grace period.
+
+Which is not a huge deal normally, but if you have a lot of access()
+calls, this causes a fair amount of seconday damage: instead of having a
+nice alloc/free patterns that hits in hot per-CPU slab caches, you have
+all those delayed free's, and on big machines with hundreds of cores,
+the RCU overhead can end up being enormous.
+
+But it turns out that all of this is entirely unnecessary.  Exactly
+because access() only installs the credential as the thread-local
+subjective credential, the temporary cred pointer doesn't actually need
+to be RCU free'd at all.  Once we're done using it, we can just free it
+synchronously and avoid all the RCU overhead.
+
+So add a 'non_rcu' flag to 'struct cred', which can be set by users that
+know they only use it in non-RCU context (there are other potential
+users for this).  We can make it a union with the rcu freeing list head
+that we need for the RCU case, so this doesn't need any extra storage.
+
+Note that this also makes 'get_current_cred()' clear the new non_rcu
+flag, in case we have filesystems that take a long-term reference to the
+cred and then expect the RCU delayed freeing afterwards.  It's not
+entirely clear that this is required, but it makes for clear semantics:
+the subjective cred remains non-RCU as long as you only access it
+synchronously using the thread-local accessors, but you _can_ use it as
+a generic cred if you want to.
+
+It is possible that we should just remove the whole RCU markings for
+->cred entirely.  Only ->real_cred is really supposed to be accessed
+through RCU, and the long-term cred copies that nfs uses might want to
+explicitly re-enable RCU freeing if required, rather than have
+get_current_cred() do it implicitly.
+
+But this is a "minimal semantic changes" change for the immediate
+problem.
+
+Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Acked-by: Eric Dumazet <edumazet@google.com>
+Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
+Cc: Oleg Nesterov <oleg@redhat.com>
+Cc: Jan Glauber <jglauber@marvell.com>
+Cc: Jiri Kosina <jikos@kernel.org>
+Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
+Cc: Greg KH <greg@kroah.com>
+Cc: Kees Cook <keescook@chromium.org>
+Cc: David Howells <dhowells@redhat.com>
+Cc: Miklos Szeredi <miklos@szeredi.hu>
+Cc: Al Viro <viro@zeniv.linux.org.uk>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ fs/open.c            |   19 +++++++++++++++++++
+ include/linux/cred.h |    7 ++++++-
+ kernel/cred.c        |   21 +++++++++++++++++++--
+ 3 files changed, 44 insertions(+), 3 deletions(-)
+
+--- a/fs/open.c
++++ b/fs/open.c
+@@ -363,6 +363,25 @@ SYSCALL_DEFINE3(faccessat, int, dfd, con
+                               override_cred->cap_permitted;
+       }
+ 
++      /*
++       * The new set of credentials can *only* be used in
++       * task-synchronous circumstances, and does not need
++       * RCU freeing, unless somebody then takes a separate
++       * reference to it.
++       *
++       * NOTE! This is _only_ true because this credential
++       * is used purely for override_creds() that installs
++       * it as the subjective cred. Other threads will be
++       * accessing ->real_cred, not the subjective cred.
++       *
++       * If somebody _does_ make a copy of this (using the
++       * 'get_current_cred()' function), that will clear the
++       * non_rcu field, because now that other user may be
++       * expecting RCU freeing. But normal thread-synchronous
++       * cred accesses will keep things non-RCY.
++       */
++      override_cred->non_rcu = 1;
++
+       old_cred = override_creds(override_cred);
+ retry:
+       res = user_path_at(dfd, filename, lookup_flags, &path);
+--- a/include/linux/cred.h
++++ b/include/linux/cred.h
+@@ -153,7 +153,11 @@ struct cred {
+       struct user_struct *user;       /* real user ID subscription */
+       struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
+       struct group_info *group_info;  /* supplementary groups for euid/fsgid */
+-      struct rcu_head rcu;            /* RCU deletion hook */
++      /* RCU deletion */
++      union {
++              int non_rcu;                    /* Can we skip RCU deletion? */
++              struct rcu_head rcu;            /* RCU deletion hook */
++      };
+ };
+ 
+ extern void __put_cred(struct cred *);
+@@ -251,6 +255,7 @@ static inline const struct cred *get_cre
+ {
+       struct cred *nonconst_cred = (struct cred *) cred;
+       validate_creds(cred);
++      nonconst_cred->non_rcu = 0;
+       return get_new_cred(nonconst_cred);
+ }
+ 
+--- a/kernel/cred.c
++++ b/kernel/cred.c
+@@ -146,7 +146,10 @@ void __put_cred(struct cred *cred)
+       BUG_ON(cred == current->cred);
+       BUG_ON(cred == current->real_cred);
+ 
+-      call_rcu(&cred->rcu, put_cred_rcu);
++      if (cred->non_rcu)
++              put_cred_rcu(&cred->rcu);
++      else
++              call_rcu(&cred->rcu, put_cred_rcu);
+ }
+ EXPORT_SYMBOL(__put_cred);
+ 
+@@ -257,6 +260,7 @@ struct cred *prepare_creds(void)
+       old = task->cred;
+       memcpy(new, old, sizeof(struct cred));
+ 
++      new->non_rcu = 0;
+       atomic_set(&new->usage, 1);
+       set_cred_subscribers(new, 0);
+       get_group_info(new->group_info);
+@@ -536,7 +540,19 @@ const struct cred *override_creds(const
+ 
+       validate_creds(old);
+       validate_creds(new);
+-      get_cred(new);
++
++      /*
++       * NOTE! This uses 'get_new_cred()' rather than 'get_cred()'.
++       *
++       * That means that we do not clear the 'non_rcu' flag, since
++       * we are only installing the cred into the thread-synchronous
++       * '->cred' pointer, not the '->real_cred' pointer that is
++       * visible to other threads under RCU.
++       *
++       * Also note that we did validate_creds() manually, not depending
++       * on the validation in 'get_cred()'.
++       */
++      get_new_cred((struct cred *)new);
+       alter_cred_subscribers(new, 1);
+       rcu_assign_pointer(current->cred, new);
+       alter_cred_subscribers(old, -1);
+@@ -619,6 +635,7 @@ struct cred *prepare_kernel_cred(struct
+       validate_creds(old);
+ 
+       *new = *old;
++      new->non_rcu = 0;
+       atomic_set(&new->usage, 1);
+       set_cred_subscribers(new, 0);
+       get_uid(new->user);
diff --git a/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch b/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch

new file mode 100644 (file)

index 0000000..d0c1af5
--- /dev/null
+++ b/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch
@@ -0,0 +1,175 @@
+From f01f17d3705bb6081c9e5728078f64067982be36 Mon Sep 17 00:00:00 2001
+From: Michal Hocko <mhocko@suse.com>
+Date: Fri, 5 Feb 2016 15:36:24 -0800
+Subject: mm, vmstat: make quiet_vmstat lighter
+
+From: Michal Hocko <mhocko@suse.com>
+
+commit f01f17d3705bb6081c9e5728078f64067982be36 upstream.
+
+Mike has reported a considerable overhead of refresh_cpu_vm_stats from
+the idle entry during pipe test:
+
+    12.89%  [kernel]       [k] refresh_cpu_vm_stats.isra.12
+     4.75%  [kernel]       [k] __schedule
+     4.70%  [kernel]       [k] mutex_unlock
+     3.14%  [kernel]       [k] __switch_to
+
+This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
+deferrable again and shut down on idle") which has placed quiet_vmstat
+into cpu_idle_loop.  The main reason here seems to be that the idle
+entry has to get over all zones and perform atomic operations for each
+vmstat entry even though there might be no per cpu diffs.  This is a
+pointless overhead for _each_ idle entry.
+
+Make sure that quiet_vmstat is as light as possible.
+
+First of all it doesn't make any sense to do any local sync if the
+current cpu is already set in oncpu_stat_off because vmstat_update puts
+itself there only if there is nothing to do.
+
+Then we can check need_update which should be a cheap way to check for
+potential per-cpu diffs and only then do refresh_cpu_vm_stats.
+
+The original patch also did cancel_delayed_work which we are not doing
+here.  There are two reasons for that.  Firstly cancel_delayed_work from
+idle context will blow up on RT kernels (reported by Mike):
+
+  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
+  Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
+  Call Trace:
+    dump_stack+0x49/0x67
+    ___might_sleep+0xf5/0x180
+    rt_spin_lock+0x20/0x50
+    try_to_grab_pending+0x69/0x240
+    cancel_delayed_work+0x26/0xe0
+    quiet_vmstat+0x75/0xa0
+    cpu_idle_loop+0x38/0x3e0
+    cpu_startup_entry+0x13/0x20
+    start_secondary+0x114/0x140
+
+And secondly, even on !RT kernels it might add some non trivial overhead
+which is not necessary.  Even if the vmstat worker wakes up and preempts
+idle then it will be most likely a single shot noop because the stats
+were already synced and so it would end up on the oncpu_stat_off anyway.
+We just need to teach both vmstat_shepherd and vmstat_update to stop
+scheduling the worker if there is nothing to do.
+
+[mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU]
+Signed-off-by: Michal Hocko <mhocko@suse.com>
+Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
+Acked-by: Christoph Lameter <cl@linux.com>
+Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Daniel Wagner <wagi@monom.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/vmstat.c |   68 ++++++++++++++++++++++++++++++++++++++++--------------------
+ 1 file changed, 46 insertions(+), 22 deletions(-)
+
+--- a/mm/vmstat.c
++++ b/mm/vmstat.c
+@@ -1395,10 +1395,15 @@ static void vmstat_update(struct work_st
+                * Counters were updated so we expect more updates
+                * to occur in the future. Keep on running the
+                * update worker thread.
++               * If we were marked on cpu_stat_off clear the flag
++               * so that vmstat_shepherd doesn't schedule us again.
+                */
+-              queue_delayed_work_on(smp_processor_id(), vmstat_wq,
+-                      this_cpu_ptr(&vmstat_work),
+-                      round_jiffies_relative(sysctl_stat_interval));
++              if (!cpumask_test_and_clear_cpu(smp_processor_id(),
++                                              cpu_stat_off)) {
++                      queue_delayed_work_on(smp_processor_id(), vmstat_wq,
++                              this_cpu_ptr(&vmstat_work),
++                              round_jiffies_relative(sysctl_stat_interval));
++              }
+       } else {
+               /*
+                * We did not update any counters so the app may be in
+@@ -1416,18 +1421,6 @@ static void vmstat_update(struct work_st
+  * until the diffs stay at zero. The function is used by NOHZ and can only be
+  * invoked when tick processing is not active.
+  */
+-void quiet_vmstat(void)
+-{
+-      if (system_state != SYSTEM_RUNNING)
+-              return;
+-
+-      do {
+-              if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+-                      cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+-
+-      } while (refresh_cpu_vm_stats(false));
+-}
+-
+ /*
+  * Check if the diffs for a certain cpu indicate that
+  * an update is needed.
+@@ -1451,6 +1444,30 @@ static bool need_update(int cpu)
+       return false;
+ }
+ 
++void quiet_vmstat(void)
++{
++      if (system_state != SYSTEM_RUNNING)
++              return;
++
++      /*
++       * If we are already in hands of the shepherd then there
++       * is nothing for us to do here.
++       */
++      if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
++              return;
++
++      if (!need_update(smp_processor_id()))
++              return;
++
++      /*
++       * Just refresh counters and do not care about the pending delayed
++       * vmstat_update. It doesn't fire that often to matter and canceling
++       * it would be too expensive from this path.
++       * vmstat_shepherd will take care about that for us.
++       */
++      refresh_cpu_vm_stats(false);
++}
++
+ 
+ /*
+  * Shepherd worker thread that checks the
+@@ -1468,18 +1485,25 @@ static void vmstat_shepherd(struct work_
+ 
+       get_online_cpus();
+       /* Check processors whose vmstat worker threads have been disabled */
+-      for_each_cpu(cpu, cpu_stat_off)
+-              if (need_update(cpu) &&
+-                      cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
+-
+-                      queue_delayed_work_on(cpu, vmstat_wq,
+-                              &per_cpu(vmstat_work, cpu), 0);
++      for_each_cpu(cpu, cpu_stat_off) {
++              struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+ 
++              if (need_update(cpu)) {
++                      if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
++                              queue_delayed_work_on(cpu, vmstat_wq, dw, 0);
++              } else {
++                      /*
++                       * Cancel the work if quiet_vmstat has put this
++                       * cpu on cpu_stat_off because the work item might
++                       * be still scheduled
++                       */
++                      cancel_delayed_work(dw);
++              }
++      }
+       put_online_cpus();
+ 
+       schedule_delayed_work(&shepherd,
+               round_jiffies_relative(sysctl_stat_interval));
+-
+ }
+ 
+ static void __init start_shepherd_timer(void)
diff --git a/queue-4.4/series b/queue-4.4/series

index b19dc2a5ed3538c2efd63eb3e137633369cf14a2..4800bc0e429dc4f40a4132fc9c3d0f19040bbf17 100644 (file)
--- a/queue-4.4/series
+++ b/queue-4.4/series
@@ -144,3 +144,6 @@ hpet-fix-division-by-zero-in-hpet_time_div.patch
  alsa-line6-fix-wrong-altsetting-for-line6_podhd500_1.patch
  alsa-hda-add-a-conexant-codec-entry-to-let-mute-led-work.patch
  powerpc-tm-fix-oops-on-sigreturn-on-systems-without-tm.patch
+access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch
+vmstat-remove-bug_on-from-vmstat_update.patch
+mm-vmstat-make-quiet_vmstat-lighter.patch
diff --git a/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch b/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch

new file mode 100644 (file)

index 0000000..d03ea1e
--- /dev/null
+++ b/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch
@@ -0,0 +1,61 @@
+From 587198ba5206cdf0d30855f7361af950a4172cd6 Mon Sep 17 00:00:00 2001
+From: Christoph Lameter <cl@linux.com>
+Date: Fri, 22 Jan 2016 10:46:14 -0600
+Subject: vmstat: Remove BUG_ON from vmstat_update
+
+From: Christoph Lameter <cl@linux.com>
+
+commit 587198ba5206cdf0d30855f7361af950a4172cd6 upstream.
+
+If we detect that there is nothing to do just set the flag and do not
+check if it was already set before.  Races really do not matter.  If the
+flag is set by any code then the shepherd will start dealing with the
+situation and reenable the vmstat workers when necessary again.
+
+Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
+and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
+a particular cpu to be handled by vmstat_shepherd.  This might trigger a
+VM_BUG_ON in vmstat_update because the work item might have been
+sleeping during the idle period and see the cpu_stat_off updated after
+the wake up.  The VM_BUG_ON is therefore misleading and no more
+appropriate.  Moreover it doesn't really suite any protection from real
+bugs because vmstat_shepherd will simply reschedule the vmstat_work
+anytime it sees a particular cpu set or vmstat_update would do the same
+from the worker context directly.  Even when the two would race the
+result wouldn't be incorrect as the counters update is fully idempotent.
+
+Reported-by: Sasha Levin <sasha.levin@oracle.com>
+Signed-off-by: Christoph Lameter <cl@linux.com>
+Acked-by: Michal Hocko <mhocko@suse.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
+Cc: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Daniel Wagner <wagi@monom.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/vmstat.c |   12 +-----------
+ 1 file changed, 1 insertion(+), 11 deletions(-)
+
+--- a/mm/vmstat.c
++++ b/mm/vmstat.c
+@@ -1407,17 +1407,7 @@ static void vmstat_update(struct work_st
+                * Defer the checking for differentials to the
+                * shepherd thread on a different processor.
+                */
+-              int r;
+-              /*
+-               * Shepherd work thread does not race since it never
+-               * changes the bit if its zero but the cpu
+-               * online / off line code may race if
+-               * worker threads are still allowed during
+-               * shutdown / startup.
+-               */
+-              r = cpumask_test_and_set_cpu(smp_processor_id(),
+-                      cpu_stat_off);
+-              VM_BUG_ON(r);
++              cpumask_set_cpu(smp_processor_id(), cpu_stat_off);
+       }
+ }
+
author	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)
committer	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	Mon, 29 Jul 2019 18:02:07 +0000 (20:02 +0200)
queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch	[new file with mode: 0644]	patch \| blob
queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch	[new file with mode: 0644]	patch \| blob
queue-4.4/series		patch \| blob \| blame \| history
queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch	[new file with mode: 0644]	patch \| blob