From: Greg Kroah-Hartman Date: Mon, 29 Jul 2019 18:02:07 +0000 (+0200) Subject: 4.4-stable patches X-Git-Tag: v5.2.5~10 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=2b323cd1cb441175124971c3a43fdfa04d401e38;p=thirdparty%2Fkernel%2Fstable-queue.git 4.4-stable patches added patches: access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch mm-vmstat-make-quiet_vmstat-lighter.patch vmstat-remove-bug_on-from-vmstat_update.patch --- diff --git a/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch b/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch new file mode 100644 index 00000000000..c699e75968f --- /dev/null +++ b/queue-4.4/access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch @@ -0,0 +1,174 @@ +From d7852fbd0f0423937fa287a598bfde188bb68c22 Mon Sep 17 00:00:00 2001 +From: Linus Torvalds +Date: Thu, 11 Jul 2019 09:54:40 -0700 +Subject: access: avoid the RCU grace period for the temporary subjective credentials + +From: Linus Torvalds + +commit d7852fbd0f0423937fa287a598bfde188bb68c22 upstream. + +It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU +work because it installs a temporary credential that gets allocated and +freed for each system call. + +The allocation and freeing overhead is mostly benign, but because +credentials can be accessed under the RCU read lock, the freeing +involves a RCU grace period. + +Which is not a huge deal normally, but if you have a lot of access() +calls, this causes a fair amount of seconday damage: instead of having a +nice alloc/free patterns that hits in hot per-CPU slab caches, you have +all those delayed free's, and on big machines with hundreds of cores, +the RCU overhead can end up being enormous. + +But it turns out that all of this is entirely unnecessary. Exactly +because access() only installs the credential as the thread-local +subjective credential, the temporary cred pointer doesn't actually need +to be RCU free'd at all. Once we're done using it, we can just free it +synchronously and avoid all the RCU overhead. + +So add a 'non_rcu' flag to 'struct cred', which can be set by users that +know they only use it in non-RCU context (there are other potential +users for this). We can make it a union with the rcu freeing list head +that we need for the RCU case, so this doesn't need any extra storage. + +Note that this also makes 'get_current_cred()' clear the new non_rcu +flag, in case we have filesystems that take a long-term reference to the +cred and then expect the RCU delayed freeing afterwards. It's not +entirely clear that this is required, but it makes for clear semantics: +the subjective cred remains non-RCU as long as you only access it +synchronously using the thread-local accessors, but you _can_ use it as +a generic cred if you want to. + +It is possible that we should just remove the whole RCU markings for +->cred entirely. Only ->real_cred is really supposed to be accessed +through RCU, and the long-term cred copies that nfs uses might want to +explicitly re-enable RCU freeing if required, rather than have +get_current_cred() do it implicitly. + +But this is a "minimal semantic changes" change for the immediate +problem. + +Acked-by: Peter Zijlstra (Intel) +Acked-by: Eric Dumazet +Acked-by: Paul E. McKenney +Cc: Oleg Nesterov +Cc: Jan Glauber +Cc: Jiri Kosina +Cc: Jayachandran Chandrasekharan Nair +Cc: Greg KH +Cc: Kees Cook +Cc: David Howells +Cc: Miklos Szeredi +Cc: Al Viro +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + fs/open.c | 19 +++++++++++++++++++ + include/linux/cred.h | 7 ++++++- + kernel/cred.c | 21 +++++++++++++++++++-- + 3 files changed, 44 insertions(+), 3 deletions(-) + +--- a/fs/open.c ++++ b/fs/open.c +@@ -363,6 +363,25 @@ SYSCALL_DEFINE3(faccessat, int, dfd, con + override_cred->cap_permitted; + } + ++ /* ++ * The new set of credentials can *only* be used in ++ * task-synchronous circumstances, and does not need ++ * RCU freeing, unless somebody then takes a separate ++ * reference to it. ++ * ++ * NOTE! This is _only_ true because this credential ++ * is used purely for override_creds() that installs ++ * it as the subjective cred. Other threads will be ++ * accessing ->real_cred, not the subjective cred. ++ * ++ * If somebody _does_ make a copy of this (using the ++ * 'get_current_cred()' function), that will clear the ++ * non_rcu field, because now that other user may be ++ * expecting RCU freeing. But normal thread-synchronous ++ * cred accesses will keep things non-RCY. ++ */ ++ override_cred->non_rcu = 1; ++ + old_cred = override_creds(override_cred); + retry: + res = user_path_at(dfd, filename, lookup_flags, &path); +--- a/include/linux/cred.h ++++ b/include/linux/cred.h +@@ -153,7 +153,11 @@ struct cred { + struct user_struct *user; /* real user ID subscription */ + struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */ + struct group_info *group_info; /* supplementary groups for euid/fsgid */ +- struct rcu_head rcu; /* RCU deletion hook */ ++ /* RCU deletion */ ++ union { ++ int non_rcu; /* Can we skip RCU deletion? */ ++ struct rcu_head rcu; /* RCU deletion hook */ ++ }; + }; + + extern void __put_cred(struct cred *); +@@ -251,6 +255,7 @@ static inline const struct cred *get_cre + { + struct cred *nonconst_cred = (struct cred *) cred; + validate_creds(cred); ++ nonconst_cred->non_rcu = 0; + return get_new_cred(nonconst_cred); + } + +--- a/kernel/cred.c ++++ b/kernel/cred.c +@@ -146,7 +146,10 @@ void __put_cred(struct cred *cred) + BUG_ON(cred == current->cred); + BUG_ON(cred == current->real_cred); + +- call_rcu(&cred->rcu, put_cred_rcu); ++ if (cred->non_rcu) ++ put_cred_rcu(&cred->rcu); ++ else ++ call_rcu(&cred->rcu, put_cred_rcu); + } + EXPORT_SYMBOL(__put_cred); + +@@ -257,6 +260,7 @@ struct cred *prepare_creds(void) + old = task->cred; + memcpy(new, old, sizeof(struct cred)); + ++ new->non_rcu = 0; + atomic_set(&new->usage, 1); + set_cred_subscribers(new, 0); + get_group_info(new->group_info); +@@ -536,7 +540,19 @@ const struct cred *override_creds(const + + validate_creds(old); + validate_creds(new); +- get_cred(new); ++ ++ /* ++ * NOTE! This uses 'get_new_cred()' rather than 'get_cred()'. ++ * ++ * That means that we do not clear the 'non_rcu' flag, since ++ * we are only installing the cred into the thread-synchronous ++ * '->cred' pointer, not the '->real_cred' pointer that is ++ * visible to other threads under RCU. ++ * ++ * Also note that we did validate_creds() manually, not depending ++ * on the validation in 'get_cred()'. ++ */ ++ get_new_cred((struct cred *)new); + alter_cred_subscribers(new, 1); + rcu_assign_pointer(current->cred, new); + alter_cred_subscribers(old, -1); +@@ -619,6 +635,7 @@ struct cred *prepare_kernel_cred(struct + validate_creds(old); + + *new = *old; ++ new->non_rcu = 0; + atomic_set(&new->usage, 1); + set_cred_subscribers(new, 0); + get_uid(new->user); diff --git a/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch b/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch new file mode 100644 index 00000000000..d0c1af5896a --- /dev/null +++ b/queue-4.4/mm-vmstat-make-quiet_vmstat-lighter.patch @@ -0,0 +1,175 @@ +From f01f17d3705bb6081c9e5728078f64067982be36 Mon Sep 17 00:00:00 2001 +From: Michal Hocko +Date: Fri, 5 Feb 2016 15:36:24 -0800 +Subject: mm, vmstat: make quiet_vmstat lighter + +From: Michal Hocko + +commit f01f17d3705bb6081c9e5728078f64067982be36 upstream. + +Mike has reported a considerable overhead of refresh_cpu_vm_stats from +the idle entry during pipe test: + + 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12 + 4.75% [kernel] [k] __schedule + 4.70% [kernel] [k] mutex_unlock + 3.14% [kernel] [k] __switch_to + +This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater +deferrable again and shut down on idle") which has placed quiet_vmstat +into cpu_idle_loop. The main reason here seems to be that the idle +entry has to get over all zones and perform atomic operations for each +vmstat entry even though there might be no per cpu diffs. This is a +pointless overhead for _each_ idle entry. + +Make sure that quiet_vmstat is as light as possible. + +First of all it doesn't make any sense to do any local sync if the +current cpu is already set in oncpu_stat_off because vmstat_update puts +itself there only if there is nothing to do. + +Then we can check need_update which should be a cheap way to check for +potential per-cpu diffs and only then do refresh_cpu_vm_stats. + +The original patch also did cancel_delayed_work which we are not doing +here. There are two reasons for that. Firstly cancel_delayed_work from +idle context will blow up on RT kernels (reported by Mike): + + CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7 + Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 + Call Trace: + dump_stack+0x49/0x67 + ___might_sleep+0xf5/0x180 + rt_spin_lock+0x20/0x50 + try_to_grab_pending+0x69/0x240 + cancel_delayed_work+0x26/0xe0 + quiet_vmstat+0x75/0xa0 + cpu_idle_loop+0x38/0x3e0 + cpu_startup_entry+0x13/0x20 + start_secondary+0x114/0x140 + +And secondly, even on !RT kernels it might add some non trivial overhead +which is not necessary. Even if the vmstat worker wakes up and preempts +idle then it will be most likely a single shot noop because the stats +were already synced and so it would end up on the oncpu_stat_off anyway. +We just need to teach both vmstat_shepherd and vmstat_update to stop +scheduling the worker if there is nothing to do. + +[mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] +Signed-off-by: Michal Hocko +Reported-by: Mike Galbraith +Acked-by: Christoph Lameter +Signed-off-by: Mike Galbraith +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Daniel Wagner +Signed-off-by: Greg Kroah-Hartman + +--- + mm/vmstat.c | 68 ++++++++++++++++++++++++++++++++++++++++-------------------- + 1 file changed, 46 insertions(+), 22 deletions(-) + +--- a/mm/vmstat.c ++++ b/mm/vmstat.c +@@ -1395,10 +1395,15 @@ static void vmstat_update(struct work_st + * Counters were updated so we expect more updates + * to occur in the future. Keep on running the + * update worker thread. ++ * If we were marked on cpu_stat_off clear the flag ++ * so that vmstat_shepherd doesn't schedule us again. + */ +- queue_delayed_work_on(smp_processor_id(), vmstat_wq, +- this_cpu_ptr(&vmstat_work), +- round_jiffies_relative(sysctl_stat_interval)); ++ if (!cpumask_test_and_clear_cpu(smp_processor_id(), ++ cpu_stat_off)) { ++ queue_delayed_work_on(smp_processor_id(), vmstat_wq, ++ this_cpu_ptr(&vmstat_work), ++ round_jiffies_relative(sysctl_stat_interval)); ++ } + } else { + /* + * We did not update any counters so the app may be in +@@ -1416,18 +1421,6 @@ static void vmstat_update(struct work_st + * until the diffs stay at zero. The function is used by NOHZ and can only be + * invoked when tick processing is not active. + */ +-void quiet_vmstat(void) +-{ +- if (system_state != SYSTEM_RUNNING) +- return; +- +- do { +- if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) +- cancel_delayed_work(this_cpu_ptr(&vmstat_work)); +- +- } while (refresh_cpu_vm_stats(false)); +-} +- + /* + * Check if the diffs for a certain cpu indicate that + * an update is needed. +@@ -1451,6 +1444,30 @@ static bool need_update(int cpu) + return false; + } + ++void quiet_vmstat(void) ++{ ++ if (system_state != SYSTEM_RUNNING) ++ return; ++ ++ /* ++ * If we are already in hands of the shepherd then there ++ * is nothing for us to do here. ++ */ ++ if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) ++ return; ++ ++ if (!need_update(smp_processor_id())) ++ return; ++ ++ /* ++ * Just refresh counters and do not care about the pending delayed ++ * vmstat_update. It doesn't fire that often to matter and canceling ++ * it would be too expensive from this path. ++ * vmstat_shepherd will take care about that for us. ++ */ ++ refresh_cpu_vm_stats(false); ++} ++ + + /* + * Shepherd worker thread that checks the +@@ -1468,18 +1485,25 @@ static void vmstat_shepherd(struct work_ + + get_online_cpus(); + /* Check processors whose vmstat worker threads have been disabled */ +- for_each_cpu(cpu, cpu_stat_off) +- if (need_update(cpu) && +- cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) +- +- queue_delayed_work_on(cpu, vmstat_wq, +- &per_cpu(vmstat_work, cpu), 0); ++ for_each_cpu(cpu, cpu_stat_off) { ++ struct delayed_work *dw = &per_cpu(vmstat_work, cpu); + ++ if (need_update(cpu)) { ++ if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) ++ queue_delayed_work_on(cpu, vmstat_wq, dw, 0); ++ } else { ++ /* ++ * Cancel the work if quiet_vmstat has put this ++ * cpu on cpu_stat_off because the work item might ++ * be still scheduled ++ */ ++ cancel_delayed_work(dw); ++ } ++ } + put_online_cpus(); + + schedule_delayed_work(&shepherd, + round_jiffies_relative(sysctl_stat_interval)); +- + } + + static void __init start_shepherd_timer(void) diff --git a/queue-4.4/series b/queue-4.4/series index b19dc2a5ed3..4800bc0e429 100644 --- a/queue-4.4/series +++ b/queue-4.4/series @@ -144,3 +144,6 @@ hpet-fix-division-by-zero-in-hpet_time_div.patch alsa-line6-fix-wrong-altsetting-for-line6_podhd500_1.patch alsa-hda-add-a-conexant-codec-entry-to-let-mute-led-work.patch powerpc-tm-fix-oops-on-sigreturn-on-systems-without-tm.patch +access-avoid-the-rcu-grace-period-for-the-temporary-subjective-credentials.patch +vmstat-remove-bug_on-from-vmstat_update.patch +mm-vmstat-make-quiet_vmstat-lighter.patch diff --git a/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch b/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch new file mode 100644 index 00000000000..d03ea1e06f3 --- /dev/null +++ b/queue-4.4/vmstat-remove-bug_on-from-vmstat_update.patch @@ -0,0 +1,61 @@ +From 587198ba5206cdf0d30855f7361af950a4172cd6 Mon Sep 17 00:00:00 2001 +From: Christoph Lameter +Date: Fri, 22 Jan 2016 10:46:14 -0600 +Subject: vmstat: Remove BUG_ON from vmstat_update + +From: Christoph Lameter + +commit 587198ba5206cdf0d30855f7361af950a4172cd6 upstream. + +If we detect that there is nothing to do just set the flag and do not +check if it was already set before. Races really do not matter. If the +flag is set by any code then the shepherd will start dealing with the +situation and reenable the vmstat workers when necessary again. + +Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again +and shut down on idle") quiet_vmstat might update cpu_stat_off and mark +a particular cpu to be handled by vmstat_shepherd. This might trigger a +VM_BUG_ON in vmstat_update because the work item might have been +sleeping during the idle period and see the cpu_stat_off updated after +the wake up. The VM_BUG_ON is therefore misleading and no more +appropriate. Moreover it doesn't really suite any protection from real +bugs because vmstat_shepherd will simply reschedule the vmstat_work +anytime it sees a particular cpu set or vmstat_update would do the same +from the worker context directly. Even when the two would race the +result wouldn't be incorrect as the counters update is fully idempotent. + +Reported-by: Sasha Levin +Signed-off-by: Christoph Lameter +Acked-by: Michal Hocko +Cc: Johannes Weiner +Cc: Tetsuo Handa +Cc: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Daniel Wagner +Signed-off-by: Greg Kroah-Hartman + +--- + mm/vmstat.c | 12 +----------- + 1 file changed, 1 insertion(+), 11 deletions(-) + +--- a/mm/vmstat.c ++++ b/mm/vmstat.c +@@ -1407,17 +1407,7 @@ static void vmstat_update(struct work_st + * Defer the checking for differentials to the + * shepherd thread on a different processor. + */ +- int r; +- /* +- * Shepherd work thread does not race since it never +- * changes the bit if its zero but the cpu +- * online / off line code may race if +- * worker threads are still allowed during +- * shutdown / startup. +- */ +- r = cpumask_test_and_set_cpu(smp_processor_id(), +- cpu_stat_off); +- VM_BUG_ON(r); ++ cpumask_set_cpu(smp_processor_id(), cpu_stat_off); + } + } +