From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Wed, 25 Jul 2012 22:20:59 +0000 (-0700)
Subject: 3.0-stable patches
X-Git-Tag: v3.4.7~4
X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=5b51816b2d15089e4c2e5a67397ae6cf4f684fbc;p=thirdparty%2Fkernel%2Fstable-queue.git

3.0-stable patches

added patches:
	cpuset-mm-reduce-large-amounts-of-memory-barrier-related-damage-v3.patch
---

diff --git a/queue-3.0/cpuset-mm-reduce-large-amounts-of-memory-barrier-related-damage-v3.patch b/queue-3.0/cpuset-mm-reduce-large-amounts-of-memory-barrier-related-damage-v3.patch
new file mode 100644
index 00000000000..039e9e6c5e9
--- /dev/null
+++ b/queue-3.0/cpuset-mm-reduce-large-amounts-of-memory-barrier-related-damage-v3.patch
@@ -0,0 +1,620 @@
+From cc9a6c8776615f9c194ccf0b63a0aa5628235545 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Wed, 21 Mar 2012 16:34:11 -0700
+Subject: cpuset: mm: reduce large amounts of memory barrier related damage v3
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.
+
+Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
+	expensive and severely impacted page allocator performance. This
+	is part of a series of patches that reduce page allocator overhead.
+
+Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
+changing cpuset's mems") wins a super prize for the largest number of
+memory barriers entered into fast paths for one commit.
+
+[get|put]_mems_allowed is incredibly heavy with pairs of full memory
+barriers inserted into a number of hot paths.  This was detected while
+investigating at large page allocator slowdown introduced some time
+after 2.6.32.  The largest portion of this overhead was shown by
+oprofile to be at an mfence introduced by this commit into the page
+allocator hot path.
+
+For extra style points, the commit introduced the use of yield() in an
+implementation of what looks like a spinning mutex.
+
+This patch replaces the full memory barriers on both read and write
+sides with a sequence counter with just read barriers on the fast path
+side.  This is much cheaper on some architectures, including x86.  The
+main bulk of the patch is the retry logic if the nodemask changes in a
+manner that can cause a false failure.
+
+While updating the nodemask, a check is made to see if a false failure
+is a risk.  If it is, the sequence number gets bumped and parallel
+allocators will briefly stall while the nodemask update takes place.
+
+In a page fault test microbenchmark, oprofile samples from
+__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
+actual results were
+
+                             3.3.0-rc3          3.3.0-rc3
+                             rc3-vanilla        nobarrier-v2r1
+    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
+    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
+    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
+    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
+    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
+    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
+    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
+    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
+    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
+    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
+    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
+    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
+    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
+    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
+    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
+    MMTests Statistics: duration
+    Sys Time Running Test (seconds)             135.68    132.17
+    User+Sys Time Running Test (seconds)         164.2    160.13
+    Total Elapsed Time (seconds)                123.46    120.87
+
+The overall improvement is small but the System CPU time is much
+improved and roughly in correlation to what oprofile reported (these
+performance figures are without profiling so skew is expected).  The
+actual number of page faults is noticeably improved.
+
+For benchmarks like kernel builds, the overall benefit is marginal but
+the system CPU time is slightly reduced.
+
+To test the actual bug the commit fixed I opened two terminals.  The
+first ran within a cpuset and continually ran a small program that
+faulted 100M of anonymous data.  In a second window, the nodemask of the
+cpuset was continually randomised in a loop.
+
+Without the commit, the program would fail every so often (usually
+within 10 seconds) and obviously with the commit everything worked fine.
+With this patch applied, it also worked fine so the fix should be
+functionally equivalent.
+
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Cc: Miao Xie <miaox@cn.fujitsu.com>
+Cc: David Rientjes <rientjes@google.com>
+Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
+Cc: Christoph Lameter <cl@linux.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+
+---
+ include/linux/cpuset.h    |   49 ++++++++++++++++++----------------------------
+ include/linux/init_task.h |    8 +++++++
+ include/linux/sched.h     |    2 -
+ kernel/cpuset.c           |   43 +++++++---------------------------------
+ kernel/fork.c             |    3 ++
+ mm/filemap.c              |   11 ++++++----
+ mm/hugetlb.c              |   15 ++++++++++----
+ mm/mempolicy.c            |   28 +++++++++++++++++++-------
+ mm/page_alloc.c           |   33 +++++++++++++++++++++---------
+ mm/slab.c                 |   13 +++++++-----
+ mm/slub.c                 |   35 +++++++++++++++++++++-----------
+ mm/vmscan.c               |    2 -
+ 12 files changed, 133 insertions(+), 109 deletions(-)
+
+--- a/include/linux/cpuset.h
++++ b/include/linux/cpuset.h
+@@ -89,36 +89,25 @@ extern void rebuild_sched_domains(void);
+ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
+ 
+ /*
+- * reading current mems_allowed and mempolicy in the fastpath must protected
+- * by get_mems_allowed()
++ * get_mems_allowed is required when making decisions involving mems_allowed
++ * such as during page allocation. mems_allowed can be updated in parallel
++ * and depending on the new value an operation can fail potentially causing
++ * process failure. A retry loop with get_mems_allowed and put_mems_allowed
++ * prevents these artificial failures.
+  */
+-static inline void get_mems_allowed(void)
++static inline unsigned int get_mems_allowed(void)
+ {
+-	current->mems_allowed_change_disable++;
++	return read_seqcount_begin(&current->mems_allowed_seq);
++}
+ 
+-	/*
+-	 * ensure that reading mems_allowed and mempolicy happens after the
+-	 * update of ->mems_allowed_change_disable.
+-	 *
+-	 * the write-side task finds ->mems_allowed_change_disable is not 0,
+-	 * and knows the read-side task is reading mems_allowed or mempolicy,
+-	 * so it will clear old bits lazily.
+-	 */
+-	smp_mb();
+-}
+-
+-static inline void put_mems_allowed(void)
+-{
+-	/*
+-	 * ensure that reading mems_allowed and mempolicy before reducing
+-	 * mems_allowed_change_disable.
+-	 *
+-	 * the write-side task will know that the read-side task is still
+-	 * reading mems_allowed or mempolicy, don't clears old bits in the
+-	 * nodemask.
+-	 */
+-	smp_mb();
+-	--ACCESS_ONCE(current->mems_allowed_change_disable);
++/*
++ * If this returns false, the operation that took place after get_mems_allowed
++ * may have failed. It is up to the caller to retry the operation if
++ * appropriate.
++ */
++static inline bool put_mems_allowed(unsigned int seq)
++{
++	return !read_seqcount_retry(&current->mems_allowed_seq, seq);
+ }
+ 
+ static inline void set_mems_allowed(nodemask_t nodemask)
+@@ -234,12 +223,14 @@ static inline void set_mems_allowed(node
+ {
+ }
+ 
+-static inline void get_mems_allowed(void)
++static inline unsigned int get_mems_allowed(void)
+ {
++	return 0;
+ }
+ 
+-static inline void put_mems_allowed(void)
++static inline bool put_mems_allowed(unsigned int seq)
+ {
++	return true;
+ }
+ 
+ #endif /* !CONFIG_CPUSETS */
+--- a/include/linux/init_task.h
++++ b/include/linux/init_task.h
+@@ -30,6 +30,13 @@ extern struct fs_struct init_fs;
+ #define INIT_THREADGROUP_FORK_LOCK(sig)
+ #endif
+ 
++#ifdef CONFIG_CPUSETS
++#define INIT_CPUSET_SEQ							\
++	.mems_allowed_seq = SEQCNT_ZERO,
++#else
++#define INIT_CPUSET_SEQ
++#endif
++
+ #define INIT_SIGNALS(sig) {						\
+ 	.nr_threads	= 1,						\
+ 	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
+@@ -193,6 +200,7 @@ extern struct cred init_cred;
+ 	INIT_FTRACE_GRAPH						\
+ 	INIT_TRACE_RECURSION						\
+ 	INIT_TASK_RCU_PREEMPT(tsk)					\
++	INIT_CPUSET_SEQ							\
+ }
+ 
+ 
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -1484,7 +1484,7 @@ struct task_struct {
+ #endif
+ #ifdef CONFIG_CPUSETS
+ 	nodemask_t mems_allowed;	/* Protected by alloc_lock */
+-	int mems_allowed_change_disable;
++	seqcount_t mems_allowed_seq;	/* Seqence no to catch updates */
+ 	int cpuset_mem_spread_rotor;
+ 	int cpuset_slab_spread_rotor;
+ #endif
+--- a/kernel/cpuset.c
++++ b/kernel/cpuset.c
+@@ -964,7 +964,6 @@ static void cpuset_change_task_nodemask(
+ {
+ 	bool need_loop;
+ 
+-repeat:
+ 	/*
+ 	 * Allow tasks that have access to memory reserves because they have
+ 	 * been OOM killed to get memory anywhere.
+@@ -983,45 +982,19 @@ repeat:
+ 	 */
+ 	need_loop = task_has_mempolicy(tsk) ||
+ 			!nodes_intersects(*newmems, tsk->mems_allowed);
+-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
+-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
+ 
+-	/*
+-	 * ensure checking ->mems_allowed_change_disable after setting all new
+-	 * allowed nodes.
+-	 *
+-	 * the read-side task can see an nodemask with new allowed nodes and
+-	 * old allowed nodes. and if it allocates page when cpuset clears newly
+-	 * disallowed ones continuous, it can see the new allowed bits.
+-	 *
+-	 * And if setting all new allowed nodes is after the checking, setting
+-	 * all new allowed nodes and clearing newly disallowed ones will be done
+-	 * continuous, and the read-side task may find no node to alloc page.
+-	 */
+-	smp_mb();
+-
+-	/*
+-	 * Allocation of memory is very fast, we needn't sleep when waiting
+-	 * for the read-side.
+-	 */
+-	while (need_loop && ACCESS_ONCE(tsk->mems_allowed_change_disable)) {
+-		task_unlock(tsk);
+-		if (!task_curr(tsk))
+-			yield();
+-		goto repeat;
+-	}
++	if (need_loop)
++		write_seqcount_begin(&tsk->mems_allowed_seq);
+ 
+-	/*
+-	 * ensure checking ->mems_allowed_change_disable before clearing all new
+-	 * disallowed nodes.
+-	 *
+-	 * if clearing newly disallowed bits before the checking, the read-side
+-	 * task may find no node to alloc page.
+-	 */
+-	smp_mb();
++	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
++	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
+ 
+ 	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
+ 	tsk->mems_allowed = *newmems;
++
++	if (need_loop)
++		write_seqcount_end(&tsk->mems_allowed_seq);
++
+ 	task_unlock(tsk);
+ }
+ 
+--- a/kernel/fork.c
++++ b/kernel/fork.c
+@@ -985,6 +985,9 @@ static int copy_signal(unsigned long clo
+ #ifdef CONFIG_CGROUPS
+ 	init_rwsem(&sig->threadgroup_fork_lock);
+ #endif
++#ifdef CONFIG_CPUSETS
++	seqcount_init(&tsk->mems_allowed_seq);
++#endif
+ 
+ 	sig->oom_adj = current->signal->oom_adj;
+ 	sig->oom_score_adj = current->signal->oom_score_adj;
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -516,10 +516,13 @@ struct page *__page_cache_alloc(gfp_t gf
+ 	struct page *page;
+ 
+ 	if (cpuset_do_page_mem_spread()) {
+-		get_mems_allowed();
+-		n = cpuset_mem_spread_node();
+-		page = alloc_pages_exact_node(n, gfp, 0);
+-		put_mems_allowed();
++		unsigned int cpuset_mems_cookie;
++		do {
++			cpuset_mems_cookie = get_mems_allowed();
++			n = cpuset_mem_spread_node();
++			page = alloc_pages_exact_node(n, gfp, 0);
++		} while (!put_mems_allowed(cpuset_mems_cookie) && !page);
++
+ 		return page;
+ 	}
+ 	return alloc_pages(gfp, 0);
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -454,14 +454,16 @@ static struct page *dequeue_huge_page_vm
+ 				struct vm_area_struct *vma,
+ 				unsigned long address, int avoid_reserve)
+ {
+-	struct page *page = NULL;
++	struct page *page;
+ 	struct mempolicy *mpol;
+ 	nodemask_t *nodemask;
+ 	struct zonelist *zonelist;
+ 	struct zone *zone;
+ 	struct zoneref *z;
++	unsigned int cpuset_mems_cookie;
+ 
+-	get_mems_allowed();
++retry_cpuset:
++	cpuset_mems_cookie = get_mems_allowed();
+ 	zonelist = huge_zonelist(vma, address,
+ 					htlb_alloc_mask, &mpol, &nodemask);
+ 	/*
+@@ -488,10 +490,15 @@ static struct page *dequeue_huge_page_vm
+ 			}
+ 		}
+ 	}
+-err:
++
+ 	mpol_cond_put(mpol);
+-	put_mems_allowed();
++	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++		goto retry_cpuset;
+ 	return page;
++
++err:
++	mpol_cond_put(mpol);
++	return NULL;
+ }
+ 
+ static void update_and_free_page(struct hstate *h, struct page *page)
+--- a/mm/mempolicy.c
++++ b/mm/mempolicy.c
+@@ -1810,18 +1810,24 @@ struct page *
+ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
+ 		unsigned long addr, int node)
+ {
+-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
++	struct mempolicy *pol;
+ 	struct zonelist *zl;
+ 	struct page *page;
++	unsigned int cpuset_mems_cookie;
++
++retry_cpuset:
++	pol = get_vma_policy(current, vma, addr);
++	cpuset_mems_cookie = get_mems_allowed();
+ 
+-	get_mems_allowed();
+ 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
+ 		unsigned nid;
+ 
+ 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+ 		mpol_cond_put(pol);
+ 		page = alloc_page_interleave(gfp, order, nid);
+-		put_mems_allowed();
++		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++			goto retry_cpuset;
++
+ 		return page;
+ 	}
+ 	zl = policy_zonelist(gfp, pol, node);
+@@ -1832,7 +1838,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
+ 		struct page *page =  __alloc_pages_nodemask(gfp, order,
+ 						zl, policy_nodemask(gfp, pol));
+ 		__mpol_put(pol);
+-		put_mems_allowed();
++		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++			goto retry_cpuset;
+ 		return page;
+ 	}
+ 	/*
+@@ -1840,7 +1847,8 @@ alloc_pages_vma(gfp_t gfp, int order, st
+ 	 */
+ 	page = __alloc_pages_nodemask(gfp, order, zl,
+ 				      policy_nodemask(gfp, pol));
+-	put_mems_allowed();
++	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++		goto retry_cpuset;
+ 	return page;
+ }
+ 
+@@ -1867,11 +1875,14 @@ struct page *alloc_pages_current(gfp_t g
+ {
+ 	struct mempolicy *pol = current->mempolicy;
+ 	struct page *page;
++	unsigned int cpuset_mems_cookie;
+ 
+ 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
+ 		pol = &default_policy;
+ 
+-	get_mems_allowed();
++retry_cpuset:
++	cpuset_mems_cookie = get_mems_allowed();
++
+ 	/*
+ 	 * No reference counting needed for current->mempolicy
+ 	 * nor system default_policy
+@@ -1882,7 +1893,10 @@ struct page *alloc_pages_current(gfp_t g
+ 		page = __alloc_pages_nodemask(gfp, order,
+ 				policy_zonelist(gfp, pol, numa_node_id()),
+ 				policy_nodemask(gfp, pol));
+-	put_mems_allowed();
++
++	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++		goto retry_cpuset;
++
+ 	return page;
+ }
+ EXPORT_SYMBOL(alloc_pages_current);
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -2293,8 +2293,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
+ {
+ 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ 	struct zone *preferred_zone;
+-	struct page *page;
++	struct page *page = NULL;
+ 	int migratetype = allocflags_to_migratetype(gfp_mask);
++	unsigned int cpuset_mems_cookie;
+ 
+ 	gfp_mask &= gfp_allowed_mask;
+ 
+@@ -2313,15 +2314,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
+ 	if (unlikely(!zonelist->_zonerefs->zone))
+ 		return NULL;
+ 
+-	get_mems_allowed();
++retry_cpuset:
++	cpuset_mems_cookie = get_mems_allowed();
++
+ 	/* The preferred zone is used for statistics later */
+ 	first_zones_zonelist(zonelist, high_zoneidx,
+ 				nodemask ? : &cpuset_current_mems_allowed,
+ 				&preferred_zone);
+-	if (!preferred_zone) {
+-		put_mems_allowed();
+-		return NULL;
+-	}
++	if (!preferred_zone)
++		goto out;
+ 
+ 	/* First allocation attempt */
+ 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+@@ -2331,9 +2332,19 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
+ 		page = __alloc_pages_slowpath(gfp_mask, order,
+ 				zonelist, high_zoneidx, nodemask,
+ 				preferred_zone, migratetype);
+-	put_mems_allowed();
+ 
+ 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
++
++out:
++	/*
++	 * When updating a task's mems_allowed, it is possible to race with
++	 * parallel threads in such a way that an allocation can fail while
++	 * the mask is being updated. If a page allocation is about to fail,
++	 * check if the cpuset changed during allocation and if so, retry.
++	 */
++	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
++		goto retry_cpuset;
++
+ 	return page;
+ }
+ EXPORT_SYMBOL(__alloc_pages_nodemask);
+@@ -2557,13 +2568,15 @@ void si_meminfo_node(struct sysinfo *val
+ bool skip_free_areas_node(unsigned int flags, int nid)
+ {
+ 	bool ret = false;
++	unsigned int cpuset_mems_cookie;
+ 
+ 	if (!(flags & SHOW_MEM_FILTER_NODES))
+ 		goto out;
+ 
+-	get_mems_allowed();
+-	ret = !node_isset(nid, cpuset_current_mems_allowed);
+-	put_mems_allowed();
++	do {
++		cpuset_mems_cookie = get_mems_allowed();
++		ret = !node_isset(nid, cpuset_current_mems_allowed);
++	} while (!put_mems_allowed(cpuset_mems_cookie));
+ out:
+ 	return ret;
+ }
+--- a/mm/slab.c
++++ b/mm/slab.c
+@@ -3218,12 +3218,10 @@ static void *alternate_node_alloc(struct
+ 	if (in_interrupt() || (flags & __GFP_THISNODE))
+ 		return NULL;
+ 	nid_alloc = nid_here = numa_mem_id();
+-	get_mems_allowed();
+ 	if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
+ 		nid_alloc = cpuset_slab_spread_node();
+ 	else if (current->mempolicy)
+ 		nid_alloc = slab_node(current->mempolicy);
+-	put_mems_allowed();
+ 	if (nid_alloc != nid_here)
+ 		return ____cache_alloc_node(cachep, flags, nid_alloc);
+ 	return NULL;
+@@ -3246,14 +3244,17 @@ static void *fallback_alloc(struct kmem_
+ 	enum zone_type high_zoneidx = gfp_zone(flags);
+ 	void *obj = NULL;
+ 	int nid;
++	unsigned int cpuset_mems_cookie;
+ 
+ 	if (flags & __GFP_THISNODE)
+ 		return NULL;
+ 
+-	get_mems_allowed();
+-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+ 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
+ 
++retry_cpuset:
++	cpuset_mems_cookie = get_mems_allowed();
++	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
++
+ retry:
+ 	/*
+ 	 * Look through allowed nodes for objects available
+@@ -3306,7 +3307,9 @@ retry:
+ 			}
+ 		}
+ 	}
+-	put_mems_allowed();
++
++	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !obj))
++		goto retry_cpuset;
+ 	return obj;
+ }
+ 
+--- a/mm/slub.c
++++ b/mm/slub.c
+@@ -1457,6 +1457,7 @@ static struct page *get_any_partial(stru
+ 	struct zone *zone;
+ 	enum zone_type high_zoneidx = gfp_zone(flags);
+ 	struct page *page;
++	unsigned int cpuset_mems_cookie;
+ 
+ 	/*
+ 	 * The defrag ratio allows a configuration of the tradeoffs between
+@@ -1480,22 +1481,32 @@ static struct page *get_any_partial(stru
+ 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
+ 		return NULL;
+ 
+-	get_mems_allowed();
+-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+-		struct kmem_cache_node *n;
++	do {
++		cpuset_mems_cookie = get_mems_allowed();
++		zonelist = node_zonelist(slab_node(current->mempolicy), flags);
++		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
++			struct kmem_cache_node *n;
+ 
+-		n = get_node(s, zone_to_nid(zone));
++			n = get_node(s, zone_to_nid(zone));
+ 
+-		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
+-				n->nr_partial > s->min_partial) {
+-			page = get_partial_node(n);
+-			if (page) {
+-				put_mems_allowed();
+-				return page;
++			if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
++					n->nr_partial > s->min_partial) {
++				page = get_partial_node(n);
++				if (page) {
++					/*
++					 * Return the object even if
++					 * put_mems_allowed indicated that
++					 * the cpuset mems_allowed was
++					 * updated in parallel. It's a
++					 * harmless race between the alloc
++					 * and the cpuset update.
++					 */
++					put_mems_allowed(cpuset_mems_cookie);
++					return page;
++				}
+ 			}
+ 		}
+-	}
++	} while (!put_mems_allowed(cpuset_mems_cookie));
+ 	put_mems_allowed();
+ #endif
+ 	return NULL;
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -2251,7 +2251,6 @@ static unsigned long do_try_to_free_page
+ 	unsigned long writeback_threshold;
+ 	bool aborted_reclaim;
+ 
+-	get_mems_allowed();
+ 	delayacct_freepages_start();
+ 
+ 	if (scanning_global_lru(sc))
+@@ -2314,7 +2313,6 @@ static unsigned long do_try_to_free_page
+ 
+ out:
+ 	delayacct_freepages_end();
+-	put_mems_allowed();
+ 
+ 	if (sc->nr_reclaimed)
+ 		return sc->nr_reclaimed;
diff --git a/queue-3.0/series b/queue-3.0/series
index 78565a490aa..9809f7617d4 100644
--- a/queue-3.0/series
+++ b/queue-3.0/series
@@ -36,3 +36,4 @@ mm-test-pageswapbacked-in-lumpy-reclaim.patch
 mm-vmscan-convert-global-reclaim-to-per-memcg-lru-lists.patch
 cpusets-avoid-looping-when-storing-to-mems_allowed-if-one-node-remains-set.patch
 cpusets-stall-when-updating-mems_allowed-for-mempolicy-or-disjoint-nodemask.patch
+cpuset-mm-reduce-large-amounts-of-memory-barrier-related-damage-v3.patch