[thirdparty/kernel/stable-queue.git] / releases / 2.6.27.36 / mmap-avoid-unnecessary-anon_vma-lock-acquisition-in-vma_adjust.patch

From 252c5f94d944487e9f50ece7942b0fbf659c5c31 Mon Sep 17 00:00:00 2001
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Date: Mon, 21 Sep 2009 17:03:40 -0700
Subject: mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust()

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

commit 252c5f94d944487e9f50ece7942b0fbf659c5c31 upstream.

We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on an
8-socket, 32-core, 256GB x86_64 platform.  On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
thousands of tasks], the throughput would vary between two "plateaus"--one
at ~65K jobs per minute and one at ~130K jpm.  The simple patch below
causes the results to smooth out at the ~130k plateau.

But wait, there's more:

We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
This could be the result of the larger number of cpus on the larger
platform--a scalability issue--or it could be the result of the larger
number of interconnect "hops" between some nodes in this platform and how
the tasks for a given load end up distributed over the nodes' cpus and
memories--a stochastic NUMA effect.

The variability in the results are less pronounced [on the same platform]
with Shanghai processors and with mainline kernels.  With 31-rc6 on
Shanghai processors and 288 file systems on 288 fibre attached storage
volumes, the curves [jpm vs load] are both quite flat with the patched
kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
jpm] than the unpatched kernel.

Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called from the
sbrk() system call.

The patch:

A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
anon_vma lock when we're only adjusting the end of a vma, as is the case
for brk().  The comment questions whether it's worth while to optimize for
this case.  Apparently, on the newer, larger x86_64 platforms, with
interesting NUMA topologies, it is worth while--especially considering
that the patch [if correct!] is quite simple.

We can detect this condition--no overlap with next vma--by noting a NULL
"importer".  The anon_vma pointer will also be NULL in this case, so
simply avoid loading vma->anon_vma to avoid the lock.

However, we DO need to take the anon_vma lock when we're inserting a vma
['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
need to check for 'insert', as well.  And Hugh points out that we should
also take it when adjusting vm_start (so that rmap.c can rely upon
vma_address() while it holds the anon_vma lock).

akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
might be -stable material even though thiss isn't a regression: "this
issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
severe on large machine (4*8*2 cpu)"

[hugh.dickins@tiscali.co.uk: test vma start too]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Eric Whitney <eric.whitney@hp.com>
Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 mm/mmap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -575,9 +575,9 @@ again:			remove_next = 1 + (end > next->
 
 	/*
 	 * When changing only vma->vm_end, we don't really need
-	 * anon_vma lock: but is that case worth optimizing out?
+	 * anon_vma lock.
 	 */
-	if (vma->anon_vma)
+	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
 		anon_vma = vma->anon_vma;
 	if (anon_vma) {
 		spin_lock(&anon_vma->lock);
Commit	Line	Data
2dda5791 GKH	1	From 252c5f94d944487e9f50ece7942b0fbf659c5c31 Mon Sep 17 00:00:00 2001
	2	From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
	3	Date: Mon, 21 Sep 2009 17:03:40 -0700
	4	Subject: mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust()
	5
	6	From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
	7
	8	commit 252c5f94d944487e9f50ece7942b0fbf659c5c31 upstream.
	9
	10	We noticed very erratic behavior [throughput] with the AIM7 shared
	11	workload running on recent distro [SLES11] and mainline kernels on an
	12	8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
	13	[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
	14	thousands of tasks], the throughput would vary between two "plateaus"--one
	15	at ~65K jobs per minute and one at ~130K jpm. The simple patch below
	16	causes the results to smooth out at the ~130k plateau.
	17
	18	But wait, there's more:
	19
	20	We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
	21	This could be the result of the larger number of cpus on the larger
	22	platform--a scalability issue--or it could be the result of the larger
	23	number of interconnect "hops" between some nodes in this platform and how
	24	the tasks for a given load end up distributed over the nodes' cpus and
	25	memories--a stochastic NUMA effect.
	26
	27	The variability in the results are less pronounced [on the same platform]
	28	with Shanghai processors and with mainline kernels. With 31-rc6 on
	29	Shanghai processors and 288 file systems on 288 fibre attached storage
	30	volumes, the curves [jpm vs load] are both quite flat with the patched
	31	kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
	32	jpm] than the unpatched kernel.
	33
	34	Profiling indicated that the "slow" runs were incurring high[er]
	35	contention on an anon_vma lock in vma_adjust(), apparently called from the
	36	sbrk() system call.
	37
	38	The patch:
	39
	40	A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
	41	anon_vma lock when we're only adjusting the end of a vma, as is the case
	42	for brk(). The comment questions whether it's worth while to optimize for
	43	this case. Apparently, on the newer, larger x86_64 platforms, with
	44	interesting NUMA topologies, it is worth while--especially considering
	45	that the patch [if correct!] is quite simple.
	46
	47	We can detect this condition--no overlap with next vma--by noting a NULL
	48	"importer". The anon_vma pointer will also be NULL in this case, so
	49	simply avoid loading vma->anon_vma to avoid the lock.
	50
	51	However, we DO need to take the anon_vma lock when we're inserting a vma
	52	['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
	53	need to check for 'insert', as well. And Hugh points out that we should
	54	also take it when adjusting vm_start (so that rmap.c can rely upon
	55	vma_address() while it holds the anon_vma lock).
	56
	57	akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
	58	might be -stable material even though thiss isn't a regression: "this
	59	issue is not clear on dual socket Nehalem machine (242 cpu), but is
	60	severe on large machine (482 cpu)"
	61
	62	[hugh.dickins@tiscali.co.uk: test vma start too]
	63	Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
	64	Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
65	Cc: Nick Piggin <npiggin@suse.de>
66	Cc: Eric Whitney <eric.whitney@hp.com>
67	Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
68	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
69	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
70	Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
71
72	---
73	mm/mmap.c \| 4 ++--
74	1 file changed, 2 insertions(+), 2 deletions(-)
75
76	--- a/mm/mmap.c
77	+++ b/mm/mmap.c
78	@@ -575,9 +575,9 @@ again: remove_next = 1 + (end > next->
79
80	/*
81	* When changing only vma->vm_end, we don't really need
82	- * anon_vma lock: but is that case worth optimizing out?
83	+ * anon_vma lock.
84	*/
85	- if (vma->anon_vma)
86	+ if (vma->anon_vma && (insert \|\| importer \|\| start != vma->vm_start))
87	anon_vma = vma->anon_vma;
88	if (anon_vma) {
89	spin_lock(&anon_vma->lock);