]>
Commit | Line | Data |
---|---|---|
2dda5791 GKH |
1 | From 252c5f94d944487e9f50ece7942b0fbf659c5c31 Mon Sep 17 00:00:00 2001 |
2 | From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> | |
3 | Date: Mon, 21 Sep 2009 17:03:40 -0700 | |
4 | Subject: mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() | |
5 | ||
6 | From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> | |
7 | ||
8 | commit 252c5f94d944487e9f50ece7942b0fbf659c5c31 upstream. | |
9 | ||
10 | We noticed very erratic behavior [throughput] with the AIM7 shared | |
11 | workload running on recent distro [SLES11] and mainline kernels on an | |
12 | 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel | |
13 | [2.6.27.19+] with Barcelona processors, as we increased the load [10s of | |
14 | thousands of tasks], the throughput would vary between two "plateaus"--one | |
15 | at ~65K jobs per minute and one at ~130K jpm. The simple patch below | |
16 | causes the results to smooth out at the ~130k plateau. | |
17 | ||
18 | But wait, there's more: | |
19 | ||
20 | We do not see this behavior on smaller platforms--e.g., 4 socket/8 core. | |
21 | This could be the result of the larger number of cpus on the larger | |
22 | platform--a scalability issue--or it could be the result of the larger | |
23 | number of interconnect "hops" between some nodes in this platform and how | |
24 | the tasks for a given load end up distributed over the nodes' cpus and | |
25 | memories--a stochastic NUMA effect. | |
26 | ||
27 | The variability in the results are less pronounced [on the same platform] | |
28 | with Shanghai processors and with mainline kernels. With 31-rc6 on | |
29 | Shanghai processors and 288 file systems on 288 fibre attached storage | |
30 | volumes, the curves [jpm vs load] are both quite flat with the patched | |
31 | kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K | |
32 | jpm] than the unpatched kernel. | |
33 | ||
34 | Profiling indicated that the "slow" runs were incurring high[er] | |
35 | contention on an anon_vma lock in vma_adjust(), apparently called from the | |
36 | sbrk() system call. | |
37 | ||
38 | The patch: | |
39 | ||
40 | A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the | |
41 | anon_vma lock when we're only adjusting the end of a vma, as is the case | |
42 | for brk(). The comment questions whether it's worth while to optimize for | |
43 | this case. Apparently, on the newer, larger x86_64 platforms, with | |
44 | interesting NUMA topologies, it is worth while--especially considering | |
45 | that the patch [if correct!] is quite simple. | |
46 | ||
47 | We can detect this condition--no overlap with next vma--by noting a NULL | |
48 | "importer". The anon_vma pointer will also be NULL in this case, so | |
49 | simply avoid loading vma->anon_vma to avoid the lock. | |
50 | ||
51 | However, we DO need to take the anon_vma lock when we're inserting a vma | |
52 | ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we | |
53 | need to check for 'insert', as well. And Hugh points out that we should | |
54 | also take it when adjusting vm_start (so that rmap.c can rely upon | |
55 | vma_address() while it holds the anon_vma lock). | |
56 | ||
57 | akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it | |
58 | might be -stable material even though thiss isn't a regression: "this | |
59 | issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is | |
60 | severe on large machine (4*8*2 cpu)" | |
61 | ||
62 | [hugh.dickins@tiscali.co.uk: test vma start too] | |
63 | Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> | |
64 | Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> | |
65 | Cc: Nick Piggin <npiggin@suse.de> | |
66 | Cc: Eric Whitney <eric.whitney@hp.com> | |
67 | Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> | |
68 | Signed-off-by: Andrew Morton <akpm@linux-foundation.org> | |
69 | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | |
70 | Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> | |
71 | ||
72 | --- | |
73 | mm/mmap.c | 4 ++-- | |
74 | 1 file changed, 2 insertions(+), 2 deletions(-) | |
75 | ||
76 | --- a/mm/mmap.c | |
77 | +++ b/mm/mmap.c | |
78 | @@ -575,9 +575,9 @@ again: remove_next = 1 + (end > next-> | |
79 | ||
80 | /* | |
81 | * When changing only vma->vm_end, we don't really need | |
82 | - * anon_vma lock: but is that case worth optimizing out? | |
83 | + * anon_vma lock. | |
84 | */ | |
85 | - if (vma->anon_vma) | |
86 | + if (vma->anon_vma && (insert || importer || start != vma->vm_start)) | |
87 | anon_vma = vma->anon_vma; | |
88 | if (anon_vma) { | |
89 | spin_lock(&anon_vma->lock); |