]>
Commit | Line | Data |
---|---|---|
dc13a30d GKH |
1 | From 5a28fc94c9143db766d1ba5480cae82d856ad080 Mon Sep 17 00:00:00 2001 |
2 | From: Dave Hansen <dave.hansen@linux.intel.com> | |
3 | Date: Fri, 19 Apr 2019 12:47:47 -0700 | |
4 | Subject: x86/mpx, mm/core: Fix recursive munmap() corruption | |
5 | ||
6 | From: Dave Hansen <dave.hansen@linux.intel.com> | |
7 | ||
8 | commit 5a28fc94c9143db766d1ba5480cae82d856ad080 upstream. | |
9 | ||
10 | This is a bit of a mess, to put it mildly. But, it's a bug | |
11 | that only seems to have showed up in 4.20 but wasn't noticed | |
12 | until now, because nobody uses MPX. | |
13 | ||
14 | MPX has the arch_unmap() hook inside of munmap() because MPX | |
15 | uses bounds tables that protect other areas of memory. When | |
16 | memory is unmapped, there is also a need to unmap the MPX | |
17 | bounds tables. Barring this, unused bounds tables can eat 80% | |
18 | of the address space. | |
19 | ||
20 | But, the recursive do_munmap() that gets called vi arch_unmap() | |
21 | wreaks havoc with __do_munmap()'s state. It can result in | |
22 | freeing populated page tables, accessing bogus VMA state, | |
23 | double-freed VMAs and more. | |
24 | ||
25 | See the "long story" further below for the gory details. | |
26 | ||
27 | To fix this, call arch_unmap() before __do_unmap() has a chance | |
28 | to do anything meaningful. Also, remove the 'vma' argument | |
29 | and force the MPX code to do its own, independent VMA lookup. | |
30 | ||
31 | == UML / unicore32 impact == | |
32 | ||
33 | Remove unused 'vma' argument to arch_unmap(). No functional | |
34 | change. | |
35 | ||
36 | I compile tested this on UML but not unicore32. | |
37 | ||
38 | == powerpc impact == | |
39 | ||
40 | powerpc uses arch_unmap() well to watch for munmap() on the | |
41 | VDSO and zeroes out 'current->mm->context.vdso_base'. Moving | |
42 | arch_unmap() makes this happen earlier in __do_munmap(). But, | |
43 | 'vdso_base' seems to only be used in perf and in the signal | |
44 | delivery that happens near the return to userspace. I can not | |
45 | find any likely impact to powerpc, other than the zeroing | |
46 | happening a little earlier. | |
47 | ||
48 | powerpc does not use the 'vma' argument and is unaffected by | |
49 | its removal. | |
50 | ||
51 | I compile-tested a 64-bit powerpc defconfig. | |
52 | ||
53 | == x86 impact == | |
54 | ||
55 | For the common success case this is functionally identical to | |
56 | what was there before. For the munmap() failure case, it's | |
57 | possible that some MPX tables will be zapped for memory that | |
58 | continues to be in use. But, this is an extraordinarily | |
59 | unlikely scenario and the harm would be that MPX provides no | |
60 | protection since the bounds table got reset (zeroed). | |
61 | ||
62 | I can't imagine anyone doing this: | |
63 | ||
64 | ptr = mmap(); | |
65 | // use ptr | |
66 | ret = munmap(ptr); | |
67 | if (ret) | |
68 | // oh, there was an error, I'll | |
69 | // keep using ptr. | |
70 | ||
71 | Because if you're doing munmap(), you are *done* with the | |
72 | memory. There's probably no good data in there _anyway_. | |
73 | ||
74 | This passes the original reproducer from Richard Biener as | |
75 | well as the existing mpx selftests/. | |
76 | ||
77 | The long story: | |
78 | ||
79 | munmap() has a couple of pieces: | |
80 | ||
81 | 1. Find the affected VMA(s) | |
82 | 2. Split the start/end one(s) if neceesary | |
83 | 3. Pull the VMAs out of the rbtree | |
84 | 4. Actually zap the memory via unmap_region(), including | |
85 | freeing page tables (or queueing them to be freed). | |
86 | 5. Fix up some of the accounting (like fput()) and actually | |
87 | free the VMA itself. | |
88 | ||
89 | This specific ordering was actually introduced by: | |
90 | ||
91 | dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") | |
92 | ||
93 | during the 4.20 merge window. The previous __do_munmap() code | |
94 | was actually safe because the only thing after arch_unmap() was | |
95 | remove_vma_list(). arch_unmap() could not see 'vma' in the | |
96 | rbtree because it was detached, so it is not even capable of | |
97 | doing operations unsafe for remove_vma_list()'s use of 'vma'. | |
98 | ||
99 | Richard Biener reported a test that shows this in dmesg: | |
100 | ||
101 | [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551 | |
102 | [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576 | |
103 | ||
104 | What triggered this was the recursive do_munmap() called via | |
105 | arch_unmap(). It was freeing page tables that has not been | |
106 | properly zapped. | |
107 | ||
108 | But, the problem was bigger than this. For one, arch_unmap() | |
109 | can free VMAs. But, the calling __do_munmap() has variables | |
110 | that *point* to VMAs and obviously can't handle them just | |
111 | getting freed while the pointer is still in use. | |
112 | ||
113 | I tried a couple of things here. First, I tried to fix the page | |
114 | table freeing problem in isolation, but I then found the VMA | |
115 | issue. I also tried having the MPX code return a flag if it | |
116 | modified the rbtree which would force __do_munmap() to re-walk | |
117 | to restart. That spiralled out of control in complexity pretty | |
118 | fast. | |
119 | ||
120 | Just moving arch_unmap() and accepting that the bonkers failure | |
121 | case might eat some bounds tables seems like the simplest viable | |
122 | fix. | |
123 | ||
124 | This was also reported in the following kernel bugzilla entry: | |
125 | ||
126 | https://bugzilla.kernel.org/show_bug.cgi?id=203123 | |
127 | ||
128 | There are some reports that this commit triggered this bug: | |
129 | ||
130 | dd2283f2605 ("mm: mmap: zap pages with read mmap_sem in munmap") | |
131 | ||
132 | While that commit certainly made the issues easier to hit, I believe | |
133 | the fundamental issue has been with us as long as MPX itself, thus | |
134 | the Fixes: tag below is for one of the original MPX commits. | |
135 | ||
136 | [ mingo: Minor edits to the changelog and the patch. ] | |
137 | ||
138 | Reported-by: Richard Biener <rguenther@suse.de> | |
139 | Reported-by: H.J. Lu <hjl.tools@gmail.com> | |
140 | Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> | |
141 | Reviewed-by Thomas Gleixner <tglx@linutronix.de> | |
142 | Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> | |
143 | Acked-by: Michael Ellerman <mpe@ellerman.id.au> | |
144 | Cc: Andrew Morton <akpm@linux-foundation.org> | |
145 | Cc: Andy Lutomirski <luto@kernel.org> | |
146 | Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> | |
147 | Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> | |
148 | Cc: Borislav Petkov <bp@alien8.de> | |
149 | Cc: Guan Xuetao <gxt@pku.edu.cn> | |
150 | Cc: H. Peter Anvin <hpa@zytor.com> | |
151 | Cc: Jeff Dike <jdike@addtoit.com> | |
152 | Cc: Linus Torvalds <torvalds@linux-foundation.org> | |
153 | Cc: Michal Hocko <mhocko@suse.com> | |
154 | Cc: Paul Mackerras <paulus@samba.org> | |
155 | Cc: Peter Zijlstra <peterz@infradead.org> | |
156 | Cc: Richard Weinberger <richard@nod.at> | |
157 | Cc: Rik van Riel <riel@surriel.com> | |
158 | Cc: Vlastimil Babka <vbabka@suse.cz> | |
159 | Cc: linux-arch@vger.kernel.org | |
160 | Cc: linux-mm@kvack.org | |
161 | Cc: linux-um@lists.infradead.org | |
162 | Cc: linuxppc-dev@lists.ozlabs.org | |
163 | Cc: stable@vger.kernel.org | |
164 | Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") | |
165 | Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.com | |
166 | Signed-off-by: Ingo Molnar <mingo@kernel.org> | |
167 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
168 | ||
169 | --- | |
170 | arch/powerpc/include/asm/mmu_context.h | 1 - | |
171 | arch/um/include/asm/mmu_context.h | 1 - | |
172 | arch/unicore32/include/asm/mmu_context.h | 1 - | |
173 | arch/x86/include/asm/mmu_context.h | 6 +++--- | |
174 | arch/x86/include/asm/mpx.h | 15 ++++++++------- | |
175 | arch/x86/mm/mpx.c | 10 ++++++---- | |
176 | include/asm-generic/mm_hooks.h | 1 - | |
177 | mm/mmap.c | 15 ++++++++------- | |
178 | 8 files changed, 25 insertions(+), 25 deletions(-) | |
179 | ||
180 | --- a/arch/powerpc/include/asm/mmu_context.h | |
181 | +++ b/arch/powerpc/include/asm/mmu_context.h | |
182 | @@ -237,7 +237,6 @@ extern void arch_exit_mmap(struct mm_str | |
183 | #endif | |
184 | ||
185 | static inline void arch_unmap(struct mm_struct *mm, | |
186 | - struct vm_area_struct *vma, | |
187 | unsigned long start, unsigned long end) | |
188 | { | |
189 | if (start <= mm->context.vdso_base && mm->context.vdso_base < end) | |
190 | --- a/arch/um/include/asm/mmu_context.h | |
191 | +++ b/arch/um/include/asm/mmu_context.h | |
192 | @@ -22,7 +22,6 @@ static inline int arch_dup_mmap(struct m | |
193 | } | |
194 | extern void arch_exit_mmap(struct mm_struct *mm); | |
195 | static inline void arch_unmap(struct mm_struct *mm, | |
196 | - struct vm_area_struct *vma, | |
197 | unsigned long start, unsigned long end) | |
198 | { | |
199 | } | |
200 | --- a/arch/unicore32/include/asm/mmu_context.h | |
201 | +++ b/arch/unicore32/include/asm/mmu_context.h | |
202 | @@ -88,7 +88,6 @@ static inline int arch_dup_mmap(struct m | |
203 | } | |
204 | ||
205 | static inline void arch_unmap(struct mm_struct *mm, | |
206 | - struct vm_area_struct *vma, | |
207 | unsigned long start, unsigned long end) | |
208 | { | |
209 | } | |
210 | --- a/arch/x86/include/asm/mmu_context.h | |
211 | +++ b/arch/x86/include/asm/mmu_context.h | |
212 | @@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(str | |
213 | mpx_mm_init(mm); | |
214 | } | |
215 | ||
216 | -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma, | |
217 | - unsigned long start, unsigned long end) | |
218 | +static inline void arch_unmap(struct mm_struct *mm, unsigned long start, | |
219 | + unsigned long end) | |
220 | { | |
221 | /* | |
222 | * mpx_notify_unmap() goes and reads a rarely-hot | |
223 | @@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_ | |
224 | * consistently wrong. | |
225 | */ | |
226 | if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX))) | |
227 | - mpx_notify_unmap(mm, vma, start, end); | |
228 | + mpx_notify_unmap(mm, start, end); | |
229 | } | |
230 | ||
231 | /* | |
232 | --- a/arch/x86/include/asm/mpx.h | |
233 | +++ b/arch/x86/include/asm/mpx.h | |
234 | @@ -64,12 +64,15 @@ struct mpx_fault_info { | |
235 | }; | |
236 | ||
237 | #ifdef CONFIG_X86_INTEL_MPX | |
238 | -int mpx_fault_info(struct mpx_fault_info *info, struct pt_regs *regs); | |
239 | -int mpx_handle_bd_fault(void); | |
240 | + | |
241 | +extern int mpx_fault_info(struct mpx_fault_info *info, struct pt_regs *regs); | |
242 | +extern int mpx_handle_bd_fault(void); | |
243 | + | |
244 | static inline int kernel_managing_mpx_tables(struct mm_struct *mm) | |
245 | { | |
246 | return (mm->context.bd_addr != MPX_INVALID_BOUNDS_DIR); | |
247 | } | |
248 | + | |
249 | static inline void mpx_mm_init(struct mm_struct *mm) | |
250 | { | |
251 | /* | |
252 | @@ -78,11 +81,10 @@ static inline void mpx_mm_init(struct mm | |
253 | */ | |
254 | mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR; | |
255 | } | |
256 | -void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma, | |
257 | - unsigned long start, unsigned long end); | |
258 | ||
259 | -unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len, | |
260 | - unsigned long flags); | |
261 | +extern void mpx_notify_unmap(struct mm_struct *mm, unsigned long start, unsigned long end); | |
262 | +extern unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len, unsigned long flags); | |
263 | + | |
264 | #else | |
265 | static inline int mpx_fault_info(struct mpx_fault_info *info, struct pt_regs *regs) | |
266 | { | |
267 | @@ -100,7 +102,6 @@ static inline void mpx_mm_init(struct mm | |
268 | { | |
269 | } | |
270 | static inline void mpx_notify_unmap(struct mm_struct *mm, | |
271 | - struct vm_area_struct *vma, | |
272 | unsigned long start, unsigned long end) | |
273 | { | |
274 | } | |
275 | --- a/arch/x86/mm/mpx.c | |
276 | +++ b/arch/x86/mm/mpx.c | |
277 | @@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_st | |
278 | * the virtual address region start...end have already been split if | |
279 | * necessary, and the 'vma' is the first vma in this range (start -> end). | |
280 | */ | |
281 | -void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma, | |
282 | - unsigned long start, unsigned long end) | |
283 | +void mpx_notify_unmap(struct mm_struct *mm, unsigned long start, | |
284 | + unsigned long end) | |
285 | { | |
286 | + struct vm_area_struct *vma; | |
287 | int ret; | |
288 | ||
289 | /* | |
290 | @@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct * | |
291 | * which should not occur normally. Being strict about it here | |
292 | * helps ensure that we do not have an exploitable stack overflow. | |
293 | */ | |
294 | - do { | |
295 | + vma = find_vma(mm, start); | |
296 | + while (vma && vma->vm_start < end) { | |
297 | if (vma->vm_flags & VM_MPX) | |
298 | return; | |
299 | vma = vma->vm_next; | |
300 | - } while (vma && vma->vm_start < end); | |
301 | + } | |
302 | ||
303 | ret = mpx_unmap_tables(mm, start, end); | |
304 | if (ret) | |
305 | --- a/include/asm-generic/mm_hooks.h | |
306 | +++ b/include/asm-generic/mm_hooks.h | |
307 | @@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct | |
308 | } | |
309 | ||
310 | static inline void arch_unmap(struct mm_struct *mm, | |
311 | - struct vm_area_struct *vma, | |
312 | unsigned long start, unsigned long end) | |
313 | { | |
314 | } | |
315 | --- a/mm/mmap.c | |
316 | +++ b/mm/mmap.c | |
317 | @@ -2736,9 +2736,17 @@ int __do_munmap(struct mm_struct *mm, un | |
318 | return -EINVAL; | |
319 | ||
320 | len = PAGE_ALIGN(len); | |
321 | + end = start + len; | |
322 | if (len == 0) | |
323 | return -EINVAL; | |
324 | ||
325 | + /* | |
326 | + * arch_unmap() might do unmaps itself. It must be called | |
327 | + * and finish any rbtree manipulation before this code | |
328 | + * runs and also starts to manipulate the rbtree. | |
329 | + */ | |
330 | + arch_unmap(mm, start, end); | |
331 | + | |
332 | /* Find the first overlapping VMA */ | |
333 | vma = find_vma(mm, start); | |
334 | if (!vma) | |
335 | @@ -2747,7 +2755,6 @@ int __do_munmap(struct mm_struct *mm, un | |
336 | /* we have start < vma->vm_end */ | |
337 | ||
338 | /* if it doesn't overlap, we have nothing.. */ | |
339 | - end = start + len; | |
340 | if (vma->vm_start >= end) | |
341 | return 0; | |
342 | ||
343 | @@ -2817,12 +2824,6 @@ int __do_munmap(struct mm_struct *mm, un | |
344 | /* Detach vmas from rbtree */ | |
345 | detach_vmas_to_be_unmapped(mm, vma, prev, end); | |
346 | ||
347 | - /* | |
348 | - * mpx unmap needs to be called with mmap_sem held for write. | |
349 | - * It is safe to call it before unmap_region(). | |
350 | - */ | |
351 | - arch_unmap(mm, vma, start, end); | |
352 | - | |
353 | if (downgrade) | |
354 | downgrade_write(&mm->mmap_sem); | |
355 |