From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Mon, 13 Aug 2012 18:14:07 +0000 (-0700)
Subject: 3.5-stable patches
X-Git-Tag: v3.5.2~10
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=a8d780e599969c221747d436bb78d0a9d14d491e;p=thirdparty%2Fkernel%2Fstable-queue.git

3.5-stable patches

added patches:
	mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch
---

diff --git a/queue-3.5/mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch b/queue-3.5/mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch
new file mode 100644
index 00000000000..3782cb8785e
--- /dev/null
+++ b/queue-3.5/mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch
@@ -0,0 +1,284 @@
+From d833352a4338dc31295ed832a30c9ccff5c7a183 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Tue, 31 Jul 2012 16:46:20 -0700
+Subject: mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit d833352a4338dc31295ed832a30c9ccff5c7a183 upstream.
+
+If a process creates a large hugetlbfs mapping that is eligible for page
+table sharing and forks heavily with children some of whom fault and
+others which destroy the mapping then it is possible for page tables to
+get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
+output a message to the kernel log.  The final teardown will trigger a
+BUG_ON in mm/filemap.c.
+
+This was reproduced in 3.4 but is known to have existed for a long time
+and goes back at least as far as 2.6.37.  It was probably was introduced
+in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
+look like this;
+
+[  ..........] Lots of bad pmd messages followed by this
+[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
+[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
+[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
+[  127.186778] ------------[ cut here ]------------
+[  127.186781] kernel BUG at mm/filemap.c:134!
+[  127.186782] invalid opcode: 0000 [#1] SMP
+[  127.186783] CPU 7
+[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
+[  127.186801]
+[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
+[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
+[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
+[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
+[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
+[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
+[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
+[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
+[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
+[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
+[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
+[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
+[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
+[  127.186821] Stack:
+[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
+[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
+[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
+[  127.186827] Call Trace:
+[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
+[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
+[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
+[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
+[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
+[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
+[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
+[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
+[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
+[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
+[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
+[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
+[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
+[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
+[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
+[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
+[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
+[  127.186870]  RSP <ffff8804144b5c08>
+[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
+
+The bug is a race and not always easy to reproduce.  To reproduce it I was
+doing the following on a single socket I7-based machine with 16G of RAM.
+
+$ hugeadm --pool-pages-max DEFAULT:13G
+$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
+$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
+$ for i in `seq 1 9000`; do ./hugetlbfs-test; done
+
+On my particular machine, it usually triggers within 10 minutes but
+enabling debug options can change the timing such that it never hits.
+Once the bug is triggered, the machine is in trouble and needs to be
+rebooted.  The machine will respond but processes accessing proc like "ps
+aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
+hard reset or a sysrq-b.
+
+The basic problem is a race between page table sharing and teardown.  For
+the most part page table sharing depends on i_mmap_mutex.  In some cases,
+it is also taking the mm->page_table_lock for the PTE updates but with
+shared page tables, it is the i_mmap_mutex that is more important.
+
+Unfortunately it appears to be also insufficient. Consider the following
+situation
+
+Process A					Process B
+---------					---------
+hugetlb_fault					shmdt
+  						LockWrite(mmap_sem)
+    						  do_munmap
+						    unmap_region
+						      unmap_vmas
+						        unmap_single_vma
+						          unmap_hugepage_range
+      						            Lock(i_mmap_mutex)
+							    Lock(mm->page_table_lock)
+							    huge_pmd_unshare/unmap tables <--- (1)
+							    Unlock(mm->page_table_lock)
+      						            Unlock(i_mmap_mutex)
+  huge_pte_alloc				      ...
+    Lock(i_mmap_mutex)				      ...
+    vma_prio_walk, find svma, spte		      ...
+    Lock(mm->page_table_lock)			      ...
+    share spte					      ...
+    Unlock(mm->page_table_lock)			      ...
+    Unlock(i_mmap_mutex)			      ...
+  hugetlb_no_page									  <--- (2)
+						      free_pgtables
+						        unlink_file_vma
+							hugetlb_free_pgd_range
+						    remove_vma_list
+
+In this scenario, it is possible for Process A to share page tables with
+Process B that is trying to tear them down.  The i_mmap_mutex on its own
+does not prevent Process A walking Process B's page tables.  At (1) above,
+the page tables are not shared yet so it unmaps the PMDs.  Process A sets
+up page table sharing and at (2) faults a new entry.  Process B then trips
+up on it in free_pgtables.
+
+This patch fixes the problem by adding a new function
+__unmap_hugepage_range_final that is only called when the VMA is about to
+be destroyed.  This function clears VM_MAYSHARE during
+unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
+ineligible for sharing and avoids the race.  Superficially this looks like
+it would then be vunerable to truncate and madvise issues but hugetlbfs
+has its own truncate handlers so does not use unmap_mapping_range() and
+does not support madvise(DONTNEED).
+
+This should be treated as a -stable candidate if it is merged.
+
+Test program is as follows. The test case was mostly written by Michal
+Hocko with a few minor changes to reproduce this bug.
+
+==== CUT HERE ====
+
+static size_t huge_page_size = (2UL << 20);
+static size_t nr_huge_page_A = 512;
+static size_t nr_huge_page_B = 5632;
+
+unsigned int get_random(unsigned int max)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	srandom(tv.tv_usec);
+	return random() % max;
+}
+
+static void play(void *addr, size_t size)
+{
+	unsigned char *start = addr,
+		      *end = start + size,
+		      *a;
+	start += get_random(size/2);
+
+	/* we could itterate on huge pages but let's give it more time. */
+	for (a = start; a < end; a += 4096)
+		*a = 0;
+}
+
+int main(int argc, char **argv)
+{
+	key_t key = IPC_PRIVATE;
+	size_t sizeA = nr_huge_page_A * huge_page_size;
+	size_t sizeB = nr_huge_page_B * huge_page_size;
+	int shmidA, shmidB;
+	void *addrA = NULL, *addrB = NULL;
+	int nr_children = 300, n = 0;
+
+	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
+		perror("shmget:");
+		return 1;
+	}
+
+	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
+		perror("shmat");
+		return 1;
+	}
+	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
+		perror("shmget:");
+		return 1;
+	}
+
+	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
+		perror("shmat");
+		return 1;
+	}
+
+fork_child:
+	switch(fork()) {
+		case 0:
+			switch (n%3) {
+			case 0:
+				play(addrA, sizeA);
+				break;
+			case 1:
+				play(addrB, sizeB);
+				break;
+			case 2:
+				break;
+			}
+			break;
+		case -1:
+			perror("fork:");
+			break;
+		default:
+			if (++n < nr_children)
+				goto fork_child;
+			play(addrA, sizeA);
+			break;
+	}
+	shmdt(addrA);
+	shmdt(addrB);
+	do {
+		wait(NULL);
+	} while (--n > 0);
+	shmctl(shmidA, IPC_RMID, NULL);
+	shmctl(shmidB, IPC_RMID, NULL);
+	return 0;
+}
+
+[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
+Signed-off-by: Hugh Dickins <hughd@google.com>
+Reviewed-by: Michal Hocko <mhocko@suse.cz>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+
+---
+ mm/hugetlb.c |   25 +++++++++++++++++++++++--
+ 1 file changed, 23 insertions(+), 2 deletions(-)
+
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -2393,6 +2393,22 @@ void unmap_hugepage_range(struct vm_area
+ {
+ 	mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ 	__unmap_hugepage_range(vma, start, end, ref_page);
++	/*
++	 * Clear this flag so that x86's huge_pmd_share page_table_shareable
++	 * test will fail on a vma being torn down, and not grab a page table
++	 * on its way out.  We're lucky that the flag has such an appropriate
++	 * name, and can in fact be safely cleared here. We could clear it
++	 * before the __unmap_hugepage_range above, but all that's necessary
++	 * is to clear it before releasing the i_mmap_mutex below.
++	 *
++	 * This works because in the contexts this is called, the VMA is
++	 * going to be destroyed. It is not vunerable to madvise(DONTNEED)
++	 * because madvise is not supported on hugetlbfs. The same applies
++	 * for direct IO. unmap_hugepage_range() is only being called just
++	 * before free_pgtables() so clearing VM_MAYSHARE will not cause
++	 * surprises later.
++	 */
++	vma->vm_flags &= ~VM_MAYSHARE;
+ 	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
+ 
+@@ -2959,9 +2975,14 @@ void hugetlb_change_protection(struct vm
+ 		}
+ 	}
+ 	spin_unlock(&mm->page_table_lock);
+-	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+-
++	/*
++	 * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
++	 * may have cleared our pud entry and done put_page on the page table:
++	 * once we release i_mmap_mutex, another task can do the final put_page
++	 * and that page table be reused and filled with junk.
++	 */
+ 	flush_tlb_range(vma, start, end);
++	mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ }
+ 
+ int hugetlb_reserve_pages(struct inode *inode,
diff --git a/queue-3.5/series b/queue-3.5/series
index fb2e20bf76b..525b9c505b5 100644
--- a/queue-3.5/series
+++ b/queue-3.5/series
@@ -61,3 +61,4 @@ random-mix-in-architectural-randomness-in-extract_buf.patch
 hid-multitouch-add-support-for-novatek-touchscreen.patch
 hid-add-support-for-cypress-barcode-scanner-04b4-ed81.patch
 hid-add-asus-aio-keyboard-model-ak1d.patch
+mm-hugetlbfs-close-race-during-teardown-of-hugetlbfs-shared-page-tables.patch