From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Sun, 13 Aug 2017 15:56:39 +0000 (-0700)
Subject: 4.9-stable patches
X-Git-Tag: v3.18.66~13
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=787c4eb963921422b9a2b135589cf5eccc29a1fe;p=thirdparty%2Fkernel%2Fstable-queue.git

4.9-stable patches

added patches:
	futex-remove-unnecessary-warning-from-get_futex_key.patch
	mm-fix-list-corruptions-on-shmem-shrinklist.patch
	mm-ratelimit-pfns-busy-info-message.patch
---

diff --git a/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch b/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch
new file mode 100644
index 00000000000..70263da20b0
--- /dev/null
+++ b/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch
@@ -0,0 +1,123 @@
+From 48fb6f4db940e92cfb16cd878cddd59ea6120d06 Mon Sep 17 00:00:00 2001
+From: Mel Gorman <mgorman@suse.de>
+Date: Wed, 9 Aug 2017 08:27:11 +0100
+Subject: futex: Remove unnecessary warning from get_futex_key
+
+From: Mel Gorman <mgorman@suse.de>
+
+commit 48fb6f4db940e92cfb16cd878cddd59ea6120d06 upstream.
+
+Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in
+get_futex_key()") removed an unnecessary lock_page() with the
+side-effect that page->mapping needed to be treated very carefully.
+
+Two defensive warnings were added in case any assumption was missed and
+the first warning assumed a correct application would not alter a
+mapping backing a futex key.  Since merging, it has not triggered for
+any unexpected case but Mark Rutland reported the following bug
+triggering due to the first warning.
+
+  kernel BUG at kernel/futex.c:679!
+  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
+  Modules linked in:
+  CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
+  Hardware name: linux,dummy-virt (DT)
+  task: ffff80001e271780 task.stack: ffff000010908000
+  PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
+  LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
+  pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
+
+The fact that it's a bug instead of a warning was due to an unrelated
+arm64 problem, but the warning itself triggered because the underlying
+mapping changed.
+
+This is an application issue but from a kernel perspective it's a
+recoverable situation and the warning is unnecessary so this patch
+removes the warning.  The warning may potentially be triggered with the
+following test program from Mark although it may be necessary to adjust
+NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
+system.
+
+    #include <linux/futex.h>
+    #include <pthread.h>
+    #include <stdio.h>
+    #include <stdlib.h>
+    #include <sys/mman.h>
+    #include <sys/syscall.h>
+    #include <sys/time.h>
+    #include <unistd.h>
+
+    #define NR_FUTEX_THREADS 16
+    pthread_t threads[NR_FUTEX_THREADS];
+
+    void *mem;
+
+    #define MEM_PROT  (PROT_READ | PROT_WRITE)
+    #define MEM_SIZE  65536
+
+    static int futex_wrapper(int *uaddr, int op, int val,
+                             const struct timespec *timeout,
+                             int *uaddr2, int val3)
+    {
+        syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
+    }
+
+    void *poll_futex(void *unused)
+    {
+        for (;;) {
+            futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
+        }
+    }
+
+    int main(int argc, char *argv[])
+    {
+        int i;
+
+        mem = mmap(NULL, MEM_SIZE, MEM_PROT,
+               MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+        printf("Mapping @ %p\n", mem);
+
+        printf("Creating futex threads...\n");
+
+        for (i = 0; i < NR_FUTEX_THREADS; i++)
+            pthread_create(&threads[i], NULL, poll_futex, NULL);
+
+        printf("Flipping mapping...\n");
+        for (;;) {
+            mmap(mem, MEM_SIZE, MEM_PROT,
+                 MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+        }
+
+        return 0;
+    }
+
+Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
+Signed-off-by: Mel Gorman <mgorman@suse.de>
+Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ kernel/futex.c |    5 +++--
+ 1 file changed, 3 insertions(+), 2 deletions(-)
+
+--- a/kernel/futex.c
++++ b/kernel/futex.c
+@@ -668,13 +668,14 @@ again:
+ 		 * this reference was taken by ihold under the page lock
+ 		 * pinning the inode in place so i_lock was unnecessary. The
+ 		 * only way for this check to fail is if the inode was
+-		 * truncated in parallel so warn for now if this happens.
++		 * truncated in parallel which is almost certainly an
++		 * application bug. In such a case, just retry.
+ 		 *
+ 		 * We are not calling into get_futex_key_refs() in file-backed
+ 		 * cases, therefore a successful atomic_inc return below will
+ 		 * guarantee that get_futex_key() will still imply smp_mb(); (B).
+ 		 */
+-		if (WARN_ON_ONCE(!atomic_inc_not_zero(&inode->i_count))) {
++		if (!atomic_inc_not_zero(&inode->i_count)) {
+ 			rcu_read_unlock();
+ 			put_page(page);
+ 
diff --git a/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch b/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch
new file mode 100644
index 00000000000..a3245e163ea
--- /dev/null
+++ b/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch
@@ -0,0 +1,109 @@
+From d041353dc98a6339182cd6f628b4c8f111278cb3 Mon Sep 17 00:00:00 2001
+From: Cong Wang <xiyou.wangcong@gmail.com>
+Date: Thu, 10 Aug 2017 15:24:24 -0700
+Subject: mm: fix list corruptions on shmem shrinklist
+
+From: Cong Wang <xiyou.wangcong@gmail.com>
+
+commit d041353dc98a6339182cd6f628b4c8f111278cb3 upstream.
+
+We saw many list corruption warnings on shmem shrinklist:
+
+  WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
+  list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
+  Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
+  CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
+  Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
+  Call Trace:
+    dump_stack+0x4d/0x66
+    __warn+0xcb/0xf0
+    warn_slowpath_fmt+0x4f/0x60
+    __list_del_entry+0x9e/0xc0
+    shmem_unused_huge_shrink+0xfa/0x2e0
+    shmem_unused_huge_scan+0x20/0x30
+    super_cache_scan+0x193/0x1a0
+    shrink_slab.part.41+0x1e3/0x3f0
+    shrink_slab+0x29/0x30
+    shrink_node+0xf9/0x2f0
+    kswapd+0x2d8/0x6c0
+    kthread+0xd7/0xf0
+    ret_from_fork+0x22/0x30
+
+  WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
+  list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
+  Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
+  CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
+  Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
+  Call Trace:
+    dump_stack+0x4d/0x66
+    __warn+0xcb/0xf0
+    warn_slowpath_fmt+0x4f/0x60
+    __list_add+0x89/0xb0
+    shmem_setattr+0x204/0x230
+    notify_change+0x2ef/0x440
+    do_truncate+0x5d/0x90
+    path_openat+0x331/0x1190
+    do_filp_open+0x7e/0xe0
+    do_sys_open+0x123/0x200
+    SyS_open+0x1e/0x20
+    do_syscall_64+0x61/0x170
+    entry_SYSCALL64_slow_path+0x25/0x25
+
+The problem is that shmem_unused_huge_shrink() moves entries from the
+global sbinfo->shrinklist to its local lists and then releases the
+spinlock.  However, a parallel shmem_setattr() could access one of these
+entries directly and add it back to the global shrinklist if it is
+removed, with the spinlock held.
+
+The logic itself looks solid since an entry could be either in a local
+list or the global list, otherwise it is removed from one of them by
+list_del_init().  So probably the race condition is that, one CPU is in
+the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
+which returns true too early then the following list_add_tail() sees a
+corrupted entry.
+
+list_empty_careful() is designed to fix this situation.
+
+[akpm@linux-foundation.org: add comments]
+Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
+Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
+Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
+Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
+Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
+Cc: Hugh Dickins <hughd@google.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/shmem.c |   12 ++++++++++--
+ 1 file changed, 10 insertions(+), 2 deletions(-)
+
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -1007,7 +1007,11 @@ static int shmem_setattr(struct dentry *
+ 			 */
+ 			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) {
+ 				spin_lock(&sbinfo->shrinklist_lock);
+-				if (list_empty(&info->shrinklist)) {
++				/*
++				 * _careful to defend against unlocked access to
++				 * ->shrink_list in shmem_unused_huge_shrink()
++				 */
++				if (list_empty_careful(&info->shrinklist)) {
+ 					list_add_tail(&info->shrinklist,
+ 							&sbinfo->shrinklist);
+ 					sbinfo->shrinklist_len++;
+@@ -1774,7 +1778,11 @@ alloc_nohuge:		page = shmem_alloc_and_ac
+ 			 * to shrink under memory pressure.
+ 			 */
+ 			spin_lock(&sbinfo->shrinklist_lock);
+-			if (list_empty(&info->shrinklist)) {
++			/*
++			 * _careful to defend against unlocked access to
++			 * ->shrink_list in shmem_unused_huge_shrink()
++			 */
++			if (list_empty_careful(&info->shrinklist)) {
+ 				list_add_tail(&info->shrinklist,
+ 						&sbinfo->shrinklist);
+ 				sbinfo->shrinklist_len++;
diff --git a/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch b/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch
new file mode 100644
index 00000000000..c52800a82cd
--- /dev/null
+++ b/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch
@@ -0,0 +1,79 @@
+From 75dddef32514f7aa58930bde6a1263253bc3d4ba Mon Sep 17 00:00:00 2001
+From: Jonathan Toppins <jtoppins@redhat.com>
+Date: Thu, 10 Aug 2017 15:23:35 -0700
+Subject: mm: ratelimit PFNs busy info message
+
+From: Jonathan Toppins <jtoppins@redhat.com>
+
+commit 75dddef32514f7aa58930bde6a1263253bc3d4ba upstream.
+
+The RDMA subsystem can generate several thousand of these messages per
+second eventually leading to a kernel crash.  Ratelimit these messages
+to prevent this crash.
+
+Doug said:
+ "I've been carrying a version of this for several kernel versions. I
+  don't remember when they started, but we have one (and only one) class
+  of machines: Dell PE R730xd, that generate these errors. When it
+  happens, without a rate limit, we get rcu timeouts and kernel oopses.
+  With the rate limit, we just get a lot of annoying kernel messages but
+  the machine continues on, recovers, and eventually the memory
+  operations all succeed"
+
+And:
+ "> Well... why are all these EBUSY's occurring? It sounds inefficient
+  > (at least) but if it is expected, normal and unavoidable then
+  > perhaps we should just remove that message altogether?
+
+  I don't have an answer to that question. To be honest, I haven't
+  looked real hard. We never had this at all, then it started out of the
+  blue, but only on our Dell 730xd machines (and it hits all of them),
+  but no other classes or brands of machines. And we have our 730xd
+  machines loaded up with different brands and models of cards (for
+  instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
+  ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
+  meant it wasn't tied to any particular brand/model of RDMA hardware.
+  To me, it always smelled of a hardware oddity specific to maybe the
+  CPUs or mainboard chipsets in these machines, so given that I'm not an
+  mm expert anyway, I never chased it down.
+
+  A few other relevant details: it showed up somewhere around 4.8/4.9 or
+  thereabouts. It never happened before, but the prinkt has been there
+  since the 3.18 days, so possibly the test to trigger this message was
+  changed, or something else in the allocator changed such that the
+  situation started happening on these machines?
+
+  And, like I said, it is specific to our 730xd machines (but they are
+  all identical, so that could mean it's something like their specific
+  ram configuration is causing the allocator to hit this on these
+  machine but not on other machines in the cluster, I don't want to say
+  it's necessarily the model of chipset or CPU, there are other bits of
+  identicalness between these machines)"
+
+Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
+Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
+Reviewed-by: Doug Ledford <dledford@redhat.com>
+Tested-by: Doug Ledford <dledford@redhat.com>
+Cc: Michal Hocko <mhocko@suse.com>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: Mel Gorman <mgorman@techsingularity.net>
+Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+---
+ mm/page_alloc.c |    2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -7335,7 +7335,7 @@ int alloc_contig_range(unsigned long sta
+ 
+ 	/* Make sure the range is really isolated. */
+ 	if (test_pages_isolated(outer_start, end, false)) {
+-		pr_info("%s: [%lx, %lx) PFNs busy\n",
++		pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n",
+ 			__func__, outer_start, end);
+ 		ret = -EBUSY;
+ 		goto done;