From: Greg Kroah-Hartman Date: Sun, 13 Aug 2017 15:56:39 +0000 (-0700) Subject: 4.9-stable patches X-Git-Tag: v3.18.66~13 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=787c4eb963921422b9a2b135589cf5eccc29a1fe;p=thirdparty%2Fkernel%2Fstable-queue.git 4.9-stable patches added patches: futex-remove-unnecessary-warning-from-get_futex_key.patch mm-fix-list-corruptions-on-shmem-shrinklist.patch mm-ratelimit-pfns-busy-info-message.patch --- diff --git a/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch b/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch new file mode 100644 index 00000000000..70263da20b0 --- /dev/null +++ b/queue-4.9/futex-remove-unnecessary-warning-from-get_futex_key.patch @@ -0,0 +1,123 @@ +From 48fb6f4db940e92cfb16cd878cddd59ea6120d06 Mon Sep 17 00:00:00 2001 +From: Mel Gorman +Date: Wed, 9 Aug 2017 08:27:11 +0100 +Subject: futex: Remove unnecessary warning from get_futex_key + +From: Mel Gorman + +commit 48fb6f4db940e92cfb16cd878cddd59ea6120d06 upstream. + +Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in +get_futex_key()") removed an unnecessary lock_page() with the +side-effect that page->mapping needed to be treated very carefully. + +Two defensive warnings were added in case any assumption was missed and +the first warning assumed a correct application would not alter a +mapping backing a futex key. Since merging, it has not triggered for +any unexpected case but Mark Rutland reported the following bug +triggering due to the first warning. + + kernel BUG at kernel/futex.c:679! + Internal error: Oops - BUG: 0 [#1] PREEMPT SMP + Modules linked in: + CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3 + Hardware name: linux,dummy-virt (DT) + task: ffff80001e271780 task.stack: ffff000010908000 + PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679 + LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679 + pc : [] lr : [] pstate: 80000145 + +The fact that it's a bug instead of a warning was due to an unrelated +arm64 problem, but the warning itself triggered because the underlying +mapping changed. + +This is an application issue but from a kernel perspective it's a +recoverable situation and the warning is unnecessary so this patch +removes the warning. The warning may potentially be triggered with the +following test program from Mark although it may be necessary to adjust +NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the +system. + + #include + #include + #include + #include + #include + #include + #include + #include + + #define NR_FUTEX_THREADS 16 + pthread_t threads[NR_FUTEX_THREADS]; + + void *mem; + + #define MEM_PROT (PROT_READ | PROT_WRITE) + #define MEM_SIZE 65536 + + static int futex_wrapper(int *uaddr, int op, int val, + const struct timespec *timeout, + int *uaddr2, int val3) + { + syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3); + } + + void *poll_futex(void *unused) + { + for (;;) { + futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1); + } + } + + int main(int argc, char *argv[]) + { + int i; + + mem = mmap(NULL, MEM_SIZE, MEM_PROT, + MAP_SHARED | MAP_ANONYMOUS, -1, 0); + + printf("Mapping @ %p\n", mem); + + printf("Creating futex threads...\n"); + + for (i = 0; i < NR_FUTEX_THREADS; i++) + pthread_create(&threads[i], NULL, poll_futex, NULL); + + printf("Flipping mapping...\n"); + for (;;) { + mmap(mem, MEM_SIZE, MEM_PROT, + MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0); + } + + return 0; + } + +Reported-and-tested-by: Mark Rutland +Signed-off-by: Mel Gorman +Acked-by: Peter Zijlstra (Intel) +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + kernel/futex.c | 5 +++-- + 1 file changed, 3 insertions(+), 2 deletions(-) + +--- a/kernel/futex.c ++++ b/kernel/futex.c +@@ -668,13 +668,14 @@ again: + * this reference was taken by ihold under the page lock + * pinning the inode in place so i_lock was unnecessary. The + * only way for this check to fail is if the inode was +- * truncated in parallel so warn for now if this happens. ++ * truncated in parallel which is almost certainly an ++ * application bug. In such a case, just retry. + * + * We are not calling into get_futex_key_refs() in file-backed + * cases, therefore a successful atomic_inc return below will + * guarantee that get_futex_key() will still imply smp_mb(); (B). + */ +- if (WARN_ON_ONCE(!atomic_inc_not_zero(&inode->i_count))) { ++ if (!atomic_inc_not_zero(&inode->i_count)) { + rcu_read_unlock(); + put_page(page); + diff --git a/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch b/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch new file mode 100644 index 00000000000..a3245e163ea --- /dev/null +++ b/queue-4.9/mm-fix-list-corruptions-on-shmem-shrinklist.patch @@ -0,0 +1,109 @@ +From d041353dc98a6339182cd6f628b4c8f111278cb3 Mon Sep 17 00:00:00 2001 +From: Cong Wang +Date: Thu, 10 Aug 2017 15:24:24 -0700 +Subject: mm: fix list corruptions on shmem shrinklist + +From: Cong Wang + +commit d041353dc98a6339182cd6f628b4c8f111278cb3 upstream. + +We saw many list corruption warnings on shmem shrinklist: + + WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0 + list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960 + Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt + CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1 + Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 + Call Trace: + dump_stack+0x4d/0x66 + __warn+0xcb/0xf0 + warn_slowpath_fmt+0x4f/0x60 + __list_del_entry+0x9e/0xc0 + shmem_unused_huge_shrink+0xfa/0x2e0 + shmem_unused_huge_scan+0x20/0x30 + super_cache_scan+0x193/0x1a0 + shrink_slab.part.41+0x1e3/0x3f0 + shrink_slab+0x29/0x30 + shrink_node+0xf9/0x2f0 + kswapd+0x2d8/0x6c0 + kthread+0xd7/0xf0 + ret_from_fork+0x22/0x30 + + WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0 + list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8). + Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt + CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1 + Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 + Call Trace: + dump_stack+0x4d/0x66 + __warn+0xcb/0xf0 + warn_slowpath_fmt+0x4f/0x60 + __list_add+0x89/0xb0 + shmem_setattr+0x204/0x230 + notify_change+0x2ef/0x440 + do_truncate+0x5d/0x90 + path_openat+0x331/0x1190 + do_filp_open+0x7e/0xe0 + do_sys_open+0x123/0x200 + SyS_open+0x1e/0x20 + do_syscall_64+0x61/0x170 + entry_SYSCALL64_slow_path+0x25/0x25 + +The problem is that shmem_unused_huge_shrink() moves entries from the +global sbinfo->shrinklist to its local lists and then releases the +spinlock. However, a parallel shmem_setattr() could access one of these +entries directly and add it back to the global shrinklist if it is +removed, with the spinlock held. + +The logic itself looks solid since an entry could be either in a local +list or the global list, otherwise it is removed from one of them by +list_del_init(). So probably the race condition is that, one CPU is in +the middle of INIT_LIST_HEAD() but the other CPU calls list_empty() +which returns true too early then the following list_add_tail() sees a +corrupted entry. + +list_empty_careful() is designed to fix this situation. + +[akpm@linux-foundation.org: add comments] +Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com +Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") +Signed-off-by: Cong Wang +Acked-by: Linus Torvalds +Acked-by: Kirill A. Shutemov +Cc: Hugh Dickins +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/shmem.c | 12 ++++++++++-- + 1 file changed, 10 insertions(+), 2 deletions(-) + +--- a/mm/shmem.c ++++ b/mm/shmem.c +@@ -1007,7 +1007,11 @@ static int shmem_setattr(struct dentry * + */ + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) { + spin_lock(&sbinfo->shrinklist_lock); +- if (list_empty(&info->shrinklist)) { ++ /* ++ * _careful to defend against unlocked access to ++ * ->shrink_list in shmem_unused_huge_shrink() ++ */ ++ if (list_empty_careful(&info->shrinklist)) { + list_add_tail(&info->shrinklist, + &sbinfo->shrinklist); + sbinfo->shrinklist_len++; +@@ -1774,7 +1778,11 @@ alloc_nohuge: page = shmem_alloc_and_ac + * to shrink under memory pressure. + */ + spin_lock(&sbinfo->shrinklist_lock); +- if (list_empty(&info->shrinklist)) { ++ /* ++ * _careful to defend against unlocked access to ++ * ->shrink_list in shmem_unused_huge_shrink() ++ */ ++ if (list_empty_careful(&info->shrinklist)) { + list_add_tail(&info->shrinklist, + &sbinfo->shrinklist); + sbinfo->shrinklist_len++; diff --git a/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch b/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch new file mode 100644 index 00000000000..c52800a82cd --- /dev/null +++ b/queue-4.9/mm-ratelimit-pfns-busy-info-message.patch @@ -0,0 +1,79 @@ +From 75dddef32514f7aa58930bde6a1263253bc3d4ba Mon Sep 17 00:00:00 2001 +From: Jonathan Toppins +Date: Thu, 10 Aug 2017 15:23:35 -0700 +Subject: mm: ratelimit PFNs busy info message + +From: Jonathan Toppins + +commit 75dddef32514f7aa58930bde6a1263253bc3d4ba upstream. + +The RDMA subsystem can generate several thousand of these messages per +second eventually leading to a kernel crash. Ratelimit these messages +to prevent this crash. + +Doug said: + "I've been carrying a version of this for several kernel versions. I + don't remember when they started, but we have one (and only one) class + of machines: Dell PE R730xd, that generate these errors. When it + happens, without a rate limit, we get rcu timeouts and kernel oopses. + With the rate limit, we just get a lot of annoying kernel messages but + the machine continues on, recovers, and eventually the memory + operations all succeed" + +And: + "> Well... why are all these EBUSY's occurring? It sounds inefficient + > (at least) but if it is expected, normal and unavoidable then + > perhaps we should just remove that message altogether? + + I don't have an answer to that question. To be honest, I haven't + looked real hard. We never had this at all, then it started out of the + blue, but only on our Dell 730xd machines (and it hits all of them), + but no other classes or brands of machines. And we have our 730xd + machines loaded up with different brands and models of cards (for + instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an + ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines + meant it wasn't tied to any particular brand/model of RDMA hardware. + To me, it always smelled of a hardware oddity specific to maybe the + CPUs or mainboard chipsets in these machines, so given that I'm not an + mm expert anyway, I never chased it down. + + A few other relevant details: it showed up somewhere around 4.8/4.9 or + thereabouts. It never happened before, but the prinkt has been there + since the 3.18 days, so possibly the test to trigger this message was + changed, or something else in the allocator changed such that the + situation started happening on these machines? + + And, like I said, it is specific to our 730xd machines (but they are + all identical, so that could mean it's something like their specific + ram configuration is causing the allocator to hit this on these + machine but not on other machines in the cluster, I don't want to say + it's necessarily the model of chipset or CPU, there are other bits of + identicalness between these machines)" + +Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com +Signed-off-by: Jonathan Toppins +Reviewed-by: Doug Ledford +Tested-by: Doug Ledford +Cc: Michal Hocko +Cc: Vlastimil Babka +Cc: Mel Gorman +Cc: Hillf Danton +Signed-off-by: Andrew Morton +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + +--- + mm/page_alloc.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/page_alloc.c ++++ b/mm/page_alloc.c +@@ -7335,7 +7335,7 @@ int alloc_contig_range(unsigned long sta + + /* Make sure the range is really isolated. */ + if (test_pages_isolated(outer_start, end, false)) { +- pr_info("%s: [%lx, %lx) PFNs busy\n", ++ pr_info_ratelimited("%s: [%lx, %lx) PFNs busy\n", + __func__, outer_start, end); + ret = -EBUSY; + goto done;