From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 14 Aug 2015 02:34:44 +0000 (-0700)
Subject: 4.1-stable patches
X-Git-Tag: v3.10.87~17
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=2bcb443b5cac1a61add74986d98f1a102b1d92a6;p=thirdparty%2Fkernel%2Fstable-queue.git

4.1-stable patches

added patches:
	mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch
---

diff --git a/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch b/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch
new file mode 100644
index 00000000000..f76e63d88f2
--- /dev/null
+++ b/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch
@@ -0,0 +1,140 @@
+From ecf5fc6e9654cd7a268c782a523f072b2f1959f9 Mon Sep 17 00:00:00 2001
+From: Michal Hocko <mhocko@suse.cz>
+Date: Tue, 4 Aug 2015 14:36:58 -0700
+Subject: mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
+
+From: Michal Hocko <mhocko@suse.cz>
+
+commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream.
+
+Nikolay has reported a hang when a memcg reclaim got stuck with the
+following backtrace:
+
+PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
+  #0 __schedule at ffffffff815ab152
+  #1 schedule at ffffffff815ab76e
+  #2 schedule_timeout at ffffffff815ae5e5
+  #3 io_schedule_timeout at ffffffff815aad6a
+  #4 bit_wait_io at ffffffff815abfc6
+  #5 __wait_on_bit at ffffffff815abda5
+  #6 wait_on_page_bit at ffffffff8111fd4f
+  #7 shrink_page_list at ffffffff81135445
+  #8 shrink_inactive_list at ffffffff81135845
+  #9 shrink_lruvec at ffffffff81135ead
+ #10 shrink_zone at ffffffff811360c3
+ #11 shrink_zones at ffffffff81136eff
+ #12 do_try_to_free_pages at ffffffff8113712f
+ #13 try_to_free_mem_cgroup_pages at ffffffff811372be
+ #14 try_charge at ffffffff81189423
+ #15 mem_cgroup_try_charge at ffffffff8118c6f5
+ #16 __add_to_page_cache_locked at ffffffff8112137d
+ #17 add_to_page_cache_lru at ffffffff81121618
+ #18 pagecache_get_page at ffffffff8112170b
+ #19 grow_dev_page at ffffffff811c8297
+ #20 __getblk_slow at ffffffff811c91d6
+ #21 __getblk_gfp at ffffffff811c92c1
+ #22 ext4_ext_grow_indepth at ffffffff8124565c
+ #23 ext4_ext_create_new_leaf at ffffffff81246ca8
+ #24 ext4_ext_insert_extent at ffffffff81246f09
+ #25 ext4_ext_map_blocks at ffffffff8124a848
+ #26 ext4_map_blocks at ffffffff8121a5b7
+ #27 mpage_map_one_extent at ffffffff8121b1fa
+ #28 mpage_map_and_submit_extent at ffffffff8121f07b
+ #29 ext4_writepages at ffffffff8121f6d5
+ #30 do_writepages at ffffffff8112c490
+ #31 __filemap_fdatawrite_range at ffffffff81120199
+ #32 filemap_flush at ffffffff8112041c
+ #33 ext4_alloc_da_blocks at ffffffff81219da1
+ #34 ext4_rename at ffffffff81229b91
+ #35 ext4_rename2 at ffffffff81229e32
+ #36 vfs_rename at ffffffff811a08a5
+ #37 SYSC_renameat2 at ffffffff811a3ffc
+ #38 sys_renameat2 at ffffffff811a408e
+ #39 sys_rename at ffffffff8119e51e
+ #40 system_call_fastpath at ffffffff815afa89
+
+Dave Chinner has properly pointed out that this is a deadlock in the
+reclaim code because ext4 doesn't submit pages which are marked by
+PG_writeback right away.
+
+The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
+with too many dirty pages") and it was applied only when may_enter_fs
+was specified.  The code has been changed by c3b94f44fcb0 ("memcg:
+further prevent OOM with too many dirty pages") which has removed the
+__GFP_FS restriction with a reasoning that we do not get into the fs
+code.  But this is not sufficient apparently because the fs doesn't
+necessarily submit pages marked PG_writeback for IO right away.
+
+ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
+submit the bio.  Instead it tries to map more pages into the bio and
+mpage_map_one_extent might trigger memcg charge which might end up
+waiting on a page which is marked PG_writeback but hasn't been submitted
+yet so we would end up waiting for something that never finishes.
+
+Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
+before we go to wait on the writeback.  The page fault path, which is
+the only path that triggers memcg oom killer since 3.12, shouldn't
+require GFP_NOFS and so we shouldn't reintroduce the premature OOM
+killer issue which was originally addressed by the heuristic.
+
+As per David Chinner the xfs is doing similar thing since 2.6.15 already
+so ext4 is not the only affected filesystem.  Moreover he notes:
+
+: For example: IO completion might require unwritten extent conversion
+: which executes filesystem transactions and GFP_NOFS allocations. The
+: writeback flag on the pages can not be cleared until unwritten
+: extent conversion completes. Hence memory reclaim cannot wait on
+: page writeback to complete in GFP_NOFS context because it is not
+: safe to do so, memcg reclaim or otherwise.
+
+Cc: stable@vger.kernel.org # 3.9+
+[tytso@mit.edu: corrected the control flow]
+Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
+Reported-by: Nikolay Borisov <kernel@kyup.com>
+Signed-off-by: Michal Hocko <mhocko@suse.cz>
+Signed-off-by: Hugh Dickins <hughd@google.com>
+Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+
+
+---
+ mm/vmscan.c |   14 +++++---------
+ 1 file changed, 5 insertions(+), 9 deletions(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -937,21 +937,17 @@ static unsigned long shrink_page_list(st
+ 		 *
+ 		 * 2) Global reclaim encounters a page, memcg encounters a
+ 		 *    page that is not marked for immediate reclaim or
+-		 *    the caller does not have __GFP_IO. In this case mark
++		 *    the caller does not have __GFP_FS (or __GFP_IO if it's
++		 *    simply going to swap, not to fs). In this case mark
+ 		 *    the page for immediate reclaim and continue scanning.
+ 		 *
+-		 *    __GFP_IO is checked  because a loop driver thread might
++		 *    Require may_enter_fs because we would wait on fs, which
++		 *    may not have submitted IO yet. And the loop driver might
+ 		 *    enter reclaim, and deadlock if it waits on a page for
+ 		 *    which it is needed to do the write (loop masks off
+ 		 *    __GFP_IO|__GFP_FS for this reason); but more thought
+ 		 *    would probably show more reasons.
+ 		 *
+-		 *    Don't require __GFP_FS, since we're not going into the
+-		 *    FS, just waiting on its writeback completion. Worryingly,
+-		 *    ext4 gfs2 and xfs allocate pages with
+-		 *    grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
+-		 *    may_enter_fs here is liable to OOM on them.
+-		 *
+ 		 * 3) memcg encounters a page that is not already marked
+ 		 *    PageReclaim. memcg does not have any dirty pages
+ 		 *    throttling so we could easily OOM just because too many
+@@ -968,7 +964,7 @@ static unsigned long shrink_page_list(st
+ 
+ 			/* Case 2 above */
+ 			} else if (global_reclaim(sc) ||
+-			    !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
++			    !PageReclaim(page) || !may_enter_fs) {
+ 				/*
+ 				 * This is slightly racy - end_page_writeback()
+ 				 * might have just cleared PageReclaim, then
diff --git a/queue-4.1/series b/queue-4.1/series
index cdd3edcbef6..d6d7f78831b 100644
--- a/queue-4.1/series
+++ b/queue-4.1/series
@@ -76,3 +76,4 @@ usb-qcserial-add-support-for-dell-wireless-5809e-4g-modem.patch
 mtd-nand-fix-nand_use_bounce_buffer-flag-conflict.patch
 input-alps-only-dell-laptops-have-separate-button-bits-for-v2-dualpoint-sticks.patch
 thermal-exynos-disable-the-regulator-on-probe-failure.patch
+mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch