From: Greg Kroah-Hartman Date: Fri, 14 Aug 2015 02:34:44 +0000 (-0700) Subject: 4.1-stable patches X-Git-Tag: v3.10.87~17 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=2bcb443b5cac1a61add74986d98f1a102b1d92a6;p=thirdparty%2Fkernel%2Fstable-queue.git 4.1-stable patches added patches: mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch --- diff --git a/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch b/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch new file mode 100644 index 00000000000..f76e63d88f2 --- /dev/null +++ b/queue-4.1/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch @@ -0,0 +1,140 @@ +From ecf5fc6e9654cd7a268c782a523f072b2f1959f9 Mon Sep 17 00:00:00 2001 +From: Michal Hocko +Date: Tue, 4 Aug 2015 14:36:58 -0700 +Subject: mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations + +From: Michal Hocko + +commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. + +Nikolay has reported a hang when a memcg reclaim got stuck with the +following backtrace: + +PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" + #0 __schedule at ffffffff815ab152 + #1 schedule at ffffffff815ab76e + #2 schedule_timeout at ffffffff815ae5e5 + #3 io_schedule_timeout at ffffffff815aad6a + #4 bit_wait_io at ffffffff815abfc6 + #5 __wait_on_bit at ffffffff815abda5 + #6 wait_on_page_bit at ffffffff8111fd4f + #7 shrink_page_list at ffffffff81135445 + #8 shrink_inactive_list at ffffffff81135845 + #9 shrink_lruvec at ffffffff81135ead + #10 shrink_zone at ffffffff811360c3 + #11 shrink_zones at ffffffff81136eff + #12 do_try_to_free_pages at ffffffff8113712f + #13 try_to_free_mem_cgroup_pages at ffffffff811372be + #14 try_charge at ffffffff81189423 + #15 mem_cgroup_try_charge at ffffffff8118c6f5 + #16 __add_to_page_cache_locked at ffffffff8112137d + #17 add_to_page_cache_lru at ffffffff81121618 + #18 pagecache_get_page at ffffffff8112170b + #19 grow_dev_page at ffffffff811c8297 + #20 __getblk_slow at ffffffff811c91d6 + #21 __getblk_gfp at ffffffff811c92c1 + #22 ext4_ext_grow_indepth at ffffffff8124565c + #23 ext4_ext_create_new_leaf at ffffffff81246ca8 + #24 ext4_ext_insert_extent at ffffffff81246f09 + #25 ext4_ext_map_blocks at ffffffff8124a848 + #26 ext4_map_blocks at ffffffff8121a5b7 + #27 mpage_map_one_extent at ffffffff8121b1fa + #28 mpage_map_and_submit_extent at ffffffff8121f07b + #29 ext4_writepages at ffffffff8121f6d5 + #30 do_writepages at ffffffff8112c490 + #31 __filemap_fdatawrite_range at ffffffff81120199 + #32 filemap_flush at ffffffff8112041c + #33 ext4_alloc_da_blocks at ffffffff81219da1 + #34 ext4_rename at ffffffff81229b91 + #35 ext4_rename2 at ffffffff81229e32 + #36 vfs_rename at ffffffff811a08a5 + #37 SYSC_renameat2 at ffffffff811a3ffc + #38 sys_renameat2 at ffffffff811a408e + #39 sys_rename at ffffffff8119e51e + #40 system_call_fastpath at ffffffff815afa89 + +Dave Chinner has properly pointed out that this is a deadlock in the +reclaim code because ext4 doesn't submit pages which are marked by +PG_writeback right away. + +The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM +with too many dirty pages") and it was applied only when may_enter_fs +was specified. The code has been changed by c3b94f44fcb0 ("memcg: +further prevent OOM with too many dirty pages") which has removed the +__GFP_FS restriction with a reasoning that we do not get into the fs +code. But this is not sufficient apparently because the fs doesn't +necessarily submit pages marked PG_writeback for IO right away. + +ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily +submit the bio. Instead it tries to map more pages into the bio and +mpage_map_one_extent might trigger memcg charge which might end up +waiting on a page which is marked PG_writeback but hasn't been submitted +yet so we would end up waiting for something that never finishes. + +Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) +before we go to wait on the writeback. The page fault path, which is +the only path that triggers memcg oom killer since 3.12, shouldn't +require GFP_NOFS and so we shouldn't reintroduce the premature OOM +killer issue which was originally addressed by the heuristic. + +As per David Chinner the xfs is doing similar thing since 2.6.15 already +so ext4 is not the only affected filesystem. Moreover he notes: + +: For example: IO completion might require unwritten extent conversion +: which executes filesystem transactions and GFP_NOFS allocations. The +: writeback flag on the pages can not be cleared until unwritten +: extent conversion completes. Hence memory reclaim cannot wait on +: page writeback to complete in GFP_NOFS context because it is not +: safe to do so, memcg reclaim or otherwise. + +Cc: stable@vger.kernel.org # 3.9+ +[tytso@mit.edu: corrected the control flow] +Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") +Reported-by: Nikolay Borisov +Signed-off-by: Michal Hocko +Signed-off-by: Hugh Dickins +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + + +--- + mm/vmscan.c | 14 +++++--------- + 1 file changed, 5 insertions(+), 9 deletions(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -937,21 +937,17 @@ static unsigned long shrink_page_list(st + * + * 2) Global reclaim encounters a page, memcg encounters a + * page that is not marked for immediate reclaim or +- * the caller does not have __GFP_IO. In this case mark ++ * the caller does not have __GFP_FS (or __GFP_IO if it's ++ * simply going to swap, not to fs). In this case mark + * the page for immediate reclaim and continue scanning. + * +- * __GFP_IO is checked because a loop driver thread might ++ * Require may_enter_fs because we would wait on fs, which ++ * may not have submitted IO yet. And the loop driver might + * enter reclaim, and deadlock if it waits on a page for + * which it is needed to do the write (loop masks off + * __GFP_IO|__GFP_FS for this reason); but more thought + * would probably show more reasons. + * +- * Don't require __GFP_FS, since we're not going into the +- * FS, just waiting on its writeback completion. Worryingly, +- * ext4 gfs2 and xfs allocate pages with +- * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing +- * may_enter_fs here is liable to OOM on them. +- * + * 3) memcg encounters a page that is not already marked + * PageReclaim. memcg does not have any dirty pages + * throttling so we could easily OOM just because too many +@@ -968,7 +964,7 @@ static unsigned long shrink_page_list(st + + /* Case 2 above */ + } else if (global_reclaim(sc) || +- !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { ++ !PageReclaim(page) || !may_enter_fs) { + /* + * This is slightly racy - end_page_writeback() + * might have just cleared PageReclaim, then diff --git a/queue-4.1/series b/queue-4.1/series index cdd3edcbef6..d6d7f78831b 100644 --- a/queue-4.1/series +++ b/queue-4.1/series @@ -76,3 +76,4 @@ usb-qcserial-add-support-for-dell-wireless-5809e-4g-modem.patch mtd-nand-fix-nand_use_bounce_buffer-flag-conflict.patch input-alps-only-dell-laptops-have-separate-button-bits-for-v2-dualpoint-sticks.patch thermal-exynos-disable-the-regulator-on-probe-failure.patch +mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch