From: Greg Kroah-Hartman Date: Fri, 14 Aug 2015 17:33:09 +0000 (-0700) Subject: 3.14-stable patches X-Git-Tag: v3.10.87~2 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=d580888da7f9c31c0500325e26a2b1f0efea226b;p=thirdparty%2Fkernel%2Fstable-queue.git 3.14-stable patches added patches: mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch --- diff --git a/queue-3.14/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch b/queue-3.14/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch new file mode 100644 index 00000000000..e986a1a10cc --- /dev/null +++ b/queue-3.14/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch @@ -0,0 +1,139 @@ +From ecf5fc6e9654cd7a268c782a523f072b2f1959f9 Mon Sep 17 00:00:00 2001 +From: Michal Hocko +Date: Tue, 4 Aug 2015 14:36:58 -0700 +Subject: mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations + +From: Michal Hocko + +commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. + +Nikolay has reported a hang when a memcg reclaim got stuck with the +following backtrace: + +PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" + #0 __schedule at ffffffff815ab152 + #1 schedule at ffffffff815ab76e + #2 schedule_timeout at ffffffff815ae5e5 + #3 io_schedule_timeout at ffffffff815aad6a + #4 bit_wait_io at ffffffff815abfc6 + #5 __wait_on_bit at ffffffff815abda5 + #6 wait_on_page_bit at ffffffff8111fd4f + #7 shrink_page_list at ffffffff81135445 + #8 shrink_inactive_list at ffffffff81135845 + #9 shrink_lruvec at ffffffff81135ead + #10 shrink_zone at ffffffff811360c3 + #11 shrink_zones at ffffffff81136eff + #12 do_try_to_free_pages at ffffffff8113712f + #13 try_to_free_mem_cgroup_pages at ffffffff811372be + #14 try_charge at ffffffff81189423 + #15 mem_cgroup_try_charge at ffffffff8118c6f5 + #16 __add_to_page_cache_locked at ffffffff8112137d + #17 add_to_page_cache_lru at ffffffff81121618 + #18 pagecache_get_page at ffffffff8112170b + #19 grow_dev_page at ffffffff811c8297 + #20 __getblk_slow at ffffffff811c91d6 + #21 __getblk_gfp at ffffffff811c92c1 + #22 ext4_ext_grow_indepth at ffffffff8124565c + #23 ext4_ext_create_new_leaf at ffffffff81246ca8 + #24 ext4_ext_insert_extent at ffffffff81246f09 + #25 ext4_ext_map_blocks at ffffffff8124a848 + #26 ext4_map_blocks at ffffffff8121a5b7 + #27 mpage_map_one_extent at ffffffff8121b1fa + #28 mpage_map_and_submit_extent at ffffffff8121f07b + #29 ext4_writepages at ffffffff8121f6d5 + #30 do_writepages at ffffffff8112c490 + #31 __filemap_fdatawrite_range at ffffffff81120199 + #32 filemap_flush at ffffffff8112041c + #33 ext4_alloc_da_blocks at ffffffff81219da1 + #34 ext4_rename at ffffffff81229b91 + #35 ext4_rename2 at ffffffff81229e32 + #36 vfs_rename at ffffffff811a08a5 + #37 SYSC_renameat2 at ffffffff811a3ffc + #38 sys_renameat2 at ffffffff811a408e + #39 sys_rename at ffffffff8119e51e + #40 system_call_fastpath at ffffffff815afa89 + +Dave Chinner has properly pointed out that this is a deadlock in the +reclaim code because ext4 doesn't submit pages which are marked by +PG_writeback right away. + +The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM +with too many dirty pages") and it was applied only when may_enter_fs +was specified. The code has been changed by c3b94f44fcb0 ("memcg: +further prevent OOM with too many dirty pages") which has removed the +__GFP_FS restriction with a reasoning that we do not get into the fs +code. But this is not sufficient apparently because the fs doesn't +necessarily submit pages marked PG_writeback for IO right away. + +ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily +submit the bio. Instead it tries to map more pages into the bio and +mpage_map_one_extent might trigger memcg charge which might end up +waiting on a page which is marked PG_writeback but hasn't been submitted +yet so we would end up waiting for something that never finishes. + +Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) +before we go to wait on the writeback. The page fault path, which is +the only path that triggers memcg oom killer since 3.12, shouldn't +require GFP_NOFS and so we shouldn't reintroduce the premature OOM +killer issue which was originally addressed by the heuristic. + +As per David Chinner the xfs is doing similar thing since 2.6.15 already +so ext4 is not the only affected filesystem. Moreover he notes: + +: For example: IO completion might require unwritten extent conversion +: which executes filesystem transactions and GFP_NOFS allocations. The +: writeback flag on the pages can not be cleared until unwritten +: extent conversion completes. Hence memory reclaim cannot wait on +: page writeback to complete in GFP_NOFS context because it is not +: safe to do so, memcg reclaim or otherwise. + +[tytso@mit.edu: corrected the control flow] +Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") +Reported-by: Nikolay Borisov +Signed-off-by: Michal Hocko +Signed-off-by: Hugh Dickins +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + + +--- + mm/vmscan.c | 14 +++++--------- + 1 file changed, 5 insertions(+), 9 deletions(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -871,21 +871,17 @@ static unsigned long shrink_page_list(st + * + * 2) Global reclaim encounters a page, memcg encounters a + * page that is not marked for immediate reclaim or +- * the caller does not have __GFP_IO. In this case mark ++ * the caller does not have __GFP_FS (or __GFP_IO if it's ++ * simply going to swap, not to fs). In this case mark + * the page for immediate reclaim and continue scanning. + * +- * __GFP_IO is checked because a loop driver thread might ++ * Require may_enter_fs because we would wait on fs, which ++ * may not have submitted IO yet. And the loop driver might + * enter reclaim, and deadlock if it waits on a page for + * which it is needed to do the write (loop masks off + * __GFP_IO|__GFP_FS for this reason); but more thought + * would probably show more reasons. + * +- * Don't require __GFP_FS, since we're not going into the +- * FS, just waiting on its writeback completion. Worryingly, +- * ext4 gfs2 and xfs allocate pages with +- * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing +- * may_enter_fs here is liable to OOM on them. +- * + * 3) memcg encounters a page that is not already marked + * PageReclaim. memcg does not have any dirty pages + * throttling so we could easily OOM just because too many +@@ -902,7 +898,7 @@ static unsigned long shrink_page_list(st + + /* Case 2 above */ + } else if (global_reclaim(sc) || +- !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { ++ !PageReclaim(page) || !may_enter_fs) { + /* + * This is slightly racy - end_page_writeback() + * might have just cleared PageReclaim, then diff --git a/queue-3.14/series b/queue-3.14/series index d5df545b0a4..edce60fa26b 100644 --- a/queue-3.14/series +++ b/queue-3.14/series @@ -41,3 +41,4 @@ dcache-don-t-need-rcu-in-shrink_dentry_list.patch kvm-x86-fix-kvm_apic_has_events-to-check-for-null-pointer.patch path_openat-fix-double-fput.patch md-bitmap-return-an-error-when-bitmap-superblock-is-corrupt.patch +mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch