From: Greg Kroah-Hartman Date: Fri, 14 Aug 2015 17:33:05 +0000 (-0700) Subject: 3.10-stable patches X-Git-Tag: v3.10.87~3 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=87a8661aa53361b21854858f30774c6f66e030c8;p=thirdparty%2Fkernel%2Fstable-queue.git 3.10-stable patches added patches: mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch --- diff --git a/queue-3.10/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch b/queue-3.10/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch new file mode 100644 index 00000000000..9b1a9cabded --- /dev/null +++ b/queue-3.10/mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch @@ -0,0 +1,113 @@ +From ecf5fc6e9654cd7a268c782a523f072b2f1959f9 Mon Sep 17 00:00:00 2001 +From: Michal Hocko +Date: Tue, 4 Aug 2015 14:36:58 -0700 +Subject: mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations + +From: Michal Hocko + +commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. + +Nikolay has reported a hang when a memcg reclaim got stuck with the +following backtrace: + +PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" + #0 __schedule at ffffffff815ab152 + #1 schedule at ffffffff815ab76e + #2 schedule_timeout at ffffffff815ae5e5 + #3 io_schedule_timeout at ffffffff815aad6a + #4 bit_wait_io at ffffffff815abfc6 + #5 __wait_on_bit at ffffffff815abda5 + #6 wait_on_page_bit at ffffffff8111fd4f + #7 shrink_page_list at ffffffff81135445 + #8 shrink_inactive_list at ffffffff81135845 + #9 shrink_lruvec at ffffffff81135ead + #10 shrink_zone at ffffffff811360c3 + #11 shrink_zones at ffffffff81136eff + #12 do_try_to_free_pages at ffffffff8113712f + #13 try_to_free_mem_cgroup_pages at ffffffff811372be + #14 try_charge at ffffffff81189423 + #15 mem_cgroup_try_charge at ffffffff8118c6f5 + #16 __add_to_page_cache_locked at ffffffff8112137d + #17 add_to_page_cache_lru at ffffffff81121618 + #18 pagecache_get_page at ffffffff8112170b + #19 grow_dev_page at ffffffff811c8297 + #20 __getblk_slow at ffffffff811c91d6 + #21 __getblk_gfp at ffffffff811c92c1 + #22 ext4_ext_grow_indepth at ffffffff8124565c + #23 ext4_ext_create_new_leaf at ffffffff81246ca8 + #24 ext4_ext_insert_extent at ffffffff81246f09 + #25 ext4_ext_map_blocks at ffffffff8124a848 + #26 ext4_map_blocks at ffffffff8121a5b7 + #27 mpage_map_one_extent at ffffffff8121b1fa + #28 mpage_map_and_submit_extent at ffffffff8121f07b + #29 ext4_writepages at ffffffff8121f6d5 + #30 do_writepages at ffffffff8112c490 + #31 __filemap_fdatawrite_range at ffffffff81120199 + #32 filemap_flush at ffffffff8112041c + #33 ext4_alloc_da_blocks at ffffffff81219da1 + #34 ext4_rename at ffffffff81229b91 + #35 ext4_rename2 at ffffffff81229e32 + #36 vfs_rename at ffffffff811a08a5 + #37 SYSC_renameat2 at ffffffff811a3ffc + #38 sys_renameat2 at ffffffff811a408e + #39 sys_rename at ffffffff8119e51e + #40 system_call_fastpath at ffffffff815afa89 + +Dave Chinner has properly pointed out that this is a deadlock in the +reclaim code because ext4 doesn't submit pages which are marked by +PG_writeback right away. + +The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM +with too many dirty pages") and it was applied only when may_enter_fs +was specified. The code has been changed by c3b94f44fcb0 ("memcg: +further prevent OOM with too many dirty pages") which has removed the +__GFP_FS restriction with a reasoning that we do not get into the fs +code. But this is not sufficient apparently because the fs doesn't +necessarily submit pages marked PG_writeback for IO right away. + +ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily +submit the bio. Instead it tries to map more pages into the bio and +mpage_map_one_extent might trigger memcg charge which might end up +waiting on a page which is marked PG_writeback but hasn't been submitted +yet so we would end up waiting for something that never finishes. + +Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) +before we go to wait on the writeback. The page fault path, which is +the only path that triggers memcg oom killer since 3.12, shouldn't +require GFP_NOFS and so we shouldn't reintroduce the premature OOM +killer issue which was originally addressed by the heuristic. + +As per David Chinner the xfs is doing similar thing since 2.6.15 already +so ext4 is not the only affected filesystem. Moreover he notes: + +: For example: IO completion might require unwritten extent conversion +: which executes filesystem transactions and GFP_NOFS allocations. The +: writeback flag on the pages can not be cleared until unwritten +: extent conversion completes. Hence memory reclaim cannot wait on +: page writeback to complete in GFP_NOFS context because it is not +: safe to do so, memcg reclaim or otherwise. + +[tytso@mit.edu: corrected the control flow] +Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") +Reported-by: Nikolay Borisov +Signed-off-by: Michal Hocko +Signed-off-by: Hugh Dickins +Signed-off-by: Linus Torvalds +Signed-off-by: Greg Kroah-Hartman + + +--- + mm/vmscan.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -743,7 +743,7 @@ static unsigned long shrink_page_list(st + * testing may_enter_fs here is liable to OOM on them. + */ + if (global_reclaim(sc) || +- !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { ++ !PageReclaim(page) || !may_enter_fs) { + /* + * This is slightly racy - end_page_writeback() + * might have just cleared PageReclaim, then diff --git a/queue-3.10/series b/queue-3.10/series index 43c71f48691..8f7c7987ed3 100644 --- a/queue-3.10/series +++ b/queue-3.10/series @@ -32,3 +32,4 @@ signal-fix-information-leak-in-copy_siginfo_to_user.patch signal-fix-information-leak-in-copy_siginfo_from_user32.patch kvm-x86-fix-kvm_apic_has_events-to-check-for-null-pointer.patch md-bitmap-return-an-error-when-bitmap-superblock-is-corrupt.patch +mm-vmscan-do-not-wait-for-page-writeback-for-gfp_nofs-allocations.patch