]>
Commit | Line | Data |
---|---|---|
87a8661a GKH |
1 | From ecf5fc6e9654cd7a268c782a523f072b2f1959f9 Mon Sep 17 00:00:00 2001 |
2 | From: Michal Hocko <mhocko@suse.cz> | |
3 | Date: Tue, 4 Aug 2015 14:36:58 -0700 | |
4 | Subject: mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations | |
5 | ||
6 | From: Michal Hocko <mhocko@suse.cz> | |
7 | ||
8 | commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream. | |
9 | ||
10 | Nikolay has reported a hang when a memcg reclaim got stuck with the | |
11 | following backtrace: | |
12 | ||
13 | PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync" | |
14 | #0 __schedule at ffffffff815ab152 | |
15 | #1 schedule at ffffffff815ab76e | |
16 | #2 schedule_timeout at ffffffff815ae5e5 | |
17 | #3 io_schedule_timeout at ffffffff815aad6a | |
18 | #4 bit_wait_io at ffffffff815abfc6 | |
19 | #5 __wait_on_bit at ffffffff815abda5 | |
20 | #6 wait_on_page_bit at ffffffff8111fd4f | |
21 | #7 shrink_page_list at ffffffff81135445 | |
22 | #8 shrink_inactive_list at ffffffff81135845 | |
23 | #9 shrink_lruvec at ffffffff81135ead | |
24 | #10 shrink_zone at ffffffff811360c3 | |
25 | #11 shrink_zones at ffffffff81136eff | |
26 | #12 do_try_to_free_pages at ffffffff8113712f | |
27 | #13 try_to_free_mem_cgroup_pages at ffffffff811372be | |
28 | #14 try_charge at ffffffff81189423 | |
29 | #15 mem_cgroup_try_charge at ffffffff8118c6f5 | |
30 | #16 __add_to_page_cache_locked at ffffffff8112137d | |
31 | #17 add_to_page_cache_lru at ffffffff81121618 | |
32 | #18 pagecache_get_page at ffffffff8112170b | |
33 | #19 grow_dev_page at ffffffff811c8297 | |
34 | #20 __getblk_slow at ffffffff811c91d6 | |
35 | #21 __getblk_gfp at ffffffff811c92c1 | |
36 | #22 ext4_ext_grow_indepth at ffffffff8124565c | |
37 | #23 ext4_ext_create_new_leaf at ffffffff81246ca8 | |
38 | #24 ext4_ext_insert_extent at ffffffff81246f09 | |
39 | #25 ext4_ext_map_blocks at ffffffff8124a848 | |
40 | #26 ext4_map_blocks at ffffffff8121a5b7 | |
41 | #27 mpage_map_one_extent at ffffffff8121b1fa | |
42 | #28 mpage_map_and_submit_extent at ffffffff8121f07b | |
43 | #29 ext4_writepages at ffffffff8121f6d5 | |
44 | #30 do_writepages at ffffffff8112c490 | |
45 | #31 __filemap_fdatawrite_range at ffffffff81120199 | |
46 | #32 filemap_flush at ffffffff8112041c | |
47 | #33 ext4_alloc_da_blocks at ffffffff81219da1 | |
48 | #34 ext4_rename at ffffffff81229b91 | |
49 | #35 ext4_rename2 at ffffffff81229e32 | |
50 | #36 vfs_rename at ffffffff811a08a5 | |
51 | #37 SYSC_renameat2 at ffffffff811a3ffc | |
52 | #38 sys_renameat2 at ffffffff811a408e | |
53 | #39 sys_rename at ffffffff8119e51e | |
54 | #40 system_call_fastpath at ffffffff815afa89 | |
55 | ||
56 | Dave Chinner has properly pointed out that this is a deadlock in the | |
57 | reclaim code because ext4 doesn't submit pages which are marked by | |
58 | PG_writeback right away. | |
59 | ||
60 | The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM | |
61 | with too many dirty pages") and it was applied only when may_enter_fs | |
62 | was specified. The code has been changed by c3b94f44fcb0 ("memcg: | |
63 | further prevent OOM with too many dirty pages") which has removed the | |
64 | __GFP_FS restriction with a reasoning that we do not get into the fs | |
65 | code. But this is not sufficient apparently because the fs doesn't | |
66 | necessarily submit pages marked PG_writeback for IO right away. | |
67 | ||
68 | ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily | |
69 | submit the bio. Instead it tries to map more pages into the bio and | |
70 | mpage_map_one_extent might trigger memcg charge which might end up | |
71 | waiting on a page which is marked PG_writeback but hasn't been submitted | |
72 | yet so we would end up waiting for something that never finishes. | |
73 | ||
74 | Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2) | |
75 | before we go to wait on the writeback. The page fault path, which is | |
76 | the only path that triggers memcg oom killer since 3.12, shouldn't | |
77 | require GFP_NOFS and so we shouldn't reintroduce the premature OOM | |
78 | killer issue which was originally addressed by the heuristic. | |
79 | ||
80 | As per David Chinner the xfs is doing similar thing since 2.6.15 already | |
81 | so ext4 is not the only affected filesystem. Moreover he notes: | |
82 | ||
83 | : For example: IO completion might require unwritten extent conversion | |
84 | : which executes filesystem transactions and GFP_NOFS allocations. The | |
85 | : writeback flag on the pages can not be cleared until unwritten | |
86 | : extent conversion completes. Hence memory reclaim cannot wait on | |
87 | : page writeback to complete in GFP_NOFS context because it is not | |
88 | : safe to do so, memcg reclaim or otherwise. | |
89 | ||
90 | [tytso@mit.edu: corrected the control flow] | |
91 | Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages") | |
92 | Reported-by: Nikolay Borisov <kernel@kyup.com> | |
93 | Signed-off-by: Michal Hocko <mhocko@suse.cz> | |
94 | Signed-off-by: Hugh Dickins <hughd@google.com> | |
95 | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> | |
96 | Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> | |
97 | ||
98 | ||
99 | --- | |
5dd4eba1 GKH |
100 | mm/vmscan.c | 11 +++-------- |
101 | 1 file changed, 3 insertions(+), 8 deletions(-) | |
87a8661a GKH |
102 | |
103 | --- a/mm/vmscan.c | |
104 | +++ b/mm/vmscan.c | |
5dd4eba1 GKH |
105 | @@ -730,20 +730,15 @@ static unsigned long shrink_page_list(st |
106 | * could easily OOM just because too many pages are in | |
107 | * writeback and there is nothing else to reclaim. | |
108 | * | |
109 | - * Check __GFP_IO, certainly because a loop driver | |
110 | + * Require may_enter_fs to wait on writeback, because | |
111 | + * fs may not have submitted IO yet. And a loop driver | |
112 | * thread might enter reclaim, and deadlock if it waits | |
113 | * on a page for which it is needed to do the write | |
114 | * (loop masks off __GFP_IO|__GFP_FS for this reason); | |
115 | * but more thought would probably show more reasons. | |
116 | - * | |
117 | - * Don't require __GFP_FS, since we're not going into | |
118 | - * the FS, just waiting on its writeback completion. | |
119 | - * Worryingly, ext4 gfs2 and xfs allocate pages with | |
120 | - * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so | |
121 | - * testing may_enter_fs here is liable to OOM on them. | |
87a8661a GKH |
122 | */ |
123 | if (global_reclaim(sc) || | |
124 | - !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) { | |
125 | + !PageReclaim(page) || !may_enter_fs) { | |
126 | /* | |
127 | * This is slightly racy - end_page_writeback() | |
128 | * might have just cleared PageReclaim, then |