]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob - releases/4.20.3/mm-memcg-fix-reclaim-deadlock-with-writeback.patch
Linux 4.19.40
[thirdparty/kernel/stable-queue.git] / releases / 4.20.3 / mm-memcg-fix-reclaim-deadlock-with-writeback.patch
1 From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
2 From: Michal Hocko <mhocko@suse.com>
3 Date: Tue, 8 Jan 2019 15:23:07 -0800
4 Subject: mm, memcg: fix reclaim deadlock with writeback
5
6 From: Michal Hocko <mhocko@suse.com>
7
8 commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.
9
10 Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
11 ext4 writeback
12
13 task1:
14 wait_on_page_bit+0x82/0xa0
15 shrink_page_list+0x907/0x960
16 shrink_inactive_list+0x2c7/0x680
17 shrink_node_memcg+0x404/0x830
18 shrink_node+0xd8/0x300
19 do_try_to_free_pages+0x10d/0x330
20 try_to_free_mem_cgroup_pages+0xd5/0x1b0
21 try_charge+0x14d/0x720
22 memcg_kmem_charge_memcg+0x3c/0xa0
23 memcg_kmem_charge+0x7e/0xd0
24 __alloc_pages_nodemask+0x178/0x260
25 alloc_pages_current+0x95/0x140
26 pte_alloc_one+0x17/0x40
27 __pte_alloc+0x1e/0x110
28 alloc_set_pte+0x5fe/0xc20
29 do_fault+0x103/0x970
30 handle_mm_fault+0x61e/0xd10
31 __do_page_fault+0x252/0x4d0
32 do_page_fault+0x30/0x80
33 page_fault+0x28/0x30
34
35 task2:
36 __lock_page+0x86/0xa0
37 mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
38 ext4_writepages+0x479/0xd60
39 do_writepages+0x1e/0x30
40 __writeback_single_inode+0x45/0x320
41 writeback_sb_inodes+0x272/0x600
42 __writeback_inodes_wb+0x92/0xc0
43 wb_writeback+0x268/0x300
44 wb_workfn+0xb4/0x390
45 process_one_work+0x189/0x420
46 worker_thread+0x4e/0x4b0
47 kthread+0xe6/0x100
48 ret_from_fork+0x41/0x50
49
50 He adds
51 "task1 is waiting for the PageWriteback bit of the page that task2 has
52 collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
53 LOCKED bit the page which tasks1 has locked"
54
55 More precisely task1 is handling a page fault and it has a page locked
56 while it charges a new page table to a memcg. That in turn hits a
57 memory limit reclaim and the memcg reclaim for legacy controller is
58 waiting on the writeback but that is never going to finish because the
59 writeback itself is waiting for the page locked in the #PF path. So
60 this is essentially ABBA deadlock:
61
62 lock_page(A)
63 SetPageWriteback(A)
64 unlock_page(A)
65 lock_page(B)
66 lock_page(B)
67 pte_alloc_pne
68 shrink_page_list
69 wait_on_page_writeback(A)
70 SetPageWriteback(B)
71 unlock_page(B)
72
73 # flush A, B to clear the writeback
74
75 This accumulating of more pages to flush is used by several filesystems
76 to generate a more optimal IO patterns.
77
78 Waiting for the writeback in legacy memcg controller is a workaround for
79 pre-mature OOM killer invocations because there is no dirty IO
80 throttling available for the controller. There is no easy way around
81 that unfortunately. Therefore fix this specific issue by pre-allocating
82 the page table outside of the page lock. We have that handy
83 infrastructure for that already so simply reuse the fault-around pattern
84 which already does this.
85
86 There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
87 from under a fs page locked but they should be really rare. I am not
88 aware of a better solution unfortunately.
89
90 [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
91 [akpm@linux-foundation.org: coding-style fixes]
92 [mhocko@kernel.org: enhance comment, per Johannes]
93 Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
94 Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
95 Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
96 Signed-off-by: Michal Hocko <mhocko@suse.com>
97 Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
98 Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
99 Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
100 Acked-by: Johannes Weiner <hannes@cmpxchg.org>
101 Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
102 Cc: Jan Kara <jack@suse.cz>
103 Cc: Dave Chinner <david@fromorbit.com>
104 Cc: Theodore Ts'o <tytso@mit.edu>
105 Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
106 Cc: Shakeel Butt <shakeelb@google.com>
107 Cc: <stable@vger.kernel.org>
108 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
109 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
110 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
111
112 ---
113 mm/memory.c | 23 +++++++++++++++++++++++
114 1 file changed, 23 insertions(+)
115
116 --- a/mm/memory.c
117 +++ b/mm/memory.c
118 @@ -2993,6 +2993,29 @@ static vm_fault_t __do_fault(struct vm_f
119 struct vm_area_struct *vma = vmf->vma;
120 vm_fault_t ret;
121
122 + /*
123 + * Preallocate pte before we take page_lock because this might lead to
124 + * deadlocks for memcg reclaim which waits for pages under writeback:
125 + * lock_page(A)
126 + * SetPageWriteback(A)
127 + * unlock_page(A)
128 + * lock_page(B)
129 + * lock_page(B)
130 + * pte_alloc_pne
131 + * shrink_page_list
132 + * wait_on_page_writeback(A)
133 + * SetPageWriteback(B)
134 + * unlock_page(B)
135 + * # flush A, B to clear the writeback
136 + */
137 + if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
138 + vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
139 + vmf->address);
140 + if (!vmf->prealloc_pte)
141 + return VM_FAULT_OOM;
142 + smp_wmb(); /* See comment in __pte_alloc() */
143 + }
144 +
145 ret = vma->vm_ops->fault(vmf);
146 if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
147 VM_FAULT_DONE_COW)))