releases/4.20.3/mm-memcg-fix-reclaim-deadlock-with-writeback.patch

   1 From 63f3655f950186752236bb88a22f8252c11ce394 Mon Sep 17 00:00:00 2001
   2 From: Michal Hocko <mhocko@suse.com>
   3 Date: Tue, 8 Jan 2019 15:23:07 -0800
   4 Subject: mm, memcg: fix reclaim deadlock with writeback
   5
   6 From: Michal Hocko <mhocko@suse.com>
   7
   8 commit 63f3655f950186752236bb88a22f8252c11ce394 upstream.
   9
  10 Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
  11 ext4 writeback
  12
  13   task1:
  14     wait_on_page_bit+0x82/0xa0
  15     shrink_page_list+0x907/0x960
  16     shrink_inactive_list+0x2c7/0x680
  17     shrink_node_memcg+0x404/0x830
  18     shrink_node+0xd8/0x300
  19     do_try_to_free_pages+0x10d/0x330
  20     try_to_free_mem_cgroup_pages+0xd5/0x1b0
  21     try_charge+0x14d/0x720
  22     memcg_kmem_charge_memcg+0x3c/0xa0
  23     memcg_kmem_charge+0x7e/0xd0
  24     __alloc_pages_nodemask+0x178/0x260
  25     alloc_pages_current+0x95/0x140
  26     pte_alloc_one+0x17/0x40
  27     __pte_alloc+0x1e/0x110
  28     alloc_set_pte+0x5fe/0xc20
  29     do_fault+0x103/0x970
  30     handle_mm_fault+0x61e/0xd10
  31     __do_page_fault+0x252/0x4d0
  32     do_page_fault+0x30/0x80
  33     page_fault+0x28/0x30
  34
  35   task2:
  36     __lock_page+0x86/0xa0
  37     mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
  38     ext4_writepages+0x479/0xd60
  39     do_writepages+0x1e/0x30
  40     __writeback_single_inode+0x45/0x320
  41     writeback_sb_inodes+0x272/0x600
  42     __writeback_inodes_wb+0x92/0xc0
  43     wb_writeback+0x268/0x300
  44     wb_workfn+0xb4/0x390
  45     process_one_work+0x189/0x420
  46     worker_thread+0x4e/0x4b0
  47     kthread+0xe6/0x100
  48     ret_from_fork+0x41/0x50
  49
  50 He adds
  51  "task1 is waiting for the PageWriteback bit of the page that task2 has
  52   collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  53   LOCKED bit the page which tasks1 has locked"
  54
  55 More precisely task1 is handling a page fault and it has a page locked
  56 while it charges a new page table to a memcg.  That in turn hits a
  57 memory limit reclaim and the memcg reclaim for legacy controller is
  58 waiting on the writeback but that is never going to finish because the
  59 writeback itself is waiting for the page locked in the #PF path.  So
  60 this is essentially ABBA deadlock:
  61
  62                                         lock_page(A)
  63                                         SetPageWriteback(A)
  64                                         unlock_page(A)
  65   lock_page(B)
  66                                         lock_page(B)
  67   pte_alloc_pne
  68     shrink_page_list
  69       wait_on_page_writeback(A)
  70                                         SetPageWriteback(B)
  71                                         unlock_page(B)
  72
  73                                         # flush A, B to clear the writeback
  74
  75 This accumulating of more pages to flush is used by several filesystems
  76 to generate a more optimal IO patterns.
  77
  78 Waiting for the writeback in legacy memcg controller is a workaround for
  79 pre-mature OOM killer invocations because there is no dirty IO
  80 throttling available for the controller.  There is no easy way around
  81 that unfortunately.  Therefore fix this specific issue by pre-allocating
  82 the page table outside of the page lock.  We have that handy
  83 infrastructure for that already so simply reuse the fault-around pattern
  84 which already does this.
  85
  86 There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
  87 from under a fs page locked but they should be really rare.  I am not
  88 aware of a better solution unfortunately.
  89
  90 [akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
  91 [akpm@linux-foundation.org: coding-style fixes]
  92 [mhocko@kernel.org: enhance comment, per Johannes]
  93   Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
  94 Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
  95 Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
  96 Signed-off-by: Michal Hocko <mhocko@suse.com>
  97 Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
  98 Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
  99 Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
 100 Acked-by: Johannes Weiner <hannes@cmpxchg.org>
 101 Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
 102 Cc: Jan Kara <jack@suse.cz>
 103 Cc: Dave Chinner <david@fromorbit.com>
 104 Cc: Theodore Ts'o <tytso@mit.edu>
 105 Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
 106 Cc: Shakeel Butt <shakeelb@google.com>
 107 Cc: <stable@vger.kernel.org>
 108 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 109 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 110 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 111
 112 ---
 113  mm/memory.c |   23 +++++++++++++++++++++++
 114  1 file changed, 23 insertions(+)
 115
 116 --- a/mm/memory.c
 117 +++ b/mm/memory.c
 118 @@ -2993,6 +2993,29 @@ static vm_fault_t __do_fault(struct vm_f
 119         struct vm_area_struct *vma = vmf->vma;
 120         vm_fault_t ret;
 121
 122 +       /*
 123 +        * Preallocate pte before we take page_lock because this might lead to
 124 +        * deadlocks for memcg reclaim which waits for pages under writeback:
 125 +        *                              lock_page(A)
 126 +        *                              SetPageWriteback(A)
 127 +        *                              unlock_page(A)
 128 +        * lock_page(B)
 129 +        *                              lock_page(B)
 130 +        * pte_alloc_pne
 131 +        *   shrink_page_list
 132 +        *     wait_on_page_writeback(A)
 133 +        *                              SetPageWriteback(B)
 134 +        *                              unlock_page(B)
 135 +        *                              # flush A, B to clear the writeback
 136 +        */
 137 +       if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
 138 +               vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
 139 +                                                 vmf->address);
 140 +               if (!vmf->prealloc_pte)
 141 +                       return VM_FAULT_OOM;
 142 +               smp_wmb(); /* See comment in __pte_alloc() */
 143 +       }
 144 +
 145         ret = vma->vm_ops->fault(vmf);
 146         if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
 147                             VM_FAULT_DONE_COW)))