git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob

   1 From 8b272b3cbbb50a6a8e62d8a15affd473a788e184 Mon Sep 17 00:00:00 2001
   2 From: Mel Gorman <mgorman@techsingularity.net>
   3 Date: Thu, 5 Mar 2020 22:28:26 -0800
   4 Subject: mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa
   5
   6 From: Mel Gorman <mgorman@techsingularity.net>
   7
   8 commit 8b272b3cbbb50a6a8e62d8a15affd473a788e184 upstream.
   9
  10 : A user reported a bug against a distribution kernel while running a
  11 : proprietary workload described as "memory intensive that is not swapping"
  12 : that is expected to apply to mainline kernels.  The workload is
  13 : read/write/modifying ranges of memory and checking the contents.  They
  14 : reported that within a few hours that a bad PMD would be reported followed
  15 : by a memory corruption where expected data was all zeros.  A partial
  16 : report of the bad PMD looked like
  17 :
  18 :   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
  19 :   [ 5195.341184] ------------[ cut here ]------------
  20 :   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
  21 :   ....
  22 :   [ 5195.410033] Call Trace:
  23 :   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
  24 :   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  25 :   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  26 :   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
  27 :   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
  28 :   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
  29 :   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
  30 :
  31 : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
  32 : was a false detection.  The bug does not trigger if automatic NUMA
  33 : balancing or transparent huge pages is disabled.
  34 :
  35 : The bug is due a race in change_pmd_range between a pmd_trans_huge and
  36 : pmd_nond_or_clear_bad check without any locks held.  During the
  37 : pmd_trans_huge check, a parallel protection update under lock can have
  38 : cleared the PMD and filled it with a prot_numa entry between the transhuge
  39 : check and the pmd_none_or_clear_bad check.
  40 :
  41 : While this could be fixed with heavy locking, it's only necessary to make
  42 : a copy of the PMD on the stack during change_pmd_range and avoid races.  A
  43 : new helper is created for this as the check if quite subtle and the
  44 : existing similar helpful is not suitable.  This passed 154 hours of
  45 : testing (usually triggers between 20 minutes and 24 hours) without
  46 : detecting bad PMDs or corruption.  A basic test of an autonuma-intensive
  47 : workload showed no significant change in behaviour.
  48
  49 Although Mel withdrew the patch on the face of LKML comment
  50 https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
  51 still open, and we have reports of Linpack test reporting bad residuals
  52 after the bad PMD warning is observed.  In addition to that, bad
  53 rss-counter and non-zero pgtables assertions are triggered on mm teardown
  54 for the task hitting the bad PMD.
  55
  56  host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
  57  ....
  58  host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
  59  host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
  60
  61 The issue is observed on a v4.18-based distribution kernel, but the race
  62 window is expected to be applicable to mainline kernels, as well.
  63
  64 [akpm@linux-foundation.org: fix comment typo, per Rafael]
  65 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  66 Signed-off-by: Rafael Aquini <aquini@redhat.com>
  67 Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
  68 Cc: <stable@vger.kernel.org>
  69 Cc: Zi Yan <zi.yan@cs.rutgers.edu>
  70 Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
  71 Cc: Vlastimil Babka <vbabka@suse.cz>
  72 Cc: Michal Hocko <mhocko@suse.com>
  73 Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.com
  74 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  75 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  76
  77 ---
  78  mm/mprotect.c |   38 ++++++++++++++++++++++++++++++++++++--
  79  1 file changed, 36 insertions(+), 2 deletions(-)
  80
  81 --- a/mm/mprotect.c
  82 +++ b/mm/mprotect.c
  83 @@ -148,6 +148,31 @@ static unsigned long change_pte_range(st
  84         return pages;
  85  }
  86
  87 +/*
  88 + * Used when setting automatic NUMA hinting protection where it is
  89 + * critical that a numa hinting PMD is not confused with a bad PMD.
  90 + */
  91 +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
  92 +{
  93 +       pmd_t pmdval = pmd_read_atomic(pmd);
  94 +
  95 +       /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
  96 +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  97 +       barrier();
  98 +#endif
  99 +
 100 +       if (pmd_none(pmdval))
 101 +               return 1;
 102 +       if (pmd_trans_huge(pmdval))
 103 +               return 0;
 104 +       if (unlikely(pmd_bad(pmdval))) {
 105 +               pmd_clear_bad(pmd);
 106 +               return 1;
 107 +       }
 108 +
 109 +       return 0;
 110 +}
 111 +
 112  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 113                 pud_t *pud, unsigned long addr, unsigned long end,
 114                 pgprot_t newprot, int dirty_accountable, int prot_numa)
 115 @@ -164,8 +189,17 @@ static inline unsigned long change_pmd_r
 116                 unsigned long this_pages;
 117
 118                 next = pmd_addr_end(addr, end);
 119 -               if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
 120 -                               && pmd_none_or_clear_bad(pmd))
 121 +
 122 +               /*
 123 +                * Automatic NUMA balancing walks the tables with mmap_sem
 124 +                * held for read. It's possible a parallel update to occur
 125 +                * between pmd_trans_huge() and a pmd_none_or_clear_bad()
 126 +                * check leading to a false positive and clearing.
 127 +                * Hence, it's necessary to atomically read the PMD value
 128 +                * for all the checks.
 129 +                */
 130 +               if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
 131 +                    pmd_none_or_clear_bad_unless_trans_huge(pmd))
 132                         goto next;
 133
 134                 /* invoke the mmu notifier if the pmd is populated */