releases/3.0.1/mm-futex-fix-futex-writes-on-archs-with-sw-tracking-of-dirty-young.patch

   1 From 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 Mon Sep 17 00:00:00 2001
   2 From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
   3 Date: Mon, 25 Jul 2011 17:12:32 -0700
   4 Subject: mm/futex: fix futex writes on archs with SW tracking of dirty & young
   5
   6 From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
   7
   8 commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream.
   9
  10 I haven't reproduced it myself but the fail scenario is that on such
  11 machines (notably ARM and some embedded powerpc), if you manage to hit
  12 that futex path on a writable page whose dirty bit has gone from the PTE,
  13 you'll livelock inside the kernel from what I can tell.
  14
  15 It will go in a loop of trying the atomic access, failing, trying gup to
  16 "fix it up", getting succcess from gup, go back to the atomic access,
  17 failing again because dirty wasn't fixed etc...
  18
  19 So I think you essentially hang in the kernel.
  20
  21 The scenario is probably rare'ish because affected architecture are
  22 embedded and tend to not swap much (if at all) so we probably rarely hit
  23 the case where dirty is missing or young is missing, but I think Shan has
  24 a piece of SW that can reliably reproduce it using a shared writable
  25 mapping & fork or something like that.
  26
  27 On archs who use SW tracking of dirty & young, a page without dirty is
  28 effectively mapped read-only and a page without young unaccessible in the
  29 PTE.
  30
  31 Additionally, some architectures might lazily flush the TLB when relaxing
  32 write protection (by doing only a local flush), and expect a fault to
  33 invalidate the stale entry if it's still present on another processor.
  34
  35 The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
  36 "fix it up" by causing get_user_pages() which would then be equivalent to
  37 taking the fault.
  38
  39 However that isn't the case.  get_user_pages() will not call
  40 handle_mm_fault() in the case where the PTE seems to have the right
  41 permissions, regardless of the dirty and young state.  It will eventually
  42 update those bits ...  in the struct page, but not in the PTE.
  43
  44 Additionally, it will not handle the lazy TLB flushing that can be
  45 required by some architectures in the fault case.
  46
  47 Basically, gup is the wrong interface for the job.  The patch provides a
  48 more appropriate one which boils down to just calling handle_mm_fault()
  49 since what we are trying to do is simulate a real page fault.
  50
  51 The futex code currently attempts to write to user memory within a
  52 pagefault disabled section, and if that fails, tries to fix it up using
  53 get_user_pages().
  54
  55 This doesn't work on archs where the dirty and young bits are maintained
  56 by software, since they will gate access permission in the TLB, and will
  57 not be updated by gup().
  58
  59 In addition, there's an expectation on some archs that a spurious write
  60 fault triggers a local TLB flush, and that is missing from the picture as
  61 well.
  62
  63 I decided that adding those "features" to gup() would be too much for this
  64 already too complex function, and instead added a new simpler
  65 fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
  66 which the futex code can call.
  67
  68 [akpm@linux-foundation.org: coding-style fixes]
  69 [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
  70 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
  71 Reported-by: Shan Hai <haishan.bai@gmail.com>
  72 Tested-by: Shan Hai <haishan.bai@gmail.com>
  73 Cc: David Laight <David.Laight@ACULAB.COM>
  74 Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
  75 Cc: Darren Hart <darren.hart@intel.com>
  76 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  77 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  78 Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
  79
  80 ---
  81  include/linux/mm.h |    2 +
  82  kernel/futex.c     |    4 +--
  83  mm/memory.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
  84  3 files changed, 61 insertions(+), 3 deletions(-)
  85
  86 --- a/include/linux/mm.h
  87 +++ b/include/linux/mm.h
  88 @@ -985,6 +985,8 @@ int get_user_pages(struct task_struct *t
  89  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  90                         struct page **pages);
  91  struct page *get_dump_page(unsigned long addr);
  92 +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
  93 +                           unsigned long address, unsigned int fault_flags);
  94
  95  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
  96  extern void do_invalidatepage(struct page *page, unsigned long offset);
  97 --- a/kernel/futex.c
  98 +++ b/kernel/futex.c
  99 @@ -355,8 +355,8 @@ static int fault_in_user_writeable(u32 _
 100         int ret;
 101
 102         down_read(&mm->mmap_sem);
 103 -       ret = get_user_pages(current, mm, (unsigned long)uaddr,
 104 -                            1, 1, 0, NULL, NULL);
 105 +       ret = fixup_user_fault(current, mm, (unsigned long)uaddr,
 106 +                              FAULT_FLAG_WRITE);
 107         up_read(&mm->mmap_sem);
 108
 109         return ret < 0 ? ret : 0;
 110 --- a/mm/memory.c
 111 +++ b/mm/memory.c
 112 @@ -1816,7 +1816,63 @@ next_page:
 113  }
 114  EXPORT_SYMBOL(__get_user_pages);
 115
 116 -/**
 117 +/*
 118 + * fixup_user_fault() - manually resolve a user page fault
 119 + * @tsk:       the task_struct to use for page fault accounting, or
 120 + *             NULL if faults are not to be recorded.
 121 + * @mm:                mm_struct of target mm
 122 + * @address:   user address
 123 + * @fault_flags:flags to pass down to handle_mm_fault()
 124 + *
 125 + * This is meant to be called in the specific scenario where for locking reasons
 126 + * we try to access user memory in atomic context (within a pagefault_disable()
 127 + * section), this returns -EFAULT, and we want to resolve the user fault before
 128 + * trying again.
 129 + *
 130 + * Typically this is meant to be used by the futex code.
 131 + *
 132 + * The main difference with get_user_pages() is that this function will
 133 + * unconditionally call handle_mm_fault() which will in turn perform all the
 134 + * necessary SW fixup of the dirty and young bits in the PTE, while
 135 + * handle_mm_fault() only guarantees to update these in the struct page.
 136 + *
 137 + * This is important for some architectures where those bits also gate the
 138 + * access permission to the page because they are maintained in software.  On
 139 + * such architectures, gup() will not be enough to make a subsequent access
 140 + * succeed.
 141 + *
 142 + * This should be called with the mm_sem held for read.
 143 + */
 144 +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 145 +                    unsigned long address, unsigned int fault_flags)
 146 +{
 147 +       struct vm_area_struct *vma;
 148 +       int ret;
 149 +
 150 +       vma = find_extend_vma(mm, address);
 151 +       if (!vma || address < vma->vm_start)
 152 +               return -EFAULT;
 153 +
 154 +       ret = handle_mm_fault(mm, vma, address, fault_flags);
 155 +       if (ret & VM_FAULT_ERROR) {
 156 +               if (ret & VM_FAULT_OOM)
 157 +                       return -ENOMEM;
 158 +               if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
 159 +                       return -EHWPOISON;
 160 +               if (ret & VM_FAULT_SIGBUS)
 161 +                       return -EFAULT;
 162 +               BUG();
 163 +       }
 164 +       if (tsk) {
 165 +               if (ret & VM_FAULT_MAJOR)
 166 +                       tsk->maj_flt++;
 167 +               else
 168 +                       tsk->min_flt++;
 169 +       }
 170 +       return 0;
 171 +}
 172 +
 173 +/*
 174   * get_user_pages() - pin user pages in memory
 175   * @tsk:       the task_struct to use for page fault accounting, or
 176   *             NULL if faults are not to be recorded.