From: Greg Kroah-Hartman Date: Mon, 5 Aug 2019 12:25:11 +0000 (+0200) Subject: 4.9-stable patches X-Git-Tag: v4.4.188~3 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=3234241e0242c7596bba04c1987805ee84e6957e;p=thirdparty%2Fkernel%2Fstable-queue.git 4.9-stable patches added patches: x86-mm-gup-prevent-get_page-race-with-munmap-in-paravirt-guest.patch --- diff --git a/queue-4.9/series b/queue-4.9/series index 0541ac83a02..9151e965ebe 100644 --- a/queue-4.9/series +++ b/queue-4.9/series @@ -39,3 +39,4 @@ eeprom-at24-make-spd-world-readable-again.patch backport-minimal-compiler_attributes.h-to-support-gcc-9.patch include-linux-module.h-copy-__init-__exit-attrs-to-init-cleanup_module.patch objtool-support-gcc-9-cold-subfunction-naming-scheme.patch +x86-mm-gup-prevent-get_page-race-with-munmap-in-paravirt-guest.patch diff --git a/queue-4.9/x86-mm-gup-prevent-get_page-race-with-munmap-in-paravirt-guest.patch b/queue-4.9/x86-mm-gup-prevent-get_page-race-with-munmap-in-paravirt-guest.patch new file mode 100644 index 00000000000..6c1c0cddf55 --- /dev/null +++ b/queue-4.9/x86-mm-gup-prevent-get_page-race-with-munmap-in-paravirt-guest.patch @@ -0,0 +1,115 @@ +From vbabka@suse.cz Mon Aug 5 13:56:29 2019 +From: Vlastimil Babka +Date: Fri, 2 Aug 2019 18:06:14 +0200 +Subject: x86, mm, gup: prevent get_page() race with munmap in paravirt guest +To: stable@vger.kernel.org +Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, Jann Horn , Ben Hutchings , xen-devel@lists.xenproject.org, Oscar Salvador , Vlastimil Babka , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Juergen Gross , "Kirill A . Shutemov" , Vitaly Kuznetsov , Linus Torvalds , Borislav Petkov , Dave Hansen , Andy Lutomirski +Message-ID: <20190802160614.8089-1-vbabka@suse.cz> + +From: Vlastimil Babka + +The x86 version of get_user_pages_fast() relies on disabled interrupts to +synchronize gup_pte_range() between gup_get_pte(ptep); and get_page() against +a parallel munmap. The munmap side nulls the pte, then flushes TLBs, then +releases the page. As TLB flush is done synchronously via IPI disabling +interrupts blocks the page release, and get_page(), which assumes existing +reference on page, is thus safe. +However when TLB flush is done by a hypercall, e.g. in a Xen PV guest, there is +no blocking thanks to disabled interrupts, and get_page() can succeed on a page +that was already freed or even reused. + +We have recently seen this happen with our 4.4 and 4.12 based kernels, with +userspace (java) that exits a thread, where mm_release() performs a futex_wake() +on tsk->clear_child_tid, and another thread in parallel unmaps the page where +tsk->clear_child_tid points to. The spurious get_page() succeeds, but futex code +immediately releases the page again, while it's already on a freelist. Symptoms +include a bad page state warning, general protection faults acessing a poisoned +list prev/next pointer in the freelist, or free page pcplists of two cpus joined +together in a single list. Oscar has also reproduced this scenario, with a +patch inserting delays before the get_page() to make the race window larger. + +Fix this by removing the dependency on TLB flush interrupts the same way as the +generic get_user_pages_fast() code by using page_cache_add_speculative() and +revalidating the PTE contents after pinning the page. Mainline is safe since +4.13 where the x86 gup code was removed in favor of the common code. Accessing +the page table itself safely also relies on disabled interrupts and TLB flush +IPIs that don't happen with hypercalls, which was acknowledged in commit +9e52fc2b50de ("x86/mm: Enable RCU based page table freeing +(CONFIG_HAVE_RCU_TABLE_FREE=y)"). That commit with follups should also be +backported for full safety, although our reproducer didn't hit a problem +without that backport. + +Reproduced-by: Oscar Salvador +Signed-off-by: Vlastimil Babka +Cc: Thomas Gleixner +Cc: Ingo Molnar +Cc: Peter Zijlstra +Cc: Juergen Gross +Cc: Kirill A. Shutemov +Cc: Vitaly Kuznetsov +Cc: Linus Torvalds +Cc: Borislav Petkov +Cc: Dave Hansen +Cc: Andy Lutomirski +Signed-off-by: Greg Kroah-Hartman +--- + +--- + arch/x86/mm/gup.c | 32 ++++++++++++++++++++++++++++++-- + 1 file changed, 30 insertions(+), 2 deletions(-) + +--- a/arch/x86/mm/gup.c ++++ b/arch/x86/mm/gup.c +@@ -98,6 +98,20 @@ static inline int pte_allows_gup(unsigne + } + + /* ++ * Return the compund head page with ref appropriately incremented, ++ * or NULL if that failed. ++ */ ++static inline struct page *try_get_compound_head(struct page *page, int refs) ++{ ++ struct page *head = compound_head(page); ++ if (WARN_ON_ONCE(page_ref_count(head) < 0)) ++ return NULL; ++ if (unlikely(!page_cache_add_speculative(head, refs))) ++ return NULL; ++ return head; ++} ++ ++/* + * The performance critical leaf functions are made noinline otherwise gcc + * inlines everything into a single function which results in too much + * register pressure. +@@ -112,7 +126,7 @@ static noinline int gup_pte_range(pmd_t + ptep = pte_offset_map(&pmd, addr); + do { + pte_t pte = gup_get_pte(ptep); +- struct page *page; ++ struct page *head, *page; + + /* Similar to the PMD case, NUMA hinting must take slow path */ + if (pte_protnone(pte)) { +@@ -138,7 +152,21 @@ static noinline int gup_pte_range(pmd_t + } + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + page = pte_page(pte); +- get_page(page); ++ ++ head = try_get_compound_head(page, 1); ++ if (!head) { ++ put_dev_pagemap(pgmap); ++ pte_unmap(ptep); ++ return 0; ++ } ++ ++ if (unlikely(pte_val(pte) != pte_val(*ptep))) { ++ put_page(head); ++ put_dev_pagemap(pgmap); ++ pte_unmap(ptep); ++ return 0; ++ } ++ + put_dev_pagemap(pgmap); + SetPageReferenced(page); + pages[*nr] = page;