From: Greg Kroah-Hartman Date: Thu, 17 Apr 2025 12:21:26 +0000 (+0200) Subject: 6.6-stable patches X-Git-Tag: v6.12.24~60 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=cc2661432b949225952f1e6d679cbbe8aa29aa59;p=thirdparty%2Fkernel%2Fstable-queue.git 6.6-stable patches added patches: iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch mm-fix-lazy-mmu-docs-and-usage.patch mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch sparc-mm-disable-preemption-in-lazy-mmu-mode.patch --- diff --git a/queue-6.6/iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch b/queue-6.6/iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch new file mode 100644 index 0000000000..7c06c54a63 --- /dev/null +++ b/queue-6.6/iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch @@ -0,0 +1,37 @@ +From a05df03a88bc1088be8e9d958f208d6484691e43 Mon Sep 17 00:00:00 2001 +From: Nicolin Chen +Date: Thu, 27 Feb 2025 12:07:29 -0800 +Subject: iommufd: Fix uninitialized rc in iommufd_access_rw() + +From: Nicolin Chen + +commit a05df03a88bc1088be8e9d958f208d6484691e43 upstream. + +Reported by smatch: +drivers/iommu/iommufd/device.c:1392 iommufd_access_rw() error: uninitialized symbol 'rc'. + +Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel access") +Link: https://patch.msgid.link/r/20250227200729.85030-1-nicolinc@nvidia.com +Cc: stable@vger.kernel.org +Reported-by: kernel test robot +Reported-by: Dan Carpenter +Closes: https://lore.kernel.org/r/202502271339.a2nWr9UA-lkp@intel.com/ +[nicolinc: can't find an original report but only in "old smatch warnings"] +Signed-off-by: Nicolin Chen +Signed-off-by: Jason Gunthorpe +Signed-off-by: Greg Kroah-Hartman +--- + drivers/iommu/iommufd/device.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/drivers/iommu/iommufd/device.c ++++ b/drivers/iommu/iommufd/device.c +@@ -1076,7 +1076,7 @@ int iommufd_access_rw(struct iommufd_acc + struct io_pagetable *iopt; + struct iopt_area *area; + unsigned long last_iova; +- int rc; ++ int rc = -EINVAL; + + if (!length) + return -EINVAL; diff --git a/queue-6.6/mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch b/queue-6.6/mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch new file mode 100644 index 0000000000..d96a4cbc95 --- /dev/null +++ b/queue-6.6/mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch @@ -0,0 +1,62 @@ +From c0ebbb3841e07c4493e6fe351698806b09a87a37 Mon Sep 17 00:00:00 2001 +From: Mathieu Desnoyers +Date: Wed, 12 Mar 2025 10:10:13 -0400 +Subject: mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock + +From: Mathieu Desnoyers + +commit c0ebbb3841e07c4493e6fe351698806b09a87a37 upstream. + +The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node +reclaim for struct pglist_data using a single bit. + +It is "locked" with a test_and_set_bit (similarly to a try lock) which +provides full ordering with respect to loads and stores done within +__node_reclaim(). + +It is "unlocked" with clear_bit(), which does not provide any ordering +with respect to loads and stores done before clearing the bit. + +The lack of clear_bit() memory ordering with respect to stores within +__node_reclaim() can cause a subsequent CPU to fail to observe stores from +a prior node reclaim. This is not an issue in practice on TSO (e.g. +x86), but it is an issue on weakly-ordered architectures (e.g. arm64). + +Fix this by using clear_bit_unlock rather than clear_bit to clear +PGDAT_RECLAIM_LOCKED with a release memory ordering semantic. + +This provides stronger memory ordering (release rather than relaxed). + +Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com +Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim") +Signed-off-by: Mathieu Desnoyers +Cc: Lorenzo Stoakes +Cc: Matthew Wilcox +Cc: Alan Stern +Cc: Andrea Parri +Cc: Will Deacon +Cc: Peter Zijlstra +Cc: Boqun Feng +Cc: Nicholas Piggin +Cc: David Howells +Cc: Jade Alglave +Cc: Luc Maranget +Cc: "Paul E. McKenney" +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/vmscan.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -8115,7 +8115,7 @@ int node_reclaim(struct pglist_data *pgd + return NODE_RECLAIM_NOSCAN; + + ret = __node_reclaim(pgdat, gfp_mask, order); +- clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags); ++ clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags); + + if (!ret) + count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED); diff --git a/queue-6.6/mm-fix-lazy-mmu-docs-and-usage.patch b/queue-6.6/mm-fix-lazy-mmu-docs-and-usage.patch new file mode 100644 index 0000000000..f6a43d9ece --- /dev/null +++ b/queue-6.6/mm-fix-lazy-mmu-docs-and-usage.patch @@ -0,0 +1,148 @@ +From 691ee97e1a9de0cdb3efb893c1f180e3f4a35e32 Mon Sep 17 00:00:00 2001 +From: Ryan Roberts +Date: Mon, 3 Mar 2025 14:15:35 +0000 +Subject: mm: fix lazy mmu docs and usage + +From: Ryan Roberts + +commit 691ee97e1a9de0cdb3efb893c1f180e3f4a35e32 upstream. + +Patch series "Fix lazy mmu mode", v2. + +I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As +part of that, I will extend lazy mmu mode to cover kernel mappings in +vmalloc table walkers. While lazy mmu mode is already used for kernel +mappings in a few places, this will extend it's use significantly. + +Having reviewed the existing lazy mmu implementations in powerpc, sparc +and x86, it looks like there are a bunch of bugs, some of which may be +more likely to trigger once I extend the use of lazy mmu. So this series +attempts to clarify the requirements and fix all the bugs in advance of +that series. See patch #1 commit log for all the details. + + +This patch (of 5): + +The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode() is +a bit of a mess (to put it politely). There are a number of issues +related to nesting of lazy mmu regions and confusion over whether the +task, when in a lazy mmu region, is preemptible or not. Fix all the +issues relating to the core-mm. Follow up commits will fix the +arch-specific implementations. 3 arches implement lazy mmu; powerpc, +sparc and x86. + +When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit +6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was +expected that lazy mmu regions would never nest and that the appropriate +page table lock(s) would be held while in the region, thus ensuring the +region is non-preemptible. Additionally lazy mmu regions were only used +during manipulation of user mappings. + +Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy +updates") started invoking the lazy mmu mode in apply_to_pte_range(), +which is used for both user and kernel mappings. For kernel mappings the +region is no longer protected by any lock so there is no longer any +guarantee about non-preemptibility. Additionally, for RT configs, the +holding the PTL only implies no CPU migration, it doesn't prevent +preemption. + +Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added +arch_[enter|leave]_lazy_mmu_mode() to the default implementation of +set_ptes(), used by x86. So after this commit, lazy mmu regions can be +nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new +page table range API") and commit 9fee28baa601 ("powerpc: implement the +new page table range API") did the same for the sparc and powerpc +set_ptes() overrides. + +powerpc couldn't deal with preemption so avoids it in commit b9ef323ea168 +("powerpc/64s: Disable preemption in hash lazy mmu mode"), which +explicitly disables preemption for the whole region in its implementation. +x86 can support preemption (or at least it could until it tried to add +support nesting; more on this below). Sparc looks to be totally broken in +the face of preemption, as far as I can tell. + +powerpc can't deal with nesting, so avoids it in commit 47b8def9358c +("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"), +which removes the lazy mmu calls from its implementation of set_ptes(). +x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow +nesting of same lazy mode") but as far as I can tell, this breaks its +support for preemption. + +In short, it's all a mess; the semantics for +arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a result +the implementations all have different expectations, sticking plasters and +bugs. + +arm64 is aiming to start using these hooks, so let's clean everything up +before adding an arm64 implementation. Update the documentation to state +that lazy mmu regions can never be nested, must not be called in interrupt +context and preemption may or may not be enabled for the duration of the +region. And fix the generic implementation of set_ptes() to avoid +nesting. + +arch-specific fixes to conform to the new spec will proceed this one. + +These issues were spotted by code review and I have no evidence of issues +being reported in the wild. + +Link: https://lkml.kernel.org/r/20250303141542.3371656-1-ryan.roberts@arm.com +Link: https://lkml.kernel.org/r/20250303141542.3371656-2-ryan.roberts@arm.com +Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()") +Signed-off-by: Ryan Roberts +Acked-by: David Hildenbrand +Acked-by: Juergen Gross +Cc: Andreas Larsson +Cc: Borislav Betkov +Cc: Boris Ostrovsky +Cc: Catalin Marinas +Cc: Dave Hansen +Cc: David S. Miller +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Juegren Gross +Cc: Matthew Wilcow (Oracle) +Cc: Thomas Gleinxer +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + include/linux/pgtable.h | 14 ++++++++------ + 1 file changed, 8 insertions(+), 6 deletions(-) + +--- a/include/linux/pgtable.h ++++ b/include/linux/pgtable.h +@@ -194,10 +194,14 @@ static inline int pmd_young(pmd_t pmd) + * hazard could result in the direct mode hypervisor case, since the actual + * write to the page tables may not yet have taken place, so reads though + * a raw PTE pointer after it has been modified are not guaranteed to be +- * up to date. This mode can only be entered and left under the protection of +- * the page table locks for all page tables which may be modified. In the UP +- * case, this is required so that preemption is disabled, and in the SMP case, +- * it must synchronize the delayed page table writes properly on other CPUs. ++ * up to date. ++ * ++ * In the general case, no lock is guaranteed to be held between entry and exit ++ * of the lazy mode. So the implementation must assume preemption may be enabled ++ * and cpu migration is possible; it must take steps to be robust against this. ++ * (In practice, for user PTE updates, the appropriate page table lock(s) are ++ * held, but for kernel PTE updates, no lock is held). Nesting is not permitted ++ * and the mode cannot be used in interrupt context. + */ + #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE + #define arch_enter_lazy_mmu_mode() do {} while (0) +@@ -233,7 +237,6 @@ static inline void set_ptes(struct mm_st + { + page_table_check_ptes_set(mm, ptep, pte, nr); + +- arch_enter_lazy_mmu_mode(); + for (;;) { + set_pte(ptep, pte); + if (--nr == 0) +@@ -241,7 +244,6 @@ static inline void set_ptes(struct mm_st + ptep++; + pte = pte_next_pfn(pte); + } +- arch_leave_lazy_mmu_mode(); + } + #endif + #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1) diff --git a/queue-6.6/mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch b/queue-6.6/mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch new file mode 100644 index 0000000000..f0ab3ca91c --- /dev/null +++ b/queue-6.6/mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch @@ -0,0 +1,41 @@ +From 1ca77ff1837249701053a7fcbdedabc41f4ae67c Mon Sep 17 00:00:00 2001 +From: Marc Herbert +Date: Wed, 19 Mar 2025 06:00:30 +0000 +Subject: mm/hugetlb: move hugetlb_sysctl_init() to the __init section + +From: Marc Herbert + +commit 1ca77ff1837249701053a7fcbdedabc41f4ae67c upstream. + +hugetlb_sysctl_init() is only invoked once by an __init function and is +merely a wrapper around another __init function so there is not reason to +keep it. + +Fixes the following warning when toning down some GCC inline options: + + WARNING: modpost: vmlinux: section mismatch in reference: + hugetlb_sysctl_init+0x1b (section: .text) -> + __register_sysctl_init (section: .init.text) + +Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com +Signed-off-by: Marc Herbert +Reviewed-by: Anshuman Khandual +Reviewed-by: Muchun Song +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/hugetlb.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/hugetlb.c ++++ b/mm/hugetlb.c +@@ -4695,7 +4695,7 @@ static struct ctl_table hugetlb_table[] + { } + }; + +-static void hugetlb_sysctl_init(void) ++static void __init hugetlb_sysctl_init(void) + { + register_sysctl_init("vm", hugetlb_table); + } diff --git a/queue-6.6/mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch b/queue-6.6/mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch new file mode 100644 index 0000000000..6b40cf135c --- /dev/null +++ b/queue-6.6/mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch @@ -0,0 +1,137 @@ +From aaf99ac2ceb7c974f758a635723eeaf48596388e Mon Sep 17 00:00:00 2001 +From: Shuai Xue +Date: Wed, 12 Mar 2025 19:28:51 +0800 +Subject: mm/hwpoison: do not send SIGBUS to processes with recovered clean pages + +From: Shuai Xue + +commit aaf99ac2ceb7c974f758a635723eeaf48596388e upstream. + +When an uncorrected memory error is consumed there is a race between the +CMCI from the memory controller reporting an uncorrected error with a UCNA +signature, and the core reporting and SRAR signature machine check when +the data is about to be consumed. + +- Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] + +Prior to Icelake memory controllers reported patrol scrub events that +detected a previously unseen uncorrected error in memory by signaling a +broadcast machine check with an SRAO (Software Recoverable Action +Optional) signature in the machine check bank. This was overkill because +it's not an urgent problem that no core is on the verge of consuming that +bad data. It's also found that multi SRAO UCE may cause nested MCE +interrupts and finally become an IERR. + +Hence, Intel downgrades the machine check bank signature of patrol scrub +from SRAO to UCNA (Uncorrected, No Action required), and signal changed to +#CMCI. Just to add to the confusion, Linux does take an action (in +uc_decode_notifier()) to try to offline the page despite the UC*NA* +signature name. + +- Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] + +Having decided that CMCI/UCNA is the best action for patrol scrub errors, +the memory controller uses it for reads too. But the memory controller is +executing asynchronously from the core, and can't tell the difference +between a "real" read and a speculative read. So it will do CMCI/UCNA if +an error is found in any read. + +Thus: + +1) Core is clever and thinks address A is needed soon, issues a speculative read. +2) Core finds it is going to use address A soon after sending the read request +3) The CMCI from the memory controller is in a race with MCE from the core + that will soon try to retire the load from address A. + +Quite often (because speculation has got better) the CMCI from the memory +controller is delivered before the core is committed to the instruction +reading address A, so the interrupt is taken, and Linux offlines the page +(marking it as poison). + +- Why user process is killed for instr case + +Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported +"not recovered"") tries to fix noise message "Memory error not recovered" +and skips duplicate SIGBUSs due to the race. But it also introduced a bug +that kill_accessing_process() return -EHWPOISON for instr case, as result, +kill_me_maybe() send a SIGBUS to user process. + +If the CMCI wins that race, the page is marked poisoned when +uc_decode_notifier() calls memory_failure(). For dirty pages, +memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, +converting the PTE to a hwpoison entry. As a result, +kill_accessing_process(): + +- call walk_page_range() and return 1 regardless of whether + try_to_unmap() succeeds or fails, +- call kill_proc() to make sure a SIGBUS is sent +- return -EHWPOISON to indicate that SIGBUS is already sent to the + process and kill_me_maybe() doesn't have to send it again. + +However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the +PTE unchanged and not converted to a hwpoison entry. Conversely, for +clean pages where PTE entries are not marked as hwpoison, +kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send +a SIGBUS. + +Console log looks like this: + + Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects + Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered + Memory failure: 0x827ca68: already hardware poisoned + mce: Memory error not recovered + +To fix it, return 0 for "corrupted page was clean", preventing an +unnecessary SIGBUS to user process. + +[1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.com/T/#mba94f1305b3009dd340ce4114d3221fe810d1871 +Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com +Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") +Signed-off-by: Shuai Xue +Tested-by: Tony Luck +Acked-by: Miaohe Lin +Cc: Baolin Wang +Cc: Borislav Betkov +Cc: Catalin Marinas +Cc: Dave Hansen +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Jane Chu +Cc: Jarkko Sakkinen +Cc: Jonathan Cameron +Cc: Josh Poimboeuf +Cc: Naoya Horiguchi +Cc: Peter Zijlstra +Cc: Ruidong Tian +Cc: Thomas Gleinxer +Cc: Yazen Ghannam +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/memory-failure.c | 11 ++++++++--- + 1 file changed, 8 insertions(+), 3 deletions(-) + +--- a/mm/memory-failure.c ++++ b/mm/memory-failure.c +@@ -869,12 +869,17 @@ static int kill_accessing_process(struct + mmap_read_lock(p->mm); + ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops, + (void *)&priv); ++ /* ++ * ret = 1 when CMCI wins, regardless of whether try_to_unmap() ++ * succeeds or fails, then kill the process with SIGBUS. ++ * ret = 0 when poison page is a clean page and it's dropped, no ++ * SIGBUS is needed. ++ */ + if (ret == 1 && priv.tk.addr) + kill_proc(&priv.tk, pfn, flags); +- else +- ret = 0; + mmap_read_unlock(p->mm); +- return ret > 0 ? -EHWPOISON : -EFAULT; ++ ++ return ret > 0 ? -EHWPOISON : 0; + } + + static const char *action_name[] = { diff --git a/queue-6.6/mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch b/queue-6.6/mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch new file mode 100644 index 0000000000..8952870b2c --- /dev/null +++ b/queue-6.6/mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch @@ -0,0 +1,103 @@ +From 442b1eca223b4860cc85ef970ae602d125aec5a4 Mon Sep 17 00:00:00 2001 +From: Jane Chu +Date: Mon, 24 Feb 2025 14:14:45 -0700 +Subject: mm: make page_mapped_in_vma() hugetlb walk aware + +From: Jane Chu + +commit 442b1eca223b4860cc85ef970ae602d125aec5a4 upstream. + +When a process consumes a UE in a page, the memory failure handler +attempts to collect information for a potential SIGBUS. If the page is an +anonymous page, page_mapped_in_vma(page, vma) is invoked in order to + + 1. retrieve the vaddr from the process' address space, + + 2. verify that the vaddr is indeed mapped to the poisoned page, + where 'page' is the precise small page with UE. + +It's been observed that when injecting poison to a non-head subpage of an +anonymous hugetlb page, no SIGBUS shows up, while injecting to the head +page produces a SIGBUS. The cause is that, though hugetlb_walk() returns +a valid pmd entry (on x86), but check_pte() detects mismatch between the +head page per the pmd and the input subpage. Thus the vaddr is considered +not mapped to the subpage and the process is not collected for SIGBUS +purpose. This is the calling stack: + + collect_procs_anon + page_mapped_in_vma + page_vma_mapped_walk + hugetlb_walk + huge_pte_lock + check_pte + +check_pte() header says that it +"check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte" +but practically works only if pvmw->pfn is the head page pfn at pvmw->pte. +Hindsight acknowledging that some pvmw->pte could point to a hugepage of +some sort such that it makes sense to make check_pte() work for hugepage. + +Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com +Signed-off-by: Jane Chu +Cc: Hugh Dickins +Cc: Kirill A. Shuemov +Cc: linmiaohe +Cc: Matthew Wilcow (Oracle) +Cc: Peter Xu +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/page_vma_mapped.c | 13 +++++++++---- + 1 file changed, 9 insertions(+), 4 deletions(-) + +--- a/mm/page_vma_mapped.c ++++ b/mm/page_vma_mapped.c +@@ -77,6 +77,7 @@ static bool map_pte(struct page_vma_mapp + * mapped at the @pvmw->pte + * @pvmw: page_vma_mapped_walk struct, includes a pair pte and pfn range + * for checking ++ * @pte_nr: the number of small pages described by @pvmw->pte. + * + * page_vma_mapped_walk() found a place where pfn range is *potentially* + * mapped. check_pte() has to validate this. +@@ -93,7 +94,7 @@ static bool map_pte(struct page_vma_mapp + * Otherwise, return false. + * + */ +-static bool check_pte(struct page_vma_mapped_walk *pvmw) ++static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr) + { + unsigned long pfn; + pte_t ptent = ptep_get(pvmw->pte); +@@ -126,7 +127,11 @@ static bool check_pte(struct page_vma_ma + pfn = pte_pfn(ptent); + } + +- return (pfn - pvmw->pfn) < pvmw->nr_pages; ++ if ((pfn + pte_nr - 1) < pvmw->pfn) ++ return false; ++ if (pfn > (pvmw->pfn + pvmw->nr_pages - 1)) ++ return false; ++ return true; + } + + /* Returns true if the two ranges overlap. Careful to not overflow. */ +@@ -201,7 +206,7 @@ bool page_vma_mapped_walk(struct page_vm + return false; + + pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); +- if (!check_pte(pvmw)) ++ if (!check_pte(pvmw, pages_per_huge_page(hstate))) + return not_found(pvmw); + return true; + } +@@ -283,7 +288,7 @@ restart: + goto next_pte; + } + this_pte: +- if (check_pte(pvmw)) ++ if (check_pte(pvmw, 1)) + return true; + next_pte: + do { diff --git a/queue-6.6/mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch b/queue-6.6/mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch new file mode 100644 index 0000000000..799f98db1e --- /dev/null +++ b/queue-6.6/mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch @@ -0,0 +1,137 @@ +From 937582ee8e8d227c30ec147629a0179131feaa80 Mon Sep 17 00:00:00 2001 +From: Lorenzo Stoakes +Date: Mon, 10 Mar 2025 20:50:34 +0000 +Subject: mm/mremap: correctly handle partial mremap() of VMA starting at 0 + +From: Lorenzo Stoakes + +commit 937582ee8e8d227c30ec147629a0179131feaa80 upstream. + +Patch series "refactor mremap and fix bug", v3. + +The existing mremap() logic has grown organically over a very long period +of time, resulting in code that is in many parts, very difficult to follow +and full of subtleties and sources of confusion. + +In addition, it is difficult to thread state through the operation +correctly, as function arguments have expanded, some parameters are +expected to be temporarily altered during the operation, others are +intended to remain static and some can be overridden. + +This series completely refactors the mremap implementation, sensibly +separating functions, adding comments to explain the more subtle aspects +of the implementation and making use of small structs to thread state +through everything. + +The reason for doing so is to lay the groundwork for planned future +changes to the mremap logic, changes which require the ability to easily +pass around state. + +Additionally, it would be unhelpful to add yet more logic to code that is +already difficult to follow without first refactoring it like this. + +The first patch in this series additionally fixes a bug when a VMA with +start address zero is partially remapped. + +Tested on real hardware under heavy workload and all self tests are +passing. + + +This patch (of 3): + +Consider the case of a partial mremap() (that results in a VMA split) of +an accountable VMA (i.e. which has the VM_ACCOUNT flag set) whose start +address is zero, with the MREMAP_MAYMOVE flag specified and a scenario +where a move does in fact occur: + + addr end + | | + v v + |-------------| + | vma | + |-------------| + 0 + +This move is affected by unmapping the range [addr, end). In order to +prevent an incorrect decrement of accounted memory which has already been +determined, the mremap() code in move_vma() clears VM_ACCOUNT from the VMA +prior to doing so, before reestablishing it in each of the VMAs +post-split: + + addr end + | | + v v + |---| |---| + | A | | B | + |---| |---| + +Commit 6b73cff239e5 ("mm: change munmap splitting order and move_vma()") +changed this logic such as to determine whether there is a need to do so +by establishing account_start and account_end and, in the instance where +such an operation is required, assigning them to vma->vm_start and +vma->vm_end. + +Later the code checks if the operation is required for 'A' referenced +above thusly: + + if (account_start) { + ... + } + +However, if the VMA described above has vma->vm_start == 0, which is now +assigned to account_start, this branch will not be executed. + +As a result, the VMA 'A' above will remain stripped of its VM_ACCOUNT +flag, incorrectly. + +The fix is to simply convert these variables to booleans and set them as +required. + +Link: https://lkml.kernel.org/r/cover.1741639347.git.lorenzo.stoakes@oracle.com +Link: https://lkml.kernel.org/r/dc55cb6db25d97c3d9e460de4986a323fa959676.1741639347.git.lorenzo.stoakes@oracle.com +Fixes: 6b73cff239e5 ("mm: change munmap splitting order and move_vma()") +Signed-off-by: Lorenzo Stoakes +Reviewed-by: Harry Yoo +Reviewed-by: Liam R. Howlett +Reviewed-by: Vlastimil Babka +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/mremap.c | 10 +++++----- + 1 file changed, 5 insertions(+), 5 deletions(-) + +--- a/mm/mremap.c ++++ b/mm/mremap.c +@@ -599,8 +599,8 @@ static unsigned long move_vma(struct vm_ + unsigned long vm_flags = vma->vm_flags; + unsigned long new_pgoff; + unsigned long moved_len; +- unsigned long account_start = 0; +- unsigned long account_end = 0; ++ bool account_start = false; ++ bool account_end = false; + unsigned long hiwater_vm; + int err = 0; + bool need_rmap_locks; +@@ -684,9 +684,9 @@ static unsigned long move_vma(struct vm_ + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) { + vm_flags_clear(vma, VM_ACCOUNT); + if (vma->vm_start < old_addr) +- account_start = vma->vm_start; ++ account_start = true; + if (vma->vm_end > old_addr + old_len) +- account_end = vma->vm_end; ++ account_end = true; + } + + /* +@@ -726,7 +726,7 @@ static unsigned long move_vma(struct vm_ + /* OOM: unable to split vma, just get accounts right */ + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) + vm_acct_memory(old_len >> PAGE_SHIFT); +- account_start = account_end = 0; ++ account_start = account_end = false; + } + + if (vm_flags & VM_LOCKED) { diff --git a/queue-6.6/mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch b/queue-6.6/mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch new file mode 100644 index 0000000000..86ec213416 --- /dev/null +++ b/queue-6.6/mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch @@ -0,0 +1,66 @@ +From bc3fe6805cf09a25a086573a17d40e525208c5d8 Mon Sep 17 00:00:00 2001 +From: David Hildenbrand +Date: Mon, 10 Feb 2025 20:37:44 +0100 +Subject: mm/rmap: reject hugetlb folios in folio_make_device_exclusive() + +From: David Hildenbrand + +commit bc3fe6805cf09a25a086573a17d40e525208c5d8 upstream. + +Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP, +let's add a safety net in case FOLL_SPLIT_PMD usage would ever be +reworked. + +In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the +generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have +returned a page. In particular, hugetlb folios that are not PMD-sized +would never have been prone to FOLL_SPLIT_PMD. + +hugetlb folios can be anonymous, and page_make_device_exclusive_one() is +not really prepared for handling them at all. So let's spell that out. + +Link: https://lkml.kernel.org/r/20250210193801.781278-3-david@redhat.com +Fixes: b756a3b5e7ea ("mm: device exclusive memory access") +Signed-off-by: David Hildenbrand +Reviewed-by: Alistair Popple +Tested-by: Alistair Popple +Cc: Alex Shi +Cc: Danilo Krummrich +Cc: Dave Airlie +Cc: Jann Horn +Cc: Jason Gunthorpe +Cc: Jerome Glisse +Cc: John Hubbard +Cc: Jonathan Corbet +Cc: Karol Herbst +Cc: Liam Howlett +Cc: Lorenzo Stoakes +Cc: Lyude +Cc: "Masami Hiramatsu (Google)" +Cc: Oleg Nesterov +Cc: Pasha Tatashin +Cc: Peter Xu +Cc: Peter Zijlstra (Intel) +Cc: SeongJae Park +Cc: Simona Vetter +Cc: Vlastimil Babka +Cc: Yanteng Si +Cc: Barry Song +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + mm/rmap.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/mm/rmap.c ++++ b/mm/rmap.c +@@ -2296,7 +2296,7 @@ static bool folio_make_device_exclusive( + * Restrict to anonymous folios for now to avoid potential writeback + * issues. + */ +- if (!folio_test_anon(folio)) ++ if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) + return false; + + rmap_walk(folio, &rwc); diff --git a/queue-6.6/mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch b/queue-6.6/mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch new file mode 100644 index 0000000000..3657d887eb --- /dev/null +++ b/queue-6.6/mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch @@ -0,0 +1,120 @@ +From fe4cdc2c4e248f48de23bc778870fd71e772a274 Mon Sep 17 00:00:00 2001 +From: Peter Xu +Date: Wed, 12 Mar 2025 10:51:31 -0400 +Subject: mm/userfaultfd: fix release hang over concurrent GUP + +From: Peter Xu + +commit fe4cdc2c4e248f48de23bc778870fd71e772a274 upstream. + +This patch should fix a possible userfaultfd release() hang during +concurrent GUP. + +This problem was initially reported by Dimitris Siakavaras in July 2023 +[1] in a firecracker use case. Firecracker has a separate process +handling page faults remotely, and when the process releases the +userfaultfd it can race with a concurrent GUP from KVM trying to fault in +a guest page during the secondary MMU page fault process. + +A similar problem was reported recently again by Jinjiang Tu in March 2025 +[2], even though the race happened this time with a mlockall() operation, +which does GUP in a similar fashion. + +In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the +uffd without triggering SIGBUS") was trying to fix this issue. AFAIU, +that fixes well the fault paths but may not work yet for GUP. In GUP, the +issue is NOPAGE will be almost treated the same as "page fault resolved" +in faultin_page(), then the GUP will follow page again, seeing page +missing, and it'll keep going into a live lock situation as reported. + +This change makes core mm return RETRY instead of NOPAGE for both the GUP +and fault paths, proactively releasing the mmap read lock. This should +guarantee the other release thread make progress on taking the write lock +and avoid the live lock even for GUP. + +When at it, rearrange the comments to make sure it's uptodate. + +[1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr +[2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com + +Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com +Signed-off-by: Peter Xu +Cc: Andrea Arcangeli +Cc: Mike Rapoport (IBM) +Cc: Axel Rasmussen +Cc: Jinjiang Tu +Cc: Dimitris Siakavaras +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + fs/userfaultfd.c | 51 +++++++++++++++++++++++++-------------------------- + 1 file changed, 25 insertions(+), 26 deletions(-) + +--- a/fs/userfaultfd.c ++++ b/fs/userfaultfd.c +@@ -452,32 +452,6 @@ vm_fault_t handle_userfault(struct vm_fa + goto out; + + /* +- * If it's already released don't get it. This avoids to loop +- * in __get_user_pages if userfaultfd_release waits on the +- * caller of handle_userfault to release the mmap_lock. +- */ +- if (unlikely(READ_ONCE(ctx->released))) { +- /* +- * Don't return VM_FAULT_SIGBUS in this case, so a non +- * cooperative manager can close the uffd after the +- * last UFFDIO_COPY, without risking to trigger an +- * involuntary SIGBUS if the process was starting the +- * userfaultfd while the userfaultfd was still armed +- * (but after the last UFFDIO_COPY). If the uffd +- * wasn't already closed when the userfault reached +- * this point, that would normally be solved by +- * userfaultfd_must_wait returning 'false'. +- * +- * If we were to return VM_FAULT_SIGBUS here, the non +- * cooperative manager would be instead forced to +- * always call UFFDIO_UNREGISTER before it can safely +- * close the uffd. +- */ +- ret = VM_FAULT_NOPAGE; +- goto out; +- } +- +- /* + * Check that we can return VM_FAULT_RETRY. + * + * NOTE: it should become possible to return VM_FAULT_RETRY +@@ -513,6 +487,31 @@ vm_fault_t handle_userfault(struct vm_fa + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) + goto out; + ++ if (unlikely(READ_ONCE(ctx->released))) { ++ /* ++ * If a concurrent release is detected, do not return ++ * VM_FAULT_SIGBUS or VM_FAULT_NOPAGE, but instead always ++ * return VM_FAULT_RETRY with lock released proactively. ++ * ++ * If we were to return VM_FAULT_SIGBUS here, the non ++ * cooperative manager would be instead forced to ++ * always call UFFDIO_UNREGISTER before it can safely ++ * close the uffd, to avoid involuntary SIGBUS triggered. ++ * ++ * If we were to return VM_FAULT_NOPAGE, it would work for ++ * the fault path, in which the lock will be released ++ * later. However for GUP, faultin_page() does nothing ++ * special on NOPAGE, so GUP would spin retrying without ++ * releasing the mmap read lock, causing possible livelock. ++ * ++ * Here only VM_FAULT_RETRY would make sure the mmap lock ++ * be released immediately, so that the thread concurrently ++ * releasing the userfault would always make progress. ++ */ ++ release_fault_lock(vmf); ++ goto out; ++ } ++ + /* take the reference before dropping the mmap_lock */ + userfaultfd_ctx_get(ctx); + diff --git a/queue-6.6/sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch b/queue-6.6/sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch new file mode 100644 index 0000000000..034a0795ab --- /dev/null +++ b/queue-6.6/sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch @@ -0,0 +1,159 @@ +From f1a69a940de58b16e8249dff26f74c8cc59b32be Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?Ricardo=20Ca=C3=B1uelo=20Navarro?= +Date: Fri, 4 Apr 2025 16:53:21 +0200 +Subject: sctp: detect and prevent references to a freed transport in sendmsg +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +From: Ricardo Cañuelo Navarro + +commit f1a69a940de58b16e8249dff26f74c8cc59b32be upstream. + +sctp_sendmsg() re-uses associations and transports when possible by +doing a lookup based on the socket endpoint and the message destination +address, and then sctp_sendmsg_to_asoc() sets the selected transport in +all the message chunks to be sent. + +There's a possible race condition if another thread triggers the removal +of that selected transport, for instance, by explicitly unbinding an +address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have +been set up and before the message is sent. This can happen if the send +buffer is full, during the period when the sender thread temporarily +releases the socket lock in sctp_wait_for_sndbuf(). + +This causes the access to the transport data in +sctp_outq_select_transport(), when the association outqueue is flushed, +to result in a use-after-free read. + +This change avoids this scenario by having sctp_transport_free() signal +the freeing of the transport, tagging it as "dead". In order to do this, +the patch restores the "dead" bit in struct sctp_transport, which was +removed in +commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport"). + +Then, in the scenario where the sender thread has released the socket +lock in sctp_wait_for_sndbuf(), the bit is checked again after +re-acquiring the socket lock to detect the deletion. This is done while +holding a reference to the transport to prevent it from being freed in +the process. + +If the transport was deleted while the socket lock was relinquished, +sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the +send. + +The bug was found by a private syzbot instance (see the error report [1] +and the C reproducer that triggers it [2]). + +Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt [1] +Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c [2] +Cc: stable@vger.kernel.org +Fixes: df132eff4638 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer") +Suggested-by: Xin Long +Signed-off-by: Ricardo Cañuelo Navarro +Acked-by: Xin Long +Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com +Signed-off-by: Paolo Abeni +Signed-off-by: Greg Kroah-Hartman +--- + include/net/sctp/structs.h | 3 ++- + net/sctp/socket.c | 22 ++++++++++++++-------- + net/sctp/transport.c | 2 ++ + 3 files changed, 18 insertions(+), 9 deletions(-) + +--- a/include/net/sctp/structs.h ++++ b/include/net/sctp/structs.h +@@ -778,6 +778,7 @@ struct sctp_transport { + + /* Reference counting. */ + refcount_t refcnt; ++ __u32 dead:1, + /* RTO-Pending : A flag used to track if one of the DATA + * chunks sent to this address is currently being + * used to compute a RTT. If this flag is 0, +@@ -787,7 +788,7 @@ struct sctp_transport { + * calculation completes (i.e. the DATA chunk + * is SACK'd) clear this flag. + */ +- __u32 rto_pending:1, ++ rto_pending:1, + + /* + * hb_sent : a flag that signals that we have a pending +--- a/net/sctp/socket.c ++++ b/net/sctp/socket.c +@@ -71,8 +71,9 @@ + /* Forward declarations for internal helper functions. */ + static bool sctp_writeable(const struct sock *sk); + static void sctp_wfree(struct sk_buff *skb); +-static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p, +- size_t msg_len); ++static int sctp_wait_for_sndbuf(struct sctp_association *asoc, ++ struct sctp_transport *transport, ++ long *timeo_p, size_t msg_len); + static int sctp_wait_for_packet(struct sock *sk, int *err, long *timeo_p); + static int sctp_wait_for_connect(struct sctp_association *, long *timeo_p); + static int sctp_wait_for_accept(struct sock *sk, long timeo); +@@ -1827,7 +1828,7 @@ static int sctp_sendmsg_to_asoc(struct s + + if (sctp_wspace(asoc) <= 0 || !sk_wmem_schedule(sk, msg_len)) { + timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); +- err = sctp_wait_for_sndbuf(asoc, &timeo, msg_len); ++ err = sctp_wait_for_sndbuf(asoc, transport, &timeo, msg_len); + if (err) + goto err; + if (unlikely(sinfo->sinfo_stream >= asoc->stream.outcnt)) { +@@ -9208,8 +9209,9 @@ void sctp_sock_rfree(struct sk_buff *skb + + + /* Helper function to wait for space in the sndbuf. */ +-static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p, +- size_t msg_len) ++static int sctp_wait_for_sndbuf(struct sctp_association *asoc, ++ struct sctp_transport *transport, ++ long *timeo_p, size_t msg_len) + { + struct sock *sk = asoc->base.sk; + long current_timeo = *timeo_p; +@@ -9219,7 +9221,9 @@ static int sctp_wait_for_sndbuf(struct s + pr_debug("%s: asoc:%p, timeo:%ld, msg_len:%zu\n", __func__, asoc, + *timeo_p, msg_len); + +- /* Increment the association's refcnt. */ ++ /* Increment the transport and association's refcnt. */ ++ if (transport) ++ sctp_transport_hold(transport); + sctp_association_hold(asoc); + + /* Wait on the association specific sndbuf space. */ +@@ -9228,7 +9232,7 @@ static int sctp_wait_for_sndbuf(struct s + TASK_INTERRUPTIBLE); + if (asoc->base.dead) + goto do_dead; +- if (!*timeo_p) ++ if ((!*timeo_p) || (transport && transport->dead)) + goto do_nonblock; + if (sk->sk_err || asoc->state >= SCTP_STATE_SHUTDOWN_PENDING) + goto do_error; +@@ -9253,7 +9257,9 @@ static int sctp_wait_for_sndbuf(struct s + out: + finish_wait(&asoc->wait, &wait); + +- /* Release the association's refcnt. */ ++ /* Release the transport and association's refcnt. */ ++ if (transport) ++ sctp_transport_put(transport); + sctp_association_put(asoc); + + return err; +--- a/net/sctp/transport.c ++++ b/net/sctp/transport.c +@@ -117,6 +117,8 @@ fail: + */ + void sctp_transport_free(struct sctp_transport *transport) + { ++ transport->dead = 1; ++ + /* Try to delete the heartbeat timer. */ + if (del_timer(&transport->hb_timer)) + sctp_transport_put(transport); diff --git a/queue-6.6/series b/queue-6.6/series index 99ad51f7c3..ee329da601 100644 --- a/queue-6.6/series +++ b/queue-6.6/series @@ -183,3 +183,15 @@ backlight-led_bl-hold-led_access-lock-when-calling-led_sysfs_disable.patch btrfs-fix-non-empty-delayed-iputs-list-on-unmount-due-to-compressed-write-workers.patch btrfs-zoned-fix-zone-activation-with-missing-devices.patch btrfs-zoned-fix-zone-finishing-with-missing-devices.patch +iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch +sparc-mm-disable-preemption-in-lazy-mmu-mode.patch +sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch +mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch +mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch +mm-fix-lazy-mmu-docs-and-usage.patch +mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch +mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch +mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch +mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch +mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch +sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch diff --git a/queue-6.6/sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch b/queue-6.6/sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch new file mode 100644 index 0000000000..503ccfb06a --- /dev/null +++ b/queue-6.6/sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch @@ -0,0 +1,70 @@ +From eb61ad14c459b54f71f76331ca35d12fa3eb8f98 Mon Sep 17 00:00:00 2001 +From: Ryan Roberts +Date: Mon, 3 Mar 2025 14:15:38 +0000 +Subject: sparc/mm: avoid calling arch_enter/leave_lazy_mmu() in set_ptes + +From: Ryan Roberts + +commit eb61ad14c459b54f71f76331ca35d12fa3eb8f98 upstream. + +With commit 1a10a44dfc1d ("sparc64: implement the new page table range +API") set_ptes was added to the sparc architecture. The implementation +included calling arch_enter/leave_lazy_mmu() calls. + +The patch removes the usage of arch_enter/leave_lazy_mmu() since this +implies nesting of lazy mmu regions which is not supported. Without this +fix, lazy mmu mode is effectively disabled because we exit the mode after +the first set_ptes: + +remap_pte_range() + -> arch_enter_lazy_mmu() + -> set_ptes() + -> arch_enter_lazy_mmu() + -> arch_leave_lazy_mmu() + -> arch_leave_lazy_mmu() + +Powerpc suffered the same problem and fixed it in a corresponding way with +commit 47b8def9358c ("powerpc/mm: Avoid calling +arch_enter/leave_lazy_mmu() in set_ptes"). + +Link: https://lkml.kernel.org/r/20250303141542.3371656-5-ryan.roberts@arm.com +Fixes: 1a10a44dfc1d ("sparc64: implement the new page table range API") +Signed-off-by: Ryan Roberts +Acked-by: David Hildenbrand +Acked-by: Andreas Larsson +Acked-by: Juergen Gross +Cc: Borislav Betkov +Cc: Boris Ostrovsky +Cc: Catalin Marinas +Cc: Dave Hansen +Cc: David S. Miller +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Juegren Gross +Cc: Matthew Wilcow (Oracle) +Cc: Thomas Gleinxer +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + arch/sparc/include/asm/pgtable_64.h | 2 -- + 1 file changed, 2 deletions(-) + +--- a/arch/sparc/include/asm/pgtable_64.h ++++ b/arch/sparc/include/asm/pgtable_64.h +@@ -931,7 +931,6 @@ static inline void __set_pte_at(struct m + static inline void set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) + { +- arch_enter_lazy_mmu_mode(); + for (;;) { + __set_pte_at(mm, addr, ptep, pte, 0); + if (--nr == 0) +@@ -940,7 +939,6 @@ static inline void set_ptes(struct mm_st + pte_val(pte) += PAGE_SIZE; + addr += PAGE_SIZE; + } +- arch_leave_lazy_mmu_mode(); + } + #define set_ptes set_ptes + diff --git a/queue-6.6/sparc-mm-disable-preemption-in-lazy-mmu-mode.patch b/queue-6.6/sparc-mm-disable-preemption-in-lazy-mmu-mode.patch new file mode 100644 index 0000000000..ade1fb51da --- /dev/null +++ b/queue-6.6/sparc-mm-disable-preemption-in-lazy-mmu-mode.patch @@ -0,0 +1,70 @@ +From a1d416bf9faf4f4871cb5a943614a07f80a7d70f Mon Sep 17 00:00:00 2001 +From: Ryan Roberts +Date: Mon, 3 Mar 2025 14:15:37 +0000 +Subject: sparc/mm: disable preemption in lazy mmu mode + +From: Ryan Roberts + +commit a1d416bf9faf4f4871cb5a943614a07f80a7d70f upstream. + +Since commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy +updates") it's been possible for arch_[enter|leave]_lazy_mmu_mode() to be +called without holding a page table lock (for the kernel mappings case), +and therefore it is possible that preemption may occur while in the lazy +mmu mode. The Sparc lazy mmu implementation is not robust to preemption +since it stores the lazy mode state in a per-cpu structure and does not +attempt to manage that state on task switch. + +Powerpc had the same issue and fixed it by explicitly disabling preemption +in arch_enter_lazy_mmu_mode() and re-enabling in +arch_leave_lazy_mmu_mode(). See commit b9ef323ea168 ("powerpc/64s: +Disable preemption in hash lazy mmu mode"). + +Given Sparc's lazy mmu mode is based on powerpc's, let's fix it in the +same way here. + +Link: https://lkml.kernel.org/r/20250303141542.3371656-4-ryan.roberts@arm.com +Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates") +Signed-off-by: Ryan Roberts +Acked-by: David Hildenbrand +Acked-by: Andreas Larsson +Acked-by: Juergen Gross +Cc: Borislav Betkov +Cc: Boris Ostrovsky +Cc: Catalin Marinas +Cc: Dave Hansen +Cc: David S. Miller +Cc: "H. Peter Anvin" +Cc: Ingo Molnar +Cc: Juegren Gross +Cc: Matthew Wilcow (Oracle) +Cc: Thomas Gleinxer +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + arch/sparc/mm/tlb.c | 5 ++++- + 1 file changed, 4 insertions(+), 1 deletion(-) + +--- a/arch/sparc/mm/tlb.c ++++ b/arch/sparc/mm/tlb.c +@@ -52,8 +52,10 @@ out: + + void arch_enter_lazy_mmu_mode(void) + { +- struct tlb_batch *tb = this_cpu_ptr(&tlb_batch); ++ struct tlb_batch *tb; + ++ preempt_disable(); ++ tb = this_cpu_ptr(&tlb_batch); + tb->active = 1; + } + +@@ -64,6 +66,7 @@ void arch_leave_lazy_mmu_mode(void) + if (tb->tlb_nr) + flush_tlb_pending(); + tb->active = 0; ++ preempt_enable(); + } + + static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,