--- /dev/null
+From a05df03a88bc1088be8e9d958f208d6484691e43 Mon Sep 17 00:00:00 2001
+From: Nicolin Chen <nicolinc@nvidia.com>
+Date: Thu, 27 Feb 2025 12:07:29 -0800
+Subject: iommufd: Fix uninitialized rc in iommufd_access_rw()
+
+From: Nicolin Chen <nicolinc@nvidia.com>
+
+commit a05df03a88bc1088be8e9d958f208d6484691e43 upstream.
+
+Reported by smatch:
+drivers/iommu/iommufd/device.c:1392 iommufd_access_rw() error: uninitialized symbol 'rc'.
+
+Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel access")
+Link: https://patch.msgid.link/r/20250227200729.85030-1-nicolinc@nvidia.com
+Cc: stable@vger.kernel.org
+Reported-by: kernel test robot <lkp@intel.com>
+Reported-by: Dan Carpenter <error27@gmail.com>
+Closes: https://lore.kernel.org/r/202502271339.a2nWr9UA-lkp@intel.com/
+[nicolinc: can't find an original report but only in "old smatch warnings"]
+Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
+Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/iommu/iommufd/device.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/drivers/iommu/iommufd/device.c
++++ b/drivers/iommu/iommufd/device.c
+@@ -1076,7 +1076,7 @@ int iommufd_access_rw(struct iommufd_acc
+ struct io_pagetable *iopt;
+ struct iopt_area *area;
+ unsigned long last_iova;
+- int rc;
++ int rc = -EINVAL;
+
+ if (!length)
+ return -EINVAL;
--- /dev/null
+From c0ebbb3841e07c4493e6fe351698806b09a87a37 Mon Sep 17 00:00:00 2001
+From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+Date: Wed, 12 Mar 2025 10:10:13 -0400
+Subject: mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlock
+
+From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+
+commit c0ebbb3841e07c4493e6fe351698806b09a87a37 upstream.
+
+The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node
+reclaim for struct pglist_data using a single bit.
+
+It is "locked" with a test_and_set_bit (similarly to a try lock) which
+provides full ordering with respect to loads and stores done within
+__node_reclaim().
+
+It is "unlocked" with clear_bit(), which does not provide any ordering
+with respect to loads and stores done before clearing the bit.
+
+The lack of clear_bit() memory ordering with respect to stores within
+__node_reclaim() can cause a subsequent CPU to fail to observe stores from
+a prior node reclaim. This is not an issue in practice on TSO (e.g.
+x86), but it is an issue on weakly-ordered architectures (e.g. arm64).
+
+Fix this by using clear_bit_unlock rather than clear_bit to clear
+PGDAT_RECLAIM_LOCKED with a release memory ordering semantic.
+
+This provides stronger memory ordering (release rather than relaxed).
+
+Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com
+Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim")
+Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
+Cc: Matthew Wilcox <willy@infradead.org>
+Cc: Alan Stern <stern@rowland.harvard.edu>
+Cc: Andrea Parri <parri.andrea@gmail.com>
+Cc: Will Deacon <will@kernel.org>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Boqun Feng <boqun.feng@gmail.com>
+Cc: Nicholas Piggin <npiggin@gmail.com>
+Cc: David Howells <dhowells@redhat.com>
+Cc: Jade Alglave <j.alglave@ucl.ac.uk>
+Cc: Luc Maranget <luc.maranget@inria.fr>
+Cc: "Paul E. McKenney" <paulmck@kernel.org>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/vmscan.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -8115,7 +8115,7 @@ int node_reclaim(struct pglist_data *pgd
+ return NODE_RECLAIM_NOSCAN;
+
+ ret = __node_reclaim(pgdat, gfp_mask, order);
+- clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
++ clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
+
+ if (!ret)
+ count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
--- /dev/null
+From 691ee97e1a9de0cdb3efb893c1f180e3f4a35e32 Mon Sep 17 00:00:00 2001
+From: Ryan Roberts <ryan.roberts@arm.com>
+Date: Mon, 3 Mar 2025 14:15:35 +0000
+Subject: mm: fix lazy mmu docs and usage
+
+From: Ryan Roberts <ryan.roberts@arm.com>
+
+commit 691ee97e1a9de0cdb3efb893c1f180e3f4a35e32 upstream.
+
+Patch series "Fix lazy mmu mode", v2.
+
+I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As
+part of that, I will extend lazy mmu mode to cover kernel mappings in
+vmalloc table walkers. While lazy mmu mode is already used for kernel
+mappings in a few places, this will extend it's use significantly.
+
+Having reviewed the existing lazy mmu implementations in powerpc, sparc
+and x86, it looks like there are a bunch of bugs, some of which may be
+more likely to trigger once I extend the use of lazy mmu. So this series
+attempts to clarify the requirements and fix all the bugs in advance of
+that series. See patch #1 commit log for all the details.
+
+
+This patch (of 5):
+
+The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode() is
+a bit of a mess (to put it politely). There are a number of issues
+related to nesting of lazy mmu regions and confusion over whether the
+task, when in a lazy mmu region, is preemptible or not. Fix all the
+issues relating to the core-mm. Follow up commits will fix the
+arch-specific implementations. 3 arches implement lazy mmu; powerpc,
+sparc and x86.
+
+When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
+6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
+expected that lazy mmu regions would never nest and that the appropriate
+page table lock(s) would be held while in the region, thus ensuring the
+region is non-preemptible. Additionally lazy mmu regions were only used
+during manipulation of user mappings.
+
+Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
+updates") started invoking the lazy mmu mode in apply_to_pte_range(),
+which is used for both user and kernel mappings. For kernel mappings the
+region is no longer protected by any lock so there is no longer any
+guarantee about non-preemptibility. Additionally, for RT configs, the
+holding the PTL only implies no CPU migration, it doesn't prevent
+preemption.
+
+Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
+arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
+set_ptes(), used by x86. So after this commit, lazy mmu regions can be
+nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
+page table range API") and commit 9fee28baa601 ("powerpc: implement the
+new page table range API") did the same for the sparc and powerpc
+set_ptes() overrides.
+
+powerpc couldn't deal with preemption so avoids it in commit b9ef323ea168
+("powerpc/64s: Disable preemption in hash lazy mmu mode"), which
+explicitly disables preemption for the whole region in its implementation.
+x86 can support preemption (or at least it could until it tried to add
+support nesting; more on this below). Sparc looks to be totally broken in
+the face of preemption, as far as I can tell.
+
+powerpc can't deal with nesting, so avoids it in commit 47b8def9358c
+("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
+which removes the lazy mmu calls from its implementation of set_ptes().
+x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
+nesting of same lazy mode") but as far as I can tell, this breaks its
+support for preemption.
+
+In short, it's all a mess; the semantics for
+arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a result
+the implementations all have different expectations, sticking plasters and
+bugs.
+
+arm64 is aiming to start using these hooks, so let's clean everything up
+before adding an arm64 implementation. Update the documentation to state
+that lazy mmu regions can never be nested, must not be called in interrupt
+context and preemption may or may not be enabled for the duration of the
+region. And fix the generic implementation of set_ptes() to avoid
+nesting.
+
+arch-specific fixes to conform to the new spec will proceed this one.
+
+These issues were spotted by code review and I have no evidence of issues
+being reported in the wild.
+
+Link: https://lkml.kernel.org/r/20250303141542.3371656-1-ryan.roberts@arm.com
+Link: https://lkml.kernel.org/r/20250303141542.3371656-2-ryan.roberts@arm.com
+Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()")
+Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
+Acked-by: David Hildenbrand <david@redhat.com>
+Acked-by: Juergen Gross <jgross@suse.com>
+Cc: Andreas Larsson <andreas@gaisler.com>
+Cc: Borislav Betkov <bp@alien8.de>
+Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: David S. Miller <davem@davemloft.net>
+Cc: "H. Peter Anvin" <hpa@zytor.com>
+Cc: Ingo Molnar <mingo@redhat.com>
+Cc: Juegren Gross <jgross@suse.com>
+Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
+Cc: Thomas Gleinxer <tglx@linutronix.de>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/pgtable.h | 14 ++++++++------
+ 1 file changed, 8 insertions(+), 6 deletions(-)
+
+--- a/include/linux/pgtable.h
++++ b/include/linux/pgtable.h
+@@ -194,10 +194,14 @@ static inline int pmd_young(pmd_t pmd)
+ * hazard could result in the direct mode hypervisor case, since the actual
+ * write to the page tables may not yet have taken place, so reads though
+ * a raw PTE pointer after it has been modified are not guaranteed to be
+- * up to date. This mode can only be entered and left under the protection of
+- * the page table locks for all page tables which may be modified. In the UP
+- * case, this is required so that preemption is disabled, and in the SMP case,
+- * it must synchronize the delayed page table writes properly on other CPUs.
++ * up to date.
++ *
++ * In the general case, no lock is guaranteed to be held between entry and exit
++ * of the lazy mode. So the implementation must assume preemption may be enabled
++ * and cpu migration is possible; it must take steps to be robust against this.
++ * (In practice, for user PTE updates, the appropriate page table lock(s) are
++ * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
++ * and the mode cannot be used in interrupt context.
+ */
+ #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+ #define arch_enter_lazy_mmu_mode() do {} while (0)
+@@ -233,7 +237,6 @@ static inline void set_ptes(struct mm_st
+ {
+ page_table_check_ptes_set(mm, ptep, pte, nr);
+
+- arch_enter_lazy_mmu_mode();
+ for (;;) {
+ set_pte(ptep, pte);
+ if (--nr == 0)
+@@ -241,7 +244,6 @@ static inline void set_ptes(struct mm_st
+ ptep++;
+ pte = pte_next_pfn(pte);
+ }
+- arch_leave_lazy_mmu_mode();
+ }
+ #endif
+ #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
--- /dev/null
+From 1ca77ff1837249701053a7fcbdedabc41f4ae67c Mon Sep 17 00:00:00 2001
+From: Marc Herbert <Marc.Herbert@linux.intel.com>
+Date: Wed, 19 Mar 2025 06:00:30 +0000
+Subject: mm/hugetlb: move hugetlb_sysctl_init() to the __init section
+
+From: Marc Herbert <Marc.Herbert@linux.intel.com>
+
+commit 1ca77ff1837249701053a7fcbdedabc41f4ae67c upstream.
+
+hugetlb_sysctl_init() is only invoked once by an __init function and is
+merely a wrapper around another __init function so there is not reason to
+keep it.
+
+Fixes the following warning when toning down some GCC inline options:
+
+ WARNING: modpost: vmlinux: section mismatch in reference:
+ hugetlb_sysctl_init+0x1b (section: .text) ->
+ __register_sysctl_init (section: .init.text)
+
+Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com
+Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com>
+Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
+Reviewed-by: Muchun Song <muchun.song@linux.dev>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/hugetlb.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -4695,7 +4695,7 @@ static struct ctl_table hugetlb_table[]
+ { }
+ };
+
+-static void hugetlb_sysctl_init(void)
++static void __init hugetlb_sysctl_init(void)
+ {
+ register_sysctl_init("vm", hugetlb_table);
+ }
--- /dev/null
+From aaf99ac2ceb7c974f758a635723eeaf48596388e Mon Sep 17 00:00:00 2001
+From: Shuai Xue <xueshuai@linux.alibaba.com>
+Date: Wed, 12 Mar 2025 19:28:51 +0800
+Subject: mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
+
+From: Shuai Xue <xueshuai@linux.alibaba.com>
+
+commit aaf99ac2ceb7c974f758a635723eeaf48596388e upstream.
+
+When an uncorrected memory error is consumed there is a race between the
+CMCI from the memory controller reporting an uncorrected error with a UCNA
+signature, and the core reporting and SRAR signature machine check when
+the data is about to be consumed.
+
+- Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1]
+
+Prior to Icelake memory controllers reported patrol scrub events that
+detected a previously unseen uncorrected error in memory by signaling a
+broadcast machine check with an SRAO (Software Recoverable Action
+Optional) signature in the machine check bank. This was overkill because
+it's not an urgent problem that no core is on the verge of consuming that
+bad data. It's also found that multi SRAO UCE may cause nested MCE
+interrupts and finally become an IERR.
+
+Hence, Intel downgrades the machine check bank signature of patrol scrub
+from SRAO to UCNA (Uncorrected, No Action required), and signal changed to
+#CMCI. Just to add to the confusion, Linux does take an action (in
+uc_decode_notifier()) to try to offline the page despite the UC*NA*
+signature name.
+
+- Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1]
+
+Having decided that CMCI/UCNA is the best action for patrol scrub errors,
+the memory controller uses it for reads too. But the memory controller is
+executing asynchronously from the core, and can't tell the difference
+between a "real" read and a speculative read. So it will do CMCI/UCNA if
+an error is found in any read.
+
+Thus:
+
+1) Core is clever and thinks address A is needed soon, issues a speculative read.
+2) Core finds it is going to use address A soon after sending the read request
+3) The CMCI from the memory controller is in a race with MCE from the core
+ that will soon try to retire the load from address A.
+
+Quite often (because speculation has got better) the CMCI from the memory
+controller is delivered before the core is committed to the instruction
+reading address A, so the interrupt is taken, and Linux offlines the page
+(marking it as poison).
+
+- Why user process is killed for instr case
+
+Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported
+"not recovered"") tries to fix noise message "Memory error not recovered"
+and skips duplicate SIGBUSs due to the race. But it also introduced a bug
+that kill_accessing_process() return -EHWPOISON for instr case, as result,
+kill_me_maybe() send a SIGBUS to user process.
+
+If the CMCI wins that race, the page is marked poisoned when
+uc_decode_notifier() calls memory_failure(). For dirty pages,
+memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
+converting the PTE to a hwpoison entry. As a result,
+kill_accessing_process():
+
+- call walk_page_range() and return 1 regardless of whether
+ try_to_unmap() succeeds or fails,
+- call kill_proc() to make sure a SIGBUS is sent
+- return -EHWPOISON to indicate that SIGBUS is already sent to the
+ process and kill_me_maybe() doesn't have to send it again.
+
+However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
+PTE unchanged and not converted to a hwpoison entry. Conversely, for
+clean pages where PTE entries are not marked as hwpoison,
+kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send
+a SIGBUS.
+
+Console log looks like this:
+
+ Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
+ Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
+ Memory failure: 0x827ca68: already hardware poisoned
+ mce: Memory error not recovered
+
+To fix it, return 0 for "corrupted page was clean", preventing an
+unnecessary SIGBUS to user process.
+
+[1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.com/T/#mba94f1305b3009dd340ce4114d3221fe810d1871
+Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com
+Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
+Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
+Tested-by: Tony Luck <tony.luck@intel.com>
+Acked-by: Miaohe Lin <linmiaohe@huawei.com>
+Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
+Cc: Borislav Betkov <bp@alien8.de>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: "H. Peter Anvin" <hpa@zytor.com>
+Cc: Ingo Molnar <mingo@redhat.com>
+Cc: Jane Chu <jane.chu@oracle.com>
+Cc: Jarkko Sakkinen <jarkko@kernel.org>
+Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
+Cc: Josh Poimboeuf <jpoimboe@kernel.org>
+Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Ruidong Tian <tianruidong@linux.alibaba.com>
+Cc: Thomas Gleinxer <tglx@linutronix.de>
+Cc: Yazen Ghannam <yazen.ghannam@amd.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/memory-failure.c | 11 ++++++++---
+ 1 file changed, 8 insertions(+), 3 deletions(-)
+
+--- a/mm/memory-failure.c
++++ b/mm/memory-failure.c
+@@ -869,12 +869,17 @@ static int kill_accessing_process(struct
+ mmap_read_lock(p->mm);
+ ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops,
+ (void *)&priv);
++ /*
++ * ret = 1 when CMCI wins, regardless of whether try_to_unmap()
++ * succeeds or fails, then kill the process with SIGBUS.
++ * ret = 0 when poison page is a clean page and it's dropped, no
++ * SIGBUS is needed.
++ */
+ if (ret == 1 && priv.tk.addr)
+ kill_proc(&priv.tk, pfn, flags);
+- else
+- ret = 0;
+ mmap_read_unlock(p->mm);
+- return ret > 0 ? -EHWPOISON : -EFAULT;
++
++ return ret > 0 ? -EHWPOISON : 0;
+ }
+
+ static const char *action_name[] = {
--- /dev/null
+From 442b1eca223b4860cc85ef970ae602d125aec5a4 Mon Sep 17 00:00:00 2001
+From: Jane Chu <jane.chu@oracle.com>
+Date: Mon, 24 Feb 2025 14:14:45 -0700
+Subject: mm: make page_mapped_in_vma() hugetlb walk aware
+
+From: Jane Chu <jane.chu@oracle.com>
+
+commit 442b1eca223b4860cc85ef970ae602d125aec5a4 upstream.
+
+When a process consumes a UE in a page, the memory failure handler
+attempts to collect information for a potential SIGBUS. If the page is an
+anonymous page, page_mapped_in_vma(page, vma) is invoked in order to
+
+ 1. retrieve the vaddr from the process' address space,
+
+ 2. verify that the vaddr is indeed mapped to the poisoned page,
+ where 'page' is the precise small page with UE.
+
+It's been observed that when injecting poison to a non-head subpage of an
+anonymous hugetlb page, no SIGBUS shows up, while injecting to the head
+page produces a SIGBUS. The cause is that, though hugetlb_walk() returns
+a valid pmd entry (on x86), but check_pte() detects mismatch between the
+head page per the pmd and the input subpage. Thus the vaddr is considered
+not mapped to the subpage and the process is not collected for SIGBUS
+purpose. This is the calling stack:
+
+ collect_procs_anon
+ page_mapped_in_vma
+ page_vma_mapped_walk
+ hugetlb_walk
+ huge_pte_lock
+ check_pte
+
+check_pte() header says that it
+"check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte"
+but practically works only if pvmw->pfn is the head page pfn at pvmw->pte.
+Hindsight acknowledging that some pvmw->pte could point to a hugepage of
+some sort such that it makes sense to make check_pte() work for hugepage.
+
+Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com
+Signed-off-by: Jane Chu <jane.chu@oracle.com>
+Cc: Hugh Dickins <hughd@google.com>
+Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
+Cc: linmiaohe <linmiaohe@huawei.com>
+Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
+Cc: Peter Xu <peterx@redhat.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/page_vma_mapped.c | 13 +++++++++----
+ 1 file changed, 9 insertions(+), 4 deletions(-)
+
+--- a/mm/page_vma_mapped.c
++++ b/mm/page_vma_mapped.c
+@@ -77,6 +77,7 @@ static bool map_pte(struct page_vma_mapp
+ * mapped at the @pvmw->pte
+ * @pvmw: page_vma_mapped_walk struct, includes a pair pte and pfn range
+ * for checking
++ * @pte_nr: the number of small pages described by @pvmw->pte.
+ *
+ * page_vma_mapped_walk() found a place where pfn range is *potentially*
+ * mapped. check_pte() has to validate this.
+@@ -93,7 +94,7 @@ static bool map_pte(struct page_vma_mapp
+ * Otherwise, return false.
+ *
+ */
+-static bool check_pte(struct page_vma_mapped_walk *pvmw)
++static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
+ {
+ unsigned long pfn;
+ pte_t ptent = ptep_get(pvmw->pte);
+@@ -126,7 +127,11 @@ static bool check_pte(struct page_vma_ma
+ pfn = pte_pfn(ptent);
+ }
+
+- return (pfn - pvmw->pfn) < pvmw->nr_pages;
++ if ((pfn + pte_nr - 1) < pvmw->pfn)
++ return false;
++ if (pfn > (pvmw->pfn + pvmw->nr_pages - 1))
++ return false;
++ return true;
+ }
+
+ /* Returns true if the two ranges overlap. Careful to not overflow. */
+@@ -201,7 +206,7 @@ bool page_vma_mapped_walk(struct page_vm
+ return false;
+
+ pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
+- if (!check_pte(pvmw))
++ if (!check_pte(pvmw, pages_per_huge_page(hstate)))
+ return not_found(pvmw);
+ return true;
+ }
+@@ -283,7 +288,7 @@ restart:
+ goto next_pte;
+ }
+ this_pte:
+- if (check_pte(pvmw))
++ if (check_pte(pvmw, 1))
+ return true;
+ next_pte:
+ do {
--- /dev/null
+From 937582ee8e8d227c30ec147629a0179131feaa80 Mon Sep 17 00:00:00 2001
+From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
+Date: Mon, 10 Mar 2025 20:50:34 +0000
+Subject: mm/mremap: correctly handle partial mremap() of VMA starting at 0
+
+From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
+
+commit 937582ee8e8d227c30ec147629a0179131feaa80 upstream.
+
+Patch series "refactor mremap and fix bug", v3.
+
+The existing mremap() logic has grown organically over a very long period
+of time, resulting in code that is in many parts, very difficult to follow
+and full of subtleties and sources of confusion.
+
+In addition, it is difficult to thread state through the operation
+correctly, as function arguments have expanded, some parameters are
+expected to be temporarily altered during the operation, others are
+intended to remain static and some can be overridden.
+
+This series completely refactors the mremap implementation, sensibly
+separating functions, adding comments to explain the more subtle aspects
+of the implementation and making use of small structs to thread state
+through everything.
+
+The reason for doing so is to lay the groundwork for planned future
+changes to the mremap logic, changes which require the ability to easily
+pass around state.
+
+Additionally, it would be unhelpful to add yet more logic to code that is
+already difficult to follow without first refactoring it like this.
+
+The first patch in this series additionally fixes a bug when a VMA with
+start address zero is partially remapped.
+
+Tested on real hardware under heavy workload and all self tests are
+passing.
+
+
+This patch (of 3):
+
+Consider the case of a partial mremap() (that results in a VMA split) of
+an accountable VMA (i.e. which has the VM_ACCOUNT flag set) whose start
+address is zero, with the MREMAP_MAYMOVE flag specified and a scenario
+where a move does in fact occur:
+
+ addr end
+ | |
+ v v
+ |-------------|
+ | vma |
+ |-------------|
+ 0
+
+This move is affected by unmapping the range [addr, end). In order to
+prevent an incorrect decrement of accounted memory which has already been
+determined, the mremap() code in move_vma() clears VM_ACCOUNT from the VMA
+prior to doing so, before reestablishing it in each of the VMAs
+post-split:
+
+ addr end
+ | |
+ v v
+ |---| |---|
+ | A | | B |
+ |---| |---|
+
+Commit 6b73cff239e5 ("mm: change munmap splitting order and move_vma()")
+changed this logic such as to determine whether there is a need to do so
+by establishing account_start and account_end and, in the instance where
+such an operation is required, assigning them to vma->vm_start and
+vma->vm_end.
+
+Later the code checks if the operation is required for 'A' referenced
+above thusly:
+
+ if (account_start) {
+ ...
+ }
+
+However, if the VMA described above has vma->vm_start == 0, which is now
+assigned to account_start, this branch will not be executed.
+
+As a result, the VMA 'A' above will remain stripped of its VM_ACCOUNT
+flag, incorrectly.
+
+The fix is to simply convert these variables to booleans and set them as
+required.
+
+Link: https://lkml.kernel.org/r/cover.1741639347.git.lorenzo.stoakes@oracle.com
+Link: https://lkml.kernel.org/r/dc55cb6db25d97c3d9e460de4986a323fa959676.1741639347.git.lorenzo.stoakes@oracle.com
+Fixes: 6b73cff239e5 ("mm: change munmap splitting order and move_vma()")
+Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
+Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
+Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
+Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/mremap.c | 10 +++++-----
+ 1 file changed, 5 insertions(+), 5 deletions(-)
+
+--- a/mm/mremap.c
++++ b/mm/mremap.c
+@@ -599,8 +599,8 @@ static unsigned long move_vma(struct vm_
+ unsigned long vm_flags = vma->vm_flags;
+ unsigned long new_pgoff;
+ unsigned long moved_len;
+- unsigned long account_start = 0;
+- unsigned long account_end = 0;
++ bool account_start = false;
++ bool account_end = false;
+ unsigned long hiwater_vm;
+ int err = 0;
+ bool need_rmap_locks;
+@@ -684,9 +684,9 @@ static unsigned long move_vma(struct vm_
+ if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
+ vm_flags_clear(vma, VM_ACCOUNT);
+ if (vma->vm_start < old_addr)
+- account_start = vma->vm_start;
++ account_start = true;
+ if (vma->vm_end > old_addr + old_len)
+- account_end = vma->vm_end;
++ account_end = true;
+ }
+
+ /*
+@@ -726,7 +726,7 @@ static unsigned long move_vma(struct vm_
+ /* OOM: unable to split vma, just get accounts right */
+ if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
+ vm_acct_memory(old_len >> PAGE_SHIFT);
+- account_start = account_end = 0;
++ account_start = account_end = false;
+ }
+
+ if (vm_flags & VM_LOCKED) {
--- /dev/null
+From bc3fe6805cf09a25a086573a17d40e525208c5d8 Mon Sep 17 00:00:00 2001
+From: David Hildenbrand <david@redhat.com>
+Date: Mon, 10 Feb 2025 20:37:44 +0100
+Subject: mm/rmap: reject hugetlb folios in folio_make_device_exclusive()
+
+From: David Hildenbrand <david@redhat.com>
+
+commit bc3fe6805cf09a25a086573a17d40e525208c5d8 upstream.
+
+Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP,
+let's add a safety net in case FOLL_SPLIT_PMD usage would ever be
+reworked.
+
+In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the
+generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have
+returned a page. In particular, hugetlb folios that are not PMD-sized
+would never have been prone to FOLL_SPLIT_PMD.
+
+hugetlb folios can be anonymous, and page_make_device_exclusive_one() is
+not really prepared for handling them at all. So let's spell that out.
+
+Link: https://lkml.kernel.org/r/20250210193801.781278-3-david@redhat.com
+Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
+Signed-off-by: David Hildenbrand <david@redhat.com>
+Reviewed-by: Alistair Popple <apopple@nvidia.com>
+Tested-by: Alistair Popple <apopple@nvidia.com>
+Cc: Alex Shi <alexs@kernel.org>
+Cc: Danilo Krummrich <dakr@kernel.org>
+Cc: Dave Airlie <airlied@gmail.com>
+Cc: Jann Horn <jannh@google.com>
+Cc: Jason Gunthorpe <jgg@nvidia.com>
+Cc: Jerome Glisse <jglisse@redhat.com>
+Cc: John Hubbard <jhubbard@nvidia.com>
+Cc: Jonathan Corbet <corbet@lwn.net>
+Cc: Karol Herbst <kherbst@redhat.com>
+Cc: Liam Howlett <liam.howlett@oracle.com>
+Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
+Cc: Lyude <lyude@redhat.com>
+Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
+Cc: Oleg Nesterov <oleg@redhat.com>
+Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
+Cc: Peter Xu <peterx@redhat.com>
+Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
+Cc: SeongJae Park <sj@kernel.org>
+Cc: Simona Vetter <simona.vetter@ffwll.ch>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: Yanteng Si <si.yanteng@linux.dev>
+Cc: Barry Song <v-songbaohua@oppo.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ mm/rmap.c | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/mm/rmap.c
++++ b/mm/rmap.c
+@@ -2296,7 +2296,7 @@ static bool folio_make_device_exclusive(
+ * Restrict to anonymous folios for now to avoid potential writeback
+ * issues.
+ */
+- if (!folio_test_anon(folio))
++ if (!folio_test_anon(folio) || folio_test_hugetlb(folio))
+ return false;
+
+ rmap_walk(folio, &rwc);
--- /dev/null
+From fe4cdc2c4e248f48de23bc778870fd71e772a274 Mon Sep 17 00:00:00 2001
+From: Peter Xu <peterx@redhat.com>
+Date: Wed, 12 Mar 2025 10:51:31 -0400
+Subject: mm/userfaultfd: fix release hang over concurrent GUP
+
+From: Peter Xu <peterx@redhat.com>
+
+commit fe4cdc2c4e248f48de23bc778870fd71e772a274 upstream.
+
+This patch should fix a possible userfaultfd release() hang during
+concurrent GUP.
+
+This problem was initially reported by Dimitris Siakavaras in July 2023
+[1] in a firecracker use case. Firecracker has a separate process
+handling page faults remotely, and when the process releases the
+userfaultfd it can race with a concurrent GUP from KVM trying to fault in
+a guest page during the secondary MMU page fault process.
+
+A similar problem was reported recently again by Jinjiang Tu in March 2025
+[2], even though the race happened this time with a mlockall() operation,
+which does GUP in a similar fashion.
+
+In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the
+uffd without triggering SIGBUS") was trying to fix this issue. AFAIU,
+that fixes well the fault paths but may not work yet for GUP. In GUP, the
+issue is NOPAGE will be almost treated the same as "page fault resolved"
+in faultin_page(), then the GUP will follow page again, seeing page
+missing, and it'll keep going into a live lock situation as reported.
+
+This change makes core mm return RETRY instead of NOPAGE for both the GUP
+and fault paths, proactively releasing the mmap read lock. This should
+guarantee the other release thread make progress on taking the write lock
+and avoid the live lock even for GUP.
+
+When at it, rearrange the comments to make sure it's uptodate.
+
+[1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr
+[2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com
+
+Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com
+Signed-off-by: Peter Xu <peterx@redhat.com>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Cc: Mike Rapoport (IBM) <rppt@kernel.org>
+Cc: Axel Rasmussen <axelrasmussen@google.com>
+Cc: Jinjiang Tu <tujinjiang@huawei.com>
+Cc: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/userfaultfd.c | 51 +++++++++++++++++++++++++--------------------------
+ 1 file changed, 25 insertions(+), 26 deletions(-)
+
+--- a/fs/userfaultfd.c
++++ b/fs/userfaultfd.c
+@@ -452,32 +452,6 @@ vm_fault_t handle_userfault(struct vm_fa
+ goto out;
+
+ /*
+- * If it's already released don't get it. This avoids to loop
+- * in __get_user_pages if userfaultfd_release waits on the
+- * caller of handle_userfault to release the mmap_lock.
+- */
+- if (unlikely(READ_ONCE(ctx->released))) {
+- /*
+- * Don't return VM_FAULT_SIGBUS in this case, so a non
+- * cooperative manager can close the uffd after the
+- * last UFFDIO_COPY, without risking to trigger an
+- * involuntary SIGBUS if the process was starting the
+- * userfaultfd while the userfaultfd was still armed
+- * (but after the last UFFDIO_COPY). If the uffd
+- * wasn't already closed when the userfault reached
+- * this point, that would normally be solved by
+- * userfaultfd_must_wait returning 'false'.
+- *
+- * If we were to return VM_FAULT_SIGBUS here, the non
+- * cooperative manager would be instead forced to
+- * always call UFFDIO_UNREGISTER before it can safely
+- * close the uffd.
+- */
+- ret = VM_FAULT_NOPAGE;
+- goto out;
+- }
+-
+- /*
+ * Check that we can return VM_FAULT_RETRY.
+ *
+ * NOTE: it should become possible to return VM_FAULT_RETRY
+@@ -513,6 +487,31 @@ vm_fault_t handle_userfault(struct vm_fa
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+ goto out;
+
++ if (unlikely(READ_ONCE(ctx->released))) {
++ /*
++ * If a concurrent release is detected, do not return
++ * VM_FAULT_SIGBUS or VM_FAULT_NOPAGE, but instead always
++ * return VM_FAULT_RETRY with lock released proactively.
++ *
++ * If we were to return VM_FAULT_SIGBUS here, the non
++ * cooperative manager would be instead forced to
++ * always call UFFDIO_UNREGISTER before it can safely
++ * close the uffd, to avoid involuntary SIGBUS triggered.
++ *
++ * If we were to return VM_FAULT_NOPAGE, it would work for
++ * the fault path, in which the lock will be released
++ * later. However for GUP, faultin_page() does nothing
++ * special on NOPAGE, so GUP would spin retrying without
++ * releasing the mmap read lock, causing possible livelock.
++ *
++ * Here only VM_FAULT_RETRY would make sure the mmap lock
++ * be released immediately, so that the thread concurrently
++ * releasing the userfault would always make progress.
++ */
++ release_fault_lock(vmf);
++ goto out;
++ }
++
+ /* take the reference before dropping the mmap_lock */
+ userfaultfd_ctx_get(ctx);
+
--- /dev/null
+From f1a69a940de58b16e8249dff26f74c8cc59b32be Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Ricardo=20Ca=C3=B1uelo=20Navarro?= <rcn@igalia.com>
+Date: Fri, 4 Apr 2025 16:53:21 +0200
+Subject: sctp: detect and prevent references to a freed transport in sendmsg
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+From: Ricardo Cañuelo Navarro <rcn@igalia.com>
+
+commit f1a69a940de58b16e8249dff26f74c8cc59b32be upstream.
+
+sctp_sendmsg() re-uses associations and transports when possible by
+doing a lookup based on the socket endpoint and the message destination
+address, and then sctp_sendmsg_to_asoc() sets the selected transport in
+all the message chunks to be sent.
+
+There's a possible race condition if another thread triggers the removal
+of that selected transport, for instance, by explicitly unbinding an
+address with setsockopt(SCTP_SOCKOPT_BINDX_REM), after the chunks have
+been set up and before the message is sent. This can happen if the send
+buffer is full, during the period when the sender thread temporarily
+releases the socket lock in sctp_wait_for_sndbuf().
+
+This causes the access to the transport data in
+sctp_outq_select_transport(), when the association outqueue is flushed,
+to result in a use-after-free read.
+
+This change avoids this scenario by having sctp_transport_free() signal
+the freeing of the transport, tagging it as "dead". In order to do this,
+the patch restores the "dead" bit in struct sctp_transport, which was
+removed in
+commit 47faa1e4c50e ("sctp: remove the dead field of sctp_transport").
+
+Then, in the scenario where the sender thread has released the socket
+lock in sctp_wait_for_sndbuf(), the bit is checked again after
+re-acquiring the socket lock to detect the deletion. This is done while
+holding a reference to the transport to prevent it from being freed in
+the process.
+
+If the transport was deleted while the socket lock was relinquished,
+sctp_sendmsg_to_asoc() will return -EAGAIN to let userspace retry the
+send.
+
+The bug was found by a private syzbot instance (see the error report [1]
+and the C reproducer that triggers it [2]).
+
+Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport.txt [1]
+Link: https://people.igalia.com/rcn/kernel_logs/20250402__KASAN_slab-use-after-free_Read_in_sctp_outq_select_transport__repro.c [2]
+Cc: stable@vger.kernel.org
+Fixes: df132eff4638 ("sctp: clear the transport of some out_chunk_list chunks in sctp_assoc_rm_peer")
+Suggested-by: Xin Long <lucien.xin@gmail.com>
+Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com>
+Acked-by: Xin Long <lucien.xin@gmail.com>
+Link: https://patch.msgid.link/20250404-kasan_slab-use-after-free_read_in_sctp_outq_select_transport__20250404-v1-1-5ce4a0b78ef2@igalia.com
+Signed-off-by: Paolo Abeni <pabeni@redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/net/sctp/structs.h | 3 ++-
+ net/sctp/socket.c | 22 ++++++++++++++--------
+ net/sctp/transport.c | 2 ++
+ 3 files changed, 18 insertions(+), 9 deletions(-)
+
+--- a/include/net/sctp/structs.h
++++ b/include/net/sctp/structs.h
+@@ -778,6 +778,7 @@ struct sctp_transport {
+
+ /* Reference counting. */
+ refcount_t refcnt;
++ __u32 dead:1,
+ /* RTO-Pending : A flag used to track if one of the DATA
+ * chunks sent to this address is currently being
+ * used to compute a RTT. If this flag is 0,
+@@ -787,7 +788,7 @@ struct sctp_transport {
+ * calculation completes (i.e. the DATA chunk
+ * is SACK'd) clear this flag.
+ */
+- __u32 rto_pending:1,
++ rto_pending:1,
+
+ /*
+ * hb_sent : a flag that signals that we have a pending
+--- a/net/sctp/socket.c
++++ b/net/sctp/socket.c
+@@ -71,8 +71,9 @@
+ /* Forward declarations for internal helper functions. */
+ static bool sctp_writeable(const struct sock *sk);
+ static void sctp_wfree(struct sk_buff *skb);
+-static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
+- size_t msg_len);
++static int sctp_wait_for_sndbuf(struct sctp_association *asoc,
++ struct sctp_transport *transport,
++ long *timeo_p, size_t msg_len);
+ static int sctp_wait_for_packet(struct sock *sk, int *err, long *timeo_p);
+ static int sctp_wait_for_connect(struct sctp_association *, long *timeo_p);
+ static int sctp_wait_for_accept(struct sock *sk, long timeo);
+@@ -1827,7 +1828,7 @@ static int sctp_sendmsg_to_asoc(struct s
+
+ if (sctp_wspace(asoc) <= 0 || !sk_wmem_schedule(sk, msg_len)) {
+ timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+- err = sctp_wait_for_sndbuf(asoc, &timeo, msg_len);
++ err = sctp_wait_for_sndbuf(asoc, transport, &timeo, msg_len);
+ if (err)
+ goto err;
+ if (unlikely(sinfo->sinfo_stream >= asoc->stream.outcnt)) {
+@@ -9208,8 +9209,9 @@ void sctp_sock_rfree(struct sk_buff *skb
+
+
+ /* Helper function to wait for space in the sndbuf. */
+-static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
+- size_t msg_len)
++static int sctp_wait_for_sndbuf(struct sctp_association *asoc,
++ struct sctp_transport *transport,
++ long *timeo_p, size_t msg_len)
+ {
+ struct sock *sk = asoc->base.sk;
+ long current_timeo = *timeo_p;
+@@ -9219,7 +9221,9 @@ static int sctp_wait_for_sndbuf(struct s
+ pr_debug("%s: asoc:%p, timeo:%ld, msg_len:%zu\n", __func__, asoc,
+ *timeo_p, msg_len);
+
+- /* Increment the association's refcnt. */
++ /* Increment the transport and association's refcnt. */
++ if (transport)
++ sctp_transport_hold(transport);
+ sctp_association_hold(asoc);
+
+ /* Wait on the association specific sndbuf space. */
+@@ -9228,7 +9232,7 @@ static int sctp_wait_for_sndbuf(struct s
+ TASK_INTERRUPTIBLE);
+ if (asoc->base.dead)
+ goto do_dead;
+- if (!*timeo_p)
++ if ((!*timeo_p) || (transport && transport->dead))
+ goto do_nonblock;
+ if (sk->sk_err || asoc->state >= SCTP_STATE_SHUTDOWN_PENDING)
+ goto do_error;
+@@ -9253,7 +9257,9 @@ static int sctp_wait_for_sndbuf(struct s
+ out:
+ finish_wait(&asoc->wait, &wait);
+
+- /* Release the association's refcnt. */
++ /* Release the transport and association's refcnt. */
++ if (transport)
++ sctp_transport_put(transport);
+ sctp_association_put(asoc);
+
+ return err;
+--- a/net/sctp/transport.c
++++ b/net/sctp/transport.c
+@@ -117,6 +117,8 @@ fail:
+ */
+ void sctp_transport_free(struct sctp_transport *transport)
+ {
++ transport->dead = 1;
++
+ /* Try to delete the heartbeat timer. */
+ if (del_timer(&transport->hb_timer))
+ sctp_transport_put(transport);
btrfs-fix-non-empty-delayed-iputs-list-on-unmount-due-to-compressed-write-workers.patch
btrfs-zoned-fix-zone-activation-with-missing-devices.patch
btrfs-zoned-fix-zone-finishing-with-missing-devices.patch
+iommufd-fix-uninitialized-rc-in-iommufd_access_rw.patch
+sparc-mm-disable-preemption-in-lazy-mmu-mode.patch
+sparc-mm-avoid-calling-arch_enter-leave_lazy_mmu-in-set_ptes.patch
+mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch
+mm-make-page_mapped_in_vma-hugetlb-walk-aware.patch
+mm-fix-lazy-mmu-docs-and-usage.patch
+mm-mremap-correctly-handle-partial-mremap-of-vma-starting-at-0.patch
+mm-add-missing-release-barrier-on-pgdat_reclaim_locked-unlock.patch
+mm-userfaultfd-fix-release-hang-over-concurrent-gup.patch
+mm-hwpoison-do-not-send-sigbus-to-processes-with-recovered-clean-pages.patch
+mm-hugetlb-move-hugetlb_sysctl_init-to-the-__init-section.patch
+sctp-detect-and-prevent-references-to-a-freed-transport-in-sendmsg.patch
--- /dev/null
+From eb61ad14c459b54f71f76331ca35d12fa3eb8f98 Mon Sep 17 00:00:00 2001
+From: Ryan Roberts <ryan.roberts@arm.com>
+Date: Mon, 3 Mar 2025 14:15:38 +0000
+Subject: sparc/mm: avoid calling arch_enter/leave_lazy_mmu() in set_ptes
+
+From: Ryan Roberts <ryan.roberts@arm.com>
+
+commit eb61ad14c459b54f71f76331ca35d12fa3eb8f98 upstream.
+
+With commit 1a10a44dfc1d ("sparc64: implement the new page table range
+API") set_ptes was added to the sparc architecture. The implementation
+included calling arch_enter/leave_lazy_mmu() calls.
+
+The patch removes the usage of arch_enter/leave_lazy_mmu() since this
+implies nesting of lazy mmu regions which is not supported. Without this
+fix, lazy mmu mode is effectively disabled because we exit the mode after
+the first set_ptes:
+
+remap_pte_range()
+ -> arch_enter_lazy_mmu()
+ -> set_ptes()
+ -> arch_enter_lazy_mmu()
+ -> arch_leave_lazy_mmu()
+ -> arch_leave_lazy_mmu()
+
+Powerpc suffered the same problem and fixed it in a corresponding way with
+commit 47b8def9358c ("powerpc/mm: Avoid calling
+arch_enter/leave_lazy_mmu() in set_ptes").
+
+Link: https://lkml.kernel.org/r/20250303141542.3371656-5-ryan.roberts@arm.com
+Fixes: 1a10a44dfc1d ("sparc64: implement the new page table range API")
+Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
+Acked-by: David Hildenbrand <david@redhat.com>
+Acked-by: Andreas Larsson <andreas@gaisler.com>
+Acked-by: Juergen Gross <jgross@suse.com>
+Cc: Borislav Betkov <bp@alien8.de>
+Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: David S. Miller <davem@davemloft.net>
+Cc: "H. Peter Anvin" <hpa@zytor.com>
+Cc: Ingo Molnar <mingo@redhat.com>
+Cc: Juegren Gross <jgross@suse.com>
+Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
+Cc: Thomas Gleinxer <tglx@linutronix.de>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/sparc/include/asm/pgtable_64.h | 2 --
+ 1 file changed, 2 deletions(-)
+
+--- a/arch/sparc/include/asm/pgtable_64.h
++++ b/arch/sparc/include/asm/pgtable_64.h
+@@ -931,7 +931,6 @@ static inline void __set_pte_at(struct m
+ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+ {
+- arch_enter_lazy_mmu_mode();
+ for (;;) {
+ __set_pte_at(mm, addr, ptep, pte, 0);
+ if (--nr == 0)
+@@ -940,7 +939,6 @@ static inline void set_ptes(struct mm_st
+ pte_val(pte) += PAGE_SIZE;
+ addr += PAGE_SIZE;
+ }
+- arch_leave_lazy_mmu_mode();
+ }
+ #define set_ptes set_ptes
+
--- /dev/null
+From a1d416bf9faf4f4871cb5a943614a07f80a7d70f Mon Sep 17 00:00:00 2001
+From: Ryan Roberts <ryan.roberts@arm.com>
+Date: Mon, 3 Mar 2025 14:15:37 +0000
+Subject: sparc/mm: disable preemption in lazy mmu mode
+
+From: Ryan Roberts <ryan.roberts@arm.com>
+
+commit a1d416bf9faf4f4871cb5a943614a07f80a7d70f upstream.
+
+Since commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
+updates") it's been possible for arch_[enter|leave]_lazy_mmu_mode() to be
+called without holding a page table lock (for the kernel mappings case),
+and therefore it is possible that preemption may occur while in the lazy
+mmu mode. The Sparc lazy mmu implementation is not robust to preemption
+since it stores the lazy mode state in a per-cpu structure and does not
+attempt to manage that state on task switch.
+
+Powerpc had the same issue and fixed it by explicitly disabling preemption
+in arch_enter_lazy_mmu_mode() and re-enabling in
+arch_leave_lazy_mmu_mode(). See commit b9ef323ea168 ("powerpc/64s:
+Disable preemption in hash lazy mmu mode").
+
+Given Sparc's lazy mmu mode is based on powerpc's, let's fix it in the
+same way here.
+
+Link: https://lkml.kernel.org/r/20250303141542.3371656-4-ryan.roberts@arm.com
+Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
+Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
+Acked-by: David Hildenbrand <david@redhat.com>
+Acked-by: Andreas Larsson <andreas@gaisler.com>
+Acked-by: Juergen Gross <jgross@suse.com>
+Cc: Borislav Betkov <bp@alien8.de>
+Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: David S. Miller <davem@davemloft.net>
+Cc: "H. Peter Anvin" <hpa@zytor.com>
+Cc: Ingo Molnar <mingo@redhat.com>
+Cc: Juegren Gross <jgross@suse.com>
+Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
+Cc: Thomas Gleinxer <tglx@linutronix.de>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/sparc/mm/tlb.c | 5 ++++-
+ 1 file changed, 4 insertions(+), 1 deletion(-)
+
+--- a/arch/sparc/mm/tlb.c
++++ b/arch/sparc/mm/tlb.c
+@@ -52,8 +52,10 @@ out:
+
+ void arch_enter_lazy_mmu_mode(void)
+ {
+- struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
++ struct tlb_batch *tb;
+
++ preempt_disable();
++ tb = this_cpu_ptr(&tlb_batch);
+ tb->active = 1;
+ }
+
+@@ -64,6 +66,7 @@ void arch_leave_lazy_mmu_mode(void)
+ if (tb->tlb_nr)
+ flush_tlb_pending();
+ tb->active = 0;
++ preempt_enable();
+ }
+
+ static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,