From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 5 Apr 2024 09:35:50 +0000 (+0200)
Subject: 5.15-stable patches
X-Git-Tag: v5.15.154~103
X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=c8a76974963b74ad0615d06737f67d0a2b88adc8;p=thirdparty%2Fkernel%2Fstable-queue.git

5.15-stable patches

added patches:
	kvm-x86-bail-to-userspace-if-emulation-of-atomic-user-access-faults.patch
	kvm-x86-mark-target-gfn-of-emulated-atomic-instruction-as-dirty.patch
	mm-vmscan-prevent-infinite-loop-for-costly-gfp_noio-__gfp_retry_mayfail-allocations.patch
	thermal-devfreq_cooling-fix-perf-state-when-calculate-dfc-res_util.patch
---

diff --git a/queue-5.15/kvm-x86-bail-to-userspace-if-emulation-of-atomic-user-access-faults.patch b/queue-5.15/kvm-x86-bail-to-userspace-if-emulation-of-atomic-user-access-faults.patch
new file mode 100644
index 00000000000..e6419bbfd95
--- /dev/null
+++ b/queue-5.15/kvm-x86-bail-to-userspace-if-emulation-of-atomic-user-access-faults.patch
@@ -0,0 +1,33 @@
+From 5d6c7de6446e9ab3fb41d6f7d82770e50998f3de Mon Sep 17 00:00:00 2001
+From: Sean Christopherson <seanjc@google.com>
+Date: Wed, 2 Feb 2022 00:49:45 +0000
+Subject: KVM: x86: Bail to userspace if emulation of atomic user access faults
+
+From: Sean Christopherson <seanjc@google.com>
+
+commit 5d6c7de6446e9ab3fb41d6f7d82770e50998f3de upstream.
+
+Exit to userspace when emulating an atomic guest access if the CMPXCHG on
+the userspace address faults.  Emulating the access as a write and thus
+likely treating it as emulated MMIO is wrong, as KVM has already
+confirmed there is a valid, writable memslot.
+
+Signed-off-by: Sean Christopherson <seanjc@google.com>
+Message-Id: <20220202004945.2540433-6-seanjc@google.com>
+Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/kvm/x86.c |    2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/arch/x86/kvm/x86.c
++++ b/arch/x86/kvm/x86.c
+@@ -7108,7 +7108,7 @@ static int emulator_cmpxchg_emulated(str
+ 	}
+ 
+ 	if (r < 0)
+-		goto emul_write;
++		return X86EMUL_UNHANDLEABLE;
+ 	if (r)
+ 		return X86EMUL_CMPXCHG_FAILED;
+ 
diff --git a/queue-5.15/kvm-x86-mark-target-gfn-of-emulated-atomic-instruction-as-dirty.patch b/queue-5.15/kvm-x86-mark-target-gfn-of-emulated-atomic-instruction-as-dirty.patch
new file mode 100644
index 00000000000..1130f1018dc
--- /dev/null
+++ b/queue-5.15/kvm-x86-mark-target-gfn-of-emulated-atomic-instruction-as-dirty.patch
@@ -0,0 +1,62 @@
+From 910c57dfa4d113aae6571c2a8b9ae8c430975902 Mon Sep 17 00:00:00 2001
+From: Sean Christopherson <seanjc@google.com>
+Date: Wed, 14 Feb 2024 17:00:03 -0800
+Subject: KVM: x86: Mark target gfn of emulated atomic instruction as dirty
+
+From: Sean Christopherson <seanjc@google.com>
+
+commit 910c57dfa4d113aae6571c2a8b9ae8c430975902 upstream.
+
+When emulating an atomic access on behalf of the guest, mark the target
+gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault.  This
+fixes a bug where KVM effectively corrupts guest memory during live
+migration by writing to guest memory without informing userspace that the
+page is dirty.
+
+Marking the page dirty got unintentionally dropped when KVM's emulated
+CMPXCHG was converted to do a user access.  Before that, KVM explicitly
+mapped the guest page into kernel memory, and marked the page dirty during
+the unmap phase.
+
+Mark the page dirty even if the CMPXCHG fails, as the old data is written
+back on failure, i.e. the page is still written.  The value written is
+guaranteed to be the same because the operation is atomic, but KVM's ABI
+is that all writes are dirty logged regardless of the value written.  And
+more importantly, that's what KVM did before the buggy commit.
+
+Huge kudos to the folks on the Cc list (and many others), who did all the
+actual work of triaging and debugging.
+
+Fixes: 1c2361f667f3 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
+Cc: stable@vger.kernel.org
+Cc: David Matlack <dmatlack@google.com>
+Cc: Pasha Tatashin <tatashin@google.com>
+Cc: Michael Krebs <mkrebs@google.com>
+base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64
+Reviewed-by: Jim Mattson <jmattson@google.com>
+Link: https://lore.kernel.org/r/20240215010004.1456078-2-seanjc@google.com
+Signed-off-by: Sean Christopherson <seanjc@google.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/kvm/x86.c |   10 ++++++++++
+ 1 file changed, 10 insertions(+)
+
+--- a/arch/x86/kvm/x86.c
++++ b/arch/x86/kvm/x86.c
+@@ -7109,6 +7109,16 @@ static int emulator_cmpxchg_emulated(str
+ 
+ 	if (r < 0)
+ 		return X86EMUL_UNHANDLEABLE;
++
++	/*
++	 * Mark the page dirty _before_ checking whether or not the CMPXCHG was
++	 * successful, as the old value is written back on failure.  Note, for
++	 * live migration, this is unnecessarily conservative as CMPXCHG writes
++	 * back the original value and the access is atomic, but KVM's ABI is
++	 * that all writes are dirty logged, regardless of the value written.
++	 */
++	kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(gpa));
++
+ 	if (r)
+ 		return X86EMUL_CMPXCHG_FAILED;
+ 
diff --git a/queue-5.15/mm-vmscan-prevent-infinite-loop-for-costly-gfp_noio-__gfp_retry_mayfail-allocations.patch b/queue-5.15/mm-vmscan-prevent-infinite-loop-for-costly-gfp_noio-__gfp_retry_mayfail-allocations.patch
new file mode 100644
index 00000000000..f83a9eebba8
--- /dev/null
+++ b/queue-5.15/mm-vmscan-prevent-infinite-loop-for-costly-gfp_noio-__gfp_retry_mayfail-allocations.patch
@@ -0,0 +1,188 @@
+From 803de9000f334b771afacb6ff3e78622916668b0 Mon Sep 17 00:00:00 2001
+From: Vlastimil Babka <vbabka@suse.cz>
+Date: Wed, 21 Feb 2024 12:43:58 +0100
+Subject: mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
+
+From: Vlastimil Babka <vbabka@suse.cz>
+
+commit 803de9000f334b771afacb6ff3e78622916668b0 upstream.
+
+Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
+__GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO.  Such combination
+can happen in a suspend/resume context where a GFP_KERNEL allocation can
+have __GFP_IO masked out via gfp_allowed_mask.
+
+Quoting Sven:
+
+1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
+   with __GFP_RETRY_MAYFAIL set.
+
+2. page alloc's __alloc_pages_slowpath tries to get a page from the
+   freelist. This fails because there is nothing free of that costly
+   order.
+
+3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
+   which bails out because a zone is ready to be compacted; it pretends
+   to have made a single page of progress.
+
+4. page alloc tries to compact, but this always bails out early because
+   __GFP_IO is not set (it's not passed by the snd allocator, and even
+   if it were, we are suspending so the __GFP_IO flag would be cleared
+   anyway).
+
+5. page alloc believes reclaim progress was made (because of the
+   pretense in item 3) and so it checks whether it should retry
+   compaction. The compaction retry logic thinks it should try again,
+   because:
+    a) reclaim is needed because of the early bail-out in item 4
+    b) a zonelist is suitable for compaction
+
+6. goto 2. indefinite stall.
+
+(end quote)
+
+The immediate root cause is confusing the COMPACT_SKIPPED returned from
+__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
+indicating a lack of order-0 pages, and in step 5 evaluating that in
+should_compact_retry() as a reason to retry, before incrementing and
+limiting the number of retries.  There are however other places that
+wrongly assume that compaction can happen while we lack __GFP_IO.
+
+To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
+evaluation and switch the open-coded test in try_to_compact_pages() to use
+it.
+
+Also use the new helper in:
+- compaction_ready(), which will make reclaim not bail out in step 3, so
+  there's at least one attempt to actually reclaim, even if chances are
+  small for a costly order
+- in_reclaim_compaction() which will make should_continue_reclaim()
+  return false and we don't over-reclaim unnecessarily
+- in __alloc_pages_slowpath() to set a local variable can_compact,
+  which is then used to avoid retrying reclaim/compaction for costly
+  allocations (step 5) if we can't compact and also to skip the early
+  compaction attempt that we do in some cases
+
+Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
+Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
+Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
+Reported-by: Sven van Ashbrook <svenva@chromium.org>
+Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/
+Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org>
+Cc: Brian Geffon <bgeffon@google.com>
+Cc: Curtis Malainey <cujomalainey@chromium.org>
+Cc: Jaroslav Kysela <perex@perex.cz>
+Cc: Mel Gorman <mgorman@techsingularity.net>
+Cc: Michal Hocko <mhocko@kernel.org>
+Cc: Takashi Iwai <tiwai@suse.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ include/linux/gfp.h |    9 +++++++++
+ mm/compaction.c     |    7 +------
+ mm/page_alloc.c     |   10 ++++++----
+ mm/vmscan.c         |    5 ++++-
+ 4 files changed, 20 insertions(+), 11 deletions(-)
+
+--- a/include/linux/gfp.h
++++ b/include/linux/gfp.h
+@@ -660,6 +660,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_ma
+ extern void pm_restrict_gfp_mask(void);
+ extern void pm_restore_gfp_mask(void);
+ 
++/*
++ * Check if the gfp flags allow compaction - GFP_NOIO is a really
++ * tricky context because the migration might require IO.
++ */
++static inline bool gfp_compaction_allowed(gfp_t gfp_mask)
++{
++	return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
++}
++
+ extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
+ 
+ #ifdef CONFIG_PM_SLEEP
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -2582,16 +2582,11 @@ enum compact_result try_to_compact_pages
+ 		unsigned int alloc_flags, const struct alloc_context *ac,
+ 		enum compact_priority prio, struct page **capture)
+ {
+-	int may_perform_io = gfp_mask & __GFP_IO;
+ 	struct zoneref *z;
+ 	struct zone *zone;
+ 	enum compact_result rc = COMPACT_SKIPPED;
+ 
+-	/*
+-	 * Check if the GFP flags allow compaction - GFP_NOIO is really
+-	 * tricky context because the migration might require IO
+-	 */
+-	if (!may_perform_io)
++	if (!gfp_compaction_allowed(gfp_mask))
+ 		return COMPACT_SKIPPED;
+ 
+ 	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
+--- a/mm/page_alloc.c
++++ b/mm/page_alloc.c
+@@ -4903,6 +4903,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
+ 						struct alloc_context *ac)
+ {
+ 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
++	bool can_compact = gfp_compaction_allowed(gfp_mask);
+ 	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
+ 	struct page *page = NULL;
+ 	unsigned int alloc_flags;
+@@ -4968,7 +4969,7 @@ restart:
+ 	 * Don't try this for allocations that are allowed to ignore
+ 	 * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
+ 	 */
+-	if (can_direct_reclaim &&
++	if (can_direct_reclaim && can_compact &&
+ 			(costly_order ||
+ 			   (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
+ 			&& !gfp_pfmemalloc_allowed(gfp_mask)) {
+@@ -5065,9 +5066,10 @@ retry:
+ 
+ 	/*
+ 	 * Do not retry costly high order allocations unless they are
+-	 * __GFP_RETRY_MAYFAIL
++	 * __GFP_RETRY_MAYFAIL and we can compact
+ 	 */
+-	if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
++	if (costly_order && (!can_compact ||
++			     !(gfp_mask & __GFP_RETRY_MAYFAIL)))
+ 		goto nopage;
+ 
+ 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+@@ -5080,7 +5082,7 @@ retry:
+ 	 * implementation of the compaction depends on the sufficient amount
+ 	 * of free memory (see __compaction_suitable)
+ 	 */
+-	if (did_some_progress > 0 &&
++	if (did_some_progress > 0 && can_compact &&
+ 			should_compact_retry(ac, order, alloc_flags,
+ 				compact_result, &compact_priority,
+ 				&compaction_retries))
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -2834,7 +2834,7 @@ static void shrink_lruvec(struct lruvec
+ /* Use reclaim/compaction for costly allocs or under memory pressure */
+ static bool in_reclaim_compaction(struct scan_control *sc)
+ {
+-	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
++	if (gfp_compaction_allowed(sc->gfp_mask) && sc->order &&
+ 			(sc->order > PAGE_ALLOC_COSTLY_ORDER ||
+ 			 sc->priority < DEF_PRIORITY - 2))
+ 		return true;
+@@ -3167,6 +3167,9 @@ static inline bool compaction_ready(stru
+ 	unsigned long watermark;
+ 	enum compact_result suitable;
+ 
++	if (!gfp_compaction_allowed(sc->gfp_mask))
++		return false;
++
+ 	suitable = compaction_suitable(zone, sc->order, 0, sc->reclaim_idx);
+ 	if (suitable == COMPACT_SUCCESS)
+ 		/* Allocation should succeed already. Don't reclaim. */
diff --git a/queue-5.15/series b/queue-5.15/series
index a76e92a471f..89ba4088b1b 100644
--- a/queue-5.15/series
+++ b/queue-5.15/series
@@ -629,3 +629,7 @@ net-rds-fix-possible-cp-null-dereference.patch
 locking-rwsem-disable-preemption-while-trying-for-rwsem-lock.patch
 io_uring-ensure-0-is-returned-on-file-registration-success.patch
 revert-x86-mm-ident_map-use-gbpages-only-where-full-gb-page-should-be-mapped.patch
+mm-vmscan-prevent-infinite-loop-for-costly-gfp_noio-__gfp_retry_mayfail-allocations.patch
+thermal-devfreq_cooling-fix-perf-state-when-calculate-dfc-res_util.patch
+kvm-x86-bail-to-userspace-if-emulation-of-atomic-user-access-faults.patch
+kvm-x86-mark-target-gfn-of-emulated-atomic-instruction-as-dirty.patch
diff --git a/queue-5.15/thermal-devfreq_cooling-fix-perf-state-when-calculate-dfc-res_util.patch b/queue-5.15/thermal-devfreq_cooling-fix-perf-state-when-calculate-dfc-res_util.patch
new file mode 100644
index 00000000000..37fbf1fe306
--- /dev/null
+++ b/queue-5.15/thermal-devfreq_cooling-fix-perf-state-when-calculate-dfc-res_util.patch
@@ -0,0 +1,42 @@
+From a26de34b3c77ae3a969654d94be49e433c947e3b Mon Sep 17 00:00:00 2001
+From: Ye Zhang <ye.zhang@rock-chips.com>
+Date: Thu, 21 Mar 2024 18:21:00 +0800
+Subject: thermal: devfreq_cooling: Fix perf state when calculate dfc res_util
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+From: Ye Zhang <ye.zhang@rock-chips.com>
+
+commit a26de34b3c77ae3a969654d94be49e433c947e3b upstream.
+
+The issue occurs when the devfreq cooling device uses the EM power model
+and the get_real_power() callback is provided by the driver.
+
+The EM power table is sorted ascendingï¼can't index the table by cooling
+device stateï¼so convert cooling state to performance state by
+dfc->max_state - dfc->capped_state.
+
+Fixes: 615510fe13bd ("thermal: devfreq_cooling: remove old power model and use EM")
+Cc: 5.11+ <stable@vger.kernel.org> # 5.11+
+Signed-off-by: Ye Zhang <ye.zhang@rock-chips.com>
+Reviewed-by: Dhruva Gole <d-gole@ti.com>
+Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
+Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ drivers/thermal/devfreq_cooling.c |    2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+--- a/drivers/thermal/devfreq_cooling.c
++++ b/drivers/thermal/devfreq_cooling.c
+@@ -199,7 +199,7 @@ static int devfreq_cooling_get_requested
+ 
+ 		res = dfc->power_ops->get_real_power(df, power, freq, voltage);
+ 		if (!res) {
+-			state = dfc->capped_state;
++			state = dfc->max_state - dfc->capped_state;
+ 			dfc->res_util = dfc->em_pd->table[state].power;
+ 			dfc->res_util *= SCALE_ERROR_MITIGATION;
+