mm-vmalloc-separate-put-pages-and-flush-vm-flags.patch
mm-thp-fix-madv_remove-deadlock-on-shmem-thp.patch
mm-filemap-add-missing-mem_cgroup_uncharge-to-__add_to_page_cache_locked.patch
+x86-build-disable-cet-instrumentation-in-the-kernel.patch
+x86-debug-fix-dr6-handling.patch
+x86-debug-prevent-data-breakpoints-on-__per_cpu_offset.patch
+x86-debug-prevent-data-breakpoints-on-cpu_dr7.patch
+x86-apic-add-extra-serialization-for-non-serializing-msrs.patch
--- /dev/null
+From 25a068b8e9a4eb193d755d58efcb3c98928636e0 Mon Sep 17 00:00:00 2001
+From: Dave Hansen <dave.hansen@linux.intel.com>
+Date: Thu, 5 Mar 2020 09:47:08 -0800
+Subject: x86/apic: Add extra serialization for non-serializing MSRs
+
+From: Dave Hansen <dave.hansen@linux.intel.com>
+
+commit 25a068b8e9a4eb193d755d58efcb3c98928636e0 upstream.
+
+Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain
+MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for
+MFENCE; LFENCE.
+
+Short summary: we have special MSRs that have weaker ordering than all
+the rest. Add fencing consistent with current SDM recommendations.
+
+This is not known to cause any issues in practice, only in theory.
+
+Longer story below:
+
+The reason the kernel uses a different semantic is that the SDM changed
+(roughly in late 2017). The SDM changed because folks at Intel were
+auditing all of the recommended fences in the SDM and realized that the
+x2apic fences were insufficient.
+
+Why was the pain MFENCE judged insufficient?
+
+WRMSR itself is normally a serializing instruction. No fences are needed
+because the instruction itself serializes everything.
+
+But, there are explicit exceptions for this serializing behavior written
+into the WRMSR instruction documentation for two classes of MSRs:
+IA32_TSC_DEADLINE and the X2APIC MSRs.
+
+Back to x2apic: WRMSR is *not* serializing in this specific case.
+But why is MFENCE insufficient? MFENCE makes writes visible, but
+only affects load/store instructions. WRMSR is unfortunately not a
+load/store instruction and is unaffected by MFENCE. This means that a
+non-serializing WRMSR could be reordered by the CPU to execute before
+the writes made visible by the MFENCE have even occurred in the first
+place.
+
+This means that an x2apic IPI could theoretically be triggered before
+there is any (visible) data to process.
+
+Does this affect anything in practice? I honestly don't know. It seems
+quite possible that by the time an interrupt gets to consume the (not
+yet) MFENCE'd data, it has become visible, mostly by accident.
+
+To be safe, add the SDM-recommended fences for all x2apic WRMSRs.
+
+This also leaves open the question of the _other_ weakly-ordered WRMSR:
+MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as
+the x2APIC MSRs, it seems substantially less likely to be a problem in
+practice. While writes to the in-memory Local Vector Table (LVT) might
+theoretically be reordered with respect to a weakly-ordered WRMSR like
+TSC_DEADLINE, the SDM has this to say:
+
+ In x2APIC mode, the WRMSR instruction is used to write to the LVT
+ entry. The processor ensures the ordering of this write and any
+ subsequent WRMSR to the deadline; no fencing is required.
+
+But, that might still leave xAPIC exposed. The safest thing to do for
+now is to add the extra, recommended LFENCE.
+
+ [ bp: Massage commit message, fix typos, drop accidentally added
+ newline to tools/arch/x86/include/asm/barrier.h. ]
+
+Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
+Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
+Signed-off-by: Borislav Petkov <bp@suse.de>
+Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Acked-by: Thomas Gleixner <tglx@linutronix.de>
+Cc: <stable@vger.kernel.org>
+Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/include/asm/apic.h | 10 ----------
+ arch/x86/include/asm/barrier.h | 18 ++++++++++++++++++
+ arch/x86/kernel/apic/apic.c | 4 ++++
+ arch/x86/kernel/apic/x2apic_cluster.c | 6 ++++--
+ arch/x86/kernel/apic/x2apic_phys.c | 9 ++++++---
+ 5 files changed, 32 insertions(+), 15 deletions(-)
+
+--- a/arch/x86/include/asm/apic.h
++++ b/arch/x86/include/asm/apic.h
+@@ -197,16 +197,6 @@ static inline bool apic_needs_pit(void)
+ #endif /* !CONFIG_X86_LOCAL_APIC */
+
+ #ifdef CONFIG_X86_X2APIC
+-/*
+- * Make previous memory operations globally visible before
+- * sending the IPI through x2apic wrmsr. We need a serializing instruction or
+- * mfence for this.
+- */
+-static inline void x2apic_wrmsr_fence(void)
+-{
+- asm volatile("mfence" : : : "memory");
+-}
+-
+ static inline void native_apic_msr_write(u32 reg, u32 v)
+ {
+ if (reg == APIC_DFR || reg == APIC_ID || reg == APIC_LDR ||
+--- a/arch/x86/include/asm/barrier.h
++++ b/arch/x86/include/asm/barrier.h
+@@ -84,4 +84,22 @@ do { \
+
+ #include <asm-generic/barrier.h>
+
++/*
++ * Make previous memory operations globally visible before
++ * a WRMSR.
++ *
++ * MFENCE makes writes visible, but only affects load/store
++ * instructions. WRMSR is unfortunately not a load/store
++ * instruction and is unaffected by MFENCE. The LFENCE ensures
++ * that the WRMSR is not reordered.
++ *
++ * Most WRMSRs are full serializing instructions themselves and
++ * do not require this barrier. This is only required for the
++ * IA32_TSC_DEADLINE and X2APIC MSRs.
++ */
++static inline void weak_wrmsr_fence(void)
++{
++ asm volatile("mfence; lfence" : : : "memory");
++}
++
+ #endif /* _ASM_X86_BARRIER_H */
+--- a/arch/x86/kernel/apic/apic.c
++++ b/arch/x86/kernel/apic/apic.c
+@@ -41,6 +41,7 @@
+ #include <asm/perf_event.h>
+ #include <asm/x86_init.h>
+ #include <linux/atomic.h>
++#include <asm/barrier.h>
+ #include <asm/mpspec.h>
+ #include <asm/i8259.h>
+ #include <asm/proto.h>
+@@ -472,6 +473,9 @@ static int lapic_next_deadline(unsigned
+ {
+ u64 tsc;
+
++ /* This MSR is special and need a special fence: */
++ weak_wrmsr_fence();
++
+ tsc = rdtsc();
+ wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR));
+ return 0;
+--- a/arch/x86/kernel/apic/x2apic_cluster.c
++++ b/arch/x86/kernel/apic/x2apic_cluster.c
+@@ -29,7 +29,8 @@ static void x2apic_send_IPI(int cpu, int
+ {
+ u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
+
+- x2apic_wrmsr_fence();
++ /* x2apic MSRs are special and need a special fence: */
++ weak_wrmsr_fence();
+ __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
+ }
+
+@@ -41,7 +42,8 @@ __x2apic_send_IPI_mask(const struct cpum
+ unsigned long flags;
+ u32 dest;
+
+- x2apic_wrmsr_fence();
++ /* x2apic MSRs are special and need a special fence: */
++ weak_wrmsr_fence();
+ local_irq_save(flags);
+
+ tmpmsk = this_cpu_cpumask_var_ptr(ipi_mask);
+--- a/arch/x86/kernel/apic/x2apic_phys.c
++++ b/arch/x86/kernel/apic/x2apic_phys.c
+@@ -43,7 +43,8 @@ static void x2apic_send_IPI(int cpu, int
+ {
+ u32 dest = per_cpu(x86_cpu_to_apicid, cpu);
+
+- x2apic_wrmsr_fence();
++ /* x2apic MSRs are special and need a special fence: */
++ weak_wrmsr_fence();
+ __x2apic_send_IPI_dest(dest, vector, APIC_DEST_PHYSICAL);
+ }
+
+@@ -54,7 +55,8 @@ __x2apic_send_IPI_mask(const struct cpum
+ unsigned long this_cpu;
+ unsigned long flags;
+
+- x2apic_wrmsr_fence();
++ /* x2apic MSRs are special and need a special fence: */
++ weak_wrmsr_fence();
+
+ local_irq_save(flags);
+
+@@ -125,7 +127,8 @@ void __x2apic_send_IPI_shorthand(int vec
+ {
+ unsigned long cfg = __prepare_ICR(which, vector, 0);
+
+- x2apic_wrmsr_fence();
++ /* x2apic MSRs are special and need a special fence: */
++ weak_wrmsr_fence();
+ native_x2apic_icr_write(cfg, 0);
+ }
+
--- /dev/null
+From 20bf2b378729c4a0366a53e2018a0b70ace94bcd Mon Sep 17 00:00:00 2001
+From: Josh Poimboeuf <jpoimboe@redhat.com>
+Date: Thu, 28 Jan 2021 15:52:19 -0600
+Subject: x86/build: Disable CET instrumentation in the kernel
+
+From: Josh Poimboeuf <jpoimboe@redhat.com>
+
+commit 20bf2b378729c4a0366a53e2018a0b70ace94bcd upstream.
+
+With retpolines disabled, some configurations of GCC, and specifically
+the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation
+to the kernel by default. That breaks certain tracing scenarios by
+adding a superfluous ENDBR64 instruction before the fentry call, for
+functions which can be called indirectly.
+
+CET instrumentation isn't currently necessary in the kernel, as CET is
+only supported in user space. Disable it unconditionally and move it
+into the x86's Makefile as CET/CFI... enablement should be a per-arch
+decision anyway.
+
+ [ bp: Massage and extend commit message. ]
+
+Fixes: 29be86d7f9cb ("kbuild: add -fcf-protection=none when using retpoline flags")
+Reported-by: Nikolay Borisov <nborisov@suse.com>
+Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
+Signed-off-by: Borislav Petkov <bp@suse.de>
+Reviewed-by: Nikolay Borisov <nborisov@suse.com>
+Tested-by: Nikolay Borisov <nborisov@suse.com>
+Cc: <stable@vger.kernel.org>
+Cc: Seth Forshee <seth.forshee@canonical.com>
+Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
+Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ Makefile | 6 ------
+ arch/x86/Makefile | 3 +++
+ 2 files changed, 3 insertions(+), 6 deletions(-)
+
+--- a/Makefile
++++ b/Makefile
+@@ -950,12 +950,6 @@ KBUILD_CFLAGS += $(call cc-option,-Wer
+ # change __FILE__ to the relative path from the srctree
+ KBUILD_CPPFLAGS += $(call cc-option,-fmacro-prefix-map=$(srctree)/=)
+
+-# ensure -fcf-protection is disabled when using retpoline as it is
+-# incompatible with -mindirect-branch=thunk-extern
+-ifdef CONFIG_RETPOLINE
+-KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
+-endif
+-
+ # include additional Makefiles when needed
+ include-y := scripts/Makefile.extrawarn
+ include-$(CONFIG_KASAN) += scripts/Makefile.kasan
+--- a/arch/x86/Makefile
++++ b/arch/x86/Makefile
+@@ -127,6 +127,9 @@ else
+
+ KBUILD_CFLAGS += -mno-red-zone
+ KBUILD_CFLAGS += -mcmodel=kernel
++
++ # Intel CET isn't enabled in the kernel
++ KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
+ endif
+
+ ifdef CONFIG_X86_X32
--- /dev/null
+From 9ad22e165994ccb64d85b68499eaef97342c175b Mon Sep 17 00:00:00 2001
+From: Peter Zijlstra <peterz@infradead.org>
+Date: Thu, 28 Jan 2021 22:16:27 +0100
+Subject: x86/debug: Fix DR6 handling
+
+From: Peter Zijlstra <peterz@infradead.org>
+
+commit 9ad22e165994ccb64d85b68499eaef97342c175b upstream.
+
+Tom reported that one of the GDB test-cases failed, and Boris bisected
+it to commit:
+
+ d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
+
+The debugging session led us to commit:
+
+ 6c0aca288e72 ("x86: Ignore trap bits on single step exceptions")
+
+It turns out that TF and data breakpoints are both traps and will be
+merged, while instruction breakpoints are faults and will not be merged.
+This means 6c0aca288e72 is wrong, only TF and instruction breakpoints
+need to be excluded while TF and data breakpoints can be merged.
+
+ [ bp: Massage commit message. ]
+
+Fixes: d53d9bc0cf78 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6")
+Fixes: 6c0aca288e72 ("x86: Ignore trap bits on single step exceptions")
+Reported-by: Tom de Vries <tdevries@suse.de>
+Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Signed-off-by: Borislav Petkov <bp@suse.de>
+Cc: <stable@vger.kernel.org>
+Link: https://lkml.kernel.org/r/YBMAbQGACujjfz%2Bi@hirez.programming.kicks-ass.net
+Link: https://lkml.kernel.org/r/20210128211627.GB4348@worktop.programming.kicks-ass.net
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/kernel/hw_breakpoint.c | 39 ++++++++++++++++++---------------------
+ 1 file changed, 18 insertions(+), 21 deletions(-)
+
+--- a/arch/x86/kernel/hw_breakpoint.c
++++ b/arch/x86/kernel/hw_breakpoint.c
+@@ -491,15 +491,12 @@ static int hw_breakpoint_handler(struct
+ struct perf_event *bp;
+ unsigned long *dr6_p;
+ unsigned long dr6;
++ bool bpx;
+
+ /* The DR6 value is pointed by args->err */
+ dr6_p = (unsigned long *)ERR_PTR(args->err);
+ dr6 = *dr6_p;
+
+- /* If it's a single step, TRAP bits are random */
+- if (dr6 & DR_STEP)
+- return NOTIFY_DONE;
+-
+ /* Do an early return if no trap bits are set in DR6 */
+ if ((dr6 & DR_TRAP_BITS) == 0)
+ return NOTIFY_DONE;
+@@ -509,28 +506,29 @@ static int hw_breakpoint_handler(struct
+ if (likely(!(dr6 & (DR_TRAP0 << i))))
+ continue;
+
++ bp = this_cpu_read(bp_per_reg[i]);
++ if (!bp)
++ continue;
++
++ bpx = bp->hw.info.type == X86_BREAKPOINT_EXECUTE;
++
+ /*
+- * The counter may be concurrently released but that can only
+- * occur from a call_rcu() path. We can then safely fetch
+- * the breakpoint, use its callback, touch its counter
+- * while we are in an rcu_read_lock() path.
++ * TF and data breakpoints are traps and can be merged, however
++ * instruction breakpoints are faults and will be raised
++ * separately.
++ *
++ * However DR6 can indicate both TF and instruction
++ * breakpoints. In that case take TF as that has precedence and
++ * delay the instruction breakpoint for the next exception.
+ */
+- rcu_read_lock();
++ if (bpx && (dr6 & DR_STEP))
++ continue;
+
+- bp = this_cpu_read(bp_per_reg[i]);
+ /*
+ * Reset the 'i'th TRAP bit in dr6 to denote completion of
+ * exception handling
+ */
+ (*dr6_p) &= ~(DR_TRAP0 << i);
+- /*
+- * bp can be NULL due to lazy debug register switching
+- * or due to concurrent perf counter removing.
+- */
+- if (!bp) {
+- rcu_read_unlock();
+- break;
+- }
+
+ perf_bp_event(bp, args->regs);
+
+@@ -538,11 +536,10 @@ static int hw_breakpoint_handler(struct
+ * Set up resume flag to avoid breakpoint recursion when
+ * returning back to origin.
+ */
+- if (bp->hw.info.type == X86_BREAKPOINT_EXECUTE)
++ if (bpx)
+ args->regs->flags |= X86_EFLAGS_RF;
+-
+- rcu_read_unlock();
+ }
++
+ /*
+ * Further processing in do_debug() is needed for a) user-space
+ * breakpoints (to generate signals) and b) when the system has
--- /dev/null
+From c4bed4b96918ff1d062ee81fdae4d207da4fa9b0 Mon Sep 17 00:00:00 2001
+From: Lai Jiangshan <laijs@linux.alibaba.com>
+Date: Thu, 4 Feb 2021 23:27:06 +0800
+Subject: x86/debug: Prevent data breakpoints on __per_cpu_offset
+
+From: Lai Jiangshan <laijs@linux.alibaba.com>
+
+commit c4bed4b96918ff1d062ee81fdae4d207da4fa9b0 upstream.
+
+When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value
+via __per_cpu_offset or pcpu_unit_offsets.
+
+When a data breakpoint is set on __per_cpu_offset[cpu] (read-write
+operation), the specific CPU will be stuck in an infinite #DB loop.
+
+RCU will try to send an NMI to the specific CPU, but it is not working
+either since NMI also relies on paranoid_entry(). Which means it's
+undebuggable.
+
+Fixes: eaad981291ee3("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
+Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
+Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
+Cc: stable@vger.kernel.org
+Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/kernel/hw_breakpoint.c | 14 ++++++++++++++
+ 1 file changed, 14 insertions(+)
+
+--- a/arch/x86/kernel/hw_breakpoint.c
++++ b/arch/x86/kernel/hw_breakpoint.c
+@@ -269,6 +269,20 @@ static inline bool within_cpu_entry(unsi
+ CPU_ENTRY_AREA_TOTAL_SIZE))
+ return true;
+
++ /*
++ * When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU
++ * GSBASE value via __per_cpu_offset or pcpu_unit_offsets.
++ */
++#ifdef CONFIG_SMP
++ if (within_area(addr, end, (unsigned long)__per_cpu_offset,
++ sizeof(unsigned long) * nr_cpu_ids))
++ return true;
++#else
++ if (within_area(addr, end, (unsigned long)&pcpu_unit_offsets,
++ sizeof(pcpu_unit_offsets)))
++ return true;
++#endif
++
+ for_each_possible_cpu(cpu) {
+ /* The original rw GDT is being used after load_direct_gdt() */
+ if (within_area(addr, end, (unsigned long)get_cpu_gdt_rw(cpu),
--- /dev/null
+From 3943abf2dbfae9ea4d2da05c1db569a0603f76da Mon Sep 17 00:00:00 2001
+From: Lai Jiangshan <laijs@linux.alibaba.com>
+Date: Thu, 4 Feb 2021 23:27:07 +0800
+Subject: x86/debug: Prevent data breakpoints on cpu_dr7
+
+From: Lai Jiangshan <laijs@linux.alibaba.com>
+
+commit 3943abf2dbfae9ea4d2da05c1db569a0603f76da upstream.
+
+local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and
+disables breakpoints to prevent recursion.
+
+When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the
+per-cpu variable cpu_dr7 to check whether a breakpoint is active or not
+before it accesses DR7.
+
+A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion.
+
+Disallow data breakpoints on cpu_dr7 to prevent that.
+
+Fixes: 84b6a3491567a("x86/entry: Optimize local_db_save() for virt")
+Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
+Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
+Cc: stable@vger.kernel.org
+Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ arch/x86/kernel/hw_breakpoint.c | 8 ++++++++
+ 1 file changed, 8 insertions(+)
+
+--- a/arch/x86/kernel/hw_breakpoint.c
++++ b/arch/x86/kernel/hw_breakpoint.c
+@@ -307,6 +307,14 @@ static inline bool within_cpu_entry(unsi
+ (unsigned long)&per_cpu(cpu_tlbstate, cpu),
+ sizeof(struct tlb_state)))
+ return true;
++
++ /*
++ * When in guest (X86_FEATURE_HYPERVISOR), local_db_save()
++ * will read per-cpu cpu_dr7 before clear dr7 register.
++ */
++ if (within_area(addr, end, (unsigned long)&per_cpu(cpu_dr7, cpu),
++ sizeof(cpu_dr7)))
++ return true;
+ }
+
+ return false;