From: Greg Kroah-Hartman Date: Sat, 6 Sep 2025 19:06:10 +0000 (+0200) Subject: 5.15-stable patches X-Git-Tag: v5.4.299~50 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=ff602c14c819fe188fe719f8776265761bdb8ad6;p=thirdparty%2Fkernel%2Fstable-queue.git 5.15-stable patches added patches: mm-move-page-table-sync-declarations-to-linux-pgtable.h.patch x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch --- diff --git a/queue-5.15/mm-move-page-table-sync-declarations-to-linux-pgtable.h.patch b/queue-5.15/mm-move-page-table-sync-declarations-to-linux-pgtable.h.patch new file mode 100644 index 0000000000..6a50fe4bbc --- /dev/null +++ b/queue-5.15/mm-move-page-table-sync-declarations-to-linux-pgtable.h.patch @@ -0,0 +1,216 @@ +From 7cc183f2e67d19b03ee5c13a6664b8c6cc37ff9d Mon Sep 17 00:00:00 2001 +From: Harry Yoo +Date: Mon, 18 Aug 2025 11:02:04 +0900 +Subject: mm: move page table sync declarations to linux/pgtable.h + +From: Harry Yoo + +commit 7cc183f2e67d19b03ee5c13a6664b8c6cc37ff9d upstream. + +During our internal testing, we started observing intermittent boot +failures when the machine uses 4-level paging and has a large amount of +persistent memory: + + BUG: unable to handle page fault for address: ffffe70000000034 + #PF: supervisor write access in kernel mode + #PF: error_code(0x0002) - not-present page + PGD 0 P4D 0 + Oops: 0002 [#1] SMP NOPTI + RIP: 0010:__init_single_page+0x9/0x6d + Call Trace: + + __init_zone_device_page+0x17/0x5d + memmap_init_zone_device+0x154/0x1bb + pagemap_range+0x2e0/0x40f + memremap_pages+0x10b/0x2f0 + devm_memremap_pages+0x1e/0x60 + dev_dax_probe+0xce/0x2ec [device_dax] + dax_bus_probe+0x6d/0xc9 + [... snip ...] + + +It turns out that the kernel panics while initializing vmemmap (struct +page array) when the vmemmap region spans two PGD entries, because the new +PGD entry is only installed in init_mm.pgd, but not in the page tables of +other tasks. + +And looking at __populate_section_memmap(): + if (vmemmap_can_optimize(altmap, pgmap)) + // does not sync top level page tables + r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); + else + // sync top level page tables in x86 + r = vmemmap_populate(start, end, nid, altmap); + +In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c +synchronizes the top level page table (See commit 9b861528a801 ("x86-64, +mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so +that all tasks in the system can see the new vmemmap area. + +However, when vmemmap_can_optimize() returns true, the optimized path +skips synchronization of top-level page tables. This is because +vmemmap_populate_compound_pages() is implemented in core MM code, which +does not handle synchronization of the top-level page tables. Instead, +the core MM has historically relied on each architecture to perform this +synchronization manually. + +We're not the first party to encounter a crash caused by not-sync'd top +level page tables: earlier this year, Gwan-gyeong Mun attempted to address +the issue [1] [2] after hitting a kernel panic when x86 code accessed the +vmemmap area before the corresponding top-level entries were synced. At +that time, the issue was believed to be triggered only when struct page +was enlarged for debugging purposes, and the patch did not get further +updates. + +It turns out that current approach of relying on each arch to handle the +page table sync manually is fragile because 1) it's easy to forget to sync +the top level page table, and 2) it's also easy to overlook that the +kernel should not access the vmemmap and direct mapping areas before the +sync. + +# The solution: Make page table sync more code robust and harder to miss + +To address this, Dave Hansen suggested [3] [4] introducing +{pgd,p4d}_populate_kernel() for updating kernel portion of the page tables +and allow each architecture to explicitly perform synchronization when +installing top-level entries. With this approach, we no longer need to +worry about missing the sync step, reducing the risk of future +regressions. + +The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, +PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by +vmalloc and ioremap to synchronize page tables. + +pgd_populate_kernel() looks like this: +static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, + p4d_t *p4d) +{ + pgd_populate(&init_mm, pgd, p4d); + if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) + arch_sync_kernel_mappings(addr, addr); +} + +It is worth noting that vmalloc() and apply_to_range() carefully +synchronizes page tables by calling p*d_alloc_track() and +arch_sync_kernel_mappings(), and thus they are not affected by this patch +series. + +This series was hugely inspired by Dave Hansen's suggestion and hence +added Suggested-by: Dave Hansen. + +Cc stable because lack of this series opens the door to intermittent +boot failures. + + +This patch (of 3): + +Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to +linux/pgtable.h so that they can be used outside of vmalloc and ioremap. + +Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com +Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com +Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1] +Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2] +Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3] +Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4] +Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") +Signed-off-by: Harry Yoo +Acked-by: Kiryl Shutsemau +Reviewed-by: Mike Rapoport (Microsoft) +Reviewed-by: "Uladzislau Rezki (Sony)" +Reviewed-by: Lorenzo Stoakes +Acked-by: David Hildenbrand +Cc: Alexander Potapenko +Cc: Alistair Popple +Cc: Andrey Konovalov +Cc: Andrey Ryabinin +Cc: Andy Lutomirski +Cc: "Aneesh Kumar K.V" +Cc: Anshuman Khandual +Cc: Ard Biesheuvel +Cc: Arnd Bergmann +Cc: bibo mao +Cc: Borislav Betkov +Cc: Christoph Lameter (Ampere) +Cc: Dennis Zhou +Cc: Dev Jain +Cc: Dmitriy Vyukov +Cc: Gwan-gyeong Mun +Cc: Ingo Molnar +Cc: Jane Chu +Cc: Joao Martins +Cc: Joerg Roedel +Cc: John Hubbard +Cc: Kevin Brodsky +Cc: Liam Howlett +Cc: Michal Hocko +Cc: Oscar Salvador +Cc: Peter Xu +Cc: Peter Zijlstra +Cc: Qi Zheng +Cc: Ryan Roberts +Cc: Suren Baghdasaryan +Cc: Tejun Heo +Cc: Thomas Gleinxer +Cc: Thomas Huth +Cc: Vincenzo Frascino +Cc: Vlastimil Babka +Cc: Dave Hansen +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + include/linux/pgtable.h | 16 ++++++++++++++++ + include/linux/vmalloc.h | 16 ---------------- + 2 files changed, 16 insertions(+), 16 deletions(-) + +--- a/include/linux/pgtable.h ++++ b/include/linux/pgtable.h +@@ -1380,6 +1380,22 @@ static inline int pmd_protnone(pmd_t pmd + } + #endif /* CONFIG_NUMA_BALANCING */ + ++/* ++ * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values ++ * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() ++ * needs to be called. ++ */ ++#ifndef ARCH_PAGE_TABLE_SYNC_MASK ++#define ARCH_PAGE_TABLE_SYNC_MASK 0 ++#endif ++ ++/* ++ * There is no default implementation for arch_sync_kernel_mappings(). It is ++ * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK ++ * is 0. ++ */ ++void arch_sync_kernel_mappings(unsigned long start, unsigned long end); ++ + #endif /* CONFIG_MMU */ + + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +--- a/include/linux/vmalloc.h ++++ b/include/linux/vmalloc.h +@@ -180,22 +180,6 @@ extern int remap_vmalloc_range(struct vm + unsigned long pgoff); + + /* +- * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values +- * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() +- * needs to be called. +- */ +-#ifndef ARCH_PAGE_TABLE_SYNC_MASK +-#define ARCH_PAGE_TABLE_SYNC_MASK 0 +-#endif +- +-/* +- * There is no default implementation for arch_sync_kernel_mappings(). It is +- * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK +- * is 0. +- */ +-void arch_sync_kernel_mappings(unsigned long start, unsigned long end); +- +-/* + * Lowlevel-APIs (not for driver use!) + */ + diff --git a/queue-5.15/series b/queue-5.15/series index 770eb58276..33a2f14ce4 100644 --- a/queue-5.15/series +++ b/queue-5.15/series @@ -28,3 +28,5 @@ net-phy-mscc-fix-memory-leak-when-using-one-step-tim.patch phy-mscc-stop-taking-ts_lock-for-tx_queue-and-use-it.patch alsa-usb-audio-add-mute-tlv-for-playback-volumes-on-some-devices.patch pcmcia-fix-a-null-pointer-dereference-in-__iodyn_find_io_region.patch +x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch +mm-move-page-table-sync-declarations-to-linux-pgtable.h.patch diff --git a/queue-5.15/x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch b/queue-5.15/x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch new file mode 100644 index 0000000000..742cc5a3ae --- /dev/null +++ b/queue-5.15/x86-mm-64-define-arch_page_table_sync_mask-and-arch_sync_kernel_mappings.patch @@ -0,0 +1,153 @@ +From 6659d027998083fbb6d42a165b0c90dc2e8ba989 Mon Sep 17 00:00:00 2001 +From: Harry Yoo +Date: Mon, 18 Aug 2025 11:02:06 +0900 +Subject: x86/mm/64: define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() + +From: Harry Yoo + +commit 6659d027998083fbb6d42a165b0c90dc2e8ba989 upstream. + +Define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to ensure +page tables are properly synchronized when calling p*d_populate_kernel(). + +For 5-level paging, synchronization is performed via +pgd_populate_kernel(). In 4-level paging, pgd_populate() is a no-op, so +synchronization is instead performed at the P4D level via +p4d_populate_kernel(). + +This fixes intermittent boot failures on systems using 4-level paging and +a large amount of persistent memory: + + BUG: unable to handle page fault for address: ffffe70000000034 + #PF: supervisor write access in kernel mode + #PF: error_code(0x0002) - not-present page + PGD 0 P4D 0 + Oops: 0002 [#1] SMP NOPTI + RIP: 0010:__init_single_page+0x9/0x6d + Call Trace: + + __init_zone_device_page+0x17/0x5d + memmap_init_zone_device+0x154/0x1bb + pagemap_range+0x2e0/0x40f + memremap_pages+0x10b/0x2f0 + devm_memremap_pages+0x1e/0x60 + dev_dax_probe+0xce/0x2ec [device_dax] + dax_bus_probe+0x6d/0xc9 + [... snip ...] + + +It also fixes a crash in vmemmap_set_pmd() caused by accessing vmemmap +before sync_global_pgds() [1]: + + BUG: unable to handle page fault for address: ffffeb3ff1200000 + #PF: supervisor write access in kernel mode + #PF: error_code(0x0002) - not-present page + PGD 0 P4D 0 + Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI + Tainted: [W]=WARN + RIP: 0010:vmemmap_set_pmd+0xff/0x230 + + vmemmap_populate_hugepages+0x176/0x180 + vmemmap_populate+0x34/0x80 + __populate_section_memmap+0x41/0x90 + sparse_add_section+0x121/0x3e0 + __add_pages+0xba/0x150 + add_pages+0x1d/0x70 + memremap_pages+0x3dc/0x810 + devm_memremap_pages+0x1c/0x60 + xe_devm_add+0x8b/0x100 [xe] + xe_tile_init_noalloc+0x6a/0x70 [xe] + xe_device_probe+0x48c/0x740 [xe] + [... snip ...] + +Link: https://lkml.kernel.org/r/20250818020206.4517-4-harry.yoo@oracle.com +Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") +Signed-off-by: Harry Yoo +Closes: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [1] +Suggested-by: Dave Hansen +Acked-by: Kiryl Shutsemau +Reviewed-by: Mike Rapoport (Microsoft) +Reviewed-by: Lorenzo Stoakes +Acked-by: David Hildenbrand +Cc: Alexander Potapenko +Cc: Alistair Popple +Cc: Andrey Konovalov +Cc: Andrey Ryabinin +Cc: Andy Lutomirski +Cc: "Aneesh Kumar K.V" +Cc: Anshuman Khandual +Cc: Ard Biesheuvel +Cc: Arnd Bergmann +Cc: bibo mao +Cc: Borislav Betkov +Cc: Christoph Lameter (Ampere) +Cc: Dennis Zhou +Cc: Dev Jain +Cc: Dmitriy Vyukov +Cc: Ingo Molnar +Cc: Jane Chu +Cc: Joao Martins +Cc: Joerg Roedel +Cc: John Hubbard +Cc: Kevin Brodsky +Cc: Liam Howlett +Cc: Michal Hocko +Cc: Oscar Salvador +Cc: Peter Xu +Cc: Peter Zijlstra +Cc: Qi Zheng +Cc: Ryan Roberts +Cc: Suren Baghdasaryan +Cc: Tejun Heo +Cc: Thomas Gleinxer +Cc: Thomas Huth +Cc: "Uladzislau Rezki (Sony)" +Cc: Vincenzo Frascino +Cc: Vlastimil Babka +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + arch/x86/include/asm/pgtable_64_types.h | 3 +++ + arch/x86/mm/init_64.c | 18 ++++++++++++++++++ + 2 files changed, 21 insertions(+) + +--- a/arch/x86/include/asm/pgtable_64_types.h ++++ b/arch/x86/include/asm/pgtable_64_types.h +@@ -40,6 +40,9 @@ static inline bool pgtable_l5_enabled(vo + #define pgtable_l5_enabled() 0 + #endif /* CONFIG_X86_5LEVEL */ + ++#define ARCH_PAGE_TABLE_SYNC_MASK \ ++ (pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED) ++ + extern unsigned int pgdir_shift; + extern unsigned int ptrs_per_p4d; + +--- a/arch/x86/mm/init_64.c ++++ b/arch/x86/mm/init_64.c +@@ -219,6 +219,24 @@ static void sync_global_pgds(unsigned lo + } + + /* ++ * Make kernel mappings visible in all page tables in the system. ++ * This is necessary except when the init task populates kernel mappings ++ * during the boot process. In that case, all processes originating from ++ * the init task copies the kernel mappings, so there is no issue. ++ * Otherwise, missing synchronization could lead to kernel crashes due ++ * to missing page table entries for certain kernel mappings. ++ * ++ * Synchronization is performed at the top level, which is the PGD in ++ * 5-level paging systems. But in 4-level paging systems, however, ++ * pgd_populate() is a no-op, so synchronization is done at the P4D level. ++ * sync_global_pgds() handles this difference between paging levels. ++ */ ++void arch_sync_kernel_mappings(unsigned long start, unsigned long end) ++{ ++ sync_global_pgds(start, end); ++} ++ ++/* + * NOTE: This function is marked __ref because it calls __init function + * (alloc_bootmem_pages). It's safe to do it ONLY when after_bootmem == 0. + */