From: Sean Christopherson Date: Fri, 5 Jun 2026 17:46:10 +0000 (-0700) Subject: KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated writes X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=69397c92de77525f70aa43cf3a47256cef409382;p=thirdparty%2Flinux.git KVM: x86/mmu: Recursively zap orphaned nested TDP shadow pages on emulated writes Recursively zap orphaned nested TDP shadow pages when emulating a guest write to a shadowed page table, regardless of whether or not the associated (parent) shadow page will be zapped, e.g. due to detected write-flooding. This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages for select L1 hypervisor patterns. Commit 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") modified KVM to recursively zap synchronized shadow pages (KVM already recursively zaps unsync children) when a child is orphaned. But the fix effectively only applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the recursive zap when KVM is already zapping a parent SP and processing its children. If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does with CONFIG_KVM_PROVE_MMU=n since commit 8ca983631f3c ("KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb luck, then it's possible to end up with tens or even hundreds of thousands of unsync shadow pages and associated rmap entries. Polluting the hash table and rmap entries with a horde of stale entries can eventually degrade L2 guest boot time by an order of magnitude, especially if there is any antagonistic activity in the host, i.e. anything that will contend for mmu_lock and/or needs to walk rmaps. With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is effectively limited to leaking 4 shadow pages per 256 GiB of memory, as KVM's write flooding detection will kick in on the third write to an L1 TDP PUD, and thus recursively zap the entire 256 GiB range of the parent PGD. I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs before dropping everything. E.g. hacking tracing into L0 KVM's kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with 16GiB of memory leads to: gpa = 107407000, old = 800000010741bd07, new = 8000000000000000, level = 3, flood = 0 gpa = 10741b000, old = 8000000112fb2d07, new = 80000000000001a0, level = 2, flood = 0 gpa = 10741b008, old = 800000012509cd07, new = 80000000000001a0, level = 2, flood = 1 gpa = 10741b010, old = 80000001114b9d07, new = 80000000000001a0, level = 2, flood = 2 gpa = 107407008, old = 8000000112fb5d07, new = 8000000000000000, level = 3, flood = 1 gpa = 112fb5298, old = 8000000106f43d07, new = 80000000000001a0, level = 2, flood = 0 gpa = 112fb52a0, old = 8000000106f4dd07, new = 80000000000001a0, level = 2, flood = 1 gpa = 112fb5ea0, old = 8000000120490d07, new = 80000000000001a0, level = 2, flood = 2 gpa = 107407010, old = 8000000106df2d07, new = 8000000000000000, level = 3, flood = 2 gpa = 107410000, old = 8000000107408d07, new = 8000000000000000, level = 5, flood = 0 gpa = 107408000, old = 8000000107407d07, new = 80000000000001a0, level = 4, flood = 0 Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs in L1 to leak their children. gpa = 167939000, old = 800000011c8f4d07, new = 8000000000000000, level = 2, flood = 0 gpa = 167939020, old = 8000000104407d07, new = 8000000000000000, level = 2, flood = 1 gpa = 167939028, old = 800000011ed20d07, new = 8000000000000000, level = 2, flood = 2 gpa = 118c70bb0, old = 8000000167ab9d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118c70bb8, old = 8000000163913d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118c70de8, old = 800000011cc9dd07, new = 8000000000000000, level = 2, flood = 2 gpa = 160be7fb0, old = 800000011d322d07, new = 8000000000000000, level = 2, flood = 1 gpa = 160be7fb8, old = 8000000126b1bd07, new = 8000000000000000, level = 2, flood = 2 gpa = 1634ab000, old = 800000010e984d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1634ab008, old = 800000016879fd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1634ab010, old = 800000016879ed07, new = 8000000000000000, level = 2, flood = 2 gpa = 11e3f1e48, old = 8000000168a33d07, new = 8000000000000000, level = 2, flood = 0 gpa = 11e3f1e50, old = 80000001664dcd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1167eacb8, old = 8000000166544d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1167eacc0, old = 800000015c16bd07, new = 8000000000000000, level = 2, flood = 1 gpa = 1689e89b8, old = 800000015f296d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1689e89c0, old = 8000000167ca8d07, new = 8000000000000000, level = 2, flood = 1 gpa = 107b35eb8, old = 8000000161e71d07, new = 8000000000000000, level = 2, flood = 0 gpa = 107b35ec0, old = 8000000118cf3d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118cf2d48, old = 8000000118cf1d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118cf2d50, old = 8000000118cf0d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118dcb770, old = 8000000118dcad07, new = 8000000000000000, level = 2, flood = 0 gpa = 118dcb778, old = 8000000118dc9d07, new = 8000000000000000, level = 2, flood = 1 gpa = 118dc87e8, old = 8000000126997d07, new = 8000000000000000, level = 2, flood = 0 gpa = 118dc87f0, old = 8000000126996d07, new = 8000000000000000, level = 2, flood = 1 gpa = 126995148, old = 8000000126994d07, new = 8000000000000000, level = 2, flood = 0 gpa = 126995150, old = 8000000103477d07, new = 8000000000000000, level = 2, flood = 1 gpa = 1034764c8, old = 8000000103475d07, new = 8000000000000000, level = 2, flood = 0 gpa = 1034764d0, old = 8000000103474d07, new = 8000000000000000, level = 2, flood = 1 gpa = 10ea4b788, old = 800000010ea4ad07, new = 8000000000000000, level = 2, flood = 0 gpa = 10ea4b790, old = 800000010ea49d07, new = 8000000000000000, level = 2, flood = 1 gpa = 10ea48928, old = 800000011a5bfd07, new = 8000000000000000, level = 2, flood = 0 gpa = 10ea48930, old = 800000011a5bed07, new = 8000000000000000, level = 2, flood = 1 gpa = 11a5bd0d8, old = 800000011a5bcd07, new = 8000000000000000, level = 2, flood = 0 gpa = 11a5bd0e0, old = 800000011d323d07, new = 8000000000000000, level = 2, flood = 1 gpa = 122ce2b40, old = 800000011fe0bd07, new = 8000000000000000, level = 2, flood = 0 gpa = 122ce2b48, old = 800000010e985d07, new = 8000000000000000, level = 2, flood = 1 gpa = 122ce2b50, old = 8000000161c9dd07, new = 8000000000000000, level = 2, flood = 2 gpa = 16864c000, old = 8000000167939d07, new = 8000000000000000, level = 3, flood = 0 gpa = 16864c008, old = 8000000118c70d07, new = 8000000000000000, level = 3, flood = 1 gpa = 16864c010, old = 80000001688a6d07, new = 8000000000000000, level = 3, flood = 2 gpa = 11c8f7000, old = 80000001608a7d07, new = 8000000000000000, level = 5, flood = 0 gpa = 1608a7000, old = 800000016864cd07, new = 80000000000001a0, level = 4, flood = 0 Note, in the shadow MMU, "level" describes the level a shadow page "points" at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB worth of memory. And as shown above, KVM's write-flooding detection operates at all levels, so a single PMD (in L1) can effectively only leak two unsync children (4KiB shadow pages) before it gets recursively zapped. As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow pages per 256GiB of L2 memory. The top-down zap also makes it more likely that L1 will self-heal (to some extent), as any shadow pages that are "rediscovered" by future runs of L2 can get reclaimed by a recursive zap, whereas bottom-up zapping orphans shadow pages over and over. Note, in theory, there is some risk of over-zapping, e.g. due to zapping a a large branch of the paging tree that L1 is only temporarily removing. In practice, the usage patterns of hypervisors are highly unlikely to trigger false positives. E.g. temporarily changing paging protections is typically done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of memory from L2, then L0 KVM's write-flooding detection will kick in, and the children would be zapped anyways. Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") Cc: Yosry Ahmed Cc: Jim Mattson Cc: James Houghton Reviewed-by: Jim Mattson Reviewed-by: Yosry Ahmed Link: https://patch.msgid.link/20260605174611.2222504-2-seanjc@google.com Signed-off-by: Sean Christopherson --- diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f8aa7eda661e..6e881c40f823 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6357,7 +6357,7 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, while (npte--) { entry = *spte; - mmu_page_zap_pte(vcpu->kvm, sp, spte, NULL); + mmu_page_zap_pte(vcpu->kvm, sp, spte, &invalid_list); if (gentry && sp->role.level != PG_LEVEL_4K) ++vcpu->kvm->stat.mmu_pde_zapped; if (is_shadow_present_pte(entry))