Recursively zap orphaned nested TDP shadow pages when emulating a guest
write to a shadowed page table, regardless of whether or not the associated
(parent) shadow page will be zapped, e.g. due to detected write-flooding.
This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages
for select L1 hypervisor patterns. Commit
2de4085cccea ("KVM: x86/MMU:
Recursively zap nested TDP SPs when zapping last/only parent") modified KVM
to recursively zap synchronized shadow pages (KVM already recursively zaps
unsync children) when a child is orphaned. But the fix effectively only
applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the
recursive zap when KVM is already zapping a parent SP and processing its
children.
If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does
with CONFIG_KVM_PROVE_MMU=n since commit
8ca983631f3c ("KVM: x86/mmu: Zap
invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak
upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or
thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding
detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb
luck, then it's possible to end up with tens or even hundreds of thousands
of unsync shadow pages and associated rmap entries.
Polluting the hash table and rmap entries with a horde of stale entries
can eventually degrade L2 guest boot time by an order of magnitude,
especially if there is any antagonistic activity in the host, i.e. anything
that will contend for mmu_lock and/or needs to walk rmaps.
With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is
effectively limited to leaking 4 shadow pages per 256 GiB of memory, as
KVM's write flooding detection will kick in on the third write to an L1
TDP PUD, and thus recursively zap the entire 256 GiB range of the parent
PGD. I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs
when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs
before dropping everything. E.g. hacking tracing into L0 KVM's
kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with
16GiB of memory leads to:
gpa =
107407000, old =
800000010741bd07, new =
8000000000000000, level = 3, flood = 0
gpa =
10741b000, old =
8000000112fb2d07, new =
80000000000001a0, level = 2, flood = 0
gpa =
10741b008, old =
800000012509cd07, new =
80000000000001a0, level = 2, flood = 1
gpa =
10741b010, old =
80000001114b9d07, new =
80000000000001a0, level = 2, flood = 2
gpa =
107407008, old =
8000000112fb5d07, new =
8000000000000000, level = 3, flood = 1
gpa =
112fb5298, old =
8000000106f43d07, new =
80000000000001a0, level = 2, flood = 0
gpa =
112fb52a0, old =
8000000106f4dd07, new =
80000000000001a0, level = 2, flood = 1
gpa =
112fb5ea0, old =
8000000120490d07, new =
80000000000001a0, level = 2, flood = 2
gpa =
107407010, old =
8000000106df2d07, new =
8000000000000000, level = 3, flood = 2
gpa =
107410000, old =
8000000107408d07, new =
8000000000000000, level = 5, flood = 0
gpa =
107408000, old =
8000000107407d07, new =
80000000000001a0, level = 4, flood = 0
Contrast that with a bottom-up zap, which effectively allows all 2MiB SPTEs
in L1 to leak their children.
gpa =
167939000, old =
800000011c8f4d07, new =
8000000000000000, level = 2, flood = 0
gpa =
167939020, old =
8000000104407d07, new =
8000000000000000, level = 2, flood = 1
gpa =
167939028, old =
800000011ed20d07, new =
8000000000000000, level = 2, flood = 2
gpa =
118c70bb0, old =
8000000167ab9d07, new =
8000000000000000, level = 2, flood = 0
gpa =
118c70bb8, old =
8000000163913d07, new =
8000000000000000, level = 2, flood = 1
gpa =
118c70de8, old =
800000011cc9dd07, new =
8000000000000000, level = 2, flood = 2
gpa =
160be7fb0, old =
800000011d322d07, new =
8000000000000000, level = 2, flood = 1
gpa =
160be7fb8, old =
8000000126b1bd07, new =
8000000000000000, level = 2, flood = 2
gpa =
1634ab000, old =
800000010e984d07, new =
8000000000000000, level = 2, flood = 0
gpa =
1634ab008, old =
800000016879fd07, new =
8000000000000000, level = 2, flood = 1
gpa =
1634ab010, old =
800000016879ed07, new =
8000000000000000, level = 2, flood = 2
gpa =
11e3f1e48, old =
8000000168a33d07, new =
8000000000000000, level = 2, flood = 0
gpa =
11e3f1e50, old =
80000001664dcd07, new =
8000000000000000, level = 2, flood = 1
gpa =
1167eacb8, old =
8000000166544d07, new =
8000000000000000, level = 2, flood = 0
gpa =
1167eacc0, old =
800000015c16bd07, new =
8000000000000000, level = 2, flood = 1
gpa =
1689e89b8, old =
800000015f296d07, new =
8000000000000000, level = 2, flood = 0
gpa =
1689e89c0, old =
8000000167ca8d07, new =
8000000000000000, level = 2, flood = 1
gpa =
107b35eb8, old =
8000000161e71d07, new =
8000000000000000, level = 2, flood = 0
gpa =
107b35ec0, old =
8000000118cf3d07, new =
8000000000000000, level = 2, flood = 1
gpa =
118cf2d48, old =
8000000118cf1d07, new =
8000000000000000, level = 2, flood = 0
gpa =
118cf2d50, old =
8000000118cf0d07, new =
8000000000000000, level = 2, flood = 1
gpa =
118dcb770, old =
8000000118dcad07, new =
8000000000000000, level = 2, flood = 0
gpa =
118dcb778, old =
8000000118dc9d07, new =
8000000000000000, level = 2, flood = 1
gpa =
118dc87e8, old =
8000000126997d07, new =
8000000000000000, level = 2, flood = 0
gpa =
118dc87f0, old =
8000000126996d07, new =
8000000000000000, level = 2, flood = 1
gpa =
126995148, old =
8000000126994d07, new =
8000000000000000, level = 2, flood = 0
gpa =
126995150, old =
8000000103477d07, new =
8000000000000000, level = 2, flood = 1
gpa =
1034764c8, old =
8000000103475d07, new =
8000000000000000, level = 2, flood = 0
gpa =
1034764d0, old =
8000000103474d07, new =
8000000000000000, level = 2, flood = 1
gpa =
10ea4b788, old =
800000010ea4ad07, new =
8000000000000000, level = 2, flood = 0
gpa =
10ea4b790, old =
800000010ea49d07, new =
8000000000000000, level = 2, flood = 1
gpa =
10ea48928, old =
800000011a5bfd07, new =
8000000000000000, level = 2, flood = 0
gpa =
10ea48930, old =
800000011a5bed07, new =
8000000000000000, level = 2, flood = 1
gpa =
11a5bd0d8, old =
800000011a5bcd07, new =
8000000000000000, level = 2, flood = 0
gpa =
11a5bd0e0, old =
800000011d323d07, new =
8000000000000000, level = 2, flood = 1
gpa =
122ce2b40, old =
800000011fe0bd07, new =
8000000000000000, level = 2, flood = 0
gpa =
122ce2b48, old =
800000010e985d07, new =
8000000000000000, level = 2, flood = 1
gpa =
122ce2b50, old =
8000000161c9dd07, new =
8000000000000000, level = 2, flood = 2
gpa =
16864c000, old =
8000000167939d07, new =
8000000000000000, level = 3, flood = 0
gpa =
16864c008, old =
8000000118c70d07, new =
8000000000000000, level = 3, flood = 1
gpa =
16864c010, old =
80000001688a6d07, new =
8000000000000000, level = 3, flood = 2
gpa =
11c8f7000, old =
80000001608a7d07, new =
8000000000000000, level = 5, flood = 0
gpa =
1608a7000, old =
800000016864cd07, new =
80000000000001a0, level = 4, flood = 0
Note, in the shadow MMU, "level" describes the level a shadow page "points"
at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB
PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB
worth of memory. And as shown above, KVM's write-flooding detection
operates at all levels, so a single PMD (in L1) can effectively only leak
two unsync children (4KiB shadow pages) before it gets recursively zapped.
As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow
pages per 256GiB of L2 memory.
The top-down zap also makes it more likely that L1 will self-heal (to some
extent), as any shadow pages that are "rediscovered" by future runs of L2
can get reclaimed by a recursive zap, whereas bottom-up zapping orphans
shadow pages over and over.
Note, in theory, there is some risk of over-zapping, e.g. due to zapping a
a large branch of the paging tree that L1 is only temporarily removing. In
practice, the usage patterns of hypervisors are highly unlikely to trigger
false positives. E.g. temporarily changing paging protections is typically
done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is
updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of
memory from L2, then L0 KVM's write-flooding detection will kick in, and
the children would be zapped anyways.
Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent")
Cc: Yosry Ahmed <yosry@kernel.org>
Cc: Jim Mattson <jmattson@google.com>
Cc: James Houghton <jthoughton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260605174611.2222504-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>