From: Paolo Bonzini <pbonzini@redhat.com>
Date: Mon, 4 May 2026 08:44:52 +0000 (-0400)
Subject: Merge branch 'kvm-mbec' into HEAD
X-Git-Url: http://git.ipfire.org/gitweb/?a=commitdiff_plain;h=2be108307eae241359bb32ee259ba0b5378156aa;p=thirdparty%2Fkernel%2Flinux.git

Merge branch 'kvm-mbec' into HEAD

This topic branch introduces support for two related features that
Hyper-V uses in its implementation of Virtual Secure Mode; these are
Intel Mode-Based Execute Control and AMD Guest Mode Execution Trap.

Both MBEC and GMET allow more granular control over execute permissions,
with different levels of separation between supervisor and user mode.
MBEC provides support for separate supervisor and user-mode bits in the
PTEs; GMET instead lacks supervisor-mode only execution (with NX=0,
"both" is represented by U=0 and user-mode only by U=1).  GMET was
clearly inspired by SMEP though with some differences and annoyances.

The implementation starts from two changes to core MMU code, both
of which help making the actual feature almost trivial to implement:

- first, I'm cleaning up the implementation of nVMX exec-only, by
  properly adding read permissions to the ACC_* constant and to the
  permission bitmask machinery.  Jon also had to add a fourth ACC_*
  bit, but used it only in the special case of nested MBEC; here
  instead ACC_READ_MASK is the normality, which simplifies testing
  a lot and removes gratuitous complexity.

- second, I'm enforcing that KVM runs with MBEC/GMET enabled even in
  non-nested mode, if it wants to provide the feature to nested
  hypervisors.  This makes the creation of SPTEs looks exactly the
  same for L1 and L2 guests, despite only the latter using MBEC/GMET
  fully; the difference lies only in the input access permissions.

This strategy adds a limited amount of complexity to the core is limited,
while providing for an almost entirely seamless support of nested
hypervisors.

Later patches have to use slightly different meanings for ACC_* in Intel
and AMD.  On the Intel side, some work is needed in order to split
shadow_x_mask and ACC_EXEC_MASK in two; now that there is an actual
ACC_READ_MASK to be used for exec-only pages, ACC_USER_MASK is unused
and can be reused as ACC_USER_EXEC_MASK.  However, unlike the older
ACC_USER_MASK hack these differences are backed by concrete concepts
of the page table format, and there is always a 1:1 mapping from ACC_*
bits to PT_*_MASK or shadow_*_mask:

                            Intel                 AMD
     --------------------   -------------------   -------------------
     ACC_READ_MASK          PT_PRESENT_MASK       PT_PRESENT_MASK
     ACC_WRITE_MASK         PT_WRITABLE_MASK      PT_WRITABLE_MASK
     ACC_EXEC_MASK          shadow_xs_mask        shadow_nx_mask
     ACC_USER_MASK          ---                   shadow_user_mask
     ACC_USER_EXEC_MASK     shadow_xu_mask        ---

On Intel, ACC_EXEC_MASK is used for kernel-mode execution and is tied to
shadow_xs_mask (when MBEC is disabled, ACC_USER_EXEC_MASK and the XU bit
are computed but ineffective).  update_permission_bitmask() precomputes
all the necessary conditions.  On the AMD side, the U bit maps to
ACC_USER_MASK but nNPT adjusts the permission bitmask to ignore it for
reads and writes when GMET is active.  Despite the smaller scale of the
changes compared to MBEC, there are some changes to make to use GMET
for L1 guests, because the page tables have to be created with U=0.
This means that the root page has role.access != ACC_ALL and its
permissions have to be propagated down.

Note that with MBEC the user/supervisor distinction depends on the U
bit of the page tables rather than the CPL.  Processors provide this
information to the hypervisor through the "advanced EPT violation
vmexit info" feature, which is a requirement for KVM to use MBEC,
and kvm-intel.ko passes it to the MMU in PFERR_USER_MASK (unlike
kvm-amd.ko which computes it from the CPL).  This needs a small change
to pass the effective XWU permissions of the page tables down to
translate_nested_gpa().

The former "smep_andnot_wp" bit of cpu_role.base, now named "cr4_smep",
is repurposed for nested TDP to indicate that MBEC/GMET is on.  The minor
pessimization for shadow page tables (toggling CR4.SMEP now always forces
building a separate version of the shadow page tables, even though that's
technically unnecessary if CR4.WP=1) is not really worth fretting about;
in practice, guests are not going to flip CR4.SMEP in a way that would
prevent efficient reuse of shadow page tables.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---

2be108307eae241359bb32ee259ba0b5378156aa
diff --cc arch/x86/kvm/hyperv.c
index 4438ecac9a89b,f35fae3a7b3dd..015c6947b462e
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@@ -2040,8 -2040,10 +2040,10 @@@ static u64 kvm_hv_flush_tlb(struct kvm_
  	 * flush).  Translate the address here so the memory can be uniformly
  	 * read with kvm_read_guest().
  	 */
 -	if (!hc->fast && is_guest_mode(vcpu)) {
 +	if (!hc->fast && mmu_is_nested(vcpu)) {
- 		hc->ingpa = translate_nested_gpa(vcpu, hc->ingpa, 0, NULL);
+ 		hc->ingpa = kvm_x86_ops.nested_ops->translate_nested_gpa(
+ 					vcpu, hc->ingpa,
+ 					PFERR_GUEST_FINAL_MASK, NULL, 0);
  		if (unlikely(hc->ingpa == INVALID_GPA))
  			return HV_STATUS_INVALID_HYPERCALL_INPUT;
  	}