]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
6 weeks agoKVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
Sean Christopherson [Fri, 19 Sep 2025 22:32:50 +0000 (15:32 -0700)] 
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors

Add {HV,CP,SX}_VECTOR definitions for AMD's Hypervisor Injection Exception,
VMM Communication Exception, and SVM Security Exception vectors, along with
human friendly formatting for trace_kvm_inj_exception().

Note, KVM is all but guaranteed to never observe or inject #SX, and #HV is
also unlikely to go unused.  Add the architectural collateral mostly for
completeness, and on the off chance that hardware goes off the rails.

Link: https://lore.kernel.org/r/20250919223258.1604852-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Define Control Protection Exception (#CP) vector
Sean Christopherson [Fri, 19 Sep 2025 22:32:49 +0000 (15:32 -0700)] 
KVM: x86: Define Control Protection Exception (#CP) vector

Add a CP_VECTOR definition for CET's Control Protection Exception (#CP),
along with human friendly formatting for trace_kvm_inj_exception().

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Add human friendly formatting for #XM, and #VE
Sean Christopherson [Fri, 19 Sep 2025 22:32:48 +0000 (15:32 -0700)] 
KVM: x86: Add human friendly formatting for #XM, and #VE

Add XM_VECTOR and VE_VECTOR pretty-printing for
trace_kvm_inj_exception().

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SVM: Enable shadow stack virtualization for SVM
John Allen [Fri, 19 Sep 2025 22:32:47 +0000 (15:32 -0700)] 
KVM: SVM: Enable shadow stack virtualization for SVM

Remove the explicit clearing of shadow stack CPU capabilities.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
Sean Christopherson [Fri, 19 Sep 2025 22:32:46 +0000 (15:32 -0700)] 
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid

Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT.  Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.

Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking.  Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage.  In either case, KVM can't fix the broken guest.

Explicitly allow access to XSS at all times, as KVM needs to ensure its
copy of XSS stays up-to-date.  E.g. KVM supports migration of SEV-ES guests
and so needs to allow the host to save/restore XSS, otherwise a guest
that *knows* its XSS hasn't change could get stale/bad CPUID emulation if
the guest doesn't provide XSS in the GHCB on every exit.  This creates a
hypothetical problem where a guest could request emulation of RDMSR or
WRMSR on XSS, but arguably that's not even a problem, e.g. it would be
entirely reasonable for a guest to request "emulation" as a way to inform
the hypervisor that its XSS value has been modified.

Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.

Suggested-by: John Allen <john.allen@amd.com>
base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c
Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SVM: Pass through shadow stack MSRs as appropriate
John Allen [Fri, 19 Sep 2025 22:32:45 +0000 (15:32 -0700)] 
KVM: SVM: Pass through shadow stack MSRs as appropriate

Pass through XSAVE managed CET MSRs on SVM when KVM supports shadow
stack. These cannot be intercepted without also intercepting XSAVE which
would likely cause unacceptable performance overhead.
MSR_IA32_INT_SSP_TAB is not managed by XSAVE, so it is intercepted.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SVM: Update dump_vmcb with shadow stack save area additions
John Allen [Fri, 19 Sep 2025 22:32:44 +0000 (15:32 -0700)] 
KVM: SVM: Update dump_vmcb with shadow stack save area additions

Add shadow stack VMCB fields to dump_vmcb. PL0_SSP, PL1_SSP, PL2_SSP,
PL3_SSP, and U_CET are part of the SEV-ES save area and are encrypted,
but can be decrypted and dumped if the guest policy allows debugging.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
Sean Christopherson [Fri, 19 Sep 2025 22:32:43 +0000 (15:32 -0700)] 
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02

Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and
SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state
simply copies the entire save area).  SVM doesn't provide a way to
disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide
nested support before advertising SHSTK to userspace.

Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SVM: Emulate reads and writes to shadow stack MSRs
John Allen [Fri, 19 Sep 2025 22:32:42 +0000 (15:32 -0700)] 
KVM: SVM: Emulate reads and writes to shadow stack MSRs

Emulate shadow stack MSR access by reading and writing to the
corresponding fields in the VMCB.

Signed-off-by: John Allen <john.allen@amd.com>
[sean: mark VMCB_CET dirty/clean as appropriate]
Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Advertise new VM-Entry/Exit control bits for CET state
Chao Gao [Fri, 19 Sep 2025 22:32:41 +0000 (15:32 -0700)] 
KVM: nVMX: Advertise new VM-Entry/Exit control bits for CET state

Advertise the LOAD_CET_STATE VM-Entry/Exit control bits in the nested VMX
MSRS, as all nested support for CET virtualization, including consistency
checks, is in place.

Advertise support if and only if KVM supports at least one of IBT or SHSTK.
While it's userspace's responsibility to provide a consistent CPU model to
the guest, that doesn't mean KVM should set userspace up to fail.

Note, the existing {CLEAR,LOAD}_BNDCFGS behavior predates
KVM_X86_QUIRK_STUFF_FEATURE_MSRS, i.e. KVM "solved" the inconsistent CPU
model problem by overwriting the VMX MSRs provided by userspace.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-35-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Add consistency checks for CET states
Chao Gao [Fri, 19 Sep 2025 22:32:40 +0000 (15:32 -0700)] 
KVM: nVMX: Add consistency checks for CET states

Introduce consistency checks for CET states during nested VM-entry.

A VMCS contains both guest and host CET states, each comprising the
IA32_S_CET MSR, SSP, and IA32_INTERRUPT_SSP_TABLE_ADDR MSR. Various
checks are applied to CET states during VM-entry as documented in SDM
Vol3 Chapter "VM ENTRIES". Implement all these checks during nested
VM-entry to emulate the architectural behavior.

In summary, there are three kinds of checks on guest/host CET states
during VM-entry:

A. Checks applied to both guest states and host states:

 * The IA32_S_CET field must not set any reserved bits; bits 10 (SUPPRESS)
   and 11 (TRACKER) cannot both be set.
 * SSP should not have bits 1:0 set.
 * The IA32_INTERRUPT_SSP_TABLE_ADDR field must be canonical.

B. Checks applied to host states only

 * IA32_S_CET MSR and SSP must be canonical if the CPU enters 64-bit mode
   after VM-exit. Otherwise, IA32_S_CET and SSP must have their higher 32
   bits cleared.

C. Checks applied to guest states only:

 * IA32_S_CET MSR and SSP are not required to be canonical (i.e., 63:N-1
   are identical, where N is the CPU's maximum linear-address width). But,
   bits 63:N of SSP must be identical.

Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-34-seanjc@google.com
[sean: have common helper return 0/-EINVAL, not true/false]
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Add consistency checks for CR0.WP and CR4.CET
Chao Gao [Fri, 19 Sep 2025 22:32:39 +0000 (15:32 -0700)] 
KVM: nVMX: Add consistency checks for CR0.WP and CR4.CET

Add consistency checks for CR4.CET and CR0.WP in guest-state or host-state
area in the VMCS12. This ensures that configurations with CR4.CET set and
CR0.WP not set result in VM-entry failure, aligning with architectural
behavior.

Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Prepare for enabling CET support for nested guest
Yang Weijiang [Fri, 19 Sep 2025 22:32:38 +0000 (15:32 -0700)] 
KVM: nVMX: Prepare for enabling CET support for nested guest

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.

Please note that consistency checks regarding CET state during VM-Entry
will be added later to prevent this patch from becoming too large.
Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred
until after the consistency checks are added.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Virtualize NO_HW_ERROR_CODE_CC for L1 event injection to L2
Yang Weijiang [Fri, 19 Sep 2025 22:32:37 +0000 (15:32 -0700)] 
KVM: nVMX: Virtualize NO_HW_ERROR_CODE_CC for L1 event injection to L2

Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: VMX: Configure nested capabilities after CPU capabilities
Sean Christopherson [Fri, 19 Sep 2025 22:32:36 +0000 (15:32 -0700)] 
KVM: VMX: Configure nested capabilities after CPU capabilities

Swap the order between configuring nested VMX capabilities and base CPU
capabilities, so that nested VMX support can be conditioned on core KVM
support, e.g. to allow conditioning support for LOAD_CET_STATE on the
presence of IBT or SHSTK.  Because the sanity checks on nested VMX config
performed by vmx_check_processor_compat() run _after_ vmx_hardware_setup(),
any use of kvm_cpu_cap_has() when configuring nested VMX support will lead
to failures in vmx_check_processor_compat().

While swapping the order of two (or more) configuration flows can lead to
a game of whack-a-mole, in this case nested support inarguably should be
done after base support.  KVM should never condition base support on nested
support, because nested support is fully optional, while obviously it's
desirable to condition nested support on base support.  And there's zero
evidence the current ordering was intentional, e.g. commit 66a6950f9995
("KVM: x86: Introduce kvm_cpu_caps to replace runtime CPUID masking")
likely placed the call to kvm_set_cpu_caps() after nested setup because it
looked pretty.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Enable CET virtualization for VMX and advertise to userspace
Yang Weijiang [Fri, 19 Sep 2025 22:32:35 +0000 (15:32 -0700)] 
KVM: x86: Enable CET virtualization for VMX and advertise to userspace

Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the
CET XFEATURE bits in XSS, and  advertise support for IBT and SHSTK to
userspace.  Explicitly clear IBT and SHSTK onn SVM, as additional work is
needed to enable CET on SVM, e.g. to context switch S_CET and other state.

Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET, as running without Unrestricted Guest
can result in KVM emulating large swaths of guest code.  While it's highly
unlikely any guest will trigger emulation while also utilizing IBT or
SHSTK, there's zero reason to allow CET without Unrestricted Guest as that
combination should only be possible when explicitly disabling
unrestricted_guest for testing purposes.

Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces
the presence of an Error Code based on exception vector, as attempting to
inject a #CP with an Error Code (#CP architecturally has an Error Code)
will fail due to the #CP vector historically not having an Error Code.

Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural
of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT.  Note,
KVM already clears guest CET state that is managed via XSTATE in
kvm_xstate_reset().

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: move some bits to separate patches, massage changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Disable support for IBT and SHSTK if allow_smaller_maxphyaddr is true
Sean Christopherson [Fri, 19 Sep 2025 22:32:34 +0000 (15:32 -0700)] 
KVM: x86: Disable support for IBT and SHSTK if allow_smaller_maxphyaddr is true

Make IBT and SHSTK virtualization mutually exclusive with "officially"
supporting setups with guest.MAXPHYADDR < host.MAXPHYADDR, i.e. if the
allow_smaller_maxphyaddr module param is set.  Running a guest with a
smaller MAXPHYADDR requires intercepting #PF, and can also trigger
emulation of arbitrary instructions.  Intercepting and reacting to #PFs
doesn't play nice with SHSTK, as KVM's MMU hasn't been taught to handle
Shadow Stack accesses, and emulating arbitrary instructions doesn't play
nice with IBT or SHSTK, as KVM's emulator doesn't handle the various side
effects, e.g. doesn't enforce end-branch markers or model Shadow Stack
updates.

Note, hiding IBT and SHSTK based solely on allow_smaller_maxphyaddr is
overkill, as allow_smaller_maxphyaddr is only problematic if the guest is
actually configured to have a smaller MAXPHYADDR.  However, KVM's ABI
doesn't provide a way to express that IBT and SHSTK may break if enabled
in conjunction with guest.MAXPHYADDR < host.MAXPHYADDR.  I.e. the
alternative is to do nothing in KVM and instead update documentation and
hope KVM users are thorough readers.  Go with the conservative-but-correct
approach; worst case scenario, this restriction can be dropped if there's
a strong use case for enabling CET on hosts with allow_smaller_maxphyaddr.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Initialize allow_smaller_maxphyaddr earlier in setup
Sean Christopherson [Mon, 22 Sep 2025 18:47:43 +0000 (11:47 -0700)] 
KVM: x86: Initialize allow_smaller_maxphyaddr earlier in setup

Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM
knows whether or not TDP will be utilized.  To avoid having to teach KVM's
emulator all about CET, KVM's upcoming CET virtualization support will be
mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK
and IBT if allow_smaller_maxphyaddr is enabled.

In general, allow_smaller_maxphyaddr should be initialized as soon as
possible since it's globally visible while its only input is whether or
not EPT/NPT is enabled.  I.e. there's effectively zero risk of setting
allow_smaller_maxphyaddr too early, and substantial risk of setting it
too late.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Disable support for Shadow Stacks if TDP is disabled
Sean Christopherson [Fri, 19 Sep 2025 22:32:33 +0000 (15:32 -0700)] 
KVM: x86: Disable support for Shadow Stacks if TDP is disabled

Make TDP a hard requirement for Shadow Stacks, as there are no plans to
add Shadow Stack support to the Shadow MMU.  E.g. KVM hasn't been taught
to understand the magic Writable=0,Dirty=1 combination that is required
for Shadow Stack accesses, and so enabling Shadow Stacks when using
shadow paging will put the guest into an infinite #PF loop (KVM thinks the
shadow page tables have a valid mapping, hardware says otherwise).

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Add XSS support for CET_KERNEL and CET_USER
Yang Weijiang [Fri, 19 Sep 2025 22:32:32 +0000 (15:32 -0700)] 
KVM: x86: Add XSS support for CET_KERNEL and CET_USER

Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT
*or* SHSTK is supported.  Like CR4.CET, XFEATURE support for IBT and SHSTK
are bundle together under the CET umbrella, and thus prone to
virtualization holes if KVM or the guest supports only one of IBT or SHSTK,
but hardware supports both.  However, again like CR4.CET, such
virtualization holes are benign from the host's perspective so long as KVM
takes care to always honor the "or" logic.

Require CET_KERNEL and CET_USER to come as a pair, and refuse to support
IBT or SHSTK if one (or both) features is missing, as the (host) kernel
expects them to come as a pair, i.e. may get confused and corrupt state if
only one of CET_KERNEL or CET_USER is supported.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: nVMX: Always forward XSAVES/XRSTORS exits from L2 to L1
Sean Christopherson [Fri, 19 Sep 2025 22:32:31 +0000 (15:32 -0700)] 
KVM: nVMX: Always forward XSAVES/XRSTORS exits from L2 to L1

Unconditionally forward XSAVES/XRSTORS VM-Exits from L2 to L1, as KVM
doesn't utilize the XSS-bitmap (KVM relies on controlling the XSS value
in hardware to prevent unauthorized access to XSAVES state).  KVM always
loads vmcs02 with vmcs12's bitmap, and so any exit _must_ be due to
vmcs12's XSS-bitmap.

Drop the comment about XSS never being non-zero in anticipation of
enabling CET_KERNEL and CET_USER support.

Opportunistically WARN if XSAVES is not enabled for L2, as the CPU is
supposed to generate #UD before checking the XSS-bitmap.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Allow setting CR4.CET if IBT or SHSTK is supported
Yang Weijiang [Fri, 19 Sep 2025 22:32:30 +0000 (15:32 -0700)] 
KVM: x86: Allow setting CR4.CET if IBT or SHSTK is supported

Drop X86_CR4_CET from CR4_RESERVED_BITS and instead mark CET as reserved
if and only if IBT *and* SHSTK are unsupported, i.e. allow CR4.CET to be
set if IBT or SHSTK is supported.  This creates a virtualization hole if
the CPU supports both IBT and SHSTK, but the kernel or vCPU model only
supports one of the features.  However, it's entirely legal for a CPU to
have only one of IBT or SHSTK, i.e. the hole is a flaw in the architecture,
not in KVM.

More importantly, so long as KVM is careful to initialize and context
switch both IBT and SHSTK state (when supported in hardware) if either
feature is exposed to the guest, a misbehaving guest can only harm itself.
E.g. VMX initializes host CET VMCS fields based solely on hardware
capabilities.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86/mmu: Pretty print PK, SS, and SGX flags in MMU tracepoints
Sean Christopherson [Fri, 19 Sep 2025 22:32:29 +0000 (15:32 -0700)] 
KVM: x86/mmu: Pretty print PK, SS, and SGX flags in MMU tracepoints

Add PK (Protection Keys), SS (Shadow Stacks), and SGX (Software Guard
Extensions) to the set of #PF error flags handled via
kvm_mmu_trace_pferr_flags.  While KVM doesn't expect PK or SS #PFs in
particular, pretty print their names instead of the raw hex value saves
the user from having to go spelunking in the SDM to figure out what's
going on.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86/mmu: WARN on attempt to check permissions for Shadow Stack #PF
Sean Christopherson [Fri, 19 Sep 2025 22:32:28 +0000 (15:32 -0700)] 
KVM: x86/mmu: WARN on attempt to check permissions for Shadow Stack #PF

Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to
check permissions for a Shadow Stack access as KVM hasn't been taught to
understand the magic Writable=0,Dirty=1 combination that is required for
Shadow Stack accesses, and likely will never learn.  There are no plans to
support Shadow Stacks with the Shadow MMU, and the emulator rejects all
instructions that affect Shadow Stacks, i.e. it should be impossible for
KVM to observe a #PF due to a shadow stack access.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Emulate SSP[63:32]!=0 #GP(0) for FAR JMP to 32-bit mode
Sean Christopherson [Fri, 19 Sep 2025 22:32:27 +0000 (15:32 -0700)] 
KVM: x86: Emulate SSP[63:32]!=0 #GP(0) for FAR JMP to 32-bit mode

Emulate the Shadow Stack restriction that the current SSP must be a 32-bit
value on a FAR JMP from 64-bit mode to compatibility mode.  From the SDM's
pseudocode for FAR JMP:

  IF ShadowStackEnabled(CPL)
    IF (IA32_EFER.LMA and DEST(segment selector).L) = 0
      (* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
      IF (SSP & 0xFFFFFFFF00000000 != 0); THEN
        #GP(0);
      FI;
    FI;
  FI;

Note, only the current CPL needs to be considered, as FAR JMP can't be
used for inter-privilege level transfers, and KVM rejects emulation of all
other far branch instructions when Shadow Stacks are enabled.

To give the emulator access to GUEST_SSP, special case handling
MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a
host access (KVM doesn't allow guest accesses to internal "MSRs").  The
->get_msr() API is only used for implicit accesses from the emulator, i.e.
is only used with hardcoded MSR indices, and so any access to
MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the
guest via RDMSR.

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Don't emulate task switches when IBT or SHSTK is enabled
Sean Christopherson [Fri, 19 Sep 2025 22:32:26 +0000 (15:32 -0700)] 
KVM: x86: Don't emulate task switches when IBT or SHSTK is enabled

Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers
task switch emulation with Indirect Branch Tracking or Shadow Stacks
enabled, as attempting to do the right thing would require non-trivial
effort and complexity, KVM doesn't support emulating CET generally, and
it's extremely unlikely that any guest will do task switches while also
utilizing CET.  Defer taking on the complexity until someone cares enough
to put in the time and effort to add support.

Per the SDM:

  If shadow stack is enabled, then the SSP of the task is located at the
  4 bytes at offset 104 in the 32-bit TSS and is used by the processor to
  establish the SSP when a task switch occurs from a task associated with
  this TSS. Note that the processor does not write the SSP of the task
  initiating the task switch to the TSS of that task, and instead the SSP
  of the previous task is pushed onto the shadow stack of the new task.

Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new
privilege level is updated.  To keep things simple, check both S_CET and
U_CET (again, anyone that wants more precise checking can have the honor
of implementing support).

Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Don't emulate instructions affected by CET features
Sean Christopherson [Fri, 19 Sep 2025 22:32:25 +0000 (15:32 -0700)] 
KVM: x86: Don't emulate instructions affected by CET features

Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts).  Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.

To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect.  Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow.  Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).

For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions.  Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.

Note!  On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode.  On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.

Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0.  Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated).  A future commit will add the #GP(0)
check for the specific FAR JMP scenario.

Note #3, task switches also modify SSP and so need to be rejected.  That
too will be addressed in a future commit.

Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: VMX: Set host constant supervisor states to VMCS fields
Yang Weijiang [Fri, 19 Sep 2025 22:32:24 +0000 (15:32 -0700)] 
KVM: VMX: Set host constant supervisor states to VMCS fields

Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.

Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.

Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: snapshot host S_CET if SHSTK *or* IBT is supported]
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: VMX: Set up interception for CET MSRs
Yang Weijiang [Fri, 19 Sep 2025 22:32:23 +0000 (15:32 -0700)] 
KVM: VMX: Set up interception for CET MSRs

Disable interception for CET MSRs that can be accessed via XSAVES/XRSTORS,
and exist accordingly to CPUID, as accesses through XSTATE aren't subject
to MSR interception checks, i.e. can't be intercepted without intercepting
and emulating XSAVES/XRSTORS, and KVM doesn't support emulating
XSAVE/XRSTOR instructions.

Don't condition interception on the guest actually having XSAVES as there
is no benefit to intercepting the accesses (when the MSRs exist).  The
MSRs in question are either context switched by the CPU on VM-Enter/VM-Exit
or by KVM via XSAVES/XRSTORS (KVM requires XSAVES to virtualization SHSTK),
i.e. KVM is going to load guest values into hardware irrespective of guest
XSAVES support.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Save and reload SSP to/from SMRAM
Yang Weijiang [Fri, 19 Sep 2025 22:32:22 +0000 (15:32 -0700)] 
KVM: x86: Save and reload SSP to/from SMRAM

Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64-bit Arch, and add the support for SSP.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: VMX: Emulate read and write to CET MSRs
Yang Weijiang [Fri, 19 Sep 2025 22:32:21 +0000 (15:32 -0700)] 
KVM: VMX: Emulate read and write to CET MSRs

Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common checks
for MSRs, e.g., accessibility, data validity etc., then passes operation
to either XSAVE-managed MSRs via the helpers or CET VMCS fields.

SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop call to kvm_set_xstate_msr() for S_CET, consolidate code]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Enable guest SSP read/write interface with new uAPIs
Yang Weijiang [Fri, 19 Sep 2025 22:32:20 +0000 (15:32 -0700)] 
KVM: x86: Enable guest SSP read/write interface with new uAPIs

Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace
save and restore the guest's Shadow Stack Pointer (SSP).  On both Intel
and AMD, SSP is a hardware register that can only be accessed by software
via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware
to context switch SSP at entry/exit).  As a result, SSP doesn't fit in
any of KVM's existing interfaces for saving/restoring state.

Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes
to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP
MSRs.  Use a translation layer to hide the KVM-internal MSR index so that
the arbitrary index doesn't become ABI, e.g. so that KVM can rework its
implementation as needed, so long as the ONE_REG ABI is maintained.

Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack
support to avoid running afoul of ignore_msrs, which unfortunately applies
to host-initiated accesses (which is a discussion for another day).  I.e.
ensure consistent behavior for KVM-defined registers irrespective of
ignore_msrs.

Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: VMX: Introduce CET VMCS fields and control bits
Yang Weijiang [Fri, 19 Sep 2025 22:32:19 +0000 (15:32 -0700)] 
KVM: VMX: Introduce CET VMCS fields and control bits

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates
  a #CP. These instruction behaves as a NOP on platforms that have no CET.

Several new CET MSRs are defined to support CET:
  MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.

  MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.

  MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.

Two XSAVES state bits are introduced for CET:
  IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
  IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.

Six VMCS fields are introduced for CET:
  {HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
  {HOST,GUEST}_SSP: Stores current active SSP.
  {HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.

On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
  HOST_S_CET
  HOST_SSP
  HOST_INTR_SSP_TABLE

If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
  GUEST_S_CET
  GUEST_SSP
  GUEST_INTR_SSP_TABLE

Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Report KVM supported CET MSRs as to-be-saved
Yang Weijiang [Fri, 19 Sep 2025 22:32:18 +0000 (15:32 -0700)] 
KVM: x86: Report KVM supported CET MSRs as to-be-saved

Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Add fault checks for guest CR4.CET setting
Yang Weijiang [Fri, 19 Sep 2025 22:32:17 +0000 (15:32 -0700)] 
KVM: x86: Add fault checks for guest CR4.CET setting

Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Load guest FPU state when access XSAVE-managed MSRs
Sean Christopherson [Fri, 19 Sep 2025 22:32:16 +0000 (15:32 -0700)] 
KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.

If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.

Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).

Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.

The two helpers are put here in order to manifest accessing xsave-managed
MSRs requires special check and handling to guarantee the correctness of
read/write to the MSRs.

Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop S_CET, add big comment, move accessors to x86.c]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Initialize kvm_caps.supported_xss
Yang Weijiang [Fri, 19 Sep 2025 22:32:15 +0000 (15:32 -0700)] 
KVM: x86: Initialize kvm_caps.supported_xss

Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.

[sean: relocate and enhance comment about PT / XSS[8] ]

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
Yang Weijiang [Fri, 19 Sep 2025 22:32:14 +0000 (15:32 -0700)] 
KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically skip CPUID updates if XSS value doesn't change.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Check XSS validity against guest CPUIDs
Chao Gao [Fri, 19 Sep 2025 22:32:13 +0000 (15:32 -0700)] 
KVM: x86: Check XSS validity against guest CPUIDs

Maintain per-guest valid XSS bits and check XSS validity against them
rather than against KVM capabilities. This is to prevent bits that are
supported by KVM but not supported for a guest from being set.

Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses
if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since
KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the
host_initiated check.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Report XSS as to-be-saved if there are supported features
Sean Christopherson [Fri, 19 Sep 2025 22:32:12 +0000 (15:32 -0700)] 
KVM: x86: Report XSS as to-be-saved if there are supported features

Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.

Before enabling CET virtualization series, guest IA32_MSR_XSS is
guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode
with XSS == 0, which equals to the effect of XSAVE/XRSTOR.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Introduce KVM_{G,S}ET_ONE_REG uAPIs support
Yang Weijiang [Fri, 19 Sep 2025 22:32:11 +0000 (15:32 -0700)] 
KVM: x86: Introduce KVM_{G,S}ET_ONE_REG uAPIs support

Enable KVM_{G,S}ET_ONE_REG uAPIs so that userspace can access MSRs and
other non-MSR registers through them, along with support for
KVM_GET_REG_LIST to enumerate support for KVM-defined registers.

This is in preparation for allowing userspace to read/write the guest SSP
register, which is needed for the upcoming CET virtualization support.

Currently, two types of registers are supported: KVM_X86_REG_TYPE_MSR and
KVM_X86_REG_TYPE_KVM. All MSRs are in the former type; the latter type is
added for registers that lack existing KVM uAPIs to access them. The "KVM"
in the name is intended to be vague to give KVM flexibility to include
other potential registers.  More precise names like "SYNTHETIC" and
"SYNTHETIC_MSR" were considered, but were deemed too confusing (e.g. can
be conflated with synthetic guest-visible MSRs) and may put KVM into a
corner (e.g. if KVM wants to change how a KVM-defined register is modeled
internally).

Enumerate only KVM-defined registers in KVM_GET_REG_LIST to avoid
duplicating KVM_GET_MSR_INDEX_LIST, and so that KVM can return _only_
registers that are fully supported (KVM_GET_REG_LIST is vCPU-scoped, i.e.
can be precise, whereas KVM_GET_MSR_INDEX_LIST is system-scoped).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/all/20240219074733.122080-18-weijiang.yang@intel.com
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Merge 'selftests' into 'cet' to pick up ex_str()
Sean Christopherson [Tue, 23 Sep 2025 16:00:18 +0000 (09:00 -0700)] 
KVM: x86: Merge 'selftests' into 'cet' to pick up ex_str()

Merge the queue of KVM selftests changes for 6.18 to pick up the ex_str()
helper so that it can be used to pretty print expected versus actual
exceptions in a new MSR selftest.  CET virtualization will add support for
several MSRs with non-trivial semantics, along with new uAPI for accessing
the guest's Shadow Stack Pointer (SSP) from userspace.

6 weeks agoKVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependencies
Sean Christopherson [Tue, 23 Sep 2025 15:59:49 +0000 (08:59 -0700)] 
KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependencies

Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB
helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for
SEV-ES+ guests.

6 weeks agoKVM: SEV: Validate XCR0 provided by guest in GHCB
Sean Christopherson [Fri, 19 Sep 2025 22:32:10 +0000 (15:32 -0700)] 
KVM: SEV: Validate XCR0 provided by guest in GHCB

Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's
software model in order to validate the new XCR0 against KVM's view of
the supported XCR0.  Allowing garbage is thankfully mostly benign, as
kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected
state, xstate_required_size() will simply provide garbage back to the
guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS
will only harm the guest (setting XCR0 will fail).

However, allowing the guest to put junk into a field that KVM assumes is
valid is a CVE waiting to happen.  And as a bonus, using the proper API
eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty.

Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage.  In either case, KVM can't fix the broken guest.

Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits
if XCR0 isn't actually changing (relatively to KVM's previous snapshot).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SEV: Read save fields from GHCB exactly once
Sean Christopherson [Fri, 19 Sep 2025 22:32:09 +0000 (15:32 -0700)] 
KVM: SEV: Read save fields from GHCB exactly once

Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific
GHCB get() utility to help guard against TOCTOU bugs.  Using READ_ONCE()
doesn't completely prevent such bugs, e.g. doesn't prevent KVM from
redoing get() after checking the initial value, but at least addresses
all potential TOCTOU issues in the current KVM code base.

To prevent unintentional use of the generic helpers, take only @svm for
the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly
passing it in.

Opportunistically reduce the indentation of the macro-defined helpers and
clean up the alignment.

Fixes: 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: SEV: Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code()
Sean Christopherson [Fri, 19 Sep 2025 22:32:08 +0000 (15:32 -0700)] 
KVM: SEV: Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code()

Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() to make
it clear that KVM is getting the cached value, not reading directly from
the guest-controlled GHCB.  More importantly, vacating
kvm_ghcb_get_sw_exit_code() will allow adding a KVM-specific macro-built
kvm_ghcb_get_##field() helper to read values from the GHCB.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Add ex_str() to print human friendly name of exception vectors
Sean Christopherson [Fri, 19 Sep 2025 22:32:51 +0000 (15:32 -0700)] 
KVM: selftests: Add ex_str() to print human friendly name of exception vectors

Steal exception_mnemonic() from KVM-Unit-Tests as ex_str() (to keep line
lengths reasonable) and use it in assert messages that currently print the
raw vector number.

Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoselftests/kvm: remove stale TODO in xapic_state_test
Sukrut Heroorkar [Mon, 8 Sep 2025 21:05:46 +0000 (23:05 +0200)] 
selftests/kvm: remove stale TODO in xapic_state_test

The TODO about using the number of vCPUs instead of vcpu.id + 1
was already addressed by commit 376bc1b458c9 ("KVM: selftests: Don't
assume vcpu->id is '0' in xAPIC state test"). The comment is now
stale and can be removed.

Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com>
Link: https://lore.kernel.org/r/20250908210547.12748-1-hsukrut3@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Handle Intel Atom errata that leads to PMU event overcount
dongsheng [Fri, 19 Sep 2025 21:46:48 +0000 (14:46 -0700)] 
KVM: selftests: Handle Intel Atom errata that leads to PMU event overcount

Add a PMU errata framework and use it to relax precise event counts on
Atom platforms that overcount "Instruction Retired" and "Branch Instruction
Retired" events, as the overcount issues on VM-Exit/VM-Entry are impossible
to prevent from userspace, e.g. the test can't prevent host IRQs.

Setup errata during early initialization and automatically sync the mask
to VMs so that tests can check for errata without having to manually
manage host=>guest variables.

For Intel Atom CPUs, the PMU events "Instruction Retired" or
"Branch Instruction Retired" may be overcounted for some certain
instructions, like FAR CALL/JMP, RETF, IRET, VMENTRY/VMEXIT/VMPTRLD
and complex SGX/SMX/CSTATE instructions/flows.

The detailed information can be found in the errata (section SRF7):
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/sierra-forest/xeon-6700-series-processor-with-e-cores-specification-update/errata-details/

For the Atom platforms before Sierra Forest (including Sierra Forest),
Both 2 events "Instruction Retired" and "Branch Instruction Retired" would
be overcounted on these certain instructions, but for Clearwater Forest
only "Instruction Retired" event is overcounted on these instructions.

Signed-off-by: dongsheng <dongsheng.x.zhang@intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Validate more arch-events in pmu_counters_test
Dapeng Mi [Fri, 19 Sep 2025 21:46:47 +0000 (14:46 -0700)] 
KVM: selftests: Validate more arch-events in pmu_counters_test

Add support for 5 new architectural events (4 topdown level 1 metrics
events and LBR inserts event) that will first show up in Intel's
Clearwater Forest CPUs.  Detailed info about the new events can be found
in SDM section 21.2.7 "Pre-defined Architectural  Performance Events".

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
[sean: drop "unavailable_mask" changes]
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Reduce number of "unavailable PMU events" combos tested
Sean Christopherson [Fri, 19 Sep 2025 21:46:46 +0000 (14:46 -0700)] 
KVM: selftests: Reduce number of "unavailable PMU events" combos tested

Reduce the number of combinations of unavailable PMU events masks that are
testing by the PMU counters test.  In reality, testing every possible
combination isn't all that interesting, and certainly not worth the tens
of seconds (or worse, minutes) of runtime.  Fully testing the N^2 space
will be especially problematic in the near future, as 5! new arch events
are on their way.

Use alternating bit patterns (and 0 and -1u) in the hopes that _if_ there
is ever a KVM bug, it's not something horribly convoluted that shows up
only with a super specific pattern/value.

Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Track unavailable_mask for PMU events as 32-bit value
Sean Christopherson [Fri, 19 Sep 2025 21:46:45 +0000 (14:46 -0700)] 
KVM: selftests: Track unavailable_mask for PMU events as 32-bit value

Track the mask of "unavailable" PMU events as a 32-bit value.  While bits
31:9 are currently reserved, silently truncating those bits is unnecessary
and asking for missed coverage.  To avoid running afoul of the sanity check
in vcpu_set_cpuid_property(), explicitly adjust the mask based on the
non-reserved bits as reported by KVM's supported CPUID.

Opportunistically update the "all ones" testcase to pass -1u instead of
0xff.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: selftests: Add timing_info bit support in vmx_pmu_caps_test
Dapeng Mi [Fri, 19 Sep 2025 21:46:44 +0000 (14:46 -0700)] 
KVM: selftests: Add timing_info bit support in vmx_pmu_caps_test

A new bit PERF_CAPABILITIES[17] called "PEBS_TIMING_INFO" bit is added
to indicated if PEBS supports to record timing information in a new
"Retried Latency" field.

Since KVM requires user can only set host consistent PEBS capabilities,
otherwise the PERF_CAPABILITIES setting would fail, add pebs_timing_info
into the "immutable_caps" to block host inconsistent PEBS configuration
and cause errors.

Opportunistically drop the anythread_deprecated bit.  It isn't and likely
never was a PERF_CAPABILITIES flag, the test's definition snuck in when
the union was copy+pasted from the kernel's definition.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
[sean: call out anythread_deprecated change]
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Fix hypercalls docs section number order
Bagas Sanjaya [Tue, 9 Sep 2025 00:39:52 +0000 (07:39 +0700)] 
KVM: x86: Fix hypercalls docs section number order

Commit 4180bf1b655a79 ("KVM: X86: Implement "send IPI" hypercall")
documents KVM_HC_SEND_IPI hypercall, yet its section number duplicates
KVM_HC_CLOCK_PAIRING one (which both are 6th). Fix the numbering order
so that the former should be 7th.

Fixes: 4180bf1b655a ("KVM: X86: Implement "send IPI" hypercall")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20250909003952.10314-1-bagasdotme@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
6 weeks agoKVM: x86: Don't treat ENTER and LEAVE as branches, because they aren't
Sean Christopherson [Fri, 19 Sep 2025 00:46:39 +0000 (17:46 -0700)] 
KVM: x86: Don't treat ENTER and LEAVE as branches, because they aren't

Remove the IsBranch flag from ENTER and LEAVE in KVM's emulator, as ENTER
and LEAVE are stack operations, not branches.  Add forced emulation of
said instructions to the PMU counters test to prove that KVM diverges from
hardware, and to guard against regressions.

Opportunistically add a missing "1 MOV" to the selftest comment regarding
the number of instructions per loop, which commit 7803339fa929 ("KVM:
selftests: Use data load to trigger LLC references/misses in Intel PMU")
forgot to add.

Fixes: 018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919004639.1360453-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Restrict GLOBAL_{CTRL,STATUS}, fixed PMCs, and PEBS to PMU v2+
Sean Christopherson [Wed, 6 Aug 2025 19:56:53 +0000 (12:56 -0700)] 
KVM: x86/pmu: Restrict GLOBAL_{CTRL,STATUS}, fixed PMCs, and PEBS to PMU v2+

Restrict support for GLOBAL_CTRL, GLOBAL_STATUS, fixed PMCs, and PEBS to
v2 or later vPMUs.  The SDM explicitly states that GLOBAL_{CTRL,STATUS} and
fixed counters were introduced with PMU v2, and PEBS has hard dependencies
on fixed counters and the bitmap MSR layouts established by PMU v2.

Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Move initialization of valid PMCs bitmask to common x86
Sean Christopherson [Wed, 6 Aug 2025 19:56:52 +0000 (12:56 -0700)] 
KVM: x86/pmu: Move initialization of valid PMCs bitmask to common x86

Move all initialization of all_valid_pmc_idx to common code, as the logic
is 100% common to Intel and AMD, and KVM heavily relies on Intel and AMD
having the same semantics.  E.g. the fact that AMD doesn't support fixed
counters doesn't allow KVM to use all_valid_pmc_idx[63:32] for other
purposes.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Use BIT_ULL() instead of open coded equivalents
Dapeng Mi [Wed, 6 Aug 2025 19:56:51 +0000 (12:56 -0700)] 
KVM: x86/pmu: Use BIT_ULL() instead of open coded equivalents

Replace a variety of "1ull << N" and "(u64)1 << N" snippets with BIT_ULL()
in the PMU code.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
[sean: split to separate patch, write changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: VMX: Add helpers to toggle/change a bit in VMCS execution controls
Dapeng Mi [Wed, 6 Aug 2025 19:56:48 +0000 (12:56 -0700)] 
KVM: VMX: Add helpers to toggle/change a bit in VMCS execution controls

Expand the VMCS controls builder macros to generate helpers to change a
bit to the desired value, and use the new helpers when toggling APICv
related controls.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: rewrite changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86: Use KVM_REQ_RECALC_INTERCEPTS to react to CPUID updates
Sean Christopherson [Wed, 6 Aug 2025 19:56:47 +0000 (12:56 -0700)] 
KVM: x86: Use KVM_REQ_RECALC_INTERCEPTS to react to CPUID updates

Defer recalculating MSR and instruction intercepts after a CPUID update
via RECALC_INTERCEPTS to converge on RECALC_INTERCEPTS as the "official"
mechanism for triggering recalcs.  As a bonus, because KVM does a "recalc"
during vCPU creation, and every functional VMM sets CPUID at least once,
for all intents and purposes this saves at least one recalc.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS
Sean Christopherson [Wed, 6 Aug 2025 19:56:46 +0000 (12:56 -0700)] 
KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS

Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS
request, and expand the responsibilities of vendor code to recalculate all
intercepts that vary based on userspace input, e.g. instruction intercepts
that are tied to guest CPUID.

Providing a generic recalc request will allow the upcoming mediated PMU
support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are
set by userspace, without having to make multiple calls to/from PMU code.
As a bonus, using a request will effectively coalesce recalcs, e.g. will
reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create,
set CPUID, set PERF_CAPABILITIES (Intel only), set filter).

The downside is that MSR filter changes that are done in isolation will do
a small amount of unnecessary work, but that's already a relatively slow
path, and the cost of recalculating instruction intercepts is negligible.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header
Dapeng Mi [Wed, 6 Aug 2025 19:56:45 +0000 (12:56 -0700)] 
KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header

Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h and rename them with
PERF_CAP prefix to keep consistent with other perf capabilities macros.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers
Dapeng Mi [Wed, 6 Aug 2025 19:56:44 +0000 (12:56 -0700)] 
KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers

Rename the two helpers vmx_vmentry/vmexit_ctrl() to
vmx_get_initial_vmentry/vmexit_ctrl() to represent their real meaning.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Snapshot host (i.e. perf's) reported PMU capabilities
Sean Christopherson [Wed, 6 Aug 2025 19:56:39 +0000 (12:56 -0700)] 
KVM: x86/pmu: Snapshot host (i.e. perf's) reported PMU capabilities

Take a snapshot of the unadulterated PMU capabilities provided by perf so
that KVM can compare guest vPMU capabilities against hardware capabilities
when determining whether or not to intercept PMU MSRs (and RDPMC).

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: SVM: Check pmu->version, not enable_pmu, when getting PMC MSRs
Sean Christopherson [Wed, 6 Aug 2025 19:56:37 +0000 (12:56 -0700)] 
KVM: SVM: Check pmu->version, not enable_pmu, when getting PMC MSRs

Gate access to PMC MSRs based on pmu->version, not on kvm->arch.enable_pmu,
to more accurately reflect KVM's behavior.  This is a glorified nop, as
pmu->version and pmu->nr_arch_gp_counters can only be non-zero if
amd_pmu_refresh() is reached, kvm_pmu_refresh() invokes amd_pmu_refresh()
if and only if kvm->arch.enable_pmu is true, and amd_pmu_refresh() forces
pmu->version to be 1 or 2.

I.e. the following holds true:

  !pmu->nr_arch_gp_counters || kvm->arch.enable_pmu == (pmu->version > 0)

and so the only way for amd_pmu_get_pmc() to return a non-NULL value is if
both kvm->arch.enable_pmu and pmu->version evaluate to true.

No real functional change intended.

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: VMX: Setup canonical VMCS config prior to kvm_x86_vendor_init()
Sean Christopherson [Wed, 6 Aug 2025 19:56:36 +0000 (12:56 -0700)] 
KVM: VMX: Setup canonical VMCS config prior to kvm_x86_vendor_init()

Setup the golden VMCS config during vmx_init(), before the call to
kvm_x86_vendor_init(), instead of waiting until the callback to do
hardware setup.  setup_vmcs_config() only touches VMX state, i.e. doesn't
poke anything in kvm.ko, and has no runtime dependencies beyond
hv_init_evmcs().

Setting the VMCS config early on will allow referencing VMCS and VMX
capabilities at any point during setup, e.g. to check for PERF_GLOBAL_CTRL
save/load support during mediated PMU initialization.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoDocumentation: KVM: Call out that KVM strictly follows the 8254 PIT spec
Jiaming Zhang [Fri, 5 Sep 2025 17:47:36 +0000 (01:47 +0800)] 
Documentation: KVM: Call out that KVM strictly follows the 8254 PIT spec

Explicitly document that the behavior of KVM_SET_PIT2 strictly conforms
to the Intel 8254 PIT hardware specification, specifically that a write of
'0' adheres to the spec's definition that a programmed count of '0' is
converted to the maximum possible value (2^16).  E.g. an unaware userspace
might attempt to validate that KVM_GET_PIT2 returns the exact state set
via KVM_SET_PIT2, and be surprised when the returned count is 65536, not 0.

Add a references to the Intel 8254 PIT datasheet that will hopefully stay
fresh for some time (the internet isn't exactly brimming with copies of
the 8254 datasheet).

Link: https://lore.kernel.org/all/CANypQFbEySjKOFLqtFFf2vrEe=NBr7XJfbkjQhqXuZGg7Rpoxw@mail.gmail.com
Signed-off-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/r/20250905174736.260694-1-r772577952@gmail.com
[sean: add context Link, drop local APIC change, massage changelog accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86: hyper-v: Use guard() instead of mutex_lock() to simplify code
Liao Yuanhong [Mon, 1 Sep 2025 13:16:04 +0000 (21:16 +0800)] 
KVM: x86: hyper-v: Use guard() instead of mutex_lock() to simplify code

Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when setting up the TSC page for a Hyper-V guest.

No functional change intended.

Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131604.646415-1-liaoyuanhong@vivo.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86: Use guard() instead of mutex_lock() to simplify code
Liao Yuanhong [Mon, 1 Sep 2025 13:18:21 +0000 (21:18 +0800)] 
KVM: x86: Use guard() instead of mutex_lock() to simplify code

Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when allocating the APIC access page.

No functional change intended.

Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131822.647802-1-liaoyuanhong@vivo.com
[sean: add blank link to isolate guard(), tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: x86/pmu: Correct typo "_COUTNERS" to "_COUNTERS"
Dapeng Mi [Fri, 18 Jul 2025 00:19:01 +0000 (08:19 +0800)] 
KVM: x86/pmu: Correct typo "_COUTNERS" to "_COUNTERS"

Fix typos. "_COUTNERS" -> "_COUNTERS".

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20250718001905.196989-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7 weeks agoKVM: TDX: Reject fully in-kernel irqchip if EOIs are protected, i.e. for TDX VMs
Sagi Shahar [Wed, 27 Aug 2025 01:17:26 +0000 (18:17 -0700)] 
KVM: TDX: Reject fully in-kernel irqchip if EOIs are protected, i.e. for TDX VMs

Reject KVM_CREATE_IRQCHIP if the VM type has protected EOIs, i.e. if KVM
can't intercept EOI and thus can't faithfully emulate level-triggered
interrupts that are routed through the I/O APIC.  For TDX VMs, the
TDX-Module owns the VMX EOI-bitmap and configures all IRQ vectors to have
the CPU accelerate EOIs, i.e. doesn't allow KVM to intercept any EOIs.

KVM already requires a split irqchip[1], but does so during vCPU creation,
which is both too late to allow userspace to fallback to a split irqchip
and a less-than-stellar experience for userspace since an -EINVAL on
KVM_VCPU_CREATE is far harder to debug/triage than failure exactly on
KVM_CREATE_IRQCHIP.  And of course, allowing an action that ultimately
fails is arguably a bug regardless of the impact on userspace.

Link: https://lore.kernel.org/lkml/20250222014757.897978-11-binbin.wu@linux.intel.com
Link: https://lore.kernel.org/lkml/aK3vZ5HuKKeFuuM4@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250827011726.2451115-1-sagis@google.com
[sean: massage shortlog+changelog, relocate setting has_protected_eoi]
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: nSVM: Replace kzalloc() + copy_from_user() with memdup_user()
Thorsten Blum [Wed, 3 Sep 2025 00:29:50 +0000 (02:29 +0200)] 
KVM: nSVM: Replace kzalloc() + copy_from_user() with memdup_user()

Replace kzalloc() followed by copy_from_user() with memdup_user() to
improve and simplify svm_set_nested_state().

Return early if an error occurs instead of trying to allocate memory for
'save' when memory allocation for 'ctl' already failed.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250903002951.118912-1-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: selftests: Add support for DIV and IDIV in the fastops test
Sean Christopherson [Tue, 9 Sep 2025 20:28:35 +0000 (13:28 -0700)] 
KVM: selftests: Add support for DIV and IDIV in the fastops test

Extend the fastops test coverage to DIV and IDIV, specifically to provide
coverage for #DE (divide error) exceptions, as #DE is the only exception
that can occur in KVM's fastops path, i.e. that requires exception fixup.

Link: https://lore.kernel.org/r/20250909202835.333554-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: selftests: Dedup the gnarly constraints of the fastops tests (more macros!)
Sean Christopherson [Tue, 9 Sep 2025 20:28:34 +0000 (13:28 -0700)] 
KVM: selftests: Dedup the gnarly constraints of the fastops tests (more macros!)

Add a fastop() macro along with macros to define its required constraints,
and use the macros to dedup the innermost guts of the fastop testcases.

No functional change intended.

Link: https://lore.kernel.org/r/20250909202835.333554-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: selftests: Add coverage for 'b' (byte) sized fastops emulation
Sean Christopherson [Tue, 9 Sep 2025 20:28:33 +0000 (13:28 -0700)] 
KVM: selftests: Add coverage for 'b' (byte) sized fastops emulation

Extend the fastops test to cover instructions that operate on 8-bit data.
Support for 8-bit instructions was omitted from the original commit purely
due to complications with BT not having a r/m8 variant.  To keep the
RFLAGS.CF behavior deterministic and not heavily biased to '0' or '1',
continue using BT, but cast and load the to-be-tested value into a
dedicated 32-bit constraint.

Supporting 8-bit operations will allow using guest_test_fastops() as-is to
provide full coverage for DIV and IDIV.  For divide operations, covering
all operand sizes _is_ interesting, because KVM needs provide exception
fixup for each size (failure to handle a #DE could panic the host).

Link: https://lore.kernel.org/all/aIF7ZhWZxlkcpm4y@google.com
Link: https://lore.kernel.org/r/20250909202835.333554-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: selftests: Add support for #DE exception fixup
Sean Christopherson [Tue, 9 Sep 2025 20:28:32 +0000 (13:28 -0700)] 
KVM: selftests: Add support for #DE exception fixup

Add support for handling #DE (divide error) exceptions in KVM selftests
so that the fastops test can verify KVM correctly handles #DE when
emulating DIV or IDIV on behalf of the guest.  Morph #DE to 0xff (i.e.
to -1) as a mostly-arbitrary vector to indicate #DE, so that '0' (the
real #DE vector) can still be used to indicate "no exception".

Link: https://lore.kernel.org/r/20250909202835.333554-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: x86: Move vector_hashing into lapic.c
Sean Christopherson [Thu, 21 Aug 2025 21:42:09 +0000 (14:42 -0700)] 
KVM: x86: Move vector_hashing into lapic.c

Move the vector_hashing module param into lapic.c now that all usage is
contained within the local APIC emulation code.

Opportunistically drop the accessor and append "_enabled" to the variable
to help capture that it's a boolean module param.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: x86: Make "lowest priority" helpers local to lapic.c
Sean Christopherson [Thu, 21 Aug 2025 21:42:08 +0000 (14:42 -0700)] 
KVM: x86: Make "lowest priority" helpers local to lapic.c

Make various helpers for resolving lowest priority IRQs local to lapic.c
now that kvm_irq_delivery_to_apic() lives in lapic.c as well.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: x86: Move kvm_irq_delivery_to_apic() from irq.c to lapic.c
Sean Christopherson [Thu, 21 Aug 2025 21:42:07 +0000 (14:42 -0700)] 
KVM: x86: Move kvm_irq_delivery_to_apic() from irq.c to lapic.c

Move kvm_irq_delivery_to_apic() to lapic.c as it is specific to local APIC
emulation.  This will allow burying more local APIC code in lapic.c, e.g.
the various "lowest priority" helpers.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: SEV: Save the SEV policy if and only if LAUNCH_START succeeds
Sean Christopherson [Thu, 21 Aug 2025 21:38:41 +0000 (14:38 -0700)] 
KVM: SEV: Save the SEV policy if and only if LAUNCH_START succeeds

Wait until LAUNCH_START fully succeeds to set a VM's SEV/SNP policy so
that KVM doesn't keep a potentially stale policy.  In practice, the issue
is benign as the policy is only used to detect if the VMSA can be
decrypted, and the VMSA only needs to be decrypted if LAUNCH_UPDATE and
thus LAUNCH_START succeeded.

Fixes: 962e2b6152ef ("KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabled")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250821213841.3462339-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
8 weeks agoKVM: selftests: Fix typo in hyperv cpuid test message
Alok Tiwari [Sun, 24 Aug 2025 18:16:40 +0000 (11:16 -0700)] 
KVM: selftests: Fix typo in hyperv cpuid test message

Fix a typo in hyperv_cpuid.c test assertion log:
replace "our of supported range" -> "out of supported range".

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://lore.kernel.org/r/20250824181642.629297-1-alok.a.tiwari@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SVM: Enable Secure TSC for SNP guests
Nikunj A Dadhania [Tue, 19 Aug 2025 23:48:33 +0000 (16:48 -0700)] 
KVM: SVM: Enable Secure TSC for SNP guests

Add support for Secure TSC, allowing userspace to configure the Secure TSC
feature for SNP guests. Use the SNP specification's desired TSC frequency
parameter during the SNP_LAUNCH_START command to set the mean TSC
frequency in KHz for Secure TSC enabled guests.

Always use kvm->arch.arch.default_tsc_khz as the TSC frequency that is
passed to SNP guests in the SNP_LAUNCH_START command.  The default value
is the host TSC frequency.  The userspace can optionally change the TSC
frequency via the KVM_SET_TSC_KHZ ioctl before calling the
SNP_LAUNCH_START ioctl.

Introduce the read-only MSR GUEST_TSC_FREQ (0xc0010134) that returns
guest's effective frequency in MHZ when Secure TSC is enabled for SNP
guests. Disable interception of this MSR when Secure TSC is enabled. Note
that GUEST_TSC_FREQ MSR is accessible only to the guest and not from the
hypervisor context.

Co-developed-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com>
Signed-off-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
[sean: contain Secure TSC to sev.c]
Link: https://lore.kernel.org/r/20250819234833.3080255-9-seanjc@google.com
[sean: return -EINVAL if TSC frequency is '0']
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SEV: Fold sev_es_vcpu_reset() into sev_vcpu_create()
Sean Christopherson [Tue, 19 Aug 2025 23:48:32 +0000 (16:48 -0700)] 
KVM: SEV: Fold sev_es_vcpu_reset() into sev_vcpu_create()

Fold the remaining line of sev_es_vcpu_reset() into sev_vcpu_create() as
there's no need for a dedicated RESET hook just to init a mutex, and the
mutex should be initialized as early as possible anyways.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SEV: Set RESET GHCB MSR value during sev_es_init_vmcb()
Sean Christopherson [Tue, 19 Aug 2025 23:48:31 +0000 (16:48 -0700)] 
KVM: SEV: Set RESET GHCB MSR value during sev_es_init_vmcb()

Set the RESET value for the GHCB "MSR" during sev_es_init_vmcb() instead
of sev_es_vcpu_reset() to allow for dropping sev_es_vcpu_reset() entirely.

Note, the call to sev_init_vmcb() from sev_migrate_from() also kinda sorta
emulates a RESET, but sev_migrate_from() immediately overwrites ghcb_gpa
with the source's current value, so whether or not stuffing the GHCB
version is correct/desirable is moot.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SEV: Move init of SNP guest state into sev_init_vmcb()
Sean Christopherson [Tue, 19 Aug 2025 23:48:30 +0000 (16:48 -0700)] 
KVM: SEV: Move init of SNP guest state into sev_init_vmcb()

Move the initialization of SNP guest state from svm_vcpu_reset() into
sev_init_vmcb() to reduce the number of paths that deal with INIT/RESET
for SEV+ vCPUs from 4+ to 1.  Plumb in @init_event as necessary.

Opportunistically check for an SNP guest outside of
sev_snp_init_protected_guest_state() so that sev_init_vmcb() is consistent
with respect to checking for SEV-ES+ and SNP+ guests.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SVM: Move SEV-ES VMSA allocation to a dedicated sev_vcpu_create() helper
Sean Christopherson [Tue, 19 Aug 2025 23:48:29 +0000 (16:48 -0700)] 
KVM: SVM: Move SEV-ES VMSA allocation to a dedicated sev_vcpu_create() helper

Add a dedicated sev_vcpu_create() helper to allocate the VMSA page for
SEV-ES+ vCPUs, and to allow for consolidating a variety of related SEV+
code in the near future.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agox86/cpufeatures: Add SNP Secure TSC
Nikunj A Dadhania [Tue, 19 Aug 2025 23:48:28 +0000 (16:48 -0700)] 
x86/cpufeatures: Add SNP Secure TSC

The Secure TSC feature for SEV-SNP allows guests to securely use the RDTSC
and RDTSCP instructions, ensuring that the parameters used cannot be
altered by the hypervisor once the guest is launched. For more details,
refer to the AMD64 APM Vol 2, Section "Secure TSC".

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Vaishali Thakkar <vaishali.thakkar@suse.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SEV: Enforce minimum GHCB version requirement for SEV-SNP guests
Nikunj A Dadhania [Tue, 19 Aug 2025 23:48:27 +0000 (16:48 -0700)] 
KVM: SEV: Enforce minimum GHCB version requirement for SEV-SNP guests

Require a minimum GHCB version of 2 when starting SEV-SNP guests through
KVM_SEV_INIT2. When a VMM attempts to start an SEV-SNP guest with an
incompatible GHCB version (less than 2), reject the request early rather
than allowing the guest kernel to start with an incorrect protocol version
and fail later with GHCB_SNP_UNSUPPORTED guest termination.

Not enforcing the minimum version typically causes the guest to request
termination with GHCB_SNP_UNSUPPORTED error code:

  kvm_amd: SEV-ES guest requested termination: 0x0:0x2

Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version")
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: SEV: Drop GHCB_VERSION_DEFAULT and open code it
Nikunj A Dadhania [Tue, 19 Aug 2025 23:48:26 +0000 (16:48 -0700)] 
KVM: SEV: Drop GHCB_VERSION_DEFAULT and open code it

Remove the GHCB_VERSION_DEFAULT macro and open code it with '2'. The macro
is used conditionally and is not a true default. KVM ABI does not
advertise/emumerates the default GHCB version. Any future change to this
macro would silently alter the ABI and potentially break existing
deployments that rely on the current behavior.

Additionally, move the GHCB version assignment earlier in the code flow and
update the comment to clarify that KVM_SEV_INIT2 defaults to version 2,
while KVM_SEV_INIT forces version 1.

No functional change intended.

Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Zero XSTATE components on INIT by iterating over supported features
Chao Gao [Tue, 12 Aug 2025 02:55:13 +0000 (19:55 -0700)] 
KVM: x86: Zero XSTATE components on INIT by iterating over supported features

Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., CET's xstate-managed MSRs.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-6-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Manually clear MPX state only on INIT
Sean Christopherson [Tue, 12 Aug 2025 02:55:12 +0000 (19:55 -0700)] 
KVM: x86: Manually clear MPX state only on INIT

Don't manually clear/zero MPX state on RESET, as the guest FPU state is
zero allocated and KVM only does RESET during vCPU creation, i.e. the
relevant state is guaranteed to be all zeroes.

Opportunistically move the relevant code into a helper in anticipation of
adding support for CET shadow stacks, which also has state that is zeroed
on INIT.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-5-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Add kvm_msr_{read,write}() helpers
Yang Weijiang [Tue, 12 Aug 2025 02:55:11 +0000 (19:55 -0700)] 
KVM: x86: Add kvm_msr_{read,write}() helpers

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-4-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Use double-underscore read/write MSR helpers as appropriate
Sean Christopherson [Tue, 12 Aug 2025 02:55:10 +0000 (19:55 -0700)] 
KVM: x86: Use double-underscore read/write MSR helpers as appropriate

Use the double-underscore helpers for emulating MSR reads and writes in
he no-underscore versions to better capture the relationship between the
two sets of APIs (the double-underscore versions don't honor userspace MSR
filters).

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses
Yang Weijiang [Tue, 12 Aug 2025 02:55:09 +0000 (19:55 -0700)] 
KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses

Rename
kvm_{g,s}et_msr_with_filter()
kvm_{g,s}et_msr()
to
kvm_emulate_msr_{read,write}
__kvm_emulate_msr_{read,write}

to make it more obvious that KVM uses these helpers to emulate guest
behaviors, i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Advertise support for the immediate form of MSR instructions
Xin Li [Tue, 5 Aug 2025 20:22:24 +0000 (13:22 -0700)] 
KVM: x86: Advertise support for the immediate form of MSR instructions

Advertise support for the immediate form of MSR instructions to userspace
if the instructions are supported by the underlying CPU, and KVM is using
VMX, i.e. is running on an Intel-compatible CPU.

For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over-
report support if AMD-compatible CPUs ever implement the immediate forms,
as SVM will likely require explicit enablement in KVM.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: VMX: Support the immediate form of WRMSRNS in the VM-Exit fastpath
Xin Li [Tue, 5 Aug 2025 20:22:23 +0000 (13:22 -0700)] 
KVM: VMX: Support the immediate form of WRMSRNS in the VM-Exit fastpath

Add support for handling "WRMSRNS with an immediate" VM-Exits in KVM's
fastpath.  On Intel, all writes to the x2APIC ICR and to the TSC Deadline
MSR are non-serializing, i.e. it's highly likely guest kernels will switch
to using WRMSRNS when possible.  And in general, any MSR written via
WRMSRNS is probably worth handling in the fastpath, as the entire point of
WRMSRNS is to shave cycles in hot paths.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: rewrite changelog, split rename to separate patch]
Link: https://lore.kernel.org/r/20250805202224.1475590-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel
Xin Li [Tue, 5 Aug 2025 20:22:22 +0000 (13:22 -0700)] 
KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel

Add support for the immediate forms of RDMSR and WRMSRNS (currently
Intel-only).  The immediate variants are only valid in 64-bit mode, and
use a single general purpose register for the data (the register is also
encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR).

The immediate variants are primarily motivated by performance, not code
size: by having the MSR index in an immediate, it is available *much*
earlier in the CPU pipeline, which allows hardware much more leeway about
how a particular MSR is handled.

Intel VMX support for the immediate forms of MSR accesses communicates
exit information to the host as follows:

  1) The immediate form of RDMSR uses VM-Exit Reason 84.

  2) The immediate form of WRMSRNS uses VM-Exit Reason 85.

  3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is
     set to the MSR index that triggered the VM-Exit.

  4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to
     the register encoding used by the immediate form of the instruction,
     i.e. the destination register for RDMSR, and the source for WRMSRNS.

  5) The VM-Exit Instruction Length field records the size of the
     immediate form of the MSR instruction.

To deal with userspace RDMSR exits, stash the destination register in a
new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc.
Alternatively, the register could be saved in kvm_run.msr or re-retrieved
from the VMCS, but the former would require sanitizing the value to ensure
userspace doesn't clobber the value to an out-of-bounds index, and the
latter would require a new one-off kvm_x86_ops hook.

Don't bother adding support for the instructions in KVM's emulator, as the
only way for RDMSR/WRMSR to be encountered is if KVM is emulating large
swaths of code due to invalid guest state, and a vCPU cannot have invalid
guest state while in 64-bit mode.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: minor tweaks, massage and expand changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Rename handle_fastpath_set_msr_irqoff() to handle_fastpath_wrmsr()
Xin Li [Tue, 5 Aug 2025 20:22:21 +0000 (13:22 -0700)] 
KVM: x86: Rename handle_fastpath_set_msr_irqoff() to handle_fastpath_wrmsr()

Rename the WRMSR fastpath API to drop "irqoff", as that information is
redundant (the fastpath always runs with IRQs disabled), and to prepare
for adding a fastpath for the immediate variant of WRMSRNS.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agoKVM: x86: Rename local "ecx" variables to "msr" and "pmc" as appropriate
Sean Christopherson [Tue, 5 Aug 2025 20:22:20 +0000 (13:22 -0700)] 
KVM: x86: Rename local "ecx" variables to "msr" and "pmc" as appropriate

Rename "ecx" variables in {RD,WR}MSR and RDPMC helpers to "msr" and "pmc"
respectively, in anticipation of adding support for the immediate variants
of RDMSR and WRMSRNS, and to better document what the variables hold
(versus where the data originated).

No functional change intended.

Link: https://lore.kernel.org/r/20250805202224.1475590-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2 months agox86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions
Xin Li [Tue, 5 Aug 2025 20:22:19 +0000 (13:22 -0700)] 
x86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions

The immediate form of MSR access instructions are primarily motivated
by performance, not code size: by having the MSR number in an immediate,
it is available *much* earlier in the pipeline, which allows the
hardware much more leeway about how a particular MSR is handled.

Use a scattered CPU feature bit for MSR immediate form instructions.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250805202224.1475590-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>