git.ipfire.org Git - thirdparty/linux.git/log

KVM: x86: Enable guest SSP read/write interface with new uAPIs

Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace
save and restore the guest's Shadow Stack Pointer (SSP).  On both Intel
and AMD, SSP is a hardware register that can only be accessed by software
via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware
to context switch SSP at entry/exit).  As a result, SSP doesn't fit in
any of KVM's existing interfaces for saving/restoring state.

Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes
to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP
MSRs.  Use a translation layer to hide the KVM-internal MSR index so that
the arbitrary index doesn't become ABI, e.g. so that KVM can rework its
implementation as needed, so long as the ONE_REG ABI is maintained.

Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack
support to avoid running afoul of ignore_msrs, which unfortunately applies
to host-initiated accesses (which is a discussion for another day).  I.e.
ensure consistent behavior for KVM-defined registers irrespective of
ignore_msrs.

Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VMX: Introduce CET VMCS fields and control bits

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates
  a #CP. These instruction behaves as a NOP on platforms that have no CET.

Several new CET MSRs are defined to support CET:
  MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.

  MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.

  MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
is indexed by IST of interrupt gate desc.

Two XSAVES state bits are introduced for CET:
  IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
  IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.

Six VMCS fields are introduced for CET:
  {HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
  {HOST,GUEST}_SSP: Stores current active SSP.
  {HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.

On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
  HOST_S_CET
  HOST_SSP
  HOST_INTR_SSP_TABLE

If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
  GUEST_S_CET
  GUEST_SSP
  GUEST_INTR_SSP_TABLE

Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Report KVM supported CET MSRs as to-be-saved

Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add fault checks for guest CR4.CET setting

Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.

If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it reenters kernel before next VM-entry.

Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).

Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.

The two helpers are put here in order to manifest accessing xsave-managed
MSRs requires special check and handling to guarantee the correctness of
read/write to the MSRs.

Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: drop S_CET, add big comment, move accessors to x86.c]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Initialize kvm_caps.supported_xss

Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.

[sean: relocate and enhance comment about PT / XSS[8] ]

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically skip CPUID updates if XSS value doesn't change.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Check XSS validity against guest CPUIDs

Maintain per-guest valid XSS bits and check XSS validity against them
rather than against KVM capabilities. This is to prevent bits that are
supported by KVM but not supported for a guest from being set.

Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses
if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since
KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the
host_initiated check.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Report XSS as to-be-saved if there are supported features

Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.

Before enabling CET virtualization series, guest IA32_MSR_XSS is
guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode
with XSS == 0, which equals to the effect of XSAVE/XRSTOR.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Introduce KVM_{G,S}ET_ONE_REG uAPIs support

Enable KVM_{G,S}ET_ONE_REG uAPIs so that userspace can access MSRs and
other non-MSR registers through them, along with support for
KVM_GET_REG_LIST to enumerate support for KVM-defined registers.

This is in preparation for allowing userspace to read/write the guest SSP
register, which is needed for the upcoming CET virtualization support.

Currently, two types of registers are supported: KVM_X86_REG_TYPE_MSR and
KVM_X86_REG_TYPE_KVM. All MSRs are in the former type; the latter type is
added for registers that lack existing KVM uAPIs to access them. The "KVM"
in the name is intended to be vague to give KVM flexibility to include
other potential registers. More precise names like "SYNTHETIC" and
"SYNTHETIC_MSR" were considered, but were deemed too confusing (e.g. can
be conflated with synthetic guest-visible MSRs) and may put KVM into a
corner (e.g. if KVM wants to change how a KVM-defined register is modeled
internally).

Enumerate only KVM-defined registers in KVM_GET_REG_LIST to avoid
duplicating KVM_GET_MSR_INDEX_LIST, and so that KVM can return _only_
registers that are fully supported (KVM_GET_REG_LIST is vCPU-scoped, i.e.
can be precise, whereas KVM_GET_MSR_INDEX_LIST is system-scoped).

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/all/20240219074733.122080-18-weijiang.yang@intel.com
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Merge 'selftests' into 'cet' to pick up ex_str()

Merge the queue of KVM selftests changes for 6.18 to pick up the ex_str()
helper so that it can be used to pretty print expected versus actual
exceptions in a new MSR selftest. CET virtualization will add support for
several MSRs with non-trivial semantics, along with new uAPI for accessing
the guest's Shadow Stack Pointer (SSP) from userspace.

KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependencies

Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB
helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for
SEV-ES+ guests.

KVM: SEV: Validate XCR0 provided by guest in GHCB

Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's
software model in order to validate the new XCR0 against KVM's view of
the supported XCR0.  Allowing garbage is thankfully mostly benign, as
kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected
state, xstate_required_size() will simply provide garbage back to the
guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS
will only harm the guest (setting XCR0 will fail).

However, allowing the guest to put junk into a field that KVM assumes is
valid is a CVE waiting to happen.  And as a bonus, using the proper API
eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty.

Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage.  In either case, KVM can't fix the broken guest.

Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits
if XCR0 isn't actually changing (relatively to KVM's previous snapshot).

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Read save fields from GHCB exactly once

Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific
GHCB get() utility to help guard against TOCTOU bugs. Using READ_ONCE()
doesn't completely prevent such bugs, e.g. doesn't prevent KVM from
redoing get() after checking the initial value, but at least addresses
all potential TOCTOU issues in the current KVM code base.

To prevent unintentional use of the generic helpers, take only @svm for
the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly
passing it in.

Opportunistically reduce the indentation of the macro-defined helpers and
clean up the alignment.

Fixes: 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code()

Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() to make
it clear that KVM is getting the cached value, not reading directly from
the guest-controlled GHCB. More importantly, vacating
kvm_ghcb_get_sw_exit_code() will allow adding a KVM-specific macro-built
kvm_ghcb_get_##field() helper to read values from the GHCB.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add ex_str() to print human friendly name of exception vectors

Steal exception_mnemonic() from KVM-Unit-Tests as ex_str() (to keep line
lengths reasonable) and use it in assert messages that currently print the
raw vector number.

Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

selftests/kvm: remove stale TODO in xapic_state_test

The TODO about using the number of vCPUs instead of vcpu.id + 1
was already addressed by commit 376bc1b458c9 ("KVM: selftests: Don't
assume vcpu->id is '0' in xAPIC state test"). The comment is now
stale and can be removed.

Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com>
Link: https://lore.kernel.org/r/20250908210547.12748-1-hsukrut3@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Handle Intel Atom errata that leads to PMU event overcount

Add a PMU errata framework and use it to relax precise event counts on
Atom platforms that overcount "Instruction Retired" and "Branch Instruction
Retired" events, as the overcount issues on VM-Exit/VM-Entry are impossible
to prevent from userspace, e.g. the test can't prevent host IRQs.

Setup errata during early initialization and automatically sync the mask
to VMs so that tests can check for errata without having to manually
manage host=>guest variables.

For Intel Atom CPUs, the PMU events "Instruction Retired" or
"Branch Instruction Retired" may be overcounted for some certain
instructions, like FAR CALL/JMP, RETF, IRET, VMENTRY/VMEXIT/VMPTRLD
and complex SGX/SMX/CSTATE instructions/flows.

The detailed information can be found in the errata (section SRF7):
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/sierra-forest/xeon-6700-series-processor-with-e-cores-specification-update/errata-details/

For the Atom platforms before Sierra Forest (including Sierra Forest),
Both 2 events "Instruction Retired" and "Branch Instruction Retired" would
be overcounted on these certain instructions, but for Clearwater Forest
only "Instruction Retired" event is overcounted on these instructions.

Signed-off-by: dongsheng <dongsheng.x.zhang@intel.com>
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Validate more arch-events in pmu_counters_test

Add support for 5 new architectural events (4 topdown level 1 metrics
events and LBR inserts event) that will first show up in Intel's
Clearwater Forest CPUs. Detailed info about the new events can be found
in SDM section 21.2.7 "Pre-defined Architectural Performance Events".

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
[sean: drop "unavailable_mask" changes]
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Reduce number of "unavailable PMU events" combos tested

Reduce the number of combinations of unavailable PMU events masks that are
testing by the PMU counters test. In reality, testing every possible
combination isn't all that interesting, and certainly not worth the tens
of seconds (or worse, minutes) of runtime. Fully testing the N^2 space
will be especially problematic in the near future, as 5! new arch events
are on their way.

Use alternating bit patterns (and 0 and -1u) in the hopes that _if_ there
is ever a KVM bug, it's not something horribly convoluted that shows up
only with a super specific pattern/value.

Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Track unavailable_mask for PMU events as 32-bit value

Track the mask of "unavailable" PMU events as a 32-bit value. While bits
31:9 are currently reserved, silently truncating those bits is unnecessary
and asking for missed coverage. To avoid running afoul of the sanity check
in vcpu_set_cpuid_property(), explicitly adjust the mask based on the
non-reserved bits as reported by KVM's supported CPUID.

Opportunistically update the "all ones" testcase to pass -1u instead of
0xff.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add timing_info bit support in vmx_pmu_caps_test

A new bit PERF_CAPABILITIES[17] called "PEBS_TIMING_INFO" bit is added
to indicated if PEBS supports to record timing information in a new
"Retried Latency" field.

Since KVM requires user can only set host consistent PEBS capabilities,
otherwise the PERF_CAPABILITIES setting would fail, add pebs_timing_info
into the "immutable_caps" to block host inconsistent PEBS configuration
and cause errors.

Opportunistically drop the anythread_deprecated bit. It isn't and likely
never was a PERF_CAPABILITIES flag, the test's definition snuck in when
the union was copy+pasted from the kernel's definition.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
[sean: call out anythread_deprecated change]
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250919214648.1585683-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Fix hypercalls docs section number order

Commit 4180bf1b655a79 ("KVM: X86: Implement "send IPI" hypercall")
documents KVM_HC_SEND_IPI hypercall, yet its section number duplicates
KVM_HC_CLOCK_PAIRING one (which both are 6th). Fix the numbering order
so that the former should be 7th.

Fixes: 4180bf1b655a ("KVM: X86: Implement "send IPI" hypercall")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20250909003952.10314-1-bagasdotme@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Don't treat ENTER and LEAVE as branches, because they aren't

Remove the IsBranch flag from ENTER and LEAVE in KVM's emulator, as ENTER
and LEAVE are stack operations, not branches. Add forced emulation of
said instructions to the PMU counters test to prove that KVM diverges from
hardware, and to guard against regressions.

Opportunistically add a missing "1 MOV" to the selftest comment regarding
the number of instructions per loop, which commit 7803339fa929 ("KVM:
selftests: Use data load to trigger LLC references/misses in Intel PMU")
forgot to add.

Fixes: 018d70ffcfec ("KVM: x86: Update vPMCs when retiring branch instructions")
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919004639.1360453-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Restrict GLOBAL_{CTRL,STATUS}, fixed PMCs, and PEBS to PMU v2+

Restrict support for GLOBAL_CTRL, GLOBAL_STATUS, fixed PMCs, and PEBS to
v2 or later vPMUs. The SDM explicitly states that GLOBAL_{CTRL,STATUS} and
fixed counters were introduced with PMU v2, and PEBS has hard dependencies
on fixed counters and the bitmap MSR layouts established by PMU v2.

Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Move initialization of valid PMCs bitmask to common x86

Move all initialization of all_valid_pmc_idx to common code, as the logic
is 100% common to Intel and AMD, and KVM heavily relies on Intel and AMD
having the same semantics. E.g. the fact that AMD doesn't support fixed
counters doesn't allow KVM to use all_valid_pmc_idx[63:32] for other
purposes.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Use BIT_ULL() instead of open coded equivalents

Replace a variety of "1ull << N" and "(u64)1 << N" snippets with BIT_ULL()
in the PMU code.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
[sean: split to separate patch, write changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VMX: Add helpers to toggle/change a bit in VMCS execution controls

Expand the VMCS controls builder macros to generate helpers to change a
bit to the desired value, and use the new helpers when toggling APICv
related controls.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: rewrite changelog]
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Use KVM_REQ_RECALC_INTERCEPTS to react to CPUID updates

Defer recalculating MSR and instruction intercepts after a CPUID update
via RECALC_INTERCEPTS to converge on RECALC_INTERCEPTS as the "official"
mechanism for triggering recalcs. As a bonus, because KVM does a "recalc"
during vCPU creation, and every functional VMM sets CPUID at least once,
for all intents and purposes this saves at least one recalc.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS

Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS
request, and expand the responsibilities of vendor code to recalculate all
intercepts that vary based on userspace input, e.g. instruction intercepts
that are tied to guest CPUID.

Providing a generic recalc request will allow the upcoming mediated PMU
support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are
set by userspace, without having to make multiple calls to/from PMU code.
As a bonus, using a request will effectively coalesce recalcs, e.g. will
reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create,
set CPUID, set PERF_CAPABILITIES (Intel only), set filter).

The downside is that MSR filter changes that are done in isolation will do
a small amount of unnecessary work, but that's already a relatively slow
path, and the cost of recalculating instruction intercepts is negligible.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h header

Move PMU_CAP_{FW_WRITES,LBR_FMT} into msr-index.h and rename them with
PERF_CAP prefix to keep consistent with other perf capabilities macros.

No functional change intended.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename vmx_vmentry/vmexit_ctrl() helpers

Rename the two helpers vmx_vmentry/vmexit_ctrl() to
vmx_get_initial_vmentry/vmexit_ctrl() to represent their real meaning.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Snapshot host (i.e. perf's) reported PMU capabilities

Take a snapshot of the unadulterated PMU capabilities provided by perf so
that KVM can compare guest vPMU capabilities against hardware capabilities
when determining whether or not to intercept PMU MSRs (and RDPMC).

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Check pmu->version, not enable_pmu, when getting PMC MSRs

Gate access to PMC MSRs based on pmu->version, not on kvm->arch.enable_pmu,
to more accurately reflect KVM's behavior. This is a glorified nop, as
pmu->version and pmu->nr_arch_gp_counters can only be non-zero if
amd_pmu_refresh() is reached, kvm_pmu_refresh() invokes amd_pmu_refresh()
if and only if kvm->arch.enable_pmu is true, and amd_pmu_refresh() forces
pmu->version to be 1 or 2.

I.e. the following holds true:

!pmu->nr_arch_gp_counters || kvm->arch.enable_pmu == (pmu->version > 0)

and so the only way for amd_pmu_get_pmc() to return a non-NULL value is if
both kvm->arch.enable_pmu and pmu->version evaluate to true.

No real functional change intended.

Reviewed-by: Sandipan Das <sandipan.das@amd.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VMX: Setup canonical VMCS config prior to kvm_x86_vendor_init()

Setup the golden VMCS config during vmx_init(), before the call to
kvm_x86_vendor_init(), instead of waiting until the callback to do
hardware setup. setup_vmcs_config() only touches VMX state, i.e. doesn't
poke anything in kvm.ko, and has no runtime dependencies beyond
hv_init_evmcs().

Setting the VMCS config early on will allow referencing VMCS and VMX
capabilities at any point during setup, e.g. to check for PERF_GLOBAL_CTRL
save/load support during mediated PMU initialization.

Tested-by: Xudong Hao <xudong.hao@intel.com>
Link: https://lore.kernel.org/r/20250806195706.1650976-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

Documentation: KVM: Call out that KVM strictly follows the 8254 PIT spec

Explicitly document that the behavior of KVM_SET_PIT2 strictly conforms
to the Intel 8254 PIT hardware specification, specifically that a write of
'0' adheres to the spec's definition that a programmed count of '0' is
converted to the maximum possible value (2^16). E.g. an unaware userspace
might attempt to validate that KVM_GET_PIT2 returns the exact state set
via KVM_SET_PIT2, and be surprised when the returned count is 65536, not 0.

Add a references to the Intel 8254 PIT datasheet that will hopefully stay
fresh for some time (the internet isn't exactly brimming with copies of
the 8254 datasheet).

Link: https://lore.kernel.org/all/CANypQFbEySjKOFLqtFFf2vrEe=NBr7XJfbkjQhqXuZGg7Rpoxw@mail.gmail.com
Signed-off-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/r/20250905174736.260694-1-r772577952@gmail.com
[sean: add context Link, drop local APIC change, massage changelog accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: hyper-v: Use guard() instead of mutex_lock() to simplify code

Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when setting up the TSC page for a Hyper-V guest.

No functional change intended.

Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131604.646415-1-liaoyuanhong@vivo.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Use guard() instead of mutex_lock() to simplify code

Use guard(mutex) instead of mutex_lock/mutex_unlock pair to simplify the
error handling when allocating the APIC access page.

No functional change intended.

Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com>
Link: https://lore.kernel.org/r/20250901131822.647802-1-liaoyuanhong@vivo.com
[sean: add blank link to isolate guard(), tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Correct typo "_COUTNERS" to "_COUNTERS"

Fix typos. "_COUTNERS" -> "_COUNTERS".

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20250718001905.196989-2-dapeng1.mi@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Reject fully in-kernel irqchip if EOIs are protected, i.e. for TDX VMs

Reject KVM_CREATE_IRQCHIP if the VM type has protected EOIs, i.e. if KVM
can't intercept EOI and thus can't faithfully emulate level-triggered
interrupts that are routed through the I/O APIC. For TDX VMs, the
TDX-Module owns the VMX EOI-bitmap and configures all IRQ vectors to have
the CPU accelerate EOIs, i.e. doesn't allow KVM to intercept any EOIs.

KVM already requires a split irqchip[1], but does so during vCPU creation,
which is both too late to allow userspace to fallback to a split irqchip
and a less-than-stellar experience for userspace since an -EINVAL on
KVM_VCPU_CREATE is far harder to debug/triage than failure exactly on
KVM_CREATE_IRQCHIP. And of course, allowing an action that ultimately
fails is arguably a bug regardless of the impact on userspace.

Link: https://lore.kernel.org/lkml/20250222014757.897978-11-binbin.wu@linux.intel.com
Link: https://lore.kernel.org/lkml/aK3vZ5HuKKeFuuM4@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sagi Shahar <sagis@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250827011726.2451115-1-sagis@google.com
[sean: massage shortlog+changelog, relocate setting has_protected_eoi]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: nSVM: Replace kzalloc() + copy_from_user() with memdup_user()

Replace kzalloc() followed by copy_from_user() with memdup_user() to
improve and simplify svm_set_nested_state().

Return early if an error occurs instead of trying to allocate memory for
'save' when memory allocation for 'ctl' already failed.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250903002951.118912-1-thorsten.blum@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add support for DIV and IDIV in the fastops test

Extend the fastops test coverage to DIV and IDIV, specifically to provide
coverage for #DE (divide error) exceptions, as #DE is the only exception
that can occur in KVM's fastops path, i.e. that requires exception fixup.

Link: https://lore.kernel.org/r/20250909202835.333554-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Dedup the gnarly constraints of the fastops tests (more macros!)

Add a fastop() macro along with macros to define its required constraints,
and use the macros to dedup the innermost guts of the fastop testcases.

No functional change intended.

Link: https://lore.kernel.org/r/20250909202835.333554-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add coverage for 'b' (byte) sized fastops emulation

Extend the fastops test to cover instructions that operate on 8-bit data.
Support for 8-bit instructions was omitted from the original commit purely
due to complications with BT not having a r/m8 variant. To keep the
RFLAGS.CF behavior deterministic and not heavily biased to '0' or '1',
continue using BT, but cast and load the to-be-tested value into a
dedicated 32-bit constraint.

Supporting 8-bit operations will allow using guest_test_fastops() as-is to
provide full coverage for DIV and IDIV. For divide operations, covering
all operand sizes _is_ interesting, because KVM needs provide exception
fixup for each size (failure to handle a #DE could panic the host).

Link: https://lore.kernel.org/all/aIF7ZhWZxlkcpm4y@google.com
Link: https://lore.kernel.org/r/20250909202835.333554-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add support for #DE exception fixup

Add support for handling #DE (divide error) exceptions in KVM selftests
so that the fastops test can verify KVM correctly handles #DE when
emulating DIV or IDIV on behalf of the guest. Morph #DE to 0xff (i.e.
to -1) as a mostly-arbitrary vector to indicate #DE, so that '0' (the
real #DE vector) can still be used to indicate "no exception".

Link: https://lore.kernel.org/r/20250909202835.333554-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move vector_hashing into lapic.c

Move the vector_hashing module param into lapic.c now that all usage is
contained within the local APIC emulation code.

Opportunistically drop the accessor and append "_enabled" to the variable
to help capture that it's a boolean module param.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Make "lowest priority" helpers local to lapic.c

Make various helpers for resolving lowest priority IRQs local to lapic.c
now that kvm_irq_delivery_to_apic() lives in lapic.c as well.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move kvm_irq_delivery_to_apic() from irq.c to lapic.c

Move kvm_irq_delivery_to_apic() to lapic.c as it is specific to local APIC
emulation. This will allow burying more local APIC code in lapic.c, e.g.
the various "lowest priority" helpers.

No functional change intended.

Link: https://lore.kernel.org/r/20250821214209.3463350-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Save the SEV policy if and only if LAUNCH_START succeeds

Wait until LAUNCH_START fully succeeds to set a VM's SEV/SNP policy so
that KVM doesn't keep a potentially stale policy. In practice, the issue
is benign as the policy is only used to detect if the VMSA can be
decrypted, and the VMSA only needs to be decrypted if LAUNCH_UPDATE and
thus LAUNCH_START succeeded.

Fixes: 962e2b6152ef ("KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabled")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Kim Phillips <kim.phillips@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250821213841.3462339-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Fix typo in hyperv cpuid test message

Fix a typo in hyperv_cpuid.c test assertion log:
replace "our of supported range" -> "out of supported range".

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://lore.kernel.org/r/20250824181642.629297-1-alok.a.tiwari@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Enable Secure TSC for SNP guests

Add support for Secure TSC, allowing userspace to configure the Secure TSC
feature for SNP guests. Use the SNP specification's desired TSC frequency
parameter during the SNP_LAUNCH_START command to set the mean TSC
frequency in KHz for Secure TSC enabled guests.

Always use kvm->arch.arch.default_tsc_khz as the TSC frequency that is
passed to SNP guests in the SNP_LAUNCH_START command. The default value
is the host TSC frequency. The userspace can optionally change the TSC
frequency via the KVM_SET_TSC_KHZ ioctl before calling the
SNP_LAUNCH_START ioctl.

Introduce the read-only MSR GUEST_TSC_FREQ (0xc0010134) that returns
guest's effective frequency in MHZ when Secure TSC is enabled for SNP
guests. Disable interception of this MSR when Secure TSC is enabled. Note
that GUEST_TSC_FREQ MSR is accessible only to the guest and not from the
hypervisor context.

Co-developed-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com>
Signed-off-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
[sean: contain Secure TSC to sev.c]
Link: https://lore.kernel.org/r/20250819234833.3080255-9-seanjc@google.com
[sean: return -EINVAL if TSC frequency is '0']
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Fold sev_es_vcpu_reset() into sev_vcpu_create()

Fold the remaining line of sev_es_vcpu_reset() into sev_vcpu_create() as
there's no need for a dedicated RESET hook just to init a mutex, and the
mutex should be initialized as early as possible anyways.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Set RESET GHCB MSR value during sev_es_init_vmcb()

Set the RESET value for the GHCB "MSR" during sev_es_init_vmcb() instead
of sev_es_vcpu_reset() to allow for dropping sev_es_vcpu_reset() entirely.

Note, the call to sev_init_vmcb() from sev_migrate_from() also kinda sorta
emulates a RESET, but sev_migrate_from() immediately overwrites ghcb_gpa
with the source's current value, so whether or not stuffing the GHCB
version is correct/desirable is moot.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Move init of SNP guest state into sev_init_vmcb()

Move the initialization of SNP guest state from svm_vcpu_reset() into
sev_init_vmcb() to reduce the number of paths that deal with INIT/RESET
for SEV+ vCPUs from 4+ to 1. Plumb in @init_event as necessary.

Opportunistically check for an SNP guest outside of
sev_snp_init_protected_guest_state() so that sev_init_vmcb() is consistent
with respect to checking for SEV-ES+ and SNP+ guests.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Move SEV-ES VMSA allocation to a dedicated sev_vcpu_create() helper

Add a dedicated sev_vcpu_create() helper to allocate the VMSA page for
SEV-ES+ vCPUs, and to allow for consolidating a variety of related SEV+
code in the near future.

No functional change intended.

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

x86/cpufeatures: Add SNP Secure TSC

The Secure TSC feature for SEV-SNP allows guests to securely use the RDTSC
and RDTSCP instructions, ensuring that the parameters used cannot be
altered by the hypervisor once the guest is launched. For more details,
refer to the AMD64 APM Vol 2, Section "Secure TSC".

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Vaishali Thakkar <vaishali.thakkar@suse.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Enforce minimum GHCB version requirement for SEV-SNP guests

Require a minimum GHCB version of 2 when starting SEV-SNP guests through
KVM_SEV_INIT2. When a VMM attempts to start an SEV-SNP guest with an
incompatible GHCB version (less than 2), reject the request early rather
than allowing the guest kernel to start with an incorrect protocol version
and fail later with GHCB_SNP_UNSUPPORTED guest termination.

Not enforcing the minimum version typically causes the guest to request
termination with GHCB_SNP_UNSUPPORTED error code:

kvm_amd: SEV-ES guest requested termination: 0x0:0x2

Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version")
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Drop GHCB_VERSION_DEFAULT and open code it

Remove the GHCB_VERSION_DEFAULT macro and open code it with '2'. The macro
is used conditionally and is not a true default. KVM ABI does not
advertise/emumerates the default GHCB version. Any future change to this
macro would silently alter the ABI and potentially break existing
deployments that rely on the current behavior.

Additionally, move the GHCB version assignment earlier in the code flow and
update the comment to clarify that KVM_SEV_INIT2 defaults to version 2,
while KVM_SEV_INIT forces version 1.

No functional change intended.

Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Link: https://lore.kernel.org/r/20250819234833.3080255-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Zero XSTATE components on INIT by iterating over supported features

Tweak the code a bit to facilitate resetting more xstate components in
the future, e.g., CET's xstate-managed MSRs.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-6-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Manually clear MPX state only on INIT

Don't manually clear/zero MPX state on RESET, as the guest FPU state is
zero allocated and KVM only does RESET during vCPU creation, i.e. the
relevant state is guaranteed to be all zeroes.

Opportunistically move the relevant code into a helper in anticipation of
adding support for CET shadow stacks, which also has state that is zeroed
on INIT.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-5-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add kvm_msr_{read,write}() helpers

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior, i.e., host_initiated ==
%true in the helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-4-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Use double-underscore read/write MSR helpers as appropriate

Use the double-underscore helpers for emulating MSR reads and writes in
he no-underscore versions to better capture the relationship between the
two sets of APIs (the double-underscore versions don't honor userspace MSR
filters).

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accesses

Rename
kvm_{g,s}et_msr_with_filter()
kvm_{g,s}et_msr()
to
kvm_emulate_msr_{read,write}
__kvm_emulate_msr_{read,write}

to make it more obvious that KVM uses these helpers to emulate guest
behaviors, i.e., host_initiated == false in these helpers.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Advertise support for the immediate form of MSR instructions

Advertise support for the immediate form of MSR instructions to userspace
if the instructions are supported by the underlying CPU, and KVM is using
VMX, i.e. is running on an Intel-compatible CPU.

For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over-
report support if AMD-compatible CPUs ever implement the immediate forms,
as SVM will likely require explicit enablement in KVM.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: massage changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VMX: Support the immediate form of WRMSRNS in the VM-Exit fastpath

Add support for handling "WRMSRNS with an immediate" VM-Exits in KVM's
fastpath. On Intel, all writes to the x2APIC ICR and to the TSC Deadline
MSR are non-serializing, i.e. it's highly likely guest kernels will switch
to using WRMSRNS when possible. And in general, any MSR written via
WRMSRNS is probably worth handling in the fastpath, as the entire point of
WRMSRNS is to shave cycles in hot paths.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: rewrite changelog, split rename to separate patch]
Link: https://lore.kernel.org/r/20250805202224.1475590-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on Intel

Add support for the immediate forms of RDMSR and WRMSRNS (currently
Intel-only).  The immediate variants are only valid in 64-bit mode, and
use a single general purpose register for the data (the register is also
encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR).

The immediate variants are primarily motivated by performance, not code
size: by having the MSR index in an immediate, it is available *much*
earlier in the CPU pipeline, which allows hardware much more leeway about
how a particular MSR is handled.

Intel VMX support for the immediate forms of MSR accesses communicates
exit information to the host as follows:

  1) The immediate form of RDMSR uses VM-Exit Reason 84.

  2) The immediate form of WRMSRNS uses VM-Exit Reason 85.

  3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is
     set to the MSR index that triggered the VM-Exit.

  4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to
     the register encoding used by the immediate form of the instruction,
     i.e. the destination register for RDMSR, and the source for WRMSRNS.

  5) The VM-Exit Instruction Length field records the size of the
     immediate form of the MSR instruction.

To deal with userspace RDMSR exits, stash the destination register in a
new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc.
Alternatively, the register could be saved in kvm_run.msr or re-retrieved
from the VMCS, but the former would require sanitizing the value to ensure
userspace doesn't clobber the value to an out-of-bounds index, and the
latter would require a new one-off kvm_x86_ops hook.

Don't bother adding support for the instructions in KVM's emulator, as the
only way for RDMSR/WRMSR to be encountered is if KVM is emulating large
swaths of code due to invalid guest state, and a vCPU cannot have invalid
guest state while in 64-bit mode.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: minor tweaks, massage and expand changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename handle_fastpath_set_msr_irqoff() to handle_fastpath_wrmsr()

Rename the WRMSR fastpath API to drop "irqoff", as that information is
redundant (the fastpath always runs with IRQs disabled), and to prepare
for adding a fastpath for the immediate variant of WRMSRNS.

No functional change intended.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
[sean: split to separate patch, write changelog]
Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename local "ecx" variables to "msr" and "pmc" as appropriate

Rename "ecx" variables in {RD,WR}MSR and RDPMC helpers to "msr" and "pmc"
respectively, in anticipation of adding support for the immediate variants
of RDMSR and WRMSRNS, and to better document what the variables hold
(versus where the data originated).

No functional change intended.

Link: https://lore.kernel.org/r/20250805202224.1475590-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

x86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions

The immediate form of MSR access instructions are primarily motivated
by performance, not code size: by having the MSR number in an immediate,
it is available *much* earlier in the pipeline, which allows the
hardware much more leeway about how a particular MSR is handled.

Use a scattered CPU feature bit for MSR immediate form instructions.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250805202224.1475590-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a fastpath handler for INVD

Add a fastpath handler for INVD so that the common fastpath logic can be
trivially tested on both Intel and AMD. Under KVM, INVD is always:
(a) intercepted, (b) available to the guest, and (c) emulated as a nop,
with no side effects. Combined with INVD not having any inputs or outputs,
i.e. no register constraints, INVD is the perfect instruction for
exercising KVM's fastpath as it can be inserted into practically any
guest-side code stream.

Link: https://lore.kernel.org/r/20250805190526.1453366-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Push acquisition of SRCU in fastpath into kvm_pmu_trigger_event()

Acquire SRCU in the VM-Exit fastpath if and only if KVM needs to check the
PMU event filter, to further trim the amount of code that is executed with
SRCU protection in the fastpath.  Counter-intuitively, holding SRCU can do
more harm than good due to masking potential bugs, and introducing a new
SRCU-protected asset to code reachable via kvm_skip_emulated_instruction()
would be quite notable, i.e. definitely worth auditing.

E.g. the primary user of kvm->srcu is KVM's memslots, accessing memslots
all but guarantees guest memory may be accessed, accessing guest memory
can fault, and page faults might sleep, which isn't allowed while IRQs are
disabled.  Not acquiring SRCU means the (hypothetical) illegal sleep would
be flagged when running with PROVE_RCU=y, even if DEBUG_ATOMIC_SLEEP=n.

Note, performance is NOT a motivating factor, as SRCU lock/unlock only
adds ~15 cycles of latency to fastpath VM-Exits.  I.e. overhead isn't a
concern _if_ SRCU protection needs to be extended beyond PMU events, e.g.
to honor userspace MSR filters.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Rename check_pmu_event_filter() to pmc_is_event_allowed()

Rename check_pmu_event_filter() to make its polarity more obvious, and to
connect the dots to is_gp_event_allowed() and is_fixed_event_allowed().

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Drop redundant check on PMC being locally enabled for emulation

Drop the check on a PMC being locally enabled when triggering emulated
events, as the bitmap of passed-in PMCs only contains locally enabled PMCs.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Drop redundant check on PMC being globally enabled for emulation

When triggering PMC events in response to emulation, drop the redundant
checks on a PMC being globally and locally enabled, as the passed in bitmap
contains only PMCs that are locally enabled (and counting the right event),
and the local copy of the bitmap has already been masked with global_ctrl.

No true functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Open code pmc_event_is_allowed() in its callers

Open code pmc_event_is_allowed() in its callers, as kvm_pmu_trigger_event()
only needs to check the event filter (both global and local enables are
consulted outside of the loop).

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Rename pmc_speculative_in_use() to pmc_is_locally_enabled()

Rename pmc_speculative_in_use() to pmc_is_locally_enabled() to better
capture what it actually tracks, and to show its relationship to
pmc_is_globally_enabled(). While neither AMD nor Intel refer to event
selectors or the fixed counter control MSR as "local", it's the obvious
name to pair with "global".

As for "speculative", there's absolutely nothing speculative about the
checks. E.g. for PMUs without PERF_GLOBAL_CTRL, from the guest's
perspective, the counters are "in use" without any qualifications.

No functional change intended.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Calculate set of to-be-emulated PMCs at time of WRMSRs

Calculate and track PMCs that are counting instructions/branches retired
when the PMC's event selector (or fixed counter control) is modified
instead evaluating the event selector on-demand.  Immediately recalc a
PMC's configuration on writes to avoid false negatives/positives when
KVM skips an emulated WRMSR, which is guaranteed to occur before the
main run loop processes KVM_REQ_PMU.

Out of an abundance of caution, and because it's relatively cheap, recalc
reprogrammed PMCs in kvm_pmu_handle_event() as well.  Recalculating in
response to KVM_REQ_PMU _should_ be unnecessary, but for now be paranoid
to avoid introducing easily-avoidable bugs in edge cases.  The code can be
removed in the future if necessary, e.g. in the unlikely event that the
overhead of recalculating to-be-emulated PMCs is noticeable.

Note!  Deliberately don't check the PMU event filters, as doing so could
result in KVM consuming stale information.

Tracking which PMCs are counting branches/instructions will allow grabbing
SRCU in the fastpath VM-Exit handlers if and only if a PMC event might be
triggered (to consult the event filters), and will also allow the upcoming
mediated PMU to do the right thing with respect to counting instructions
(the mediated PMU won't be able to update PMCs in the VM-Exit fastpath).

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Add wrappers for counting emulated instructions/branches

Add wrappers for triggering instruction retired and branch retired PMU
events in anticipation of reworking the internal mechanisms to track
which PMCs need to be evaluated, e.g. to avoid having to walk and check
every PMC.

Opportunistically bury "struct kvm_pmu_emulated_event_selectors" in pmu.c.

No functional change intended.

Link: https://lore.kernel.org/r/20250805190526.1453366-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Move kvm_init_pmu_capability() to pmu.c

Move kvm_init_pmu_capability() to pmu.c so that future changes can access
variables that have no business being visible outside of pmu.c.
kvm_init_pmu_capability() is called once per module load, there's is zero
reason it needs to be inlined.

No functional change intended.

Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Fold WRMSR fastpath helpers into the main handler

Fold the per-MSR WRMSR fastpath helpers into the main handler now that the
IPI path in particular is relatively tiny. In addition to eliminating a
decent amount of boilerplate, this removes the ugly -errno/1/0 => bool
conversion (which is "necessitated" by kvm_x2apic_icr_write_fast()).

Opportunistically drop the comment about IPIs, as the purpose of the
fastpath is hopefully self-evident, and _if_ it needs more documentation,
the documentation (and rules!) should be placed in a more central location.

No functional change intended.

Link: https://lore.kernel.org/r/20250805190526.1453366-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Unconditionally grab data from EDX:EAX in WRMSR fastpath

Always grab EDX:EAX in the WRMSR fastpath to deduplicate and simplify the
case statements, and to prepare for handling immediate variants of WRMSRNS
in the fastpath (the data register is explicitly provided in that case).
There's no harm in reading the registers, as their values are always
available, i.e. don't require VMREADs (or similarly slow operations).

No real functional change intended.

Cc: Xin Li <xin@zytor.com>
Link: https://lore.kernel.org/r/20250805190526.1453366-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Acquire SRCU in WRMSR fastpath iff instruction needs to be skipped

Acquire SRCU in the WRMSR fastpath if and only if an instruction needs to
be skipped, i.e. only if the fastpath succeeds. The reasoning in commit
3f2739bd1e0b ("KVM: x86: Acquire SRCU read lock when handling fastpath MSR
writes") about "avoid having to play whack-a-mole" seems sound, but in
hindsight unconditionally acquiring SRCU does more harm than good.

While acquiring/releasing SRCU isn't slow per se, the things that are
_protected_ by kvm->srcu are generally safe to access only in the "slow"
VM-Exit path. E.g. accessing memslots in generic helpers is never safe,
because accessing guest memory with IRQs disabled is unless unsafe (except
when kvm_vcpu_read_guest_atomic() is used, but that API should never be
used in emulation helpers).

In other words, playing whack-a-mole is actually desirable in this case,
because every access to an asset protected by kvm->srcu warrants further
scrutiny.

Link: https://lore.kernel.org/r/20250805190526.1453366-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Unconditionally handle MSR_IA32_TSC_DEADLINE in fastpath exits

Drop the fastpath VM-Exit requirement that KVM can use the hypervisor
timer to emulate the APIC timer in TSC deadline mode.  I.e. unconditionally
handle MSR_IA32_TSC_DEADLINE WRMSRs in the fastpath.  Restricting the
fastpath to *maybe* using the VMX preemption timer is ineffective and
unnecessary.

If the requested deadline can't be programmed into the VMX preemption
timer, KVM will fall back to hrtimers, i.e. the restriction is ineffective
as far as preventing any kind of worst case scenario.

But guarding against a worst case scenario is completely unnecessary as
the "slow" path, start_sw_tscdeadline() => hrtimer_start(), explicitly
disables IRQs.  In fact, the worst case scenario is when KVM thinks it
can use the VMX preemption timer, as KVM will eat the overhead of calling
into vmx_set_hv_timer() and falling back to hrtimers.

Opportunistically limit kvm_can_use_hv_timer() to lapic.c as the fastpath
code was the only external user.

Stating the obvious, this allows handling MSR_IA32_TSC_DEADLINE writes in
the fastpath on AMD CPUs.

Link: https://lore.kernel.org/r/20250805190526.1453366-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop semi-arbitrary restrictions on IPI type in fastpath

Drop the restrictions on fastpath IPIs only working for fixed IRQs with a
physical destination now that the fastpath is explicitly limited to "fast"
delivery.  Limiting delivery to a single physical APIC ID guarantees only
one vCPU will receive the event, but that isn't necessary "fast", e.g. if
the targeted vCPU is the last of 4096 vCPUs.  And logical destination mode
or shorthand (to self) can also be fast, e.g. if only a few vCPUs are
being targeted.  Lastly, there's nothing inherently slow about delivering
an NMI, INIT, SIPI, SMI, etc., i.e. there's no reason to artificially
limit fastpath delivery to fixed vector IRQs.

Link: https://lore.kernel.org/r/20250805190526.1453366-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Only allow "fast" IPIs in fastpath WRMSR(X2APIC_ICR) handler

Explicitly restrict fastpath ICR writes to IPIs that are "fast", i.e. can
be delivered without having to walk all vCPUs, and that target at most 16
vCPUs. Artificially restricting ICR writes to physical mode guarantees
at most one vCPU will receive in IPI (because x2APIC IDs are read-only),
but that delivery might not be "fast". E.g. even if the vCPU exists, KVM
might have to iterate over 4096 vCPUs to find the right one.

Limiting delivery to fast IPIs aligns the WRMSR fastpath with
kvm_arch_set_irq_inatomic() (which also runs with IRQs disabled), and will
allow dropping the semi-arbitrary restrictions on delivery mode and type.

Link: https://lore.kernel.org/r/20250805190526.1453366-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add kvm_icr_to_lapic_irq() helper to allow for fastpath IPIs

Extract the code for converting an ICR message into a kvm_lapic_irq
structure into a local helper so that a fast-only IPI path can share the
conversion logic.

No functional change intended.

Link: https://lore.kernel.org/r/20250805190526.1453366-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

kvm: x86: simplify kvm_vector_to_index()

Use find_nth_bit() and make the function almost a one-liner.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: allow CPUID 0xC000_0000 to proceed on Zhaoxin CPUs

Bypass the Centaur-only filter for the CPUID signature leaf so that
processing continues when the CPU vendor is Zhaoxin.

Signed-off-by: Ewan Hai <ewanhai-oc@zhaoxin.com>
Link: https://lore.kernel.org/r/20250818083034.93935-1-ewanhai-oc@zhaoxin.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

arch/x86/kvm/ioapic: Remove license boilerplate with bad FSF address

The Free Software Foundation does not reside in "59 Temple Place"
anymore, so we should not mention that address in the source code here.
But instead of updating the address to their current location, let's
rather drop the license boilerplate text here and use a proper SPDX
license identifier instead. The text talks about the "GNU *Lesser*
General Public License" and "any later version", so LGPL-2.1+ is the
right choice here.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Link: https://lore.kernel.org/r/20250728152843.310260-1-thuth@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Skip fastpath emulation on VM-Exit if next RIP isn't valid

Skip the WRMSR and HLT fastpaths in SVM's VM-Exit handler if the next RIP
isn't valid, e.g. because KVM is running with nrips=false.  SVM must
decode and emulate to skip the instruction if the CPU doesn't provide the
next RIP, and getting the instruction bytes to decode requires reading
guest memory.  Reading guest memory through the emulator can fault, i.e.
can sleep, which is disallowed since the fastpath handlers run with IRQs
disabled.

BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:106
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 32611, name: qemu
preempt_count: 1, expected: 0
INFO: lockdep is turned off.
irq event stamp: 30580
hardirqs last  enabled at (30579): [<ffffffffc08b2527>] vcpu_run+0x1787/0x1db0 [kvm]
hardirqs last disabled at (30580): [<ffffffffb4f62e32>] __schedule+0x1e2/0xed0
softirqs last  enabled at (30570): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
softirqs last disabled at (30568): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210
CPU: 298 UID: 0 PID: 32611 Comm: qemu Tainted: G     U              6.16.0-smp--e6c618b51cfe-sleep #782 NONE
Tainted: [U]=USER
Hardware name: Google Astoria-Turin/astoria, BIOS 0.20241223.2-0 01/17/2025
Call Trace:
  <TASK>
  dump_stack_lvl+0x7d/0xb0
  __might_resched+0x271/0x290
  __might_fault+0x28/0x80
  kvm_vcpu_read_guest_page+0x8d/0xc0 [kvm]
  kvm_fetch_guest_virt+0x92/0xc0 [kvm]
  __do_insn_fetch_bytes+0xf3/0x1e0 [kvm]
  x86_decode_insn+0xd1/0x1010 [kvm]
  x86_emulate_instruction+0x105/0x810 [kvm]
  __svm_skip_emulated_instruction+0xc4/0x140 [kvm_amd]
  handle_fastpath_invd+0xc4/0x1a0 [kvm]
  vcpu_run+0x11a1/0x1db0 [kvm]
  kvm_arch_vcpu_ioctl_run+0x5cc/0x730 [kvm]
  kvm_vcpu_ioctl+0x578/0x6a0 [kvm]
  __se_sys_ioctl+0x6d/0xb0
  do_syscall_64+0x8a/0x2c0
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f479d57a94b
  </TASK>

Note, this is essentially a reapply of commit 5c30e8101e8d ("KVM: SVM:
Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"), but with
different justification (KVM now grabs SRCU when skipping the instruction
for other reasons).

Fixes: b439eb8ab578 ("Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250805190526.1453366-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2

Emulate PERF_CNTR_GLOBAL_STATUS_SET when PerfMonV2 is enumerated to the
guest, as the MSR is supposed to exist in all AMD v2 PMUs.

Fixes: 4a2771895ca6 ("KVM: x86/svm/pmu: Add AMD PerfMonV2 support")
Cc: stable@vger.kernel.org
Cc: Sandipan Das <sandipan.das@amd.com>
Link: https://lore.kernel.org/r/20250711172746.1579423-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Don't (re)check L1 intercepts when completing userspace I/O

When completing emulation of instruction that generated a userspace exit
for I/O, don't recheck L1 intercepts as KVM has already finished that
phase of instruction execution, i.e. has already committed to allowing L2
to perform I/O.  If L1 (or host userspace) modifies the I/O permission
bitmaps during the exit to userspace,  KVM will treat the access as being
intercepted despite already having emulated the I/O access.

Pivot on EMULTYPE_NO_DECODE to detect that KVM is completing emulation.
Of the three users of EMULTYPE_NO_DECODE, only complete_emulated_io() (the
intended "recipient") can reach the code in question.  gp_interception()'s
use is mutually exclusive with is_guest_mode(), and
complete_emulated_insn_gp() unconditionally pairs EMULTYPE_NO_DECODE with
EMULTYPE_SKIP.

The bad behavior was detected by a syzkaller program that toggles port I/O
interception during the userspace I/O exit, ultimately resulting in a WARN
on vcpu->arch.pio.count being non-zero due to KVM no completing emulation
of the I/O instruction.

  WARNING: CPU: 23 PID: 1083 at arch/x86/kvm/x86.c:8039 emulator_pio_in_out+0x154/0x170 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 23 UID: 1000 PID: 1083 Comm: repro Not tainted 6.16.0-rc5-c1610d2d66b1-next-vm #74 NONE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:emulator_pio_in_out+0x154/0x170 [kvm]
  PKRU: 55555554
  Call Trace:
   <TASK>
   kvm_fast_pio+0xd6/0x1d0 [kvm]
   vmx_handle_exit+0x149/0x610 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xda8/0x1ac0 [kvm]
   kvm_vcpu_ioctl+0x244/0x8c0 [kvm]
   __x64_sys_ioctl+0x8a/0xd0
   do_syscall_64+0x5d/0xc60
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Reported-by: syzbot+cc2032ba16cc2018ca25@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68790db4.a00a0220.3af5df.0020.GAE@google.com
Fixes: 8a76d7f25f8f ("KVM: x86: Add x86 callback for intercept check")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250715190638.1899116-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: fix minor typo in cpumodel_subfuncs

Specifically, fix spelling of "available" in main function.

Signed-off-by: Gopi Krishna Menon <krishnagopi487@gmail.com>
Link: https://lore.kernel.org/r/20250813154751.5725-1-krishnagopi487@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Move Intel and AMD module param helpers to x86/processor.h

Move the x86 specific helpers for getting kvm_{amd,intel} module params to
x86 where they belong. Expose the module-agnostic helpers globally, there
is nothing secret about the logic.

Link: https://lore.kernel.org/r/20250806225159.1687326-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Fix signedness issue with vCPU mmap size check

Check that the return value of KVM_GET_VCPU_MMAP_SIZE is non-negative
before comparing with sizeof(kvm_run). If KVM_GET_VCPU_MMAP_SIZE fails,
it will return -1, and `-1 > sizeof(kvm_run)` is true, so the ASSERT
passes.

There are no other locations in tools/testing/selftests/kvm that make
the same mistake.

Signed-off-by: James Houghton <jthoughton@google.com>
Link: https://lore.kernel.org/r/20250711001742.1965347-1-jthoughton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

Merge tag 'x86_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Remove a transitional asm/cpuid.h header which was added only as a
   fallback during cpuid helpers reorg

- Initialize reserved fields in the SVSM page validation calls
   structure to zero in order to allow for future structure extensions

- Have the sev-guest driver's buffers used in encryption operations be
   in linear mapping space as the encryption operation can be offloaded
   to an accelerator

- Have a read-only MSR write when in an AMD SNP guest trap to the
   hypervisor as it is usually done. This makes the guest user
   experience better by simply raising a #GP instead of terminating said
   guest

- Do not output AVX512 elapsed time for kernel threads because the data
   is wrong and fix a NULL pointer dereferencing in the process

- Adjust the SRSO mitigation selection to the new attack vectors

* tag 'x86_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpuid: Remove transitional <asm/cpuid.h> header
  x86/sev: Ensure SVSM reserved fields in a page validation entry are initialized to zero
  virt: sev-guest: Satisfy linear mapping requirement in get_derived_key()
  x86/sev: Improve handling of writes to intercepted TSC MSRs
  x86/fpu: Fix NULL dereference in avx512_status()
  x86/bugs: Select best SRSO mitigation

Merge tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Borislav Petkov:

- Make sure sanity checks down in the mutex lock path happen on the
   correct type of task so that they don't trigger falsely

- Use the write unsafe user access pairs when writing a futex value to
   prevent an error on PowerPC which does user read and write accesses
   differently

* tag 'locking_urgent_for_v6.17_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking: Fix __clear_task_blocked_on() warning from __ww_mutex_wound() path
  futex: Use user_write_access_begin/_end() in futex_put_value()

Merge tag 'rust-fixes-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux

Pull rust fixes from Miguel Ojeda:

- Workaround 'rustdoc' target modifiers bug in Rust >= 1.88.0. It will
   be fixed in Rust 1.90.0 (expected 2025-09-18).

- Clean 'rustdoc' output before running it to avoid confusing the tool
   when files from previous versions remain.

* tag 'rust-fixes-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  rust: kbuild: clean output before running `rustdoc`
  rust: workaround `rustdoc` target modifiers bug

Merge tag 'ata-ata-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fixes from Damien Le Moal:

- Fix a regression affecting old IDE/PATA device scan and introduced by
   the recent link power management cleanups & fixes. The regression
   prevented devices from being properly detected (me)

- Fix command duration limits (CDL) feature control: attempting to
   enable the feature while NCQ commands are being executed resulted in
   a silent failure to enable CDL when needed (Igor)

* tag 'ata-ata-6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-scsi: Fix CDL control
  ata: libata-eh: Fix link state check for IDE/PATA ports