git.ipfire.org Git - thirdparty/kernel/linux.git/log

KVM: VMX: Handle event vectoring error in check_emulate_instruction()

Move handling of emulation during event vectoring, which KVM doesn't
support, into VMX's check_emulate_instruction(), so that KVM detects
all unsupported emulation, not just cached emulated MMIO (EPT misconfig).
E.g. on emulated MMIO that isn't cached (EPT Violation) or occurs with
legacy shadow paging (#PF).

Rejecting emulation on other sources of emulation also fixes a largely
theoretical flaw (thanks to the "unprotect and retry" logic), where KVM
could incorrectly inject a #DF:

  1. CPU executes an instruction and hits a #GP
  2. While vectoring the #GP, a shadow #PF occurs
  3. On the #PF VM-Exit, KVM re-injects #GP
  4. KVM emulates because of the write-protected page
  5. KVM "successfully" emulates and also detects the #GP
  6. KVM synthesizes a #GP, and since #GP has already been injected,
     incorrectly escalates to a #DF.

Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.

Note, handle_ept_misconfig() checks vmx_check_emulate_instruction() before
attempting emulation of any kind.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-5-iorlov@amazon.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Try to unprotect and retry on unhandleable emulation failure

If emulation is "rejected" by check_emulate_instruction(), try to
unprotect and retry instruction execution before reporting the error to
userspace. Currently, check_emulate_instruction() never signals failure
when "unprotect and retry" is possible, but that will change in the
future as both VMX and SVM will reject emulation due to coincident
exception vectoring. E.g. if there is a write to a shadowed page table
when vectoring an event, then unprotecting the gfn and retrying the
instruction will allow the guest to make forward progress in most cases,
i.e. will allow the vCPU to keep running instead of returning an error to
userspace.

This ensures that the subsequent patches won't make KVM exit to
userspace when handling an intercepted #PF during vectoring without
checking whether unprotect and retry is possible.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-4-iorlov@amazon.com
[sean: massage changelog to clarify this is a nop for the current code]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add emulation status for unhandleable exception vectoring

Add emulation status for unhandleable vectoring, i.e. when KVM can't
emulate an instruction because emulation was triggered on an exit that
occurred while the CPU was vectoring an event. Such a situation can
occur if guest sets the IDT descriptor base to point to MMIO region,
and triggers an exception after that.

Exit to userspace with event delivery error when KVM can't emulate
an instruction when vectoring an event.

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-3-iorlov@amazon.com
[sean: massage changelog and X86EMUL_UNHANDLEABLE_VECTORING comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add function for vectoring error generation

Extract VMX code for unhandleable VM-Exit during vectoring into
vendor-agnostic function so that boiler-plate code can be shared by SVM.

To avoid unnecessarily complexity in the helper, unconditionally report a
GPA to userspace instead of having a conditional entry. For exits that
don't report a GPA, i.e. everything except EPT Misconfig, simply report
KVM's "invalid GPA".

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
Link: https://lore.kernel.org/r/20241217181458.68690-2-iorlov@amazon.com
[sean: clarify that the INVALID_GPA logic is new]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Use only local variables (no bitmask) to init kvm_cpu_caps

Refactor the kvm_cpu_cap_init() macro magic to collect supported features
in a local variable instead of passing them to the macro as a "mask". As
pointed out by Maxim, relying on macros to "return" a value and set local
variables is surprising, as the bitwise-OR logic suggests the macros are
pure, i.e. have no side effects.

Ideally, the feature initializers would have zero side effects, e.g. would
take local variables as params, but there isn't a sane way to do so
without either sacrificing the various compile-time assertions (basically
a non-starter), or passing at least one variable, e.g. a struct, to each
macro usage (adds a lot of noise and boilerplate code).

Opportunistically force callers to emit a trailing comma by intentionally
omitting a semicolon after invoking the feature initializers. Forcing a
trailing comma isotales futures changes to a single line, i.e. doesn't
cause churn for unrelated features/lines when adding/removing/modifying a
feature.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-58-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Explicitly track feature flags that are enabled at runtime

Add one last (hopefully) CPUID feature macro, RUNTIME_F(), and use it
to track features that KVM supports, but that are only set at runtime
(in response to other state), and aren't advertised to userspace via
KVM_GET_SUPPORTED_CPUID.

Currently, RUNTIME_F() is mostly just documentation, but tracking all
KVM-supported features will allow for asserting, at build time, take),
that all features that are set, cleared, *or* checked by KVM are known to
kvm_set_cpu_caps().

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-57-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Explicitly track feature flags that require vendor enabling

Add another CPUID feature macro, VENDOR_F(), and use it to track features
that KVM supports, but that need additional vendor support and so are
conditionally enabled in vendor code.

Currently, VENDOR_F() is mostly just documentation, but tracking all
KVM-supported features will allow for asserting, at build time, take),
that all features that are set, cleared, *or* checked by KVM are known to
kvm_set_cpu_caps().

To fudge around a macro collision on 32-bit kernels, #undef DS to be able
to get at X86_FEATURE_DS.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-56-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename "SF" macro to "SCATTERED_F"

Now that each feature flag is on its own line, i.e. brevity isn't a major
concern, drop the "SF" acronym and use the (almost) full name, SCATTERED_F.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-55-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Pull CPUID capabilities from boot_cpu_data only as needed

Don't memcpy() all of boot_cpu_data.x86_capability, and instead explicitly
fill each kvm_cpu_cap_init leaf during kvm_cpu_cap_init(). While clever,
copying all kernel capabilities risks over-reporting KVM capabilities,
e.g. if KVM added support in __do_cpuid_func(), but neglected to init the
supported set of capabilities.

Note, explicitly grabbing leafs deliberately keeps Linux-defined leafs as
0! KVM should never advertise Linux-defined leafs; any relevant features
that are "real", but scattered, must be gathered in their correct hardware-
defined leaf.

Link: https://lore.kernel.org/r/20241128013424.4096668-54-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a macro for features that are synthesized into boot_cpu_data

Add yet another CPUID macro, this time for features that the host kernel
synthesizes into boot_cpu_data, i.e. that the kernel force sets even in
situations where the feature isn't reported by CPUID. Thanks to the
macro shenanigans of kvm_cpu_cap_init(), such features can now be handled
in the core CPUID framework, i.e. don't need to be handled out-of-band and
thus without as many guardrails.

Adding a dedicated macro also helps document what's going on, e.g. the
calls to kvm_cpu_cap_check_and_set() are very confusing unless the reader
knows exactly how kvm_cpu_cap_init() generates kvm_cpu_caps (and even
then, it's far from obvious).

Link: https://lore.kernel.org/r/20241128013424.4096668-53-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop superfluous host XSAVE check when adjusting guest XSAVES caps

Drop the manual boot_cpu_has() checks on XSAVE when adjusting the guest's
XSAVES capabilities now that guest cpu_caps incorporates KVM's support.
The guest's cpu_caps are initialized from kvm_cpu_caps, which are in turn
initialized from boot_cpu_data, i.e. checking guest_cpu_cap_has() also
checks host/KVM capabilities (which is the entire point of cpu_caps).

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Replace (almost) all guest CPUID feature queries with cpu_caps

Switch all queries (except XSAVES) of guest features from guest CPUID to
guest capabilities, i.e. replace all calls to guest_cpuid_has() with calls
to guest_cpu_cap_has().

Keep guest_cpuid_has() around for XSAVES, but subsume its helper
guest_cpuid_get_register() and add a compile-time assertion to prevent
using guest_cpuid_has() for any other feature. Add yet another comment
for XSAVE to explain why KVM is allowed to query its raw guest CPUID.

Opportunistically drop the unused guest_cpuid_clear(), as there should be
no circumstance in which KVM needs to _clear_ a guest CPUID feature now
that everything is tracked via cpu_caps. E.g. KVM may need to _change_
a feature to emulate dynamic CPUID flags, but KVM should never need to
clear a feature in guest CPUID to prevent it from being used by the guest.

Delete the last remnants of the governed features framework, as the lone
holdout was vmx_adjust_secondary_exec_control()'s divergent behavior for
governed vs. ungoverned features.

Note, replacing guest_cpuid_has() checks with guest_cpu_cap_has() when
computing reserved CR4 bits is a nop when viewed as a whole, as KVM's
capabilities are already incorporated into the calculation, i.e. if a
feature is present in guest CPUID but unsupported by KVM, its CR4 bit
was already being marked as reserved, checking guest_cpu_cap_has() simply
double-stamps that it's a reserved bit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-51-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Shuffle code to prepare for dropping guest_cpuid_has()

Move the implementations of guest_has_{spec_ctrl,pred_cmd}_msr() down
below guest_cpu_cap_has() so that their use of guest_cpuid_has() can be
replaced with calls to guest_cpu_cap_has().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-50-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Update guest cpu_caps at runtime for dynamic CPUID-based features

When updating guest CPUID entries to emulate runtime behavior, e.g. when
the guest enables a CR4-based feature that is tied to a CPUID flag, also
update the vCPU's cpu_caps accordingly. This will allow replacing all
usage of guest_cpuid_has() with guest_cpu_cap_has().

Note, this relies on kvm_set_cpuid() taking a snapshot of cpu_caps before
invoking kvm_update_cpuid_runtime(), i.e. when KVM is updating CPUID
entries that *may* become the vCPU's CPUID, so that unwinding to the old
cpu_caps is possible if userspace tries to set bogus CPUID information.

Note #2, none of the features in question use guest_cpu_cap_has() at this
time, i.e. aside from settings bits in cpu_caps, this is a glorified nop.

Cc: Yang Weijiang <weijiang.yang@intel.com>
Cc: Robert Hoo <robert.hoo.linux@gmail.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-49-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Update OS{XSAVE,PKE} bits in guest CPUID irrespective of host support

When making runtime CPUID updates, change OSXSAVE and OSPKE even if their
respective base features (XSAVE, PKU) are not supported by the host. KVM
already incorporates host support in the vCPU's effective reserved CR4 bits.
I.e. OSXSAVE and OSPKE can be set if and only if the host supports them.

And conversely, since KVM's ABI is that KVM owns the dynamic OS feature
flags, clearing them when they obviously aren't supported and thus can't
be enabled is arguably a fix.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-48-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop unnecessary check that cpuid_entry2_find() returns right leaf

Drop an unnecessary check that kvm_find_cpuid_entry_index(), i.e.
cpuid_entry2_find(), returns the correct leaf when getting CPUID.0x7.0x0
to update X86_FEATURE_OSPKE. cpuid_entry2_find() never returns an entry
for the wrong function. And not that it matters, but cpuid_entry2_find()
will always return a precise match for CPUID.0x7.0x0 since the index is
significant.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-47-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Avoid double CPUID lookup when updating MWAIT at runtime

Move the handling of X86_FEATURE_MWAIT during CPUID runtime updates to
utilize the lookup done for other CPUID.0x1 features.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-46-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Initialize guest cpu_caps based on KVM support

Constrain all guest cpu_caps based on KVM support instead of constraining
only the few features that KVM _currently_ needs to verify are actually
supported by KVM. The intent of cpu_caps is to track what the guest is
actually capable of using, not the raw, unfiltered CPUID values that the
guest sees.

I.e. KVM should always consult it's only support when making decisions
based on guest CPUID, and the only reason KVM has historically made the
checks opt-in was due to lack of centralized tracking.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-45-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Treat MONTIOR/MWAIT as a "partially emulated" feature

Enumerate MWAIT in cpuid_func_emulated(), but only if the caller wants to
include "partially emulated" features, i.e. features that KVM kinda sorta
emulates, but with major caveats.  This will allow initializing the guest
cpu_caps based on the set of features that KVM virtualizes and/or emulates,
without needing to handle things like MONITOR/MWAIT as one-off exceptions.

Adding one-off handling for individual features is quite painful,
especially when considering future hardening.  It's very doable to verify,
at compile time, that every CPUID-based feature that KVM queries when
emulating guest behavior is actually known to KVM, e.g. to prevent KVM
bugs where KVM emulates some feature but fails to advertise support to
userspace.  In other words, any features that are special cased, i.e. not
handled generically in the CPUID framework, would also need to be special
cased for any hardening efforts that build on said framework.

Link: https://lore.kernel.org/r/20241128013424.4096668-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Extract code for generating per-entry emulated CPUID information

Extract the meat of __do_cpuid_func_emulated() into a separate helper,
cpuid_func_emulated(), so that cpuid_func_emulated() can be used with a
single CPUID entry. This will allow marking emulated features as fully
supported in the guest cpu_caps without needing to hardcode the set of
emulated features in multiple locations.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Initialize guest cpu_caps based on guest CPUID

Initialize a vCPU's capabilities based on the guest CPUID provided by
userspace instead of simply zeroing the entire array. This is the first
step toward using cpu_caps to query *all* CPUID-based guest capabilities,
i.e. will allow converting all usage of guest_cpuid_has() to
guest_cpu_cap_has().

Zeroing the array was the logical choice when using cpu_caps was opt-in,
e.g. "unsupported" was generally a safer default, and the whole point of
governed features is that KVM would need to check host and guest support,
i.e. making everything unsupported by default didn't require more code.

But requiring KVM to manually "enable" every CPUID-based feature in
cpu_caps would require an absurd amount of boilerplate code.

Follow existing CPUID/kvm_cpu_caps nomenclature where possible, e.g. for
the change() and clear() APIs. Replace check_and_set() with constrain()
to try and capture that KVM is constraining userspace's desired guest
feature set based on KVM's capabilities.

This is intended to be gigantic nop, i.e. should not have any impact on
guest or KVM functionality.

This is also an intermediate step; a future commit will also incorporate
KVM support into the vCPU's cpu_caps before converting guest_cpuid_has()
to guest_cpu_cap_has().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Replace guts of "governed" features with comprehensive cpu_caps

Replace the internals of the governed features framework with a more
comprehensive "guest CPU capabilities" implementation, i.e. with a guest
version of kvm_cpu_caps. Keep the skeleton of governed features around
for now as vmx_adjust_sec_exec_control() relies on detecting governed
features to do the right thing for XSAVES, and switching all guest feature
queries to guest_cpu_cap_has() requires subtle and non-trivial changes,
i.e. is best done as a standalone change.

Tracking *all* guest capabilities that KVM cares will allow excising the
poorly named "governed features" framework, and effectively optimizes all
KVM queries of guest capabilities, i.e. doesn't require making a
subjective decision as to whether or not a feature is worth "governing",
and doesn't require adding the code to do so.

The cost of tracking all features is currently 92 bytes per vCPU on 64-bit
kernels: 100 bytes for cpu_caps versus 8 bytes for governed_features.
That cost is well worth paying even if the only benefit was eliminating
the "governed features" terminology. And practically speaking, the real
cost is zero unless those 92 bytes pushes the size of vcpu_vmx or vcpu_svm
into a new order-N allocation, and if that happens there are better ways
to reduce the footprint of kvm_vcpu_arch, e.g. making the PMU and/or MTRR
state separate allocations.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename "governed features" helpers to use "guest_cpu_cap"

As the first step toward replacing KVM's so-called "governed features"
framework with a more comprehensive, less poorly named implementation,
replace the "kvm_governed_feature" function prefix with "guest_cpu_cap"
and rename guest_can_use() to guest_cpu_cap_has().

The "guest_cpu_cap" naming scheme mirrors that of "kvm_cpu_cap", and
provides a more clear distinction between guest capabilities, which are
KVM controlled (heh, or one might say "governed"), and guest CPUID, which
with few exceptions is fully userspace controlled.

Opportunistically rewrite the comment about XSS passthrough for SEV-ES
guests to avoid referencing so many functions, as such comments are prone
to becoming stale (case in point...).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Advertise HYPERVISOR in KVM_GET_SUPPORTED_CPUID

Unconditionally advertise "support" for the HYPERVISOR feature in CPUID,
as the flag simply communicates to the guest that's it's running under a
hypervisor.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID

Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID,
as KVM always emulates deadline mode, *if* the VM has an in-kernel local
APIC. The odds of a VMM emulating the local APIC in userspace, not
emulating the TSC deadline timer, _and_ reflecting
KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of
over-advertising and breaking any setups, is extremely low.

KVM has _unconditionally_ advertised X2APIC via CPUID since commit
0d1de2d901f4 ("KVM: Always report x2apic as supported feature"), and it
is completely impossible for userspace to emulate X2APIC as KVM doesn't
support forwarding the MSR accesses to userspace. I.e. KVM has relied on
userspace VMMs to not misreport local APIC capabilities for nearly 13
years.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Remove all direct usage of cpuid_entry2_find()

Convert all use of cpuid_entry2_find() to kvm_find_cpuid_entry{,index}()
now that cpuid_entry2_find() operates on the vCPU state, i.e. now that
there is no need to use cpuid_entry2_find() directly in order to pass in
non-vCPU state.

To help prevent unwanted usage of cpuid_entry2_find(), #undef
KVM_CPUID_INDEX_NOT_SIGNIFICANT, i.e. force KVM to use
kvm_find_cpuid_entry().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move kvm_find_cpuid_entry{,_index}() up near cpuid_entry2_find()

Move kvm_find_cpuid_entry{,_index}() "up" in cpuid.c so that they are
colocated with cpuid_entry2_find(), e.g. to make it easier to see the
effective guts of the helpers without having to bounce around cpuid.c.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Always operate on kvm_vcpu data in cpuid_entry2_find()

Now that KVM sets vcpu->arch.cpuid_{entries,nent} before processing the
incoming CPUID entries during KVM_SET_CPUID{,2}, drop the @entries and
@nent params from cpuid_entry2_find() and unconditionally operate on the
vCPU state.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-35-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Remove unnecessary caching of KVM's PV CPUID base

Now that KVM only searches for KVM's PV CPUID base when userspace sets
guest CPUID, drop the cache and simply do the search every time.

Practically speaking, this is a nop except for situations where userspace
sets CPUID _after_ running the vCPU, which is anything but a hot path,
e.g. QEMU does so only when hotplugging a vCPU. And on the flip side,
caching guest CPUID information, especially information that is used to
query/modify _other_ CPUID state, is inherently dangerous as it's all too
easy to use stale information, i.e. KVM should only cache CPUID state when
the performance and/or programming benefits justify it.

Link: https://lore.kernel.org/r/20241128013424.4096668-34-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Clear PV_UNHALT for !HLT-exiting only when userspace sets CPUID

Now that KVM disallows disabling HLT-exiting after vCPUs have been created,
i.e. now that it's impossible for kvm_hlt_in_guest() to change while vCPUs
are running, apply KVM's PV_UNHALT quirk only when userspace is setting
guest CPUID.

Opportunistically rename the helper to make it clear that KVM's behavior
is a quirk that should never have been added. KVM's documentation
explicitly states that userspace should not advertise PV_UNHALT if
HLT-exiting is disabled, but for unknown reasons, commit caa057a2cad6
("KVM: X86: Provide a capability to disable HLT intercepts") didn't stop
at documenting the requirement and also massaged the incoming guest CPUID.

Unfortunately, it's quite likely that userspace has come to rely on KVM's
behavior, i.e. the code can't simply be deleted. The only reason KVM
doesn't have an "official" quirk is that there is no known use case where
disabling the quirk would make sense, i.e. letting userspace disable the
quirk would further increase KVM's burden without any benefit.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Swap incoming guest CPUID into vCPU before massaging in KVM_SET_CPUID2

When handling KVM_SET_CPUID{,2}, swap the old and new CPUID arrays and
lengths before processing the new CPUID, and simply undo the swap if
setting the new CPUID fails for whatever reason.

To keep the diff reasonable, continue passing the entry array and length
to most helpers, and defer the more complete cleanup to future commits.

For any sane VMM, setting "bad" CPUID state is not a hot path (or even
something that is surviable), and setting guest CPUID before it's known
good will allow removing all of KVM's infrastructure for processing CPUID
entries directly (as opposed to operating on vcpu->arch.cpuid_entries).

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a macro to init CPUID features that KVM emulates in software

Now that kvm_cpu_cap_init() is a macro with its own scope, add EMUL_F() to
OR-in features that KVM emulates in software, i.e. that don't depend on
the feature being available in hardware. The contained scope
of kvm_cpu_cap_init() allows using a local variable to track the set of
emulated leaves, which in addition to avoiding confusing and/or
unnecessary variables, helps prevent misuse of EMUL_F().

Link: https://lore.kernel.org/r/20241128013424.4096668-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a macro to init CPUID features that ignore host kernel support

Add a macro for use in kvm_set_cpu_caps() to automagically initialize
features that KVM wants to support based solely on the CPU's capabilities,
e.g. KVM advertises LA57 support if it's available in hardware, even if
the host kernel isn't utilizing 57-bit virtual addresses.

Track a features that are passed through to userspace (from hardware) in
a local variable, and simply OR them in *after* adjusting the capabilities
that came from boot_cpu_data.

Note, eliminating the open-coded call to cpuid_ecx() also fixes a largely
benign bug where KVM could incorrectly report LA57 support on Intel CPUs
whose max supported CPUID is less than 7, i.e. if the max supported leaf
(<7) happened to have bit 16 set. In practice, barring a funky virtual
machine setup, the bug is benign as all known CPUs that support VMX also
support leaf 7.

Link: https://lore.kernel.org/r/20241128013424.4096668-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Harden CPU capabilities processing against out-of-scope features

Add compile-time assertions to verify that usage of F() and friends in
kvm_set_cpu_caps() is scoped to the correct CPUID word, e.g. to detect
bugs where KVM passes a feature bit from word X into word y.

Add a one-off assertion in the aliased feature macro to ensure that only
word 0x8000_0001.EDX aliased the features defined for 0x1.EDX.

To do so, convert kvm_cpu_cap_init() to a macro and have it define a
local variable to track which CPUID word is being initialized that is
then used to validate usage of F() (all of the inputs are compile-time
constants and thus can be fed into BUILD_BUG_ON()).

Redefine KVM_VALIDATE_CPU_CAP_USAGE after kvm_set_cpu_caps() to be a nop
so that F() can be used in other flows that aren't as easily hardened,
e.g. __do_cpuid_func_emulated() and __do_cpuid_func().

Invoke KVM_VALIDATE_CPU_CAP_USAGE() in SF() and X86_64_F() to ensure the
validation occurs, e.g. if the usage of F() is completely compiled out
(which shouldn't happen for boot_cpu_has(), but could happen in the future,
e.g. if KVM were to use cpu_feature_enabled()).

Link: https://lore.kernel.org/r/20241128013424.4096668-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: #undef SPEC_CTRL_SSBD in cpuid.c to avoid macro collisions

Undefine SPEC_CTRL_SSBD, which is #defined by msr-index.h to represent the
enable flag in MSR_IA32_SPEC_CTRL, to avoid issues with the macro being
unpacked into its raw value when passed to KVM's F() macro. This will
allow using multiple layers of macros in F() and friends, e.g. to harden
against incorrect usage of F().

No functional change intended (cpuid.c doesn't consume SPEC_CTRL_SSBD).

Link: https://lore.kernel.org/r/20241128013424.4096668-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Handle kernel- and KVM-defined CPUID words in a single helper

Merge kvm_cpu_cap_init() and kvm_cpu_cap_init_kvm_defined() into a single
helper. The only advantage of separating the two was to make it somewhat
obvious that KVM directly initializes the KVM-defined words, whereas using
a common helper will allow for hardening both kernel- and KVM-defined
CPUID words without needing copy+paste.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a macro to precisely handle aliased 0x1.EDX CPUID features

Add a macro to precisely handle CPUID features that AMD duplicated from
CPUID.0x1.EDX into CPUID.0x8000_0001.EDX. This will allow adding an
assert that all features passed to kvm_cpu_cap_init() match the word being
processed, e.g. to prevent passing a feature from CPUID 0x7 to CPUID 0x1.

Because the kernel simply reuses the X86_FEATURE_* definitions from
CPUID.0x1.EDX, KVM's use of the aliased features would result in false
positives from such an assert.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Add a macro to init CPUID features that are 64-bit only

Add a macro to mask-in feature flags that are supported only on 64-bit
kernels/KVM. In addition to reducing overall #ifdeffery, using a macro
will allow hardening the kvm_cpu_cap initialization sequences to assert
that the features being advertised are indeed included in the word being
initialized. And arguably using *F() macros through is more readable.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init()

Rename kvm_cpu_cap_mask() to kvm_cpu_cap_init() in anticipation of merging
it with kvm_cpu_cap_init_kvm_defined(), and in anticipation of _setting_
bits in the helper (a future commit will play macro games to set emulated
feature flags via kvm_cpu_cap_init()).

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Unpack F() CPUID feature flag macros to one flag per line of code

Refactor kvm_set_cpu_caps() to express each supported (or not) feature
flag on a separate line, modulo a handful of cases where KVM does not, and
likely will not, support a sequence of flags.  This will allow adding
fancier macros with longer, more descriptive names without resulting in
absurd line lengths and/or weird code.  Isolating each flag also makes it
far easier to review changes, reduces code conflicts, and generally makes
it easier to resolve conflicts.  Lastly, it allows co-locating comments
for notable flags, e.g. MONITOR, precisely with the relevant flag.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Account for max supported CPUID leaf when getting raw host CPUID

Explicitly zero out the feature word in kvm_cpu_caps if the word's
associated CPUID function is greater than the max leaf supported by the
CPU.  For such unsupported functions, Intel CPUs return the output from
the last supported leaf, not all zeros.

Practically speaking, this is likely a benign bug, as KVM uses the raw
host CPUID to mask the kernel's computed capabilities, and the kernel does
perform max leaf checks when populating boot_cpu_data.  The only way KVM's
goof could be problematic is if the kernel force-set a feature in a leaf
that is completely unsupported, _and_ the max supported leaf happened to
return a value with '1' the same bit position.  Which is theoretically
possible, but extremely unlikely.  And even if that did happen, it's
entirely possible that KVM would still provide the correct functionality;
the kernel did set the capability after all.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Do reverse CPUID sanity checks in __feature_leaf()

Do the compile-time sanity checks on reverse_cpuid in __feature_leaf() so
that higher level APIs don't need to "manually" perform the sanity checks.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Don't update PV features caches when enabling enforcement capability

Revert the chunk of commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") that forced a PV features cache refresh
during KVM_CAP_ENFORCE_PV_FEATURE_CPUID, as whatever ioctl() ordering
issue it alleged to have fixed never existed upstream, and likely never
existed in any kernel.

At the time of the commit, there was a tangentially related ioctl()
ordering issue, as toggling KVM_X86_DISABLE_EXITS_HLT after KVM_SET_CPUID2
would have resulted in KVM potentially leaving KVM_FEATURE_PV_UNHALT set.
But (a) that bug affected the entire guest CPUID, not just the cache, (b)
commit 01b4f510b9f4 didn't address that bug, it only refreshed the cache
(with the bad CPUID), and (c) setting KVM_X86_DISABLE_EXITS_HLT after vCPU
creation is completely broken as KVM configures HLT-exiting only during
vCPU creation, which is why KVM_CAP_X86_DISABLE_EXITS is now disallowed if
vCPUs have been created.

Another tangentially related bug was KVM's failure to clear the cache when
handling KVM_SET_CPUID2, but again commit 01b4f510b9f4 did nothing to fix
that bug.

The most plausible explanation for the what commit 01b4f510b9f4 was trying
to fix is a bug that existed in Google's internal kernel that was the
source of commit 01b4f510b9f4. At the time, Google's internal kernel had
not yet picked up commit 0d3b2ba16ba68 ("KVM: X86: Go on updating other
CPUID leaves when leaf 1 is absent"), i.e. KVM would not initialize the
PV features cache if KVM_SET_CPUID2 was called without a CPUID.0x1 entry.

Of course, no sane real world VMM would omit CPUID.0x1, including the KVM
selftest added by commit ac4a4d6de22e ("selftests: kvm: test enforcement
of paravirtual cpuid features"). And the test didn't actually try to
verify multiple orderings, nor did the selftest enter the guest without
doing KVM_SET_CPUID2, so who knows what motivated the change.

Regardless of why commit 01b4f510b9f4 ("kvm: x86: ensure pv_cpuid.features
is initialized when enabling cap") was added, refreshing the cache during
KVM_CAP_ENFORCE_PV_FEATURE_CPUID isn't necessary.

Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Zero out PV features cache when the CPUID leaf is not present

Clear KVM's PV feature cache prior when processing a new guest CPUID so
that KVM doesn't keep a stale cache entry if userspace does KVM_SET_CPUID2
multiple times, once with a PV features entry, and a second time without.

Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Update x86's KVM PV test to match KVM's disabling exits behavior

Rework x86's KVM PV features test to align with KVM's new, fixed behavior
of not allowing userspace to disable HLT-exiting after vCPUs have been
created. Rework the core testcase to disable HLT-exiting before creating
a vCPU, and opportunistically modify keep the paired VM+vCPU creation to
verify that KVM rejects KVM_CAP_X86_DISABLE_EXITS as expected.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Fix a bad TEST_REQUIRE() in x86's KVM PV test

Actually check for KVM support for disabling HLT-exiting instead of
effectively checking that KVM_CAP_X86_DISABLE_EXITS is #defined to a
non-zero value, and convert the TEST_REQUIRE() to a simple return so
that only the sub-test is skipped if HLT-exiting is mandatory.

The goof has likely gone unnoticed because all x86 CPUs support disabling
HLT-exiting, only systems with the opt-in mitigate_smt_rsb KVM module
param disallow HLT-exiting.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop the now unused KVM_X86_DISABLE_VALID_EXITS

Drop the KVM_X86_DISABLE_VALID_EXITS definition, as it is misleading, and
unused in KVM *because* it is misleading. The set of exits that can be
disabled is dynamic, i.e. userspace (and KVM) must check KVM's actual
capabilities.

Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Reject disabling of MWAIT/HLT interception when not allowed

Reject KVM_CAP_X86_DISABLE_EXITS if userspace attempts to disable MWAIT or
HLT exits and KVM previously reported (via KVM_CHECK_EXTENSION) that
disabling the exit(s) is not allowed. E.g. because MWAIT isn't supported
or the CPU doesn't have an always-running APIC timer, or because KVM is
configured to mitigate cross-thread vulnerabilities.

Cc: Kechen Lu <kechenl@nvidia.com>
Fixes: 4d5422cea3b6 ("KVM: X86: Provide a capability to disable MWAIT intercepts")
Fixes: 6f0f2d5ef895 ("KVM: x86: Mitigate the cross-thread return address predictions bug")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation

Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation. vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop now-redundant MAXPHYADDR and GPA rsvd bits from vCPU creation

Drop the manual initialization of maxphyaddr and reserved_gpa_bits during
vCPU creation now that kvm_arch_vcpu_create() unconditionally invokes
kvm_vcpu_after_set_cpuid(), which handles all such CPUID caching.

None of the helpers between the existing code in kvm_arch_vcpu_create()
and the call to kvm_vcpu_after_set_cpuid() consume maxphyaddr or
reserved_gpa_bits (though auditing vmx_vcpu_create() and svm_vcpu_create()
isn't exactly easy).

Link: https://lore.kernel.org/r/20241128013424.4096668-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86/pmu: Drop now-redundant refresh() during init()

Drop the manual kvm_pmu_refresh() from kvm_pmu_init() now that
kvm_arch_vcpu_create() performs the refresh via kvm_vcpu_after_set_cpuid().

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Move __kvm_is_valid_cr4() definition to x86.h

Let vendor code inline __kvm_is_valid_cr4() now x86.c's cr4_reserved_bits
no longer exists, as keeping cr4_reserved_bits local to x86.c was the only
reason for "hiding" the definition of __kvm_is_valid_cr4().

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Verify KVM stuffs runtime CPUID OS bits on CR4 writes

Extend x86's set sregs test to verify that KVM sets/clears OSXSAVE and
OSKPKE according to CR4.XSAVE and CR4.PKE respectively. For performance
reasons, KVM is responsible for emulating the architectural behavior of
the OS CPUID bits tracking CR4.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Refresh vCPU CPUID cache in __vcpu_get_cpuid_entry()

Refresh selftests' CPUID cache in the vCPU structure when querying a CPUID
entry so that tests don't consume stale data when KVM modifies CPUID as a
side effect to a completely unrelated change. E.g. KVM adjusts OSXSAVE in
response to CR4.OSXSAVE changes.

Unnecessarily invoking KVM_GET_CPUID is suboptimal, but vcpu->cpuid exists
to simplify selftests development, not for performance reasons. And,
unfortunately, trying to handle the side effects in tests or other flows
is unpleasant, e.g. selftests could manually refresh if KVM_SET_SREGS is
successful, but that would still leave a gap with respect to guest CR4
changes.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Assert that vcpu->cpuid is non-NULL when getting CPUID entries

Add a sanity check in __vcpu_get_cpuid_entry() to provide a friendlier
error than a segfault when a test developer tries to use a vCPU CPUID
helper on a barebones vCPU.

Link: https://lore.kernel.org/r/20241128013424.4096668-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Update x86's set_sregs_test to match KVM's CPUID enforcement

Rework x86's set sregs test to verify that KVM enforces CPUID vs. CR4
features even if userspace hasn't explicitly set guest CPUID. KVM used to
allow userspace to set any KVM-supported CR4 value prior to KVM_SET_CPUID2,
and the test verified that behavior.

However, the testcase was written purely to verify KVM's existing behavior,
i.e. was NOT written to match the needs of real world VMMs.

Opportunistically verify that KVM continues to reject unsupported features
after KVM_SET_CPUID2 (using KVM_GET_SUPPORTED_CPUID).

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Account for KVM-reserved CR4 bits when passing through CR4 on VMX

Drop x86.c's local pre-computed cr4_reserved bits and instead fold KVM's
reserved bits into the guest's reserved bits.  This fixes a bug where VMX's
set_cr4_guest_host_mask() fails to account for KVM-reserved bits when
deciding which bits can be passed through to the guest.  In most cases,
letting the guest directly write reserved CR4 bits is ok, i.e. attempting
to set the bit(s) will still #GP, but not if a feature is available in
hardware but explicitly disabled by the host, e.g. if FSGSBASE support is
disabled via "nofsgsbase".

Note, the extra overhead of computing host reserved bits every time
userspace sets guest CPUID is negligible.  The feature bits that are
queried are packed nicely into a handful of words, and so checking and
setting each reserved bit costs in the neighborhood of ~5 cycles, i.e. the
total cost will be in the noise even if the number of checked CR4 bits
doubles over the next few years.  In other words, x86 will run out of CR4
bits long before the overhead becomes problematic.

Note #2, __cr4_reserved_bits() starts from CR4_RESERVED_BITS, which is
why the existing __kvm_cpu_cap_has() processing doesn't explicitly OR in
CR4_RESERVED_BITS (and why the new code doesn't do so either).

Fixes: 2ed41aa631fc ("KVM: VMX: Intercept guest reserved CR4 bits to inject #GP fault")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Explicitly do runtime CPUID updates "after" initial setup

Explicitly perform runtime CPUID adjustments as part of the "after set
CPUID" flow to guard against bugs where KVM consumes stale vCPU/CPUID
state during kvm_update_cpuid_runtime(). E.g. see commit 4736d85f0d18
("KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT").

Whacking each mole individually is not sustainable or robust, e.g. while
the aforemention commit fixed KVM's PV features, the same issue lurks for
Xen and Hyper-V features, Xen and Hyper-V simply don't have any runtime
features (though spoiler alert, neither should KVM).

Updating runtime features in the "full" path will also simplify adding a
snapshot of the guest's capabilities, i.e. of caching the intersection of
guest CPUID and kvm_cpu_caps (modulo a few edge cases).

Link: https://lore.kernel.org/r/20241128013424.4096668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Do all post-set CPUID processing during vCPU creation

During vCPU creation, process KVM's default, empty CPUID as if userspace
set an empty CPUID to ensure consistent and correct behavior with respect
to guest CPUID.  E.g. if userspace never sets guest CPUID, KVM will never
configure cr4_guest_rsvd_bits, and thus create divergent, incorrect, guest-
visible behavior due to letting the guest set any KVM-supported CR4 bits
despite the features not being allowed per guest CPUID.

Note!  This changes KVM's ABI, as lack of full CPUID processing allowed
userspace to stuff garbage vCPU state, e.g. userspace could set CR4 to a
guest-unsupported value via KVM_SET_SREGS.  But it's extremely unlikely
that this is a breaking change, as KVM already has many flows that require
userspace to set guest CPUID before loading vCPU state.  E.g. multiple MSR
flows consult guest CPUID on host writes, and KVM_SET_SREGS itself already
relies on guest CPUID being up-to-date, as KVM's validity check on CR3
consumes CPUID.0x7.1 (for LAM) and CPUID.0x80000008 (for MAXPHYADDR).

Furthermore, the plan is to commit to enforcing guest CPUID for userspace
writes to MSRs, at which point bypassing sregs CPUID checks is even more
nonsensical.

Link: https://lore.kernel.org/r/20241128013424.4096668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Limit use of F() and SF() to kvm_cpu_cap_{mask,init_kvm_defined}()

Define and undefine the F() and SF() macros precisely around
kvm_set_cpu_caps() to make it all but impossible to use the macros outside
of kvm_cpu_cap_{mask,init_kvm_defined}(). Currently, F() is a simple
passthrough, but SF() is actively dangerous as it checks that the scattered
feature is supported by the host kernel.

And usage outside of the aforementioned helpers will run afoul of future
changes to harden KVM's CPUID management.

Opportunistically switch to feature_bit() when stuffing LA57 based on raw
hardware support.

No functional change intended.

Link: https://lore.kernel.org/r/20241128013424.4096668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Use feature_bit() to clear CONSTANT_TSC when emulating CPUID

When clearing CONSTANT_TSC during CPUID emulation due to a Hyper-V quirk,
use feature_bit() instead of SF() to ensure the bit is actually cleared.
SF() evaluates to zero if the _host_ doesn't support the feature. I.e.
KVM could keep the bit set if userspace advertised CONSTANT_TSC despite
it not being supported in hardware.

Note, translating from a scattered feature to a the hardware version is
done by __feature_translate(), not SF(). The sole purpose of SF() is to
check kernel support for the scattered feature, *before* translation.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Override ARCH for x86_64 instead of using ARCH_DIR

Now that KVM selftests uses the kernel's canonical arch paths, directly
override ARCH to 'x86' when targeting x86_64 instead of defining ARCH_DIR
to redirect to appropriate paths. ARCH_DIR was originally added to deal
with KVM selftests using the target triple ARCH for directories, e.g.
s390x and aarch64; keeping it around just to deal with the one-off alias
from x86_64=>x86 is unnecessary and confusing.

Note, even when selftests are built from the top-level Makefile, ARCH is
scoped to KVM's makefiles, i.e. overriding ARCH won't trip up some other
selftests that (somehow) expects x86_64 and can't work with x86.

Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use canonical $(ARCH) paths for KVM selftests directories

Use the kernel's canonical $(ARCH) paths instead of the raw target triple
for KVM selftests directories. KVM selftests are quite nearly the only
place in the entire kernel that using the target triple for directories,
tools/testing/selftests/drivers/s390x being the lone holdout.

Using the kernel's preferred nomenclature eliminates the minor, but
annoying, friction of having to translate to KVM's selftests directories,
e.g. for pattern matching, opening files, running selftests, etc.

Opportunsitically delete file comments that reference the full path of the
file, as they are obviously prone to becoming stale, and serve no known
purpose.

Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Provide empty 'all' and 'clean' targets for unsupported ARCHs

Provide empty targets for KVM selftests if the target architecture is
unsupported to make it obvious which architectures are supported, and so
that various side effects don't fail and/or do weird things, e.g. as is,
"mkdir -p $(sort $(dir $(TEST_GEN_PROGS)))" fails due to a missing operand,
and conversely, "$(shell mkdir -p $(sort $(OUTPUT)/$(ARCH_DIR) ..." will
create an empty, useless directory for the unsupported architecture.

Move the guts of the Makefile to Makefile.kvm so that it's easier to see
that the if-statement effectively guards all of KVM selftests.

Reported-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ)

Add two phases to mmu_stress_test to verify that KVM correctly handles
guest memory that was writable, and then made read-only in the primary MMU,
and then made writable again.

Add bonus coverage for x86 and arm64 to verify that all of guest memory was
marked read-only. Making forward progress (without making memory writable)
requires arch specific code to skip over the faulting instruction, but the
test can at least verify each vCPU's starting page was made read-only for
other architectures.

Link: https://lore.kernel.org/r/20241128005547.4077116-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test

Add a third phase of mmu_stress_test to verify that mprotect()ing guest
memory to make it read-only doesn't cause explosions, e.g. to verify KVM
correctly handles the resulting mmu_notifier invalidations.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Precisely limit the number of guest loops in mmu_stress_test

Run the exact number of guest loops required in mmu_stress_test instead
of looping indefinitely in anticipation of adding more stages that run
different code (e.g. reads instead of writes).

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test

Use vcpu_arch_put_guest() to write memory from the guest in
mmu_stress_test as an easy way to provide a bit of extra coverage.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Enable mmu_stress_test on arm64

Enable the mmu_stress_test on arm64. The intent was to enable the test
across all architectures when it was first added, but a few goofs made it
unrunnable on !x86. Now that those goofs are fixed, at least for arm64,
enable the test.

Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: sefltests: Explicitly include ucall_common.h in mmu_stress_test.c

Explicitly include ucall_common.h in the MMU stress test, as unlike arm64
and x86-64, RISC-V doesn't include ucall_common.h in its processor.h, i.e.
this will allow enabling the test on RISC-V.

Reported-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Compute number of extra pages needed in mmu_stress_test

Create mmu_stress_tests's VM with the correct number of extra pages needed
to map all of memory in the guest. The bug hasn't been noticed before as
the test currently runs only on x86, which maps guest memory with 1GiB
pages, i.e. doesn't need much memory in the guest for page tables.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test

Try to get/set SREGS in mmu_stress_test only when running on x86, as the
ioctls are supported only by x86 and PPC, and the latter doesn't yet
support KVM selftests.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Rename max_guest_memory_test to mmu_stress_test

Rename max_guest_memory_test to mmu_stress_test so that the name isn't
horribly misleading when future changes extend the test to verify things
like mprotect() interactions, and because the test is useful even when its
configured to populate far less than the maximum amount of guest memory.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded

Don't check for an unhandled exception if KVM_RUN failed, e.g. if it
returned errno=EFAULT, as reporting unhandled exceptions is done via a
ucall, i.e. requires KVM_RUN to exit cleanly. Theoretically, checking
for a ucall on a failed KVM_RUN could get a false positive, e.g. if there
were stale data in vcpu->run from a previous exit.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Assert that vcpu_{g,s}et_reg() won't truncate

Assert that the register being read/written by vcpu_{g,s}et_reg() is no
larger than a uint64_t, i.e. that a selftest isn't unintentionally
truncating the value being read/written.

Ideally, the assert would be done at compile-time, but that would limit
the checks to hardcoded accesses and/or require fancier compile-time
assertion infrastructure to filter out dynamic usage.

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Return a value from vcpu_get_reg() instead of using an out-param

Return a uint64_t from vcpu_get_reg() instead of having the caller provide
a pointer to storage, as none of the vcpu_get_reg() usage in KVM selftests
accesses a register larger than 64 bits, and vcpu_set_reg() only accepts a
64-bit value. If a use case comes along that needs to get a register that
is larger than 64 bits, then a utility can be added to assert success and
take a void pointer, but until then, forcing an out param yields ugly code
and prevents feeding the output of vcpu_get_reg() into vcpu_set_reg().

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: Move KVM_REG_SIZE() definition to common uAPI header

Define KVM_REG_SIZE() in the common kvm.h header, and delete the arm64 and
RISC-V versions. As evidenced by the surrounding definitions, all aspects
of the register size encoding are generic, i.e. RISC-V should have moved
arm64's definition to common code instead of copy+pasting.

Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://lore.kernel.org/r/20241128005547.4077116-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

Merge tag 'kvm-riscv-fixes-6.13-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.13, take #1

- Replace csr_write() with csr_set() for HVIEN PMU overflow bit

KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init

Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime.  The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.

On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values.  The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.

As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:

  SKX 11650
  ICX 22350
  EMR 28850

Using cached values, the latency drops to:

  SKX 6850
  ICX 9000
  EMR 7900

The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR.  The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).

SKX:
  CPUID.0xD.2  = 348 cycles
  CPUID.0xD.3  = 400 cycles
  CPUID.0xD.4  = 276 cycles
  CPUID.0xD.5  = 236 cycles
  <other sub-leaves are similar>

EMR:
  CPUID.0xD.2  = 1138 cycles
  CPUID.0xD.3  = 1362 cycles
  CPUID.0xD.4  = 1068 cycles
  CPUID.0xD.5  = 910 cycles
  CPUID.0xD.6  = 914 cycles
  CPUID.0xD.7  = 1350 cycles
  CPUID.0xD.8  = 734 cycles
  CPUID.0xD.9  = 766 cycles
  CPUID.0xD.10 = 732 cycles
  CPUID.0xD.11 = 718 cycles
  CPUID.0xD.12 = 734 cycles
  CPUID.0xD.13 = 1700 cycles
  CPUID.0xD.14 = 1126 cycles
  CPUID.0xD.15 = 898 cycles
  CPUID.0xD.16 = 716 cycles
  CPUID.0xD.17 = 748 cycles
  CPUID.0xD.18 = 776 cycles

Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD.  E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2.  That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.

Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge tag 'kvmarm-fixes-6.13-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

- Align nested page table walker with the intended memory attribute
   combining rules of the architecture

- Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

- Don't leak references on LPIs when insertion into the translation
   cache fails

Linux 6.13-rc2

Merge tag 'kbuild-fixes-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- Fix a section mismatch warning in modpost

- Fix Debian package build error with the O= option

* tag 'kbuild-fixes-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: deb-pkg: fix build error with O=
modpost: Add .irqentry.text to OTHER_SECTIONS

Merge tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

- Fix a /proc/interrupts formatting regression

- Have the BCM2836 interrupt controller enter power management states
   properly

- Other fixlets

* tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing
  genirq/proc: Add missing space separator back
  irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND
  irqchip/gic-v3: Fix irq_complete_ack() comment

Merge tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Borislav Petkov:

- Handle the case where clocksources with small counter width can,
   in conjunction with overly long idle sleeps, falsely trigger the
   negative motion detection of clocksources

* tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: Make negative motion detection more robust

Merge tag 'x86_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Have the Automatic IBRS setting check on AMD does not falsely fire in
   the guest when it has been set already on the host

- Make sure cacheinfo structures memory is allocated to address a boot
   NULL ptr dereference on Intel Meteor Lake which has different numbers
   of subleafs in its CPUID(4) leaf

- Take care of the GDT restoring on the kexec path too, as expected by
   the kernel

- Make sure SMP is not disabled when IO-APIC is disabled on the kernel
   cmdline

- Add a PGD flag _PAGE_NOPTISHADOW to instruct machinery not to
   propagate changes to the kernelmode page tables, to the user portion,
   in PTI

- Mark Intel Lunar Lake as affected by an issue where MONITOR wakeups
   can get lost and thus user-visible delays happen

- Make sure PKRU is properly restored with XRSTOR on AMD after a PRKU
   write of 0 (WRPKRU) which will mark PKRU in its init state and thus
   lose the actual buffer

* tag 'x86_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/CPU/AMD: WARN when setting EFER.AUTOIBRS if and only if the WRMSR fails
  x86/cacheinfo: Delete global num_cache_leaves
  cacheinfo: Allocate memory during CPU hotplug if not done from the primary CPU
  x86/kexec: Restore GDT on return from ::preserve_context kexec
  x86/cpu/topology: Remove limit of CPUs due to disabled IO/APIC
  x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables
  x86/cpu: Add Lunar Lake to list of CPUs with a broken MONITOR implementation
  x86/pkeys: Ensure updated PKRU value is XRSTOR'd
  x86/pkeys: Change caller of update_pkru_in_sigframe()

Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"24 hotfixes.  17 are cc:stable.  15 are MM and 9 are non-MM.

  The usual bunch of singletons - please see the relevant changelogs for
  details"

* tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
  iio: magnetometer: yas530: use signed integer type for clamp limits
  sched/numa: fix memory leak due to the overwritten vma->numab_state
  mm/damon: fix order of arguments in damos_before_apply tracepoint
  lib: stackinit: hide never-taken branch from compiler
  mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()
  scatterlist: fix incorrect func name in kernel-doc
  mm: correct typo in MMAP_STATE() macro
  mm: respect mmap hint address when aligning for THP
  mm: memcg: declare do_memsw_account inline
  mm/codetag: swap tags when migrate pages
  ocfs2: update seq_file index in ocfs2_dlm_seq_next
  stackdepot: fix stack_depot_save_flags() in NMI context
  mm: open-code page_folio() in dump_page()
  mm: open-code PageTail in folio_flags() and const_folio_flags()
  mm: fix vrealloc()'s KASAN poisoning logic
  Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()"
  selftests/damon: add _damon_sysfs.py to TEST_FILES
  selftest: hugetlb_dio: fix test naming
  ocfs2: free inode when ocfs2_get_init_inode() fails
  nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()
  ...

kbuild: deb-pkg: fix build error with O=

Since commit 13b25489b6f8 ("kbuild: change working directory to external
module directory with M="), the Debian package build fails if a relative
path is specified with the O= option.

  $ make O=build bindeb-pkg
    [ snip ]
  dpkg-deb: building package 'linux-image-6.13.0-rc1' in '../linux-image-6.13.0-rc1_6.13.0-rc1-6_amd64.deb'.
  Rebuilding host programs with x86_64-linux-gnu-gcc...
  make[6]: Entering directory '/home/masahiro/linux/build'
  /home/masahiro/linux/Makefile:190: *** specified kernel directory "build" does not exist.  Stop.

This occurs because the sub_make_done flag is cleared, even though the
working directory is already in the output directory.

Passing KBUILD_OUTPUT=. resolves the issue.

Fixes: 13b25489b6f8 ("kbuild: change working directory to external module directory with M=")
Reported-by: Charlie Jenkins <charlie@rivosinc.com>
Closes: https://lore.kernel.org/all/Z1DnP-GJcfseyrM3@ghost/
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

modpost: Add .irqentry.text to OTHER_SECTIONS

The compiler can fully inline the actual handler function of an interrupt
entry into the .irqentry.text entry point. If such a function contains an
access which has an exception table entry, modpost complains about a
section mismatch:

  WARNING: vmlinux.o(__ex_table+0x447c): Section mismatch in reference ...

  The relocation at __ex_table+0x447c references section ".irqentry.text"
  which is not in the list of authorized sections.

Add .irqentry.text to OTHER_SECTIONS to cure the issue.

Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # needed for linux-5.4-y
Link: https://lore.kernel.org/all/20241128111844.GE10431@google.com/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>

Merge tag '6.13-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- DFS fix (for race with tree disconnect and dfs cache worker)

- Four fixes for SMB3.1.1 posix extensions:
      - improve special file support e.g. to Samba, retrieving the file
        type earlier
      - reduce roundtrips (e.g. on ls -l, in some cases)

* tag '6.13-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: fix potential race in cifs_put_tcon()
  smb3.1.1: fix posix mounts to older servers
  fs/smb/client: cifs_prime_dcache() for SMB3 POSIX reparse points
  fs/smb/client: Implement new SMB3 POSIX type
  fs/smb/client: avoid querying SMB2_OP_QUERY_WSL_EA for SMB3 POSIX

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Large number of small fixes, all in drivers"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (32 commits)
  scsi: scsi_debug: Fix hrtimer support for ndelay
  scsi: storvsc: Do not flag MAINTENANCE_IN return of SRB_STATUS_DATA_OVERRUN as an error
  scsi: ufs: core: Add missing post notify for power mode change
  scsi: sg: Fix slab-use-after-free read in sg_release()
  scsi: ufs: core: sysfs: Prevent div by zero
  scsi: qla2xxx: Update version to 10.02.09.400-k
  scsi: qla2xxx: Supported speed displayed incorrectly for VPorts
  scsi: qla2xxx: Fix NVMe and NPIV connect issue
  scsi: qla2xxx: Remove check req_sg_cnt should be equal to rsp_sg_cnt
  scsi: qla2xxx: Fix use after free on unload
  scsi: qla2xxx: Fix abort in bsg timeout
  scsi: mpi3mr: Update driver version to 8.12.0.3.50
  scsi: mpi3mr: Handling of fault code for insufficient power
  scsi: mpi3mr: Start controller indexing from 0
  scsi: mpi3mr: Fix corrupt config pages PHY state is switched in sysfs
  scsi: mpi3mr: Synchronize access to ioctl data buffer
  scsi: mpt3sas: Update driver version to 51.100.00.00
  scsi: mpt3sas: Diag-Reset when Doorbell-In-Use bit is set during driver load time
  scsi: ufs: pltfrm: Dellocate HBA during ufshcd_pltfrm_remove()
  scsi: ufs: pltfrm: Drop PM runtime reference count after ufshcd_remove()
  ...

Merge tag 'block-6.13-20241207' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- NVMe pull request via Keith:
      - Target fix using incorrect zero buffer (Nilay)
      - Device specifc deallocate quirk fixes (Christoph, Keith)
      - Fabrics fix for handling max command target bugs (Maurizio)
      - Cocci fix usage for kzalloc (Yu-Chen)
      - DMA size fix for host memory buffer feature (Christoph)
      - Fabrics queue cleanup fixes (Chunguang)

- CPU hotplug ordering fixes

- Add missing MODULE_DESCRIPTION for rnull

- bcache error value fix

- virtio-blk queue freeze fix

* tag 'block-6.13-20241207' of git://git.kernel.dk/linux:
  blk-mq: move cpuhp callback registering out of q->sysfs_lock
  blk-mq: register cpuhp callback after hctx is added to xarray table
  virtio-blk: don't keep queue frozen during system suspend
  nvme-tcp: simplify nvme_tcp_teardown_io_queues()
  nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()
  nvme-rdma: unquiesce admin_q before destroy it
  nvme-tcp: fix the memleak while create new ctrl failed
  nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary
  nvmet: replace kmalloc + memset with kzalloc for data allocation
  nvme-fabrics: handle zero MAXCMD without closing the connection
  bcache: revert replacing IS_ERR_OR_NULL with IS_ERR again
  nvme-pci: remove two deallocate zeroes quirks
  block: rnull: add missing MODULE_DESCRIPTION
  nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported
  nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()

Merge tag 'io_uring-6.13-20241207' of git://git.kernel.dk/linux

Pull io_uring fix from Jens Axboe:
"A single fix for a parameter type which affects 32-bit"

* tag 'io_uring-6.13-20241207' of git://git.kernel.dk/linux:
io_uring: Change res2 parameter type in io_uring_cmd_done

Merge tag 'ubifs-for-linus-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull jffs2 fix from Richard Weinberger:

- Fixup rtime compressor bounds checking

* tag 'ubifs-for-linus-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
jffs2: Fix rtime decompressor

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann::

- Fix several issues for BPF LPM trie map which were found by syzbot
   and during addition of new test cases (Hou Tao)

- Fix a missing process_iter_arg register type check in the BPF
   verifier (Kumar Kartikeya Dwivedi, Tao Lyu)

- Fix several correctness gaps in the BPF verifier when interacting
   with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi,
   Eduard Zingerman, Tao Lyu)

- Fix OOB BPF map writes when deleting elements for the case of xsk map
   as well as devmap (Maciej Fijalkowski)

- Fix xsk sockets to always clear DMA mapping information when
   unmapping the pool (Larysa Zaremba)

- Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after
   sent bytes have been finalized (Zijian Zhang)

- Fix BPF sockmap with vsocks which was missing a queue check in poll
   and sockmap cleanup on close (Michal Luczaj)

- Fix tools infra to override makefile ARCH variable if defined but
   empty, which addresses cross-building tools. (Björn Töpel)

- Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols
   (Thomas Weißschuh)

- Fix a NULL pointer dereference in bpftool (Amir Mohammadi)

- Fix BPF selftests to check for CONFIG_PREEMPTION instead of
   CONFIG_PREEMPT (Sebastian Andrzej Siewior)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits)
  selftests/bpf: Add more test cases for LPM trie
  selftests/bpf: Move test_lpm_map.c to map_tests
  bpf: Use raw_spinlock_t for LPM trie
  bpf: Switch to bpf mem allocator for LPM trie
  bpf: Fix exact match conditions in trie_get_next_key()
  bpf: Handle in-place update for full LPM trie correctly
  bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie
  bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem
  bpf: Remove unnecessary check when updating LPM trie
  selftests/bpf: Add test for narrow spill into 64-bit spilled scalar
  selftests/bpf: Add test for reading from STACK_INVALID slots
  selftests/bpf: Introduce __caps_unpriv annotation for tests
  bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots
  bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc
  samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS
  bpf: Zero index arg error string for dynptr and iter
  selftests/bpf: Add tests for iter arg check
  bpf: Ensure reg is PTR_TO_STACK in process_iter_arg
  tools: Override makefile ARCH variable if defined, but empty
  selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap
  ...

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:
"Nothing major, some left-overs from the recent merging window (MTE,
  coco) and some newly found issues like the ptrace() ones.

   - MTE/hugetlbfs:

      - Set VM_MTE_ALLOWED in the arch code and remove it from the core
        code for hugetlbfs mappings

      - Fix copy_highpage() warning when the source is a huge page but
        not MTE tagged, taking the wrong small page path

   - drivers/virt/coco:

      - Add the pKVM and Arm CCA drivers under the arm64 maintainership

      - Fix the pkvm driver to fall back to ioremap() (and warn) if the
        MMIO_GUARD hypercall fails

      - Keep the Arm CCA driver default 'n' rather than 'm'

   - A series of fixes for the arm64 ptrace() implementation,
     potentially leading to the kernel consuming uninitialised stack
     variables when PTRACE_SETREGSET is invoked with a length of 0

   - Fix zone_dma_limit calculation when RAM starts below 4GB and
     ZONE_DMA is capped to this limit

   - Fix early boot warning with CONFIG_DEBUG_VIRTUAL=y triggered by a
     call to page_to_phys() (from patch_map()) which checks pfn_valid()
     before vmemmap has been set up

   - Do not clobber bits 15:8 of the ASID used for TTBR1_EL1 and TLBI
     ops when the kernel assumes 8-bit ASIDs but running under a
     hypervisor on a system that implements 16-bit ASIDs (found running
     Linux under Parallels on Apple M4)

   - ACPI/IORT: Add PMCG platform information for HiSilicon HIP09A as it
     is using the same SMMU PMCG as HIP09 and suffers from the same
     errata

   - Add GCS to cpucap_is_possible(), missed in the recent merge"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: ptrace: fix partial SETREGSET for NT_ARM_GCS
  arm64: ptrace: fix partial SETREGSET for NT_ARM_POE
  arm64: ptrace: fix partial SETREGSET for NT_ARM_FPMR
  arm64: ptrace: fix partial SETREGSET for NT_ARM_TAGGED_ADDR_CTRL
  arm64: cpufeature: Add GCS to cpucap_is_possible()
  coco: virt: arm64: Do not enable cca guest driver by default
  arm64: mte: Fix copy_highpage() warning on hugetlb folios
  arm64: Ensure bits ASID[15:8] are masked out when the kernel uses 8-bit ASIDs
  ACPI/IORT: Add PMCG platform information for HiSilicon HIP09A
  MAINTAINERS: Add CCA and pKVM CoCO guest support to the ARM64 entry
  drivers/virt: pkvm: Don't fail ioremap() call if MMIO_GUARD fails
  arm64: patching: avoid early page_to_phys()
  arm64: mm: Fix zone_dma_limit calculation
  arm64: mte: set VM_MTE_ALLOWED for hugetlbfs at correct place

Merge tag 'fixes-2024-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock fixes from Mike Rapoport:
"Restore check for node validity in arch_numa.

  The rework of NUMA initialization in arch_numa dropped a check that
  refused to accept configurations with invalid node IDs.

  Restore that check to ensure that when firmware passes invalid nodes,
  such configuration is rejected and kernel gracefully falls back to
  dummy NUMA"

* tag 'fixes-2024-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  arch_numa: Restore nid checks before registering a memblock with a node
  memblock: allow zero threshold in validate_numa_converage()

Merge tag 'drm-fixes-2024-12-06' of https://gitlab.freedesktop.org/drm/kernel

Pull more drm fixes from Simona Vetter:
"Due to mailing list unreliability we missed the amdgpu pull, hence
  part two with that now included:

   - amdgu: mostly display fixes + jpeg vcn 1.0, sriov, dcn4.0 resume
     fixes

   - amdkfd fixes"

* tag 'drm-fixes-2024-12-06' of https://gitlab.freedesktop.org/drm/kernel:
  drm/amdgpu: rework resume handling for display (v2)
  drm/amd/pm: fix and simplify workload handling
  Revert "drm/amd/pm: correct the workload setting"
  drm/amdgpu: fix sriov reinit late orders
  drm/amdgpu: Fix ISP hw init issue
  drm/amd/display: Add hblank borrowing support
  drm/amd/display: Limit VTotal range to max hw cap minus fp
  drm/amd/display: Correct prefetch calculation
  drm/amd/display: Add option to retrieve detile buffer size
  drm/amd/display: Add a left edge pixel if in YCbCr422 or YCbCr420 and odm
  drm/amdkfd: hard-code cacheline for gc943,gc944
  drm/amdkfd: add MEC version that supports no PCIe atomics for GFX12
  drm/amd/display: Fix programming backlight on OLED panels
  drm/amd: Sanity check the ACPI EDID
  drm/amdgpu/hdp7.0: do a posting read when flushing HDP
  drm/amdgpu/hdp6.0: do a posting read when flushing HDP
  drm/amdgpu/hdp5.2: do a posting read when flushing HDP
  drm/amdgpu/hdp5.0: do a posting read when flushing HDP
  drm/amdgpu/hdp4.0: do a posting read when flushing HDP
  drm/amdgpu/jpeg1.0: fix idle work handler

Merge tag 'amd-drm-fixes-6.13-2024-12-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.13-2024-12-04:

amdgpu:
- Jpeg work handler fix for VCN 1.0
- HDP flush fixes
- ACPI EDID sanity check
- OLED panel backlight fix
- DC YCbCr fix
- DC Detile buffer size debugging
- DC prefetch calculation fix
- DC VTotal handling fix
- DC HBlank fix
- ISP fix
- SR-IOV fix
- Workload profile fixes
- DCN 4.0.1 resume fix

amdkfd:
- GC 12.x fix
- GC 9.4.x fix

Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241206190452.2571042-1-alexander.deucher@amd.com

Merge tag 'drm-fixes-2024-12-07' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Pretty quiet week which is probably expected after US holidays, the
  dma-fence and displayport MST message handling fixes make up the bulk
  of this, along with a couple of minor xe and other driver fixes.

  dma-fence:
   - Fix reference leak on fence-merge failure path
   - Simplify fence merging with kernel's sort()
   - Fix dma_fence_array_signaled() to ensure forward progress

  dp_mst:
   - Fix MST sideband message body length check
   - Fix a bunch of locking/state handling with DP MST msgs

  sti:
   - Add __iomem for mixer_dbg_mxn()'s parameter

  xe:
   - Missing init value and 64-bit write-order check
   - Fix a memory allocation issue causing lockdep violation

  v3d:
   - Performance counter fix"

* tag 'drm-fixes-2024-12-07' of https://gitlab.freedesktop.org/drm/kernel:
  drm/v3d: Enable Performance Counters before clearing them
  drm/dp_mst: Use reset_msg_rx_state() instead of open coding it
  drm/dp_mst: Reset message rx state after OOM in drm_dp_mst_handle_up_req()
  drm/dp_mst: Ensure mst_primary pointer is valid in drm_dp_mst_handle_up_req()
  drm/dp_mst: Fix down request message timeout handling
  drm/dp_mst: Simplify error path in drm_dp_mst_handle_down_rep()
  drm/dp_mst: Verify request type in the corresponding down message reply
  drm/dp_mst: Fix resetting msg rx state after topology removal
  drm/xe: Move the coredump registration to the worker thread
  drm/xe/guc: Fix missing init value and add register order check
  drm/sti: Add __iomem for mixer_dbg_mxn's parameter
  drm/dp_mst: Fix MST sideband message body length check
  dma-buf: fix dma_fence_array_signaled v4
  dma-fence: Use kernel's sort for merging fences
  dma-fence: Fix reference leak on fence merge failure path

Merge tag 'sound-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes that have been gathered in the week.

   - Fix the missing XRUN handling in USB-audio low latency mode

   - Fix regression by the previous USB-audio hadening change

   - Clean up old SH sound driver to use the standard helpers

   - A few further fixes for MIDI 2.0 UMP handling

   - Various HD-audio and USB-audio quirks

   - Fix jack handling at PM on ASoC Intel AVS

   - Misc small fixes for ASoC SOF and Mediatek"

* tag 'sound-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek: Fix spelling mistake "Firelfy" -> "Firefly"
  ASoC: mediatek: mt8188-mt6359: Remove hardcoded dmic codec
  ALSA: hda/realtek: fix micmute LEDs don't work on HP Laptops
  ALSA: usb-audio: Add extra PID for RME Digiface USB
  ALSA: usb-audio: Fix a DMA to stack memory bug
  ASoC: SOF: ipc3-topology: fix resource leaks in sof_ipc3_widget_setup_comp_dai()
  ALSA: hda/realtek: Add support for Samsung Galaxy Book3 360 (NP730QFG)
  ASoC: Intel: avs: da7219: Remove suspend_pre() and resume_post()
  ALSA: hda/tas2781: Fix error code tas2781_read_acpi()
  ALSA: hda/realtek: Enable mute and micmute LED on HP ProBook 430 G8
  ALSA: usb-audio: add mixer mapping for Corsair HS80
  ALSA: ump: Shut up truncated string warning
  ALSA: sh: Use standard helper for buffer accesses
  ALSA: usb-audio: Notify xrun for low-latency mode
  ALSA: hda/conexant: fix Z60MR100 startup pop issue
  ALSA: ump: Update legacy substream names upon FB info update
  ALSA: ump: Indicate the inactive group in legacy substream names
  ALSA: ump: Don't open legacy substream for an inactive group
  ALSA: seq: ump: Fix seq port updates per FB info notify