Chao Gao [Wed, 20 May 2026 22:28:59 +0000 (15:28 -0700)]
coco/tdx-host: Don't expose P-SEAMLDR information on CPUs with erratum
TDX-capable CPUs clobber the current VMCS on P-SEAMLDR calls. Clearing
the current VMCS behind KVM's back breaks KVM.
Future CPUs will fix this by preserving the current VMCS across
P-SEAMLDR calls. A future specification update will describe the
VMCS-clearing behavior as an erratum and to state that it does not
occur when IA32_VMX_BASIC[60] is set.
Add a CPU bug bit and refuse to expose P-SEAMLDR information on
affected CPUs.
Use a CPU bug bit to stay consistent with X86_BUG_TDX_PW_MCE. As a
bonus, the bug bit is visible to userspace, which allows userspace to
determine why these sysfs files are not exposed, and it can also be
checked by other kernel components in the future if needed.
== Alternatives ==
Two workarounds were considered but both were rejected:
1. Save/restore the current VMCS around P-SEAMLDR calls. This produces ugly
assembly code [1] and doesn't play well with #MCE or #NMI if they
need to use the current VMCS.
2. Move KVM's VMCS tracking logic to the TDX core code, which would break
the boundary between KVM and the TDX core code [2].
Chao Gao [Wed, 20 May 2026 22:28:57 +0000 (15:28 -0700)]
coco/tdx-host: Expose P-SEAMLDR information via sysfs
TDX module updates require userspace to select the appropriate module
to load. Expose necessary information to facilitate this decision. Two
values are needed:
- P-SEAMLDR version: for compatibility checks between TDX module and
P-SEAMLDR
- num_remaining_updates: indicates how many updates can be performed
Expose them as tdx-host device attributes visible only when updates
are supported.
Note that the underlying P-SEAMLDR attributes are available regardless
of update support; this only restricts their visibility to userspace.
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260520133909.409394-11-chao.gao@intel.com
Chao Gao [Wed, 20 May 2026 22:28:56 +0000 (15:28 -0700)]
x86/virt/seamldr: Add a helper to retrieve P-SEAMLDR information
P-SEAMLDR reports its state via SEAMLDR.INFO, including its version and
the number of remaining runtime updates.
This information is useful for userspace. For example, userspace can
use the P-SEAMLDR version to determine whether a candidate TDX module
is compatible with the running loader, and can use the remaining
update count to determine whether another runtime update is still
possible.
Add a helper to retrieve P-SEAMLDR information in preparation for
exposing P-SEAMLDR version and other necessary information to userspace.
Export the new kAPI for use by the "tdx_host" device.
Note that there are two distinct P-SEAMLDR APIs with similar names:
"SEAMLDR.INFO" is metadata about the loader. It's metadata for the
update process.
"SEAMLDR.SEAMINFO" is metadata about SEAM mode. It is for the module
init process, not for the update process.
Use SEAMLDR.INFO here.
For details, see "Intel Trust Domain Extensions - SEAM Loader (SEAMLDR)
Interface Specification".
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260520133909.409394-10-chao.gao@intel.com
Chao Gao [Wed, 20 May 2026 22:28:55 +0000 (15:28 -0700)]
x86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs
The TDX architecture uses the "SEAMCALL" instruction to communicate with
SEAM mode software. Right now, the only SEAM mode software that the kernel
communicates with is the TDX module. But, there is actually another
component that runs in SEAM mode but it is separate from the TDX module:
the persistent SEAM loader or "P-SEAMLDR". Right now, the only component
that communicates with it is the BIOS which loads the TDX module itself at
boot. But, to support updating the TDX module, the kernel now needs to be
able to talk to it.
P-SEAMLDR SEAMCALLs differ from TDX module SEAMCALLs in areas such as
concurrency requirements.
Add a P-SEAMLDR wrapper to handle these differences and prepare for
implementing concrete functions.
Use seamcall_prerr() (not '_ret') because current P-SEAMLDR calls do not
use any output registers other than RAX.
Note: Despite the similar name, the NP-SEAMLDR ("Non-Persistent")
(ACM) invoked exclusively by the BIOS at boot rather than a component
running in SEAM mode. The kernel cannot call it at runtime. It exposes
no SEAMCALL interface.
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://cdrdv2.intel.com/v1/dl/getContent/733582 Link: https://patch.msgid.link/20260520133909.409394-9-chao.gao@intel.com
Chao Gao [Wed, 20 May 2026 22:28:53 +0000 (15:28 -0700)]
coco/tdx-host: Expose TDX module version
For TDX module updates, userspace needs to select compatible update
versions based on the current module version.
For example, the 1.5.x series runs on Sapphire Rapids but not Granite
Rapids, which needs 2.0.x. Updates are also constrained by version
distance, so a 1.5.6 module might permit updates to 1.5.7 but not to
1.5.20.
Start the process of punting the version selection logic to userspace.
Expose the TDX module version in the new faux device.
Define TDX_VERSION_FMT macro for the TDX version format since it will be
used multiple times. Also convert an existing print statement to use it.
== Background ==
For posterity, here's what other firmware mechanisms do:
1. AMD SEV leverages an existing PCI device for the PSP to expose
metadata. TDX uses a faux device as it doesn't have PCI device
in its architecture.
2. Microcode uses per-CPU virtual devices to report microcode revisions
because CPUs can have different revisions. But, there is only a
single TDX module, so exposing the TDX module version through a global
TDX faux device is appropriate
3. ARM's CCA implementation isn't in-tree yet, but will likely follow a
similar faux device approach, though it's unclear whether they need
to expose firmware version information
[ dhansen: trim changelog ]
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/2025073035-bulginess-rematch-b92e@gregkh/ Link: https://patch.msgid.link/20260520133909.409394-8-chao.gao@intel.com
Chao Gao [Wed, 20 May 2026 22:28:52 +0000 (15:28 -0700)]
coco/tdx-host: Introduce a "tdx_host" device
TDX depends on a platform firmware module that runs on the CPU.
Unlike other CoCo architectures, TDX has no hardware "device"
running the show, just a blob on the CPU.
Create a virtual device to anchor interactions with this platform
firmware. This lets later code:
- expose metadata: TDX module version, seamldr version, to userspace
as device attributes
- implement firmware uploader APIs (which are tied to a device) to
support TDX module runtime updates
Use a faux device because the TDX module is singular within the system
and has no platform resources. Using a faux device eliminates the need
to create a stub bus.
The call to tdx_get_sysinfo() ensures that the TDX module is ready to
provide services.
Note that AMD has a PCI device for the PSP for SEV and ARM CCA will
likely have a faux device [1].
Thanks to Dan and Yilun for all the help on this one.
[ dhansen: trim changelog ]
Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/all/2025073035-bulginess-rematch-b92e@gregkh/ Link: https://patch.msgid.link/20260520133909.409394-7-chao.gao@intel.com
Kai Huang [Wed, 20 May 2026 22:28:51 +0000 (15:28 -0700)]
x86/virt/tdx: Move low level SEAMCALL helpers out of <asm/tdx.h>
TDX host core code implements three seamcall*() helpers to make SEAMCALLs
to the TDX module. Currently, they are implemented in <asm/tdx.h> and
are exposed to other kernel code which includes <asm/tdx.h>.
However, other than the TDX host core, seamcall*() are not expected to
be used by other kernel code directly. For instance, for all SEAMCALLs
that are used by KVM, the TDX host core exports a wrapper function for
each of them.
Move seamcall*() and related code out of <asm/tdx.h> and make them only
visible to TDX host core.
Since TDX host core tdx.c is already very heavy, don't put low level
seamcall*() code there but to a new dedicated "seamcall_internal.h". Also,
currently tdx.c has seamcall_prerr*() helpers which additionally print
error message when calling seamcall*() fails. Move them to
"seamcall_internal.h" as well. In such way all low level SEAMCALL helpers
are in a dedicated place, which is much more readable.
Copy the copyright notice from the original files and consolidate the
date ranges to:
Copyright (C) 2021-2023 Intel Corporation
Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Vishal Annapurve <vannapurve@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260520133909.409394-6-chao.gao@intel.com
Chao Gao [Wed, 20 May 2026 22:28:49 +0000 (15:28 -0700)]
x86/virt/tdx: Move TDX_FEATURES0 bits to asm/tdx.h
Future changes will add support for new TDX features exposed as
TDX_FEATURES0 bits. The presence of these features will need to be
checked outside of arch/x86/virt. The feature query helpers and
the TDX_FEATURES0 defines they reference will need to live in the
widely accessible asm/tdx.h header. Move the existing TDX_FEATURES0 to
asm/tdx.h so that they can all be kept together.
Opportunistically switch to BIT_ULL() since TDX_FEATURES0 is 64-bit.
Chao Gao [Wed, 20 May 2026 22:28:48 +0000 (15:28 -0700)]
x86/virt/tdx: Consolidate TDX global initialization states
The kernel uses several global flags to guard one-time TDX initialization
flows and prevent them from being repeated.
When the TDX module is updated, all of those states must be reset so that
the module can be initialized again. Today those states are kept as
separate global variables, which makes the reset path awkward and easy to
miss when a new state is added.
Group the states into a single structure so they can be reset together, for
example with memset(), and so a newly added state won't be missed.
Drop the __ro_after_init annotation from tdx_module_initialized because
the other two states do not have it. And with TDX module update support,
all the states need to be writable at runtime.
Chao Gao [Wed, 20 May 2026 22:28:47 +0000 (15:28 -0700)]
x86/virt/tdx: Move TDX global initialization states to file scope
TDX module global initialization is executed only once. The first call
caches both the result and the "done" state, and later callers reuse the
saved result. A lock protects that cached states.
Those states and the lock are currently kept as function-local statics
because they are used only by try_init_module_global().
TDX module updates need to reset the cached states so TDX global
initialization can be run again after an update. That will add another
access site in the same file.
Move the cached states to file scope so it is accessible outside
try_init_module_global(), and move the lock along with the states it
protects.
Chao Gao [Wed, 20 May 2026 22:28:46 +0000 (15:28 -0700)]
x86/virt/tdx: Clarify try_init_module_global() result caching
TDX module global initialization is executed only once. The first call
caches both the return code and the "done" state in static function
variables. Later callers read the variables. A lock protects the
saved state and serializes callers.
These variables will soon be moved to a global structure. Prepare for
that by treating the variables as a unit. Assign them together and
limit accesses to while the lock is held.
Paolo Bonzini [Wed, 3 Jun 2026 15:00:06 +0000 (17:00 +0200)]
Merge branch 'kvm-ghcb-for-7.2' into HEAD
Merge the final part of the GHCB 7.2 fixes at
https://lore.kernel.org/kvm/20260529183549.1104619-1-pbonzini@redhat.com/.
Patches 1-17 have already been included in Linux 7.1; these are minor
cleanups, and fixes for behaviors that are suboptimal or contradicting
the specification.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Remove sometimes-used function-scoped "ret" from #VMGEXIT handler
Now that only two case-statements actually need a local "ret" variable,
refactor sev_handle_vmgexit() to have all flows return directly when
possible, and bury "ret" as "r" in the two paths that need to propagate a
return value from a helper.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-25-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Turn sev_es_validate_vmgexit() into a dedicated predicate
Now that sev_es_validate_vmgexit() is only responsible for checking that
all required GHCB fields are marked valid, turn it into a predicate whose
name reflects exactly that.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-24-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Handle unknown #VMGEXIT reasons in sev_handle_vmgexit()
Handle unknown #VMGEXIT reasons in sev_handle_vmgexit(), not in
sev_es_validate_vmgexit(). This makes it _much_ more obvious that KVM
simply funnels "legacy" exits to the standard SVM interception handlers,
and is the final preparatory change needed to reduce the scope of
sev_es_validate_vmgexit().
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-23-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Return INVALID_INPUT, not MISSING_INPUT, for bad GUEST_REQUEST input(s)
Return INVALID_INPUT, not MISSING_INPUT, if the guest provides an unaligned
address for a GUEST_REQUEST, and/or attempts to use the same page for the
source and destination. The inputs are obviously invalid, not missing.
Opportunistically move the checks out of sev_es_validate_vmgexit(), to
continue the march towards reducing the scope of the helper, and to help
guide future changes into correctly handling bad input.
Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Fixes: 74458e4859d8 ("KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-22-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Return INVALID_EVENT for SNP-only #VMGEXIT from non-SNP guest
Signal INVALID_EVENT, not MISSING_INPUT, if a non-SNP guest attempts to
invoke an SNP-only #VMGEXIT. Opportunistically move the checks out of
sev_es_validate_vmgexit() to continue the march towards making said helper
a predicate whose sole purpose is to verify the guest has marked required
GHCB fields as valid.
Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event") Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT") Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Fixes: 74458e4859d8 ("KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-21-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Move GHCB "usage" check out of sev_es_validate_vmgexit()
Move the check to verify the guest's requested GHCB out of
sev_es_validate_vmgexit() as the first step towards making said helper a
predicate whose sole purpose is to verify the guest has marked required
GHCB fields as valid.
Using a single "validate" helper sounds good on paper, but in practice it's
difficult to verify that KVM is performing the necessary sanity checks (the
usage of state is far removed from the relevant checks), makes it difficult
to understand that "legacy" exits are simply routed to KVM's existing exit
handlers, and most importantly, has directly contributed to a number of
bugs as adding case-statements to the validation subtly removes them from
the default path that rejects unknown exit codes with INVALID_EVENT.
Deliberately extract the usage code check first so as to preserve the order
of KVM's checks, even though future code extraction will technically fix
bugs.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-20-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Don't terminate SNP VMs on #VMGEXIT without a registered GHCB
If the guest attempts a non-MSR #VMGEXIT without the registered GHCB,
return a GHCB_HV_RESP_MALFORMED_INPUT+GHCB_ERR_NOT_REGISTERED error to the
guest instead of exiting KVM_RUN with -EINVAL (and in likelihood killing
the VM). KVM has already mapped the requested GHCB, i.e. can cleanly
report an error, and so exiting with -EINVAL is completely unjustified.
Fixes: 0c76b1d08280 ("KVM: SEV: Add support to handle GHCB GPA register VMGEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-19-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eliot Courtney [Wed, 3 Jun 2026 07:30:25 +0000 (16:30 +0900)]
gpu: nova-core: add non-sec2 unload path
For non-sec2 it is only required to wait for GSP falcon to halt. This is
because GSP does the main work of unloading on GPUs not using sec2.
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
[ jhubbard: use Result instead of Result<()> in the UnloadBundle impl ] Signed-off-by: John Hubbard <jhubbard@nvidia.com> Link: https://patch.msgid.link/20260603-b4-blackwell-v13-8-d9f3a06939e0@nvidia.com Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
On Hopper and Blackwell, FSP boots GSP with hardware lockdown enabled.
After FSP Chain of Trust completes, the driver must poll for lockdown
release before proceeding with GSP initialization. Add the register
bit and helper functions needed for this polling.
KVM: SEV: Unmap and unpin the GHCB as needed on vCPU free
Unmap and unpin the GHCB as needed when freeing a vCPU. If the VM is
destroyed after mapping+pinning the GHCB on #VMGEXIT, without re-running
the vCPU, KVM will effectively leak the GHCB and any mappings created for
the GHCB.
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Cc: stable@vger.kernel.org Tested-by: Michael Roth <michael.roth@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-18-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()
Relocate sev_free_vcpu() down in sev.c so that it's definition comes after
sev_es_unmap_ghcb(). This will allow sharing unmap functionality between
the two functions without needing a forward declaration (or weird placement
of the common code).
No functional change intended.
Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-16-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying
When marking a page dirty, complain about not having a running/loaded vCPU
if and only if the VM is still alive, i.e. its refcount is non-zero. This
will allow fixing a memory leak for x86 SEV-ES guests without hitting what
is effectively a false positive on the WARN.
For some SEV-ES VM-Exits, KVM keeps a writable mapping of a guest page
across an exit to userspace, and typically unmaps the page on the next
KVM_RUN. But if userspace never calls KVM_RUN after such an exit, then KVM
needs to unmap the page when the vCPU is destroyed, which in turn triggers
the WARN about not having a running vCPU.
Alternatively, SEV-ES could temporarily load the vCPU to suppress the WARN,
as is done in nested_vmx_free_vcpu() (but for completely unrelated reasons;
suppressing WARN from nested_put_vmcs12_pages() is pure happenstance). But
loading a vCPU during destruction is gross (ideally nVMX code would be
cleaned up), risks complicating the SEV-ES code (KVM would need to ensure
the temporarily load()+put() only runs when the vCPU isn't already loaded),
and is ultimately pointless.
The motivation for the WARN is to guard against KVM dirtying guest memory
without pushing the corresponding GFN to the active vCPU's dirty ring, e.g.
to ensure userspace doesn't miss a dirty page. But for the VM's refcount
to reach zero, there can't be _any_ userspace mappings to the dirty ring,
as mapping the dirty ring requires doing mmap() on the vCPU FD. I.e. if
userspace had a valid mapping for the dirty ring, then the vCPU file and
thus the owning VM would still be alive. And so since userspace can't
possibly reach the dirty ring, whether or not KVM technically "misses" a
push to the dirty ring is irrelevant.
Reported-by: Michael Roth <michael.roth@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-15-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT
Rework Page State Change (PSC) handling to read the guest-provided start
and end indices exactly once, at the beginning of the request. Re-reading
the indices is "fine", _if_ the guest is well-behaved. KVM _should_ be
safe against concurrent guest modification of the indices, but there is
zero reason to introduce unnecessary risk.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-14-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Add an anonymous "psc" struct to track current PSC metadata
Add a "psc" struct to vcpu_sev_es_state to avoid having to prefix all of
the fields with "psc_".
Take advantage of the code churn to opportunistically rename local
variables to "guest_psc" to make it more obvious that the buffer is guest
data, and more importantly, guest accessible!
Opportunistically rename inflight => batch_size as well, because there can
really only be one operation in-flight (per-vCPU), i.e. "inflight" _looks_
like a boolean, but in actuality is an integer tracking how many pages are
being handled by the current operation.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Make it more obvious when KVM is writing back the current PSC index
Increment the guest-visible "cur_entry" index outside of the for-loop
when processing Page State Change entries, and add a comment to make it
more obvious which code is operating on trusted data, and which code is
touching guest-accessible data.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guan-Chun Wu [Sun, 31 May 2026 08:00:19 +0000 (16:00 +0800)]
ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.
Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.
Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):
Guan-Chun Wu [Sun, 31 May 2026 08:00:18 +0000 (16:00 +0800)]
ext4: add Kunit coverage for directory hash computation
Introduce Kunit tests for fs/ext4/hash.c to verify ext4fs_dirhash()
across the legacy, half-MD4, and TEA hash variants.
The tests cover empty, seeded hashing, and non-ASCII name handling.
They also verify error paths, including invalid hash versions and
SipHash without a configured key, and check that the signed and
unsigned hash variants differ on non-ASCII input as expected.
When CONFIG_UNICODE is enabled, the tests further verify casefolded-name
hashing and the fallback behavior for invalid input.
Li RongQing [Wed, 3 Jun 2026 12:37:08 +0000 (20:37 +0800)]
dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device
In debug_dma_sync_sg_for_device(), when iterating over a scatterlist,
the debug entry population mistakenly uses the head of the scatterlist
'sg' to fetch the physical address via sg_phys(), instead of using the
current iterator variable 's'.
This causes dma-debug to track the physical address of the very first
scatterlist entry for all subsequent entries in the list.
Fix this by passing the correct loop iterator 's' to sg_phys()
Li Chen [Fri, 15 May 2026 09:18:27 +0000 (17:18 +0800)]
ext4: fast commit: export snapshot stats in fc_info
Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.
Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.
Li Chen [Fri, 15 May 2026 09:18:26 +0000 (17:18 +0800)]
ext4: fast commit: add lock_updates tracepoint
Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.
Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.
Li Chen [Fri, 15 May 2026 09:18:25 +0000 (17:18 +0800)]
ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.
The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.
The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.
Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.
Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.
Li Chen [Fri, 15 May 2026 09:18:24 +0000 (17:18 +0800)]
ext4: fast commit: avoid self-deadlock in inode snapshotting
ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.
ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.
Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.
Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.
Li Chen [Fri, 15 May 2026 09:18:23 +0000 (17:18 +0800)]
ext4: fast commit: avoid waiting for FC_COMMITTING
ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.
Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.
When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.
Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).
Li Chen [Fri, 15 May 2026 09:18:22 +0000 (17:18 +0800)]
ext4: lockdep: handle i_data_sem subclassing for special inodes
Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.
lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.
Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.
Li Chen [Fri, 15 May 2026 09:18:21 +0000 (17:18 +0800)]
ext4: fast commit: snapshot inode state before writing log
Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.
Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.
Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.
Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.
Junrui Luo [Wed, 13 May 2026 09:28:40 +0000 (17:28 +0800)]
jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
jbd2_journal_initialize_fast_commit() validates journal capacity by
checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
j_last the subtraction wraps to a large value, bypassing the bounds
check.
The resulting underflow corrupts j_last, j_fc_first, and j_free,
leading to journal abort.
Fix by checking num_fc_blks against j_last before the subtraction,
returning -EFSCORRUPTED.
Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization") Reported-by: Yuhao Jiang <danisjiang@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Fixes: e029c5f27987 ("ext4: make num of fast commit blocks configurable") Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Fixes: e029c5f279872 ("ext4: make num of fast commit blocks configurable") Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Li Chen [Wed, 13 May 2026 08:58:17 +0000 (16:58 +0800)]
ext4: fix fast commit wait/wake bit mapping on 64-bit
On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
and ext4_test_inode_state() applies the corresponding +32 offset.
The fast-commit wait and wake paths open-coded the wait key with the raw
EXT4_STATE_* value. Add small helpers for the state wait word and bit,
and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
key follows the same mapping as the state helpers.
Fixes: 857d32f26181 ("ext4: rework fast commit commit path") Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Baokun Li <libaokun@linux.alibaba.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260513085818.552432-1-me@linux.beauty Signed-off-by: Theodore Ts'o <tytso@mit.edu>
However, h_transaction may legitimately be NULL for an aborted handle.
The is_handle_aborted() helper in include/linux/jbd2.h explicitly
treats !h_transaction as one of the aborted states:
if (handle->h_aborted || !handle->h_transaction)
return 1;
Every other entry point in fs/jbd2/transaction.c
(jbd2_journal_get_{write,undo,create}_access, jbd2_journal_extend,
jbd2_journal_restart, jbd2_journal_stop, etc.) guards against this
with an is_handle_aborted() check before any dereference of
h_transaction. jbd2_journal_dirty_metadata() was missing this guard.
This is reachable from ocfs2's xattr code. ocfs2_xa_set() intentionally
falls through to ocfs2_xa_journal_dirty() even after
ocfs2_xa_prepare_entry() fails, on the assumption that the buffer
needs to be journaled to record any partial modifications (see the
comment above the out_dirty label in fs/ocfs2/xattr.c). If the failure
was caused by the journal being aborted -- e.g. an underlying I/O
error during a sub-operation such as __ocfs2_remove_xattr_range() --
the handle's h_transaction has been cleared by the abort path, and
the unconditional deref in jbd2_journal_dirty_metadata() becomes a
NULL deref.
Reproduced by syzbot with a crafted ocfs2 image where I/O against the
loop device backing the mount is sabotaged via LOOP_SET_STATUS64
between two setxattr() calls, causing the second setxattr (which
truncates an external xattr value) to abort the journal mid-flight:
Oops: general protection fault, probably for non-canonical
address 0xdffffc0000000000
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: jbd2_journal_dirty_metadata+0x4a/0xd30 fs/jbd2/transaction.c:1520
Call Trace:
ocfs2_journal_dirty+0x130/0x700 fs/ocfs2/journal.c:831
ocfs2_xa_journal_dirty fs/ocfs2/xattr.c:1483 [inline]
ocfs2_xa_set+0x15e3/0x2ec0 fs/ocfs2/xattr.c:2294
ocfs2_xattr_block_set+0x3e0/0x33c0 fs/ocfs2/xattr.c:3016
__ocfs2_xattr_set_handle+0x6b3/0xf50 fs/ocfs2/xattr.c:3418
ocfs2_xattr_set+0xf3f/0x13e0 fs/ocfs2/xattr.c:3681
__vfs_setxattr+0x43c/0x480 fs/xattr.c:218
...
Fix by adding the standard is_handle_aborted() guard at the top of
jbd2_journal_dirty_metadata() and returning -EROFS, matching the
pattern used by every other entry point in this file.
ocfs2_journal_dirty() already handles a non-zero return from
jbd2_journal_dirty_metadata() correctly.
Reported-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=98f651460e558a21baae Tested-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20260507050605.50081-1-kartikey406@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avinash Bhatt [Sun, 31 May 2026 10:53:08 +0000 (13:53 +0300)]
wifi: iwlwifi: mld: add KUnit tests for link grading
Add tests for the link grading algorithm covering per-bandwidth
grading tables, channel load calculation, 6 GHz RSSI adjustments
including duplicated beacon and PSD/EIRP compensation, and
puncturing penalty.
Shahar Tzarfati [Sun, 31 May 2026 10:53:06 +0000 (13:53 +0300)]
wifi: iwlwifi: mld: drop TLC config cmd v4/v5 compat code
FW core102 bumped TLC_MNG_CONFIG_CMD_API_S from version 5 to
version 6. The v4 and v5 compatibility paths in
iwl_mld_send_tlc_cmd() are no longer reachable on any supported
firmware.
Miri Korenblit [Sun, 31 May 2026 10:53:05 +0000 (13:53 +0300)]
wifi: iwlwifi: mvm: remove __must_check annotation from command sending
We don't acually need to always check the return value. For example, if
we send a command to remove an object - we can assume success
(if it fails it is probably because the fw is dead, and then it doesn't
have the object anyway).
Miri Korenblit [Sun, 31 May 2026 10:53:04 +0000 (13:53 +0300)]
wifi: iwlwifi: trans: export the maximum supported hcmd size
Export the maximum allowed host command payload size to the op-modes.
Note that this information was available to the op-modes also before
this change, this just adds a clear macro.
Shahar Tzarfati [Sun, 31 May 2026 10:53:03 +0000 (13:53 +0300)]
wifi: iwlwifi: stop supporting core101
BZ, DR and SC no longer need to accept core101 firmware.
Raise the minimum supported firmware core from 101 to 102 so
these families only match supported core102 and newer images.
Shahar Tzarfati [Sun, 31 May 2026 10:53:02 +0000 (13:53 +0300)]
wifi: iwlwifi: remove orphaned DC2DC config enum
FW core102 removed both DC2DC_CONFIG_CMD_API_S and
DC2DC_CONFIG_CMD_RSP_API_S. The only driver-side artifact is
enum iwl_dc2dc_config_id in fw/api/config.h, which has no
callers in any .c file across all driver paths (mld/mvm/xvt).
Ever since the TFD queue size is no longer limited to 256 entries,
this code has been wrong, and might erroneously not detect a move
if it was by a multiple of 256. Not a big deal, but fix it while
I see it.
Our binding handling for P2P-Device can run into the following
scenario, as observed by our testing:
- a station interface is connected on some channel
- the P2P-Device does a remain-on-channel (ROC) on that channel
- the ROC ends, and the P2P-Device is removed from the binding,
but the phy_ctxt pointer is left around as a PHY cache so we
don't need to recalibrate to the channel again and again in
case it's not shared
- a binding update by the station interface, even a removal,
will re-add the P2P-Device to the binding
- the P2P-Device is removed, which removes the PHY context, but
it's still in the binding so the firmware crashes
Since the P2P device is removed from the binding and only re-
added by unrelated code, but we want to keep the phy_ctxt around
as a cache for future ROC usage, fix it by adding a boolean that
indicates whether or not the P2P-Device should be added to the
binding, and handle that in the binding iterator. That way, the
station interface cannot re-add the P2P-Device to the binding
when that isn't active.
Add KUnit tests to verify RSSI adjustment for 6 GHz duplicated
beacons across different operational bandwidths and validate
detection of the duplicated beacon bit.
Johannes Berg [Wed, 27 May 2026 20:05:09 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o netdetect
Clearly, from a user perspective, it must be valid to configure
WoWLAN and then suspend while not connected to a network. Since
mac80211 doesn't distinguish these cases and simply calls the
driver to suspend whenever WoWLAN is configured, the driver has
to cleanly handle the case where it's called for WoWLAN, it's
not connected but there's also no netdetect configured.
Remove the WARN_ON() and keep returning 1 to disconnect and
then suspend.
Shahar Tzarfati [Wed, 27 May 2026 20:05:08 +0000 (23:05 +0300)]
wifi: iwlwifi: cfg: Revert "wifi: iwlwifi: cfg: move the MODULE_FIRMWARE to the per-rf file"
IWL_BZ_UCODE_CORE_MAX is undefined in cfg/rf-fm.c, this
causes __stringify(core) to turn it into the literal
token text, so MODULE_FIRMWARE entries are generated as
"iwlwifi...-cIWL_BZ_UCODE_CORE_MAX.ucode",
instead of the actual number.
wifi: iwlwifi: mld: set fast-balance scan for active EMLSR
While associated to MLD AP with active EMLSR, set all scan
operations as fast-balance scans. The only exception is when a
fragmented scan is planned (high traffic or low latency), in
which case the fragmented scan is preserved.
Miri Korenblit [Wed, 27 May 2026 20:05:05 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: always allow mimo in NAN
The mimo field of the sta command is badly named. It really carries the
initial SMPS value as it is in the association request of the client
station (when we are the AP).
In NAN we don't have this information, just mark SMPS as disabled.
Given that we now use v3 rates with FW index throughout,
_to_hwrate() is confusing, since the hardware still uses
the PLCP value, the driver just doesn't see that now (as
it talks to firmware, not hardware.)
Rename this to iwl_mvm_rate_idx_to_fw_idx() to more
clearly indicate what it's doing.
Moriya Itzchaki [Wed, 27 May 2026 20:05:02 +0000 (23:05 +0300)]
wifi: iwlwifi: fix STEP_URM register address for SC devices
The CNVI_PMU_STEP_FLOW register address differs between device families.
For SC and newer devices, the register is at 0xA2D688,
while for BZ devices it's at 0xA2D588.
Miri Korenblit [Wed, 27 May 2026 20:05:01 +0000 (23:05 +0300)]
wifi: iwlwifi: mld: fix smatch warning
We dereference the mld_sta pointer before checking for NULL.
But we do check the sta pointer, and sta != NULL means mld_sta != NULL,
so there is no real issue.
Fix it anyway to silence the warning.
Miri Korenblit [Wed, 27 May 2026 20:04:59 +0000 (23:04 +0300)]
wifi: iwlwifi: remove stale comment
iwl_pcie_set_hw_ready still returns the return value of iwl_poll_bits,
but the latter one no longer returns the time elapsed until success, now it
returns either success or failure.
Remove the comment entirely.
Johannes Berg [Wed, 27 May 2026 20:04:58 +0000 (23:04 +0300)]
wifi: iwlwifi: fw: cut down NIC wakeups during dump
Currently, the dump code attempts to dump any number of
memories and register banks, as defined by the firmware.
Especially when the device is failing, this can lead to
excessive time spent attempting to acquire NIC access
over and over again.
Improve the code to only attempt to acquire NIC access
once or twice, but using the new memory dump functions
that may drop the spinlock etc. Mark all dump regions
that require NIC access, and skip them if we couldn't
obtain that.
In order to avoid CPU latency due to the increased time
holding the spinlock (and possibly disabling softirqs),
drop locks and call cond_resched() after each section
(if holding NIC access) but don't release HW NIC access.
AX231 is a device that is based on AX211 that doesn't support 6E and
its bandwidth is limited to 80 MHz.
Just reuse the radio config from AX203 which has the exact same
characteristics.
It has a specific subdevice ID to allow the driver to differentiate
between AX211 and AX231.
Heiko Carstens [Tue, 26 May 2026 05:57:02 +0000 (07:57 +0200)]
s390/percpu: Provide arch_this_cpu_write() implementation
Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.
Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k (defconfig).
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:57:01 +0000 (07:57 +0200)]
s390/percpu: Provide arch_this_cpu_read() implementation
Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().
Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:57:00 +0000 (07:57 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.
There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:56:59 +0000 (07:56 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.
With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Heiko Carstens [Tue, 26 May 2026 05:56:58 +0000 (07:56 +0200)]
s390/percpu: Use new percpu code section for arch_this_cpu_add()
Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.
With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.
With:
DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }
Heiko Carstens [Tue, 26 May 2026 05:56:56 +0000 (07:56 +0200)]
s390/percpu: Infrastructure for more efficient this_cpu operations
With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.
In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.
E.g. this simple C code sequence
DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }
Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.
Get rid of the conditional branch with the following code sequence:
The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:
- The first mviy instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.
- The first instruction following the mviy instruction must be the ag
instruction which adds the percpu offset to the percpu address register.
- Afterwards the atomic percpu operation follows.
- Then a second mviy instruction writes a zero to lowcore, which indicates
the end of the percpu code section.
- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.
- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/zcrypt: Replace get_zeroed_page() with kzalloc()
zcrypt_rng_device_add() allocates a buffer for the software random
number generator data cache.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/trng: Replace __get_free_page() with kmalloc()
trng_read() allocates a temporary staging buffer for CPACF TRNG
random data before copying it to userspace.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
s390/qeth: Replace get_zeroed_page() with kzalloc()
qeth_get_trap_id() allocates a temporary buffer for STSI system
information queries used to build trap identification strings.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/hvc_iucv: Replace get_zeroed_page() with kzalloc()
hvc_iucv_alloc() allocates a send staging buffer for accumulating
outbound terminal characters before they are copied into a separate
IUCV message buffer for transmission to the hypervisor. The staging
buffer itself is never passed to any IUCV function.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/dasd: Replace get_zeroed_page() with kzalloc()
DASD driver uses get_zeroed_page() to allocate pages for the Extended Error
Reporting software ring buffer and for a scratch buffer for formatting
sense dump diagnostic text.
These buffers can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().
s390/con3270: Replace __get_free_page() with kmalloc()
con3270_alloc_view() allocates a staging buffer used to assemble
3270 datastream content before it is copied into channel program
requests.
This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.
kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.
Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.
For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.
Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().
John Hubbard [Wed, 3 Jun 2026 07:30:22 +0000 (16:30 +0900)]
gpu: nova-core: Hopper/Blackwell: select FSP Chain of Trust version
The FSP Chain of Trust handshake is versioned: Hopper speaks version 1
and Blackwell speaks version 2. Provide the version through the FSP HAL
so the boot message carries the value FSP expects, and so chipsets that
do not use FSP need not express a version at all.
FSP exchanges are request/response: the driver sends an MCTP/NVDM
message and must match the reply against the request before acting on
it. Add the synchronous send-and-wait path that validates the response
transport and message headers and confirms the reply corresponds to the
request that was sent.
John Hubbard [Wed, 3 Jun 2026 07:30:20 +0000 (16:30 +0900)]
gpu: nova-core: add MCTP/NVDM protocol types for firmware communication
Add the MCTP (Management Component Transport Protocol) and NVDM (NVIDIA
Data Model) wire-format types used for communication between the kernel
driver and GPU firmware processors.
This includes typed MCTP transport headers, NVDM message headers, and
NVDM message type identifiers. Both the FSP boot path and the upcoming
GSP RPC message queue share this protocol layer.
FSP communication uses a pair of non-circular queues in the FSP
falcon's EMEM, one for messages from the driver to FSP and one for
replies, with the driver polling for response data. Add the queue
registers and the low-level helpers used by the higher-level FSP
message layer.
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Eliot Courtney <ecourtney@nvidia.com> Link: https://patch.msgid.link/20260603-b4-blackwell-v13-2-d9f3a06939e0@nvidia.com
[acourbot: align register fields names with OpenRM.]
[acourbot: represent registers as arrays of 8 instances, as per OpenRM.] Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Add external memory (EMEM) read/write operations to the GPU's FSP falcon
engine. These operations use Falcon PIO (Programmed I/O) to communicate
with the FSP through indirect memory access.
Jinyu Tang [Sun, 17 May 2026 15:34:27 +0000 (23:34 +0800)]
KVM: riscv: Fast-path dirty logging write faults
With dirty logging enabled, guest writes often fault on an existing 4K
G-stage leaf that was write-protected only for dirty tracking. The slow
path still performs the full fault handling flow and takes mmu_lock for
write, even though the page-table shape does not change.
x86 handles the analogous case in its fast page fault path by atomically
making a writable SPTE writable again when the fault is only a
write-protection fault. Add the same style of fast path for RISC-V. If a
write fault hits an existing 4K leaf in a writable dirty-log memslot,
mark the page dirty and atomically set the PTE writable and dirty under
the read side of mmu_lock.
The dirty bitmap is updated before the PTE becomes writable again. The
PTE D bit is also set so systems that trap on a clear D bit do not fall
back to the slow path for a writable but clean PTE.
When a fault hits an existing G-stage leaf with the same PFN, KVM only
needs to update the PTE permissions. This path will be used by read-side
fault handling, so it must not overwrite a concurrent PTE update.
Use the cmpxchg helper when relaxing permissions on an existing leaf,
following the same concurrency model used by x86 for atomic SPTE
permission updates. Retry if another CPU changed the PTE first, and use
cpu_relax() while spinning.
Jinyu Tang [Sun, 17 May 2026 15:34:25 +0000 (23:34 +0800)]
KVM: riscv: Add a G-stage PTE cmpxchg helper
Permission-only G-stage PTE updates can run in parallel once they are
moved to the read side of mmu_lock. Plain set_pte() is not enough for
that case because another CPU may update the same PTE first.
x86 handles the same class of SPTE races with cmpxchg-based updates in
its fast page fault and TDP MMU paths. Add a small RISC-V helper for
atomic G-stage PTE updates. The helper reports contention to the caller
and flushes the target range only when the PTE value actually changes.
Jinyu Tang [Sun, 17 May 2026 15:34:24 +0000 (23:34 +0800)]
KVM: riscv: Use an rwlock for mmu_lock
RISC-V KVM currently uses a spinlock for mmu_lock. That serializes all
G-stage MMU operations, including permission-only updates that do not
allocate or free page-table pages.
Use KVM's rwlock form of mmu_lock, as x86 and arm64 already do. Keep the
existing map, unmap and teardown paths on the write side. This prepares
RISC-V for read-side handling of G-stage permission updates.