]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
2 weeks agococo/tdx-host: Don't expose P-SEAMLDR information on CPUs with erratum
Chao Gao [Wed, 20 May 2026 22:28:59 +0000 (15:28 -0700)] 
coco/tdx-host: Don't expose P-SEAMLDR information on CPUs with erratum

TDX-capable CPUs clobber the current VMCS on P-SEAMLDR calls. Clearing
the current VMCS behind KVM's back breaks KVM.

Future CPUs will fix this by preserving the current VMCS across
P-SEAMLDR calls. A future specification update will describe the
VMCS-clearing behavior as an erratum and to state that it does not
occur when IA32_VMX_BASIC[60] is set.

Add a CPU bug bit and refuse to expose P-SEAMLDR information on
affected CPUs.

Use a CPU bug bit to stay consistent with X86_BUG_TDX_PW_MCE. As a
bonus, the bug bit is visible to userspace, which allows userspace to
determine why these sysfs files are not exposed, and it can also be
checked by other kernel components in the future if needed.

== Alternatives ==
Two workarounds were considered but both were rejected:

1. Save/restore the current VMCS around P-SEAMLDR calls. This produces ugly
   assembly code [1] and doesn't play well with #MCE or #NMI if they
   need to use the current VMCS.

2. Move KVM's VMCS tracking logic to the TDX core code, which would break
   the boundary between KVM and the TDX core code [2].

[ dhansen: comment and changelog munging. Add seamldr_call() bug check. ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/kvm/fedb3192-e68c-423c-93b2-a4dc2f964148@intel.com/
Link: https://lore.kernel.org/kvm/aYIXFmT-676oN6j0@google.com/
Link: https://patch.msgid.link/20260520133909.409394-12-chao.gao@intel.com
2 weeks agococo/tdx-host: Expose P-SEAMLDR information via sysfs
Chao Gao [Wed, 20 May 2026 22:28:57 +0000 (15:28 -0700)] 
coco/tdx-host: Expose P-SEAMLDR information via sysfs

TDX module updates require userspace to select the appropriate module
to load. Expose necessary information to facilitate this decision. Two
values are needed:

- P-SEAMLDR version: for compatibility checks between TDX module and
     P-SEAMLDR
- num_remaining_updates: indicates how many updates can be performed

Expose them as tdx-host device attributes visible only when updates
are supported.

Note that the underlying P-SEAMLDR attributes are available regardless
of update support; this only restricts their visibility to userspace.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-11-chao.gao@intel.com
2 weeks agox86/virt/seamldr: Add a helper to retrieve P-SEAMLDR information
Chao Gao [Wed, 20 May 2026 22:28:56 +0000 (15:28 -0700)] 
x86/virt/seamldr: Add a helper to retrieve P-SEAMLDR information

P-SEAMLDR reports its state via SEAMLDR.INFO, including its version and
the number of remaining runtime updates.

This information is useful for userspace. For example, userspace can
use the P-SEAMLDR version to determine whether a candidate TDX module
is compatible with the running loader, and can use the remaining
update count to determine whether another runtime update is still
possible.

Add a helper to retrieve P-SEAMLDR information in preparation for
exposing P-SEAMLDR version and other necessary information to userspace.
Export the new kAPI for use by the "tdx_host" device.

Note that there are two distinct P-SEAMLDR APIs with similar names:

  "SEAMLDR.INFO" is metadata about the loader. It's metadata for the
  update process.

  "SEAMLDR.SEAMINFO" is metadata about SEAM mode. It is for the module
  init process, not for the update process.

Use SEAMLDR.INFO here.

For details, see "Intel Trust Domain Extensions - SEAM Loader (SEAMLDR)
Interface Specification".

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-10-chao.gao@intel.com
2 weeks agox86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs
Chao Gao [Wed, 20 May 2026 22:28:55 +0000 (15:28 -0700)] 
x86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs

The TDX architecture uses the "SEAMCALL" instruction to communicate with
SEAM mode software. Right now, the only SEAM mode software that the kernel
communicates with is the TDX module. But, there is actually another
component that runs in SEAM mode but it is separate from the TDX module:
the persistent SEAM loader or "P-SEAMLDR". Right now, the only component
that communicates with it is the BIOS which loads the TDX module itself at
boot. But, to support updating the TDX module, the kernel now needs to be
able to talk to it.

P-SEAMLDR SEAMCALLs differ from TDX module SEAMCALLs in areas such as
concurrency requirements.

Add a P-SEAMLDR wrapper to handle these differences and prepare for
implementing concrete functions.

Use seamcall_prerr() (not '_ret') because current P-SEAMLDR calls do not
use any output registers other than RAX.

Note: Despite the similar name, the NP-SEAMLDR ("Non-Persistent")
(ACM) invoked exclusively by the BIOS at boot rather than a component
running in SEAM mode. The kernel cannot call it at runtime. It exposes
no SEAMCALL interface.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://cdrdv2.intel.com/v1/dl/getContent/733582
Link: https://patch.msgid.link/20260520133909.409394-9-chao.gao@intel.com
2 weeks agococo/tdx-host: Expose TDX module version
Chao Gao [Wed, 20 May 2026 22:28:53 +0000 (15:28 -0700)] 
coco/tdx-host: Expose TDX module version

For TDX module updates, userspace needs to select compatible update
versions based on the current module version.

For example, the 1.5.x series runs on Sapphire Rapids but not Granite
Rapids, which needs 2.0.x. Updates are also constrained by version
distance, so a 1.5.6 module might permit updates to 1.5.7 but not to
1.5.20.

Start the process of punting the version selection logic to userspace.
Expose the TDX module version in the new faux device.

Define TDX_VERSION_FMT macro for the TDX version format since it will be
used multiple times. Also convert an existing print statement to use it.

== Background ==

For posterity, here's what other firmware mechanisms do:

1. AMD SEV leverages an existing PCI device for the PSP to expose
   metadata. TDX uses a faux device as it doesn't have PCI device
   in its architecture.

2. Microcode uses per-CPU virtual devices to report microcode revisions
   because CPUs can have different revisions. But, there is only a
   single TDX module, so exposing the TDX module version through a global
   TDX faux device is appropriate

3. ARM's CCA implementation isn't in-tree yet, but will likely follow a
   similar faux device approach, though it's unclear whether they need
   to expose firmware version information

[ dhansen: trim changelog ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/2025073035-bulginess-rematch-b92e@gregkh/
Link: https://patch.msgid.link/20260520133909.409394-8-chao.gao@intel.com
2 weeks agococo/tdx-host: Introduce a "tdx_host" device
Chao Gao [Wed, 20 May 2026 22:28:52 +0000 (15:28 -0700)] 
coco/tdx-host: Introduce a "tdx_host" device

TDX depends on a platform firmware module that runs on the CPU.
Unlike other CoCo architectures, TDX has no hardware "device"
running the show, just a blob on the CPU.

Create a virtual device to anchor interactions with this platform
firmware. This lets later code:

 - expose metadata: TDX module version, seamldr version, to userspace
   as device attributes

 - implement firmware uploader APIs (which are tied to a device) to
   support TDX module runtime updates

Use a faux device because the TDX module is singular within the system
and has no platform resources. Using a faux device eliminates the need
to create a stub bus.

The call to tdx_get_sysinfo() ensures that the TDX module is ready to
provide services.

Note that AMD has a PCI device for the PSP for SEV and ARM CCA will
likely have a faux device [1].

Thanks to Dan and Yilun for all the help on this one.

[ dhansen: trim changelog ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/all/2025073035-bulginess-rematch-b92e@gregkh/
Link: https://patch.msgid.link/20260520133909.409394-7-chao.gao@intel.com
2 weeks agox86/virt/tdx: Move low level SEAMCALL helpers out of <asm/tdx.h>
Kai Huang [Wed, 20 May 2026 22:28:51 +0000 (15:28 -0700)] 
x86/virt/tdx: Move low level SEAMCALL helpers out of <asm/tdx.h>

TDX host core code implements three seamcall*() helpers to make SEAMCALLs
to the TDX module.  Currently, they are implemented in <asm/tdx.h> and
are exposed to other kernel code which includes <asm/tdx.h>.

However, other than the TDX host core, seamcall*() are not expected to
be used by other kernel code directly.  For instance, for all SEAMCALLs
that are used by KVM, the TDX host core exports a wrapper function for
each of them.

Move seamcall*() and related code out of <asm/tdx.h> and make them only
visible to TDX host core.

Since TDX host core tdx.c is already very heavy, don't put low level
seamcall*() code there but to a new dedicated "seamcall_internal.h".  Also,
currently tdx.c has seamcall_prerr*() helpers which additionally print
error message when calling seamcall*() fails.  Move them to
"seamcall_internal.h" as well. In such way all low level SEAMCALL helpers
are in a dedicated place, which is much more readable.

Copy the copyright notice from the original files and consolidate the
date ranges to:

Copyright (C) 2021-2023 Intel Corporation

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Vishal Annapurve <vannapurve@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-6-chao.gao@intel.com
2 weeks agox86/virt/tdx: Move TDX_FEATURES0 bits to asm/tdx.h
Chao Gao [Wed, 20 May 2026 22:28:49 +0000 (15:28 -0700)] 
x86/virt/tdx: Move TDX_FEATURES0 bits to asm/tdx.h

Future changes will add support for new TDX features exposed as
TDX_FEATURES0 bits. The presence of these features will need to be
checked outside of arch/x86/virt. The feature query helpers and
the TDX_FEATURES0 defines they reference will need to live in the
widely accessible asm/tdx.h header. Move the existing TDX_FEATURES0 to
asm/tdx.h so that they can all be kept together.

Opportunistically switch to BIT_ULL() since TDX_FEATURES0 is 64-bit.

No functional change intended.

[ dhansen: grammar fixups ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/kvm/20260427152854.101171-17-chao.gao@intel.com/
Link: https://lore.kernel.org/kvm/20251121005125.417831-16-rick.p.edgecombe@intel.com/
Link: https://patch.msgid.link/20260520133909.409394-5-chao.gao@intel.com
2 weeks agox86/virt/tdx: Consolidate TDX global initialization states
Chao Gao [Wed, 20 May 2026 22:28:48 +0000 (15:28 -0700)] 
x86/virt/tdx: Consolidate TDX global initialization states

The kernel uses several global flags to guard one-time TDX initialization
flows and prevent them from being repeated.

When the TDX module is updated, all of those states must be reset so that
the module can be initialized again. Today those states are kept as
separate global variables, which makes the reset path awkward and easy to
miss when a new state is added.

Group the states into a single structure so they can be reset together, for
example with memset(), and so a newly added state won't be missed.

Drop the __ro_after_init annotation from tdx_module_initialized because
the other two states do not have it. And with TDX module update support,
all the states need to be writable at runtime.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-4-chao.gao@intel.com
2 weeks agox86/virt/tdx: Move TDX global initialization states to file scope
Chao Gao [Wed, 20 May 2026 22:28:47 +0000 (15:28 -0700)] 
x86/virt/tdx: Move TDX global initialization states to file scope

TDX module global initialization is executed only once. The first call
caches both the result and the "done" state, and later callers reuse the
saved result. A lock protects that cached states.

Those states and the lock are currently kept as function-local statics
because they are used only by try_init_module_global().

TDX module updates need to reset the cached states so TDX global
initialization can be run again after an update. That will add another
access site in the same file.

Move the cached states to file scope so it is accessible outside
try_init_module_global(), and move the lock along with the states it
protects.

No functional change intended.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-3-chao.gao@intel.com
2 weeks agox86/virt/tdx: Clarify try_init_module_global() result caching
Chao Gao [Wed, 20 May 2026 22:28:46 +0000 (15:28 -0700)] 
x86/virt/tdx: Clarify try_init_module_global() result caching

TDX module global initialization is executed only once. The first call
caches both the return code and the "done" state in static function
variables.  Later callers read the variables. A lock protects the
saved state and serializes callers.

These variables will soon be moved to a global structure. Prepare for
that by treating the variables as a unit. Assign them together and
limit accesses to while the lock is held.

[ dhansen: mostly rewrite changelog ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260520133909.409394-2-chao.gao@intel.com
2 weeks agoMerge branch 'kvm-ghcb-for-7.2' into HEAD
Paolo Bonzini [Wed, 3 Jun 2026 15:00:06 +0000 (17:00 +0200)] 
Merge branch 'kvm-ghcb-for-7.2' into HEAD

Merge the final part of the GHCB 7.2 fixes at
https://lore.kernel.org/kvm/20260529183549.1104619-1-pbonzini@redhat.com/.

Patches 1-17 have already been included in Linux 7.1; these are minor
cleanups, and fixes for behaviors that are suboptimal or contradicting
the specification.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Remove sometimes-used function-scoped "ret" from #VMGEXIT handler
Sean Christopherson [Fri, 29 May 2026 18:35:49 +0000 (20:35 +0200)] 
KVM: SEV: Remove sometimes-used function-scoped "ret" from #VMGEXIT handler

Now that only two case-statements actually need a local "ret" variable,
refactor sev_handle_vmgexit() to have all flows return directly when
possible, and bury "ret" as "r" in the two paths that need to propagate a
return value from a helper.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-25-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Turn sev_es_validate_vmgexit() into a dedicated predicate
Sean Christopherson [Fri, 29 May 2026 18:35:48 +0000 (20:35 +0200)] 
KVM: SEV: Turn sev_es_validate_vmgexit() into a dedicated predicate

Now that sev_es_validate_vmgexit() is only responsible for checking that
all required GHCB fields are marked valid, turn it into a predicate whose
name reflects exactly that.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-24-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Handle unknown #VMGEXIT reasons in sev_handle_vmgexit()
Sean Christopherson [Fri, 29 May 2026 18:35:47 +0000 (20:35 +0200)] 
KVM: SEV: Handle unknown #VMGEXIT reasons in sev_handle_vmgexit()

Handle unknown #VMGEXIT reasons in sev_handle_vmgexit(), not in
sev_es_validate_vmgexit().  This makes it _much_ more obvious that KVM
simply funnels "legacy" exits to the standard SVM interception handlers,
and is the final preparatory change needed to reduce the scope of
sev_es_validate_vmgexit().

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-23-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Return INVALID_INPUT, not MISSING_INPUT, for bad GUEST_REQUEST input(s)
Sean Christopherson [Fri, 29 May 2026 18:35:46 +0000 (20:35 +0200)] 
KVM: SEV: Return INVALID_INPUT, not MISSING_INPUT, for bad GUEST_REQUEST input(s)

Return INVALID_INPUT, not MISSING_INPUT, if the guest provides an unaligned
address for a GUEST_REQUEST, and/or attempts to use the same page for the
source and destination.  The inputs are obviously invalid, not missing.

Opportunistically move the checks out of sev_es_validate_vmgexit(), to
continue the march towards reducing the scope of the helper, and to help
guide future changes into correctly handling bad input.

Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event")
Fixes: 74458e4859d8 ("KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-22-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Return INVALID_EVENT for SNP-only #VMGEXIT from non-SNP guest
Sean Christopherson [Fri, 29 May 2026 18:35:45 +0000 (20:35 +0200)] 
KVM: SEV: Return INVALID_EVENT for SNP-only #VMGEXIT from non-SNP guest

Signal INVALID_EVENT, not MISSING_INPUT, if a non-SNP guest attempts to
invoke an SNP-only #VMGEXIT.  Opportunistically move the checks out of
sev_es_validate_vmgexit() to continue the march towards making said helper
a predicate whose sole purpose is to verify the guest has marked required
GHCB fields as valid.

Fixes: e366f92ea99e ("KVM: SEV: Support SEV-SNP AP Creation NAE event")
Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT")
Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event")
Fixes: 74458e4859d8 ("KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event")
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-21-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Move GHCB "usage" check out of sev_es_validate_vmgexit()
Sean Christopherson [Fri, 29 May 2026 18:35:44 +0000 (20:35 +0200)] 
KVM: SEV: Move GHCB "usage" check out of sev_es_validate_vmgexit()

Move the check to verify the guest's requested GHCB out of
sev_es_validate_vmgexit() as the first step towards making said helper a
predicate whose sole purpose is to verify the guest has marked required
GHCB fields as valid.

Using a single "validate" helper sounds good on paper, but in practice it's
difficult to verify that KVM is performing the necessary sanity checks (the
usage of state is far removed from the relevant checks), makes it difficult
to understand that "legacy" exits are simply routed to KVM's existing exit
handlers, and most importantly, has directly contributed to a number of
bugs as adding case-statements to the validation subtly removes them from
the default path that rejects unknown exit codes with INVALID_EVENT.

Deliberately extract the usage code check first so as to preserve the order
of KVM's checks, even though future code extraction will technically fix
bugs.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-20-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Don't terminate SNP VMs on #VMGEXIT without a registered GHCB
Sean Christopherson [Fri, 29 May 2026 18:35:43 +0000 (20:35 +0200)] 
KVM: SEV: Don't terminate SNP VMs on #VMGEXIT without a registered GHCB

If the guest attempts a non-MSR #VMGEXIT without the registered GHCB,
return a GHCB_HV_RESP_MALFORMED_INPUT+GHCB_ERR_NOT_REGISTERED error to the
guest instead of exiting KVM_RUN with -EINVAL (and in likelihood killing
the VM).  KVM has already mapped the requested GHCB, i.e. can cleanly
report an error, and so exiting with -EINVAL is completely unjustified.

Fixes: 0c76b1d08280 ("KVM: SEV: Add support to handle GHCB GPA register VMGEXIT")
Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-19-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agogpu: nova-core: gsp: enable FSP boot path
Alexandre Courbot [Wed, 3 Jun 2026 07:30:26 +0000 (16:30 +0900)] 
gpu: nova-core: gsp: enable FSP boot path

Now that all the elements are in place, enable the FSP boot path so
Hopper and Blackwell can boot.

Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-9-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: add non-sec2 unload path
Eliot Courtney [Wed, 3 Jun 2026 07:30:25 +0000 (16:30 +0900)] 
gpu: nova-core: add non-sec2 unload path

For non-sec2 it is only required to wait for GSP falcon to halt. This is
because GSP does the main work of unloading on GPUs not using sec2.

Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
[ jhubbard: use Result instead of Result<()> in the UnloadBundle impl ]
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-8-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: add GSP lockdown release polling
John Hubbard [Wed, 3 Jun 2026 07:30:24 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: add GSP lockdown release polling

On Hopper and Blackwell, FSP boots GSP with hardware lockdown enabled.
After FSP Chain of Trust completes, the driver must poll for lockdown
release before proceeding with GSP initialization. Add the register
bit and helper functions needed for this polling.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-7-d9f3a06939e0@nvidia.com
[acourbot: fix `lockdown_released` logic and add explanatory comments.]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: add FSP Chain of Trust boot
John Hubbard [Wed, 3 Jun 2026 07:30:23 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: add FSP Chain of Trust boot

Build and send the Chain of Trust message to FSP, bundling the
DMA-coherent boot parameters that FSP reads at boot time.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-6-d9f3a06939e0@nvidia.com
[acourbot: rename `frts_offset` to `frts_vidmem_offset`.]
[acourbot: add note about frts_sysmem_* CoT members.]
Co-developed-by: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agoMerge tag 'kvm-s390-master-7.1-3' of https://git.kernel.org/pub/scm/linux/kernel...
Paolo Bonzini [Wed, 3 Jun 2026 14:46:31 +0000 (16:46 +0200)] 
Merge tag 'kvm-s390-master-7.1-3' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: More gmap and vsie fixes

2 weeks agoKVM: SEV: Unmap and unpin the GHCB as needed on vCPU free
Sean Christopherson [Fri, 29 May 2026 18:35:42 +0000 (20:35 +0200)] 
KVM: SEV: Unmap and unpin the GHCB as needed on vCPU free

Unmap and unpin the GHCB as needed when freeing a vCPU.  If the VM is
destroyed after mapping+pinning the GHCB on #VMGEXIT, without re-running
the vCPU, KVM will effectively leak the GHCB and any mappings created for
the GHCB.

Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Tested-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-18-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Decouple the need to sync the GHCB SA from the need to free the SA
Sean Christopherson [Fri, 29 May 2026 18:35:41 +0000 (20:35 +0200)] 
KVM: SEV: Decouple the need to sync the GHCB SA from the need to free the SA

Decouple synchronizing the GHCB SA from freeing/unpinning the SA, so that
the free/unpin path can be reused when freeing a vCPU.

Opportunistically add a WARN to harden KVM against stomping over (and thus
leaking) an already-allocated scratch area.

Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-17-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()
Sean Christopherson [Fri, 29 May 2026 18:35:40 +0000 (20:35 +0200)] 
KVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()

Relocate sev_free_vcpu() down in sev.c so that it's definition comes after
sev_es_unmap_ghcb().  This will allow sharing unmap functionality between
the two functions without needing a forward declaration (or weird placement
of the common code).

No functional change intended.

Cc: stable@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-16-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying
Sean Christopherson [Fri, 29 May 2026 18:35:39 +0000 (20:35 +0200)] 
KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying

When marking a page dirty, complain about not having a running/loaded vCPU
if and only if the VM is still alive, i.e. its refcount is non-zero.  This
will allow fixing a memory leak for x86 SEV-ES guests without hitting what
is effectively a false positive on the WARN.

For some SEV-ES VM-Exits, KVM keeps a writable mapping of a guest page
across an exit to userspace, and typically unmaps the page on the next
KVM_RUN.  But if userspace never calls KVM_RUN after such an exit, then KVM
needs to unmap the page when the vCPU is destroyed, which in turn triggers
the WARN about not having a running vCPU.

Alternatively, SEV-ES could temporarily load the vCPU to suppress the WARN,
as is done in nested_vmx_free_vcpu() (but for completely unrelated reasons;
suppressing WARN from nested_put_vmcs12_pages() is pure happenstance).  But
loading a vCPU during destruction is gross (ideally nVMX code would be
cleaned up), risks complicating the SEV-ES code (KVM would need to ensure
the temporarily load()+put() only runs when the vCPU isn't already loaded),
and is ultimately pointless.

The motivation for the WARN is to guard against KVM dirtying guest memory
without pushing the corresponding GFN to the active vCPU's dirty ring, e.g.
to ensure userspace doesn't miss a dirty page.  But for the VM's refcount
to reach zero, there can't be _any_ userspace mappings to the dirty ring,
as mapping the dirty ring requires doing mmap() on the vCPU FD.  I.e. if
userspace had a valid mapping for the dirty ring, then the vCPU file and
thus the owning VM would still be alive.  And so since userspace can't
possibly reach the dirty ring, whether or not KVM technically "misses" a
push to the dirty ring is irrelevant.

Reported-by: Michael Roth <michael.roth@amd.com>
Cc: stable@vger.kernel.org
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-15-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT
Sean Christopherson [Fri, 29 May 2026 18:35:38 +0000 (20:35 +0200)] 
KVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT

Rework Page State Change (PSC) handling to read the guest-provided start
and end indices exactly once, at the beginning of the request.  Re-reading
the indices is "fine", _if_ the guest is well-behaved.  KVM _should_ be
safe against concurrent guest modification of the indices, but there is
zero reason to introduce unnecessary risk.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-14-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Add an anonymous "psc" struct to track current PSC metadata
Sean Christopherson [Fri, 29 May 2026 18:35:37 +0000 (20:35 +0200)] 
KVM: SEV: Add an anonymous "psc" struct to track current PSC metadata

Add a "psc" struct to vcpu_sev_es_state to avoid having to prefix all of
the fields with "psc_".

Take advantage of the code churn to opportunistically rename local
variables to "guest_psc" to make it more obvious that the buffer is guest
data, and more importantly, guest accessible!

Opportunistically rename inflight => batch_size as well, because there can
really only be one operation in-flight (per-vCPU), i.e. "inflight" _looks_
like a boolean, but in actuality is an integer tracking how many pages are
being handled by the current operation.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoKVM: SEV: Make it more obvious when KVM is writing back the current PSC index
Sean Christopherson [Fri, 29 May 2026 18:35:36 +0000 (20:35 +0200)] 
KVM: SEV: Make it more obvious when KVM is writing back the current PSC index

Increment the guest-visible "cur_entry" index outside of the for-loop
when processing Page State Change entries, and add a comment to make it
more obvious which code is operating on trusted data, and which code is
touching guest-accessible data.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260501202250.2115252-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260529183549.1104619-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 weeks agoext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers
Guan-Chun Wu [Sun, 31 May 2026 08:00:19 +0000 (16:00 +0800)] 
ext4: improve str2hashbuf by processing 4-byte chunks and removing function pointers

The original byte-by-byte implementation with modulo checks is less
efficient. Refactor str2hashbuf_unsigned() and str2hashbuf_signed()
to process input in explicit 4-byte chunks instead of using a
modulus-based loop to emit words byte by byte.

Additionally, the use of function pointers for selecting the appropriate
str2hashbuf implementation has been removed. Instead, the functions are
directly invoked based on the hash type, eliminating the overhead of
dynamic function calls.

Performance test (x86_64, Intel Core i7-10700 @ 2.90GHz, average over 10000
runs, using kernel module for testing):

    len | orig_s | new_s | orig_u | new_u
    ----+--------+-------+--------+-------
      1 |   70   |   71  |   63   |   63
      8 |   68   |   64  |   64   |   62
     32 |   75   |   70  |   75   |   63
     64 |   96   |   71  |  100   |   68
    255 |  192   |  108  |  187   |   84

This change improves performance, especially for larger input sizes.

Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
Link: https://patch.msgid.link/20260531080019.3794809-3-409411716@gms.tku.edu.tw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: add Kunit coverage for directory hash computation
Guan-Chun Wu [Sun, 31 May 2026 08:00:18 +0000 (16:00 +0800)] 
ext4: add Kunit coverage for directory hash computation

Introduce Kunit tests for fs/ext4/hash.c to verify ext4fs_dirhash()
across the legacy, half-MD4, and TEA hash variants.

The tests cover empty, seeded hashing, and non-ASCII name handling.
They also verify error paths, including invalid hash versions and
SipHash without a configured key, and check that the signed and
unsigned hash variants differ on non-ASCII input as expected.

When CONFIG_UNICODE is enabled, the tests further verify casefolded-name
hashing and the fallback behavior for invalid input.

Co-developed-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Chen Hao Yu <edward062254@gmail.com>
Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw>
Link: https://patch.msgid.link/20260531080019.3794809-2-409411716@gms.tku.edu.tw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agodma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device
Li RongQing [Wed, 3 Jun 2026 12:37:08 +0000 (20:37 +0800)] 
dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device

In debug_dma_sync_sg_for_device(), when iterating over a scatterlist,
the debug entry population mistakenly uses the head of the scatterlist
'sg' to fetch the physical address via sg_phys(), instead of using the
current iterator variable 's'.

This causes dma-debug to track the physical address of the very first
scatterlist entry for all subsequent entries in the list.

Fix this by passing the correct loop iterator 's' to sg_phys()

Fixes: 9d4f645a1fd49ee ("dma-debug: store a phys_addr_t in struct dma_debug_entry")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260603123708.1665-1-lirongqing@baidu.com
2 weeks agoext4: fast commit: export snapshot stats in fc_info
Li Chen [Fri, 15 May 2026 09:18:27 +0000 (17:18 +0800)] 
ext4: fast commit: export snapshot stats in fc_info

Snapshot-based fast commit can fall back when the commit-time snapshot
cannot be built (e.g. extent status cache misses). It is useful to
quantify the updates-locked window and to see why snapshotting failed.

Add best-effort snapshot counters to the ext4 superblock and extend
/proc/fs/ext4/<sb_id>/fc_info to report the number of snapshotted
inodes and ranges, snapshot failure reasons, and the average/max time
spent with journal updates locked.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-8-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fast commit: add lock_updates tracepoint
Li Chen [Fri, 15 May 2026 09:18:26 +0000 (17:18 +0800)] 
ext4: fast commit: add lock_updates tracepoint

Commit-time fast commit snapshots run under jbd2_journal_lock_updates(),
so it is useful to quantify the time spent with updates locked and to
understand why snapshotting can fail.

Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in
the updates-locked window along with the number of snapshotted inodes
and ranges. Record the first snapshot failure reason in a stable snap_err
field for tooling.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://patch.msgid.link/20260515091829.194810-7-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots
Li Chen [Fri, 15 May 2026 09:18:25 +0000 (17:18 +0800)] 
ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots

Commit-time snapshots run under jbd2_journal_lock_updates(), so the work
done there must stay bounded.

The snapshot path still used ext4_map_blocks() to build data ranges. This
can take i_data_sem and pulls the mapping code into the snapshot logic.
Build inode data range snapshots from the extent status tree instead.

The extent status tree is a cache, not an authoritative source. If the
needed information is missing or unstable (e.g. delayed allocation), treat
the transaction as fast commit ineligible and fall back to full commit.

Also cap the number of inodes and ranges snapshotted per fast commit and
allocate range records from a dedicated slab cache. The inode pointer
array is allocated outside the updates-locked window.

Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted
dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and
python3 500x {creat + fsync(dir)} without lockdep splats or errors.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-6-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fast commit: avoid self-deadlock in inode snapshotting
Li Chen [Fri, 15 May 2026 09:18:24 +0000 (17:18 +0800)] 
ext4: fast commit: avoid self-deadlock in inode snapshotting

ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building
commit-time snapshots. With ext4_fc_del() waiting for
EXT4_STATE_FC_COMMITTING, iput() can trigger
ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting
for the fast commit to finish.

ext4_fc_del() also has to re-check EXT4_STATE_FC_COMMITTING after
waiting on EXT4_STATE_FC_FLUSHING_DATA. The commit thread clears
FLUSHING_DATA before it sets COMMITTING, so a waiter woken from the
flush wait must not delete the inode based on an old COMMITTING
check.

Avoid taking extra references. Collect inode pointers under s_fc_lock and
rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup()
clears the bit.

Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced
from the dentry update queue, and wake up waiters when ext4_fc_cleanup()
clears the bit.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-5-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fast commit: avoid waiting for FC_COMMITTING
Li Chen [Fri, 15 May 2026 09:18:23 +0000 (17:18 +0800)] 
ext4: fast commit: avoid waiting for FC_COMMITTING

ext4_fc_track_inode() can be called while holding i_data_sem (e.g.
fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an
ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING ->
wait(i_data_sem) in the commit task.

Now that fast commit snapshots inode state at commit time, updates during
log writing do not need to block. Drop the wait and lockdep assertion in
ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an
inode cannot be removed while the commit thread is still using it.

When an inode is modified during a fast commit, mark it with
EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit.
This is needed because jbd2_fc_end_commit() invokes the cleanup callback
with tid == 0, so tid-based requeue logic would requeue every inode.

Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same
transaction. nblks is the number of journal blocks written for that fast
commit. Before this change, the second fsync still wrote almost the same
fast commit log (nblks 10->9), because tid == 0 in jbd2_fc_end_commit()
caused the tid-based requeue logic to keep all inodes queued. After this
change, only inodes modified during the commit are requeued, and the
second fsync wrote a nearly empty fast commit (nblks 10->1).

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-4-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: lockdep: handle i_data_sem subclassing for special inodes
Li Chen [Fri, 15 May 2026 09:18:22 +0000 (17:18 +0800)] 
ext4: lockdep: handle i_data_sem subclassing for special inodes

Fast commit can hold s_fc_lock while writing journal blocks. Mapping the
journal inode can take its i_data_sem. Normal inode update paths can take a
data inode i_data_sem and then s_fc_lock, which makes lockdep report a
circular dependency.

lockdep treats all i_data_sem instances as one lock class and cannot
distinguish the journal inode i_data_sem from a regular inode i_data_sem.
The journal inode is not tracked by fast commit and no FC waiters ever
depend on it, so this is not a real ABBA deadlock. Assign the journal inode
a dedicated i_data_sem lockdep subclass to avoid the false positive.

Inode cache objects can be recycled, so also reset i_data_sem to
I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may
inherit an old subclass (journal/quota/ea) and trigger lockdep warnings.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-3-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fast commit: snapshot inode state before writing log
Li Chen [Fri, 15 May 2026 09:18:21 +0000 (17:18 +0800)] 
ext4: fast commit: snapshot inode state before writing log

Fast commit writes inode metadata and data range updates after unlocking
journal updates. New handles can start at that point, so the log writing
path must not look at live inode state.

Add a commit-time per-inode snapshot and populate it while journal updates
are locked and existing handles are drained. Store the snapshot behind
ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer.
The snapshot contains a copy of the on-disk inode plus the data range
records needed for fast commit TLVs.

Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O
there by using ext4_get_inode_loc_noio() and falling back to full commit
if the inode table block is not present or not uptodate.

Log writing then only serializes the snapshot, so it no longer needs to
call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot
is installed and freed under s_fc_lock and is released from fast commit
cleanup and inode eviction.

Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Link: https://patch.msgid.link/20260515091829.194810-2-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agojbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()
Junrui Luo [Wed, 13 May 2026 09:28:40 +0000 (17:28 +0800)] 
jbd2: fix integer underflow in jbd2_journal_initialize_fast_commit()

jbd2_journal_initialize_fast_commit() validates journal capacity by
checking (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS).
Both j_last and num_fc_blks are unsigned, so when num_fc_blks exceeds
j_last the subtraction wraps to a large value, bypassing the bounds
check.

The resulting underflow corrupts j_last, j_fc_first, and j_free,
leading to journal abort.

Fix by checking num_fc_blks against j_last before the subtraction,
returning -EFSCORRUPTED.

Fixes: 6866d7b3f2bb ("ext4 / jbd2: add fast commit initialization")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Fixes: e029c5f27987 ("ext4: make num of fast commit blocks configurable")
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Fixes: e029c5f279872 ("ext4: make num of fast commit blocks configurable")
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/SYBPR01MB7881663C927DE9D7BBF4D1DFAF062@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agoext4: fix fast commit wait/wake bit mapping on 64-bit
Li Chen [Wed, 13 May 2026 08:58:17 +0000 (16:58 +0800)] 
ext4: fix fast commit wait/wake bit mapping on 64-bit

On 64-bit, ext4 dynamic inode states live in the upper half of i_flags,
and ext4_test_inode_state() applies the corresponding +32 offset.

The fast-commit wait and wake paths open-coded the wait key with the raw
EXT4_STATE_* value. Add small helpers for the state wait word and bit,
and use them for the FC_COMMITTING and FC_FLUSHING_DATA waits so the wait
key follows the same mapping as the state helpers.

Fixes: 857d32f26181 ("ext4: rework fast commit commit path")
Reported-by: Sashiko AI review <sashiko-bot@kernel.org>
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260513085818.552432-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agojbd2: check for aborted handle in jbd2_journal_dirty_metadata()
Deepanshu Kartikey [Thu, 7 May 2026 05:06:05 +0000 (10:36 +0530)] 
jbd2: check for aborted handle in jbd2_journal_dirty_metadata()

jbd2_journal_dirty_metadata() unconditionally dereferences
handle->h_transaction at function entry to obtain the journal pointer:

transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;

However, h_transaction may legitimately be NULL for an aborted handle.
The is_handle_aborted() helper in include/linux/jbd2.h explicitly
treats !h_transaction as one of the aborted states:

if (handle->h_aborted || !handle->h_transaction)
return 1;

Every other entry point in fs/jbd2/transaction.c
(jbd2_journal_get_{write,undo,create}_access, jbd2_journal_extend,
jbd2_journal_restart, jbd2_journal_stop, etc.) guards against this
with an is_handle_aborted() check before any dereference of
h_transaction. jbd2_journal_dirty_metadata() was missing this guard.

This is reachable from ocfs2's xattr code. ocfs2_xa_set() intentionally
falls through to ocfs2_xa_journal_dirty() even after
ocfs2_xa_prepare_entry() fails, on the assumption that the buffer
needs to be journaled to record any partial modifications (see the
comment above the out_dirty label in fs/ocfs2/xattr.c). If the failure
was caused by the journal being aborted -- e.g. an underlying I/O
error during a sub-operation such as __ocfs2_remove_xattr_range() --
the handle's h_transaction has been cleared by the abort path, and
the unconditional deref in jbd2_journal_dirty_metadata() becomes a
NULL deref.

Reproduced by syzbot with a crafted ocfs2 image where I/O against the
loop device backing the mount is sabotaged via LOOP_SET_STATUS64
between two setxattr() calls, causing the second setxattr (which
truncates an external xattr value) to abort the journal mid-flight:

  Oops: general protection fault, probably for non-canonical
        address 0xdffffc0000000000
  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  RIP: jbd2_journal_dirty_metadata+0x4a/0xd30 fs/jbd2/transaction.c:1520
  Call Trace:
   ocfs2_journal_dirty+0x130/0x700 fs/ocfs2/journal.c:831
   ocfs2_xa_journal_dirty fs/ocfs2/xattr.c:1483 [inline]
   ocfs2_xa_set+0x15e3/0x2ec0 fs/ocfs2/xattr.c:2294
   ocfs2_xattr_block_set+0x3e0/0x33c0 fs/ocfs2/xattr.c:3016
   __ocfs2_xattr_set_handle+0x6b3/0xf50 fs/ocfs2/xattr.c:3418
   ocfs2_xattr_set+0xf3f/0x13e0 fs/ocfs2/xattr.c:3681
   __vfs_setxattr+0x43c/0x480 fs/xattr.c:218
   ...

Fix by adding the standard is_handle_aborted() guard at the top of
jbd2_journal_dirty_metadata() and returning -EROFS, matching the
pattern used by every other entry point in this file.
ocfs2_journal_dirty() already handles a non-zero return from
jbd2_journal_dirty_metadata() correctly.

Reported-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=98f651460e558a21baae
Tested-by: syzbot+98f651460e558a21baae@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20260507050605.50081-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2 weeks agowifi: iwlwifi: bump maximum core version for BZ/SC/DR to 106
Emmanuel Grumbach [Sun, 31 May 2026 10:53:09 +0000 (13:53 +0300)] 
wifi: iwlwifi: bump maximum core version for BZ/SC/DR to 106

Start supporting Core 106 FW on these devices.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20260531135036.4ec96e57a17b.I1eea0a221656b2f03839964734d9a3624530b964@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: add KUnit tests for link grading
Avinash Bhatt [Sun, 31 May 2026 10:53:08 +0000 (13:53 +0300)] 
wifi: iwlwifi: mld: add KUnit tests for link grading

Add tests for the link grading algorithm covering per-bandwidth
grading tables, channel load calculation, 6 GHz RSSI adjustments
including duplicated beacon and PSD/EIRP compensation, and
puncturing penalty.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260531135036.a4251e5665a0.I811b35680115e7de0ffd75b6b7a1c91ad361c97c@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: add KUnit tests for PSD/EIRP RSSI adjustment
Avinash Bhatt [Sun, 31 May 2026 10:53:07 +0000 (13:53 +0300)] 
wifi: iwlwifi: mld: add KUnit tests for PSD/EIRP RSSI adjustment

Add tests for PSD/EIRP RSSI adjustment which compensates measurements
when APs use PSD-based power scaling with bandwidth.

Tests cover all power types, bandwidths, and limiting scenarios.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260531135036.a18b8d0acd62.I68dfcc17359ab8a5abdc84e1e21db4ad1671af41@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: drop TLC config cmd v4/v5 compat code
Shahar Tzarfati [Sun, 31 May 2026 10:53:06 +0000 (13:53 +0300)] 
wifi: iwlwifi: mld: drop TLC config cmd v4/v5 compat code

FW core102 bumped TLC_MNG_CONFIG_CMD_API_S from version 5 to
version 6. The v4 and v5 compatibility paths in
iwl_mld_send_tlc_cmd() are no longer reachable on any supported
firmware.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.c0e2dbfd0569.I44f8eb4d985bb9590b65b77e9a3dd157e4bd5e79@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mvm: remove __must_check annotation from command sending
Miri Korenblit [Sun, 31 May 2026 10:53:05 +0000 (13:53 +0300)] 
wifi: iwlwifi: mvm: remove __must_check annotation from command sending

We don't acually need to always check the return value. For example, if
we send a command to remove an object - we can assume success
(if it fails it is probably because the fw is dead, and then it doesn't
have the object anyway).

Remove the annotations.

Link: https://patch.msgid.link/20260531135036.434473c7b29a.I455e0c3f93c25635df708da7d3216c183dbdbbbb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: trans: export the maximum supported hcmd size
Miri Korenblit [Sun, 31 May 2026 10:53:04 +0000 (13:53 +0300)] 
wifi: iwlwifi: trans: export the maximum supported hcmd size

Export the maximum allowed host command payload size to the op-modes.
Note that this information was available to the op-modes also before
this change, this just adds a clear macro.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260531135036.2e6b15bcaf50.I027e150e5f25ef2431ab4e212175dc00ca5e8abd@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: stop supporting core101
Shahar Tzarfati [Sun, 31 May 2026 10:53:03 +0000 (13:53 +0300)] 
wifi: iwlwifi: stop supporting core101

BZ, DR and SC no longer need to accept core101 firmware.
Raise the minimum supported firmware core from 101 to 102 so
these families only match supported core102 and newer images.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.4ece89be11a9.If00f9c7e011ec75219d28a38ca2077a926afc70e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: remove orphaned DC2DC config enum
Shahar Tzarfati [Sun, 31 May 2026 10:53:02 +0000 (13:53 +0300)] 
wifi: iwlwifi: remove orphaned DC2DC config enum

FW core102 removed both DC2DC_CONFIG_CMD_API_S and
DC2DC_CONFIG_CMD_RSP_API_S. The only driver-side artifact is
enum iwl_dc2dc_config_id in fw/api/config.h, which has no
callers in any .c file across all driver paths (mld/mvm/xvt).

Remove the dead definition.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260531135036.487ceed62714.I13cf8cc214c68899379112e8e52f0cd38dc7b6f8@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: fix a typo
Miri Korenblit [Sun, 31 May 2026 10:53:01 +0000 (13:53 +0300)] 
wifi: iwlwifi: fix a typo

We use 512 A-MSDUs in an A-MPDU, not 612. Fix the typo.

Link: https://patch.msgid.link/20260531135036.62a394741a04.I2fd9e1d5dc4d467426c9061df2796ff8ba0129d4@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: pcie: fix write pointer move detection
Johannes Berg [Sun, 31 May 2026 10:53:00 +0000 (13:53 +0300)] 
wifi: iwlwifi: pcie: fix write pointer move detection

Ever since the TFD queue size is no longer limited to 256 entries,
this code has been wrong, and might erroneously not detect a move
if it was by a multiple of 256. Not a big deal, but fix it while
I see it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260531135036.87ffbeab298e.I4fae41383b6756bccbed250985e0521b68a40d0c@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: Require HT support for NAN
Ilan Peer [Wed, 27 May 2026 20:05:12 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: Require HT support for NAN

NAN cannot be supported if HT is not supported, so check that
HT is supported before declaring that NAN is supported.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Link: https://patch.msgid.link/20260527230313.6274b222e849.If215f00f0cdb5eefb2507f8d0fb5734a65ce945f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mvm: fix P2P-Device binding handling
Johannes Berg [Wed, 27 May 2026 20:05:11 +0000 (23:05 +0300)] 
wifi: iwlwifi: mvm: fix P2P-Device binding handling

Our binding handling for P2P-Device can run into the following
scenario, as observed by our testing:

 - a station interface is connected on some channel
 - the P2P-Device does a remain-on-channel (ROC) on that channel
 - the ROC ends, and the P2P-Device is removed from the binding,
   but the phy_ctxt pointer is left around as a PHY cache so we
   don't need to recalibrate to the channel again and again in
   case it's not shared
 - a binding update by the station interface, even a removal,
   will re-add the P2P-Device to the binding
 - the P2P-Device is removed, which removes the PHY context, but
   it's still in the binding so the firmware crashes

Since the P2P device is removed from the binding and only re-
added by unrelated code, but we want to keep the phy_ctxt around
as a cache for future ROC usage, fix it by adding a boolean that
indicates whether or not the P2P-Device should be added to the
binding, and handle that in the binding iterator. That way, the
station interface cannot re-add the P2P-Device to the binding
when that isn't active.

Assisted-by: Github Copilot:claude-opus-4-6
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.07f94335ae06.I384238b0859343c4a9a9dda20682be1aad89cc9d@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: add KUnit tests for duplicated beacon RSSI adjustment
Avinash Bhatt [Wed, 27 May 2026 20:05:10 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: add KUnit tests for duplicated beacon RSSI adjustment

Add KUnit tests to verify RSSI adjustment for 6 GHz duplicated
beacons across different operational bandwidths and validate
detection of the duplicated beacon bit.

Signed-off-by: Avinash Bhatt <avinash.bhatt@intel.com>
Link: https://patch.msgid.link/20260527230313.a3500c44f5e8.Icba6ee1158e9f563a91b482b8cdd3f51ddace468@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o netdetect
Johannes Berg [Wed, 27 May 2026 20:05:09 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o netdetect

Clearly, from a user perspective, it must be valid to configure
WoWLAN and then suspend while not connected to a network. Since
mac80211 doesn't distinguish these cases and simply calls the
driver to suspend whenever WoWLAN is configured, the driver has
to cleanly handle the case where it's called for WoWLAN, it's
not connected but there's also no netdetect configured.

Remove the WARN_ON() and keep returning 1 to disconnect and
then suspend.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://patch.msgid.link/20260527230313.19720967372b.Iff30814510a26f9f609f98eeea3111c50c1afb31@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: cfg: Revert "wifi: iwlwifi: cfg: move the MODULE_FIRMWARE to the per...
Shahar Tzarfati [Wed, 27 May 2026 20:05:08 +0000 (23:05 +0300)] 
wifi: iwlwifi: cfg: Revert "wifi: iwlwifi: cfg: move the MODULE_FIRMWARE to the per-rf file"

IWL_BZ_UCODE_CORE_MAX is undefined in cfg/rf-fm.c, this
causes __stringify(core) to turn it into the literal
token text, so MODULE_FIRMWARE entries are generated as
"iwlwifi...-cIWL_BZ_UCODE_CORE_MAX.ucode",
instead of the actual number.

This reverts the commit below.

Signed-off-by: Shahar Tzarfati <shahar.tzarfati@intel.com>
Link: https://patch.msgid.link/20260527230313.a10bc3359dca.I446a1340c635f07aff3efaba5317635e010c156f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: set fast-balance scan for active EMLSR
Pagadala Yesu Anjaneyulu [Wed, 27 May 2026 20:05:07 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: set fast-balance scan for active EMLSR

While associated to MLD AP with active EMLSR, set all scan
operations as fast-balance scans. The only exception is when a
fragmented scan is planned (high traffic or low latency), in
which case the fragmented scan is preserved.

Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Link: https://patch.msgid.link/20260527230313.32d278842b0e.Ia3d73e4085eefc4d3921e93de4107b2d6a6f922e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: support FW TLV for NAN max channel switch time
Israel Kozitz [Wed, 27 May 2026 20:05:06 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: support FW TLV for NAN max channel switch time

Add a new FW TLV (IWL_UCODE_TLV_FW_NAN_MAX_CHAN_SWITCH_TIME) that
allows the firmware to specify the NAN maximum channel switch time
in microseconds.

When the TLV is present, use its value for the NAN device capability.
Otherwise, fall back to the default of 4 milliseconds.

Signed-off-by: Israel Kozitz <israel.kozitz@intel.com>
Link: https://patch.msgid.link/20260527230313.e8ae1a3adacd.I15b933407ca3974a65047b63b4f9b00bed3520fb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: always allow mimo in NAN
Miri Korenblit [Wed, 27 May 2026 20:05:05 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: always allow mimo in NAN

The mimo field of the sta command is badly named. It really carries the
initial SMPS value as it is in the association request of the client
station (when we are the AP).

In NAN we don't have this information, just mark SMPS as disabled.

Link: https://patch.msgid.link/20260527230313.abd136be474e.I9eb663d953b482236345ffbcb611f28facea83c1@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: move iwl_fw_rate_idx_to_plcp() to mvm
Johannes Berg [Wed, 27 May 2026 20:05:04 +0000 (23:05 +0300)] 
wifi: iwlwifi: move iwl_fw_rate_idx_to_plcp() to mvm

It's only needed by mvm, so there's no need to have it in
iwlwifi and export it, just move it to mvm itself.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.87769f13c7d7.I3875d768694b9484317a3253f479a2a2100244f4@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mvm: rename iwl_mvm_mac80211_idx_to_hwrate()
Johannes Berg [Wed, 27 May 2026 20:05:03 +0000 (23:05 +0300)] 
wifi: iwlwifi: mvm: rename iwl_mvm_mac80211_idx_to_hwrate()

Given that we now use v3 rates with FW index throughout,
_to_hwrate() is confusing, since the hardware still uses
the PLCP value, the driver just doesn't see that now (as
it talks to firmware, not hardware.)

Rename this to iwl_mvm_rate_idx_to_fw_idx() to more
clearly indicate what it's doing.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.a60c8aea5b6c.I6af48d5d9748e184eed9d3437d312291cab61d7f@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: fix STEP_URM register address for SC devices
Moriya Itzchaki [Wed, 27 May 2026 20:05:02 +0000 (23:05 +0300)] 
wifi: iwlwifi: fix STEP_URM register address for SC devices

The CNVI_PMU_STEP_FLOW register address differs between device families.
For SC and newer devices, the register is at 0xA2D688,
while for BZ devices it's at 0xA2D588.

Signed-off-by: Moriya Itzchaki <moriya.itzchaki@intel.com>
Link: https://patch.msgid.link/20260527230313.f0c115c4f74e.I3c66b2e39a97f754e853ac7e7dba8e433523619e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: mld: fix smatch warning
Miri Korenblit [Wed, 27 May 2026 20:05:01 +0000 (23:05 +0300)] 
wifi: iwlwifi: mld: fix smatch warning

We dereference the mld_sta pointer before checking for NULL.
But we do check the sta pointer, and sta != NULL means mld_sta != NULL,
so there is no real issue.
Fix it anyway to silence the warning.

Link: https://patch.msgid.link/20260527200512.506707-2-miriam.rachel.korenblit@intel.com
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: remove mvm prefix from marker command
Miri Korenblit [Wed, 27 May 2026 20:05:00 +0000 (23:05 +0300)] 
wifi: iwlwifi: remove mvm prefix from marker command

This command is sent in other opmodes as well. Remove the mvm prefix.

Link: https://patch.msgid.link/20260527230313.290e4d9db14a.Ia4edc64dacc8e298ab7817ab5c37843e92698b8d@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: remove stale comment
Miri Korenblit [Wed, 27 May 2026 20:04:59 +0000 (23:04 +0300)] 
wifi: iwlwifi: remove stale comment

iwl_pcie_set_hw_ready still returns the return value of iwl_poll_bits,
but the latter one no longer returns the time elapsed until success, now it
returns either success or failure.
Remove the comment entirely.

Link: https://patch.msgid.link/20260527230313.ae42da7924ec.I1a92266621dc0033afa80f022d4c45e91674fedb@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: fw: cut down NIC wakeups during dump
Johannes Berg [Wed, 27 May 2026 20:04:58 +0000 (23:04 +0300)] 
wifi: iwlwifi: fw: cut down NIC wakeups during dump

Currently, the dump code attempts to dump any number of
memories and register banks, as defined by the firmware.
Especially when the device is failing, this can lead to
excessive time spent attempting to acquire NIC access
over and over again.

Improve the code to only attempt to acquire NIC access
once or twice, but using the new memory dump functions
that may drop the spinlock etc. Mark all dump regions
that require NIC access, and skip them if we couldn't
obtain that.

In order to avoid CPU latency due to the increased time
holding the spinlock (and possibly disabling softirqs),
drop locks and call cond_resched() after each section
(if holding NIC access) but don't release HW NIC access.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260527230313.bec886142cc8.I41f2eaf2403b38147504d5dab0a7414de2699adc@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agowifi: iwlwifi: add support for AX231
Emmanuel Grumbach [Tue, 12 May 2026 05:22:57 +0000 (08:22 +0300)] 
wifi: iwlwifi: add support for AX231

AX231 is a device that is based on AX211 that doesn't support 6E and
its bandwidth is limited to 80 MHz.
Just reuse the radio config from AX203 which has the exact same
characteristics.
It has a specific subdevice ID to allow the driver to differentiate
between AX211 and AX231.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Reviewed-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20260512082114.0685ed313987.Ibcfa24e196ac778405d2843f0984b66ca167704e@changeid
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
2 weeks agoASoC: amd: remove unused machine
Mark Brown [Wed, 3 Jun 2026 13:50:02 +0000 (14:50 +0100)] 
ASoC: amd: remove unused machine

Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> says:

This patch-set removes unused machine

Link: https://patch.msgid.link/877bogce4k.wl-kuninori.morimoto.gx@renesas.com
2 weeks agoASoC: amd: ps-mach: remove unused machine
Kuninori Morimoto [Wed, 3 Jun 2026 06:50:09 +0000 (06:50 +0000)] 
ASoC: amd: ps-mach: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/8733z4ce3i.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2 weeks agoASoC: amd: acp6x-mach: remove unused machine
Kuninori Morimoto [Wed, 3 Jun 2026 06:50:05 +0000 (06:50 +0000)] 
ASoC: amd: acp6x-mach: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/874ijkce3m.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2 weeks agoASoC: amd: acp3x-rn: remove unused machine
Kuninori Morimoto [Wed, 3 Jun 2026 06:50:00 +0000 (06:50 +0000)] 
ASoC: amd: acp3x-rn: remove unused machine

Not used, remove it.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Reviewed-by: Vijendar Mukunda <Vijendar.Mukunda@amd.com>
Link: https://patch.msgid.link/875x40ce3s.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2 weeks agos390/percpu: Provide arch_this_cpu_write() implementation
Heiko Carstens [Tue, 26 May 2026 05:57:02 +0000 (07:57 +0200)] 
s390/percpu: Provide arch_this_cpu_write() implementation

Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.

Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k (defconfig).

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Provide arch_this_cpu_read() implementation
Heiko Carstens [Tue, 26 May 2026 05:57:01 +0000 (07:57 +0200)] 
s390/percpu: Provide arch_this_cpu_read() implementation

Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().

Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
Heiko Carstens [Tue, 26 May 2026 05:57:00 +0000 (07:57 +0200)] 
s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()

Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.

There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Use new percpu code section for arch_this_cpu_add_return()
Heiko Carstens [Tue, 26 May 2026 05:56:59 +0000 (07:56 +0200)] 
s390/percpu: Use new percpu code section for arch_this_cpu_add_return()

Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.

With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Use new percpu code section for arch_this_cpu_add()
Heiko Carstens [Tue, 26 May 2026 05:56:58 +0000 (07:56 +0200)] 
s390/percpu: Use new percpu code section for arch_this_cpu_add()

Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.

With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.

With:

DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }

Old arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   eb 01 03 a8 00 6a       asi     936,1
  cc:   c4 18 00 00 00 00       lgrl    %r1,cc <bar+0xc>
                        ce: R_390_GOTENT        foo+0x2
  d2:   e3 10 03 b8 00 08       ag      %r1,952
  d8:   eb 22 10 00 00 e8       laag    %r2,%r2,0(%r1)
  de:   eb ff 03 a8 00 6e       alsi    936,-1
  e4:   a7 a4 00 05             jhe     ee <bar+0x2e>
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2
  ee:   c0 f4 00 00 00 00       jg      ee <bar+0x2e>
                        f0: R_390_PLT32DBL      preempt_schedule_notrace+0x2

New arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   c4 38 00 00 00 00       lgrl    %r3,c6 <bar+0x6>
                        c8: R_390_GOTENT        foo+0x2
  cc:   b9 04 00 43             lgr     %r4,%r3
  d0:   eb 00 43 c0 00 52       mviy    960(%r0),4
  d6:   e3 40 03 b8 00 08       ag      %r4,952
  dc:   eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  e2:   eb 00 03 c0 00 52       mviy    960,0
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2

Note that the conditional function call is removed.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Add missing do { } while (0) constructs
Heiko Carstens [Tue, 26 May 2026 05:56:57 +0000 (07:56 +0200)] 
s390/percpu: Add missing do { } while (0) constructs

Add missing do { } while (0) constructs in order to avoid potential
build failures.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca%40linux.ibm.com
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/percpu: Infrastructure for more efficient this_cpu operations
Heiko Carstens [Tue, 26 May 2026 05:56:56 +0000 (07:56 +0200)] 
s390/percpu: Infrastructure for more efficient this_cpu operations

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

E.g. this simple C code sequence

DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }

generates this code:

  11a976:       eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
  11a97c:       b9 04 00 ef             lgr     %r14,%r15
  11a980:       b9 04 00 b2             lgr     %r11,%r2
  11a984:       e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
  11a98a:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
  11a990:       eb 01 03 a8 00 6a       asi     936,1            <- __preempt_count_add(1)
  11a996:       c0 10 00 d2 ac b5       larl    %r1,1b70300      <- address of percpu var
  11a9a0:       e3 10 23 b8 00 08       ag      %r1,952          <- add percpu offset
  11a9a6:       eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1) <- atomic op
  11a9ac:       eb ff 03 a8 00 6e       alsi    936,-1           <- __preempt_count_dec_and_test()
  11a9b2:       a7 54 00 05             jnhe    11a9bc <bar+0x4c>
  11a9b6:       c0 e5 00 76 d1 bd       brasl   %r14,ff4d30 <preempt_schedule_notrace>
  11a9bc:       b9 e8 b0 2a             agrk    %r2,%r10,%r11
  11a9c0:       eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
  11a9c6        07 fe                   br      %r14

Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.

Get rid of the conditional branch with the following code sequence:

  11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
  11a8ec:       b9 04 00 43             lgr     %r4,%r3
  11a8f0:       eb 00 43 c0 00 52       mviy    960,4
  11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
  11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  11a902:       eb 00 03 c0 00 52       mviy    960,0
  11a908:       b9 08 00 25             agr     %r2,%r5
  11a90c        07 fe                   br      %r14

The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:

- The first mviy instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mviy instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mviy instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/zcrypt: Replace get_zeroed_page() with kzalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:27 +0000 (17:08 +0300)] 
s390/zcrypt: Replace get_zeroed_page() with kzalloc()

zcrypt_rng_device_add() allocates a buffer for the software random
number generator data cache.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/trng: Replace __get_free_page() with kmalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:26 +0000 (17:08 +0300)] 
s390/trng: Replace __get_free_page() with kmalloc()

trng_read() allocates a temporary staging buffer for CPACF TRNG
random data before copying it to userspace.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/qeth: Replace get_zeroed_page() with kzalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:25 +0000 (17:08 +0300)] 
s390/qeth: Replace get_zeroed_page() with kzalloc()

qeth_get_trap_id() allocates a temporary buffer for STSI system
information queries used to build trap identification strings.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/hvc_iucv: Replace get_zeroed_page() with kzalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:24 +0000 (17:08 +0300)] 
s390/hvc_iucv: Replace get_zeroed_page() with kzalloc()

hvc_iucv_alloc() allocates a send staging buffer for accumulating
outbound terminal characters before they are copied into a separate
IUCV message buffer for transmission to the hypervisor. The staging
buffer itself is never passed to any IUCV function.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/dasd: Replace get_zeroed_page() with kzalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:23 +0000 (17:08 +0300)] 
s390/dasd: Replace get_zeroed_page() with kzalloc()

DASD driver uses get_zeroed_page() to allocate pages for the Extended Error
Reporting software ring buffer and for a scratch buffer for formatting
sense dump diagnostic text.

These buffers can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of get_zeroed_page() with kzalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/con3270: Replace __get_free_page() with kmalloc()
Mike Rapoport (Microsoft) [Sun, 31 May 2026 14:08:22 +0000 (17:08 +0300)] 
s390/con3270: Replace __get_free_page() with kmalloc()

con3270_alloc_view() allocates a staging buffer used to assemble
3270 datastream content before it is copied into channel program
requests.

This buffer can be allocated with kmalloc() as there's nothing special
about it to go directly to the page allocator.

kmalloc() provides a better API that does not require ugly casts and
kfree() does not need to know the size of the freed object.

Performance difference between kmalloc() and __get_free_pages() is not
measurable as both allocators take an object/page from a per-CPU list for
fast path allocations.

For the slow path the performance is anyway determined by the amount of
reclaim involved rather than by what allocator is used.

Replace use of __get_free_page() with kmalloc() and free_page() with
kfree().

Link: https://lore.kernel.org/all/635405e4-9423-4a25-a6e7-e03c8ea0bcbe@redhat.com
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/fpu: Move GR_NUM / VX_NUM macros to separate header file
Heiko Carstens [Tue, 26 May 2026 13:09:52 +0000 (15:09 +0200)] 
s390/fpu: Move GR_NUM / VX_NUM macros to separate header file

Move GR_NUM / VX_NUM macros to separate insn-common-asm.h header file
so they can be reused for non-fpu insn constructs.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/fpu: Shorten GR_NUM / VX_NUM macros
Heiko Carstens [Tue, 26 May 2026 13:09:51 +0000 (15:09 +0200)] 
s390/fpu: Shorten GR_NUM / VX_NUM macros

Use the ".irp" directive to get rid of all the repeated ".ifc" usages
in fpu-insn-asm.h.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agos390/ap/zcrypt: Rearrange fields within AP and zcrypt structs
Harald Freudenberger [Mon, 27 Apr 2026 16:09:39 +0000 (18:09 +0200)] 
s390/ap/zcrypt: Rearrange fields within AP and zcrypt structs

Rearrange some fields within AP and zcrypt structs to reduce
memory consumption and unused holes with the help of pahole
analysis of the code.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: select FSP Chain of Trust version
John Hubbard [Wed, 3 Jun 2026 07:30:22 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: select FSP Chain of Trust version

The FSP Chain of Trust handshake is versioned: Hopper speaks version 1
and Blackwell speaks version 2. Provide the version through the FSP HAL
so the boot message carries the value FSP expects, and so chipsets that
do not use FSP need not express a version at all.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-5-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: add FSP send/receive messaging
John Hubbard [Wed, 3 Jun 2026 07:30:21 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: add FSP send/receive messaging

FSP exchanges are request/response: the driver sends an MCTP/NVDM
message and must match the reply against the request before acting on
it. Add the synchronous send-and-wait path that validates the response
transport and message headers and confirms the reply corresponds to the
request that was sent.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-4-d9f3a06939e0@nvidia.com
[acourbot: make `MessageToFsp` private.]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: add MCTP/NVDM protocol types for firmware communication
John Hubbard [Wed, 3 Jun 2026 07:30:20 +0000 (16:30 +0900)] 
gpu: nova-core: add MCTP/NVDM protocol types for firmware communication

Add the MCTP (Management Component Transport Protocol) and NVDM (NVIDIA
Data Model) wire-format types used for communication between the kernel
driver and GPU firmware processors.

This includes typed MCTP transport headers, NVDM message headers, and
NVDM message type identifiers. Both the FSP boot path and the upcoming
GSP RPC message queue share this protocol layer.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-3-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: add FSP message infrastructure
John Hubbard [Wed, 3 Jun 2026 07:30:19 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: add FSP message infrastructure

FSP communication uses a pair of non-circular queues in the FSP
falcon's EMEM, one for messages from the driver to FSP and one for
replies, with the driver polling for response data. Add the queue
registers and the low-level helpers used by the higher-level FSP
message layer.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-2-d9f3a06939e0@nvidia.com
[acourbot: align register fields names with OpenRM.]
[acourbot: represent registers as arrays of 8 instances, as per OpenRM.]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agogpu: nova-core: Hopper/Blackwell: add FSP falcon EMEM operations
John Hubbard [Wed, 3 Jun 2026 07:30:18 +0000 (16:30 +0900)] 
gpu: nova-core: Hopper/Blackwell: add FSP falcon EMEM operations

Add external memory (EMEM) read/write operations to the GPU's FSP falcon
engine. These operations use Falcon PIO (Programmed I/O) to communicate
with the FSP through indirect memory access.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260603-b4-blackwell-v13-1-d9f3a06939e0@nvidia.com
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
2 weeks agoselftests: livepatch: set LC_ALL=C to fix locale-dependent test failure
Qiang Ma [Wed, 27 May 2026 09:59:29 +0000 (17:59 +0800)] 
selftests: livepatch: set LC_ALL=C to fix locale-dependent test failure

When executing the command
"make -C tools/testing/selftests TARGETS=livepatch run_tests",
the following error message was reported.

TEST: livepatch interaction with ftrace_enabled sysctl ... not ok
...
livepatch: sysctlo
: setting key "kernel.ftrace_enabled": Device or resource busy
livepatch: sysctl: setting key "kernel.ftrace_enabled": 设备或资源忙
...
ERROR: livepatch kselftest(s) failed
not ok 5 selftests: livepatch: test-ftrace.sh # exit=1

To fix it, set LC_ALL=C.

Signed-off-by: Qiang Ma <maqianga@uniontech.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20260527095929.1504032-1-maqianga@uniontech.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
2 weeks agoKVM: riscv: Fast-path dirty logging write faults
Jinyu Tang [Sun, 17 May 2026 15:34:27 +0000 (23:34 +0800)] 
KVM: riscv: Fast-path dirty logging write faults

With dirty logging enabled, guest writes often fault on an existing 4K
G-stage leaf that was write-protected only for dirty tracking. The slow
path still performs the full fault handling flow and takes mmu_lock for
write, even though the page-table shape does not change.

x86 handles the analogous case in its fast page fault path by atomically
making a writable SPTE writable again when the fault is only a
write-protection fault. Add the same style of fast path for RISC-V. If a
write fault hits an existing 4K leaf in a writable dirty-log memslot,
mark the page dirty and atomically set the PTE writable and dirty under
the read side of mmu_lock.

The dirty bitmap is updated before the PTE becomes writable again. The
PTE D bit is also set so systems that trap on a clear D bit do not fall
back to the slow path for a writable but clean PTE.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-6-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2 weeks agoKVM: riscv: Update G-stage PTE permissions atomically
Jinyu Tang [Sun, 17 May 2026 15:34:26 +0000 (23:34 +0800)] 
KVM: riscv: Update G-stage PTE permissions atomically

When a fault hits an existing G-stage leaf with the same PFN, KVM only
needs to update the PTE permissions. This path will be used by read-side
fault handling, so it must not overwrite a concurrent PTE update.

Use the cmpxchg helper when relaxing permissions on an existing leaf,
following the same concurrency model used by x86 for atomic SPTE
permission updates. Retry if another CPU changed the PTE first, and use
cpu_relax() while spinning.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-5-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2 weeks agoKVM: riscv: Add a G-stage PTE cmpxchg helper
Jinyu Tang [Sun, 17 May 2026 15:34:25 +0000 (23:34 +0800)] 
KVM: riscv: Add a G-stage PTE cmpxchg helper

Permission-only G-stage PTE updates can run in parallel once they are
moved to the read side of mmu_lock. Plain set_pte() is not enough for
that case because another CPU may update the same PTE first.

x86 handles the same class of SPTE races with cmpxchg-based updates in
its fast page fault and TDP MMU paths. Add a small RISC-V helper for
atomic G-stage PTE updates. The helper reports contention to the caller
and flushes the target range only when the PTE value actually changes.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-4-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>
2 weeks agoKVM: riscv: Use an rwlock for mmu_lock
Jinyu Tang [Sun, 17 May 2026 15:34:24 +0000 (23:34 +0800)] 
KVM: riscv: Use an rwlock for mmu_lock

RISC-V KVM currently uses a spinlock for mmu_lock. That serializes all
G-stage MMU operations, including permission-only updates that do not
allocate or free page-table pages.

Use KVM's rwlock form of mmu_lock, as x86 and arm64 already do. Keep the
existing map, unmap and teardown paths on the write side. This prepares
RISC-V for read-side handling of G-stage permission updates.

Signed-off-by: Jinyu Tang <tjytimi@163.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260517153427.94889-3-tjytimi@163.com
Signed-off-by: Anup Patel <anup@brainfault.org>