Yosry Ahmed [Mon, 16 Mar 2026 20:27:29 +0000 (20:27 +0000)]
KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation
Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL.
But it could also fail with -EFAULT if creating a host mapping failed.
Inject a #GP in all cases, no reason to treat failure modes differently.
Similar to commit 01ddcdc55e09 ("KVM: nSVM: Always inject a #GP if
mapping VMCB12 fails on nested VMRUN"), treat all failures equally.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:28 +0000 (20:27 +0000)]
KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction from L2, it checks the
legality of RAX, and injects a #GP if RAX is illegal, or otherwise
synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes
precedence over both the RAX check and the intercept. Call
nested_svm_check_permissions() first to cover both.
Note that if #GP is intercepted on SVM instruction in L1, the intercept
handlers of VMRUN/VMLOAD/VMSAVE already perform these checks.
Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not
done in the correct order, because KVM handles it by intercepting the
instructions when EFER.SVME=0 and injecting #UD. However, a #GP
injected by hardware would happen before the instruction intercept,
leading to #GP taking precedence over #UD from the guest's perspective.
Opportunistically add a FIXME for this.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
When #GP is intercepted by KVM, the #GP interception handler checks
whether the GPA in RAX is legal and reinjects the #GP accordingly.
Otherwise, it calls into the appropriate interception handler for
VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX.
However, the intercept handlers need to do the RAX check, because if the
guest has a smaller MAXPHYADDR, RAX could be legal from the hardware
perspective (i.e. CPU does not inject #GP), but not from the vCPU's
perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS
cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can
always be checked against the vCPU's MAXPHYADDR.
Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as
the CPU does not check RAX before the interception. Read RAX using
kvm_register_read() to avoid a false negative on page_address_valid() on
32-bit due to garbage in the higher bits.
Keep the check in the #GP intercept handler in the nested case where
a #VMEXIT is synthesized into L1, as the RAX check is still needed there
and takes precedence over the intercept.
Opportunistically add a FIXME about the #VMEXIT being synthesized into
L1, as it needs to be conditional.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:26 +0000 (20:27 +0000)]
KVM: SVM: Properly check RAX on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if
the instruction was executed with a mis-algined RAX. However, a #GP
should also be reinjected if RAX contains an illegal GPA, according to
the APM, one of #GP conditions is:
rAX referenced a physical address above the maximum
supported physical address.
Replace the PAGE_MASK check with page_address_valid(), which checks both
page-alignment as well as the legality of the GPA based on the vCPU's
MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are
dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive
when checking the validity of the address.
Note that this is currently only a problem if KVM is running an L2 guest
and ends up synthesizing a #VMEXIT to L1, as the RAX check takes
precedence over the intercept. Otherwise, if KVM emulates the
instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP
anyway. However, following patches will change the failure behavior of
kvm_vcpu_map(), so make sure the #GP interception handler does this
appropriately.
Opportunistically drop a teaser FIXME about the SVM instructions
handling on #GP belonging in the emulator.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Fixes: d1cba6c92237 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org
[sean: massage wording with respect to kvm_register_read()] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Mon, 16 Mar 2026 20:27:25 +0000 (20:27 +0000)]
KVM: SVM: Refactor SVM instruction handling on #GP intercept
Instead of returning an opcode from svm_instr_opcode() and then passing
it to emulate_svm_instr(), which uses it to find the corresponding exit
code and intercept handler, return the exit code directly from
svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code().
emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the
intercept handler, so open-code it in gp_interception(), and use
svm_invoke_exit_handler() to call the intercept handler based on
the exit code. This allows for dropping the SVM_INSTR_* enum, and the
const array mapping its values to exit codes and intercept handlers.
In gp_intercept(), handle SVM instructions and first with an early return,
and invert is_guest_mode() checks, un-indenting the rest of the code.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:24 +0000 (20:27 +0000)]
KVM: SVM: Properly check RAX in the emulator for SVM instructions
Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the
physical address in RAX is not supported. check_svme_pa() hardcodes this
to checking that bits 63-48 are not set. This is incorrect on HW
supporting 52 bits of physical address space. Additionally, the emulator
does not check if the address is not aligned, which should also result
in #GP.
Use page_address_valid() which properly checks alignment and the address
legality based on the guest's MAXPHYADDR. Plumb it through
x86_emulate_ops, similar to is_canonical_addr(), to avoid directly
accessing the vCPU object in emulator code.
Fixes: 01de8b09e606 ("KVM: SVM: Add intercept checks for SVM instructions") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86: Suppress WARNs on nested_run_pending after userspace exit
To end an ongoing game of whack-a-mole between KVM and syzkaller, WARN on
illegally cancelling a pending nested VM-Enter if and only if userspace
has NOT gained control of the vCPU since the nested run was initiated. As
proven time and time again by syzkaller, userspace can clobber vCPU state
so as to force a VM-Exit that violates KVM's architectural modelling of
VMRUN/VMLAUNCH/VMRESUME.
To detect that userspace has gained control, while minimizing the risk of
operating on stale data, convert nested_run_pending from a pure boolean to
a tri-state of sorts, where '0' is still "not pending", '1' is "pending",
and '2' is "pending but untrusted". Then on KVM_RUN, if the flag is in
the "trusted pending" state, move it to "untrusted pending".
Note, moving the state to "untrusted" even if KVM_RUN is ultimately
rejected is a-ok, because for the "untrusted" state to matter, KVM must
get past kvm_x86_vcpu_pre_run() at some point for the vCPU.
Yosry Ahmed [Thu, 12 Mar 2026 23:48:22 +0000 (16:48 -0700)]
KVM: x86: Move nested_run_pending to kvm_vcpu_arch
Move nested_run_pending field present in both svm_nested_state and
nested_vmx to the common kvm_vcpu_arch. This allows for common code to
use without plumbing it through per-vendor helpers.
nested_run_pending remains zero-initialized, as the entire kvm_vcpu
struct is, and all further accesses are done through vcpu->arch instead
of svm->nested or vmx->nested.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org>
[sean: expand the commend in the field declaration] Link: https://patch.msgid.link/20260312234823.3120658-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Fri, 6 Mar 2026 21:08:56 +0000 (21:08 +0000)]
KVM: nSVM: Simplify error handling of nested_svm_copy_vmcb12_to_cache()
nested_svm_vmrun() currently stores the return value of
nested_svm_copy_vmcb12_to_cache() in a local variable 'err', separate
from the generally used 'ret' variable. This is done to have a single
call to kvm_skip_emulated_instruction(), such that we can store the
return value of kvm_skip_emulated_instruction() in 'ret', and then
re-check the return value of nested_svm_copy_vmcb12_to_cache() in 'err'.
The code is unnecessarily confusing. Instead, call
kvm_skip_emulated_instruction() in the failure path of
nested_svm_copy_vmcb12_to_cache() if the return value is not -EFAULT,
and drop 'err'.
KVM: SVM: Add a helper to get LBR field pointer to dedup MSR accesses
Add a helper to get a pointer to the corresponding VMCB field given an LBR
MSR index, and use it to dedup the handling in svm_{g,s}et_msr().
No functional change intended.
Suggested-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260310220414.2569208-1-seanjc@google.com
[sean: use KVM_BUG_ON() instead of BUILD_BUG(), clang ain't smart enough] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Mon, 9 Feb 2026 19:51:42 +0000 (19:51 +0000)]
KVM: selftests: Add a test for L2 clearing EFER.SVME without intercept
Add a test that verifies KVM's newly introduced behavior of synthesizing
a triple fault in L1 if L2 clears EFER.SVME without an L1 interception
(which is architecturally undefined).
Yosry Ahmed [Mon, 9 Feb 2026 19:51:41 +0000 (19:51 +0000)]
KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2
KVM tracks when EFER.SVME is set and cleared to initialize and tear down
nested state. However, it doesn't differentiate if EFER.SVME is getting
toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept
the EFER write, KVM exits guest mode and tears down nested state while
L2 is running, executing L1 without injecting a proper #VMEXIT.
According to the APM:
The effect of turning off EFER.SVME while a guest is running is
undefined; therefore, the VMM should always prevent guests from
writing EFER.
Since the behavior is architecturally undefined, KVM gets to choose what
to do. Inject a triple fault into L1 as a more graceful option that
running L1 with corrupted state.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:19 +0000 (00:34 +0000)]
KVM: nSVM: Only copy SVM_MISC_ENABLE_NP from VMCB01's misc_ctl
The 'misc_ctl' field in VMCB02 is taken as-is from VMCB01. However, the
only bit that needs to copied is SVM_MISC_ENABLE_NP, as all other known
bits in misc_ctl are related to SEV guests, and KVM doesn't support
nested virtualization for SEV guests.
Only copy SVM_MISC_ENABLE_NP to harden against future bugs if/when other
bits are set for L1 but should not be set for L2.
Opportunistically add a comment explaining why SVM_MISC_ENABLE_NP is
taken from VMCB01 and not VMCB02.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:18 +0000 (00:34 +0000)]
KVM: nSVM: Sanitize INT/EVENTINJ fields when copying from vmcb12
Make sure all fields used from vmcb12 in creating the vmcb02 are
sanitized, such that no unhandled or reserved bits end up in the vmcb02.
The following control fields are read from vmcb12 and have bits that are
either reserved or not handled/advertised by KVM: tlb_ctl, int_ctl,
int_state, int_vector, event_inj, misc_ctl, and misc_ctl2.
The following fields do not require any extra sanitizing:
- tlb_ctl: already being sanitized.
- int_ctl: bits from vmcb12 are copied bit-by-bit as needed.
- misc_ctl: only used in consistency checks (particularly NP_ENABLE).
- misc_ctl2: bits from vmcb12 are copied bit-by-bit as needed.
For the remaining fields (int_vector, int_state, and event_inj), make
sure only defined bits are copied from L1's vmcb12 into KVM'cache by
defining appropriate masks where needed.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:17 +0000 (00:34 +0000)]
KVM: nSVM: Sanitize TLB_CONTROL field when copying from vmcb12
The APM defines possible values for TLB_CONTROL as 0, 1, 3, and 7 -- all
of which are always allowed for KVM guests as KVM always supports
X86_FEATURE_FLUSHBYASID. Only copy bits 0 to 2 from vmcb12's
TLB_CONTROL, such that no unhandled or reserved bits end up in vmcb02.
Note that TLB_CONTROL in vmcb12 is currently ignored by KVM, as it nukes
the TLB on nested transitions anyway (see
nested_svm_transition_tlb_flush()). However, such sanitization will be
needed once the TODOs there are addressed, and it's minimal churn to add
it now.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:16 +0000 (00:34 +0000)]
KVM: nSVM: Use PAGE_MASK to drop lower bits of bitmap GPAs from vmcb12
Use PAGE_MASK to drop the lower bits from IOPM_BASE_PA and MSRPM_BASE_PA
while copying them instead of dropping the bits afterward with a
hardcoded mask.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:15 +0000 (00:34 +0000)]
KVM: nSVM: Restrict mapping vmcb12 on nested VMRUN
All accesses to the vmcb12 in the guest memory on nested VMRUN are
limited to nested_svm_vmrun() copying vmcb12 fields and writing them on
failed consistency checks. However, vmcb12 remains mapped throughout
nested_svm_vmrun(). Mapping and unmapping around usages is possible,
but it becomes easy-ish to introduce bugs where 'vmcb12' is used after
being unmapped.
Move reading the vmcb12, copying to cache, and consistency checks from
nested_svm_vmrun() into a new helper, nested_svm_copy_vmcb12_to_cache()
to limit the scope of the mapping.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:14 +0000 (00:34 +0000)]
KVM: nSVM: Cache all used fields from VMCB12
Currently, most fields used from VMCB12 are cached in
svm->nested.{ctl/save}. This is mainly to avoid TOC-TOU bugs. However,
for the save area, only the fields used in the consistency checks (i.e.
nested_vmcb_check_save()) were being cached. Other fields are read
directly from guest memory in nested_vmcb02_prepare_save().
While probably benign, this still makes it possible for TOC-TOU bugs to
happen. For example, RAX, RSP, and RIP are read twice, once to store in
VMCB02, and once to store in vcpu->arch.regs. It is possible for the
guest to modify the value between both reads, potentially causing nasty
bugs.
Harden against such bugs by caching everything in svm->nested.save.
Cache all the needed fields, and keep all accesses to the VMCB12
strictly in nested_svm_vmrun() for caching and early error injection.
Following changes will further limit the access to the VMCB12 in the
nested VMRUN path.
Introduce vmcb12_is_dirty() to use with the cached control fields
instead of vmcb_is_dirty(), similar to vmcb12_is_intercept().
Opportunistically order the copies in __nested_copy_vmcb_save_to_cache()
by the order in which the fields are defined in struct vmcb_save_area.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:13 +0000 (00:34 +0000)]
KVM: SVM: Rename vmcb->virt_ext to vmcb->misc_ctl2
'virt' is confusing in the VMCB because it is relative and ambiguous.
The 'virt_ext' field includes bits for LBR virtualization and
VMSAVE/VMLOAD virtualization, so it's just another miscellaneous control
field. Name it as such.
While at it, move the definitions of the bits below those for
'misc_ctl' and rename them for consistency.
KVM: SVM: Rename vmcb->nested_ctl to vmcb->misc_ctl
The 'nested_ctl' field is misnamed. Although the first bit is for nested
paging, the other defined bits are for SEV/SEV-ES. Other bits in the
same field according to the APM (but not defined by KVM) include "Guest
Mode Execution Trap", "Enable INVLPGB/TLBSYNC", and other control bits
unrelated to 'nested'.
There is nothing common among these bits, so just name the field
misc_ctl. Also rename the flags accordingly.
KVM: nSVM: Capture svm->nested.ctl as vmcb12_ctrl when preparing vmcb02
Grab svm->nested.ctl as vmcb12_ctrl when preparing the vmcb02 controls to
make it more obvious that much of the data is coming from vmcb12 (or
rather, a snapshot of vmcb12 at the time of L1's VMRUN).
Opportunistically reorder the variable definitions to create a pretty
reverse fir tree.
KVM: nSVM: Move vmcb_ctrl_area_cached.bus_lock_rip to svm_nested_state
Move "bus_lock_rip" from "vmcb_ctrl_area_cached" to "svm_nested_state" as
"last_bus_lock_rip" to more accurately reflect what it tracks, and because
it is NOT a cached vmcb12 control field. The misplaced field isn't all
that apparent in the current code base, as KVM uses "svm->nested.ctl"
broadly, but the bad placement becomes glaringly obvious if
"svm->nested.ctl" is captured as a local "vmcb12_ctrl" variable.
KVM: nSVM: Use intuitive local variables in nested_vmcb02_recalc_intercepts()
Now that nested_vmcb02_recalc_intercepts() is explicitly scoped to deal
with *only* recalculating vmcb02 intercepts, rename its local variables to
use more intuivite names. The current "c", "h", and "g" local variables,
for the current VMCB, vmcb01, and (cached) vmcb12 respectively, are short
and sweet, but don't do much to help unfamiliar readers understand what
the code is doing.
Use vmcb12_ctrl/vmcb01/vmcb02/vmcb12_ctrl in lieu of c/h/g to make it clear
the function is updating intercepts in vmcb02 based on the intercepts in
vmcb01 and (cached) vmcb12.
Opportunistically change the existing WARN_ON to a WARN_ON_ONCE so that a
KVM bug doesn't unintentionally DoS the host.
KVM: nSVM: Directly (re)calc vmcb02 intercepts from nested_vmcb02_prepare_control()
Now that nested_vmcb02_recalc_intercepts() provides guardrails against it
being incorrectly called without vmcb02 active, invoke it directly from
nested_vmcb02_recalc_intercepts() instead of bouncing through
svm_mark_intercepts_dirty(), which unnecessarily marks vmcb01 as dirty.
Yosry Ahmed [Wed, 18 Feb 2026 23:09:53 +0000 (15:09 -0800)]
KVM: nSVM: WARN and abort vmcb02 intercepts recalc if vmcb02 isn't active
WARN and bail early from nested_vmcb02_recalc_intercepts() if vmcb02 isn't
the active/current VMCB, as recalculating intercepts for vmcb01 using logic
intended for merging vmcb12 and vmcb01 intercepts can yield unexpected and
unwanted results.
In addition to hardening against general bugs, this will provide additional
safeguards "if" nested_vmcb02_recalc_intercepts() is invoked directly from
nested_vmcb02_prepare_control().
KVM: SVM: Separate recalc_intercepts() into nested vs. non-nested parts
Extract the non-nested aspects of recalc_intercepts() into a separate
helper, svm_mark_intercepts_dirty(), to make it clear that the call isn't
*just* recalculating (vmcb02's) intercepts, and to not bury non-nested
code in nested.c.
As suggested by Yosry, opportunistically prepend "nested_vmbc02_" to
recalc_intercepts() so that it's obvious the function specifically deals
with recomputing intercepts for L2.
Kevin Cheng [Wed, 4 Mar 2026 00:30:10 +0000 (16:30 -0800)]
KVM: SVM: Recalc instructions intercepts when EFER.SVME is toggled
The AMD APM states that VMRUN, VMLOAD, VMSAVE, CLGI, VMMCALL, and
INVLPGA instructions should generate a #UD when EFER.SVME is cleared.
Currently, when VMLOAD, VMSAVE, or CLGI are executed in L1 with
EFER.SVME cleared, no #UD is generated in certain cases. This is because
the intercepts for these instructions are cleared based on whether or
not vls or vgif is enabled. The #UD fails to be generated when the
intercepts are absent.
Fix the missing #UD generation by ensuring that all relevant
instructions have intercepts set when SVME.EFER is disabled.
VMMCALL is special because KVM's ABI is that VMCALL/VMMCALL are always
supported for L1 and never fault.
Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: isolate Intel CPU "compatibility" in EFER.SVME=1 path] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260304003010.1108257-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Kevin Cheng [Wed, 4 Mar 2026 00:30:09 +0000 (16:30 -0800)]
KVM: SVM: Move STGI and CLGI intercept handling
Move STGI/CLGI intercept handling to svm_recalc_instruction_intercepts()
in preparation for making the function EFER.SVME-aware. This will allow
configuring STGI/CLGI intercepts along with other intercepts for other SVM
instructions when EFER.SVME is toggled (KVM needs to intercept SVM
instructions when EFER.SVME=0 to inject #UD).
When clearing the STGI intercept in particular, request KVM_REQ_EVENT if
there is at least one a pending GIF-controlled event. This avoids breaking
NMI/SMI window tracking, as enable_{nmi,smi}_window() sets INTERCEPT_STGI
to detect when NMIs become unblocked. KVM_REQ_EVENT forces
kvm_check_and_inject_events() to re-evaluate pending events and re-enable
the intercept if needed.
Extract the pending GIF event check into a helper function
svm_has_pending_gif_event() to deduplicate the logic between
svm_recalc_instruction_intercepts() and svm_set_gif().
Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: keep vgif handling out of the "Intel CPU model" path] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260304003010.1108257-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: nSVM: Always intercept VMMCALL when L2 is active
Always intercept VMMCALL now that KVM properly synthesizes a #UD as
appropriate, i.e. when L1 doesn't want to intercept VMMCALL, to avoid
putting L2 into an infinite #UD loop if KVM_X86_QUIRK_FIX_HYPERCALL_INSN
is enabled.
By letting L2 execute VMMCALL natively and thus #UD, for all intents and
purposes KVM morphs the VMMCALL intercept into a #UD intercept (KVM always
intercepts #UD). When the hypercall quirk is enabled, KVM "emulates"
VMMCALL in response to the #UD by trying to fixup the opcode to the "right"
vendor, then restarts the guest, without skipping the VMMCALL. As a
result, the guest sees an endless stream of #UDs since it's already
executing the correct vendor hypercall instruction, i.e. the emulator
doesn't anticipate that the #UD could be due to lack of interception, as
opposed to a truly undefined opcode.
Fixes: 0d945bd93511 ("KVM: SVM: Don't allow nested guest to VMMCALL into host") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/20260304002223.1105129-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Kevin Cheng [Wed, 4 Mar 2026 00:22:22 +0000 (16:22 -0800)]
KVM: nSVM: Raise #UD if unhandled VMMCALL isn't intercepted by L1
Explicitly synthesize a #UD for VMMCALL if L2 is active, L1 does NOT want
to intercept VMMCALL, nested_svm_l2_tlb_flush_enabled() is true, and the
hypercall is something other than one of the supported Hyper-V hypercalls.
When all of the above conditions are met, KVM will intercept VMMCALL but
never forward it to L1, i.e. will let L2 make hypercalls as if it were L1.
The TLFS says a whole lot of nothing about this scenario, so go with the
architectural behavior, which says that VMMCALL #UDs if it's not
intercepted.
Opportunistically do a 2-for-1 stub trade by stub-ifying the new API
instead of the helpers it uses. The last remaining "single" stub will
soon be dropped as well.
Suggested-by: Sean Christopherson <seanjc@google.com> Fixes: 3f4a812edf5c ("KVM: nSVM: hyper-v: Enable L2 TLB flush") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Kevin Cheng <chengkev@google.com> Link: https://patch.msgid.link/20260228033328.2285047-5-chengkev@google.com
[sean: rewrite changelog and comment, tag for stable, remove defunct stubs] Reviewed-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://patch.msgid.link/20260304002223.1105129-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: SVM: Explicitly mark vmcb01 dirty after modifying VMCB intercepts
When reacting to an intercept update, explicitly mark vmcb01's intercepts
dirty, as KVM always initially operates on vmcb01, and nested_svm_vmexit()
isn't guaranteed to mark VMCB_INTERCEPTS as dirty. I.e. if L2 is active,
KVM will modify the intercepts for L1, but might not mark them as dirty
before the next VMRUN of L1.
Fixes: 116a0a23676e ("KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count") Cc: stable@vger.kernel.org Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260218230958.2877682-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:34:11 +0000 (00:34 +0000)]
KVM: nSVM: Add missing consistency check for EVENTINJ
According to the APM Volume #2, 15.20 (24593—Rev. 3.42—March 2024):
VMRUN exits with VMEXIT_INVALID error code if either:
• Reserved values of TYPE have been specified, or
• TYPE = 3 (exception) has been specified with a vector that does not
correspond to an exception (this includes vector 2, which is an NMI,
not an exception).
Add the missing consistency checks to KVM. For the second point, inject
VMEXIT_INVALID if the vector is anything but the vectors defined by the
APM for exceptions. Reserved vectors are also considered invalid, which
matches the HW behavior. Vector 9 (i.e. #CSO) is considered invalid
because it is reserved on modern CPUs, and according to LLMs no CPUs
exist supporting SVM and producing #CSOs.
Defined exceptions could be different between virtual CPUs as new CPUs
define new vectors. In a best effort to dynamically define the valid
vectors, make all currently defined vectors as valid except those
obviously tied to a CPU feature: SHSTK -> #CP and SEV-ES -> #VC. As new
vectors are defined, they can similarly be tied to corresponding CPU
features.
Invalid vectors on specific (e.g. old) CPUs that are missed by KVM
should be rejected by HW anyway.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:10 +0000 (00:34 +0000)]
KVM: nSVM: Add missing consistency check for EFER, CR0, CR4, and CS
According to the APM Volume #2, 15.5, Canonicalization and Consistency
Checks (24593—Rev. 3.42—March 2024), the following condition (among
others) results in a #VMEXIT with VMEXIT_INVALID (aka SVM_EXIT_ERR):
EFER.LME, CR0.PG, CR4.PAE, CS.L, and CS.D are all non-zero.
In the list of consistency checks done when EFER.LME and CR0.PG are set,
add a check that CS.L and CS.D are not both set, after the existing
check that CR4.PAE is set.
This is functionally a nop because the nested VMRUN results in
SVM_EXIT_ERR in HW, which is forwarded to L1, but KVM makes all
consistency checks before a VMRUN is actually attempted.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:09 +0000 (00:34 +0000)]
KVM: nSVM: Add missing consistency check for nCR3 validity
From the APM Volume #2, 15.25.4 (24593—Rev. 3.42—March 2024):
When VMRUN is executed with nested paging enabled (NP_ENABLE = 1), the
following conditions are considered illegal state combinations, in
addition to those mentioned in “Canonicalization and Consistency Checks”:
• Any MBZ bit of nCR3 is set.
• Any G_PAT.PA field has an unsupported type encoding or any
reserved field in G_PAT has a nonzero value.
Add the consistency check for nCR3 being a legal GPA with no MBZ bits
set. Note, the G_PAT.PA check is being handled separately[*].
Yosry Ahmed [Tue, 3 Mar 2026 00:34:08 +0000 (00:34 +0000)]
KVM: nSVM: Drop the non-architectural consistency check for NP_ENABLE
KVM currenty fails a nested VMRUN and injects VMEXIT_INVALID (aka
SVM_EXIT_ERR) if L1 sets NP_ENABLE and the host does not support NPTs.
On first glance, it seems like the check should actually be for
guest_cpu_cap_has(X86_FEATURE_NPT) instead, as it is possible for the
host to support NPTs but the guest CPUID to not advertise it.
However, the consistency check is not architectural to begin with. The
APM does not mention VMEXIT_INVALID if NP_ENABLE is set on a processor
that does not have X86_FEATURE_NPT. Hence, NP_ENABLE should be ignored
if X86_FEATURE_NPT is not available for L1, so sanitize it when copying
from the VMCB12 to KVM's cache.
Apart from the consistency check, NP_ENABLE in VMCB12 is currently
ignored because the bit is actually copied from VMCB01 to VMCB02, not
from VMCB12.
Fixes: 4b16184c1cca ("KVM: SVM: Initialize Nested Nested MMU context on VMRUN") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-15-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:34:06 +0000 (00:34 +0000)]
KVM: nSVM: Clear tracking of L1->L2 NMI and soft IRQ on nested #VMEXIT
KVM clears tracking of L1->L2 injected NMIs (i.e. nmi_l1_to_l2) and soft
IRQs (i.e. soft_int_injected) on a synthesized #VMEXIT(INVALID) due to
failed VMRUN. However, they are not explicitly cleared in other
synthesized #VMEXITs.
soft_int_injected is always cleared after the first VMRUN of L2 when
completing interrupts, as any re-injection is then tracked by KVM
(instead of purely in vmcb02).
nmi_l1_to_l2 is not cleared after the first VMRUN if NMI injection
failed, as KVM still needs to keep track that the NMI originated from L1
to avoid blocking NMIs for L1. It is only cleared when the NMI injection
succeeds.
KVM could synthesize a #VMEXIT to L1 before successfully injecting the
NMI into L2 (e.g. due to a #NPF on L2's NMI handler in L1's NPTs). In
this case, nmi_l1_to_l2 will remain true, and KVM may not correctly mask
NMIs and intercept IRET when injecting an NMI into L1.
Clear both nmi_l1_to_l2 and soft_int_injected in nested_svm_vmexit(), i.e.
for all #VMEXITs except those that occur due to failed consistency checks,
as those happen before nmi_l1_to_l2 or soft_int_injected are set.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:05 +0000 (00:34 +0000)]
KVM: nSVM: Clear EVENTINJ fields in vmcb12 on nested #VMEXIT
According to the APM, from the reference of the VMRUN instruction:
Upon #VMEXIT, the processor performs the following actions in order to
return to the host execution context:
...
clear EVENTINJ field in VMCB
KVM already syncs EVENTINJ fields from vmcb02 to cached vmcb12 on every
L2->L0 #VMEXIT. Since these fields are zeroed by the CPU on #VMEXIT, they
will mostly be zeroed in vmcb12 on nested #VMEXIT by nested_svm_vmexit().
However, this is not the case when:
1. Consistency checks fail, as nested_svm_vmexit() is not called.
2. Entering guest mode fails before L2 runs (e.g. due to failed load of
CR3).
(2) was broken by commit 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB
controls updated by the processor on every vmexit"), as prior to that
nested_svm_vmexit() always zeroed EVENTINJ fields.
Explicitly clear the fields in all nested #VMEXIT code paths.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Fixes: 2d8a42be0e2b ("KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexit") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-12-yosry@kernel.org
[sean: massage changelog formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:34:04 +0000 (00:34 +0000)]
KVM: nSVM: Clear GIF on nested #VMEXIT(INVALID)
According to the APM, GIF is set to 0 on any #VMEXIT, including
an #VMEXIT(INVALID) due to failed consistency checks. Clear GIF on
consistency check failures.
Yosry Ahmed [Tue, 3 Mar 2026 00:34:03 +0000 (00:34 +0000)]
KVM: nSVM: Triple fault if restore host CR3 fails on nested #VMEXIT
If loading L1's CR3 fails on a nested #VMEXIT, nested_svm_vmexit()
returns an error code that is ignored by most callers, and continues to
run L1 with corrupted state. A sane recovery is not possible in this
case, and HW behavior is to cause a shutdown. Inject a triple fault
instead, and do not return early from nested_svm_vmexit(). Continue
cleaning up the vCPU state (e.g. clear pending exceptions), to handle
the failure as gracefully as possible.
From the APM:
Upon #VMEXIT, the processor performs the following actions in order to
return to the host execution context:
...
if (illegal host state loaded, or exception while loading host state)
shutdown
else
execute first host instruction following the VMRUN
Remove the return value of nested_svm_vmexit(), which is mostly
unchecked anyway.
Fixes: d82aaef9c88a ("KVM: nSVM: use nested_svm_load_cr3() on guest->host switch") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-10-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:34:02 +0000 (00:34 +0000)]
KVM: nSVM: Triple fault if mapping VMCB12 fails on nested #VMEXIT
KVM currently injects a #GP and hopes for the best if mapping VMCB12
fails on nested #VMEXIT, and only if the failure mode is -EINVAL.
Mapping the VMCB12 could also fail if creating host mappings fails.
After the #GP is injected, nested_svm_vmexit() bails early, without
cleaning up (e.g. KVM_REQ_GET_NESTED_STATE_PAGES is set, is_guest_mode()
is true, etc).
Instead of optionally injecting a #GP, triple fault the guest if mapping
VMCB12 fails since KVM cannot make a sane recovery. The APM states that
a #VMEXIT will triple fault if host state is illegal or an exception
occurs while loading host state, so the behavior is not entirely made
up.
Do not return early from nested_svm_vmexit(), continue cleaning up the
vCPU state (e.g. switch back to vmcb01), to handle the failure as
gracefully as possible.
Fixes: cf74a78b229d ("KVM: SVM: Add VMEXIT handler and intercepts") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-9-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:34:00 +0000 (00:34 +0000)]
KVM: nSVM: Refactor checking LBRV enablement in vmcb12 into a helper
Refactor the vCPU cap and vmcb12 flag checks into a helper. The
unlikely() annotation is dropped, it's unlikely (huh) to make a
difference and the CPU will probably predict it better on its own.
Yosry Ahmed [Tue, 3 Mar 2026 00:33:59 +0000 (00:33 +0000)]
KVM: nSVM: Always inject a #GP if mapping VMCB12 fails on nested VMRUN
nested_svm_vmrun() currently only injects a #GP if kvm_vcpu_map() fails
with -EINVAL. But it could also fail with -EFAULT if creating a host
mapping failed. Inject a #GP in all cases, no reason to treat failure
modes differently.
Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-6-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:33:57 +0000 (00:33 +0000)]
KVM: SVM: Add missing save/restore handling of LBR MSRs
MSR_IA32_DEBUGCTLMSR and LBR MSRs are currently not enumerated by
KVM_GET_MSR_INDEX_LIST, and LBR MSRs cannot be set with KVM_SET_MSRS. So
save/restore is completely broken.
Fix it by adding the MSRs to msrs_to_save_base, and allowing writes to
LBR MSRs from userspace only (as they are read-only MSRs) if LBR
virtualization is enabled. Additionally, to correctly restore L1's LBRs
while L2 is running, make sure the LBRs are copied from the captured
VMCB01 save area in svm_copy_vmrun_state().
Note, for VMX, this also fixes a flaw where MSR_IA32_DEBUGCTLMSR isn't
reported as an MSR to save/restore.
Note #2, over-reporting MSR_IA32_LASTxxx on Intel is ok, as KVM already
handles unsupported reads and writes thanks to commit b5e2fec0ebc3 ("KVM:
Ignore DEBUGCTL MSRs with no effect") (kvm_do_msr_access() will morph the
unsupported userspace write into a nop).
Fixes: 24e09cbf480a ("KVM: SVM: enable LBR virtualization") Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260303003421.2185681-4-yosry@kernel.org
[sean: guard with lbrv checks, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 3 Mar 2026 00:33:56 +0000 (00:33 +0000)]
KVM: SVM: Switch svm_copy_lbrs() to a macro
In preparation for using svm_copy_lbrs() with 'struct vmcb_save_area'
without a containing 'struct vmcb', and later even 'struct
vmcb_save_area_cached', make it a macro.
Macros are generally not preferred compared to functions, mainly due to
type-safety. However, in this case it seems like having a simple macro
copying a few fields is better than copy-pasting the same 5 lines of
code in different places.
Yosry Ahmed [Tue, 3 Mar 2026 00:33:55 +0000 (00:33 +0000)]
KVM: nSVM: Avoid clearing VMCB_LBR in vmcb12
svm_copy_lbrs() always marks VMCB_LBR dirty in the destination VMCB.
However, nested_svm_vmexit() uses it to copy LBRs to vmcb12, and
clearing clean bits in vmcb12 is not architecturally defined.
Move vmcb_mark_dirty() to callers and drop it for vmcb12.
This also facilitates incoming refactoring that does not pass the entire
VMCB to svm_copy_lbrs().
KVM: nSVM: Delay setting soft IRQ RIP tracking fields until vCPU run
In the save+restore path, when restoring nested state, the values of RIP
and CS base passed into nested_vmcb02_prepare_control() are mostly
incorrect. They are both pulled from the vmcb02. For CS base, the value
is only correct if system regs are restored before nested state. The
value of RIP is whatever the vCPU had in vmcb02 before restoring nested
state (zero on a freshly created vCPU).
Instead, take a similar approach to NextRIP, and delay initializing the
RIP tracking fields until shortly before the vCPU is run, to make sure
the most up-to-date values of RIP and CS base are used regardless of
KVM_SET_SREGS, KVM_SET_REGS, and KVM_SET_NESTED_STATE's relative
ordering.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-8-yosry@kernel.org
[sean: deal with the svm_cancel_injection() madness] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Wed, 25 Feb 2026 00:59:48 +0000 (00:59 +0000)]
KVM: nSVM: Delay stuffing L2's current RIP into NextRIP until vCPU run
For guests with NRIPS disabled, L1 does not provide NextRIP when running
an L2 with an injected soft interrupt, instead it advances L2's RIP
before running it. KVM uses L2's current RIP as the NextRIP in vmcb02 to
emulate a CPU without NRIPS.
However, in svm_set_nested_state(), the value used for L2's current RIP
comes from vmcb02, which is just whatever the vCPU had in vmcb02 before
restoring nested state (zero on a freshly created vCPU). Passing the
cached RIP value instead (i.e. kvm_rip_read()) would only fix the issue
if registers are restored before nested state.
Instead, split the logic of setting NextRIP in vmcb02. Handle the
'normal' case of initializing vmcb02's NextRIP using NextRIP from vmcb12
(or KVM_GET_NESTED_STATE's payload) in nested_vmcb02_prepare_control().
Delay the special case of stuffing L2's current RIP into vmcb02's
NextRIP until shortly before the vCPU is run, to make sure the most
up-to-date value of RIP is used regardless of KVM_SET_REGS and
KVM_SET_NESTED_STATE's relative ordering.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-7-yosry@kernel.org
[sean: use new helper, svm_fixup_nested_rips()] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Wed, 25 Feb 2026 00:59:47 +0000 (00:59 +0000)]
KVM: nSVM: Always use NextRIP as vmcb02's NextRIP after first L2 VMRUN
For guests with NRIPS disabled, L1 does not provide NextRIP when running
an L2 with an injected soft interrupt, instead it advances the current RIP
before running it. KVM uses the current RIP as the NextRIP in vmcb02 to
emulate a CPU without NRIPS.
However, after L2 runs the first time, NextRIP will be updated by the CPU
and/or KVM, and the current RIP is no longer the correct value to use in
vmcb02. Hence, after save/restore, use the current RIP if and only if a
nested run is pending, otherwise use NextRIP. Give soft_int_next_rip the
same treatment, as it's the same logic, just for a narrower use case.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-6-yosry@kernel.org
[sean: give soft_int_next_rip the same treatment] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Wed, 25 Feb 2026 00:59:46 +0000 (00:59 +0000)]
KVM: selftests: Extend state_test to check next_rip
Similar to vGIF, extend state_test to make sure that next_rip is saved
correctly in nested state. GUEST_SYNC() in L2 causes IO emulation by
KVM, which advances the RIP to the value of next_rip. Hence, if next_rip
is saved correctly, its value should match the saved RIP value.
Yosry Ahmed [Wed, 25 Feb 2026 00:59:45 +0000 (00:59 +0000)]
KVM: selftests: Extend state_test to check vGIF
V_GIF_MASK is one of the fields written by the CPU after VMRUN, and
sync'd by KVM from vmcb02 to cached vmcb12 after running L2. Part of the
reason is to make sure V_GIF_MASK is saved/restored correctly, as the
cached vmcb12 is the payload of nested state.
Verify that V_GIF_MASK is saved/restored correctly in state_test by
enabling vGIF in vmcb12, toggling GIF in L2 at different GUEST_SYNC()
points, and verifying that V_GIF_MASK is correctly propagated to the
nested state.
Yosry Ahmed [Wed, 25 Feb 2026 00:59:44 +0000 (00:59 +0000)]
KVM: nSVM: Sync interrupt shadow to cached vmcb12 after VMRUN of L2
After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
fields written by the CPU from vmcb02 to the cached vmcb12. This is
because the cached vmcb12 is used as the authoritative copy of some of
the controls, and is the payload when saving/restoring nested state.
int_state is also written by the CPU, specifically bit 0 (i.e.
SVM_INTERRUPT_SHADOW_MASK) for nested VMs, but it is not sync'd to
cached vmcb12. This does not cause a problem if KVM_SET_NESTED_STATE
preceeds KVM_SET_VCPU_EVENTS in the restore path, as an interrupt shadow
would be correctly restored to vmcb02 (KVM_SET_VCPU_EVENTS overwrites
what KVM_SET_NESTED_STATE restored in int_state).
However, if KVM_SET_VCPU_EVENTS preceeds KVM_SET_NESTED_STATE, an
interrupt shadow would be restored into vmcb01 instead of vmcb02. This
would mostly be benign for L1 (delays an interrupt), but not for L2. For
L2, the vCPU could hang (e.g. if a wakeup interrupt is delivered before
a HLT that should have been in an interrupt shadow).
Sync int_state to the cached vmcb12 in nested_sync_control_from_vmcb02()
to avoid this problem. With that, KVM_SET_NESTED_STATE restores the
correct interrupt shadow state, and if KVM_SET_VCPU_EVENTS follows it
would overwrite it with the same value.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-3-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Wed, 25 Feb 2026 00:59:43 +0000 (00:59 +0000)]
KVM: nSVM: Sync NextRIP to cached vmcb12 after VMRUN of L2
After VMRUN in guest mode, nested_sync_control_from_vmcb02() syncs
fields written by the CPU from vmcb02 to the cached vmcb12. This is
because the cached vmcb12 is used as the authoritative copy of some of
the controls, and is the payload when saving/restoring nested state.
NextRIP is also written by the CPU (in some cases) after VMRUN, but is
not sync'd to the cached vmcb12. As a result, it is corrupted after
save/restore (replaced by the original value written by L1 on nested
VMRUN). This could cause problems for both KVM (e.g. when injecting a
soft IRQ) or L1 (e.g. when using NextRIP to advance RIP after emulating
an instruction).
Fix this by sync'ing NextRIP to the cache after VMRUN of L2, but only
after completing interrupts (not in nested_sync_control_from_vmcb02()),
as KVM may update NextRIP (e.g. when re-injecting a soft IRQ).
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260225005950.3739782-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 24 Feb 2026 22:50:17 +0000 (22:50 +0000)]
KVM: nSVM: Ensure AVIC is inhibited when restoring a vCPU to guest mode
On nested VMRUN, KVM ensures AVIC is inhibited by requesting
KVM_REQ_APICV_UPDATE, triggering a check of inhibit reasons, finding
APICV_INHIBIT_REASON_NESTED, and disabling AVIC.
However, when KVM_SET_NESTED_STATE is performed on a vCPU not in guest
mode with AVIC enabled, KVM_REQ_APICV_UPDATE is not requested, and AVIC
is not inhibited.
Request KVM_REQ_APICV_UPDATE in the KVM_SET_NESTED_STATE path if AVIC is
active, similar to the nested VMRUN path.
Fixes: f44509f849fe ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running") Cc: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260224225017.3303870-1-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Tue, 10 Feb 2026 01:08:06 +0000 (01:08 +0000)]
KVM: nSVM: Mark all of vmcb02 dirty when restoring nested state
When restoring a vCPU in guest mode, any state restored before
KVM_SET_NESTED_STATE (e.g. KVM_SET_SREGS) will mark the corresponding
dirty bits in vmcb01, as it is the active VMCB before switching to
vmcb02 in svm_set_nested_state().
Hence, mark all fields in vmcb02 dirty in svm_set_nested_state() to
capture any previously restored fields.
Fixes: cc440cdad5b7 ("KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE") CC: stable@vger.kernel.org Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://patch.msgid.link/20260210010806.3204289-1-yosry.ahmed@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86: Defer non-architectural deliver of exception payload to userspace read
When attempting to play nice with userspace that hasn't enabled
KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the
payload until userspace actually reads relevant vCPU state, and more
importantly, force delivery of the payload in *all* paths where userspace
saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS.
Ignoring userspace save/restore for the moment, delivering the payload
before the exception is injected is wrong regardless of whether L1 or L2
is running. To make matters even more confusing, the flaw *currently*
being papered over by the !is_guest_mode() check isn't even the same bug
that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF
delivery") was trying to avoid.
At the time of commit da998b46d244, KVM didn't correctly handle exception
intercepts, as KVM would wait until VM-Entry into L2 was imminent to check
if the queued exception should morph to a nested VM-Exit. I.e. KVM would
deliver the payload to L2 and then synthesize a VM-Exit into L1. But the
payload was only the most blatant issue, e.g. waiting to check exception
intercepts would also lead to KVM incorrectly escalating a
should-be-intercepted #PF into a #DF.
That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86:
Morph pending exceptions to pending VM-Exits at queue time"), but in the
interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on
KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on
the !is_guest_mode() check.
While not recorded in the changelog, the motivation for deferring the
!exception_payload_enabled delivery was to fix a flaw where a synthesized
MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6.
On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which
is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the
payload from the queue/pending exception. I.e. prematurely delivering the
payload would cause the pending #DB to not be recorded in the VMCS, and of
course, clobber L2's DR6 as seen by L1.
Jumping back to save+restore, the quirked behavior of forcing delivery of
the payload only works if userspace does KVM_GET_VCPU_EVENTS *before*
CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS.
E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the
CR2 saved by userspace won't contain the payload for the exception save by
KVM_GET_VCPU_EVENTS.
Deliberately deliver the payload in the store_regs() path, as it's the
least awful option even though userspace may not be doing save+restore.
Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS
knowing that SREGS were already saved when the vCPU exited.
Yosry Ahmed [Tue, 3 Feb 2026 20:10:10 +0000 (20:10 +0000)]
KVM: nSVM: Use vcpu->arch.cr2 when updating vmcb12 on nested #VMEXIT
KVM currently uses the value of CR2 from vmcb02 to update vmcb12 on
nested #VMEXIT. This value is incorrect in some cases, causing L1 to run
L2 with a corrupted CR2. This could lead to segfaults or data corruption
if L2 is in the middle of handling a #PF and reads a corrupted CR2. Use
the correct value in vcpu->arch.cr2 instead.
The value in vcpu->arch.cr2 is sync'd to vmcb02 shortly before a VMRUN
of L2, and sync'd back to vcpu->arch.cr2 shortly after. The value are
only out-of-sync in two cases: after save+restore, and after a #PF is
injected into L2. In either case, if a #VMEXIT to L1 is synthesized
before L2 runs, using the value in vmcb02 would be incorrect.
After save+restore, the value of CR2 is restored by KVM_SET_SREGS into
vcpu->arch.cr2. It is not reflect in vmcb02 until a VMRUN of L2. Before
that, it holds whatever was in vmcb02 before restore, which would be
zero on a new vCPU that never ran nested. If a #VMEXIT to L1 is
synthesized before L2 ever runs, using vcpu->arch.cr2 to update vmcb12
is the right thing to do.
The #PF injection case is more nuanced. Although the APM is a bit
unclear about when CR2 is written during a #PF, the SDM is more clear:
Processors update CR2 whenever a page fault is detected. If a
second page fault occurs while an earlier page fault is being
delivered, the faulting linear address of the second fault will
overwrite the contents of CR2 (replacing the previous address).
These updates to CR2 occur even if the page fault results in a
double fault or occurs during the delivery of a double fault.
KVM injecting the exception surely counts as the #PF being "detected".
More importantly, when an exception is injected into L2 at the time of a
synthesized #VMEXIT, KVM updates exit_int_info in vmcb12 accordingly,
such that an L1 hypervisor can re-inject the exception. If CR2 is not
written at that point, the L1 hypervisor have no way of correctly
re-injecting the #PF. Hence, if a #VMEXIT to L1 is synthesized after
the #PF is injected into L2 but before it actually runs, using
vcpu->arch.cr2 to update vmcb12 is also the right thing to do.
Note that KVM does _not_ update vcpu->arch.cr2 when a #PF is pending for
L2, only when it is injected. The distinction is important, because only
injected (but not intercepted) exceptions are propagated to L1 through
exit_int_info. It would be incorrect to update CR2 in vmcb12 for a
pending #PF, as L1 would perceive an updated CR2 value with no #PF.
Linus Torvalds [Sun, 1 Mar 2026 23:34:47 +0000 (15:34 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
Linus Torvalds [Sun, 1 Mar 2026 21:32:32 +0000 (13:32 -0800)]
Merge tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects fix from Thomas Gleixner:
"A single fix for debugobjects.
The deferred page initialization prevents debug objects from
allocating slab pages until the initialization is complete. That
causes depletion of the pool and disabling of debugobjects.
The reason is that debugobjects uses __GFP_HIGH for allocations as it
might be invoked from arbitrary contexts. When PREEMPT_COUNT is
disabled there is no way to know whether the context is safe to set
__GFP_KSWAPD_RECLAIM.
This worked until v6.18. Since then allocations w/o a reclaim flag
cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(),
which returns early when deferred page initialization has not yet
completed.
Work around that when PREEMPT_COUNT is enabled as the preempt counter
allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when
the context is preemtible. When PREEMPT_COUNT is disabled the context
is unknown and the reclaim bit can't be set because the caller might
hold locks which might deadlock in the allocator.
That makes debugobjects depend on PREEMPT_COUNT ||
!DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but
keeps it functional for most cases"
* tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
debugobject: Make it work with deferred page initialization - again
Linus Torvalds [Sun, 1 Mar 2026 21:16:35 +0000 (13:16 -0800)]
Merge tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Fix speculative safety in fred_extint()
- Fix __WARN_printf() trap in early_fixup_exception()
- Fix clang-build boot bug for unusual alignments, triggered by
CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y
- Replace the final few __ASSEMBLY__ stragglers that snuck in lately
into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again)
* tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__
x86/cfi: Fix CFI rewrite for odd alignments
x86/bug: Handle __WARN_printf() trap in early_fixup_exception()
x86/fred: Correct speculative safety in fred_extint()
Linus Torvalds [Sun, 1 Mar 2026 20:15:58 +0000 (12:15 -0800)]
Merge tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for
the common HZ=100, 250 or 1000 cases. Only use a function call for odd
HZ values like HZ=300 that generate more code.
The function call overhead showed up in performance tests of the TCP
code"
* tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
Linus Torvalds [Sun, 1 Mar 2026 19:09:24 +0000 (11:09 -0800)]
Merge tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix zero_vruntime tracking when there's a single task running
- Fix slice protection logic
- Fix the ->vprot logic for reniced tasks
- Fix lag clamping in mixed slice workloads
- Fix objtool uaccess warning (and bug) in the
!CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining,
which triggers with older compilers
- Fix a comment in the rseq registration rseq_size bound check code
- Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes
differently, which special size we now reached naturally and want to
avoid. The visible ugliness of the new reserved field will be avoided
the next time the RSEQ area is extended.
* tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: slice ext: Ensure rseq feature size differs from original rseq size
rseq: Clarify rseq registration rseq_size bound check comment
sched/core: Fix wakeup_preempt's next_class tracking
rseq: Mark rseq_arm_slice_extension_timer() __always_inline
sched/fair: Fix lag clamp
sched/eevdf: Update se->vprot in reweight_entity()
sched/fair: Only set slice protection at pick time
sched/fair: Fix zero_vruntime tracking
Linus Torvalds [Sun, 1 Mar 2026 19:07:20 +0000 (11:07 -0800)]
Merge tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events fixes from Ingo Molnar:
- Fix lock ordering bug found by lockdep in perf_event_wakeup()
- Fix uncore counter enumeration on Granite Rapids and Sierra Forest
- Fix perf_mmap() refcount bug found by Syzkaller
- Fix __perf_event_overflow() vs perf_remove_from_context() race
* tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix __perf_event_overflow() vs perf_remove_from_context() race
perf/core: Fix refcount bug and potential UAF in perf_mmap
perf/x86/intel/uncore: Add per-scheduler IMC CAS count events
perf/core: Fix invalid wait context in ctx_sched_in()
Linus Torvalds [Sun, 1 Mar 2026 19:00:43 +0000 (11:00 -0800)]
Merge tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Now that LLVM 22 has been released officially, require a release
version to use the new CONFIG_WARN_CONTEXT_ANALYSIS feature.
In particular this avoids the widely used Android clang 22.0.1
pre-release build which is known to be broken for this usecase"
* tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lib/Kconfig.debug: Require a release version of LLVM 22 for context analysis
Linus Torvalds [Sun, 1 Mar 2026 18:58:16 +0000 (10:58 -0800)]
Merge tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irqchip driver fixes from Ingo Molnar:
- Fix frozen interrupt bug in the sifive-plic driver
- Limit per-device MSI interrupts on uncommon gic-v3-its hardware
variants
- Address Sparse warning by constifying a variable in the MMP driver
- Revert broken commit and also fix an error check in the ls-extirq
driver
* tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ls-extirq: Fix devm_of_iomap() error check
Revert "irqchip/ls-extirq: Use for_each_of_imap_item iterator"
irqchip/mmp: Make icu_irq_chip variable static const
irqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS supports
irqchip/sifive-plic: Fix frozen interrupt due to affinity setting
Linus Torvalds [Sun, 1 Mar 2026 17:59:29 +0000 (09:59 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All changes in drivers (well technically SES is enclosure services,
but its change is minor). The biggest is the write combining change in
lpfc followed by the additional NULL checks in mpi3mr"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix shift out of bounds when MAXQ=32
scsi: ufs: core: Move link recovery for hibern8 exit failure to wl_resume
scsi: ufs: core: Fix possible NULL pointer dereference in ufshcd_add_command_trace()
scsi: snic: MAINTAINERS: Update snic maintainers
scsi: snic: Remove unused linkstatus
scsi: pm8001: Fix use-after-free in pm8001_queue_command()
scsi: mpi3mr: Add NULL checks when resetting request and reply queues
scsi: ufs: core: Reset urgent_bkops_lvl to allow runtime PM power mode
scsi: ses: Fix devices attaching to different hosts
scsi: ufs: core: Fix RPMB region size detection for UFS 2.2
scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT
scsi: lpfc: Properly set WC for DPP mapping
Linus Torvalds [Sun, 1 Mar 2026 03:54:28 +0000 (19:54 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad
Tabba)
- Fix invariant violation for single value tnums in the verifier
(Harishankar Vishwanathan, Paul Chaignon)
- Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai)
- Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen)
- Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri
Olsa)
- Fix race in freeing special fields in BPF maps to prevent memory
leaks (Kumar Kartikeya Dwivedi)
- Fix OOB read in dmabuf_collector (T.J. Mercier)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits)
selftests/bpf: Avoid simplification of crafted bounds test
selftests/bpf: Test refinement of single-value tnum
bpf: Improve bounds when tnum has a single possible value
bpf: Introduce tnum_step to step through tnum's members
bpf: Fix race in devmap on PREEMPT_RT
bpf: Fix race in cpumap on PREEMPT_RT
selftests/bpf: Add tests for special fields races
bpf: Retire rcu_trace_implies_rcu_gp() from local storage
bpf: Delay freeing fields in local storage
bpf: Lose const-ness of map in map_check_btf()
bpf: Register dtor for freeing special fields
selftests/bpf: Fix OOB read in dmabuf_collector
selftests/bpf: Fix a memory leak in xdp_flowtable test
bpf: Fix stack-out-of-bounds write in devmap
bpf: Fix kprobe_multi cookies access in show_fdinfo callback
bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing
selftests/bpf: Don't override SIGSEGV handler with ASAN
selftests/bpf: Check BPFTOOL env var in detect_bpftool_path()
selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN
selftests/bpf: Fix array bounds warning in jit_disasm_helpers
...
Linus Torvalds [Sun, 1 Mar 2026 03:35:30 +0000 (19:35 -0800)]
Merge tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Do not register imx_clk_scu_driver in imx8qxp_clk_probe(); besides
fixing two other issues, this avoids a deadlock in combination with
commit dc23806a7c47 ("driver core: enforce device_lock for
driver_match_device()")
- Move secondary node lookup from device_get_next_child_node() to
fwnode_get_next_child_node(); this avoids issues when users switch
from the device API to the fwnode API
- Export io_define_{read,write}!() to avoid unused import warnings when
CONFIG_PCI=n
* tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
clk: scu/imx8qxp: do not register driver in probe()
rust: io: macro_export io_define_read!() and io_define_write!()
device property: Allow secondary lookup in fwnode_get_next_child_node()
Linus Torvalds [Sat, 28 Feb 2026 18:45:56 +0000 (10:45 -0800)]
Merge tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Two multichannel fixes
- Locking fix for superblock flags
- Fix to remove debug message that could log password
- Cleanup fix for setting credentials
* tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Use snprintf in cifs_set_cifscreds
smb: client: Don't log plaintext credentials in cifs_set_cifscreds
smb: client: fix broken multichannel with krb5+signing
smb: client: use atomic_t for mnt_cifs_flags
smb: client: fix cifs_pick_channel when channels are equally loaded
Takashi Sakamoto [Sat, 28 Feb 2026 02:56:03 +0000 (11:56 +0900)]
firewire: ohci: initialize page array to use alloc_pages_bulk() correctly
The call of alloc_pages_bulk() skips to fill entries of page array when
the entries already have values. While, 1394 OHCI PCI driver passes the
page array without initializing. It could cause invalid state at PFN
validation in vmap().
Fixes: f2ae92780ab9 ("firewire: ohci: split page allocation from dma mapping") Reported-by: John Ogness <john.ogness@linutronix.de> Reported-and-tested-by: Harald Arnesen <linux@skogtun.org> Reported-and-tested-by: David Gow <david@davidgow.net> Closes: https://lore.kernel.org/lkml/87tsv1vig5.fsf@jogness.linutronix.de/ Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 28 Feb 2026 17:21:18 +0000 (09:21 -0800)]
Merge tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"One fix for the stm32 driver which got broken for DMA chaining cases,
plus a removal of some straggling bindings for the Bikal SoC which has
been pulled out of the kernel"
* tag 'spi-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: stm32: fix missing pointer assignment in case of dma chaining
spi: dt-bindings: snps,dw-abp-ssi: Remove unused bindings
Linus Torvalds [Sat, 28 Feb 2026 17:18:02 +0000 (09:18 -0800)]
Merge tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A small pile of fixes, none of which are super major - the code fixes
are improved error handling and fixing a leak of a device node.
We also have a typo fix and an improvement to make the binding example
for mt6359 more directly usable"
* tag 'regulator-fix-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Kconfig: fix a typo
regulator: bq257xx: Fix device node reference leak in bq257xx_reg_dt_parse_gpio()
regulator: fp9931: Fix PM runtime reference leak in fp9931_hwmon_read()
regulator: tps65185: check devm_kzalloc() result in probe
regulator: dt-bindings: mt6359: make regulator names unique
Linus Torvalds [Sat, 28 Feb 2026 17:01:33 +0000 (09:01 -0800)]
Merge tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix guest pfault init to pass a physical address to DIAG 0x258,
restoring pfault interrupts and avoiding vCPU stalls during host
page-in
- Fix kexec/kdump hangs with stack protector by marking
s390_reset_system() __no_stack_protector; set_prefix(0) switches
lowcore and the canary no longer matches
- Fix idle/vtime cputime accounting (idle-exit ordering, vtimer
double-forwarding) and small cleanups
* tag 's390-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pfault: Fix virtual vs physical address confusion
s390/kexec: Disable stack protector in s390_reset_system()
s390/idle: Remove psw_idle() prototype
s390/vtime: Use lockdep_assert_irqs_disabled() instead of BUG_ON()
s390/vtime: Use __this_cpu_read() / get rid of READ_ONCE()
s390/irq/idle: Remove psw bits early
s390/idle: Inline update_timer_idle()
s390/idle: Slightly optimize idle time accounting
s390/idle: Add comment for non obvious code
s390/vtime: Fix virtual timer forwarding
s390/idle: Fix cpu idle exit cpu time accounting
====================
Fix invariant violation for single-value tnums
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. As far as
I know this is the first case of invariant violation found in a real
program (i.e., not by a fuzzer). The following extract from verifier
logs shows what's happening:
More details are given in the second patch, but in short, the verifier
should be able to detect that the false branch of instruction 237 is
never true. After instruction 236, the u64 range and the tnum overlap
in a single value, 0xf00.
The long-term solution to invariant violation is likely to rely on the
refinement + invariant violation check to detect dead branches, as
started by Eduard. To fix the current issue, we need something with
less refactoring that we can backport to affected kernels.
The solution implemented in the second patch is to improve the bounds
refinement to avoid this case. It relies on a new tnum helper,
tnum_step, first sent as an RFC in [2]. The last two patches extend and
update the selftests.
Link: https://github.com/cilium/cilium/issues/44216 Link: https://lore.kernel.org/bpf/20251107192328.2190680-2-harishankar.vishwanathan@gmail.com/
Changes in v3:
- Fix commit description error spotted by AI bot.
- Simplify constants in first two tests (Eduard).
- Rework comment on third test (Eduard).
- Add two new negative test cases (Eduard).
- Rebased.
Changes in v2:
- Add guard suggested by Hari in tnum_step, to avoid undefined
behavior spotted by AI code review.
- Add explanation diagrams in code as suggested by Eduard.
- Rework conditions for readability as suggested by Eduard.
- Updated reference to SMT formula.
- Rebased.
====================
Paul Chaignon [Fri, 27 Feb 2026 21:42:45 +0000 (22:42 +0100)]
selftests/bpf: Avoid simplification of crafted bounds test
The reg_bounds_crafted tests validate the verifier's range analysis
logic. They focus on the actual ranges and thus ignore the tnum. As a
consequence, they carry the assumption that the tested cases can be
reproduced in userspace without using the tnum information.
Unfortunately, the previous change the refinement logic breaks that
assumption for one test case:
The tested bytecode is shown below. Without our previous improvement, on
the false branch of the condition, R7 is only known to have u64 range
[0xfffffffe; 0x100000000]. With our improvement, and using the tnum
information, we can deduce that R7 equals 0x100000000.
R7's tnum is (0; 0x1ffffffff). On the false branch, regs_refine_cond_op
refines R7's u32 range to [0; 0x7fffffff]. Then, __reg32_deduce_bounds
refines the s32 range to 0 using u32 and finally also sets u32=0.
From this, __reg_bound_offset improves the tnum to (0; 0x100000000).
Finally, our previous patch uses this new tnum to deduce that it only
intersect with u64=[0xfffffffe; 0x100000000] in a single value:
0x100000000.
Because the verifier uses the tnum to reach this constant value, the
selftest is unable to reproduce it by only simulating ranges. The
solution implemented in this patch is to change the test case such that
there is more than one overlap value between u64 and the tnum. The max.
u64 value is thus changed from 0x100000000 to 0x300000000.
Paul Chaignon [Fri, 27 Feb 2026 21:36:30 +0000 (22:36 +0100)]
selftests/bpf: Test refinement of single-value tnum
This patch introduces selftests to cover the new bounds refinement
logic introduced in the previous patch. Without the previous patch,
the first two tests fail because of the invariant violation they
trigger. The last test fails because the R10 access is not detected as
dead code. In addition, all three tests fail because of R0 having a
non-constant value in the verifier logs.
In addition, the last two cases are covering the negative cases: when we
shouldn't refine the bounds because the u64 and tnum overlap in at least
two values.
Paul Chaignon [Fri, 27 Feb 2026 21:35:02 +0000 (22:35 +0100)]
bpf: Improve bounds when tnum has a single possible value
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:
We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.
With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.
This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.
This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:
1. The u64 range and the tnum only overlap in umin.
u64: ---[xxxxxx]-----
tnum: --xx----------x-
2. The u64 range and the tnum only overlap in the maximum value
represented by the tnum, called tmax.
u64: ---[xxxxxx]-----
tnum: xx-----x--------
3. The u64 range and the tnum only overlap in between umin (excluded)
and umax.
u64: ---[xxxxxx]-----
tnum: xx----x-------x-
To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.
This change implements these three cases. With it, the above bytecode
looks as follows:
In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.
bpf: Introduce tnum_step to step through tnum's members
This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.
The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.
There are only two possible cases:
(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.
(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.
Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.
Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit). Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.
>From our example, considering only positions of significance greater
than k:
t = xx..x
z = 10..1
+ 1
-----
11..0
This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,
Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.
t = 1x1x0xxx
z = .0.1.100
+ 1
---------
.0.1.101
This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get
Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.
Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.
In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:
This concludes the computation for a result > z that is a member of t.
The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:
1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res
We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.
In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].
====================
bpf: Fix per-CPU bulk queue races on PREEMPT_RT
On PREEMPT_RT kernels, local_bh_disable() only calls migrate_disable()
(when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption. This means CFS scheduling can preempt a task inside the
per-CPU bulk queue (bq) operations in cpumap and devmap, allowing
another task on the same CPU to concurrently access the same bq,
leading to use-after-free, list corruption, and kernel panics.
Patch 1 fixes the cpumap race in bq_flush_to_queue(), originally
reported by syzbot [1].
Patch 2 fixes the same class of race in devmap's bq_xmit_all(),
identified by code inspection after Sebastian Andrzej Siewior pointed
out that devmap has the same per-CPU bulk queue pattern [2].
Both patches use local_lock_nested_bh() to serialize access to the
per-CPU bq. On non-RT this is a pure lockdep annotation with no
overhead; on PREEMPT_RT it provides a per-CPU sleeping lock.
To reproduce the devmap race, insert an mdelay(100) in bq_xmit_all()
after "cnt = bq->count" and before the actual transmit loop. Then pin
two threads to the same CPU, each running BPF_PROG_TEST_RUN with an XDP
program that redirects to a DEVMAP entry (e.g. a veth pair). CFS
timeslicing during the mdelay window causes interleaving. Without the
fix, KASAN reports null-ptr-deref due to operating on freed frames:
BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449
v3 -> v4: https://lore.kernel.org/all/20260213034018.284146-1-jiayuan.chen@linux.dev/
- Move panic trace to cover letter. (Sebastian Andrzej Siewior)
- Add Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> to both patches
from cover letter.
v2 -> v3: https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/
- Fix commit message: remove incorrect "spin_lock() becomes rt_mutex"
claim, the per-CPU bq has no spin_lock at all. (Sebastian Andrzej Siewior)
- Fix commit message: accurately describe local_lock_nested_bh()
behavior instead of referencing local_lock(). (Sebastian Andrzej Siewior)
- Remove incomplete discussion of snapshot alternative.
(Sebastian Andrzej Siewior)
- Remove panic trace from commit message. (Sebastian Andrzej Siewior)
- Add patch 2/2 for devmap, same race pattern. (Sebastian Andrzej Siewior)
v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
local_lock()/local_unlock(), since these paths already run under
local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
softirq-BKL lock on PREEMPT_RT") which is the actual commit that
makes the race possible. (Sebastian Andrzej Siewior)
====================
Jiayuan Chen [Wed, 25 Feb 2026 12:14:56 +0000 (20:14 +0800)]
bpf: Fix race in devmap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiayuan Chen [Wed, 25 Feb 2026 12:14:55 +0000 (20:14 +0800)]
bpf: Fix race in cpumap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
Close race in freeing special fields and map value
There exists a race across various map types where the freeing of
special fields (tw, timer, wq, kptr, etc.) can be done eagerly when a
logical delete operation is done on a map value, such that the program
which continues to have access to such a map value can recreate the
fields and cause them to leak.
The set contains fixes for this case. It is a continuation of Mykyta's
previous attempt in [0], but applies to all fields. A test is included
which reproduces the bug reliably in absence of the fixes.
Local Storage Benchmarks
------------------------
Evaluation Setup: Benchmarked on a dual-socket Intel Xeon Gold 6348 (Ice
Lake) @ 2.60GHz (56 cores / 112 threads), with the CPU governor set to
performance. Bench was pinned to a single NUMA node throughout the test.
Benchmark comes from [1] using the following command:
./bench -p 1 local-storage-create --storage-type <socket,task> --batch-size <16,32,64>
Before the test, 10 runs of all cases ([socket|task] x 3 batch sizes x 7
iterations per batch size) are done to warm up and prime the machine.
Then, 3 runs of all cases are done (with and without the patch, across
reboots).
For each comparison, we have 21 samples, i.e. per batch size (e.g.
socket 16) of a given local storage, we have 3 runs x 7 iterations.
The statistics (mean, median, stddev) and t-test is done for each
scenario (local storage and batch size pair) individually (21 samples
for either case). All values are for local storage creations in thousand
creations / sec (k/s).
The cases for socket are within the range of noise, and improvements in task
local storage are due to high variance (CV ~4%-6% across batch sizes). The only
statistically significant case worth mentioning is socket with batch size 64
with p-value from t-test < 0.05, but the absolute difference is small (~2k/s).
TL;DR there doesn't appear to be any significant regression or improvement.
* Add Paul's Reviewed-by.
* Fix use-after-free in accessing bpf_mem_alloc embedded in map. (syzbot CI)
* Add benchmark numbers for local storage.
* Add extra test case for per-cpu hashmap coverage with up to 16 refcount leaks.
* Target bpf tree.
====================
Add a couple of tests to ensure that the refcount drops to zero when we
exercise the race where creation of a special field succeeds the logical
bpf_obj_free_fields done when deleting an element. Prior to previous
changes, the fields would be freed eagerly and repopulate and end up
leaking, causing the reference to not drop down correctly. Running this
test on a kernel without fixes will cause a hang in delete_module, since
the module reference stays active due to the leaked kptr not dropping
it. After the fixes tests succeed as expected.
Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.
Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.
While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.
BPF hash map may now use the map_check_btf() callback to decide whether
to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can
opt out of const-ness using mutable, we must lose the const qualifier on
the callback such that we can avoid the ugly cast. Make the change and
adjust all existing users, and lose the comment in hashtab.c.
There is a race window where BPF hash map elements can leak special
fields if the program with access to the map value recreates these
special fields between the check_and_free_fields done on the map value
and its eventual return to the memory allocator.
Several ways were explored prior to this patch, most notably [0] tried
to use a poison value to reject attempts to recreate special fields for
map values that have been logically deleted but still accessible to BPF
programs (either while sitting in the free list or when reused). While
this approach works well for task work, timers, wq, etc., it is harder
to apply the idea to kptrs, which have a similar race and failure mode.
Instead, we change bpf_mem_alloc to allow registering destructor for
allocated elements, such that when they are returned to the allocator,
any special fields created while they were accessible to programs in the
mean time will be freed. If these values get reused, we do not free the
fields again before handing the element back. The special fields thus
may remain initialized while the map value sits in a free list.
When bpf_mem_alloc is retired in the future, a similar concept can be
introduced to kmalloc_nolock-backed kmem_cache, paired with the existing
idea of a constructor.
Note that the destructor registration happens in map_check_btf, after
the BTF record is populated and (at that point) avaiable for inspection
and duplication. Duplication is necessary since the freeing of embedded
bpf_mem_alloc can be decoupled from actual map lifetime due to logic
introduced to reduce the cost of rcu_barrier()s in mem alloc free path in 9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.").
As such, once all callbacks are done, we must also free the duplicated
record. To remove dependency on the bpf_map itself, also stash the key
size of the map to obtain value from htab_elem long after the map is
gone.
Linus Torvalds [Fri, 27 Feb 2026 21:40:30 +0000 (13:40 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The diffstat is dominated by changes to our TLB invalidation errata
handling and the introduction of a new GCS selftest to catch one of
the issues that is fixed here relating to PROT_NONE mappings.
- Fix cpufreq warning due to attempting a cross-call with interrupts
masked when reading local AMU counters
- Fix DEBUG_PREEMPT warning from the delay loop when it tries to
access per-cpu errata workaround state for the virtual counter
- Re-jig and optimise our TLB invalidation errata workarounds in
preparation for more hardware brokenness
- Fix GCS mappings to interact properly with PROT_NONE and to avoid
corrupting the pte on CPUs with FEAT_LPA2
- Fix ioremap_prot() to extract only the memory attributes from the
user pte and ignore all the other 'prot' bits"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: topology: Fix false warning in counters_read_on_cpu() for same-CPU reads
arm64: Fix sampling the "stable" virtual counter in preemptible section
arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
arm64: tlb: Allow XZR argument to TLBI ops
kselftest: arm64: Check access to GCS after mprotect(PROT_NONE)
arm64: gcs: Honour mprotect(PROT_NONE) on shadow stack mappings
arm64: gcs: Do not set PTE_SHARED on GCS mappings if FEAT_LPA2 is enabled
arm64: io: Extract user memory type in ioremap_prot()
arm64: io: Rename ioremap_prot() to __ioremap_prot()
Linus Torvalds [Fri, 27 Feb 2026 21:32:52 +0000 (13:32 -0800)]
Merge tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Update MAINTAINERS email address (Shawn Guo)
- Refresh cached Endpoint driver MSI Message Address to fix a v7.0
regression when kernel changes the address after firmware has
configured it (Niklas Cassel)
- Flush Endpoint MSI-X writes so they complete before the outbound ATU
entry is unmapped (Niklas Cassel)
- Correct the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value, which broke VMM use
of PCI capabilities (Bjorn Helgaas)
* tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value
PCI: dwc: ep: Flush MSI-X write before unmapping its ATU entry
PCI: dwc: ep: Refresh MSI Message Address cache on change
MAINTAINERS: Update Shawn Guo's address for HiSilicon PCIe controller driver
Linus Torvalds [Fri, 27 Feb 2026 18:52:57 +0000 (10:52 -0800)]
Merge tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Fix incorrect usages of decoder flags
- Validate payload size before accessing contents
- Fix race condition when creating nvdimm objects
- Fix deadlock on attach failure
* tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Test CXL_DECODER_F_NORMALIZED_ADDRESSING as a bitmask
cxl: Test CXL_DECODER_F_LOCK as a bitmask
cxl/mbox: validate payload size before accessing contents in cxl_payload_from_user_allowed()
cxl: Fix race of nvdimm_bus object when creating nvdimm objects
cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko
cxl/port: Hold port host lock during dport adding.
cxl/port: Introduce port_to_host() helper
cxl/memdev: fix deadlock in cxl_memdev_autoremove() on attach failure