Maxim Levitsky [Fri, 12 Jun 2026 15:00:38 +0000 (11:00 -0400)]
KVM: selftests: access_tracking_perf_test: bump number of NUMA nodes to 32
It's rare to find a system that has more than 4 sockets,
but a system can have more than 4 NUMA nodes if each socket
exposes its chiplets as separate NUMA nodes.
In particular, our CI caught a failure in this test on a system with
two sockets, each containing an 'AMD EPYC 7601 32-Core Processor'.
Bump the limit to 32, just in case.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-ID: <20260612150038.1277394-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 12 Jun 2026 08:51:42 +0000 (10:51 +0200)]
Merge tag 'kvmarm-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 7.2
* New features:
- None. Zilch. Nada. Que dalle.
* Fixes and other improvements:
- Significant cleanup of the vgic-v5 PPI support which was merged in
7.1. This makes the code more maintainable, and squashes a couple
of bugs in the meantime.
- Set of fixes for the handling of the MMU in an NV context,
particularly VNCR-triggered faults. S1POE support is fixed
as well.
- Large set of pKVM fixes, mostly addressing recurring issues
around hypervisor tracking of donated pages in obscure cases
where the donation could fail and leave things in a bizarre
state.
- Fixes for the so-called "lazy vgic init", which resulted in
sleeping operations in non-preemptible sections. This turned
out to be far more invasive than initially expected...
- Reduce the overhead of L1/L2 context switch by not touching
the FP registers.
- Fix the way non-implemented page sizes are dealt with when
a guest insist on using them for S2 translation.
- The usual set of low-impact fixes and cleanups all over the map.
Paolo Bonzini [Fri, 12 Jun 2026 08:47:24 +0000 (10:47 +0200)]
Merge branch 'kvm-single-pdptrs' into HEAD
The non-MMU changes/preliminary cleanups from the "split kvm_mmu in
three" series[1]. The final outcome is to have a single copy of the
PDPTRs (in vcpu->arch) instead of two (in root_mmu and nested_mmu).
Paolo Bonzini [Sat, 30 May 2026 16:55:45 +0000 (12:55 -0400)]
KVM: x86/mmu: move pdptrs out of the MMU
PDPTRs are part of the CPU state. A bit unconventionally, they are
reached via vcpu->arch.walk_mmu instead of being stored in vcpu->arch
directly. That is nice in principle---it would allow TDP shadow paging
to have its own PDPTRs---but it is not necessary, because EPT has no
PDPTRs and NPT does not cache them.
Since kvm_pdptr_read does not otherwise need the MMU, drop the pdptrs
from the MMU altogether. There is however something to be careful
about, in that PDPTRs are now not stored separately in root_mmu and
nested_mmu for L1 and L2 guests. In practice this was already not
an issue:
- for EPT the VMCS0x has to keep them up to date; and for the purpose
of emulation they are always loaded from the VMCS on vmentry/vmexit,
thanks to the clearing of dirty and available register bitmaps in
vmx_switch_vmcs()
- for NPT, VCPU_EXREG_PDPTR is similarly cleared for nNPT, which does
not cache the PDPTRs; while for non-nNPT the PDPTRs are loaded
together with the load of CR3.
Note that page table PDPTRs are not affected, since they are stored
in pae_root.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 30 May 2026 16:55:44 +0000 (12:55 -0400)]
KVM: x86: check that kvm_handle_invpcid is only invoked with shadow paging
This is true for both Intel and AMD. On Intel, "enable INVPCID" is
set unconditionally if supported, but the vmexit is triggered by the
"INVLPG exiting" control which is disabled by enable_ept. On AMD, KVM
can intercept INVPCID if NPT is enabled but only in order to inject #UD
in the guest.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 30 May 2026 16:55:43 +0000 (12:55 -0400)]
KVM: nSVM: invalidate cached PDPTRs across nested NPT transitions
When L2 runs under nested NPT and uses PAE paging, KVM's cached PDPTRs
in mmu->pdptrs[] can hold stale or wrong values after nested
transitions and across migration restore, because both
nested_svm_load_cr3() and svm_get_nested_state_pages() only refresh
PDPTRs on the !nested_npt path.
The user-visible bug is on migration restore of an L2 running with nested
NPT and 32-bit PAE paging, if userspace uses KVM_SET_SREGS rather than
KVM_SET_SREGS2. In that case, load_pdptrs() leaves VCPU_EXREG_PDPTR
marked as available, and kvm_pdptr_read() will use a stale translation
that used L1 GPAs instead of L2 nGPAs. svm_get_nested_state_pages()
runs on first KVM_RUN but skips the refresh because nested_npt_enabled()
is true. The CPU itself reads L2's PDPTRs correctly from memory via
L1's NPT, but KVM-side walking of guest PAE page tables uses the bogus
cached values.
Unlike Intel's GUEST_PDPTR0..3 fields in the VMCS, SVM has no
VMCB-cached PDPTR state: the in-memory PDPTEs at the current CR3 are
the only source of truth, and svm_cache_reg(VCPU_EXREG_PDPTR) simply
reloads them from memory via load_pdptrs(). Clearing the avail
bit (and the dirty bit because !avail/dirty is invalid) to force
a reload when PDPTRs as needed fixes the bug.
Do the same for nested_svm_load_cr3()'s nested_npt branch, so that
the invariant "PDPTRs need reloading" is handled similarly for both
immediate and deferred loading.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 30 May 2026 16:55:42 +0000 (12:55 -0400)]
KVM: nVMX: remove unnecessary code in prepare_vmcs02_rare
The early vmwrite of the PDPTRs in prepare_vmcs02_rare() is redundant, because
every write it does will be performed by prepare_vmcs02() if it is actually
needed.
In any case where the emulator or the processor need the PDPTR, either
is_pae_paging() is true on vmentry, or a write of CR0, CR4 or EFER will
cause a vmexit to L0. The next vmentry will refresh the PDPTRs in the
vmcs02 from vmcs12.
In fact, the original version[1] of what ended up being commit c7554efc8335 ("KVM: nVMX: Copy PDPTRs to/from vmcs12 only when
necessary"), the writes in what is now prepare_vmcs02_rare() were removed.
When the mega-collection of optimizations was posted[2], the removal of
that code got dropped as a rebase good, so reinstate it.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sat, 30 May 2026 16:55:41 +0000 (12:55 -0400)]
KVM: x86: remove nested_mmu from mmu_is_nested()
nested_mmu is always stored into vcpu->arch.walk_mmu at the same time as
guest_mmu is stored into vcpu->arch.mmu. But nested_mmu is not even
a proper MMU, it is only used for page walking; plus the fact that
walk_mmu has to be switched at all is just an implementation detail.
In the end what matters here is whether the guest is using nested
page tables; vmx/nested.c and svm/nested.c check it to see if they
are in nEPT or nNPT context respectively. So switch to checking
root_mmu vs. guest_mmu, which is a more cogent test.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260511150648.685374-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260530165545.25599-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marc Zyngier [Fri, 12 Jun 2026 08:29:34 +0000 (09:29 +0100)]
Merge branch kvm-arm64/nv-mmu-7.2 into kvmarm-master/next
* kvm-arm64/nv-mmu-7.2:
: .
: Assorted collection of fixes for NV MMU bugs
:
: - Correctly plug AT S1E1A handling in the emulation backend
:
: - Make CPTR_EL2.E0POE depend on FEAT_S1POE
:
: - Drop the reference on the page if the VNCR translation
: races with an MMU notifier
:
: - Correctly synthesise an SEA if a page table walk fails due
: to a guest error
:
: - Fully invalidate the VNCR TLB and fixmap when translating
: for a new VNCR
:
: - Restart S1 walk when the S2 walk fails due to a race condition
:
: - Correctly return -EAGAIN when a S1 walk fails
:
: - Fix block mapping validity check in stage-1 walker for 64kB pages
:
: - Fix potential NULL dereference when performing an EL2 TLBI targeting
: the VNCR page
:
: - Hold kvm->mmu_lock while initialising the vncr_tlb pointer
: .
KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlb
KVM: arm64: nv: Avoid dereferencing NULL VNCR pseudo-TLB
KVM: arm64: Fix block mapping validity check in stage-1 walker
KVM: arm64: nv: Restart stage-1 walk if stage-2 desc update fails
KVM: arm64: Restart instruction upon race in __kvm_at_s12()
KVM: arm64: nv: Inject SEA TTW when desc update can't write to GPA
KVM: arm64: nv: Fully update VNCR fixmap state in kvm_translate_vncr()
KVM: arm64: Don't leak PFN when kvm_translate_vncr() races MMU notifier
arm64: cpufeature: Expose ID_AA64ISAR2_EL1.ATS1A to KVM
KVM: arm64: Wire AT S1E1A in the system instruction handling table
KVM: arm64: Key CPTR_EL2.E0POE propagation on FEAT_S1POE
Marc Zyngier [Fri, 12 Jun 2026 08:29:31 +0000 (09:29 +0100)]
Merge branch kvm-arm64/misc-7.2 into kvmarm-master/next
* kvm-arm64/misc-7.2:
: .
: - Check for a valid vcpu pointer upon deactivating traps when handling
: a HYP panic in VHE mode
:
: - Make the __deactivate_fgt() macro use its arguments instead of the
: surrounding context
:
: - Don't bother with initialising TPIDR_EL2 in the hyp stubs, as this
: is already taken care of in more obvious places
:
: - Drop the unused kvm_arch pointer passed to __load_stage2()
:
: - Return -EOPNOTSUPP when a hypercall fails for some reason, instead of
: returning whatever was in the result structure
:
: - Make the ITS ABI selection helpers return void, which avoids wondering
: about the nature of the return code (always 0)
: .
KVM: arm64: vgic-its: Make ABI commit helpers return void
KVM: arm64: Set a Linux errno on SMCCC error in kvm_call_hyp_nvhe()
KVM: arm64: Remove @arch from __load_stage2()
KVM: arm64: Don't populate TPIDR_EL2 in finalise_el2()
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
Paolo Bonzini [Fri, 12 Jun 2026 08:12:22 +0000 (10:12 +0200)]
Merge tag 'kvm-x86-selftests-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM selftests changes for 7.2
- Randomize the dirty log test's delay when reaping the bitmap on the first
pass, as always waiting only 1ms hid a KVM RISC-V bug as the test reaped the
bitmap before KVM could build up enough state to hit the bug.
Paolo Bonzini [Fri, 12 Jun 2026 08:11:59 +0000 (10:11 +0200)]
Merge tag 'kvm-x86-mmu-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 7.2
- Use the kernel's "enum pg_level" in the TDX APIs instead of the TDX-Module's
level definitions (which are 0-based).
- Rework the TDX memory APIs to not require/assume that guest memory is
backed by "struct page" (in prepartion for guest_memfd hugepage support).
- Overhaul the TDP MMU => S-EPT code to move as much S-EPT specific logic as
possible into the TDX code, and to funnel (almost) all S-EPT updates into
a single chokepoint. The motivation is largely to prepare for upcoming
Dynamic PAMT support, but the cleanups are nice to have on their own.
- Plug a hole in the shadow MMU where KVM fails to recursively zap nested TDP
shadow when L1 is tearing its TDP page tables from the bottom up, as KVM's
TDP MMU now does.
Paolo Bonzini [Fri, 12 Jun 2026 08:11:09 +0000 (10:11 +0200)]
Merge tag 'kvm-x86-misc-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM misc x86 changes for 7.2
- Handle EXIT_FASTPATH_EXIT_USERSPACE in vendor code to ensure vendor code
gets a chance to handle things like reaping the PML buffer.
- Ensure KVM's copy of CR0 and CR3 are up-to-date on SVM prior to invoking
fastpath handlers.
- Update KVM's view of PV async enabling if and only if the MSR write fully
succeeds.
- Fix a variety of issues where the emulator doesn't honor guest-debug state,
and clean up related code along the way.
- Synthesize EPT Violation and #NPF "error code" bits when injecting faults
into L1 that didn't originate in hardware (in which case the VMCS/VMCB
doesn't hold relevant information).
- Add support for virtualizing (well, emulating) AMD's flavor of CPL>0 CPUID
faulting.
- Clean up the GPR APIs so that KVM's use of "raw" is consistent, and fix a
variety of minor bugs along the way.
- Fix an OOB memory access due to not checking the VP ID when handling a
Hyper-V PV TLB flush for L2.
- Fix a bug in the mediated PMU's handling of fixed counters that allowed the
guest to bypass the PMU event filter.
- Allow userspace to return EAGAIN when handling SNP and TDX hypercalls, so
the KVM can forward a "retry" status code to the guest, and reserve all
unused error codes for future usage.
Paolo Bonzini [Fri, 12 Jun 2026 08:08:52 +0000 (10:08 +0200)]
Merge tag 'kvm-x86-gmem-7.2' of https://github.com/kvm-x86/linux into HEAD
KVM guest_memfd changes for 7.2
- Return -EEXIST instead of -EINVAL if userspace attempts to bind a gmem
range to multiple memslots, and fix the test that was supposed to ensure
KVM returns -EEXIST.
- Treat memslot binding offsets and sizes as unsigned values to fix a bug
where KVM interprets a large "offset + size" as a negative value and allows
a nonsensical offset.
- Use the inode number instead of the page offset for the NUMA interleaving
index to fix a bug where the effective index would jump by two for
consecutive pages (the caller also adds in the page offset).
Marc Zyngier [Fri, 12 Jun 2026 08:08:31 +0000 (09:08 +0100)]
Merge branch kvm-arm64/vgic-v5-PPI-fixes into kvmarm-master/next
* kvm-arm64/vgic-v5-PPI-fixes:
: .
: Substantial cleanup of the vgic-v5 PPI support. From the original
: cover letter:
:
: "With the GICv5 PPi support merged in, it has become obvious that a few
: things could be improved, both from the correctness and maintainability
: angles."
: .
KVM: arm64: Fix arch timer interrupts for GICv3-on-GICv5 guests
irqchip/gic-v5: Immediately exec priority drop following activate
Documentation: KVM: Clarify that PMU_V3_IRQ IntID requirements for GICv5
Documentation: KVM: Fix typos in VGICv5 documentation
KVM: arm64: selftests: Improve error handling for GICv5 PPI selftest
KVM: arm64: selftests: Cleanup unused vars in GICv5 PPI selftest
KVM: arm64: selftests: Add missing GIC CDEN to no-vgic-v5 selftest
KVM: arm64: vgic-v5: Atomically assign bits to PPI DVI bitmap
KVM: arm64: vgic-v5: Add missing trap handing for NV triage
KVM: arm64: vgic-v5: Limit support to 64 PPIs
KVM: arm64: vgic: Rationalise per-CPU irq accessor
KVM: arm64: vgic-v5: Drop defensive checks from vgic_v5_ppi_queue_irq_unlock()
KVM: arm64: vgic: Consolidate vgic_allocate_private_irqs_locked()
KVM: arm64: vgic: Constify struct irq_ops usage
KVM: arm64: vgic-v5: Drop pointless ARM64_HAS_GICV5_CPUIF check
KVM: arm64: vgic-v5: Remove use of __assign_bit() with a constant
KVM: arm64: vgic-v5: Move PPI caps into kvm_vgic_global_state
KVM: arm64: vgic-v5: Add for_each_visible_v5_ppi() iterator
Marc Zyngier [Fri, 12 Jun 2026 08:08:25 +0000 (09:08 +0100)]
Merge branch kvm-arm64/pkvm-fixes-7.2 into kvmarm-master/next
* kvm-arm64/pkvm-fixes-7.2:
: .
: Assorted pKVM fixes for 7.2:
:
: - Ensure that the vcpu memcache is filled in a number of cases (donate,
: share, selftest)
:
: - Fix vmemmap page order handling by resetting it when initialising the
: memory pool
:
: - Don't leak page references on failed memory donation
:
: - Add sanity-check for refcounted pages when donating/sharing pages
:
: - Clear __hyp_running_vcpu on state flush
:
: - Check LR upper bound against a trusted value
:
: - Assorted fixes for the host-side tracking of the pages shared with
: EL2 as a result of some Sashiko testing from Fuad
:
: - Correctly forward HCR_EL2.VSE from host to guest, so that protected
: guests can see SErrors
: .
KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
KVM: arm64: Free hyp-share tracking node when share hypercall fails
KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
KVM: arm64: Fix __pkvm_init_vm error path
KVM: arm64: Reset page order in pKVM hyp_pool
Marc Zyngier [Fri, 12 Jun 2026 08:04:24 +0000 (09:04 +0100)]
Merge branch kvm-arm64/nv-granule-sizes into kvmarm-master/next
* kvm-arm64/nv-granule-sizes:
: .
: Tidying up of the behaviour when the selected page size in not
: implemented, courtesy of Wei-Lin Chang. From the initial cover
: letter:
:
: "This small series fixes the granule size selection for software stage-1
: and stage-2 walks. Previously we treat the guest's TCR/VTCR.TGx as-is
: and use the encoded granule size for the walks. However this is
: incorrect if the granule sizes are not advertised in the guest's
: ID_AA64MMFR0_EL1.TGRAN*. The architecture specifies that when an
: unsupported size is programed in TGx, it must be treated as an
: implemented size. Fix this by choosing an available one while
: prioritizing PAGE_SIZE."
: .
KVM: arm64: Fallback to a supported value for unsupported guest TGx
KVM: arm64: nv: Use literal granule size in TLBI range calculation
KVM: arm64: Factor out TG0/1 decoding of VTCR and TCR
KVM: arm64: nv: Rename vtcr_to_walk_info() to setup_s2_walk()
Marc Zyngier [Fri, 12 Jun 2026 08:03:57 +0000 (09:03 +0100)]
Merge branch kvm-arm64/nv-fp-elision into kvmarm-master/next
* kvm-arm64/nv-fp-elision:
: .
: Significantly reduce the overhead of the context switch between L1 and
: L2 guests by eliding the save/restore of the FP/SIMD/SVE registers, as
: this state is shared between the two guests, and therefore can be left
: live.
: .
KVM: arm64: nv: Don't save/restore FP register during a nested ERET or exception
KVM: arm64: nv: Track L2 to L1 exception emulation
Marc Zyngier [Fri, 12 Jun 2026 08:03:24 +0000 (09:03 +0100)]
Merge branch kvm-arm64/no-lazy-vgic-init into kvmarm-master/next
* kvm-arm64/no-lazy-vgic-init:
: .
: Fix an ugly situation where the vgic lazy init could happen in
: non-preemtible contexts such as vcpu reset, resulting in lockdep
: splats.
:
: This requires revamping the way in-kernel emulation of devices
: (timers, PMU) are presenting their interrupt to the vgic, and
: make sure there is no need to init the vgic on the back of that.
: .
KVM: arm64: vgic-v2: Don't init the vgic on in-kernel interrupt injection
KVM: arm64: vgic-v2: Force vgic init on injection outside the run loop
KVM: arm64: pmu: Kill the PMU interrupt level cache
KVM: arm64: timer: Kill the per-timer irq level cache
KVM: arm64: Simplify userspace notification of interrupt state
KVM: arm64: timer: Repaint kvm_timer_{should,irq_can}_fire() to kvm_timer_{pending,enabled}()
Jackie Liu [Thu, 4 Jun 2026 07:51:47 +0000 (15:51 +0800)]
KVM: arm64: vgic-its: Make ABI commit helpers return void
The return values of vgic_its_set_abi() and vgic_its_commit_v0() are always
0 and do not carry useful error information. Simplify by changing them to
void.
Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Reviewed-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://patch.msgid.link/20260604075147.53299-1-liu.yun@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
Claudio Imbrenda [Thu, 11 Jun 2026 10:48:49 +0000 (12:48 +0200)]
KVM: s390: vsie: Add missing radix_tree_preload() in _gaccess_shadow_fault()
Add missing radix_tree_preload() in _gaccess_shadow_fault() to
guarantee forward progress. The core of _gaccess_shadow_fault() has
been split into ___gaccess_shadow_fault() in order to simplify locking.
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap") CC: stable@vger.kernel.org # 7.1 Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260611104850.110313-5-imbrenda@linux.ibm.com>
Claudio Imbrenda [Thu, 11 Jun 2026 10:48:47 +0000 (12:48 +0200)]
KVM: s390: Fix unlikely race in try_get_locked_pte()
Fix an unlikely race in try_get_locked_pte(), which could have happened
if puds or pmds get unmapped between the p?dp_get() and p?d_offset()
functions.
Linus Torvalds [Wed, 10 Jun 2026 18:53:55 +0000 (11:53 -0700)]
Merge tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address some remaining fallout after introducing dynamic EPP
support in the amd-pstate driver during the current development cycle:
- Restore allowing writing EPP of 0 when in performance mode in the
amd-pstate driver which was unnecessarily disallowed by one of the
recent updates (Mario Limonciello)
- Remove stale documentation of the epp_cached field in struct
amd_cpudata that has been dropped recently (Zhan Xusheng)"
* tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Fix setting EPP in performance mode
cpufreq/amd-pstate: drop stale @epp_cached kdoc
Linus Torvalds [Wed, 10 Jun 2026 14:18:32 +0000 (07:18 -0700)]
Merge tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Fix the implementation of the CFI branch landing pad control prctl()s
to return -EINVAL if unknown control bits are set, rather than
silently ignoring the request; and add a kselftest for this case
- Fix unaligned access performance testing to happen earlier in boot,
which fixes a performance regression in the lib/checksum code
- Fix a binfmt_elf warning when dumping core (due to missing
.core_note_name for CFI registers)
* tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: cfi: reject unknown flags in PR_SET_CFI
riscv: Fix fast_unaligned_access_speed_key not getting initialized
riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
Jann Horn [Fri, 5 Jun 2026 20:27:33 +0000 (22:27 +0200)]
namespace: restrict OPEN_TREE_NAMESPACE/FSMOUNT_NAMESPACE to directories
open_tree(..., OPEN_TREE_NAMESPACE) and
fsmount(..., FSMOUNT_NAMESPACE, ...) currently work on non-directories,
like regular files. That's bad for two reasons:
- It ends up mounting a regular file over the inherited namespace root,
which is a directory; mounting a non-directory over a directory is
normally explicitly forbidden, see for example do_move_mount()
- It causes setns() on the new namespace to set the cwd to a regular
file, which the rest of VFS does not expect
Fix it by restricting create_new_namespace() (which is used by both of
these flags) to directories.
Leave the behavior for OPEN_TREE_CLONE as-is, that seems unproblematic.
Fixes: 9b8a0ba68246 ("mount: add OPEN_TREE_NAMESPACE") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marc Zyngier [Mon, 8 Jun 2026 08:11:08 +0000 (09:11 +0100)]
KVM: arm64: nv: Hold kvm->mmu_lock while initialising vcpu->arch.vncr_tlb
Sashiko reports that there is a race between initialising vncr_tlb
and making use of it, as we don't hold the mmu_lock at this point.
Additionally, it identifies a memory leak, should userspace repeatedly
invokes the KVM_RUN ioctl after a failure of kvm_arch_vcpu_run_pid_change(),
as we assign vncr_tlb blindly on first run, irrespective of prior
allocations.
Slap the two bugs in one go by taking the kvm->mmu_lock on assigning
vncr_tlb, preventing the race for good, and by checking that vncr_tlb
is indeed NULL prior to allocation.
VNCR TLB invalidation occurs from MMU notifiers or TLBI instructions,
and either can race against a vcpu not being onlined yet (no pseudo-TLB
allocated). Similarly, the TLB might be invalid, and the invalidation
should be skipped in this case.
Both kvm_invalidate_vncr_ipa() and kvm_invalidate_vncr_va() are
expected to perform the same checks, except that the latter doesn't
check for the allocation and blindly dereferences the pointer.
Solve this by introducing a new iterator built on top of the usual
kvm_for_each_vcpu() that checks for both of the above conditions,
and convert the two users to it.
Linus Torvalds [Wed, 10 Jun 2026 00:20:00 +0000 (17:20 -0700)]
Merge tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors
type and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix suffix strip in dot2k
tools/rv: Fix cleanup after failed trace setup
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix substring match bug in monitor name search
tools/rv: Ensure monitor name and desc are NUL-terminated
rv: Use 0 to check preemption enabled in opid
rv: Prevent task migration while handling per-CPU events
rv: Ensure synchronous cleanup for HA monitors
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Fix monitor start ordering and memory ordering for monitoring flag
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Reset per-task DA monitors before releasing the slot
rv: Fix __user specifier usage in extract_params()
Linus Torvalds [Wed, 10 Jun 2026 00:05:19 +0000 (17:05 -0700)]
Merge tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull RTLA fix from Steven Rostedt:
- Fix multi-character short option parsing
Fix regression in parsing of multiple-character short options
(eg -p100 /= -p 100/, -un /= -u -n/) caused by getopt_long()
internal state corruption after a refactoring.
* tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix parsing of multi-character short options
Linus Torvalds [Tue, 9 Jun 2026 15:24:25 +0000 (08:24 -0700)]
Merge tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 9 are for MM. 8 are cc:stable and the remaining 3 address
post-7.1 issues or aren't considered suitable for backporting.
Thre's a two-patch series "mm/damon/{reclaim,lru_sort}: handle ctx
allocation failures" from SeongJae Park which fixes a couple of DAMON
-ENOMEM bloopers. The rest are singletons - please see the individual
changelogs for details"
* tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mincore: handle non-swap entries before !CONFIG_SWAP guard
arm64: mm: call pagetable dtor when freeing hot-removed page tables
mm/list_lru: drain before clearing xarray entry on reparent
mm/huge_memory: use correct flags for device private PMD entry
mm/damon/lru_sort: handle ctx allocation failure
mm/damon/reclaim: handle ctx allocation failure
zram: fix use-after-free in zram_bvec_write_partial()
MAINTAINERS: update Baoquan He's email address
tools headers UAPI: sync linux/taskstats.h for procacct.c
mm/cma_sysfs: skip inactive CMA areas in sysfs
ipc/shm: serialize orphan cleanup with shm_nattch updates
Linus Torvalds [Tue, 9 Jun 2026 15:19:48 +0000 (08:19 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Several significant bug fixes of pre-existing issues:
- Missing validation on ucap fd types passed from userspace
- Missing validation of HW DMA space vs userpace expected sizes in
EFA queue setup
- DMA corruption when using DMA block sizes >= 4G when setting up MRs
in all drivers
- Missing validation of CPU IDs when setting up dma handles
- Missing validation of IB_MR_REREG_ACCESS when changing writability
of a MR
- Missing validation of received message/packet size in ISER and SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/srp: bound SRP_RSP sense copy by the received length
IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN
RDMA: During rereg_mr ensure that REREG_ACCESS is compatible
RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc
RDMA/umem: Fix truncation for block sizes >= 4G
RDMA/efa: Validate SQ ring size against max LLQ size
RDMA/core: Validate the passed in fops for ib_get_ucaps()
Recursively zap orphaned nested TDP shadow pages when emulating a guest
write to a shadowed page table, regardless of whether or not the associated
(parent) shadow page will be zapped, e.g. due to detected write-flooding.
This plugs a hole where KVM fails to reclaim defunct, unsync shadow pages
for select L1 hypervisor patterns. Commit 2de4085cccea ("KVM: x86/MMU:
Recursively zap nested TDP SPs when zapping last/only parent") modified KVM
to recursively zap synchronized shadow pages (KVM already recursively zaps
unsync children) when a child is orphaned. But the fix effectively only
applied the logic to kvm_mmu_page_unlink_children(), i.e. only performs the
recursive zap when KVM is already zapping a parent SP and processing its
children.
If L1 zaps SPTEs bottom-up (4KiB => 2MiB => ...), as KVM's TDP MMU does
with CONFIG_KVM_PROVE_MMU=n since commit 8ca983631f3c ("KVM: x86/mmu: Zap
invalidated TDP MMU roots at 4KiB granularity"), then KVM (as L0) will leak
upwards of 4 shadow pages per GiB of L2 guest memory. Over hundreds or
thousands of L2 boots, if the VM is "lucky" enough to escape write-flooding
detection, i.e. not trigger reclaim of the orphaned shadow pages by dumb
luck, then it's possible to end up with tens or even hundreds of thousands
of unsync shadow pages and associated rmap entries.
Polluting the hash table and rmap entries with a horde of stale entries
can eventually degrade L2 guest boot time by an order of magnitude,
especially if there is any antagonistic activity in the host, i.e. anything
that will contend for mmu_lock and/or needs to walk rmaps.
With "top"-down zapping, where "top" is 1GiB or above, then L0 KVM is
effectively limited to leaking 4 shadow pages per 256 GiB of memory, as
KVM's write flooding detection will kick in on the third write to an L1
TDP PUD, and thus recursively zap the entire 256 GiB range of the parent
PGD. I.e. even though L1 KVM still recursively zaps 2MiB => 4KiB SPTEs
when zapping each 1GiB SPTE, KVM only gets through two of the 1GiB SPTEs
before dropping everything. E.g. hacking tracing into L0 KVM's
kvm_mmu_track_write(), the top-down zapping of L1's TDP MMU for an L2 with
16GiB of memory leads to:
Note, in the shadow MMU, "level" describes the level a shadow page "points"
at, not the level of its associated SPTE. I.e. when write-flooding of 1GiB
PUD entries is detected, KVM recursively zaps shadow pages covering 256GiB
worth of memory. And as shown above, KVM's write-flooding detection
operates at all levels, so a single PMD (in L1) can effectively only leak
two unsync children (4KiB shadow pages) before it gets recursively zapped.
As a result, for the top-down zap, L0 KVM will leak at most 4 unsync shadow
pages per 256GiB of L2 memory.
The top-down zap also makes it more likely that L1 will self-heal (to some
extent), as any shadow pages that are "rediscovered" by future runs of L2
can get reclaimed by a recursive zap, whereas bottom-up zapping orphans
shadow pages over and over.
Note, in theory, there is some risk of over-zapping, e.g. due to zapping a
a large branch of the paging tree that L1 is only temporarily removing. In
practice, the usage patterns of hypervisors are highly unlikely to trigger
false positives. E.g. temporarily changing paging protections is typically
done at the leaf, not on a non-leaf entry. And if the L1 hypervisor is
updating large swaths of PTEs, e.g. to (temporarily?) remove chunks of
memory from L2, then L0 KVM's write-flooding detection will kick in, and
the children would be zapped anyways.
Fixes: 2de4085cccea ("KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent") Cc: Yosry Ahmed <yosry@kernel.org> Cc: Jim Mattson <jmattson@google.com> Cc: James Houghton <jthoughton@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260605174611.2222504-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
RDMA/srp: bound SRP_RSP sense copy by the received length
srp_process_rsp() copies sense data from rsp->data + resp_data_len,
where resp_data_len is the full 32-bit value supplied by the SRP target
and is never checked against the number of bytes actually received
(wc->byte_len). The copy length is bounded to SCSI_SENSE_BUFFERSIZE, so
at most 96 bytes are copied, but the source offset is not bounded.
A malicious or compromised SRP target on the InfiniBand/RoCE fabric that
the initiator has logged into can return an SRP_RSP with
SRP_RSP_FLAG_SNSVALID set and a large resp_data_len. The receive buffer
is allocated at the target-chosen max_ti_iu_len, so the source of the
sense copy lands past the bytes actually received; with resp_data_len
near 0xFFFFFFFF it is gigabytes past the buffer and the read faults.
Copy the sense data only if it has not been truncated, that is, only if
the response header, the response data, and the sense region fit within
the bytes actually received; otherwise drop the sense and log. The
in-tree iSER and NVMe-RDMA receive paths already bound their parse by
wc->byte_len; this brings ib_srp into line with them.
Fixes: aef9ec39c47f ("IB: Add SCSI RDMA Protocol (SRP) initiator") Link: https://patch.msgid.link/r/20260602220457.2542840-1-michael.bommarito@gmail.com Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN
In drivers/infiniband/ulp/isert/ib_isert.c, isert_login_recv_done()
computes the login request payload length as wc->byte_len minus
ISER_HEADERS_LEN with no lower bound, and login_req_len is a signed int.
A remote iSER initiator can post a login Send work request carrying
fewer than ISER_HEADERS_LEN (76) bytes, so the subtraction underflows
and login_req_len becomes negative.
isert_rx_login_req() then reads that negative length back into a signed
int, takes size = min(rx_buflen, MAX_KEY_VALUE_PAIRS), and because the
min() is signed it keeps the negative value; the value is then passed as
the memcpy() length and sign-extended to a multi-gigabyte size_t. The
copy into the 8192-byte login->req_buf runs far out of bounds and
faults, crashing the target node. The login phase precedes iSCSI
authentication, so no credentials are required to reach this path.
Reject any login PDU shorter than ISER_HEADERS_LEN before the
subtraction, mirroring the existing early return on a failed work
completion, so login_req_len can never go negative. The upper bound was
already safe: a posted login buffer cannot deliver more than
ISER_RX_PAYLOAD_SIZE, so the difference stays at or below
MAX_KEY_VALUE_PAIRS and the existing min() clamps it; only the missing
lower bound needs to be added.
Fixes: b8d26b3be8b3 ("iser-target: Add iSCSI Extensions for RDMA (iSER) target driver") Link: https://patch.msgid.link/r/20260602194642.2273217-1-michael.bommarito@gmail.com Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Jason Gunthorpe [Thu, 4 Jun 2026 18:03:13 +0000 (15:03 -0300)]
RDMA: During rereg_mr ensure that REREG_ACCESS is compatible
If IB_MR_REREG_ACCESS changes from RO to RW then the umem has to be
re-evaluated to ensure it is properly pinned as RW. Since the umem is
hidden inside each driver's mr struct add a ib_umem_check_rereg() function
that each driver has to call before processing IB_MR_REREG_ACCESS.
mlx4 has to retain its duplicate ib_access_writable check because it
implements IB_MR_REREG_ACCESS | IB_MR_REREG_TRANS by changing both items
in place sequentially while the MR is live, so it will continue to not
support this combination.
Carlos López [Wed, 3 Jun 2026 11:45:04 +0000 (13:45 +0200)]
Documentation: KVM: Synchronize x86 VM types
KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf
("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since 161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the
documentation to reflect this fact.
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported") Signed-off-by: Carlos López <clopez@suse.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260603114504.814647-2-clopez@suse.de
[sean: use one tab instead of two] Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: selftests: Add regression test for mediated PMU fixed counter filter bug
Add a regression test where KVM would inadvertently ignore PMU event
filters on writes that change _some_ bits in FIXED_CTR_CTRL, but not the
enable bits for PMCs that are denied to the guest.
KVM: x86/pmu: Use hardware value when reprogramming for FIXED_CTR_CTRL changes
When (conditionally) reprogramming fixed counters, use the hardware value
of FIXED_CTR_CTRL to detect changes, not the guest's original value. For
guests with a mediated PMU, overwriting fixed_ctr_ctrl_hw at the start of
reprogramming without actually reacting to changes in fixed_ctr_ctrl_hw can
lead to KVM ignoring PMU event filters.
E.g. if the guest attempts to enable a fixed PMC that is disallowed, and
then toggles a different PMC in a subsequent WRMSR, KVM will update
pmu->fixed_ctr_ctrl_hw and reprogram the PMC that is changing, but not the
others that are now effectively enabled in pmu->fixed_ctr_ctrl_hw.
Note, the perf-based PMU is unaffected, as it doesn't use fixed_ctr_ctrl_hw
(which is also why keying off fixed_ctr_ctrl_hw works for both PMUs.
Note #2, fixed_ctr_ctrl_hw won't mess up pmc_in_use either, because the
latter isn't used by the mediated PMU. Its purpose is solely to release
perf events that are no longer being actively used, and the meadiated PMU
obviously doesn't create perf events.
Hyunwoo Kim [Sat, 6 Jun 2026 14:44:52 +0000 (23:44 +0900)]
KVM: x86: hyper-v: Bound the bank index when querying sparse banks
When checking if a VP ID is included in a sparse bank set, explicitly check
that the ID can actually be contained in a sparse bank (the TLFS allows for
a maximum of 64 banks of 64 vCPUs each). When handling a paravirtual TLB
flush for L2, the VP ID is copied verbatim from the enlightened VMCS,
without any bounds check, i.e. isn't guaranteed to be under the limit of
4096.
Failure to check the bounds of the VP ID leads to an out-of-bounds read
when testing the sparse bank, and super strictly speaking could lead to KVM
performing an unnecessary TLB flush for an L2 vCPU.
==================================================================
BUG: KASAN: use-after-free in hv_is_vp_in_sparse_set+0x85/0x100 [kvm]
Read of size 8 at addr ffff88811ba5f598 by task hyperv_evmcs/2802
Opportunistically add a compile time assertion to ensure the maximum number
of sparse banks exactly matches the number of possible bits in the passed
in mask.
KVM: guest_memfd: fix NUMA interleave index double-counting
kvm_gmem_get_policy() sets the interleave index (the output param that's
typically named "ilx") to the full page offset (vm_pgoff + vma offset).
But get_vma_policy() adds the page offset on top of the interleave index,
and so the offset is counted twice. This causes NUMA interleaving to skip
nodes: for order-0 pages the effective index jumps by 2 for each
consecutive page.
The vm_op.get_policy() implementation should return only a per-file bias in
the interleave index (like shmem_get_policy does with inode->i_ino),
letting get_vma_policy() add the page-offset component.
Fix by setting the output interleave index to the inode number (a la shmem)
instead of the full page offset, as the index is intended to be a constant,
semi-random value for a given file, e.g. so that interleaving doesn't start
at the same node for every file, and so that allocations are round-robined
across nodes based on the page offset (the selected node would bounce/skip
around if the index isn't constant).
Found by Sashiko (sashiko.dev) AI code review.
Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy") Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Shivank Garg <shivankg@amd.com> Tested-by: Shivank Garg <shivankg@amd.com> Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()") Link: https://patch.msgid.link/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com
[sean: use reverse fir-tree, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
Linus Torvalds [Mon, 8 Jun 2026 14:31:41 +0000 (07:31 -0700)]
Merge tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- MSHV driver fixes from various people (Anirudh Rayabharam, Can Peng,
Dexuan Cui, Michael Kelley, Jork Loeser, Wei Liu)
- Hyper-V user space tools fixes (Thorsten Blum)
- Allow VMBus to be unloaded after frame buffer is flushed (Michael
Kelley)
* tag 'hyperv-fixes-signed-20260607' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: support 1G hugepages by passing them as 2M-aligned chunks
Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
mshv: use kmalloc_array in mshv_root_scheduler_init
mshv: Add conditional VMBus dependency
hyperv: Clean up and fix the guest ID comment in hvgdk.h
drm/hyperv: During panic do VMBus unload after frame buffer is flushed
Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
mshv: unmap debugfs stats pages on kexec
mshv: clean up SynIC state on kexec for L1VH
mshv: limit SynIC management to MSHV-owned resources
hv: utils: replace deprecated strcpy with strscpy in kvp_register
hv: utils: handle and propagate errors in kvp_register
mshv: add a missing padding field
Merge tag 'amd-pstate-v7.1-2026-06-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Pull amd-pstate fixes for 7.1 (2026-06-02) from Mario Limonciello:
"* Fix a kdoc issue
* Fix an issue setting performance state in EPP mode introduced earlier in
the cycle from new 7.1 content"
* tag 'amd-pstate-v7.1-2026-06-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
cpufreq/amd-pstate: Fix setting EPP in performance mode
cpufreq/amd-pstate: drop stale @epp_cached kdoc
Linus Torvalds [Sun, 7 Jun 2026 20:12:29 +0000 (13:12 -0700)]
Merge tag 'x86-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- Add more AMD Zen6 models (Pratik Vishwakarma)
- Avoid confusing bootup message by the Intel resctl enumeration
code when running on certain AMD systems (Tony Luck)
* tag 'x86-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Only check Intel systems for SNC
x86/CPU/AMD: Add more Zen6 models
Linus Torvalds [Sun, 7 Jun 2026 20:02:02 +0000 (13:02 -0700)]
Merge tag 'timers-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Fix the arch_inlined_clockevent_set_next_coupled() prototype in the
!CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST case (Naveen Kumar Chaudhary)
- Fix an off-by-1 bug in the sys_settimeofday() usecs validation code
(Naveen Kumar Chaudhary)
- Mark vdso_k_*_data pointers as __ro_after_init (Thomas Weißschuh)
- Fix livelock race in tmigr_handle_remote_up() (Amit Matityahu)
* tag 'timers-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix livelock in tmigr_handle_remote_up()
vdso/datastore: Mark vdso_k_*_data pointers as __ro_after_init
time: Fix off-by-one in settimeofday() usec validation
clockevents: Fix duplicate type specifier in stub function parameter
Linus Torvalds [Sun, 7 Jun 2026 19:54:37 +0000 (12:54 -0700)]
Merge tag 'sched-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq fix from Ingo Molnar:
- Fix uninitialized stack variable in rseq_exit_user_update() (Qing
Wang)
* tag 'sched-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rseq: Fix using an uninitialized stack variable in rseq_exit_user_update()
Linus Torvalds [Sun, 7 Jun 2026 19:43:21 +0000 (12:43 -0700)]
Merge tag 'locking-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
- Fix a NULL pointer dereference bug in the FUTEX_CMP_REQUEUE_PI
code (Ji'an Zhou)
- Fix a NULL pointer dereference bug in the rtmutex code (Davidlohr
Bueso)
* tag 'locking-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Skip remove_waiter() when waiter is not enqueued
futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock
Wei-Lin Chang [Fri, 5 Jun 2026 18:52:55 +0000 (19:52 +0100)]
KVM: arm64: Fix block mapping validity check in stage-1 walker
For the 64K granule size, FEAT_LPA determines whether a level 1 mapping
is allowed. Using the result of has_52bit_pa() is too restrictive, as it
also checks the selected output addressi size in TCR.(I)PS. Fix it by
only checking FEAT_LPA.
KVM: arm64: Set a Linux errno on SMCCC error in kvm_call_hyp_nvhe()
If kvm_call_hyp_nvhe() fails with an SMCCC error code, we WARN().
However, the returned value isn't initialized and the caller might get
garbage or 0 which is likely to be interpreted as success.
Set a default -EOPNOTSUPP error value, ensuring all callers get the
message when hypercalls fail.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260603110312.2909844-1-vdonnefort@google.com
[maz: changed error value to -EOPNOTSUPP as suggested by Will,
tidied up change log] Signed-off-by: Marc Zyngier <maz@kernel.org>
tabba@google.com [Fri, 29 May 2026 12:17:55 +0000 (13:17 +0100)]
KVM: arm64: Roll back partial shares on kvm_share_hyp() failure
kvm_share_hyp() shares a range one page at a time. If share_pfn_hyp()
fails partway through, the pages already shared by this call are left
shared, while the caller treats the whole range as failed and never
unshares them.
Unshare those pages before returning the error. If an unshare itself
fails the page is leaked: it stays shared with the hypervisor and is
no longer reusable for pKVM, but no isolation guarantee is broken, so
WARN and continue. Not expected in practice.
Fixes: a83e2191b7f1 ("KVM: arm64: pkvm: Refcount the pages shared with EL2") Suggested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260529121755.2923500-4-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
tabba@google.com [Fri, 29 May 2026 12:17:54 +0000 (13:17 +0100)]
KVM: arm64: Avoid host/hyp share desync on unshare hypercall failure
unshare_pfn_hyp() erases the tracking node from hyp_shared_pfns
and frees it before invoking __pkvm_host_unshare_hyp. If the
hypercall fails (e.g. EL2 refcount still held, or page-state
mismatch), the host loses its record while EL2 still holds the
share, breaking later share/unshare attempts on the same pfn.
Invoke the hypercall first; erase and free only on success.
Document at the kvm_unshare_hyp() call site that the WARN_ON() is
left non-fatal: a failed unshare leaks the page (it stays shared
with the hypervisor) but breaks no isolation guarantee.
Fixes: 52b28657ebd7 ("KVM: arm64: pkvm: Unshare guest structs during teardown") Reported-by: Sashiko (local):gemini-3.1-pro Suggested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260529121755.2923500-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
tabba@google.com [Fri, 29 May 2026 12:17:53 +0000 (13:17 +0100)]
KVM: arm64: Free hyp-share tracking node when share hypercall fails
share_pfn_hyp() inserts a tracking node into hyp_shared_pfns and
then invokes __pkvm_host_share_hyp. If the hypercall rejects the
share (page-state mismatch at EL2), the node stays in the tree
with refcount 1: a phantom share that leaks the allocation and
that a later unshare will trust.
Erase the node and free it on hypercall failure.
Fixes: a83e2191b7f1 ("KVM: arm64: pkvm: Refcount the pages shared with EL2") Reported-by: Sashiko (local):gemini-3.1-pro Suggested-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260529121755.2923500-2-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
tabba@google.com [Sun, 31 May 2026 15:45:48 +0000 (16:45 +0100)]
KVM: arm64: Flush HCR_EL2.VSE to deliver SErrors to pKVM guests
With pKVM enabled, the host injects a virtual SError by setting
HCR_EL2.VSE on its vCPU copy, but flush_hyp_vcpu() only flows TWI/TWE
into the hyp vCPU that runs, so VSE never reaches it and a deferred
(masked) SError is never delivered. VSE is a host-owned injection
control, not a trap-configuration bit, so restricting the host's
trap-register values should not have dropped it.
Flow it on entry; sync_hyp_vcpu() already copies hcr_el2 back, so
delivery is reflected to the host. THis makes it consistent with
the existing forwarding of VSESR_EL2, which qualifies the Serror.
Fixes: b56680de9c648 ("KVM: arm64: Initialize trap register values in hyp in pKVM") Reported-by: Sashiko (local):gemini-3.1-pro Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260531154548.1505799-1-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Hyunwoo Kim [Sat, 6 Jun 2026 17:56:11 +0000 (02:56 +0900)]
KVM: arm64: Bound used_lrs when flushing the pKVM hyp vCPU
flush_hyp_vcpu() copies the host vGIC state into the hyp's private vCPU
on every run. The vGIC list register save and restore use used_lrs as
their loop bound and expect it to stay within the number of implemented
list registers. While this is generally the case, flush_hyp_vcpu()
copies vgic_v3 verbatim and does not enforce this, so a value provided
by the host is used at EL2 to index vgic_lr[] and access ICH_LR<n>_EL2
(host -> EL2).
Fix by clamping used_lrs to the number of implemented list registers
after the copy, as the trusted path already does in
vgic_flush_lr_state(). The number of implemented list registers is
constant after init, so it is replicated once from
kvm_vgic_global_state.nr_lr into hyp_gicv3_nr_lr rather than read on
every entry.
Cc: stable@vger.kernel.org Fixes: be66e67f1750 ("KVM: arm64: Use the pKVM hyp vCPU structure in handle___kvm_vcpu_run()") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260606175614.83273-3-imv4bel@gmail.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Hyunwoo Kim [Sat, 6 Jun 2026 17:56:10 +0000 (02:56 +0900)]
KVM: arm64: Clear __hyp_running_vcpu when flushing the pKVM hyp vCPU
flush_hyp_vcpu() copies the host vCPU context into the hyp's private
vCPU on every run. ctxt_to_vcpu() expects a guest context to have a
NULL __hyp_running_vcpu, which is only ever set on the host context, so
that it resolves the vCPU via container_of(). While this is generally
the case, flush_hyp_vcpu() copies the context verbatim and does not
enforce this, so a value provided by the host is dereferenced at EL2
(host -> EL2).
Fix by clearing __hyp_running_vcpu after the copy.
Cc: stable@vger.kernel.org Fixes: be66e67f1750 ("KVM: arm64: Use the pKVM hyp vCPU structure in handle___kvm_vcpu_run()") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260606175614.83273-2-imv4bel@gmail.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Richard Patel [Mon, 18 May 2026 18:39:18 +0000 (18:39 +0000)]
riscv: cfi: reject unknown flags in PR_SET_CFI
prctl(PR_SET_CFI,PR_CFI_BRANCH_LANDING_PADS) silently ignored
unknown control values. Only PR_CFI_{ENABLE,DISABLE,LOCK} should
be permitted.
This changes the behavior of the uABI (fails previously accepted bits
with EINVAL).
Fixes: 08ee1559052b ("prctl: cfi: change the branch landing pad prctl()s to be more descriptive") Signed-off-by: Richard Patel <ripatel@wii.dev> Link: https://patch.msgid.link/20260518183918.322545-1-ripatel@wii.dev
[pjw@kernel.org: change the patch description to note that although this is a uABI change, it does not break the uABI] Signed-off-by: Paul Walmsley <pjw@kernel.org>
Nam Cao [Tue, 7 Apr 2026 12:06:39 +0000 (14:06 +0200)]
riscv: Fix fast_unaligned_access_speed_key not getting initialized
The static key fast_unaligned_access_speed_key is supposed to be
initialized after check_unaligned_access_all_cpus() has been completed.
However, check_unaligned_access_all_cpus() has been moved to late_initcall
while setting fast_unaligned_access_speed_key still happens at
arch_initcall_sync, thus the static key does not get properly initialized.
fast_unaligned_access_speed_key can still be initialized in CPU hotplug
events, but that cannot be relied on.
Move fast_unaligned_access_speed_key's initialization into
check_unaligned_access_all_cpus() to fix this issue. This also prevent
someone from moving one initcall while forgetting the other in the future.
Fixes: 6455c6c11827 ("riscv: Clean up & optimize unaligned scalar access probe") Reported-by: Michael Neuling <mikey@neuling.org> Closes: https://lore.kernel.org/linux-riscv/CAEjGV6y0=bSLp_wrS0uHFj1S2TCRtz4GKzaU5O-L1VV-EL7Nnw@mail.gmail.com/ Signed-off-by: Nam Cao <namcao@linutronix.de> Link: https://patch.msgid.link/20260407120639.4006031-1-namcao@linutronix.de Signed-off-by: Paul Walmsley <pjw@kernel.org>
Andreas Schwab [Thu, 21 May 2026 22:34:30 +0000 (00:34 +0200)]
riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
Fixes a warning while dumping core:
[54983.546369][ C7] WARNING: [!note_name] fs/binfmt_elf.c:1771 at elf_core_dump+0x910/0xf68, CPU#7: abort01/31982
Fixes: 2af7c9cf021c ("riscv/ptrace: expose riscv CFI status and state via ptrace and in core files") Signed-off-by: Andreas Schwab <schwab@suse.de> Link: https://patch.msgid.link/87y0hcxuh5.fsf@igel.home Signed-off-by: Paul Walmsley <pjw@kernel.org>
After commit 0652a3daa787 ("tracing: Fix CFI violation in probestub
being called by tprobes"), there are many build errors when building
ARCH=arm multi_v7_defconfig + CONFIG_CFI=y like:
In file included from drivers/base/devres.c:17:
In file included from drivers/base/trace.h:16:
In file included from include/linux/tracepoint.h:23:
include/linux/cfi.h:44:6: error: call to undeclared function 'get_kernel_nofault'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
44 | if (get_kernel_nofault(hash, func - cfi_get_offset()))
| ^
1 error generated.
get_kernel_nofault() is called in the generic version of
cfi_get_func_hash() but nothing ensures uaccess.h is always included for
a proper expansion and prototype. Include uaccess.h in cfi.h to clear
up the errors.
Cc: stable@vger.kernel.org Fixes: 0652a3daa787 ("tracing: Fix CFI violation in probestub being called by tprobes") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Input: atkbd - skip deactivate for HONOR BCC-N's internal keyboard
After commit 9cf6e24c9fbf17e52de9fff07f12be7565ea6d61 ("Input: atkbd -
do not skip atkbd_deactivate() when skipping ATKBD_CMD_GETID"), HONOR
BCC-N, aka HONOR MagicBook 14 2026's internal keyboard stops
working. Adding the atkbd_deactivate_fixup quirk fixes it.
Linus Torvalds [Sat, 6 Jun 2026 16:49:16 +0000 (09:49 -0700)]
Merge tag 'sound-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"It's getting calmer, but we still came up with a handful of small
fixes, including two core fixes. All look sane and safe.
Core:
- Fix wait queue list corruption in snd_pcm_drain() on linked streams
- Fix UMP event stack overread in seq dummy driver
USB-audio:
- Add quirk for AB13X USB Audio
- Fix the regression with sticky mixer volumes in 7.1-rc
ASoC:
- Fix 32-slot TDM breakage on Freescale SAI
- Varioud DMI quirks for AMD ACP"
* tag 'sound-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: seq: dummy: fix UMP event stack overread
ALSA: usb-audio: Add iface reset and delay quirk for AB13X USB Audio
ALSA: PCM: Fix wait queue list corruption in snd_pcm_drain() on linked streams
ASoC: amd: acp70: add standalone RT721 SoundWire machine
ASoC: amd: yc: Add MSI Raider A18 HX A9WJG to quirk table
ASoC: fsl_sai: Fix 32 slots TDM broken by integer shift UB in xMR write
ASoC: amd: yc: Enable internal mic on MSI Bravo 17 C7VF
ASoC: amd: acp: Add DMI quirk for Lenovo Yoga Pro 7 15ASH11
ALSA: usb-audio: Set the value of potential sticky mixers to maximum
Linus Torvalds [Sat, 6 Jun 2026 16:44:42 +0000 (09:44 -0700)]
Merge tag 'rust-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Fix 'rustc-option' (the Makefile one) when cross-compiling that
leads to build or boot failures in certain configs
- Work around a Rust compiler bug (already fixed for Rust 1.98.0)
thats lead to boot failures in certain configs due to missing
'uwtable' LLVM module flags
- Support a Rust compiler change (starting with Rust 1.98.0) in the
unstable target specification JSON files
- Forbid Rust + arm + KASAN configs, which do not build
'kernel' crate:
- Fix NOMMU build by adding a missing helper"
* tag 'rust-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: x86: support Rust >= 1.98.0 target spec
rust: arm64: set uwtable llvm module flag for CONFIG_UNWIND_TABLES
rust: helpers: add is_vmalloc_addr wrapper for NOMMU builds
rust: kasan/kbuild: fix rustc-option when cross-compiling
ARM: Do not select HAVE_RUST when KASAN is enabled
Linus Torvalds [Sat, 6 Jun 2026 14:28:59 +0000 (07:28 -0700)]
Merge tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix error handling in ovl_cache_get()
- Tighten access checks for exited tasks in pidfd_getfd()
- Fix selftests leak in __wait_for_test()
- Limit FUSE_NOTIFY_RETRIEVE to uptodate folios
- Reject fuse_notify() pagecache ops on directories
- Clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
- Fix failure to unlock in nfsd4_create_file()
- Fix pointer arithmetic in qnx6 directory iteration
- Fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
- Avoid potential null folio->mapping deref during iomap error
reporting
* tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: avoid potential null folio->mapping deref during error reporting
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
fs/qnx6: fix pointer arithmetic in directory iteration
VFS: fix possible failure to unlock in nfsd4_create_file()
signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
fuse: reject fuse_notify() pagecache ops on directories
fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate folios
selftests: harness: fix pidfd leak in __wait_for_test
pidfd: refuse access to tasks that have started exiting harder
ovl: keep err zero after successful ovl_cache_get()
Linus Torvalds [Sat, 6 Jun 2026 01:02:23 +0000 (18:02 -0700)]
Merge tag 'drm-fixes-2026-06-06' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes, not contributing to things settling down
unfortunately. Lots of driver fixes for various bounds checks, leaks
and UAF type things, i915/xe probably the most sane, amdgpu has a mix
of fixes all over, then ethosu has lots of small fixes.
The problem of fixing thing in private has really hit us with the
change handle ioctl, and "Sima was right" and we should have disabled
the ioctl, since it was only introduced a couple of kernels ago and
failed to upstream it's tests in time.
The patch here fixes the problems Sima identified, but disables the
ioctl as well, with a list of known problems in it and a request for
proper tests to be written and upstreamed. It's a niche user ioctl
designed for CRIU with AMD ROCm, so I think it's fine to just disable
it.
Maybe this week will settle down.
core:
- disable the gem change handle ioctl for security reasons (plan to
fix it on list later with proper test coverage)
dumb-buffer:
- remove strict limits on buffer geometry
amdkfd:
- UAF race fix
- Fix a potential NULL pointer dereference
- GC 11 buffer overflow fix for SDMA
xe:
- Revert removing support for unpublished NVL-S GuC
- Suspend fixes related to multi-queue
i915:
- Fix color blob reference handling in intel_plane_state
- Revert "drm/i915/backlight: Remove try_vesa_interface"
ethosu:
- reject unsupported NPU_OP_RESIZE
- fix index of IFM region
- fix weight index
- fix overflows in DMA-size calculations
- reject DMA commands with uninitialized length
- fix OOB write in ethosu_gem_cmdstream_copy_and_validate
imx:
- fix kernel-doc warnings
ivpu:
- add overflow checks in firmware handling and get_info_ioctl
v3d:
- wait for pending L2T flush before cleaning caches
- fix leak of vaddr
- skip CSD when it has zeroed workgroups
- fix ref counting in performance monitoring"
* tag 'drm-fixes-2026-06-06' of https://gitlab.freedesktop.org/drm/kernel: (50 commits)
drm/gem: Try to fix change_handle ioctl, attempt 4
Revert "drm/i915/backlight: Remove try_vesa_interface"
accel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()
accel/ethosu: reject DMA commands with uninitialized length
accel/ethosu: fix arithmetic issues in dma_length()
accel/ethosu: fix wrong weight index in NPU_SET_SCALE1_LENGTH on U85
accel/ethosu: reject NPU_OP_RESIZE commands from userspace
accel/ethosu: fix IFM region index out-of-bounds in command stream parser
drm/v3d: Fix global performance monitor reference counting
drm/xe/multi_queue: skip submit when primary queue is suspended
drm/xe: Clear pending_disable before signaling suspend fence
Revert "drm/xe: Skip exec queue schedule toggle if queue is idle during suspend"
drm/amd/pm: smu_v14_0_0: use SoftMin for gfxclk in set_soft_freq_limited_range
drm/amdgpu: Fix incorrect VRAM GART mappings on non-4K page size systems
drm/amdgpu/userq: move wptr_obj cleanup in mqd_destroy
drm/amdgpu: improve the userq seq BO free bit lookup
drm/amdgpu/userq: remove the vital queue unmap logging
drm/amdkfd: Fix buffer overflow in SDMA queue checkpoint/restore on GFX11
drm/amdkfd: fix NULL dereference in get_queue_ids()
drm/amdgpu: set noretry=1 as default for GFX 10.1.x (Navi10/12/14)
...
Simona Vetter [Thu, 4 Jun 2026 19:44:37 +0000 (21:44 +0200)]
drm/gem: Try to fix change_handle ioctl, attempt 4
[airlied: just added some comments on how to reenable]
On-list because the cat is out of the bag and we're clearly not good
enough to figure this out in private. The story thus far:
5e28b7b94408 ("drm: Set old handle to NULL before prime swap in
change_handle") tried to fix a race condition between the gem_close and
gem_change_handle ioctls, but got a few things wrong:
- There's a confusion with the local variable handle, which is actually
the new handle, and so the two-stage trick was actually applied to the
wrong idr slot. 7164d78559b0 ("drm/gem: fix race between
change_handle and handle_delete") tried to fix that by adding yet
another code block, but forgot to add the error handling. Which meant
we now have two paths, both kinda wrong.
- dc366607c41c ("drm: Replace old pointer to new idr") tried to apply
another fix, but inconsistently, again because of the handle confusion
- this would be the right fix (kinda, somewhat, it's a mess) if we'd
do the two-stage approach for the new handle. Except that wasn't the
intent of the original fix.
We also didn't have an igt merged for the original ioctl, which is a big
no-go. This was attempted to address off-list in the original bugfix,
and amd QA people claimed the bug was fixed now. Very clearly that's not
the case. Here's my attempt to sort this out:
- Rename the local variable to new_handle, the old aliasing with
args->handle is just too dangerously confusing.
- Merge the gem obj lookup with the two-stage idr_replace so that we
avoid getting ourselves confused there.
- This means we don't have a surplus temporary reference anymore, only
an inherited from the idr. A concurrent gem_close on the new_handle
could steal that. Fix that with the same two-stage approach
create_tail uses. This is a bit overkill as documented in the comment,
but I also don't trust my ability to understand this all correctly, so
go with the established pattern we have from other ioctls instead for
maximum paranoia.
- Adjust error paths. I've tried to make the error and success paths
common, because they are identical except for which handle is removed
and on which we call idr_replace to (re)install the object again. But
that made things messier to read, so I've left it at the more verbose
version, which unfortunately hides the symmetry in the entire code
flow a bit.
- While at it, also replace the 7 space indent with 1 tab.
And finally, because I flat out don't trust my abilities here at all
anymore:
- Disable the ioctl until we have the igt situation and everything else
sorted out on-list and with full consensus.
v2:
Sashiko noticed that I didn't handle the error path for idr_replace
correctly, it must be checked with IS_ERR_OR_NULL like in
gem_handle_delete. So yeah, definitely should just the existing paths
1:1 because this is endless amounts of tricky.
Also add the Fixes: line for the original ioctl, I forgot that too.
Reported-by: DARKNAVY (@DarkNavyOrg) <vr@darknavy.com> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch> Fixes: dc366607c41c ("drm: Replace old pointer to new idr") Cc: syzbot+d7c9eed171647e421013@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Cc: Edward Adam Davis <eadavis@qq.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Fixes: 5e28b7b94408 ("drm: Set old handle to NULL before prime swap in change_handle") Cc: David Francis <David.Francis@amd.com> Cc: Puttimet Thammasaeng <pwn8official@gmail.com> Cc: Christian Koenig <Christian.Koenig@amd.com> Fixes: 7164d78559b0 ("drm/gem: fix race between change_handle and handle_delete") Cc: Zhenghang Xiao <kipreyyy@gmail.com> Fixes: 5e28b7b94408 ("drm: Set old handle to NULL before prime swap in change_handle") Reviewed-by: David Francis <David.Francis@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patch.msgid.link/20260604194437.1725314-1-simona.vetter@ffwll.ch
Dave Airlie [Fri, 5 Jun 2026 22:37:21 +0000 (08:37 +1000)]
Merge tag 'drm-misc-fixes-2026-06-05' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Short summary of fixes pull:
dumb-buffer:
- remove strict limits on buffer geometry
ethosu:
- reject unsupported NPU_OP_RESIZE
- fix index of IFM region
- fix weight index
- fix overflows in DMA-size calculations
- reject DMA commands with uninitialized length
- fix OOB write in ethosu_gem_cmdstream_copy_and_validate
imx:
- fix kernel-doc warnings
ivpu:
- add overflow checks in firmware handling and get_info_ioctl
v3d:
- wait for pending L2T flush before cleaning caches
- fix leak of vaddr
- skip CSD when it has zeroed workgroups
- fix ref counting in performance monitoring
Linus Torvalds [Fri, 5 Jun 2026 20:52:15 +0000 (13:52 -0700)]
Merge tag 'io_uring-7.1-20260605' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fix from Jens Axboe:
"A single fix for a missing flag mask when multishot is used with
an incrementally consumed buffer ring, potentially leading to
application confusion because of lack of IORING_CQE_F_BUF_MORE
consistency"
* tag 'io_uring-7.1-20260605' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
Linus Torvalds [Fri, 5 Jun 2026 17:38:45 +0000 (10:38 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Correctly drop the ITS translation cache reference when it actually
gets invalidated
- Take the SRCU lock for SW page table walks
- Restore POR_EL0 access to host EL0, avoiding POR_EL0 becoming
inaccessible from EL0 after running a guest
- Reassign nested_mmus array behind mmu_lock, ensuring that vcpu init
and MMU notifiers are mutually exclusive
- Correctly handle FEAT_XNX at stage-2
s390:
- More fixes for the new page table management and nested
virtualization
x86:
- More fixes for GHCB issues:
- Read start/end indices of page size change requests exactly once
per vmexit
- Unmap and unpin the GHCB as needed on vCPU free"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits)
KVM: arm64: Correctly identify executable PTEs at stage-2
KVM: arm64: nv: Fix handling of XN[0] when !FEAT_XNX
KVM: arm64: Reassign nested_mmus array behind mmu_lock
KVM: arm64: Restore POR_EL0 access to host EL0
KVM: arm64: Take the SRCU lock for page table walks in fault injection and AT emulation
KVM: arm64: vgic-its: Drop the translation cache reference only for the erased entry
KVM: SEV: Unmap and unpin the GHCB as needed on vCPU free
KVM: SEV: Decouple the need to sync the GHCB SA from the need to free the SA
KVM: SEV: Move sev_free_vcpu() down below sev_es_unmap_ghcb()
KVM: Don't WARN if memory is dirtied without a vCPU when the VM is dying
KVM: SEV: Read start/end indices of PSC requests exactly once per #VMGEXIT
KVM: SEV: Add an anonymous "psc" struct to track current PSC metadata
KVM: SEV: Make it more obvious when KVM is writing back the current PSC index
KVM: s390: Remove ptep_zap_softleaf_entry()
KVM: s390: Fix possible reference leak in fault-in code
KVM: s390: Prevent memslots outside the ASCE range
KVM: s390: Lock pte when making page secure
KVM: s390: Fix fault-in code
KVM: s390: vsie: Fix rmap handling in _do_shadow_crste()
KVM: s390: Fix guest / virtual address confusion in _essa_clear_cbrl()
...
Linus Torvalds [Fri, 5 Jun 2026 17:33:32 +0000 (10:33 -0700)]
Merge tag 'probes-fixes-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing/probes fix from Masami Hiramatsu:
"Fix the eprobe event parser to point error position correctly"
* tag 'probes-fixes-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Point the error offset correctly for eprobe argument error
Zhou Yuhang [Wed, 20 May 2026 07:08:00 +0000 (15:08 +0800)]
kconfig: Fix repeated include selftest expectation
The err_repeated_inc test was added with an expected stderr fixture
that does not match the diagnostic printed by kconfig.
Running "make testconfig" currently fails in that test even though the
parser reports the duplicated include correctly:
[stderr]
Kconfig.inc1:4: error: repeated inclusion of Kconfig.inc3
Kconfig.inc2:3: note: location of first inclusion of Kconfig.inc3
The fixture expects "Repeated" and "Location" with capital letters, but
the diagnostic emitted by scripts/kconfig/util.c uses lowercase words.
Update the fixture to match the real message.
Fixes: 102d712ded3e ("kconfig: Error out on duplicated kconfig inclusion") Signed-off-by: Zhou Yuhang <zhouyuhang@kylinos.cn> Tested-by: Nicolas Schier <nsc@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20260520070800.2265479-1-zhouyuhang1010@163.com Signed-off-by: Nicolas Schier <nsc@kernel.org>
Linus Torvalds [Fri, 5 Jun 2026 15:34:32 +0000 (08:34 -0700)]
Merge tag 'xfs-fixes-7.1-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"A collection of fixes mostly for the RT device, including a small
refactor that has no functional change"
* tag 'xfs-fixes-7.1-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Remove mention of PageWriteback
xfs: abort mount if xfs_fs_reserve_ag_blocks fails
xfs: factor rtgroup geom write pointer reporting into a helper
xfs: drop the RTG reference later in xfs_ioc_rtgroup_geometry
xfs: fix rtgroup cleanup in CoW fork repair
xfs: fix error returns in CoW fork repair
xfs: fix overlapping extents returned for pNFS LAYOUTGET
xfs: fix use of uninitialized imap in xfs_fs_map_blocks error path
xfs: handle racing deletions in xfs_zone_gc_iter_irec
Linus Torvalds [Fri, 5 Jun 2026 15:28:10 +0000 (08:28 -0700)]
Merge tag 'erofs-for-7.1-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix a UAF of sbi->sync_decompress when compressed I/Os
race with unmount
- Fix a regression introduced this development cycle that
incorrectly rejects multiple-algorithm images
* tag 'erofs-for-7.1-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix EFSCORRUPTED on multi-algorithm images in z_erofs_map_sanity_check()
erofs: fix use-after-free on sbi->sync_decompress
Linus Torvalds [Fri, 5 Jun 2026 15:23:02 +0000 (08:23 -0700)]
Merge tag 'v7.1-rc7-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix use after free in SMB2_CANCEL
- Fix race in ksmbd_reopen_durable_fd
- Fix oplock and lease break potential NULL-dref
* tag 'v7.1-rc7-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix use-after-free of a deferred file_lock on double SMB2_CANCEL
ksmbd: fix durable reconnect double-bind race in ksmbd_reopen_durable_fd
ksmbd: fix NULL-deref of opinfo->conn in oplock/lease break notifiers
Oliver Upton [Tue, 2 Jun 2026 16:59:01 +0000 (09:59 -0700)]
KVM: arm64: Correctly identify executable PTEs at stage-2
KVM invalidates the I-cache before installing an executable PTE on
implementations without DIC. Unfortunately, support for FEAT_XNX
broke this check as KVM_PTE_LEAF_ATTR_HI_S2_XN was expanded to a
bitfield.
Fix it by reusing kvm_pgtable_stage2_pte_prot() and testing the abstract
permission bits instead.
Fixes: 2608563b466b ("KVM: arm64: Add support for FEAT_XNX stage-2 permissions") Reported-by: Sashiko (gemini/gemini-3.1-pro-preview) Signed-off-by: Oliver Upton <oupton@kernel.org> Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Link: https://patch.msgid.link/20260602165901.52800-3-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
Oliver Upton [Tue, 2 Jun 2026 16:59:00 +0000 (09:59 -0700)]
KVM: arm64: nv: Fix handling of XN[0] when !FEAT_XNX
XN has already been extracted from its bitfield position so using
FIELD_PREP() on the mask that clears XN[0] is completely broken, having
the effect of unconditionally granting execute permissions...
Fix the obvious mistake by manipulating the right bit.
Cc: stable@vger.kernel.org Fixes: d93febe2ed2e ("KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2") Reviewed-by: Wei-Lin Chang <weilin.chang@arm.com> Signed-off-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/20260602165901.52800-2-oupton@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
Clément Léger [Thu, 4 Jun 2026 16:07:13 +0000 (09:07 -0700)]
io_uring/net: inherit IORING_CQE_F_BUF_MORE across bundle recv retries
When a bundle recv retries inside io_recv_finish(), the merge logic OR
the saved cflags from the previous iteration with the cflags returned by
the new iteration:
cflags = req->cqe.flags | (cflags & CQE_F_MASK);
Bits listed in CQE_F_MASK are inherited from the new iteration, and all
other bits (notably IORING_CQE_F_BUFFER and the buffer ID) come from the
saved cflags. Before this change CQE_F_MASK covered only
IORING_CQE_F_SOCK_NONEMPTY and IORING_CQE_F_MORE.
When using provided buffer rings (IOU_PBUF_RING_INC) with incremental
mode, and bundle recv, io_kbuf_inc_commit() can leave the head ring
entry partially consumed, __io_put_kbufs() then sets
IORING_CQE_F_BUF_MORE on the returned cflags so userspace knows the
buffer ID will be reused for subsequent completions.
Because IORING_CQE_F_BUF_MORE was not in CQE_F_MASK, the merge above
silently dropped it whenever the final retry iteration partially
consumed the buffer, and the subsequent req->cqe.flags = cflags &
~CQE_F_MASK save would have left a stale IORING_CQE_F_BUF_MORE in the
carried-over cflags had one been present. Userspace would then
wrongfully advance it ring head past an entry the kernel still uses.
Add IORING_CQE_F_BUF_MORE to CQE_F_MASK so it is both inherited from the
new iteration into the user-visible CQE and stripped from the saved
cflags between iterations.
kvm->arch.nested_mmus[] is walked under kvm->mmu_lock, including from the
MMU notifier path (kvm_unmap_gfn_range() -> kvm_nested_s2_unmap()), which
can run at any time. kvm_vcpu_init_nested() reallocates the array and frees
the old buffer while holding only kvm->arch.config_lock, so such a walker
can reference the freed array.
Allocate the new array outside of mmu_lock, as the allocation can sleep.
Under the lock, copy the existing entries, fix up the back pointers and
reassign the array. Free the old buffer after dropping the lock, as
kvfree() can sleep as well.
Fixes: 4f128f8e1aaac ("KVM: arm64: nv: Support multiple nested Stage-2 mmu structures") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiKIVVeIr1aAB1yp@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger,kernel.org
Joey Gouly [Thu, 4 Jun 2026 10:54:34 +0000 (11:54 +0100)]
KVM: arm64: Restore POR_EL0 access to host EL0
CPTR_EL2.E0POE was being cleared in __deactivate_cptr_traps_vhe(), which meant
that any accesses to POR_EL0 from host EL0 would trap and be reported to
userspace as an Illegal instruction. This would happen after running any VM,
regardless if it used POE or not.
Removing the try_vesa_interface gate caused a backlight regression on
panels whose VBT correctly reports INTEL_BACKLIGHT_DISPLAY_DDI and whose
PWM path is the actual backlight control, but whose DPCD optimistically
advertises DP_EDP_BACKLIGHT_AUX_ENABLE_CAP / _BRIGHTNESS_AUX_SET_CAP.
After the commit such panels silently bind to the VESA AUX backlight
funcs; AUX writes complete but the panel ignores them, leaving
brightness stuck (no-op backlight). Observed on at least KBL and TGL
eDP setups.
Hyunwoo Kim [Wed, 3 Jun 2026 12:09:33 +0000 (21:09 +0900)]
KVM: arm64: Take the SRCU lock for page table walks in fault injection and AT emulation
walk_s1() and kvm_walk_nested_s2() expect to be called while holding
kvm->srcu to guard against memslot changes. While this is generally
the case, __kvm_at_s12() and __kvm_find_s1_desc_level() call into the
respective walkers without taking kvm->srcu.
Fix by acquiring kvm->srcu prior to the table walk in both instances.
Cc: stable@vger.kernel.org Fixes: 50f77dc87f13 ("KVM: arm64: Populate level on S1PTW SEA injection") Fixes: be04cebf3e78 ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}") Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/aiAZfdeyanIvP8SD@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org>
Hyunwoo Kim [Mon, 1 Jun 2026 14:53:26 +0000 (23:53 +0900)]
KVM: arm64: vgic-its: Drop the translation cache reference only for the erased entry
vgic_its_invalidate_cache() walks the per-ITS translation cache with
xa_for_each() and drops the cache's reference on each entry with
vgic_put_irq(). It puts the iterated pointer, though, rather than the
value returned by xa_erase().
The function is called from contexts that do not exclude one another: the
ITS command handlers hold its_lock, the GITS_CTLR write path holds
cmd_lock, and the path that clears EnableLPIs in a redistributor's
GICR_CTLR holds neither. Two or more of them can drain the same cache
concurrently, and if each one observes the same entry, erases it and then
puts it, the single reference the cache holds on that entry is dropped
more than once. The entry can then be freed while an ITE still maps it.
xa_erase() is atomic and returns the previous entry, so put only the entry
that this context actually removed. The cache reference is then dropped
exactly once per entry even when the invalidations run concurrently, and
the behavior is unchanged when only one context runs.
Fixes: 8201d1028caa ("KVM: arm64: vgic-its: Maintain a translation cache per ITS") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Oliver Upton <oupton@kernel.org> Link: https://patch.msgid.link/ah2c5lu4JbUg7dj-@v4bel Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
Tony Luck [Fri, 5 Jun 2026 04:46:49 +0000 (21:46 -0700)]
x86/resctrl: Only check Intel systems for SNC
topology_num_nodes_per_package() reports values greater than one on certain
AMD systems resulting in resctrl's Intel model specific SNC detection
printing the confusing message:
"CoD enabled system? Resctrl not supported"
Add a check for Intel systems before looking at the topology.
Kyle Zeng [Fri, 5 Jun 2026 08:02:04 +0000 (01:02 -0700)]
ALSA: seq: dummy: fix UMP event stack overread
The dummy sequencer port forwards events by copying an incoming
struct snd_seq_event into a stack temporary, rewriting source and
destination, and dispatching the temporary to subscribers. That legacy
event storage is smaller than struct snd_seq_ump_event.
When a UMP event reaches the dummy client, the copy leaves the UMP flag
set but only provides legacy-sized stack storage. The subscriber
delivery path then uses snd_seq_event_packet_size() and copies a
UMP-sized packet from that stack object, reading past the end of the
temporary.
Use the existing union __snd_seq_event storage and copy the packet size
reported for the incoming event before rewriting the common routing
fields. This preserves the full UMP packet for UMP events while keeping
legacy event handling unchanged.
Muhammad Bilal [Sat, 23 May 2026 19:08:43 +0000 (19:08 +0000)]
accel/ethosu: fix OOB write in ethosu_gem_cmdstream_copy_and_validate()
The command stream parsing loop increments the index variable a second
time when a 64-bit command word is encountered (bit 14 set), but does
not re-check the loop bound before writing the second word:
for (i = 0; i < size / 4; i++) {
bocmds[i] = cmds[0];
if (cmd & 0x4000) {
i++;
bocmds[i] = cmds[1]; /* unchecked */
}
}
The buffer bocmds is backed by a DMA allocation of exactly size bytes
from drm_gem_dma_create(ddev, size), giving valid indices [0, size/4-1].
When i == size/4 - 1 on entry to an iteration and bit 14 of cmds[0] is
set, bocmds[size/4-1] is written in bounds, i is then incremented to
size/4, and bocmds[size/4] writes four bytes past the end of the
allocation.
Userspace controls both the buffer contents and the size argument via
the ioctl, making this a userspace-triggerable heap out-of-bounds write.
Fix by checking the incremented index against the buffer bound before
the second write and returning -EINVAL if the buffer is too small to
contain the extended command.