Oliver Upton [Mon, 1 Dec 2025 08:47:41 +0000 (00:47 -0800)]
Merge branch 'kvm-arm64/nv-xnx-haf' into kvmarm/next
* kvm-arm64/nv-xnx-haf: (22 commits)
: Support for FEAT_XNX and FEAT_HAF in nested
:
: Add support for a couple of MMU-related features that weren't
: implemented by KVM's software page table walk:
:
: - FEAT_XNX: Allows the hypervisor to describe execute permissions
: separately for EL0 and EL1
:
: - FEAT_HAF: Hardware update of the Access Flag, which in the context of
: nested means software walkers must also set the Access Flag.
:
: The series also adds some basic support for testing KVM's emulation of
: the AT instruction, including the implementation detail that AT sets the
: Access Flag in KVM.
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2
...
Oliver Upton [Mon, 1 Dec 2025 08:47:32 +0000 (00:47 -0800)]
Merge branch 'kvm-arm64/vgic-lr-overflow' into kvmarm/next
* kvm-arm64/vgic-lr-overflow: (50 commits)
: Support for VGIC LR overflows, courtesy of Marc Zyngier
:
: Address deficiencies in KVM's GIC emulation when a vCPU has more active
: IRQs than can be represented in the VGIC list registers. Sort the AP
: list to prioritize inactive and pending IRQs, potentially spilling
: active IRQs outside of the LRs.
:
: Handle deactivation of IRQs outside of the LRs for both EOImode=0/1,
: which involves special consideration for SPIs being deactivated from a
: different vCPU than the one that acked it.
KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE
KVM: arm64: selftests: vgic_irq: Add timer deactivation test
KVM: arm64: selftests: vgic_irq: Add Group-0 enable test
KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test
KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order
KVM: arm64: selftests: vgic_irq: Remove LR-bound limitation
KVM: arm64: selftests: vgic_irq: Exclude timer-controlled interrupts
KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt
KVM: arm64: selftests: vgic_irq: Fix GUEST_ASSERT_IAR_EMPTY() helper
KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default
KVM: arm64: selftests: gic_v3: Add irq group setting helper
KVM: arm64: GICv2: Always trap GICV_DIR register
KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps
KVM: arm64: GICv2: Handle LR overflow when EOImode==0
KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En
KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive
KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation
KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR
KVM: arm64: GICv3: Handle in-LR deactivation when possible
KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation
...
Oliver Upton [Mon, 1 Dec 2025 08:47:20 +0000 (00:47 -0800)]
Merge branch 'kvm-arm64/sea-user' into kvmarm/next
* kvm-arm64/sea-user:
: Userspace handling of SEAs, courtesy of Jiaqi Yan
:
: Add support for processing external aborts in userspace in situations
: where the host has failed to do so, allowing the VMM to potentially
: reinject an external abort into the VM.
Documentation: kvm: new UAPI for handling SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
KVM: arm64: VM exit to userspace to handle SEA
Oliver Upton [Mon, 1 Dec 2025 08:47:12 +0000 (00:47 -0800)]
Merge branch 'kvm-arm64/misc' into kvmarm/next
* kvm-arm64/misc:
: Miscellaneous fixes/cleanups for KVM/arm64
:
: - Fix for need_resched warnings on non-preemptible kernels when
: tearing down a VM's stage-2
:
: - Improvements to KVM struct allocation, getting rid of pointless
: __GFP_HIGHMEM and switching to kvzalloc()
:
: - SYNC ITS configuration before injecting LPIs in vgic_lpi_stress
: selftest
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
KVM: arm64: Split kvm_pgtable_stage2_destroy()
KVM: arm64: Only drop references on empty tables in stage2_free_walker
KVM: selftests: SYNC after guest ITS setup in vgic_lpi_stress
KVM: selftests: Assert GICR_TYPER.Processor_Number matches selftest CPU number
KVM: arm64: Use kvzalloc() for kvm struct allocation
KVM: arm64: Drop useless __GFP_HIGHMEM from kvm struct allocation
Alexandru Elisei [Fri, 28 Nov 2025 10:09:46 +0000 (10:09 +0000)]
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
A guest can write 1 to TCR_ELx.HA, making the KVM software walker update
the access flag in a table descriptor even if FEAT_HAFDBS is not present.
Avoid this by making wi->ha depend on FEAT_HAFDBS being enabled in the VM,
similar to how the software walker treats FEAT_HPDS.
This is not needed for VTCR_EL2.HA, since a guest will always write to
the in-memory copy of the register, where the HA bit is masked (set to
0) by KVM if the VM doesn't have FEAT_HAFDBS.
Fixes: c59ca4b5b0c3 ("KVM: arm64: Implement HW access flag management in stage-1 SW PTW") Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Link: https://msgid.link/20251128100946.74210-5-alexandru.elisei@arm.com Signed-off-by: Oliver Upton <oupton@kernel.org>
Alexandru Elisei [Fri, 28 Nov 2025 10:09:43 +0000 (10:09 +0000)]
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
Commit 2608563b466b ("KVM: arm64: Add support for FEAT_XNX stage-2
permissions") added the KVM_PGTABLE_PROX_{UX,PX} permissions to stage 2 and
to EL2 translation regimes, but left them undocumented. Let's fix that.
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
Clang warns (or errors with CONFIG_WERROR=y / W=e):
arch/arm64/kvm/hyp/pgtable.c:757:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
757 | }
| ^
With older versions of clang (15 and older) and GCC (at least the minimum
supported, 8.1), this is an unconditional hard error:
arch/arm64/kvm/hyp/pgtable.c: In function 'kvm_pgtable_stage2_pte_prot':
arch/arm64/kvm/hyp/pgtable.c:756:2: error: label at end of compound statement
default:
^~~~~~~
arch/arm64/kvm/hyp/pgtable.c:756:10: error: label at end of compound statement: expected statement
default:
^
;
Add a break statement to this default case to clear up the error/warning.
Oliver Upton [Mon, 24 Nov 2025 23:54:09 +0000 (15:54 -0800)]
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
__lse_swap_desc() is compiled unconditionally, even if LSE is disabled
using the config option. Align with the spirit of the config option and
fix some build errors due to __LSE_PREAMBLE being undefined with the
application of some ifdeffery.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511250700.kAutzJFm-lkp@intel.com/ Link: https://msgid.link/20251124235409.1731253-1-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Oliver Upton [Mon, 24 Nov 2025 19:01:54 +0000 (11:01 -0800)]
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
Atomically update the Access flag at stage-1 when the guest has
configured the MMU to do so. Make the implementation choice (and liberal
interpretation of speculation) that any access type updates the Access
flag, including AT and CMO instructions.
Restart the entire walk by returning to the exception-generating
instruction in the case of a failed Access flag update.
Oliver Upton [Mon, 24 Nov 2025 19:01:53 +0000 (11:01 -0800)]
KVM: arm64: Propagate PTW errors up to AT emulation
KVM's software PTW will soon support 'hardware' updates to the access
flag. Similar to fault handling, races to update the descriptor will be
handled by restarting the instruction. Prepare for this by propagating
errors up to the AT emulation, only retiring the instruction if the walk
succeeds.
Oliver Upton [Mon, 24 Nov 2025 19:01:52 +0000 (11:01 -0800)]
KVM: arm64: Add helper for swapping guest descriptor
Implementing FEAT_HAFDBS in KVM's software PTWs requires the ability to
CAS a descriptor to update the in-memory value. Add an accessor to do
exactly that, coping with the fact that guest descriptors are in user
memory (duh).
While FEAT_LSE required on any system that implements NV, KVM now uses
the stage-1 PTW for non-nested use cases meaning an LL/SC implementation
is necessary as well.
Oliver Upton [Mon, 24 Nov 2025 19:01:50 +0000 (11:01 -0800)]
KVM: arm64: Handle endianness in read helper for emulated PTW
Implementing FEAT_HAFDBS means adding another descriptor accessor that
needs to deal with the guest-configured endianness. Prepare by moving
the endianness handling into the read accessor and out of the main body
of the S1/S2 PTWs.
Oliver Upton [Mon, 24 Nov 2025 19:01:49 +0000 (11:01 -0800)]
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
The stage-2 table walker passes down the vCPU as a void pointer. That
might've made sense if the walker was generic although at this point it
is clear this will only ever be used in the context of a vCPU.
Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Tested-by: Marc Zyngier <maz@kernel.org> Link: https://msgid.link/20251124190158.177318-8-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Oliver Upton [Mon, 24 Nov 2025 19:01:46 +0000 (11:01 -0800)]
KVM: arm64: Teach ptdump about FEAT_XNX permissions
Although KVM doesn't make direct use of the feature, guest hypervisors
can use FEAT_XNX which influences the permissions of the shadow stage-2.
Update ptdump to separately print the privileged and unprivileged
execute permissions.
Marc Zyngier [Tue, 25 Nov 2025 16:01:44 +0000 (16:01 +0000)]
KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE
Suzuki notices that making the ICH_HCR_EL2_TDIR capability a system
one isn't a very good idea, should we end-up with CPUs that have
asymmetric TDIR support (somehow unlikely, but you never know what
level of stupidity vendors are up to). For this hypothetical setup,
making this an "EARLY_LOCAL_CPU_FEATURE" is a much better option.
This is actually consistent with what we already do with GICv5
legacy interface, so flip the capability over.
Marc Zyngier [Thu, 20 Nov 2025 17:25:39 +0000 (17:25 +0000)]
KVM: arm64: selftests: vgic_irq: Add timer deactivation test
Add a new test case that triggers the HW deactivation emulation path
when trapping ICV_DIR_EL1. This is obviously tied to the way KVM
works now, but the test follows the expected architectural behaviour.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-50-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:38 +0000 (17:25 +0000)]
KVM: arm64: selftests: vgic_irq: Add Group-0 enable test
Add a new test case that inject a Group-0 interrupt together
with a bunch of Group-1 interrupts, Ack/EOI the G1 interrupts,
and only then enable G0, expecting to get the G0 interrupt.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-49-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:37 +0000 (17:25 +0000)]
KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test
Add a new test case that makes an interrupt pending on a vcpu,
activates it, do the priority drop, and then get *another* vcpu
to do the deactivation.
Special care is taken not to trigger an exit in the process, so
that we are sure that the active interrupt is in an LR. Joy.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-48-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:36 +0000 (17:25 +0000)]
KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order
When EOImode==1, perform the deactivation in the order of activation,
just to make things a bit worse for KVM. Yes, I'm nasty.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-47-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Good news: our GIC emulation is not completely broken, and we can
activate as many interrupts as we want.
Bump the test to cover all the SGIs, all the allowed PPIs, and
31 SPIs. Yes, 31, because we have 31 available priorities, and the
test is not happy with having two interrupts with the same priority.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-46-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
The PPI injection API is clear that you can't inject the timer PPIs
from userspace, since they are controlled by the timers themselves.
Add an exclusion list for this purpose.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-45-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:33 +0000 (17:25 +0000)]
KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt
The architecture is pretty clear that changing the configuration of
an enable interrupt is not OK. It doesn't really matter here, but
doing the right thing is not more expensive.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-44-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
No, 0 is not a spurious INTID. Never been, never was.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-43-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:31 +0000 (17:25 +0000)]
KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default
Make sure G0 is disabled at the point of initialising the GIC.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-42-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:30 +0000 (17:25 +0000)]
KVM: arm64: selftests: gic_v3: Add irq group setting helper
Being able to set the group of an interrupt is pretty useful.
Add such a helper.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-41-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:29 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Always trap GICV_DIR register
Since we can't decide to trap the DIR register on a per-vcpu basis,
always trap the second page of the GIC CPU interface. Yes, this is
costly. On the bright side, no sane SW should use EOImode==1 on
GICv2...
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-40-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:28 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps
Add the plumbing of GICv2 interrupt deactivation via GICV_DIR.
This requires adding a new device so that we can easily decode
the DIR address.
The deactivation itself is very similar to the GICv3 version.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-39-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:27 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Handle LR overflow when EOImode==0
Similarly to the GICv3 version, handle the EOIcount-driven deactivation
by walking the overflow list.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-38-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:26 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En
FEAT_NV2 is pretty terrible for anything that tries to enforce immediate
effects, and writing to ICH_HCR_EL2 in the hope to disable a maintenance
interrupt is vain. This only hits memory, and the guest hasn't cleared
anything -- the MI will fire.
For example, running the vgic_irq test under NV results in about 800
maintenance interrupts being actually handled by the L1 guest,
when none were expected.
As a cheap workaround, read back ICH_MISR_EL2 after writing 0 to
ICH_HCR_EL2. This is very cheap on real HW, and causes a trap to
the host in NV, giving it the opportunity to retire the pending MI.
With this, the above test runs to completion without any MI being
actually handled.
Yes, this is really poor...
Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-37-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:25 +0000 (17:25 +0000)]
KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive
Pretty much like the rest of the LR handling, deactivation of an
L2 interrupt gets reflected in the L1 LRs, and therefore must be
propagated into the L1 shadow state if the interrupt is HW-bound.
Instead of directly handling the active state (which looks a bit
off as it ignores locking and L1->L0 HW propagation), use the new
deactivation primitive to perform the deactivation and deal with
the required maintenance.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-36-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:24 +0000 (17:25 +0000)]
KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation
The current approach to nested GICv3 support is to not do anything
while L2 is running, wait a transition from L2 to L1 to resync
LRs, VMCR and HCR, and only then evaluate the state to decide
whether to generate a maintenance interrupt.
This doesn't provide a good quality of emulation, and it would be
far preferable to find out early that we need to perform a switch.
Move the LRs/VMCR and HCR resync into vgic_v3_sync_nested(), so
that we have most of the state available. As we turning the vgic
off at this stage to avoid a screaming host MI, add a new helper
vgic_v3_flush_nested() that switches the vgic on again. The MI can
then be directly injected as required.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-35-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:23 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR
CPUs lacking TDIR always trap ICV_DIR_EL1, no matter what, since
we have ICH_HCR_EL2.TC set permanently. For these CPUs, it is
useless to use a broadcast kick on SPI injection, as the sole
purpose of this is to set TDIR.
We can therefore skip this on these CPUs, which are challenged
enough not to be burdened by extra IPIs. As a consequence,
permanently set the TDIR bit in the shadow state to notify the
fast-path emulation code of the exit reason.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-34-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:22 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Handle in-LR deactivation when possible
Even when we have either an LR overflow or SPIs in flight, it is
extremely likely that the interrupt being deactivated is still in
the LRs, and that going all the way back to the the generic trap
handling code is a waste of time.
Instead, try and deactivate in place when possible, and only if
this fails, perform a full exit.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-33-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:21 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation
SPIs are specially annpying, as they can be activated on a CPU and
deactivated on another. WHich means that when an SPI is in flight
anywhere, all CPUs need to have their TDIR trap bit set.
This translates into broadcasting an IPI across all CPUs to make sure
they set their trap bit, The number of in-flight SPIs is kept in
an atomic variable so that CPUs can turn the trap bit off as soon
as possible.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-32-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:20 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Set ICH_HCR_EL2.TDIR when interrupts overflow LR capacity
Now that we are ready to handle deactivation through ICV_DIR_EL1,
set the trap bit if we have active interrupts outside of the LRs.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-31-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:19 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Add GICv2 SGI handling to deactivation primitive
The GICv2 SGIs require additional handling for deactivation, as they
are effectively multiple interrrupts muxed into one. Make sure we
check for the source CPU when deactivating.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-30-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:18 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Handle deactivation via ICV_DIR_EL1 traps
Deactivation via ICV_DIR_EL1 is both relatively straightforward
(we have the interrupt that needs deactivation) and really awkward.
The main issue is that the interrupt may either be in an LR on
another CPU, or ourside of any LR.
In the former case, we process the deactivation is if ot was
a write to GICD_CACTIVERn, which is already implemented as a big
hammer IPI'ing all vcpus. In the latter case, we just perform
a normal deactivation, similar to what we do for EOImode==0.
Another annoying aspect is that we need to tell the CPU owning
the interrupt that its ap_list needs laudering. We use a brand new
vcpu request to that effect.
Note that this doesn't address deactivation via the GICV MMIO view,
which will be taken care of in a later change.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:17 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Handle LR overflow when EOImode==0
Now that we can identify interrupts that have not made it into the LRs,
it becomes relatively easy to use EOIcount to walk the overflow list.
What is a bit odd is that we compute a fake LR for the original
state of the interrupt, clear the active bit, and feed into the existing
logic for processing. In a way, this is what would have happened if
the interrupt was in an LR.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-28-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:16 +0000 (17:25 +0000)]
KVM: arm64: Use MI to detect groups being enabled/disabled
Add the maintenance interrupt to force an exit when the guest
enables/disables individual groups, so that we can resort the
ap_list accordingly.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-27-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:15 +0000 (17:25 +0000)]
KVM: arm64: Move undeliverable interrupts to the end of ap_list
Interrupts in the ap_list that cannot be acted upon because they
are not enabled, or that their group is not enabled, shouldn't
make it into the LRs if we are space-constrained.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-26-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:14 +0000 (17:25 +0000)]
KVM: arm64: Invert ap_list sorting to push active interrupts out
Having established that pending interrupts should have priority
to be moved into the LRs over the active interrupts, implement this
in the ap_list sorting.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-25-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:13 +0000 (17:25 +0000)]
KVM: arm64: Make vgic_target_oracle() globally available
Make the internal crystal ball global, so that implementation-specific
code can use it.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-24-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:12 +0000 (17:25 +0000)]
KVM: arm64: Turn kvm_vgic_vcpu_enable() into kvm_vgic_vcpu_reset()
Now that we always reconfigure the vgic HCR register on entry,
the "enable" part of kvm_vgic_vcpu_enable() is pretty useless.
Removing the enable bits from these functions makes it plain that
they are just about computing the reset state. Just rename the
functions accordingly.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-23-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
We currently don't use the maintenance interrupt very much, apart
from EOI on level interrupts, and for LR underflow in limited cases.
However, as we are moving toward a setup where active interrupts
can live outside of the LRs, we need to use the MIs in a more
diverse set of cases.
Add a new helper that produces a digest of the ap_list, and use
that summary to set the various control bits as required.
This slightly changes the way v2 SGIs are handled, as they used to
count for more than one interrupt, but not anymore.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-22-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:10 +0000 (17:25 +0000)]
KVM: arm64: Eagerly save VMCR on exit
We currently save/restore the VMCR register in a pretty lazy way
(on load/put, consistently with what we do with the APRs).
However, we are going to need the group-enable bits that are backed
by VMCR on each entry (so that we can avoid injecting interrupts for
disabled groups).
Move the synchronisation from put to sync, which results in some minor
churn in the nVHE hypercalls to simplify things.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-21-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:09 +0000 (17:25 +0000)]
KVM: arm64: Compute vgic state irrespective of the number of interrupts
As we are going to rely on the [G]ICH_HCR{,_EL2} register to be
programmed with MI information at all times, slightly de-optimise
the flush/sync code to always be called. This is rather lightweight
when no interrupts are in flight.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-20-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:08 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Extract LR computing primitive
Split vgic_v2_populate_lr() into two helpers, so that we have another
primitive that computes the LR from a vgic_irq, but doesn't update
anything in the shadow structure.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-19-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:07 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Extract LR folding primitive
As we are going to need to handle deactivation for interrupts that
are not in the LRs, split vgic_v2_fold_lr_state() into a helper
that deals with a single interrupt, and the function that loops
over the used LRs.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-18-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:06 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Decouple GICH_HCR programming from LRs being loaded
Not programming GICH_HCR while no LRs are populated is a bit
of an issue, as we otherwise don't see any maintenance interrupt
when the guest interacts with the LRs.
Decouple the two and always program the control register, even when
we don't have to touch the LRs.
This is very similar to what we are already doing for GICv3.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-17-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:05 +0000 (17:25 +0000)]
KVM: arm64: GICv2: Preserve EOIcount on exit
EOIcount is how the virtual CPU interface signals that the guest
is deactivating interrupts outside of the LRs when EOImode==0.
We therefore need to preserve that information so that we can find
out what actually needs deactivating, just like we already do on
GICv3.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-16-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:04 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Extract LR computing primitive
Split vgic_v3_populate_lr() into two, so that we have another
primitive that computes the LR from a vgic_irq, but doesn't
update anything in the shadow structure.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-15-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:03 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Extract LR folding primitive
As we are going to need to handle deactivation for interrupts that
are not in the LRs, split vgic_v3_fold_lr_state() into a helper
that deals with a single interrupt, and the function that loops
over the used LRs.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-14-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:02 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Decouple ICH_HCR_EL2 programming from LRs
Not programming ICH_HCR_EL2 while no LRs are populated is a bit
of an issue, as we otherwise don't see any maintenance interrupt
when the guest interacts with the LRs.
Decouple the two and always program the control register, even when
we don't have to touch the LRs.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-13-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:01 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Preserve EOIcount on exit
EOIcount is how the virtual CPU interface signals that the guest
is deactivating interrupts outside of the LRs when EOImode==0.
We therefore need to preserve that information so that we can find
out what actually needs deactivating.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-12-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:25:00 +0000 (17:25 +0000)]
KVM: arm64: GICv3: Drop LPI active state when folding LRs
Despite LPIs not having an active state, *virtual* LPIs do have
one, which gets cleared on EOI. So far, so good.
However, this leads to a small problem: when an active LPI is not
in the LRs, that EOImode==0 and that the guest EOIs it, EOIcount
doesn't get bumped up. Which means that in these condition, the
LPI would stay active forever.
Clearly, we can't have that. So if we spot an active LPI, we drop
that state. It's pretty pointless anyway, and only serves as a way
to trip SW over.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-11-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:59 +0000 (17:24 +0000)]
KVM: arm64: Add LR overflow handling documentation
Add a bit of documentation describing how we are dealing with LR
overflow. This is mostly a braindump of how things are expected
to work. For now anyway.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-10-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:58 +0000 (17:24 +0000)]
KVM: arm64: Add tracking of vgic_irq being present in a LR
We currently cannot identify whether an interrupt is queued into
a LR. It wasn't needed until now, but that's about to change.
Add yet another flag to track that state.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-9-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:57 +0000 (17:24 +0000)]
KVM: arm64: Repack struct vgic_irq fields
struct vgic_irq has grown over the years, in a rather bad way.
Repack it using bitfields so that the individual flags, and move
things around a bit so that it a bit smaller.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-8-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:56 +0000 (17:24 +0000)]
KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping
A long time ago, an unsuspecting architect forgot to add a trap
bit for ICV_DIR_EL1 in ICH_HCR_EL2. Which was unfortunate, but
what's a bit of spec between friends? Thankfully, this was fixed
in a later revision, and ARM "deprecates" the lack of trapping
ability.
Unfortuantely, a few (billion) CPUs went out with that defect,
anything ARMv8.0 from ARM, give or take. And on these CPUs,
you can't trap DIR on its own, full stop.
As the next best thing, we can trap everything in the common group,
which is a tad expensive, but hey ho, that's what you get. You can
otherwise recycle the HW in the neaby bin.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-7-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:55 +0000 (17:24 +0000)]
KVM: arm64: vgic-v3: Fix GICv3 trapping in protected mode
As we are about to start trapping a bunch of extra things, augment
the pKVM trap description with all the registers trapped by ICH_HCR_EL2.TC,
making them legal instead of resulting in a UNDEF injection in the guest.
While we're at it, ensure that pKVM captures the vgic model so that it
can be checked by the emulation code.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-6-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:54 +0000 (17:24 +0000)]
KVM: arm64: Turn vgic-v3 errata traps into a patched-in constant
The trap bits are currently only set to manage CPU errata. However,
we are about to make use of them for purposes beyond beating broken
CPUs into submission.
For this purpose, turn these errata-driven bits into a patched-in
constant that is merged with the KVM-driven value at the point of
programming the ICH_HCR_EL2 register, rather than being directly
stored with with the shadow value..
This allows the KVM code to distinguish between a trap being handled
for the purpose of an erratum workaround, or for KVM's own need.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-5-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:53 +0000 (17:24 +0000)]
irqchip/apple-aic: Spit out ICH_MISR_EL2 value on spurious vGIC MI
It is all good and well to scream about spurious vGIC maintenance
interrupts. It would be even better to output the reason why, which
is already checked, but not printed out.
The unsuspecting kernel tinkerer thanks you.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-4-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:52 +0000 (17:24 +0000)]
irqchip/gic: Expose CPU interface VA to KVM
Future changes will require KVM to be able to perform deactivations
by writing to the physical CPU interface. Add the corresponding
VA to the kvm_info structure, and let KVM stash it.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-3-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Marc Zyngier [Thu, 20 Nov 2025 17:24:51 +0000 (17:24 +0000)]
irqchip/gic: Add missing GICH_HCR control bits
The GICH_HCR description is missing a bunch of control bits that
control the maintenance interrupt. Add them.
Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-2-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Oliver Upton [Mon, 24 Nov 2025 19:01:45 +0000 (11:01 -0800)]
KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2
Add support for FEAT_XNX to shadow stage-2 MMUs, being careful to only
evaluate XN[0] when the feature is actually exposed to the VM.
Restructure the layering of permissions in the fault handler to assume
pX and uX then restricting based on the guest's stage-2 afterwards.
Oliver Upton [Mon, 24 Nov 2025 19:01:44 +0000 (11:01 -0800)]
KVM: arm64: Add support for FEAT_XNX stage-2 permissions
FEAT_XNX adds support for encoding separate execute permissions for EL0
and EL1 at stage-2. Add support for this to the page table library,
hiding the unintuitive encoding scheme behind generic pX and uX
permission flags.
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
When a large VM, specifically one that holds a significant number of PTEs,
gets abruptly destroyed, the following warning is seen during the
page-table walk:
The warning is seen majorly on the host kernels that are configured
not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this,
instead of walking the entire page-table in one go, split it into
smaller ranges, by checking for cond_resched() between each range.
Since the path is executed during VM destruction, after the
page-table structure is unlinked from the KVM MMU, relying on
cond_resched_rwlock_write() isn't necessary.
Split kvm_pgtable_stage2_destroy() into two:
- kvm_pgtable_stage2_destroy_range(), that performs the
page-table walk and free the entries over a range of addresses.
- kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
This refactoring enables subsequent patches to free large page-tables
in chunks, calling cond_resched() between each chunk, to yield the
CPU as necessary.
Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot
take advantage of this (such as nVMHE), will continue to function as is.
Oliver Upton [Wed, 19 Nov 2025 22:11:50 +0000 (14:11 -0800)]
KVM: arm64: Only drop references on empty tables in stage2_free_walker
A subsequent change to the way KVM frees stage-2s will invoke the free
walker on sub-ranges of the VM's IPA space, meaning there's potential
for only partially visiting a table's PTEs.
Split the leaf and table visitors and only drop references on a table
when the page count reaches 1, implying there are no valid PTEs that
need to be visited. Invalidate the table PTE to avoid traversing the
stale reference.
KVM: selftests: SYNC after guest ITS setup in vgic_lpi_stress
vgic_lpi_stress sends MAPTI and MAPC commands during guest GIC setup to
map interrupt events to ITT entries and collection IDs to
redistributors, respectively.
We have no guarantee that the ITS will finish handling these mapping
commands before the selftest calls KVM_SIGNAL_MSI to inject LPIs to the
guest. If LPIs are injected before ITS mapping completes, the ITS cannot
properly pass the interrupt on to the redistributor.
Fix by adding a SYNC command to the selftests ITS library, then calling
SYNC after ITS mapping to ensure mapping completes before signal_lpi()
writes to GITS_TRANSLATER.
Oliver Upton [Wed, 19 Nov 2025 09:38:22 +0000 (01:38 -0800)]
KVM: arm64: Use kvzalloc() for kvm struct allocation
Physically-allocated KVM structs aren't necessary when in VHE mode as
there's no need to share with the hyp's address space. Of course, there
can still be a performance benefit from physical allocations.
Use kvzalloc() for opportunistic physical allocations.
Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://msgid.link/20251119093822.2513142-3-oupton@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
Oliver Upton [Wed, 19 Nov 2025 09:38:21 +0000 (01:38 -0800)]
KVM: arm64: Drop useless __GFP_HIGHMEM from kvm struct allocation
A recent change on the receiving end of vmalloc() started warning about
unsupported GFP flags passed by the caller. Nathan reports that this
warning fires in kvm_arch_alloc_vm(), owing to the fact that KVM is
passing a meaningless __GFP_HIGHMEM.
Jiaqi Yan [Mon, 13 Oct 2025 18:59:03 +0000 (18:59 +0000)]
Documentation: kvm: new UAPI for handling SEA
Document the new userspace-visible features and APIs for handling
synchronous external abort (SEA)
- KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature.
- KVM_EXIT_ARM_SEA: exit userspace gets when it needs to handle SEA
and what userspace gets while taking the SEA.
Jiaqi Yan [Mon, 13 Oct 2025 18:59:02 +0000 (18:59 +0000)]
KVM: selftests: Test for KVM_EXIT_ARM_SEA
Test how KVM handles guest SEA when APEI is unable to claim it, and
KVM_CAP_ARM_SEA_TO_USER is enabled.
The behavior is triggered by consuming recoverable memory error (UER)
injected via EINJ. The test asserts two major things:
1. KVM returns to userspace with KVM_EXIT_ARM_SEA exit reason, and
has provided expected fault information, e.g. esr, flags, gva, gpa.
2. Userspace is able to handle KVM_EXIT_ARM_SEA by injecting SEA to
guest and KVM injects expected SEA into the VCPU.
Tested on a data center server running Siryn AmpereOne processor
that has RAS support.
Several things to notice before attempting to run this selftest:
- The test relies on EINJ support in both firmware and kernel to
inject UER. Otherwise the test will be skipped.
- The under-test platform's APEI should be unable to claim the SEA.
Otherwise the test will be skipped.
- Some platform doesn't support notrigger in EINJ, which may cause
APEI and GHES to offline the memory before guest can consume
injected UER, and making test unable to trigger SEA.
Jiaqi Yan [Mon, 13 Oct 2025 18:59:01 +0000 (18:59 +0000)]
KVM: arm64: VM exit to userspace to handle SEA
When APEI fails to handle a stage-2 synchronous external abort (SEA),
today KVM injects an asynchronous SError to the VCPU then resumes it,
which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable
uncorrected memory error (UER). Although SError and guest kernel panic
effectively stops the propagation of corrupted memory, guest may
re-use the corrupted memory if auto-rebooted; in worse case, guest
boot may run into poisoned memory. So there is room to recover from
an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via
KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
consumption or fault is not from guest kernel, blast radius can be
limited to the triggering thread in guest userspace, so VM can
keep running.
- Allow VMM to protect from future memory poison consumption by
unmapping the page from stage-2, or to interrupt guest of the
poisoned page so guest kernel can unmap it from stage-1 page table.
- Allow VMM to track SEA events that VM customers care about, to restart
VM when certain number of distinct poison events have happened,
to provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
when host APEI fails to claim a SEA, userspace can opt in this new
capability to let KVM exit to userspace during SEA if it is not
owned by host.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
KVM fills kvm_run.arm_sea with as much as possible information about
the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits
useful for userspace and relevant to guest memory.
- Flags indicating if faulting guest physical address is valid.
- Faulting guest physical and virtual addresses if valid.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://msgid.link/20251013185903.1372553-2-jiaqiyan@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>
Linus Torvalds [Sun, 26 Oct 2025 17:33:46 +0000 (10:33 -0700)]
Merge tag 'char-misc-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc/android driver fixes for 6.18-rc3 for
reported issues. Included in here are:
- rust binder fixes for reported issues
- mei device id addition
- mei driver fixes
- comedi bugfix
- most usb driver bugfixes
- fastrpc memory leak fix
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
most: usb: hdm_probe: Fix calling put_device() before device initialization
most: usb: Fix use-after-free in hdm_disconnect
binder: remove "invalid inc weak" check
mei: txe: fix initialization order
comedi: fix divide-by-zero in comedi_buf_munge()
mei: late_bind: Fix -Wincompatible-function-pointer-types-strict
misc: fastrpc: Fix dma_buf object leak in fastrpc_map_lookup
mei: me: add wildcat lake P DID
misc: amd-sbi: Clarify that this is a BMC driver
nvmem: rcar-efuse: add missing MODULE_DEVICE_TABLE
binder: Fix missing kernel-doc entries in binder.c
rust_binder: report freeze notification only when fully frozen
rust_binder: don't delete FreezeListener if there are pending duplicates
rust_binder: freeze_notif_done should resend if wrong state
rust_binder: remove warning about orphan mappings
rust_binder: clean `clippy::mem_replace_with_default` warning
Linus Torvalds [Sun, 26 Oct 2025 17:29:45 +0000 (10:29 -0700)]
Merge tag 'staging-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are some small staging driver fixes for the gpib subsystem to
resolve some reported issues. Included in here are:
- memory leak fixes
- error code fixes
- proper protocol fixes
All of these have been in linux-next for almost 2 weeks now with no
reported issues"
* tag 'staging-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: gpib: Fix device reference leak in fmh_gpib driver
staging: gpib: Return -EINTR on device clear
staging: gpib: Fix sending clear and trigger events
staging: gpib: Fix no EOI on 1 and 2 byte writes
Linus Torvalds [Sun, 26 Oct 2025 17:24:39 +0000 (10:24 -0700)]
Merge tag 'tty-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are some small tty and serial driver fixes for reported issues.
Included in here are:
- sh-sci serial driver fixes
- 8250_dw and _mtk driver fixes
- sc16is7xx driver bugfix
- new 8250_exar device ids added
All of these have been in linux-next this past week with no reported
issues"
* tag 'tty-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250_mtk: Enable baud clock and manage in runtime PM
serial: 8250_dw: handle reset control deassert error
dt-bindings: serial: sh-sci: Fix r8a78000 interrupts
serial: sc16is7xx: remove useless enable of enhanced features
serial: 8250_exar: add support for Advantech 2 port card with Device ID 0x0018
tty: serial: sh-sci: fix RSCI FIFO overrun handling
Linus Torvalds [Sun, 26 Oct 2025 16:57:18 +0000 (09:57 -0700)]
Merge tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Remove dead code leftovers after a recent mitigations cleanup which
fail a Clang build
- Make sure a Retbleed mitigation message is printed only when
necessary
- Correct the last Zen1 microcode revision for which Entrysign sha256
check is needed
- Fix a NULL ptr deref when mounting the resctrl fs on a system which
supports assignable counters but where L3 total and local bandwidth
monitoring has been disabled at boot
* tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Remove dead code which might prevent from building
x86/bugs: Qualify RETBLEED_INTEL_MSG
x86/microcode: Fix Entrysign revision check for Zen1/Naples
x86,fs/resctrl: Fix NULL pointer dereference with events force-disabled in mbm_event mode
Linus Torvalds [Sun, 26 Oct 2025 16:54:36 +0000 (09:54 -0700)]
Merge tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Restore the original buslock locking in a couple of places in the irq
core subsystem after a rework
* tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/manage: Add buslock back in to enable_irq()
genirq/manage: Add buslock back in to __disable_irq_nosync()
genirq/chip: Add buslock back in to irq_set_handler()
Linus Torvalds [Sun, 26 Oct 2025 16:44:36 +0000 (09:44 -0700)]
Merge tag 'objtool_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Fix x32 build due to wrong format specifier on that sub-arch
- Add one more Rust noreturn function to objtool's list
* tag 'objtool_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix failure when being compiled on x32 system
objtool/rust: add one more `noreturn` Rust function
Linus Torvalds [Sun, 26 Oct 2025 16:42:19 +0000 (09:42 -0700)]
Merge tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Make sure a CFS runqueue on a throttled hierarchy has its PELT clock
throttled otherwise task movement and manipulation would lead to
dangling cfs_rq references and an eventual crash
* tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
Linus Torvalds [Sun, 26 Oct 2025 16:40:16 +0000 (09:40 -0700)]
Merge tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Do not create more than eight (max supported) AUX clocks sysfs
hierarchies
* tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix aux clocks sysfs initialization loop bound
Linus Torvalds [Sat, 25 Oct 2025 16:35:26 +0000 (09:35 -0700)]
Merge tag 'riscv-for-linus-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Close a race during boot between userspace vDSO usage and some
late-initialized vDSO data
- Improve performance on systems with non-CPU-cache-coherent
DMA-capable peripherals by enabling write combining on
pgprot_dmacoherent() allocations
- Add human-readable detail for RISC-V IPI tracing
- Provide more information to zsmalloc on 64-bit RISC-V to improve
allocation
- Silence useless boot messages about CPUs that have been disabled in
DT
- Resolve some compiler and smatch warnings and remove a redundant
macro
* tag 'riscv-for-linus-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: hwprobe: avoid uninitialized variable use in hwprobe_arch_id()
riscv: cpufeature: avoid uninitialized variable in has_thead_homogeneous_vlenb()
riscv: hwprobe: Fix stale vDSO data for late-initialized keys at boot
riscv: add a forward declaration for cpuinfo_op
RISC-V: Don't print details of CPUs disabled in DT
riscv: Remove the PER_CPU_OFFSET_SHIFT macro
riscv: mm: Define MAX_POSSIBLE_PHYSMEM_BITS for zsmalloc
riscv: Register IPI IRQs with unique names
ACPI: RIMT: Fix unused function warnings when CONFIG_IOMMU_API is disabled
RISC-V: Define pgprot_dmacoherent() for non-coherent devices
Linus Torvalds [Sat, 25 Oct 2025 16:31:13 +0000 (09:31 -0700)]
Merge tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"The main highlight here is a fix for a bug brought in by the removal
of attr2 mount option, where some installations might actually have
'attr2' explicitly configured in fstab preventing system to boot by
not being able to remount the rootfs as RW.
Besides that there are a couple fix to the zonefs implementation,
changing XFS_ONLINE_SCRUB_STATS to depend on DEBUG_FS (was select
before), and some other minor changes"
* tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix locking in xchk_nlinks_collect_dir
xfs: loudly complain about defunct mount options
xfs: always warn about deprecated mount options
xfs: don't set bt_nr_sectors to a negative number
xfs: don't use __GFP_NOFAIL in xfs_init_fs_context
xfs: cache open zone in inode->i_private
xfs: avoid busy loops in GCD
xfs: XFS_ONLINE_SCRUB_STATS should depend on DEBUG_FS
xfs: do not tightly pack-write large files
xfs: Improve CONFIG_XFS_RT Kconfig help