git.ipfire.org Git - thirdparty/kernel/linux.git/commit

KVM: x86: Defer non-architectural deliver of exception payload to userspace read

When attempting to play nice with userspace that hasn't enabled
KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the
payload until userspace actually reads relevant vCPU state, and more
importantly, force delivery of the payload in *all* paths where userspace
saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS.

Ignoring userspace save/restore for the moment, delivering the payload
before the exception is injected is wrong regardless of whether L1 or L2
is running.  To make matters even more confusing, the flaw *currently*
being papered over by the !is_guest_mode() check isn't even the same bug
that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF
delivery") was trying to avoid.

At the time of commit da998b46d244, KVM didn't correctly handle exception
intercepts, as KVM would wait until VM-Entry into L2 was imminent to check
if the queued exception should morph to a nested VM-Exit.  I.e. KVM would
deliver the payload to L2 and then synthesize a VM-Exit into L1.  But the
payload was only the most blatant issue, e.g. waiting to check exception
intercepts would also lead to KVM incorrectly escalating a
should-be-intercepted #PF into a #DF.

That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86:
Morph pending exceptions to pending VM-Exits at queue time"), but in the
interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on
KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on
the !is_guest_mode() check.

While not recorded in the changelog, the motivation for deferring the
!exception_payload_enabled delivery was to fix a flaw where a synthesized
MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6.
On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which
is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the
payload from the queue/pending exception.  I.e. prematurely delivering the
payload would cause the pending #DB to not be recorded in the VMCS, and of
course, clobber L2's DR6 as seen by L1.

Jumping back to save+restore, the quirked behavior of forcing delivery of
the payload only works if userspace does KVM_GET_VCPU_EVENTS *before*
CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS.
E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the
CR2 saved by userspace won't contain the payload for the exception save by
KVM_GET_VCPU_EVENTS.

Deliberately deliver the payload in the store_regs() path, as it's the
least awful option even though userspace may not be doing save+restore.
Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS
knowing that SREGS were already saved when the vCPU exited.

Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@google.com
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Cc: stable@vger.kernel.org
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Tested-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20260218005438.2619063-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

author	Sean Christopherson <seanjc@google.com>
	Wed, 18 Feb 2026 00:54:38 +0000 (16:54 -0800)
committer	Sean Christopherson <seanjc@google.com>
	Mon, 2 Mar 2026 17:53:50 +0000 (09:53 -0800)
commit	d0ad1b05bbe6f8da159a4dfb6692b3b7ce30ccc8
tree	55f1d23fd1322285a5bd863ade9ce85365ebbb75	tree \| snapshot
parent	5c247d08bc81bbad4c662dcf5654137a2f8483ec	commit \| diff