git.ipfire.org Git - thirdparty/kernel/linux.git/log

io-wq: check that the predecessor is hashed in io_wq_remove_pending()

io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled
work was the tail of its hash bucket. When doing this, it checks whether
the preceding entry in acct->work_list has the same hash value, but
never checks that the predecessor is hashed at all. io_get_work_hash()
is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash
bits are never set for non-hashed work, so it returns 0. Thus, when a
hashed bucket-0 work is cancelled while a non-hashed work is its list
predecessor, the check spuriously passes and a pointer to the non-hashed
io_kiocb is stored in wq->hash_tail[0].

Because non-hashed work is dequeued via the fast path in
io_get_next_work(), which never touches hash_tail[], the stale pointer
is never cleared. Therefore, after the non-hashed io_kiocb completes and
is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The
io_wq is per-task (tctx->io_wq) and survives ring open/close, so the
dangling pointer persists for the lifetime of the task; the next hashed
bucket-0 enqueue dereferences it in io_wq_insert_work() and
wq_list_add_after() writes through freed memory.

Add the missing io_wq_is_hashed() check so a non-hashed predecessor
never inherits a hash_tail[] slot.

Cc: stable@vger.kernel.org
Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

thermal: hwmon: Use extra_groups for adding temperature attributes

Instead of passing NULL as the last argument to __hwmon_device_register()
in hwmon_device_register_for_thermal() and then adding each temperature
sysfs attribute to the hwmon device via device_create_file(), redefine
hwmon_device_register_for_thermal() to take an extra_groups argument
that will be passed to __hwmon_device_register(), define an attribute
group with a proper .is_visible() callback for the temperature
attributes and a related attribute groups pointer, and pass the latter
to hwmon_device_register_for_thermal().

This causes the code to be way more straightforward and closer to
what the other users of the hwmon subsystem do.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/8704209.T7Z3S40VBb@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: hwmon: Register a hwmon device for each thermal zone

The current code creates one hwmon device per thermal zone type and that
device is registered under the first thermal zone of the given type.

That turns out to be problematic when the thermal zone holding the
hwmon device is removed.

For example, say that there are two ACPI thermal zones on a system

/sys/devices/virtual/thermal/thermal_zone0/
/sys/devices/virtual/thermal/thermal_zone1/

The current code registers a hwmon class device for thermal_zone0 only:

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/

because the type is "acpitz" for both of them, but it adds a sysfs
attribute that belongs to thermal_zone1 under it:

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp2_input

There is also

/sys/devices/virtual/thermal/thermal_zone0/hwmon0/temp1_input

which belongs to thermal_zone0.

When thermal_zone0 is removed, say because the ACPI thermal driver is
unbound from the underlying platform device, the removal code skips the
removal of hwmon0 because of the temp2_input attribute belonging to
thermal_zone1 which effectively prevents thermal_zone0 removal from
making progress.

To address this problem, rework the thermal hwmon code to register one
hwmon device for each thermal zone, but since user space utilities
produce confusing output in some cases when there are multiple hwmon
devices with the same name attribute value present under thermal zones
of the same type, append the thermal zone ID preceded by an underline
character to the name of the hwmon device registered for that thermal
zone.

Link: https://lore.kernel.org/linux-pm/20260402021828.16556-1-liujia6264@gmail.com/
Fixes: f6b6b52ef7a5 ("thermal_hwmon: Pass the originating device down to hwmon_device_register_with_info")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/3070412.e9J7NaK4W3@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: hwmon: Fix critical temperature attribute removal

Since the return value of thermal_zone_crit_temp_valid() depends on
the behavior of the thermal zone .get_crit_temp() callback which
may change over time in theory, thermal_remove_hwmon_sysfs() may
attempt to remove a critical temperature attribute that has not
been created, passing a pointer to an uninitialized attribute
structure to device_remove_file().

To avoid that, set a flag in struct thermal_hwmon_temp after creating
a critical temperature attribute and use the value of that flag to
decide whether or not the attribute needs to be removed.

Fixes: e8db5d6736a7 ("thermal: hwmon: Make the check for critical temp valid consistent")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/2437056.ElGaqSPkdT@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/core: Use the thermal class pointer as init guard

The thermal class is now dynamically allocated and stored as a
pointer.

Use the thermal_class pointer itself to check whether the thermal
class has been created instead of keeping a separate
thermal_class_unavailable flag.

Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20260508180511.1306659-5-daniel.lezcano@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/core: Allocate the thermal class dynamically

Use class_create() instead of a statically allocated struct class.

This allows the thermal class to be managed through a dynamically
allocated class object and avoids keeping a static class instance
around.

Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Added __ro_after_init to thermal_class ]
[ rjw: Used temporary local var to store class_create() return value ]
Link: https://patch.msgid.link/20260508180511.1306659-4-daniel.lezcano@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/core: Add dedicated release callback for thermal zones

The thermal class release callback currently handles thermal zone
cleanup by checking the device name prefix.

Move the thermal zone cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the thermal zone lifetime handling local to the thermal
zone object.

Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20260508180511.1306659-3-daniel.lezcano@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/core: Add dedicated release callback for cooling devices

The thermal class release callback currently handles both thermal
zones and cooling devices by checking the device name prefix.

Move the cooling device cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the cooling device lifetime handling local to the
cooling device object.

Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Link: https://patch.msgid.link/20260508180511.1306659-2-daniel.lezcano@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

media: dt-bindings: mediatek: Constrain iommus

Lists should have fixed constraints, because binding must be specific in
respect to hardware. Add missing constraints to number of iommus in
Mediatek media devices and remove completely redundant and obvious
description.

Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250821065900.17430-2-krzysztof.kozlowski@linaro.org
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

tools/sched_ext: scx_qmap: Fix qa arena placement

__arena is a pointer qualifier meaning "this pointer points to arena
memory". When used on a global variable declaration, it expands to
nothing in scx's build because __BPF_FEATURE_ADDR_SPACE_CAST is never
defined, leaving qa as a plain global in BSS. bpftool then generates
skel->bss->qa instead of the expected skel->arena->qa, causing:

scx_qmap.c: error: 'struct scx_qmap' has no member named 'arena'

__arena_global is the correct annotation for global variables that
reside in the arena. When __BPF_FEATURE_ADDR_SPACE_CAST is not defined
it expands to SEC(".addr_space.1"), placing qa in the arena ELF section.
When __BPF_FEATURE_ADDR_SPACE_CAST is defined it expands to
__attribute__((address_space(1))). In both cases bpftool generates the
typed skel->arena accessor.

Fixes: 60a59eaca71b ("sched_ext: scx_qmap: move globals and cpu_ctx into a BPF arena map")
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

cgroup/cpuset: Return only actually allocated CPUs during partition invalidation

In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs
to return to the parent are computed as:

    adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);

where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().

Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.

Reproducer (on a 4-CPU machine):

    cd /sys/fs/cgroup
    mkdir a1 b1

    # a1 becomes partition root with CPUs 0-1
    echo "0-1" > a1/cpuset.cpus
    echo "root" > a1/cpuset.cpus.partition

    # b1 becomes partition root with CPUs 1-2, but sibling exclusion
    # reduces its effective_xcpus to CPU 2 only
    echo "1-2" > b1/cpuset.cpus
    echo "root" > b1/cpuset.cpus.partition

    # b1 changes cpus_allowed to 0-1 -> partition invalidation
    echo "0-1" > b1/cpuset.cpus

    # Expected: CPUs 2-3  (only CPU 2 returned from b1)
    # Actual:   CPUs 1-3  (CPU 0-1 returned, overlapping with a1)
    cat cpuset.cpus.effective

dmesg will also show a WARNING from generate_sched_domains() reporting
overlapping partition root effective_cpus.

Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Signed-off-by: sunshaojie <sunshaojie@kylinos.cn>
Tested-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"arm64:

   - Add the pKVM side of the workaround for ARM's erratum 4193714,
     provided that the EL3 firmware does its part of the job. KVM will
     refuse to initialise otherwise

   - Correctly handle 52bit VAs for guest EL2 stage-1 translations when
     running under NV with E2H==0

   - Correctly deal with permission faults in guest_memfd memslots

   - Fix the steal-time selftest after the infrastructure was reworked

   - Make sure the host cannot pass a non-sensical clock update to the
     EL2 tracing infrastructure

   - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
     ability to run arm64 guests, which will inevitably lead to arm64
     code being directly used on s390

   - Make sure that EL2 is configured with both exception entry and exit
     being Context Synchronization Events

   - Handle the current vcpu being NULL on EL2 panic

   - Fix the selftest_vcpu memcache being empty at the point of donation
     or sharing

   - Check that the memcache has enough capacity before engaging on the
     share/donate path

   - Fix __deactivate_fgt() to use its parameter rather than a variable
     in the macro context

  s390:

   - Fix array overrun with large amounts of PCI devices

  x86:

   - Never use L0's PAUSE loop exiting while L2 is running, since it's
     unlikely that a nested guest will help solving the hypervisor's
     spinlock contention

   - Fix emulation of MOVNTDQA

   - Fix typo in Xen hypercall tracepoint

   - Add back an optimization that was left behind when recently fixing
     a bug

   - Add module parameter to disable CET, whose implementation seems to
     have issues. For now it remains enabled by default

  Generic:

   - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()

  Documentation:

   - Update stale links

  Selftests:

   - Fix guest_memfd_test with host page size > guest page size"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
  KVM: VMX: introduce module parameter to disable CET
  KVM: x86: Swap the dst and src operand for MOVNTDQA
  KVM: x86: use again the flush argument of __link_shadow_page()
  KVM: selftests: Ensure gmem file sizes are multiple of host page size
  Documentation: kvm: update links in the references section of AMD Memory Encryption
  KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
  KVM: x86: Fix Xen hypercall tracepoint argument assignment
  KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
  KVM: arm64: Pre-check vcpu memcache for host->guest donate
  KVM: arm64: Pre-check vcpu memcache for host->guest share
  KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
  KVM: arm64: Fix __deactivate_fgt macro parameter typo
  KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
  KVM: arm64: Make EL2 exception entry and exit context-synchronization events
  MAINTAINERS: Add Steffen as reviewer for KVM/arm64
  KVM: arm64: Remove potential UB on nvhe tracing clock update
  KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
  KVM: arm64: Handle permission faults with guest_memfd
  KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
  KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
  ...

selftests/cgroup: Fix error path leaks in test_percpu_basic

When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.

When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.

Fixes: 90631e1dea55 ("kselftests: cgroup: add perpcu memory accounting test")
Signed-off-by: Yu Miao <yumiao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>

RDMA/bnxt_re: zero shared page before exposing to userspace

bnxt_re_alloc_ucontext() allocates uctx->shpg via
__get_free_page(GFP_KERNEL). The buddy allocator does not zero pages
without __GFP_ZERO, so the page contains stale kernel data from
whatever object most recently freed it.

The page is then mapped into userspace via vm_insert_page() under
BNXT_RE_MMAP_SH_PAGE in bnxt_re_mmap(). The driver only ever writes
4 bytes (a u32 AVID) at offset BNXT_RE_AVID_OFFT (0x10) inside
bnxt_re_create_ah(); the remaining 4092 bytes of the page are exposed
to userspace unsanitised, leaking kernel memory contents.

Any user with access to /dev/infiniband/uverbsX on a host with a
bnxt_re device (typically rdma group membership) can read this data
via a single mmap() at pgoff 0 after IB_USER_VERBS_CMD_GET_CONTEXT.

Other shared pages in the same file already use get_zeroed_page()
correctly:

  drivers/infiniband/hw/bnxt_re/ib_verbs.c
      srq->uctx_srq_page = (void *)get_zeroed_page(GFP_KERNEL);
      cq->uctx_cq_page  = (void *)get_zeroed_page(GFP_KERNEL);

uctx->shpg is the only outlier. Bring it in line with the existing
convention by switching to get_zeroed_page().

Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Lord Ulf Henrik Holmberg <henrik.holmberg@defensify.se>
Link: https://patch.msgid.link/20260509084011.11971-1-pomzm67@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>

selftests/rdma: explicitly skip tests when required modules are missing

Currently, the rdma rxe selftests fail with an exit code of 1 when
required kernel modules are not present. This causes spurious failures
in environments where these modules might not be compiled or available.

Include the standard kselftest 'ktap_helpers.sh' and replace the
hardcoded error exits with '$KSFT_SKIP'. This ensures the tests are
properly marked as skipped rather than failed.

Fixes: e01027cab38a ("RDMA/rxe: Add testcase for net namespace rxe")
Signed-off-by: Yi Lai <yi1.lai@intel.com>
Link: https://patch.msgid.link/20260507125106.3114167-1-yi1.lai@intel.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()

Rework tdx_has_emulated_msr() to explicitly enumerate the x2APIC MSRs
that KVM can emulate, instead of trying to enumerate the MSRs that KVM
cannot emulate. Drop the inner switch and list the emulatable x2APIC
registers directly in the outer switch's "return true" block.

The old code had multiple bugs in the x2APIC range handling.
X2APIC_MSR(APIC_ISR + APIC_ISR_NR) was incorrect because APIC_ISR_NR is
0x8, not 0x80, so the X2APIC_MSR() shift lost the lower bits, collapsing
each range to a single MSR. IA32_X2APIC_SELF_IPI was also missing from
the non-emulatable list. Note, these bugs are relatively benign, as they
only affect a guest that is requesting "bogus" emulation.

KVM has no visibility into whether or not a guest has enabled #VE
reduction, which changes which MSRs the TDX-Module handles itself versus
triggering a #VE for the guest to make a TDVMCALL. So maintaining a list
of non-emulatable MSRs is fragile. Listing only the MSRs KVM can always
emulate sidesteps the problem.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reported-by: Dmytro Maluka <dmaluka@chromium.org>
Closes: https://lore.kernel.org/all/20260318190111.1041924-1-dmaluka@chromium.org
Fixes: dd50294f3e3c ("KVM: TDX: Implement callbacks for MSR operations")
Assisted-by: Claude:claude-opus-4-6
[based on a diff from Sean, but added missed LVTCMCI case, log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260410232654.3864196-1-rick.p.edgecombe@intel.com
[sean: call out the bugs are relatively benign, expand comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>

RDMA/nldev: Add mutual exclusion in nldev_dellink()

We must serialize calls to nldev_dellink() or risk a crash as syzbot
reported:

KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
Call Trace:
udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197
rxe_release_udp_tunnel drivers/infiniband/sw/rxe/rxe_net.c:294 [inline]
rxe_sock_put drivers/infiniband/sw/rxe/rxe_net.c:639 [inline]
rxe_net_del+0xfb/0x290 drivers/infiniband/sw/rxe/rxe_net.c:660
rxe_dellink+0x15/0x20 drivers/infiniband/sw/rxe/rxe.c:254

Fixes: a60e3f3d6fba ("RDMA/nldev: Add dellink function pointer")
Reported-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d8f76778263ab65c2b21
Tested-by: syzbot+d8f76778263ab65c2b21@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Link: https://patch.msgid.link/tencent_611BEB4B141B1A2526BAA3BBB2335F9E9108@qq.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>

KVM: x86: Make "external SPTE" ops that can fail RET0 static calls

Define kvm_x86_ops .link_external_spt(), .set_external_spte(), and
.free_external_spt() as RET0 static calls so that an unexpected call to a
a default operation doesn't consume garbage.

Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Fixes: 94faba8999b9 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables")
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260129011517.3545883-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: TDX: Account all non-transient page allocations for per-TD structures

Account all non-transient allocations associated with a single TD (or its
vCPUs), as KVM's ABI is that allocations that are active for the lifetime
of a VM are accounted. Leave temporary allocations, i.e. allocations that
are freed within a single function/ioctl, unaccounted, to again align with
KVM's existing behavior, e.g. see commit dd103407ca31 ("KVM: X86: Remove
unnecessary GFP_KERNEL_ACCOUNT for temporary variables").

Fixes: 8d032b683c29 ("KVM: TDX: create/destroy VM structure")
Fixes: a50f673f25e0 ("KVM: TDX: Do TDX specific vcpu initialization")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260129011517.3545883-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

drm/gma500/oaktrail_lvds: fix i2c adapter leaks on init

The LVDS init code looks up an I2C adapter using i2c_get_adapter() and
tries to read the EDID before falling back to allocating and registering
its own adapter.

Make sure to drop the references taken by i2c_get_adapter() when falling
back to allocating an adapter as well as on late errors to allow the
looked up adapter to be deregistered.

Fixes: 1b082ccf5901 ("gma500: Add Oaktrail support")
Cc: stable@vger.kernel.org # 3.3
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Link: https://patch.msgid.link/20260508144446.59722-4-johan@kernel.org

drm/gma500/oaktrail_lvds: fix hang on init failure

The LVDS init code looks up an I2C adapter using i2c_get_adapter() and
tries to read the EDID before falling back to allocating and registering
its own adapter.

The error handling does not separate these cases so on a late init
failure it will try to deregister and free also an adapter that had
previously been registered. Since i2c_get_adapter() takes another
reference to the adapter, deregistration hangs indefinitely while
waiting for the reference to be released.

Fix this by only destroying adapters allocated during LVDS init on
errors.

Fixes: a57ebfc0b4da ("drm/gma500: Make oaktrail lvds use ddc adapter from drm_connector")
Cc: stable@vger.kernel.org # 6.0
Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Link: https://patch.msgid.link/20260508144446.59722-3-johan@kernel.org

drm/gma500/oaktrail_hdmi: fix i2c adapter leak on setup

Make sure to drop the reference taken to the I2C adapter (and its
module) when setting up HDMI to allow the adapter to be deregistered.

Fixes: 1b082ccf5901 ("gma500: Add Oaktrail support")
Cc: stable@vger.kernel.org # 3.3
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>
Link: https://patch.msgid.link/20260508144446.59722-2-johan@kernel.org

KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"

Pass a pointer to iter->old_spte, not simply its value, when setting an
external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
will be updated if the cmpxchg64 to freeze the mirror SPTE fails. The bug
is currently benign as TDX is mutualy exclusive with all paths that do
"local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().

Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables")
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260129011517.3545883-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level

Rework the TDX APIs to take the kernel's 1-based pg_level enum, not the
TDX-Module's 0-based level. The APIs are _kernel_ APIs, not TDX-Module
APIs, and the kernel (and KVM) uses "enum pg_level" literally everywhere.

Using "enum pg_level" eliminates ambiguity when looking at the APIs (it's
NOT clear that "int level" refers to the TDX-Module's level), and will
allow for using existing helpers like page_level_size() when support for
hugepages is added to the S-EPT APIs.

No functional change intended.

Cc: Kai Huang <kai.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260129011517.3545883-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VFIO: update coherency only if file was deleted

When servicing a KVM_DEV_VFIO_FILE_DEL request, if a file is removed
from kv->file_list, kv->noncoherent needs to be updated, in case we
can revert to using coherent DMA. However, if we found no candidate
to remove, there is no need to re-scan the list, so do it only if a
matching file was found.

To simplify the control flow, use a mutex guard so that we can return
early from within the search loop if the maching file is found.

Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20260313122040.1413091-7-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VFIO: deduplicate file release logic

There are two callsites which destroy files in kv->file_list: the
function servicing KVM_DEV_VFIO_FILE_DEL, and the relase of the whole
KVM VFIO device. The process involves several steps, so move all those
into a single function, removing duplicate code.

Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20260313122040.1413091-6-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: VFIO: use mutex guard in kvm_vfio_file_set_spapr_tce()

Use a mutex guard to hold a lock for the entirety of the function, which
removes the need for a goto (whose label even has a misleading name
since 8152f8201088 ("fdget(), more trivial conversions"))

Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20260313122040.1413091-5-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>

drm/xe/memirq: Enable GT_MI_USER_INTERRUPT only

We only expect and handle the GT_MI_USER_INTERRUPT from the
engines, there is no point in enabling other interrupts, like
GT_CONTEXT_SWITCH_INTERRUPT, if we don't intent to handle them.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://patch.msgid.link/20260511172838.2299-3-michal.wajdeczko@intel.com

drm/xe/memirq: Update interrupt handler logic

To workaround some corner case hardware limitations, new programming
note for the memory based interrupt handler suggests to assume that
some status bytes, like GT_MI_USER_INTERRUPT and GUC_INTR_GUC2HOST,
are always set. Update our interrupt handler to follow the new rules.

Bspec: 53672
Fixes: a6581ebe7685 ("drm/xe/vf: Introduce Memory Based Interrupts Handler")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>
Link: https://patch.msgid.link/20260511172838.2299-2-michal.wajdeczko@intel.com

KVM: VFIO: clean up control flow in kvm_vfio_file_add()

The struct file that this function fgets() is always passed to fput()
before returning, so use automatic cleanup via __free() to avoid several
jumps to the end of the function. Similarly, use a mutex guard to
completely remove the need to use gotos.

Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20260313122040.1413091-4-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: memslot_perf_test: make host wait timeout configurable

When memslot_perf_test is run on the Qemu Risc-V Virt machine,
sometimes the RW subtest fails due to sigalarm, indicating that the
guest sync did not finish within the expected duration of 10 seconds.
Since the current timeout value is itself a bump up from the original
2s, making the host timeout value configurable via a new command line
parameter. The test can be invoked with '-t' option to set a suitable
timeout value for the host.

Signed-off-by: Mayuresh Chitale <mayuresh.chitale@oss.qualcomm.com>
Link: https://patch.msgid.link/20260407144914.2621843-1-mayuresh.chitale@oss.qualcomm.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

lib/string_helpers: annotate struct strarray with __counted_by_ptr

Add the __counted_by_ptr() compiler attribute to 'array' to improve
bounds checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260415122542.370926-6-thorsten.blum@linux.dev
Signed-off-by: Kees Cook <kees@kernel.org>

lib/string_helpers: drop redundant allocation in kasprintf_strarray

kasprintf_strarray() returns an array of N strings and kfree_strarray()
also frees N entries. However, kasprintf_strarray() currently allocates
N+1 char pointers. Allocate exactly N pointers instead of N+1.

Also update the kernel-doc for @n.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260415122542.370926-4-thorsten.blum@linux.dev
Signed-off-by: Kees Cook <kees@kernel.org>

libbpf: Use strscpy() in kernel code for skel_map_create()

Linux has deprecated[1] strncpy(), and the use in skel_map_create()
is best replaced with strscpy(). Since we still need to build this
file in userspace, leave the strncpy() in place in that case. This
is the last use of strncpy() in the kernel.

Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20260513050806.do.620-kees@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

KVM: x86: Drop superfluous caching of KVM_ASYNC_PF_SEND_ALWAYS

Drop kvm_vcpu_arch.apf.send_always and instead use msr_en_val as the source
of truth to reduce the probability of operating on stale data. This fixes
flaws where KVM fails to update send_always when APF is explicitly
disabled by the guest or implicitly disabled by KVM on INIT. Absent other
bugs, the flaws are benign as KVM *shouldn't* consume send_always when PV
APF support is disabled.

Simply delete the field, as there's zero benefit to maintaining a separate
"cache" of the state.

Opportunistically turn the enabled vs. disabled logic at the end of
kvm_pv_enable_async_pf() into an if-else instead of using an early return,
e.g. so that it's more obvious that both paths are "success" paths.

Fixes: 6adba5274206 ("KVM: Let host know whether the guest can handle async PF in non-userspace context.")
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260406225359.1245490-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Drop superfluous caching of KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT

Drop kvm_vcpu_arch.apf.delivery_as_pf_vmexit and instead use msr_en_val as
the source of truth to reduce the probability of operating on stale data.
This fixes flaws where KVM fails to update delivery_as_pf_vmexit when APF
is explicitly disabled by the guest or implicitly disabled by KVM on INIT.
Absent other bugs, the flaws are benign as KVM *shouldn't* consume
delivery_as_pf_vmexit when PV APF support is disabled.

Simply delete the field, as there's zero benefit to maintaining a separate
"cache" of the state.

Fixes: 52a5c155cf79 ("KVM: async_pf: Let guest support delivery of async_pf from guest mode")
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20260406225359.1245490-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Don't leave APF half-enabled on bad APF data GPA

kvm_pv_enable_async_pf() updates vcpu->arch.apf.msr_en_val before
initializing the APF data gfn_to_hva cache. If userspace provides an
invalid GPA, kvm_gfn_to_hva_cache_init() fails, but msr_en_val stays
enabled and leaves APF state half-initialized.

Later APF paths can then try to use the empty cache and trigger
WARN_ON() in kvm_read_guest_offset_cached().

Determine the new APF enabled state from the incoming MSR value, do cache
initialization first on the enable path, and commit msr_en_val only after
successful initialization. Keep the disable path behavior unchanged.

Reported-by: syzbot+bc0e18379a290e5edfe4@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bc0e18379a290e5edfe4
Fixes: 344d9588a9df ("KVM: Add PV MSR to enable asynchronous page faults delivery.")
Link: https://lore.kernel.org/r/aHfD3MczrDpzDX9O@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Ethan Yang <ethan.yang.kernel@gmail.com>
[sean: don't bother with a local "enable" variable]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://patch.msgid.link/20260406225359.1245490-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Guard execinfo.h inclusion for non-glibc builds

The backtrace() function and execinfo.h are GNU extensions available
in glibc but not in non-glibc C libraries such as musl. Building KVM
selftests with musl-gcc fails with:

lib/assert.c:9:10: fatal error: execinfo.h: No such file or directory

Fix this by guarding the inclusion of execinfo.h and the stack dumping
logic under #ifdef __GLIBC__. For non-glibc builds, provide a local
stub for test_dump_stack().

Suggested-by: Aqib Faruqui <aqibaf@amazon.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hisam Mehboob <hisamshar@gmail.com>
Link: https://patch.msgid.link/20260409153846.1502656-2-hisamshar@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

MAINTAINERS: add kernel hardening keyword __counted_by_ptr

In addition to __counted_by, __counted_by_le, and __counted_by_be, also
match the keyword __counted_by_ptr.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260414130926.312094-3-thorsten.blum@linux.dev
Signed-off-by: Kees Cook <kees@kernel.org>

firmware: qcom: scm: Allow QSEECOM on Surface Pro 12in

Add the Surface Pro 12in to the QSEECOM allowlist
so that the Qualcomm Secure Execution Environment
interface is available on this device.

Signed-off-by: Harrison Vanderbyl <harrison.vanderbyl@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/92171ad5e7851e6758dd205246b4289f32e12655.1778498477.git.harrison.vanderbyl@gmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

drm/i915/sdvo: use the i2c bus locking functions

Use i2c_lock_bus(), i2c_trylock_bus(), and i2c_unlock_bus() instead of
poking at i2c adapter's lock_ops directly.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260513080103.169402-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>

dt-bindings: arm: qcom: Add Microsoft Surface Pro 12in

Document the compatible string for the Microsoft Surface Pro
12-inch, 1st Edition with Snapdragon, based on the Qualcomm X1P42100
SoC.

Signed-off-by: Harrison Vanderbyl <harrison.vanderbyl@gmail.com>
Link: https://lore.kernel.org/r/627a1e2506fbed99e971250dbba64902af54232c.1778498477.git.harrison.vanderbyl@gmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

KVM: Fix kvm_vcpu_map[_readonly]() function prototypes

kvm_vcpu_map() and kvm_vcpu_map_readonly() should take a gfn instead of
a gpa. This appears to be a result of the original kvm_vcpu_map() being
declared with the wrong function prototype in kvm_host.h, even though
it was correct in the actual implementation in kvm_main.c.

No actual harm has been done yet as all of the call sites are correctly
passing in a gfn. Plus, both gfn_t and gpa_t are typedef'd to u64 so
this change shouldn't have any functional impact.

Compile-tested on x86 and ppc, which are the current users of these
interfaces.

Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API")
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Peter Fang <peter.fang@intel.com>
Reviewed-by: Yosry Ahmed <yosry@kernel.org>
Link: https://patch.msgid.link/20260408001137.3290444-2-peter.fang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Rate-limit global clock updates on vCPU load

commit 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"")
dropped the rate limiting for KVM_REQ_GLOBAL_CLOCK_UPDATE.

As a result, kvm_arch_vcpu_load() can queue global clock update requests
every time a vCPU is scheduled when the master clock is disabled or when
the vCPU is loaded for the first time.

Restore the throttling with a per-VM ratelimit state and gate
KVM_REQ_GLOBAL_CLOCK_UPDATE through __ratelimit(), so frequent vCPU
scheduling does not generate a steady stream of redundant clock update
requests.

Fixes: 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"")
Signed-off-by: Lei Chen <lei.chen@smartx.com>
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Closes: https://lore.kernel.org/all/CAK8fFZ5gY8_Mw2A=iZVFNVKQNrXQzVsn-HTd+Me9K6ZfmdgA+Q@mail.gmail.com/
Link: https://patch.msgid.link/20260409142226.2581-1-lei.chen@smartx.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Fix page overflow in sev_dbg_crypt() for ENCRYPT path

In sev_dbg_crypt(), the per-iteration transfer length is bounded by
the source page offset (PAGE_SIZE - s_off) but not by the destination
page offset (PAGE_SIZE - d_off).  When d_off > s_off, the encrypt
path (__sev_dbg_encrypt_user) performs a read-modify-write using a
single-page intermediate buffer (dst_tpage):

  1. __sev_dbg_decrypt() expands the size to round_up(len + (d_off & 15), 16)
     before issuing the PSP command.  If len + (d_off & 15) > PAGE_SIZE,
     the PSP writes beyond the end of the 4096-byte dst_tpage allocation.

  2. The subsequent memcpy()/copy_from_user() into
     page_address(dst_tpage) + (d_off & 15) of 'len' bytes overflows
     by up to 15 bytes under the same condition.

Trigger example: s_off = 0, d_off = 1, debug.len = PAGE_SIZE -
the PSP is instructed to write round_up(4097, 16) = 4112 bytes to
a 4096-byte buffer.

Fix by also bounding len by (PAGE_SIZE - d_off), the same check that
sev_send_update_data() already performs for its single-page guest
region.

==================================================================
BUG: KASAN: slab-use-after-free in sev_dbg_crypt+0x993/0xd10 [kvm_amd]
Write of size 4095 at addr ff110062293bb009 by task sev_dbg_test/228214

CPU: 96 UID: 0 PID: 228214 Comm: sev_dbg_test Tainted: G     U  W           7.0.0-smp--5ce9b0c48211-dbg #156 PREEMPTLAZY
Tainted: [U]=USER, [W]=WARN
Hardware name: Google Astoria/astoria, BIOS 0.20250817.1-0 08/25/2025
Call Trace:
  <TASK>
  dump_stack_lvl+0x54/0x70
  print_report+0xbc/0x260
  kasan_report+0xa2/0xd0
  kasan_check_range+0x25f/0x2c0
  __asan_memcpy+0x40/0x70
  sev_dbg_crypt+0x993/0xd10 [kvm_amd]
  sev_mem_enc_ioctl+0x33c/0x450 [kvm_amd]
  kvm_vm_ioctl+0x65d/0x6d0 [kvm]
  __se_sys_ioctl+0xb2/0x100
  do_syscall_64+0xe8/0x870
  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  </TASK>

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x7fe72b6a0 pfn:0x62293bb
memcg:ff11000112827d82
flags: 0x1400000000000000(node=1|zone=1)
raw: 1400000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 00000007fe72b6a0 0000000000000000 00000001ffffffff ff11000112827d82
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ff110062293bbf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ff110062293bbf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ff110062293bc000: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
                    ^
  ff110062293bc080: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
  ff110062293bc100: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint

Fixes: 24f41fb23a39 ("KVM: SVM: Add support for SEV DEBUG_DECRYPT command")
Fixes: 7d1594f5d94b ("KVM: SVM: Add support for SEV DEBUG_ENCRYPT command")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
[sean: add sample KASAN splat, Fixes, and stable@]
Link: https://patch.msgid.link/20260501203537.2120074-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Teach sev_*_test about revoking VM types

Instead of using CPUID, use the VM type bit to determine support, since
those now reflect the correct status of support by the kernel and firmware
configurations.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Don't advertise VM types that are disabled by firmware

As called out in a footnote for a recent SNP vulnerability[1], it is
possible for a specific flavor of SEV+ to be disabled by the firmware even
when the flavor is fully supported by the CPU and platform:

  Applying mitigation CVE-2025-48514 will result in disabling SEV-ES when
  SEV-SNP is enabled.

Restrict KVM's set of supported VM types based on the VM types that are
fully supported by firmware to avoid over-reporting what KVM can actually
support.  Like KVM's handling of ASID space exhaustion, don't modify KVM's
CPUID capabilities, as the CPU/platform still supports the underlying
technology and clearing e.g. SEV_ES while advertising SEV_SNP would confuse
KVM and userspace.

Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-3023.html
Link: https://lore.kernel.org/all/aZyLIWtffvEnmtYh@google.com
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
[sean: rewrite changelog to provide details on why/how this can happen]
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Don't advertise support for unusable VM types

Commit 0aa6b90ef9d7 ("KVM: SVM: Add support for allowing zero SEV ASIDs")
made it possible to make it impossible to use SEV VMs by not allocating
them any ASIDs.

Commit 6c7c620585c6 ("KVM: SEV: Add SEV-SNP CipherTextHiding support") did
the same thing for SEV-ES.

Do not export KVM_X86_SEV(_ES)_VM as supported types if in either of these
situations, so that userspace can use them to determine what is actually
supported by the current kernel configuration.

Also move the buildup to a local variable so it is easier to add additional
masking in future patches.

Link: https://lore.kernel.org/all/aZyLIWtffvEnmtYh@google.com/
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
[sean: land code in sev_hardware_setup()]
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Consolidate logic for printing state of SEV{,-ES,-SNP} enabling

Add a helper to print enabled/unusable/disabled for SEV+ VM types in
anticipation of SNP also being subjecting to "unusable" logic.

No functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Set supported SEV+ VM types during sev_hardware_setup()

Set the supported SEV+ VM types during sev_hardware_setup() instead of
waiting until sev_set_cpu_caps(). This will using the set of *fully*
supported VM types to print the enabled/unusable/disabled messaged.

For all intents and purposes, no functional change intended.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

crypto/ccp: export firmware supported vm types

In some configurations, the firmware does not support all VM types. The SEV
firmware has an entry in the TCB_VERSION structure referred to as the
Security Version Number in the SEV-SNP firmware specification and referred
to as the "SPL" in SEV firmware release notes. The SEV firmware release
notes say:

    On every SEV firmware release where a security mitigation has been
    added, the SNP SPL gets increased by 1. This is to let users know that
    it is important to update to this version.

The SEV firmware release that fixed CVE-2025-48514 by disabling SEV-ES
support on vulnerable platforms has this SVN increased to reflect the fix.
The SVN is platform-specific, as is the structure of TCB_VERSION.

Check CURRENT_TCB instead of REPORTED_TCB, since the firmware behaves with
the CURRENT_TCB SVN level and will reject SEV-ES VMs accordingly.

Parse the SVN, and mask off the SEV_ES supported VM type from the list of
supported types if it is above the per-platform threshold for the relevant
platforms.

Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

crypto/ccp: hoist kernel part of SNP_PLATFORM_STATUS

...to its own function. This way it can be used when the kernel needs
access to the platform status regardless of the INIT state of the firmware.

No functional change intended.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Link: https://patch.msgid.link/20260416232329.3408497-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: hyperv_tlb_flush: replace NOP loop with udelay()

Replace the open-coded NOP loop with udelay() which was added to KVM
selftests in commit 6b878cbb87bf ("KVM: selftests: Add guest udelay()
utility for x86"). The NOP loop is CPU speed dependent while udelay()
provides a deterministic delay regardless of host CPU frequency.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260422130307.1171808-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Fix typo in comment in hyperv_features.c

Fix a typo in a comment: 'vailable' -> 'available'.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260428083037.1926902-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: sync_regs_test: drop stale TODO comment

The TODO asked for a build-time check to guard against missing new sync
fields. Remove it, as code review is sufficient to catch such issues.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Link: https://patch.msgid.link/20260512161317.2580678-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SVM: Refresh vcpu->arch.cr{0,3} prior to invoking fastpath handler

Refresh KVM's copies of CR0 and CR3 from the VMCB prior to (potentially)
invoking a fastpath handler to ensure that KVM doesn't consume stale
state. While it's unlikely KVM will ever consume CR3 or CR0.{TS,MP} in
the fastpath, grabbing the values from the VMCB is inexpensive, i.e. the
risk of subtle bugs far outweighs the reward of deferring reads for a
small subset of VM-Exits.

Note, KVM doesn't currently consume CR3 or CR0.{TS,MP} in the fastpath,
as KVM requires next_rip to be valid (i.e. KVM doesn't read CR3 to decode
the instruction), CR0.MP is never consumed, and CR0.TS is only consumed by
the full emulator.

Reviewed-by: Nikunj A. Dadhania <nikunj@amd.com>
Link: https://patch.msgid.link/20260423162628.490962-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: x86: Ensure vendor's exit handler runs before fastpath userspace exits

Move the handling of fastpath userspace exits into vendor code to ensure
KVM runs vendor specific operations that need to run before userspace gains
control of the vCPU. E.g. for VMX (and soon to be for SVM as well), KVM
needs to flush the PML buffer prior to exiting to userspace, otherwise any
memory written by the final KVM_RUN might never be flagged as dirty.

Note, waiting to snapshot CR0 and CR3 until svm_handle_exit() is flawed in
general, as that risks consuming stale state in a fastpath handler. That
will be addressed in a future change.

Fixes: f7f39c50edb9 ("KVM: x86: Exit to userspace if fastpath triggers one on instruction skip")
Cc: stable@vger.kernel.org
Cc: Nikunj A. Dadhania <nikunj@amd.com>
Reviewed-by: Nikunj A. Dadhania <nikunj@amd.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20260423162628.490962-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

x86/virt: Silence RCU lockdep splat in emergency virt callback path

x86_virt_invoke_kvm_emergency_callback() reaches rcu_dereference()
through machine_crash_shutdown() with IRQs disabled but with RCU not
necessarily watching the crashing CPU, which triggers a suspicious
RCU usage splat on debug kernels (CONFIG_PROVE_RCU=y) during
panic/kdump:

  WARNING: suspicious RCU usage
  arch/x86/virt/hw.c:52 suspicious rcu_dereference_check() usage!

  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by tee/11119:
   #0: ffff8881fa32c440 (sb_writers#3){.+.+}-{0:0}, at: ksys_write

  Call Trace:
   <TASK>
   dump_stack_lvl+0x84/0xd0
   lockdep_rcu_suspicious.cold+0x37/0x8f
   x86_virt_invoke_kvm_emergency_callback+0x5f/0x70
   x86_svm_emergency_disable_virtualization_cpu+0x2a/0x30
   x86_virt_emergency_disable_virtualization_cpu+0x6b/0x90
   native_machine_crash_shutdown+0x72/0x170
   __crash_kexec+0x137/0x280
   panic+0xce/0xd0
   sysrq_handle_crash+0x1f/0x20
   __handle_sysrq.cold+0x192/0x335
   write_sysrq_trigger+0x8c/0xc0
   proc_reg_write+0x1c3/0x3c0
   vfs_write+0x1d0/0xf80
   ksys_write+0x116/0x250
   do_syscall_64+0x11c/0x1480
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   </TASK>

A truly correct fix is non-trivial: the RCU usage genuinely is wrong in
panic context (RCU may ignore the crashing CPU during synchronization),
and a concurrent KVM module unload could in principle race with the
callback read; see commit 2baa33a8ddd6 ("KVM: x86: Leave user-return
notifier registered on reboot/shutdown") which notes that nothing
prevents module unload during panic/reboot.

However, the alternatives are worse:

  - smp_store_release()/smp_load_acquire() handles ordering but not
    liveness; the kernel still needs to keep the module text alive
    while the callback is in flight.
  - Taking a lock in the panic path is risky — any lock could be held
    by a CPU that has already been NMI'd to a halt.

Use rcu_dereference_raw() to silence the splat and accept the
vanishingly small remaining race. Panic context inherently cannot
guarantee complete correctness; the goal here is to keep debug builds
quiet on the kdump path so the splat doesn't obscure the actual
kernel state being captured.

Reproducible on a debug kernel (CONFIG_PROVE_LOCKING=y, CONFIG_PROVE_RCU=y)
with kvm_amd or kvm_intel loaded by triggering kdump:

  echo c > /proc/sysrq-trigger

Suggested-by: Sean Christopherson <seanjc@google.com>
Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem")
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20260504235435.90957-1-mikhail.v.gavrilov@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Include sys/mman.h *and* linux/mman.h, via kvm_syscalls.h

Include both linux/mman.h (the kernel provided version) and sys/mman.h (the
libc provided version) throughout KVM selftests, by way of kvm_syscalls.h
(which should have been including sys/mman.h anyways).  Pulling in the
kernel's version fixes compilation errors with the guest_memfd test on
older versions of libc due to a recent commit adding MADV_COLLAPSE testing.

  In file included from include/kvm_util.h:8,
                   from guest_memfd_test.c:21:
  guest_memfd_test.c: In function ‘test_collapse’:
  guest_memfd_test.c:219:47: error: ‘MADV_COLLAPSE’ undeclared (first use in this function); did you mean ‘MADV_COLD’?
      219 |         TEST_ASSERT_EQ(madvise(mem, pmd_size, MADV_COLLAPSE), -1);
          |                                               ^~~~~~~~~~~~~
    include/test_util.h:62:16: note: in definition of macro ‘TEST_ASSERT_EQ’
       62 |         typeof(a) __a = (a);                                            \
          |                ^
    guest_memfd_test.c:219:47: note: each undeclared identifier is reported only once for each function it appears in
      219 |         TEST_ASSERT_EQ(madvise(mem, pmd_size, MADV_COLLAPSE), -1);
          |                                               ^~~~~~~~~~~~~
    include/test_util.h:62:16: note: in definition of macro ‘TEST_ASSERT_EQ’
       62 |         typeof(a) __a = (a);                                            \
          |                ^

Route the includes through kvm_syscalls.h to try and avoid a future game
of whack-a-mole, i.e. so that future expansion of test coverage doesn't run
into the same problem.

To discourage use of sys/mman.h, opportunistically include the kernel's
version of mman.h in test_util.h as it only needs MAP_SHARED, i.e. only
needs the full set of kernel defs, not the libc syscall wrappers.

Fixes: 9830209b4ae8 ("KVM: selftests: Test MADV_COLLAPSE on guest_memfd")
Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Closes: https://lore.kernel.org/all/20260427204313.50741-1-rick.p.edgecombe@intel.com
Link: https://patch.msgid.link/20260428012503.1213654-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: Rename invalidate_begin to invalidate_start for consistency

Rename kvm_mmu_invalidate_begin() to kvm_mmu_invalidate_start() to
align with mmu_notifier_ops.invalidate_range_start(), which is the
callback that ultimately drives KVM's MMU invalidation.

While the naming within KVM itself is a close split between "_begin" and
"_start":

  $ git grep -E "invalidate(_range)?_begin" **/kvm* | wc -l
  12
  $ git grep -E "invalidate(_range)?_start" **/kvm* | wc -l
  21

All two of the begin() uses are in KVM:

  $ git grep -E "invalidate(_range)?_begin" * | wc -l
  14

And those two holdouts are bugs in invalidate_range_start()'s comment,
i.e. will also be fixed sooner or later[*].  On the other hand, use of
_start() is pervasive throughout the kernel:

  $ git grep -E "invalidate(_range)?_start" * | wc -l
  117

Even if that weren't the case, conforming to the mmu_notifier_ops naming
is the right call since invalidate_range_start() is the external API that
KVM hooks into.

No functional change intended.

Link: https://lore.kernel.org/all/20260513163546.1176742-1-seanjc@google.com
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Link: https://patch.msgid.link/20260420154720.29012-4-itazur@amazon.com
[sean: massage changelog to provide more (accurate) numbers]
Signed-off-by: Sean Christopherson <seanjc@google.com>

soc: qcom: pd-mapper: Add support for Hawi SoC

Hawi uses the same protection domain layout as Kaanapali, so reuse the
kaanapali_domains table. Also add the missing adsp_ois_pd entry (OIS
protection domain, instance_id 74) to kaanapali_domains, which is
required by both Kaanapali and Hawi.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260506110226.2256249-1-mukesh.ojha@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

dt-bindings: soc: qcom,aoss-qmp: Document the Hawi AOSS side channel

Document the Always-on Subsystem side channel on Qualcomm Hawi SoC.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260427181609.3648384-1-mukesh.ojha@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

dt-bindings: soc: qcom: qcom,pmic-glink: Add Hawi compatible string

Hawi is a mobile platform that is compatible with Kaanapali platform
with respect to pmic-glink support. Add the Hawi compatible string
with Kaanapali as a fallback.

Signed-off-by: Fenglin Wu <fenglin.wu@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260419-hawi-pmic-glink-v1-1-a26908c468fc@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

dt-bindings: firmware: qcom,scm: Document SCM on Hawi SoC

Document SCM compatible for the Qualcomm Hawi SoC.

Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20260401123825.589452-1-mukesh.ojha@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

soc: qcom: llcc-qcom: Capitalize LLCC/EDAC in comments and diagnostics

Capitalize occurrences of the acronym "LLCC" and "EDAC" in comments
and diagnostic text to improve consistency and readability.

Signed-off-by: Francisco Munoz Ruiz <francisco.ruiz@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260407-external_llcc_changes2set-v2-3-b5017ce2020b@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

soc: qcom: llcc-qcom: get SCT descriptors from fw-populated memory

Retrieve System Cache Table (SCT) descriptors from a shared memory
region populated by firmware.

SCT initialization and programming are performed entirely by firmware
outside of Linux. The LLCC driver only consumes the pre-initialized
descriptor data and does not configure SCT itself.

Support this mechanism for future SoCs that provide SCT programming
via firmware.

Signed-off-by: Francisco Munoz Ruiz <francisco.ruiz@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260407-external_llcc_changes2set-v2-2-b5017ce2020b@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

dt-bindings: cache: qcom,llcc: Document Hawi SoC

Add documentation for the Last Level Cache Controller (LLCC) bindings
to support Hawi SoC where the System Cache Table (SCT) is programmed
by firmware outside of Linux.

Introduce a property that specifies the base address of the shared
memory region from which the driver should read SCT descriptors
provided by firmware.

Signed-off-by: Francisco Munoz Ruiz <francisco.ruiz@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260407-external_llcc_changes2set-v2-1-b5017ce2020b@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>

dt-bindings: display: imx: add deprecated property 'port' and 'display-timings'

Add deprecated property 'port' and 'display-timings' for i.MX5 SoCs (over
15 years) to fix below CHECK_DTBS warnings:
arm/boot/dts/nxp/imx/imx51-apf51dev.dtb: disp1 (fsl,imx-parallel-display): 'display-timings', 'port' do not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/display/imx/fsl,imx-parallel-display.yaml

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20260511220924.1905571-1-Frank.Li@nxp.com
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>

Merge branch 'kvm-apx-prepare' into HEAD

Clean up KVM's register tracking and storage, primarily to prepare for
APX support, which expands the maximum number of GPRs from 16 to 32.

KVM: x86: Use a proper bitmap for tracking available/dirty registers

Define regs_{avail,dirty} as bitmaps instead of U32s to harden against
overflow, and to allow for dynamically sizing the bitmaps when APX comes
along, which will add 16 more GPRs (R16-R31) and thus increase the total
number of registers beyond 32.

Open code writes in the "reset" APIs, as the writes are hot paths and
bitmap_write() is complete overkill for what KVM needs. Even better,
hardcoding writes to entry '0' in the array is a perfect excuse to assert
that the array contains exactly one entry, e.g. to effectively add guard
against defining R16-R31 in 32-bit kernels.

For all intents and purposes, no functional change intended even though
using bitmap_fill() will mean "undefined" registers are no longer marked
available and dirty (KVM should never be querying those bits).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Track available/dirty register masks as "unsigned long" values

Convert regs_{avail,dirty} and all related masks to "unsigned long" values
as an intermediate step towards declaring the fields as actual bitmaps, and
as a step toward support APX, which will push the total number of registers
beyond 32 on 64-bit kernels.

Opportunistically convert TDX's ULL bitmask to a UL to match everything
else (TDX is 64-bit only, so it's a nop in the end).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Add wrapper APIs to reset dirty/available register masks

Add wrappers for setting regs_{avail,dirty} in anticipation of turning the
fields into proper bitmaps, at which point direct writes won't work so
well.

Deliberately leave the initialization in kvm_arch_vcpu_create() as-is,
because the regs_avail logic in particular is special in that it's the one
and only place where KVM marks eagerly synchronized registers as available.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: nVMX: Do a bitwise-AND of regs_avail when switching active VMCS

When switching between vmcs01 and vmcs02, do a bitwise-AND of regs_avail
to effectively reset the mask for the new VMCS, purely to be consistent
with all other "full" writes of regs_avail. In practice, a straight write
versus a bitwise-AND will yield the same result, as kvm_arch_vcpu_create()
marks *all* registers available (and dirty), and KVM never marks registers
unavailable unless they're lazily loaded.

This will allow adding wrapper APIs to set regs_{avail,dirty} without
having to add special handling for a nVMX use case that doesn't exist in
practice.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Drop the "EX" part of "EXREG" to avoid collision with APX

Now that NR_VCPU_REGS is no longer a thing, and now that now that RIP is
effectively an EXREG, drop the "EX" is for extended (or maybe extra?")
prefix from non-GPR registers to avoid a collision with APX (Advanced
Performance Extensions), which adds:

16 additional general-purpose registers (GPRs) R16–R31, also referred
to as Extended GPRs (EGPRs) in this document;

I.e. KVM's version of "extended" won't match with APX's definition.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

KVM: x86: Add dedicated storage for guest RIP

Add kvm_vcpu_arch.rip to track guest RIP instead of including it in the
generic regs[] array.  Decoupling RIP from regs[] will allow using a
*completely* arbitrary index for RIP, as opposed to the mostly-arbitrary
index that is currently used.  That in turn will allow using indices
16-31 to track R16-R31 that are coming with APX.

Note, although RIP can used for addressing, it does NOT have an
architecturally defined index, and so can't be reached via flows like
get_vmx_mem_address() where KVM "blindly" reads a general purpose register
given the SIB information reported by hardware.  For RIP-relative
addressing, hardware reports the full "offset" in vmcs.EXIT_QUALIFICATION.

Note #2, keep the available/dirty tracking as RSP is context switched
through the VMCS, i.e. needs to be cached for VMX.

Opportunistically rename NR_VCPU_REGS to NR_VCPU_GENERAL_PURPOSE_REGS to
better capture what it tracks, and so that KVM can slot in R16-R13 without
running into weirdness where KVM's definition of "EXREG" doesn't line up
with APX's definition of "extended reg".

No functional change intended.

Cc: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20260409224236.2021562-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Merge branch 'bpf-support-stack-arguments-for-bpf-functions-and-kfuncs'

Yonghong Song says:

====================
bpf: Support stack arguments for BPF functions and kfuncs

Currently, bpf function calls and kfunc's are limited by 5 reg-level
parameters. For function calls with more than 5 parameters,
developers can use always inlining or pass a struct pointer
after packing more parameters in that struct although it may have
some inconvenience. But there is no workaround for kfunc if more
than 5 parameters is needed.

This patch set lifts the 5-argument limit by introducing stack-based
argument passing for BPF functions and kfunc's, coordinated with
compiler support in LLVM [1]. The compiler emits loads/stores through
a new bpf register r11 (BPF_REG_PARAMS), to pass arguments beyond
the 5th, keeping the stack arg area separate from the r10-based program
stack. The current maximum number of arguments is capped at
MAX_BPF_FUNC_ARGS (12), which is sufficient for the vast majority of
use cases.

All kfunc/bpf-function arguments are caller saved, including stack
arguments. For register arguments (r1-r5), the verifier already marks
them as clobbered after each call. For stack arguments, the verifier
invalidates all outgoing stack arg slots immediately after a call,
requiring the compiler to re-store them before any subsequent call.
This follows the native calling convention where all function
parameters are caller saved.

The x86_64 JIT translates r11-relative accesses to RBP-relative
native instructions. Each function's stack allocation is extended
by 'max_outgoing' bytes to hold the outgoing arg area below the
callee-saved registers. This makes implementation easier as the r10
can be reused for stack argument access. At both BPF-to-BPF and kfunc
calls, outgoing args are pushed onto the expected calling convention
locations directly. The incoming parameters can directly get the value
from caller.

Global subprogs and freplace progs with >5 args are not yet supported.
Only x86_64 and arm64 are supported for now. Same selftests are tested
by both x86_64 and arm64. Please see each individual patch for details.

  [1] https://github.com/llvm/llvm-project/pull/189060

Changelogs:
  v3 -> v4:
    - v3: https://lore.kernel.org/bpf/20260511053301.1878610-1-yonghong.song@linux.dev/
    - Added no_stack_arg_load comparison in func_states_equal() to ensure
      correctness of pruning.
    - Shrink bpf_jmp_history_entry.flags to 4bit to match the number of flags.
    - Instead of passing bpf_subprog_info to JIT, use prog->aux->func_idx to
      find corresponding bpf_subprog_info from 'env'.
    - For patch 'bpf: Reject stack arguments if tail call reachable', use stack_arg_cnt
      instead of just incoming stack arg cnt.
    - Tighten invalidate_outgoing_stack_args() for kfunc/helper/bpf-to-bpf calls.
    - Disable private stack in verifier for x86_64 instead of in JIT.
  v2 -> v3:
    - v2: https://lore.kernel.org/bpf/20260507212942.1122000-1-yonghong.song@linux.dev/
    - In do_check_common() and for main prog, if btf does not match with actual
      parameter, the verification will continue and will ignore arg_cnt. Make
      arg_cnt=1 explictly to prevent any incoming stack arguments.
    - Remove the loop which clear current frame stack slot and set the upper level frame
      stack slot. This is not needed unless there is a bug. Add a verifier_bug
      if the bug happens.
    - For liveness, avoid r11 based load/stores mixing with r10 based stack tracking.
      Also, print out stack arguments properly.
    - Pass bpf_subprog_info the JIT so we can avoid copy bpf_subprog_info fields to
      bpf_prog_aux.
    - Fix the missed allocation free for test infra BTF fixup.
    - Remove selftest result for precision backtracking test since the result would
      be change (two possible output).
  v1 -> v2:
    - v1: https://lore.kernel.org/bpf/20260424171433.2034470-1-yonghong.song@linux.dev/
    - Several refactoring (convert bpf_get_spilled_reg macro to static inline func,
      Remove copy_register_state(), Refactor jmp history, Refactor record_call_access(), etc),
      suggested by Eduard.
    - Use incoming_stack_arg_cnt/stack_arg_cnt instead of incoming_stack_arg_depth/stack_arg_depth,
      suggested by Eduard.
    - Fix a stack arg pruning bug, from Eduard.
    - Fix a bug for precision marking and backtracking, basically callee needs to get the
      stack arg value from callers, helped from Eduard.
    - Set sub->arg_cnt earlier in btf_prepare_func_args(), this will avoid having
      incoming_stack_arg_cnt in bpf_subprog_info.
    - Do stack-arg liveness analysis together with r10 based liveness analysis,
      suggested by Eduard.
    - Fix a few tests to ensure that r11-based loads cannot be ahead of r11-based stores,
      and r11-based loads cannot be after kfunc/helper/bpf-function.
====================

Link: https://patch.msgid.link/20260513044949.2382019-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Enable stack argument tests for arm64

Now that arm64 supports stack arguments, enable the existing stack_arg,
stack_arg_kfunc and verifier_stack_arg tests for __TARGET_ARCH_arm64.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045204.2403441-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, arm64: Add JIT support for stack arguments

Implement stack argument passing for BPF-to-BPF and kfunc calls with
more than 5 parameters on arm64, following the AAPCS64 calling
convention.

BPF R1-R5 already map to x0-x4. With BPF_REG_0 moved to x8 by the
previous commit, x5-x7 are free for arguments 6-8. Arguments 9-12
spill onto the stack at [SP+0], [SP+8], ... and the callee reads
them from [FP+16], [FP+24], ... (above the saved FP/LR pair).

BPF convention uses fixed offsets from BPF_REG_PARAMS (r11): off=-8 is
always arg 6, off=-16 arg 7, etc. The verifier invalidates all outgoing
stack arg slots after each call, so the compiler must re-store before
every call. This means x5-x7 don't need to be saved on stack.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045158.2402494-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, arm64: Map BPF_REG_0 to x8 instead of x7

Move the BPF return value register from x7 to x8, freeing x7 for use
as an argument register. AAPCS64 designates x8 as the indirect result
location register; it is caller-saved and not used for argument
passing, making it a suitable home for BPF_REG_0.

This is a prerequisite for stack argument support, which needs x5-x7
to pass arguments 6-8 to native kfuncs following the AAPCS64 calling
convention.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045153.2402197-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add precision backtracking test for stack arguments

Add a test that verifies precision backtracking works correctly
across BPF-to-BPF calls when stack arguments are involved.

The test passes a size value as incoming stack arg (arg6) to a
subprog, which forwards it as the mem__sz parameter (outgoing arg7)
to bpf_kfunc_call_stack_arg_mem. The expected __msg annotations
verify that precision propagates from the kfunc's mem__sz argument
back through the subprog frame to the caller's outgoing stack arg
store.

A companion BTF file (btf__stack_arg_precision.c) provides named
parameter BTF for the __naked subprog via __btf_func_path.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045148.2400087-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add verifier tests for stack argument validation

Add inline-asm based verifier tests that exercise stack argument
validation logic directly.

Positive tests:
  - subprog call with 6 arg's
  - Two sequential calls to different subprogs (6-arg and 7-arg)
  - Share a r11 store for both branches

Negative tests — verifier rejection:
  - Read from uninitialized incoming stack arg slot
  - Gap in outgoing slots: only r11-16 written, r11-8 missing
  - Write at r11-80, exceeding max 7 stack args
  - Missing store on one branch with a shared store
  - First call has proper stack arguments and the second
    call intends to inherit stack arguments but not working
  - r11 load ordering issue

Negative tests — pointer/ref tracking:
  - Pruning type mismatch: one branch stores PTR_TO_STACK, the
    other stores a scalar, callee dereferences — must not prune
  - Release invalidation: bpf_sk_release invalidates a socket
    pointer stored in a stack arg slot
  - Packet pointer invalidation: bpf_skb_pull_data invalidates
    a packet pointer stored in a stack arg slot
  - Null propagation: PTR_TO_MAP_VALUE_OR_NULL stored in stack
    arg slot, null branch attempts dereference via callee

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045143.2399278-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add BTF fixup for __naked subprog parameter names

When __naked subprogs are used in verifier tests, clang drops
parameter names from their BTF FUNC_PROTO entries. This prevents
the verifier from resolving stack argument slots by name.

Add a __btf_func_path(path) annotation that points to a separate
BTF file containing properly-named FUNC entries. The test_loader
matches FUNC entries by name, detects anonymous parameters, and
replaces the FUNC_PROTO with a new one that carries parameter
names from the custom file while preserving the original type IDs.

The custom BTF file also serves as btf_custom_path for kfunc
resolution when no separate btf_custom_path is specified.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045138.2398886-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for stack argument validation

Add negative tests that verify the kfunc (rejecting kfunc call
with >8 byte struct as stack argument) and the verifier
(rejecting invalid uses of r11 for stack arguments).

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045132.2398371-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for BPF function stack arguments

Add selftests covering stack argument passing for both BPF-to-BPF
subprog calls and kfunc calls with more than 5 arguments. All tests
are guarded by __BPF_FEATURE_STACK_ARGUMENT and __TARGET_ARCH_x86.

BPF-to-BPF subprog call tests (stack_arg.c):
  - Scalar stack args
  - Pointer stack args
  - Mixed pointer/scalar stack args
  - Nested calls
  - Dynptr stack arg
  - Two callees with different stack arg counts
  - Async callback

Kfunc call tests (stack_arg_kfunc.c, with bpf_testmod kfuncs):
  - Scalar stack args
  - Pointer stack args
  - Mixed pointer/scalar stack args
  - Dynptr stack arg
  - Memory buffer + size pair
  - Iterator
  - Const string pointer
  - Timer pointer

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045127.2397187-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf,x86: Implement JIT support for stack arguments

Add x86_64 JIT support for BPF functions and kfuncs with more than
5 arguments. The extra arguments are passed through a stack area
addressed by register r11 (BPF_REG_PARAMS) in BPF bytecode,
which the JIT translates to native code.

The JIT follows the x86-64 calling convention for both BPF-to-BPF
and kfunc calls:
  - Arg 6 is passed in the R9 register
  - Args 7+ are passed on the stack

Incoming arg 6 (BPF r11+8) is translated to a MOV from R9 rather
than a memory load. Incoming args 7+ (BPF r11+16, r11+24, ...) map
directly to [rbp + 16], [rbp + 24], ..., matching the x86-64 stack
layout after CALL + PUSH RBP, so no offset adjustment is needed.

tail_call_reachable is rejected by the verifier and priv_stack is
disabled by the JIT when stack args exist, so R9 is always
available. When BPF bytecode writes to the arg-6 stack slot
(offset -8), the JIT emits a MOV into R9 instead of a memory store.
Outgoing args 7+ are placed at [rsp] in a pre-allocated area below
callee-saved registers, using:
  native_off = outgoing_arg_base - outgoing_rsp - bpf_off - 16

The native x86_64 stack layout with stack arguments:

  high address
  +-------------------------+
  | incoming stack arg N    |  [rbp + 16 + (N-7)*8]  (from caller)
  | ...                     |
  | incoming stack arg 7    |  [rbp + 16]
  +-------------------------+
  | return address          |  [rbp + 8]
  | saved rbp               |  [rbp]
  +-------------------------+
  | BPF program stack       |  (round_up(stack_depth, 8) bytes)
  +-------------------------+
  | callee-saved regs       |  (r12, rbx, r13, r14, r15 as needed)
  +-------------------------+
  | outgoing arg M          |  [rsp + (M-7)*8]
  | ...                     |
  | outgoing arg 7          |  [rsp]
  +-------------------------+  rsp
  low address

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045122.2393118-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Disable private stack for x86_64 if stack arguments used

Other architectures like arm64, riscv, etc. have enough register
and for them private stack can be used together with
stack arguments.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045114.2392291-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Reject stack arguments if tail call reachable

Tail calls are deprecated and will be replaced by indirect calls
in the future. Reject programs that combine tail calls with stack
arguments rather than adding complexity for a deprecated feature.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045109.2392108-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Support stack arguments for kfunc calls

Extend the stack argument mechanism to kfunc calls, allowing kfuncs
with more than 5 parameters to receive additional arguments via the
r11-based stack arg area.

For kfuncs, the caller is a BPF program and the callee is a kernel
function. The BPF program writes outgoing args at negative r11
offsets, following the same convention as BPF-to-BPF calls:

  Outgoing: r11 - 8 (arg6), ..., r11 - N*8 (last arg)

The following is an example:

  int foo(int a1, int a2, int a3, int a4, int a5, int a6, int a7) {
    ...
    kfunc1(a1, a2, a3, a4, a5, a6, a7, a8);
    ...
    kfunc2(a1, a2, a3, a4, a5, a6, a7, a8, a9);
    ...
  }

   Caller (foo), generated by llvm
   ===============================
   Incoming (positive offsets):
     r11+8:  [incoming arg 6]
     r11+16: [incoming arg 7]

   Outgoing for kfunc1 (negative offsets):
     r11-8:  [outgoing arg 6]
     r11-16: [outgoing arg 7]
     r11-24: [outgoing arg 8]

   Outgoing for kfunc2 (negative offsets):
     r11-8:  [outgoing arg 6]
     r11-16: [outgoing arg 7]
     r11-24: [outgoing arg 8]
     r11-32: [outgoing arg 9]

Later JIT will marshal outgoing arguments to the native calling
convention for kfunc1() and kfunc2().

For kfunc calls where stack args are used as constant or size
parameters, a mark_stack_arg_precision() helper is used to propagate
precision and do proper backtracking.

There are two places where meta->release_regno needs to keep
regno for later releasing the reference. Also, 'cur_aux(env)->arg_prog = regno'
is also keeping regno for later fixup. Since stack arguments don't have a valid
register number (regno is negative), these three cases are rejected for now
if the argument is on the stack.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045104.2391543-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Enable r11 based insns

BPF_REG_PARAMS (r11) is used for stack argument accesses and
the following are only insns with r11 presence:
    - load incoming stack arg
    - store register to outgoing stack arg
    - store immediate to outgoing stack arg

The detailed insn format can be found in is_stack_arg_ldx/st/stx()
helpers. After this patch, stack arg ldx/st/stx insns become valid
for kernel and these insns can be properly checked by verifier.

The LLVM compiler [1] implemented the above BPF_REG_PARAMS insns.

  [1] https://github.com/llvm/llvm-project/pull/189060

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045059.2391192-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Prepare architecture JIT support for stack arguments

Add bpf_jit_supports_stack_args() as a weak function defaulting to
false. Architectures that implement JIT support for stack arguments
override it to return true.

Reject BPF functions with more than 5 parameters at verification
time if the architecture does not support stack arguments.

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045054.2390945-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Reject stack arguments in non-JITed programs

The interpreter does not understand the bpf register r11
(BPF_REG_PARAMS) used for stack arguments. So reject interpreter
usage if stack arguments are used either in the main program or
any subprogram.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045049.2390444-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Extend liveness analysis to track stack argument slots

BPF_REG_PARAMS (R11) is at index MAX_BPF_REG, which is beyond the
register tracking arrays in const_fold.c and liveness.c. Handle it
explicitly to avoid out-of-bounds accesses.

Extend the arg tracking dataflow to cover stack arg slots. Otherwise,
pointers passed through stack args are invisible to liveness, causing
the pointed-to stack slots to be incorrectly poisoned.

Extend the at_out tracking array to MAX_AT_TRACK_REGS (registers
plus stack arg slots) so that outgoing stack arg stores are tracked
alongside registers. Add a separate at_stack_arg_entry array in
compute_subprog_args(), passed to arg_track_xfer(), to restore
FP-derived values on incoming stack arg reads.

Extend record_call_access() to check stack arg slots for FP-derived
pointers at kfunc call sites, reusing the record_arg_access() helper
extracted in the previous patch. Pass stack arg state from caller to
callee in analyze_subprog() so that callees can track pointers received
through stack args, hence avoid poisoning.

Skip stack arg instructions in record_load_store_access(). Stack arg
STX uses dst_reg=BPF_REG_PARAMS (index 11), but at[11] is repurposed
to track the value stored in stack arg slot 0. Without the skip, if a
prior stack arg STX stored an FP-derived pointer (e.g., fp-64) into
slot 0, a subsequent stack arg STX would read that FP-derived value as
the base pointer and spuriously mark a regular stack slot (e.g., fp-72
from -64 + -8) as accessed in the liveness bitmap.

Extend arg_track_log() to log state transitions for outgoing stack arg
slots at indices MAX_BPF_REG through MAX_AT_TRACK_REGS-1. Without this,
changes to at_out[11..17] caused by stack arg store instructions are
silently omitted from BPF_LOG_LEVEL2 output. For example, when a
caller passes fp-64 through a stack argument:

  subprog#0:
   10: (bf) r6 = r10
   11: (07) r6 += -64
   12: (7b) *(u64 *)(r11 -8) = r6
sa0: none -> fp0-64
   13: (85) call pc+5

Without the fix, the "sa0: none -> fp0-64" transition at insn 12
would not appear.

Extend print_subprog_arg_access() to include stack arg slots in the
per-instruction FP-derived state dump. For example:

  subprog#0:
   12: (7b) *(u64 *)(r11 - 8) = r6  // r6=fp0-64
   13: (85) call pc+5              // r6=fp0-64 sa0=fp0-64

Without the fix, the "sa0=fp0-64" annotation at insn 13 would not
appear, making it harder to debug liveness analysis for programs
that pass FP-derived pointers through stack arguments.

Extend has_fp_args() to also check stack arg slots for FP-derived
pointers, so that callees receiving pointers only through stack args
are still recursively analyzed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045043.2389049-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Use arg_is_fp() in has_fp_args()

Replace "frame != ARG_NONE" with arg_is_fp() in has_fp_args().
The function's purpose is to check whether any argument is derived
from a frame pointer, which is exactly what arg_is_fp() tests
(frame >= 0 || frame == ARG_IMPRECISE). Using the dedicated
predicate is clearer and more consistent with the rest of the file.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045035.2388671-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Refactor record_call_access() to extract per-arg logic

Extract the per-argument FP-derived pointer handling from
record_call_access() into a new record_arg_access() helper.

The existing loop body — checking arg_is_fp, querying stack access
bytes, and calling record_stack_access/record_imprecise — will be
reused for stack argument slots in the next patch. Factoring it out
now avoids duplicating the logic.

No functional change.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045030.2388067-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add precision marking and backtracking for stack argument slots

Extend the precision marking and backtracking infrastructure to
support stack argument slots (r11-based accesses). Without this,
precision demands for scalar values passed through stack arguments
are silently dropped, which could allow the verifier to incorrectly
prune states with different constant values in stack arg slots.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045025.2387526-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Refactor jmp history to use dedicated spi/frame fields

Move stack slot index (spi) and frame number out of the flags field
in bpf_jmp_history_entry into dedicated bitfields. This simplifies
the encoding and makes room for new flags.

Previously, spi and frame were packed into the lower 9 bits of the
12-bit flags field (3 bits frame + 6 bits spi), with INSN_F_STACK_ACCESS
at BIT(9) and INSN_F_DST/SRC_REG_STACK at BIT(10)/BIT(11).
But this has no room for an INSN_F_* flag for stack arguments.

To resolve this issue, bpf_jmp_history_entry field idx is narrowed to
20 bits (sufficient for insn indices up to 1M), and the freed bits hold
spi (6 bits) and frame (3 bits) as dedicated struct fields. The flags
enum is simplified accordingly:
  INSN_F_STACK_ACCESS  -> BIT(0)
  INSN_F_DST_REG_STACK -> BIT(1)
  INSN_F_SRC_REG_STACK -> BIT(2)
which allows more room for additional INSN_F_* flags.

bpf_push_jmp_history() now takes explicit spi and frame parameters
instead of encoding them into flags. The insn_stack_access_flags(),
insn_stack_access_spi(), and insn_stack_access_frameno() helpers are
removed.

No functional change.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045020.2385962-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Support stack arguments for bpf functions

Currently BPF functions (subprogs) are limited to 5 register arguments.
With [1], the compiler can emit code that passes additional arguments
via a dedicated stack area through bpf register BPF_REG_PARAMS (r11),
introduced in an earlier patch ([2]).

The compiler uses positive r11 offsets for incoming (callee-side) args
and negative r11 offsets for outgoing (caller-side) args, following the
x86_64/arm64 calling convention direction. There is an 8-byte gap at
offset 0 separating two regions:
  Incoming (callee reads):   r11+8 (arg6), r11+16 (arg7), ...
  Outgoing (caller writes):  r11-8 (arg6), r11-16 (arg7), ...

The following is an example to show how stack arguments are saved
and transferred between caller and callee:

  int foo(int a1, int a2, int a3, int a4, int a5, int a6, int a7) {
    ...
    bar(a1, a2, a3, a4, a5, a6, a7, a8);
    ...
  }

  Caller (foo)                           Callee (bar)
  ============                           ============
  Incoming (positive offsets):           Incoming (positive offsets):

  r11+8:  [incoming arg 6]               r11+8:  [incoming arg 6] <-+
  r11+16: [incoming arg 7]               r11+16: [incoming arg 7] <-|+
                                         r11+24: [incoming arg 8] <-||+
  Outgoing (negative offsets):                                      |||
  r11-8:  [outgoing arg 6 to bar] -------->-------------------------+||
  r11-16: [outgoing arg 7 to bar] -------->--------------------------+|
  r11-24: [outgoing arg 8 to bar] -------->---------------------------+

If the bpf function has more than one call:

  int foo(int a1, int a2, int a3, int a4, int a5, int a6, int a7) {
    ...
    bar1(a1, a2, a3, a4, a5, a6, a7, a8);
    ...
    bar2(a1, a2, a3, a4, a5, a6, a7, a8, a9);
    ...
  }

  Caller (foo)                             Callee (bar2)
  ============                             ==============
  Incoming (positive offsets):             Incoming (positive offsets):

  r11+8:  [incoming arg 6]                 r11+8:  [incoming arg 6] <+
  r11+16: [incoming arg 7]                 r11+16: [incoming arg 7] <|+
                                           r11+24: [incoming arg 8] <||+
  Outgoing for bar2 (negative offsets):    r11+32: [incoming arg 9] <|||+
  r11-8:  [outgoing arg 6] ---->----------->-------------------------+|||
  r11-16: [outgoing arg 7] ---->----------->--------------------------+||
  r11-24: [outgoing arg 8] ---->----------->---------------------------+|
  r11-32: [outgoing arg 9] ---->----------->----------------------------+

The verifier tracks outgoing stack arguments in stack_arg_regs[] and
out_stack_arg_cnt in bpf_func_state, separately from the regular
r10 stack. The callee does not copy incoming args — it reads them
directly from the caller's outgoing slots at positive r11 offsets.
Similar to stacksafe(), introduce stack_arg_safe() to do pruning
check.

Outgoing stack arg slots are invalidated when the callee returns
(e.g. in prepare_func_exit), not at call time. This allows the callee to
read incoming args from the caller's outgoing slots during
verification. The following are a few examples.

Example 1:
  *(u64 *)(r11 - 8) = r6;
  *(u64 *)(r11 - 16) = r7;
  call bar1;                // arg6 = r6, arg7 = r7
  call bar2;                // expected with 2 stack arguments, failed

Example 2:
To fix the Example 1:
  *(u64 *)(r11 - 8) = r6;
  *(u64 *)(r11 - 16) = r7;
  call bar1;                // arg6 = r6, arg7 = r7
  *(u64 *)(r11 - 8) = r8;
  *(u64 *)(r11 - 16) = r9;
  call bar2;                // arg6 = r8, arg7 = r9

Example 3:
The compiler can hoist the shared stack arg stores above the branch:
  *(u64 *)(r11 - 16) = r7;
  if cond goto else;
    *(u64 *)(r11 - 8) = r8;
    call bar1;               // arg6 = r8, arg7 = r7
    goto end;
  else:
    *(u64 *)(r11 - 8) = r9;
    call bar2;               // arg6 = r9, arg7 = r7
  end:

Example 4:
Within a loop:
  loop:
    *(u64 *)(r11 - 8) = r6;  // arg6, before loop
    call bar;                // reuses arg6 each iteration
    if ... goto loop;

A separate max_out_stack_arg_cnt field in bpf_subprog_info tracks
the deepest outgoing slot actually written. This intends to
reject programs that write to slots beyond what any callee expects.
It is necessary for JIT.

Similar to typical compiler generated code, enforce the following
orderings:
  - all stack arg reads must be ahead of any stack arg write
  - all stack arg reads must be before any bpf func, kfunc and helpers
This is needed as JIT may emit 'mov' insns for read/write with
the same register and bpf function, kfunc and helper will invalidate
all arguments immediately after the call.

Callback functions with stack arguments need kernel setup parameter
types (including stack parameters) properly and then callback function
can retrieve such information for verification purpose.

Global subprogs and freplace with >5 args are not yet supported.

  [1] https://github.com/llvm/llvm-project/pull/189060
  [2] https://lore.kernel.org/bpf/20260423033506.2542005-1-yonghong.song@linux.dev/

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045015.2385013-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Set sub->arg_cnt earlier in btf_prepare_func_args()

Move the "sub->arg_cnt = nargs" assignment to immediately after
nargs is computed from btf_type_vlen(), instead of at the end of
btf_prepare_func_args().

btf_prepare_func_args() can return -EINVAL early in several cases,
e.g. when a static function has some non-int/enum arguments.
Since -EINVAL from btf_prepare_func_args() does not immediately
reject verification, arg_cnt remains zero after the early return.
This causes later stack argument based load/store insns to
incorrectly assume the function has no arguments.

Setting arg_cnt right after nargs ensures it is available regardless
of which path btf_prepare_func_args() takes.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045010.2384635-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add helper functions for r11-based stack argument insns

Add three static inline helper functions — is_stack_arg_ldx(),
is_stack_arg_st(), and is_stack_arg_stx() — that identify r11-based
(BPF_REG_PARAMS) instructions used for stack argument passing. These
helpers encapsulate the detailed encoding requirements (operand size,
register, offset alignment and sign) and hide raw BPF_REG_PARAMS usage
from the verifier, making call sites more readable and explicit.

A later patch ("bpf: Enable r11 based insns") will wire these helpers
into the verifier. Until then, check_and_resolve_insns() rejects any
r11-based registers.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045005.2383881-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Remove copy_register_state wrapper function

Remove the copy_register_state() helper which was just a plain struct
assignment wrapper and replace all call sites with direct struct
assignment. This simplifies the code in preparation for upcoming stack
argument support.

No functional change.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260513045000.2382933-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>