Link: https://lore.kernel.org/20260428124833.1903302-1-rppt@kernel.org Link: https://lore.kernel.org/20260428124833.1903302-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Baoquan He <baoquan.he@linux.dev> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Young <ruirui.yang@linux.dev> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Destructive tests should be invoked with -d command-line option, but this
won't work today since 'd' is missing in getopts command-line. This
commit fixes it.
Link: https://lore.kernel.org/214fd9e4-5398-4c26-859e-c982c2e277c3@redhat.com Fixes: f16ff3b692ad ("selftests/mm: run_vmtests.sh: add missing tests") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scripts/gdb: slab: update field names of struct kmem_cache
The commit 5ba6bc27b1f9 ("slab: decouple pointer to barn from
kmem_cache_node") reorganized the struct kmem_cache to factor out the
per-node fields to the new struct kmem_cache_per_node_ptrs. This causes
the gdb scripts for lx-slabinfo and lx-slabtrace fail as they still
reference the old structure.
Adjust the gdb scripts to match the current state of struct kmem_cache.
Link: https://lore.kernel.org/20260427142448.666117-3-illia@yshyn.com Fixes: 5ba6bc27b1f9 ("slab: decouple pointer to barn from kmem_cache_node") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Hao Li <hao.li@linux.dev> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Seongjun Hong <hsj0512@snu.ac.kr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scripts/gdb: mm: cast untyped symbols in x86_page_ops
The symbols phys_base, _text, and _end, used in x86_page_ops are either
defined in assembly or implicitly by the linker. Thus, they lack type
information and cause a conversion error after gdb.parse_and_eval.
Explicitly cast these expressions to unsigned long.
Link: https://lore.kernel.org/20260427142448.666117-2-illia@yshyn.com Fixes: 55f8b4518d14 ("scripts/gdb: implement x86_page_ops in mm.py") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Vlastimil Babka <vbabka@suse.com> Cc: Hao Li <hao.li@linux.dev> Cc: Harry Yoo <harry@kernel.org> Cc: Seongjun Hong <hsj0512@snu.ac.kr> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_sysfs_memcg_path_to_id() breaks mem_cgroup_iter() loop without
calling mem_cgroup_iter_break(). This leaks the cgroup reference. Fix
the issue by calling mem_cgroup_iter_break() before the break.
mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
When check_stable_address_space() fails after the PMD spinlock has
been acquired via pmd_lock(), the code jumps directly to the abort
label, bypassing the spin_unlock() call in unlock_abort. This causes
the PMD spinlock to be permanently held, leading to a deadlock.
Change the goto target from abort to unlock_abort to ensure the
spinlock is always released on this error path.
Link: https://lore.kernel.org/20260425133537.17463-1-nueralspacetech@gmail.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot. Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.
The safety mechanism depends on the origin of the reset. For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event. In that case no sleep is possible as a device halt is reported
at the hardirq level.
A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6ffbee
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:
The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to. Therefore there is a
very remote issue only this is a sign of.
Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.
Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean). It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection. For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).
Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.
Fixes: 61414f5ec9834 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Kelley [Tue, 17 Feb 2026 18:23:35 +0000 (10:23 -0800)]
drm/hyperv: During panic do VMBus unload after frame buffer is flushed
In a VM, Linux panic information (reason for the panic, stack trace,
etc.) may be written to a serial console and/or a virtual frame buffer
for a graphics console. The latter may need to be flushed back to the
host hypervisor for display.
The current Hyper-V DRM driver for the frame buffer does the flushing
*after* the VMBus connection has been unloaded, such that panic messages
are not displayed on the graphics console. A user with a Hyper-V graphics
console is left with just a hung empty screen after a panic. The enhanced
control that DRM provides over the panic display in the graphics console
is similarly non-functional.
Commit 3671f3777758 ("drm/hyperv: Add support for drm_panic") added
the Hyper-V DRM driver support to flush the virtual frame buffer. It
provided necessary functionality but did not handle the sequencing
problem with VMBus unload.
Fix the full problem by using VMBus functions to suppress the VMBus
unload that is normally done by the VMBus driver in the panic path. Then
after the frame buffer has been flushed, do the VMBus unload so that a
kdump kernel can start cleanly. As expected, CONFIG_DRM_PANIC must be
selected for these changes to have effect. As a side benefit, the
enhanced features of the DRM panic path are also functional.
Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Michael Kelley [Tue, 17 Feb 2026 18:23:34 +0000 (10:23 -0800)]
Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
Currently, VMBus code initiates a VMBus unload in the panic path so
that if a kdump kernel is loaded, it can start fresh in setting up its
own VMBus connection. However, a driver for the VMBus virtual frame
buffer may need to flush dirty portions of the frame buffer back to
the Hyper-V host so that panic information is visible in the graphics
console. To support such flushing, provide exported functions for the
frame buffer driver to specify that the VMBus unload should not be
done by the VMBus driver, and to initiate the VMBus unload itself.
Together these allow a frame buffer driver to delay the VMBus unload
until after it has completed the flush.
Ideally, the VMBus driver could use its own panic-path callback to do
the unload after all frame buffer drivers have finished. But DRM frame
buffer drivers use the kmsg dump callback, and there are no callbacks
after that in the panic path. Hence this somewhat messy approach to
properly sequencing the frame buffer flush and the VMBus unload.
Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Pengpeng Hou [Thu, 7 May 2026 08:18:11 +0000 (16:18 +0800)]
drivers/of: validate status properties in reconfig state changes
Live-tree reconfiguration properties also carry raw values plus explicit
lengths. `of_reconfig_get_state_change()` currently treats `status`
property values as NUL-terminated strings and feeds them straight into
`strcmp()`.
Factor the `"okay"` / `"ok"` check out into a helper that first verifies
that the property contains a bounded C string within `prop->length`.
Malformed `status` updates should be treated as not enabling the node.
Pengpeng Hou [Thu, 7 May 2026 08:18:10 +0000 (16:18 +0800)]
drivers/of: validate live-tree string properties before string use
`populate_properties()` stores live-tree property values as raw byte
sequences plus a separate `length`. They are not globally guaranteed to
be NUL-terminated.
`of_prop_next_string()` iterates string-list properties by walking raw
bytes, `__of_node_is_type()` checks `device_type`,
`__of_device_is_status()` checks `status`, and
`of_alias_from_compatible()` reads the first `compatible` entry. These
paths must validate that the relevant string fits within the property
bounds before they hand it to C string helpers.
Validate these live-tree string properties within their declared bounds.
In particular, make `of_prop_next_string()` reject malformed entries
before returning them, keep the `device_type` check inside the existing
no-lock helper path, and add unit coverage for malformed first and
trailing string-list entries.
i2c: tegra: make tegra_i2c_mutex_unlock() return void
tegra_i2c_mutex_unlock() returning an error that overwrites the transfer
result causes silent loss of I2C transfer errors. If the transfer failed
but the unlock succeeded, the error was lost and the function incorrectly
reported success.
Rather than propagating the unlock error (which is not actionable by the
caller - the I2C message may have been sent regardless), convert the
function to return void and WARN on the unexpected condition. If the
unlock fails, subsequent lock attempts will fail anyway, making the error
visible on the next transfer.
i2c: tegra: fix pm_runtime leak on mutex_lock failure
If tegra_i2c_mutex_lock() fails, the function returns without calling
pm_runtime_put(), leaking the runtime PM reference acquired by the
preceding pm_runtime_get_sync(). This prevents the device from ever
entering runtime suspend.
Add the missing pm_runtime_put() before returning on lock failure.
driver core: platform: remove software node on release()
If we pass a software node to a newly created device using struct
platform_device_info, it will not be removed when the device is
released. This may happen when a module creating the device is removed
or on failure in platform_device_add().
When we try to reuse that software node in a subsequent call to
platform_device_register_full(), it will fail with -EBUSY.
Provide a wrapper around the existing platform_device_release() that
additionally calls device_remove_software_node() and use it to replace
the former if we end up adding a software node.
While at it: check all three possible situations in which two software
nodes for a single platform device can be created/assigned in
platform_device_register_full() and bail-out early.
KVM: SEV: Allocate only as many bytes as needed for temp crypt buffers
When using a temporary buffer to {de,en}crypt unaligned memory for debug,
allocate only the number of bytes that are needed instead of allocating an
entire page. The most common case for unaligned accesses will be reading
or writing less than 16 bytes.
KVM: SEV: Rewrite logic to {de,en}crypt memory for debug
Wholesale rewrite the guts of the debug {de,en}crypt flows, as the existing
code is broken, e.g. doesn't handle cases where the access isn't naturally
sized and aligned, and is so wildly flawed that attempting to salvage the
current code in an iterative fashion would be more risky than a rewrite.
E.g. when encrypting 9 bytes at offset 8, KVM needs to _decrypt_
destination[31:0] into a temporary buffer, buffer[31:0], then copy 9 bytes
from source[8:0] to buffer[16:8], then encrypt buffer[31:0] back into
destination[31:0]. The current code only ever copies 16 bytes, and
bizarrely uses a temporary buffer for the source as well.
To keep the code easier to read and maintain, send the unaligned cases
down dedicated "slow" paths instead of trying to mix and match the possible
combinations in one helper.
For now, preserve the basic approach of the current code, e.g. allocate an
entire page for the temporary buffer, to minimize unwanted changes in
functionality.
KVM: SEV: Add helper function to pin/unpin a single page
Add helpers to pin/unpin a single page, and use it in all flows that pin
exactly one page. None of the single-page users actually check that the
correct number of pages was pinned, which is functionally ok, but visually
jarring, especially in the decrypt/encrypt flow, which separately pins two
pages, but uses a single variable to track how pages were pinned each time.
Again, it's functionally ok since core mm guarantees exactly one page will
be pinned on success, but it's ugly.
Opportunistically use page_to_phys() instead of open coding an equivalent
via page_to_pfn().
Note, all users of the single-page helper pre-validate the address and
length, i.e. don't rely on the sanity check in sev_pin_memory().
KVM: SEV: Explicitly validate the dst buffer for debug operations
When encrypting/decrypting guest memory, explicitly check that the
destination is non-NULL and doesn't wrap instead of subtly relying on
sev_pin_memory() to perform the check. This will allow adding and using
a more focused single-page pinning helper.
KVM: selftests: Add a test to verify SEV {en,de}crypt debug ioctls
Add a selftest to verify KVM's handling of {de,en}crypt debug ioctls,
specifically focusing on edge cases around the chunk (16 bytes) and page
(4096) sizes, where KVM had multiple bugs. E.g. KVM would fail to handle
small sizes that aren't naturally aligned and sized, would buffer overflow
if the destination was unaligned but the source was not, etc.
Attempt to strike a balance between an exhaustive test and a reasonable
runtime. On a system with both SEV and SEV-ES support, the current runtime
is under 45 seconds. Which isn't great, but it's tolerable, and it's not
obvious which of the combinations are "better" than the others.
Linus Torvalds [Wed, 13 May 2026 22:00:40 +0000 (15:00 -0700)]
Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
- PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
the failing task rather than the irq_work kthread
- Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
fixes"
* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
sched_ext: Fix ops->priv clobber on concurrent attach/detach
selftests/sched_ext: Fix build error in dequeue selftest
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
sched_ext: Close sub-sched init race with post-init DEAD recheck
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
sched_ext: Drop unused scx_find_sub_sched() stub
sched_ext: Move scx_error() out of scx_link_sched()'s lock region
Dmitry Osipenko [Fri, 1 May 2026 00:00:43 +0000 (03:00 +0300)]
drm/virtio: Extend blob UAPI with deferred-mapping hinting
If userspace never maps GEM object, then BO wastes hostmem space
because VirtIO-GPU driver maps VRAM BO at the BO's creating time.
Make mappings on-demand by adding new RESOURCE_CREATE_BLOB IOCTL/UAPI
hinting flag telling that host mapping should be deferred until first
mapping is made when the flag is set by userspace.
Chaitanya Sabnis [Wed, 13 May 2026 12:37:57 +0000 (18:07 +0530)]
dt-bindings: i2c: convert davinci i2c to dt-schema
Convert the Texas Instruments DaVinci and Keystone I2C controller
bindings from legacy text format to modern dt-schema (YAML).
During the conversion, the `interrupts` property was made required
to match the strict requirement in the driver probe function. The
custom `ti,has-pfunc` and `power-domains` properties were also
properly defined to match SoC-specific hardware features.
Linus Torvalds [Wed, 13 May 2026 21:56:31 +0000 (14:56 -0700)]
Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
Linus Torvalds [Wed, 13 May 2026 21:49:13 +0000 (14:49 -0700)]
Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path
- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
set with no owner
* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
workqueue: Release PENDING in __queue_work() drain/destroy reject path
Mikko Perttunen [Tue, 21 Apr 2026 04:02:38 +0000 (13:02 +0900)]
drm/msm: Fix iommu_map_sgtable() return value check and avoid WARN
Commit "iommu: return full error code from iommu_map_sg[_atomic]()"
changed iommu_map_sgtable() to return an ssize_t and negative values
in error cases, rather than a size_t and a zero.
Store the return value in the appropriate type and in case of error,
return it rather than WARNing.
Fixes: ad8f36e4b6b1 ("iommu: return full error code from iommu_map_sg[_atomic]()") Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Patchwork: https://patchwork.freedesktop.org/patch/719685/
Message-ID: <20260421-iommu_map_sgtable-return-v1-3-fb484c07d2a1@nvidia.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Rob Clark [Sat, 11 Apr 2026 15:03:12 +0000 (08:03 -0700)]
drm/msm/a6xx: Restore sysprof_active
This got lost in the shuffle somehow when moving the vfunc table to
catalogue. Fixes inhibiting IFPC when userspace is collecting perfcntr
data.
Fixes: 491fadb2b818 ("drm/msm/adreno: Move adreno_gpu_func to catalogue") Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com> Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/717780/
Message-ID: <20260411150312.257937-1-robin.clark@oss.qualcomm.com>
drm/msm/adreno: fix userspace-triggered crash on a2xx-a4xx
Before a5xx Adreno driver will not try fetching UBWC params (because
those generations didn't support UBWC anyway), however it's still
possible to query UBWC-related params from the userspace, triggering
possible NULL pointer dereference. Check for UBWC config in
adreno_get_param() and return sane defaults if there is none.
Fixes: a452510aad53 ("drm/msm/adreno: Switch to the common UBWC config struct") Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/717778/
Message-ID: <20260411-adreno-fix-ubwc-v3-1-4983156f3f80@oss.qualcomm.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Jeremy Laratro [Tue, 12 May 2026 23:26:16 +0000 (08:26 +0900)]
ksmbd: fix null pointer dereference in compare_guid_key()
session_fd_check() walks the per-inode m_op_list during durable-handle
session teardown and sets op->conn = NULL for every opinfo whose conn
matched the closing session's connection. The matching opinfo, however,
stays linked in its per-ClientGuid lease_table_list entry's lb->lease_list
because destroy_lease_table() only runs on full TCP-connection teardown,
not on SESSION_LOGOFF.
If the same TCP connection then negotiates a fresh session with the
same ClientGuid (ClientGuid is bound to NEGOTIATE, not the session, and
is unchanged across LOGOFF + SETUP) and issues a SMB2 CREATE with a
lease context on a different inode, find_same_lease_key() walks
lb->lease_list, reaches the stale opinfo, and calls compare_guid_key(),
which unconditionally dereferences opinfo->conn->ClientGUID. The conn
pointer is NULL and the kernel panics.
Reproducer requires only a successful SMB2 SESSION_SETUP and a share
configured with 'durable handles = yes'. KASAN report on mainline 70390501d194:
general protection fault, probably for non-canonical address
0xdffffc0000000069: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000348-0x000000000000034f]
Workqueue: ksmbd-io handle_ksmbd_work
RIP: 0010:bcmp+0x5b/0x230
Call Trace:
compare_guid_key+0x4b/0xd0
find_same_lease_key+0x324/0x690
smb2_open+0x6aea/0x8e60
handle_ksmbd_work+0x796/0xee0
...
Faulting address 0x348 is the offset of ClientGUID within struct
ksmbd_conn, confirming opinfo->conn was NULL.
Read opinfo->conn once and bail out if it has been cleared by a
concurrent session_fd_check(). A half-detached opinfo cannot be the
owner of an active lease, so returning 0 is the correct match result.
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Jeremy Laratro [Tue, 12 May 2026 23:23:26 +0000 (08:23 +0900)]
ksmbd: fix null pointer dereference in proc_show_files()
When a SMB2 client opens a file with a durable v2 handle and then issues
SMB2 SESSION_LOGOFF, session_fd_check() clears fp->tcon = NULL on the
reconnectable file pointer but leaves the fp registered in global_ft.idr
until the durable scavenger fires (up to fp->durable_timeout seconds
later).
During that window any read of /proc/fs/ksmbd/files (mode 0400) panics
the kernel because proc_show_files() walks global_ft.idr and
unconditionally dereferences fp->tcon->id with no NULL guard.
Reproducer requires only a successful SMB2 SESSION_SETUP and a share
configured with 'durable handles = yes'. KASAN report on mainline 70390501d194:
general protection fault, probably for non-canonical address
0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:proc_show_files+0x118/0x740
Call Trace:
proc_show_files+0x118/0x740
seq_read_iter+0x4ef/0xe10
proc_reg_read_iter+0x1b7/0x280
...
Guard the dereference. A durable-disconnected fp legitimately has no
tcon; report its tree id as 0 rather than oopsing.
Fixes: b38f99c1217a ("ksmbd: add procfs interface for runtime monitoring and statistics") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Ferry Meng [Mon, 11 May 2026 13:18:16 +0000 (21:18 +0800)]
ksmbd: fix SID memory leak in set_posix_acl_entries_dacl() on overflow
Commit 299f962c0b02 ("ksmbd: use check_add_overflow() to prevent u16
DACL size overflow") added check_add_overflow() guards that break out
of the ACE-building loops in set_posix_acl_entries_dacl() when the
accumulated DACL size would wrap past 65535.
However, each iteration allocates a struct smb_sid via kmalloc_obj()
at the top of the loop and relies on the kfree(sid) call at the end
of the loop body (the 'pass_same_sid' label in the first loop, and
the explicit kfree at the tail of the second loop) to release it.
The newly introduced 'break' statements bypass those kfree() calls,
leaking the sid buffer every time an overflow is detected.
A malicious or malformed file with enough POSIX ACL entries to trip
the overflow check will leak one or more struct smb_sid allocations
on every request that touches the file's DACL, providing a trivial
kernel memory exhaustion vector.
Free sid before breaking out of the loops to plug the leak.
Fixes: 299f962c0b02 ("ksmbd: use check_add_overflow() to prevent u16 DACL size overflow") Cc: stable@vger.kernel.org Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Felix Gu [Fri, 23 Jan 2026 16:37:38 +0000 (00:37 +0800)]
drm/msm/adreno: Fix a reference leak in a6xx_gpu_init()
In a6xx_gpu_init(), node is obtained via of_parse_phandle().
While there was a manual of_node_put() at the end of the
common path, several early error returns would bypass this call,
resulting in a reference leak.
Fix this by using the __free(device_node) cleanup handler to
release the reference when the variable goes out of scope.
Fixes: 5a903a44a984 ("drm/msm/a6xx: Introduce GMU wrapper support") Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Patchwork: https://patchwork.freedesktop.org/patch/700661/
Message-ID: <20260124-a6xx_gpu-v1-1-fa0c8b2dcfb1@gmail.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Stefan Dösinger [Wed, 28 Jan 2026 19:13:17 +0000 (22:13 +0300)]
ARM: dts: zte: Add D-Link DWR-932M support
This adds base DT definition for zx297520v3 and one board that consumes it.
The stock kernel does not use the armv7 timer, but it seems to work
fine. The board has other board-specific timers that would need a driver
and I see no reason to bother with them since the arm standard timer
works.
The caveat is the non-standard GIC setup needed to handle the timer's
level-low PPI. This is the responsibility of the boot loader and
documented in Documentation/arch/arm/zte/zx297520v3.rst.
Reviewed-by: Linus Walleij <linusw@kernel.org> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Changes in
v8: Remove redundant label, use "arm,pl011" for uart0 and 2 too.
v6: Squash board + timer + uart patches into one
v5: Prepend the SoC name in the device specific DTS filename.
v4:
Declare all uarts
Remove the UART aliases for now. I can revisit this when I get my
hands on a board that exposes two UARTs.
Stefan Dösinger [Wed, 28 Jan 2026 19:12:44 +0000 (22:12 +0300)]
dt-bindings: arm: zte: Add D-Link DWR932M board based on zx297520v3 SoC
This adds a new binding file for ZTE, containing their zx297520v3 SoC
and one board (D-Link DWR-932M) based on it.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Changelog:
v6:
Removed extra boards, I'll add them when submitting their individual
DTS files. Rephrase the subject to add "zte" and remove the redundant
use of "binding".
Moved the devicetree bindings patch ahead of the implementation patches.
Moved the MAINTAINERS section from "ZX29" to "ARM/ZTE".
Commit dc220915ddb2 ("drm/msm: Fix GMEM_BASE for gen8") changed the
GMEM_BASE check from adreno_is_a650_family() & adreno_is_a740_family()
to family >= ADRENO_6XX_GEN4.
This inadvertently excluded A650 (ADRENO_6XX_GEN3), causing it to report
an incorrect GMEM_BASE which results in severe rendering corruption.
Update check to also include ADRENO_6XX_GEN3 to fix A650.
Fixes: dc220915ddb2 ("drm/msm: Fix GMEM_BASE for gen8") Signed-off-by: Alexander Koskovich <akoskovich@pm.me> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/711880/
Message-ID: <20260314-fix-gmem-base-a650-v1-1-3308f60cf74c@pm.me> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Stefan Dösinger [Tue, 27 Jan 2026 17:52:08 +0000 (20:52 +0300)]
ARM: zte: Add zx297520v3 platform support
This SoC is used in low end LTE-to-WiFi routers, for example some D-Link
DWR 932 revisions, ZTE K10, ZLT S10 4G, but also models that are branded
and sold by ISPs themselves. They are widespread in Africa, China,
Russia and Eastern Europe.
This SoC is a relative of the zx296702 and zx296718 that had some
upstream support until commit 89d4f98ae90d ("ARM: remove zte zx
platform").
Reviewed-by: Linus Walleij <linusw@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Patch changelog:
v8:
* Select ARM_PSCI_FW (Sashiko). This is an issue make defconfig pointed
out in the last patch in this series. The board does not have PSCI
firmware as far as I can tell, but the ARM_GIC_V3 option indirectly
assumes ARM_PSCI_FW is enabled.
* Include <linux/init.h> in the board file for __initdata (Sashiko),
removed other includes copypasted from another platform that aren't
needed. Let's see if Sashiko agrees.
* Add the SoC documentation to the documentation index (Sashiko)
* Add the SoC documentation to MAINTAINERS (Sashiko)
* Removed redundant if ARCH_ZTE (Sashiko)
* Point towards a sane (USB-Only) U-Boot and modify the example code for
booting from NAND to detect already fixed GIC setups.
Matt Evans [Mon, 11 May 2026 14:46:42 +0000 (07:46 -0700)]
vfio/pci: Make VFIO_PCI_OFFSET_TO_INDEX() return unsigned
VFIO_PCI_OFFSET_TO_INDEX() is used in several places with a signed
parameter (e.g. loff_t). Because it makes no sense for a BAR/resource
index to be negative, enforce this in the macro.
This fixes at least one current issue, where vfio_pci_ioeventfd() uses
this macro with an unvalidated signed loff_t returned into a signed
type, leading to a possible negative array access. This instance does
test against an out-of-bounds positive value, so treating the index as
unsigned fixes this issue.
Thomas Gleixner [Wed, 13 May 2026 12:59:29 +0000 (14:59 +0200)]
hrtimer: Fix the bogus return type of __hrtimer_start_range_ns()
__hrtimer_start_range_ns() has a bool return type, but returns actually
three different values, which are checked at the call site.
Make the return type int.
Fixes: bd5956166d20 ("hrtimer: Provide hrtimer_start_range_ns_user()") Reported-by: Dan Carpenter <error27@gmail.com Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Andrea Righi [Wed, 13 May 2026 11:24:38 +0000 (13:24 +0200)]
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.
Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:
In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.
Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.
Alex Williamson [Thu, 7 May 2026 14:35:46 +0000 (08:35 -0600)]
vfio/pci: fix dma-buf kref underflow after revoke
vfio_pci_dma_buf_move(revoked=true) and vfio_pci_dma_buf_cleanup()
ran the same drain sequence: set priv->revoked, invalidate mappings,
wait for fences, drop the registered kref, wait for completion.
When the VFIO device fd was closed after PCI_COMMAND_MEMORY had been
cleared, both ran in turn -- the second kref_put underflowed and the
subsequent wait_for_completion() blocked on a completion that the
first run had already consumed:
Collapse the duplication: vfio_pci_dma_buf_cleanup() now delegates
the drain to vfio_pci_dma_buf_move(true), which is idempotent for
already-revoked dma-bufs. cleanup retains only list removal and
the device registration drop; the dma_resv_lock that bracketed
those is dropped along with the in-line drain that required it,
memory_lock continues to protect them.
Re-arm the kref and the completion at the end of move()'s revoke
branch so post-revoke state matches post-creation (kref == 1,
completion ready). This keeps cleanup's call into move() a no-op
when revoke already ran, and replaces the explicit kref_init() that
the un-revoke branch used to perform for the un-revoke -> remap
path.
Fixes: 1a8a5227f229 ("vfio: Wait for dma-buf invalidation to complete") Reported-by: Joonas Kylmälä <joonas.kylmala@netum.fi> Closes: https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/ Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Reviewed-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260507143548.1018405-1-alex.williamson@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
block: pass a minsize argument to bio_iov_iter_bounce
When bouncing for block size > PAGE_SIZE file systems that require
file system block size alignment (e.g. zoned XFS), the bio needs to
be big enough to fit an entire block.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Henry Tseng [Wed, 13 May 2026 10:28:46 +0000 (18:28 +0800)]
cpufreq: intel_pstate: Use HYBRID_SCALING_FACTOR_ADL for Bartlett Lake
Bartlett Lake P-core only SKUs (e.g. Intel Core 9 273PE)
do not report X86_FEATURE_HYBRID_CPU and are not in
intel_hybrid_scaling_factor[]. In hwp_get_cpu_scaling(), the
non-hybrid fallback then applies core_get_scaling() (100000),
producing cpuinfo_max_freq values that exceed the documented Max
Turbo Frequency:
Per the Intel datasheet [1], the Intel Core 9 273PE specifies:
Performance-cores: 12
Efficient-cores: 0
Max Turbo Frequency: 5.7 GHz
Intel Thermal Velocity Boost Frequency: 5.7 GHz
Intel Turbo Boost Max Technology 3.0 Frequency: 5.6 GHz
Performance-core Max Turbo Frequency: 5.4 GHz
Performance-core Base Frequency: 2.3 GHz
Bartlett Lake P-cores are Raptor Cove cores, per
commit d466304c4322 ("x86/cpu: Add CPU model number for Bartlett
Lake CPUs with Raptor Cove cores"). The Alder Lake and Raptor Lake
P-core entries in intel_hybrid_scaling_factor[] use
HYBRID_SCALING_FACTOR_ADL (78741). The same factor applies to
Bartlett Lake.
Add Bartlett Lake to intel_hybrid_scaling_factor[] with
HYBRID_SCALING_FACTOR_ADL so HWP performance levels map to the
correct CPU frequencies matching the datasheet's Max Turbo Frequency.
cpufreq: intel_pstate: Use correct scaling factor on Raptor Lake-E
Raptor Lake-E has the same processor ID as Raptor Lake-S, so there is
an entry in intel_hybrid_scaling_factor[] for it. It does not contain
E-cores though and hybrid_get_cpu_type() returns 0 for its P-cores, so
they get the default "core" scaling factor. However, the original
Raptor Lake scaling factor for P-cores still needs to be used for
mapping the HWP performance levels of the P-cores in Raptor Lake-E to
frequency, as though they were part of a real hybrid system.
To address this, update hwp_get_cpu_scaling() to return
hybrid_scaling_factor, which is the P-core scaling factor
retrieved from intel_hybrid_scaling_factor[], for all CPUs
that are not enumerated as E-cores.
Documentation: intel_pstate: Fix description of asymmetric packing with SMT
Patchset [1], including commits
046a5a95c3b0 ("x86/sched/itmt: Give all SMT siblings of a core the same priority") 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags")
overhauled asym_packing handling in the scheduler on x86 hybrid
processors with SMT. It removed SD_ASYM_PACKING from the x86 SMT
scheduling domain and made all SMT siblings of a core share the same
priority. As a result, asym_packing operates only across physical
cores, spreading tasks among them and only using idle SMT siblings
once all physical cores are busy.
Ethan Tidmore [Fri, 8 May 2026 01:57:16 +0000 (20:57 -0500)]
memory: tegra: Fix possible null pointer dereference
The function tegra114_emc_find_timing() has the possibility of returning
null and it's return value 'timing' is dereferenced before it is
checked for null.
Place dereference after null pointer check.
Detected by Smatch:
drivers/memory/tegra/tegra114-emc.c:520 tegra114_emc_prepare_timing_change()
warn: variable dereferenced before check 'timing' (see line 515)
Nicholas Carlini [Mon, 11 May 2026 18:02:16 +0000 (18:02 +0000)]
io-wq: check that the predecessor is hashed in io_wq_remove_pending()
io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled
work was the tail of its hash bucket. When doing this, it checks whether
the preceding entry in acct->work_list has the same hash value, but
never checks that the predecessor is hashed at all. io_get_work_hash()
is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash
bits are never set for non-hashed work, so it returns 0. Thus, when a
hashed bucket-0 work is cancelled while a non-hashed work is its list
predecessor, the check spuriously passes and a pointer to the non-hashed
io_kiocb is stored in wq->hash_tail[0].
Because non-hashed work is dequeued via the fast path in
io_get_next_work(), which never touches hash_tail[], the stale pointer
is never cleared. Therefore, after the non-hashed io_kiocb completes and
is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The
io_wq is per-task (tctx->io_wq) and survives ring open/close, so the
dangling pointer persists for the lifetime of the task; the next hashed
bucket-0 enqueue dereferences it in io_wq_insert_work() and
wq_list_add_after() writes through freed memory.
Add the missing io_wq_is_hashed() check so a non-hashed predecessor
never inherits a hash_tail[] slot.
Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
thermal: hwmon: Use extra_groups for adding temperature attributes
Instead of passing NULL as the last argument to __hwmon_device_register()
in hwmon_device_register_for_thermal() and then adding each temperature
sysfs attribute to the hwmon device via device_create_file(), redefine
hwmon_device_register_for_thermal() to take an extra_groups argument
that will be passed to __hwmon_device_register(), define an attribute
group with a proper .is_visible() callback for the temperature
attributes and a related attribute groups pointer, and pass the latter
to hwmon_device_register_for_thermal().
This causes the code to be way more straightforward and closer to
what the other users of the hwmon subsystem do.
When thermal_zone0 is removed, say because the ACPI thermal driver is
unbound from the underlying platform device, the removal code skips the
removal of hwmon0 because of the temp2_input attribute belonging to
thermal_zone1 which effectively prevents thermal_zone0 removal from
making progress.
To address this problem, rework the thermal hwmon code to register one
hwmon device for each thermal zone, but since user space utilities
produce confusing output in some cases when there are multiple hwmon
devices with the same name attribute value present under thermal zones
of the same type, append the thermal zone ID preceded by an underline
character to the name of the hwmon device registered for that thermal
zone.
thermal: hwmon: Fix critical temperature attribute removal
Since the return value of thermal_zone_crit_temp_valid() depends on
the behavior of the thermal zone .get_crit_temp() callback which
may change over time in theory, thermal_remove_hwmon_sysfs() may
attempt to remove a critical temperature attribute that has not
been created, passing a pointer to an uninitialized attribute
structure to device_remove_file().
To avoid that, set a flag in struct thermal_hwmon_temp after creating
a critical temperature attribute and use the value of that flag to
decide whether or not the attribute needs to be removed.
Fixes: e8db5d6736a7 ("thermal: hwmon: Make the check for critical temp valid consistent") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2437056.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Lezcano [Fri, 8 May 2026 18:05:10 +0000 (20:05 +0200)]
thermal/core: Allocate the thermal class dynamically
Use class_create() instead of a statically allocated struct class.
This allows the thermal class to be managed through a dynamically
allocated class object and avoids keeping a static class instance
around.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Added __ro_after_init to thermal_class ]
[ rjw: Used temporary local var to store class_create() return value ] Link: https://patch.msgid.link/20260508180511.1306659-4-daniel.lezcano@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Lezcano [Fri, 8 May 2026 18:05:09 +0000 (20:05 +0200)]
thermal/core: Add dedicated release callback for thermal zones
The thermal class release callback currently handles thermal zone
cleanup by checking the device name prefix.
Move the thermal zone cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the thermal zone lifetime handling local to the thermal
zone object.
Daniel Lezcano [Fri, 8 May 2026 18:05:08 +0000 (20:05 +0200)]
thermal/core: Add dedicated release callback for cooling devices
The thermal class release callback currently handles both thermal
zones and cooling devices by checking the device name prefix.
Move the cooling device cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the cooling device lifetime handling local to the
cooling device object.
Lists should have fixed constraints, because binding must be specific in
respect to hardware. Add missing constraints to number of iommus in
Mediatek media devices and remove completely redundant and obvious
description.
Cheng-Yang Chou [Wed, 13 May 2026 08:17:11 +0000 (16:17 +0800)]
tools/sched_ext: scx_qmap: Fix qa arena placement
__arena is a pointer qualifier meaning "this pointer points to arena
memory". When used on a global variable declaration, it expands to
nothing in scx's build because __BPF_FEATURE_ADDR_SPACE_CAST is never
defined, leaving qa as a plain global in BSS. bpftool then generates
skel->bss->qa instead of the expected skel->arena->qa, causing:
scx_qmap.c: error: 'struct scx_qmap' has no member named 'arena'
__arena_global is the correct annotation for global variables that
reside in the arena. When __BPF_FEATURE_ADDR_SPACE_CAST is not defined
it expands to SEC(".addr_space.1"), placing qa in the arena ELF section.
When __BPF_FEATURE_ADDR_SPACE_CAST is defined it expands to
__attribute__((address_space(1))). In both cases bpftool generates the
typed skel->arena accessor.
Fixes: 60a59eaca71b ("sched_ext: scx_qmap: move globals and cpu_ctx into a BPF arena map") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().
Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.
# b1 becomes partition root with CPUs 1-2, but sibling exclusion
# reduces its effective_xcpus to CPU 2 only
echo "1-2" > b1/cpuset.cpus
echo "root" > b1/cpuset.cpus.partition
Linus Torvalds [Wed, 13 May 2026 18:53:51 +0000 (11:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
Yu Miao [Wed, 13 May 2026 02:39:07 +0000 (10:39 +0800)]
selftests/cgroup: Fix error path leaks in test_percpu_basic
When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.
When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.
RDMA/bnxt_re: zero shared page before exposing to userspace
bnxt_re_alloc_ucontext() allocates uctx->shpg via
__get_free_page(GFP_KERNEL). The buddy allocator does not zero pages
without __GFP_ZERO, so the page contains stale kernel data from
whatever object most recently freed it.
The page is then mapped into userspace via vm_insert_page() under
BNXT_RE_MMAP_SH_PAGE in bnxt_re_mmap(). The driver only ever writes
4 bytes (a u32 AVID) at offset BNXT_RE_AVID_OFFT (0x10) inside
bnxt_re_create_ah(); the remaining 4092 bytes of the page are exposed
to userspace unsanitised, leaking kernel memory contents.
Any user with access to /dev/infiniband/uverbsX on a host with a
bnxt_re device (typically rdma group membership) can read this data
via a single mmap() at pgoff 0 after IB_USER_VERBS_CMD_GET_CONTEXT.
Other shared pages in the same file already use get_zeroed_page()
correctly:
Yi Lai [Thu, 7 May 2026 12:51:06 +0000 (20:51 +0800)]
selftests/rdma: explicitly skip tests when required modules are missing
Currently, the rdma rxe selftests fail with an exit code of 1 when
required kernel modules are not present. This causes spurious failures
in environments where these modules might not be compiled or available.
Include the standard kselftest 'ktap_helpers.sh' and replace the
hardcoded error exits with '$KSFT_SKIP'. This ensures the tests are
properly marked as skipped rather than failed.
Fixes: e01027cab38a ("RDMA/rxe: Add testcase for net namespace rxe") Signed-off-by: Yi Lai <yi1.lai@intel.com> Link: https://patch.msgid.link/20260507125106.3114167-1-yi1.lai@intel.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
Rick Edgecombe [Fri, 10 Apr 2026 23:26:54 +0000 (16:26 -0700)]
KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()
Rework tdx_has_emulated_msr() to explicitly enumerate the x2APIC MSRs
that KVM can emulate, instead of trying to enumerate the MSRs that KVM
cannot emulate. Drop the inner switch and list the emulatable x2APIC
registers directly in the outer switch's "return true" block.
The old code had multiple bugs in the x2APIC range handling.
X2APIC_MSR(APIC_ISR + APIC_ISR_NR) was incorrect because APIC_ISR_NR is
0x8, not 0x80, so the X2APIC_MSR() shift lost the lower bits, collapsing
each range to a single MSR. IA32_X2APIC_SELF_IPI was also missing from
the non-emulatable list. Note, these bugs are relatively benign, as they
only affect a guest that is requesting "bogus" emulation.
KVM has no visibility into whether or not a guest has enabled #VE
reduction, which changes which MSRs the TDX-Module handles itself versus
triggering a #VE for the guest to make a TDVMCALL. So maintaining a list
of non-emulatable MSRs is fragile. Listing only the MSRs KVM can always
emulate sidesteps the problem.
Suggested-by: Sean Christopherson <seanjc@google.com> Reported-by: Dmytro Maluka <dmaluka@chromium.org> Closes: https://lore.kernel.org/all/20260318190111.1041924-1-dmaluka@chromium.org Fixes: dd50294f3e3c ("KVM: TDX: Implement callbacks for MSR operations") Assisted-by: Claude:claude-opus-4-6
[based on a diff from Sean, but added missed LVTCMCI case, log] Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260410232654.3864196-1-rick.p.edgecombe@intel.com
[sean: call out the bugs are relatively benign, expand comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86: Make "external SPTE" ops that can fail RET0 static calls
Define kvm_x86_ops .link_external_spt(), .set_external_spte(), and
.free_external_spt() as RET0 static calls so that an unexpected call to a
a default operation doesn't consume garbage.
Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables") Fixes: 94faba8999b9 ("KVM: x86/tdp_mmu: Propagate tearing down mirror page tables") Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://patch.msgid.link/20260129011517.3545883-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: TDX: Account all non-transient page allocations for per-TD structures
Account all non-transient allocations associated with a single TD (or its
vCPUs), as KVM's ABI is that allocations that are active for the lifetime
of a VM are accounted. Leave temporary allocations, i.e. allocations that
are freed within a single function/ioctl, unaccounted, to again align with
KVM's existing behavior, e.g. see commit dd103407ca31 ("KVM: X86: Remove
unnecessary GFP_KERNEL_ACCOUNT for temporary variables").
Fixes: 8d032b683c29 ("KVM: TDX: create/destroy VM structure") Fixes: a50f673f25e0 ("KVM: TDX: Do TDX specific vcpu initialization") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://patch.msgid.link/20260129011517.3545883-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Johan Hovold [Fri, 8 May 2026 14:44:46 +0000 (16:44 +0200)]
drm/gma500/oaktrail_lvds: fix i2c adapter leaks on init
The LVDS init code looks up an I2C adapter using i2c_get_adapter() and
tries to read the EDID before falling back to allocating and registering
its own adapter.
Make sure to drop the references taken by i2c_get_adapter() when falling
back to allocating an adapter as well as on late errors to allow the
looked up adapter to be deregistered.
Johan Hovold [Fri, 8 May 2026 14:44:45 +0000 (16:44 +0200)]
drm/gma500/oaktrail_lvds: fix hang on init failure
The LVDS init code looks up an I2C adapter using i2c_get_adapter() and
tries to read the EDID before falling back to allocating and registering
its own adapter.
The error handling does not separate these cases so on a late init
failure it will try to deregister and free also an adapter that had
previously been registered. Since i2c_get_adapter() takes another
reference to the adapter, deregistration hangs indefinitely while
waiting for the reference to be released.
Fix this by only destroying adapters allocated during LVDS init on
errors.
Fixes: a57ebfc0b4da ("drm/gma500: Make oaktrail lvds use ddc adapter from drm_connector") Cc: stable@vger.kernel.org # 6.0 Cc: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com> Link: https://patch.msgid.link/20260508144446.59722-3-johan@kernel.org
KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
Pass a pointer to iter->old_spte, not simply its value, when setting an
external SPTE in __tdp_mmu_set_spte_atomic(), so that the iterator's value
will be updated if the cmpxchg64 to freeze the mirror SPTE fails. The bug
is currently benign as TDX is mutualy exclusive with all paths that do
"local" retry", e.g. clear_dirty_gfn_range() and wrprot_gfn_range().
Fixes: 77ac7079e66d ("KVM: x86/tdp_mmu: Propagate building mirror page tables") Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://patch.msgid.link/20260129011517.3545883-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level
Rework the TDX APIs to take the kernel's 1-based pg_level enum, not the
TDX-Module's 0-based level. The APIs are _kernel_ APIs, not TDX-Module
APIs, and the kernel (and KVM) uses "enum pg_level" literally everywhere.
Using "enum pg_level" eliminates ambiguity when looking at the APIs (it's
NOT clear that "int level" refers to the TDX-Module's level), and will
allow for using existing helpers like page_level_size() when support for
hugepages is added to the S-EPT APIs.
No functional change intended.
Cc: Kai Huang <kai.huang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerley Tng <ackerleytng@google.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260129011517.3545883-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Carlos López [Fri, 13 Mar 2026 12:20:42 +0000 (13:20 +0100)]
KVM: VFIO: update coherency only if file was deleted
When servicing a KVM_DEV_VFIO_FILE_DEL request, if a file is removed
from kv->file_list, kv->noncoherent needs to be updated, in case we
can revert to using coherent DMA. However, if we found no candidate
to remove, there is no need to re-scan the list, so do it only if a
matching file was found.
To simplify the control flow, use a mutex guard so that we can return
early from within the search loop if the maching file is found.
Carlos López [Fri, 13 Mar 2026 12:20:41 +0000 (13:20 +0100)]
KVM: VFIO: deduplicate file release logic
There are two callsites which destroy files in kv->file_list: the
function servicing KVM_DEV_VFIO_FILE_DEL, and the relase of the whole
KVM VFIO device. The process involves several steps, so move all those
into a single function, removing duplicate code.
Carlos López [Fri, 13 Mar 2026 12:20:40 +0000 (13:20 +0100)]
KVM: VFIO: use mutex guard in kvm_vfio_file_set_spapr_tce()
Use a mutex guard to hold a lock for the entirety of the function, which
removes the need for a goto (whose label even has a misleading name
since 8152f8201088 ("fdget(), more trivial conversions"))
Michal Wajdeczko [Mon, 11 May 2026 17:28:38 +0000 (19:28 +0200)]
drm/xe/memirq: Enable GT_MI_USER_INTERRUPT only
We only expect and handle the GT_MI_USER_INTERRUPT from the
engines, there is no point in enabling other interrupts, like
GT_CONTEXT_SWITCH_INTERRUPT, if we don't intent to handle them.
Michal Wajdeczko [Mon, 11 May 2026 17:28:37 +0000 (19:28 +0200)]
drm/xe/memirq: Update interrupt handler logic
To workaround some corner case hardware limitations, new programming
note for the memory based interrupt handler suggests to assume that
some status bytes, like GT_MI_USER_INTERRUPT and GUC_INTR_GUC2HOST,
are always set. Update our interrupt handler to follow the new rules.
Bspec: 53672 Fixes: a6581ebe7685 ("drm/xe/vf: Introduce Memory Based Interrupts Handler") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> Link: https://patch.msgid.link/20260511172838.2299-2-michal.wajdeczko@intel.com
Carlos López [Fri, 13 Mar 2026 12:20:39 +0000 (13:20 +0100)]
KVM: VFIO: clean up control flow in kvm_vfio_file_add()
The struct file that this function fgets() is always passed to fput()
before returning, so use automatic cleanup via __free() to avoid several
jumps to the end of the function. Similarly, use a mutex guard to
completely remove the need to use gotos.
KVM: selftests: memslot_perf_test: make host wait timeout configurable
When memslot_perf_test is run on the Qemu Risc-V Virt machine,
sometimes the RW subtest fails due to sigalarm, indicating that the
guest sync did not finish within the expected duration of 10 seconds.
Since the current timeout value is itself a bump up from the original
2s, making the host timeout value configurable via a new command line
parameter. The test can be invoked with '-t' option to set a suitable
timeout value for the host.
lib/string_helpers: drop redundant allocation in kasprintf_strarray
kasprintf_strarray() returns an array of N strings and kfree_strarray()
also frees N entries. However, kasprintf_strarray() currently allocates
N+1 char pointers. Allocate exactly N pointers instead of N+1.
Kees Cook [Wed, 13 May 2026 05:08:07 +0000 (22:08 -0700)]
libbpf: Use strscpy() in kernel code for skel_map_create()
Linux has deprecated[1] strncpy(), and the use in skel_map_create()
is best replaced with strscpy(). Since we still need to build this
file in userspace, leave the strncpy() in place in that case. This
is the last use of strncpy() in the kernel.
KVM: x86: Drop superfluous caching of KVM_ASYNC_PF_SEND_ALWAYS
Drop kvm_vcpu_arch.apf.send_always and instead use msr_en_val as the source
of truth to reduce the probability of operating on stale data. This fixes
flaws where KVM fails to update send_always when APF is explicitly
disabled by the guest or implicitly disabled by KVM on INIT. Absent other
bugs, the flaws are benign as KVM *shouldn't* consume send_always when PV
APF support is disabled.
Simply delete the field, as there's zero benefit to maintaining a separate
"cache" of the state.
Opportunistically turn the enabled vs. disabled logic at the end of
kvm_pv_enable_async_pf() into an if-else instead of using an early return,
e.g. so that it's more obvious that both paths are "success" paths.
Fixes: 6adba5274206 ("KVM: Let host know whether the guest can handle async PF in non-userspace context.") Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260406225359.1245490-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86: Drop superfluous caching of KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT
Drop kvm_vcpu_arch.apf.delivery_as_pf_vmexit and instead use msr_en_val as
the source of truth to reduce the probability of operating on stale data.
This fixes flaws where KVM fails to update delivery_as_pf_vmexit when APF
is explicitly disabled by the guest or implicitly disabled by KVM on INIT.
Absent other bugs, the flaws are benign as KVM *shouldn't* consume
delivery_as_pf_vmexit when PV APF support is disabled.
Simply delete the field, as there's zero benefit to maintaining a separate
"cache" of the state.
Fixes: 52a5c155cf79 ("KVM: async_pf: Let guest support delivery of async_pf from guest mode") Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://patch.msgid.link/20260406225359.1245490-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Ethan Yang [Mon, 6 Apr 2026 22:53:56 +0000 (15:53 -0700)]
KVM: x86: Don't leave APF half-enabled on bad APF data GPA
kvm_pv_enable_async_pf() updates vcpu->arch.apf.msr_en_val before
initializing the APF data gfn_to_hva cache. If userspace provides an
invalid GPA, kvm_gfn_to_hva_cache_init() fails, but msr_en_val stays
enabled and leaves APF state half-initialized.
Later APF paths can then try to use the empty cache and trigger
WARN_ON() in kvm_read_guest_offset_cached().
Determine the new APF enabled state from the incoming MSR value, do cache
initialization first on the enable path, and commit msr_en_val only after
successful initialization. Keep the disable path behavior unchanged.
Reported-by: syzbot+bc0e18379a290e5edfe4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bc0e18379a290e5edfe4 Fixes: 344d9588a9df ("KVM: Add PV MSR to enable asynchronous page faults delivery.") Link: https://lore.kernel.org/r/aHfD3MczrDpzDX9O@google.com Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Ethan Yang <ethan.yang.kernel@gmail.com>
[sean: don't bother with a local "enable" variable] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260406225359.1245490-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: selftests: Guard execinfo.h inclusion for non-glibc builds
The backtrace() function and execinfo.h are GNU extensions available
in glibc but not in non-glibc C libraries such as musl. Building KVM
selftests with musl-gcc fails with:
lib/assert.c:9:10: fatal error: execinfo.h: No such file or directory
Fix this by guarding the inclusion of execinfo.h and the stack dumping
logic under #ifdef __GLIBC__. For non-glibc builds, provide a local
stub for test_dump_stack().
Suggested-by: Aqib Faruqui <aqibaf@amazon.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Hisam Mehboob <hisamshar@gmail.com> Link: https://patch.msgid.link/20260409153846.1502656-2-hisamshar@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Peter Fang [Wed, 8 Apr 2026 00:11:28 +0000 (17:11 -0700)]
KVM: Fix kvm_vcpu_map[_readonly]() function prototypes
kvm_vcpu_map() and kvm_vcpu_map_readonly() should take a gfn instead of
a gpa. This appears to be a result of the original kvm_vcpu_map() being
declared with the wrong function prototype in kvm_host.h, even though
it was correct in the actual implementation in kvm_main.c.
No actual harm has been done yet as all of the call sites are correctly
passing in a gfn. Plus, both gfn_t and gpa_t are typedef'd to u64 so
this change shouldn't have any functional impact.
Compile-tested on x86 and ppc, which are the current users of these
interfaces.
Fixes: e45adf665a53 ("KVM: Introduce a new guest mapping API") Cc: KarimAllah Ahmed <karahmed@amazon.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Peter Fang <peter.fang@intel.com> Reviewed-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260408001137.3290444-2-peter.fang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Lei Chen [Thu, 9 Apr 2026 14:22:26 +0000 (22:22 +0800)]
KVM: x86: Rate-limit global clock updates on vCPU load
commit 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"")
dropped the rate limiting for KVM_REQ_GLOBAL_CLOCK_UPDATE.
As a result, kvm_arch_vcpu_load() can queue global clock update requests
every time a vCPU is scheduled when the master clock is disabled or when
the vCPU is loaded for the first time.
Restore the throttling with a per-VM ratelimit state and gate
KVM_REQ_GLOBAL_CLOCK_UPDATE through __ratelimit(), so frequent vCPU
scheduling does not generate a steady stream of redundant clock update
requests.
Fixes: 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"") Signed-off-by: Lei Chen <lei.chen@smartx.com> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/all/CAK8fFZ5gY8_Mw2A=iZVFNVKQNrXQzVsn-HTd+Me9K6ZfmdgA+Q@mail.gmail.com/ Link: https://patch.msgid.link/20260409142226.2581-1-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Ashutosh Desai [Fri, 1 May 2026 20:35:32 +0000 (13:35 -0700)]
KVM: SVM: Fix page overflow in sev_dbg_crypt() for ENCRYPT path
In sev_dbg_crypt(), the per-iteration transfer length is bounded by
the source page offset (PAGE_SIZE - s_off) but not by the destination
page offset (PAGE_SIZE - d_off). When d_off > s_off, the encrypt
path (__sev_dbg_encrypt_user) performs a read-modify-write using a
single-page intermediate buffer (dst_tpage):
1. __sev_dbg_decrypt() expands the size to round_up(len + (d_off & 15), 16)
before issuing the PSP command. If len + (d_off & 15) > PAGE_SIZE,
the PSP writes beyond the end of the 4096-byte dst_tpage allocation.
2. The subsequent memcpy()/copy_from_user() into
page_address(dst_tpage) + (d_off & 15) of 'len' bytes overflows
by up to 15 bytes under the same condition.
Trigger example: s_off = 0, d_off = 1, debug.len = PAGE_SIZE -
the PSP is instructed to write round_up(4097, 16) = 4112 bytes to
a 4096-byte buffer.
Fix by also bounding len by (PAGE_SIZE - d_off), the same check that
sev_send_update_data() already performs for its single-page guest
region.
==================================================================
BUG: KASAN: slab-use-after-free in sev_dbg_crypt+0x993/0xd10 [kvm_amd]
Write of size 4095 at addr ff110062293bb009 by task sev_dbg_test/228214
Memory state around the buggy address: ff110062293bbf00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff110062293bbf80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ff110062293bc000: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
^ ff110062293bc080: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ff110062293bc100: fa fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint
Fixes: 24f41fb23a39 ("KVM: SVM: Add support for SEV DEBUG_DECRYPT command") Fixes: 7d1594f5d94b ("KVM: SVM: Add support for SEV DEBUG_ENCRYPT command") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Desai <ashutoshdesai993@gmail.com>
[sean: add sample KASAN splat, Fixes, and stable@] Link: https://patch.msgid.link/20260501203537.2120074-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: selftests: Teach sev_*_test about revoking VM types
Instead of using CPUID, use the VM type bit to determine support, since
those now reflect the correct status of support by the kernel and firmware
configurations.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Tested-by: Tycho Andersen (AMD) <tycho@kernel.org> Link: https://patch.msgid.link/20260416232329.3408497-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: SEV: Don't advertise VM types that are disabled by firmware
As called out in a footnote for a recent SNP vulnerability[1], it is
possible for a specific flavor of SEV+ to be disabled by the firmware even
when the flavor is fully supported by the CPU and platform:
Applying mitigation CVE-2025-48514 will result in disabling SEV-ES when
SEV-SNP is enabled.
Restrict KVM's set of supported VM types based on the VM types that are
fully supported by firmware to avoid over-reporting what KVM can actually
support. Like KVM's handling of ASID space exhaustion, don't modify KVM's
CPUID capabilities, as the CPU/platform still supports the underlying
technology and clearing e.g. SEV_ES while advertising SEV_SNP would confuse
KVM and userspace.
KVM: SEV: Don't advertise support for unusable VM types
Commit 0aa6b90ef9d7 ("KVM: SVM: Add support for allowing zero SEV ASIDs")
made it possible to make it impossible to use SEV VMs by not allocating
them any ASIDs.
Commit 6c7c620585c6 ("KVM: SEV: Add SEV-SNP CipherTextHiding support") did
the same thing for SEV-ES.
Do not export KVM_X86_SEV(_ES)_VM as supported types if in either of these
situations, so that userspace can use them to determine what is actually
supported by the current kernel configuration.
Also move the buildup to a local variable so it is easier to add additional
masking in future patches.
KVM: SEV: Set supported SEV+ VM types during sev_hardware_setup()
Set the supported SEV+ VM types during sev_hardware_setup() instead of
waiting until sev_set_cpu_caps(). This will using the set of *fully*
supported VM types to print the enabled/unusable/disabled messaged.
For all intents and purposes, no functional change intended.
In some configurations, the firmware does not support all VM types. The SEV
firmware has an entry in the TCB_VERSION structure referred to as the
Security Version Number in the SEV-SNP firmware specification and referred
to as the "SPL" in SEV firmware release notes. The SEV firmware release
notes say:
On every SEV firmware release where a security mitigation has been
added, the SNP SPL gets increased by 1. This is to let users know that
it is important to update to this version.
The SEV firmware release that fixed CVE-2025-48514 by disabling SEV-ES
support on vulnerable platforms has this SVN increased to reflect the fix.
The SVN is platform-specific, as is the structure of TCB_VERSION.
Check CURRENT_TCB instead of REPORTED_TCB, since the firmware behaves with
the CURRENT_TCB SVN level and will reject SEV-ES VMs accordingly.
Parse the SVN, and mask off the SEV_ES supported VM type from the list of
supported types if it is above the per-platform threshold for the relevant
platforms.
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Tycho Andersen (AMD) <tycho@kernel.org> Link: https://patch.msgid.link/20260416232329.3408497-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>