Keith Busch [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)]
blk-mq: reinsert cached request to the list
A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.
Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable") Suggested-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/20260526153531.2365935-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Ming [Wed, 20 May 2026 12:14:57 +0000 (20:14 +0800)]
cxl/test: Update mock dev array before calling platform_device_add()
CXL test environment hits the following error sometimes.
cxl_mem mem9: endpoint7 failed probe
All mock memdevs are platform firmware devices added by cxl_test module,
and cxl_test module also provides a platform device driver for them to
create a memdev device to CXL subsystem. cxl_test module uses
cxl_rcd/mem_single/mem arrays to store different types of mock memdevs.
CXL drivers calls registered mock functions for a mock memdev by
checking if a given memdev is in these arrays.
When cxl_test module adds these mock memdevs, it always calls
platform_device_add() before adding them to a suitable mock memdev
array. However, there is a small window where CXL drivers calls mock
function for a added memdev before it added to a mock memdev array. In
above case, cxl endpoint driver considers a added memdev was not a mock
memdev, then calling devm_cxl_endpoint_decoders_setup() for it rather
than mock_endpoint_decoders_setup().
An appropriate solution is that adding a new mock device to a mock
device array before calling platform_device_add() for it. It can
guarantee the new mock device is visible to CXL subsystem.
This patch introduces a new helped called cxl_mock_platform_device_add()
to handle the issue, and uses the function for all mock devices addition.
Fixes: 3a2b97b3210b ("cxl/test: Improve init-order fidelity relative to real-world systems") Signed-off-by: Li Ming <ming.li@zohomail.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20260520121457.234404-1-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Linus Torvalds [Tue, 26 May 2026 20:49:13 +0000 (13:49 -0700)]
Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
Petr Pavlu [Fri, 27 Mar 2026 07:59:03 +0000 (08:59 +0100)]
module, riscv: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all riscv-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:02 +0000 (08:59 +0100)]
module, m68k: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all m68k-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:01 +0000 (08:59 +0100)]
module, arm64: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm64-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:00 +0000 (08:59 +0100)]
module, arm: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm-specific module sections.
Linus Torvalds [Tue, 26 May 2026 20:37:26 +0000 (13:37 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fix from Shuah Khan:
"Fix a use-after-free in kunit debugfs when using kunit.filter when the
executor frees dynamically allocated resources after running boot-time
tests. This resulted in fatal hardware exception due to invalidation
of capability flags on the reclaimed memory on some architectures such
as CHERI RISC-V that support the feature, and silent memory corruption
on others.
The fix for this couples the lifetime of the filtered suite memory
allocation to the lifetime of the kunit subsystem and its associated
VFS nodes. Ownership of the boot-time suite_set is now transferred to
a global tracker ('kunit_boot_suites'), and the memory is cleanly
released in kunit_exit() during module teardown"
* tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: fix use-after-free in debugfs when using kunit.filter
Borislav Petkov [Wed, 13 May 2026 20:06:01 +0000 (22:06 +0200)]
x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest
Patch in Fixes: causes the usual:
unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
Call Trace:
early_init_intel
early_cpu_init
setup_arch
_printk
start_kernel
x86_64_start_reservations
x86_64_start_kernel
common_startup_64
because the kernel is booted in a guest.
In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.
The platform ID needs to be read as early as when microcode is loaded on
the BSP:
and by that time, CPUID leafs haven't been parsed yet.
The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.
Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure") Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
Mostafa Saleh [Tue, 26 May 2026 12:53:17 +0000 (12:53 +0000)]
irqchip/gic-v4: Don't advertise VLPIs if no ITS is probed
When accidentally setting “kvm-arm.vgic_v4_enable=1” on a system that has
no MSI controller device tree node and GICv4, it results a panic as
“gic_domain” is NULL and the kernel attempts to access it.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
Mem abort info:
ESR = 0x0000000096000006
Tvrtko Ursulin [Sat, 23 May 2026 10:34:18 +0000 (11:34 +0100)]
drm/xe: Assign queue name in time for drm_sched_init
Currently the queue name is only assigned after the drm scheduler instance
has been created. This loses information with all logging or debug
workqueue facilities so lets re-order things a bit so the name gets
assigned in time.
To be able to assign a GuC ID early we split the allocation into
reservation and publish phases.
First, with the submission state lock held, we reserve the ID in the GuC
ID manager, which serves as an authoritative source of truth. Then we can
drop the lock and reserve entries in the exec queue lookup XArray. This
can be lockless since the NULL entries are invisible both to the kernel
and userspace. Only after the queue has been fully created we replace the
reserved entries with the queue pointer, which can be done locklessly for
single width queues.
Kevin Cheng [Fri, 22 May 2026 23:26:57 +0000 (16:26 -0700)]
KVM: x86: Widen x86_exception's error_code to 64 bits
Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).
Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.
Piotr Zarycki [Sat, 23 May 2026 11:18:57 +0000 (13:18 +0200)]
KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET
Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.
KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap
In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).
Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.
KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs
Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.
KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0
When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected. Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.
Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots") Reported-by: Bibo Mao <maobibo@loongson.cn> Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn Reviewed-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Zongyao Chen [Fri, 22 May 2026 17:21:50 +0000 (10:21 -0700)]
KVM: selftests: Test guest_memfd binding overlap without GPA overlap
The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot. KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.
Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.
Zongyao Chen [Fri, 22 May 2026 17:21:49 +0000 (10:21 -0700)]
KVM: guest_memfd: Return -EEXIST for overlapping bindings
KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path. That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.
Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.
Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST. Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.
Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST] Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Thomas Weißschuh [Mon, 25 May 2026 08:27:16 +0000 (10:27 +0200)]
selftests/nolibc: use mutable buffer for execve() argv string
The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.
Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.
Aravind Anilraj [Sun, 29 Mar 2026 07:06:42 +0000 (03:06 -0400)]
thermal: intel: int340x: Check return value of ptc_create_groups()
proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.
The thermal control interface would be unavailable with no indication
of failure.
Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.
Aravind Anilraj [Sun, 29 Mar 2026 07:06:41 +0000 (03:06 -0400)]
thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()
The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.
Zhongqiu Han [Sun, 19 Apr 2026 13:26:54 +0000 (21:26 +0800)]
cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.
The race can be illustrated with two concurrent paths:
Path A (sysfs write, holds attr_set->update_lock):
Path B (work queue, wins the race between A1 and A2):
dbs_work_handler()
mutex_lock(&policy_dbs->update_mutex) /* acquired before A2 */
dbs_update()
ignore_nice = dbs_data->ignore_nice_load /* sees new value: 1 */
cur_nice = kcpustat_field(...)
idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
j_cdbs->prev_cpu_nice = cur_nice
mutex_unlock(&policy_dbs->update_mutex)
Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.
To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:
- gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
prev_cpu_idle, so both baselines share the same timestamp when
io_is_busy changes. This prevents an interval mismatch between
idle_time and nice_delta on the next dbs_update() when
ignore_nice_load is enabled.
- cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
so the baseline is always valid from the first dbs_update() call;
remove the ignore_nice guard and the now-unused ignore_nice variable.
Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor") Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor") Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Zhongqiu Han [Sun, 19 Apr 2026 13:26:53 +0000 (21:26 +0800)]
cpufreq: governor: Fix data races on per-CPU idle/nice baselines
gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().
Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:
Path A (sysfs write, holds attr_set->update_lock only):
Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.
Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.
Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex. A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex. cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable. Fix this by reading io_busy
inside the mutex.
The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().
Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.
Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor") Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor") Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tal Zussman [Mon, 25 May 2026 18:25:55 +0000 (14:25 -0400)]
block: remove blkdev_write_begin() and blkdev_write_end()
Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.
Yuho Choi [Mon, 25 May 2026 16:25:31 +0000 (12:25 -0400)]
mtip32xx: fix use-after-free on service thread failure
If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.
The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.
Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.
Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.
Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Denis Arefev [Thu, 21 May 2026 07:28:56 +0000 (10:28 +0300)]
block: Avoid mounting the bdev pseudo-filesystem in userspace
The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.
This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`
Mateusz Nowicki [Sat, 23 May 2026 12:52:35 +0000 (12:52 +0000)]
block: switch numa_node to int in blk_mq_hw_ctx and init_request
numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.
Chao Shi [Fri, 22 May 2026 22:00:25 +0000 (18:00 -0400)]
block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.
syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.
Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead") Acked-by: Sungwoo Kim <iam@sung-woo.kim> Acked-by: Dave Tian <daveti@purdue.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Signed-off-by: Chao Shi <coshi036@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Aaron Tomlin [Mon, 25 May 2026 00:51:23 +0000 (20:51 -0400)]
blk-mq: add tracepoint block_rq_tag_wait
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.
Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.
This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.
This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:
Ackerley Tng [Fri, 22 May 2026 22:46:10 +0000 (15:46 -0700)]
KVM: SEV: Mark source page dirty when writing back CPUID data on failure
When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.
Ackerley Tng [Fri, 22 May 2026 22:46:09 +0000 (15:46 -0700)]
KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements
Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".
Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.
Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.
KVM: SEV: Pin source page for write when adding CPUID data for SNP guest
When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.
Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.
Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory") Cc: stable@vger.kernel.org Signed-off-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@] Signed-off-by: Sean Christopherson <seanjc@google.com>
Mark Brown [Tue, 26 May 2026 16:50:18 +0000 (17:50 +0100)]
ASoC: SOF: ipc4-topology: Support for multiple src output formats
Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:
SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.
In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.
Peter Ujfalusi [Tue, 26 May 2026 10:57:48 +0000 (13:57 +0300)]
ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output
The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.
Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 26 May 2026 10:57:47 +0000 (13:57 +0300)]
ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc
SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.
Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.
Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.
The kernel updates the stats counters in-place on every copy-fallback
event.
Clément Léger [Tue, 19 May 2026 11:44:33 +0000 (12:44 +0100)]
io_uring/zcrx: notify user on frag copy fallback
Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.
Pavel Begunkov [Tue, 19 May 2026 11:44:32 +0000 (12:44 +0100)]
io_uring/zcrx: notify user when out of buffers
There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.
It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.
When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.
The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.
Pavel Begunkov [Tue, 19 May 2026 11:44:31 +0000 (12:44 +0100)]
io_uring/zcrx: add ctx pointer to zcrx
zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.
Bertie Tryner [Tue, 19 May 2026 11:44:30 +0000 (12:44 +0100)]
io_uring/zcrx: reorder fd allocation in zcrx_export()
Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.
While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.
Pavel Begunkov [Tue, 19 May 2026 11:44:29 +0000 (12:44 +0100)]
io_uring/zcrx: remove extra ifq close
By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.
Pavel Begunkov [Tue, 19 May 2026 11:44:27 +0000 (12:44 +0100)]
io_uring/zcrx: make scrubbing more reliable
Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.
Wentao Liang [Tue, 26 May 2026 10:21:24 +0000 (10:21 +0000)]
block: partitions: fix of_node refcount leak in of_partition()
of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:
1. The compatibility check at the top of the function returns 0
without releasing partitions_np when the node exists but is not
"fixed-partitions" compatible.
2. The function returns 1 at the end after successfully processing
all partitions without releasing partitions_np.
Fix both leaks by adding of_node_put(partitions_np) on each path.
Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF") Cc: stable@vger.kernel.org Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ethan Tidmore [Sat, 23 May 2026 21:15:22 +0000 (16:15 -0500)]
ASoC: cs35l56-shared-test: Fix possible null pointer dereference
The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.
Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)
Linus Torvalds [Tue, 26 May 2026 15:23:19 +0000 (08:23 -0700)]
Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
====================
ethtool: module: fix a handful of small bugs
I've been poking at the locking in ethtool and it appears
that the FW flashing is not currently taking the ops lock.
Existing drivers which implement module FW flashing seem
to have their own locking, so this series doesn't actually
add the ops lock (I'll add it in net-next). But a number
of other errors have been surfaced in the process.
====================
Jakub Kicinski [Fri, 22 May 2026 23:13:12 +0000 (16:13 -0700)]
ethtool: cmis: validate fw->size against start_cmd_payload_size
cmis_fw_update_start_download() copies start_cmd_payload_size bytes
from the firmware blob into the CDB LPL vendor_data[] payload without
validating that the FW has enough data.
Since the start_cmd_payload_size can only be ~120B an image too short
is most likely corrupted, so reject it.
Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:13:11 +0000 (16:13 -0700)]
ethtool: cmis: validate start_cmd_payload_size from module
The CMIS firmware update code reads start_cmd_payload_size from
the module's FW Management Features CDB reply and uses it directly
as the byte count for memcpy. The destination buffer is 112 bytes
(ETHTOOL_CMIS_CDB_LPL_MAX_PL_LENGTH - 8). So a malicious
module (or corrupted response) can cause a OOB write later on in
cmis_fw_update_start_download().
Let's error out. If modules that expect longer LPL writes actually
exist we should revisit.
struct cmis_cdb_start_fw_download_pl's definition has to move,
no change there.
Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:13:10 +0000 (16:13 -0700)]
ethtool: cmis: fix u16-to-u8 truncation of msleep_pre_rpl
ethtool_cmis_cdb_compose_args() accepts msleep_pre_rpl as u16 but stores
it into the u8 field ethtool_cmis_cdb_cmd_args::msleep_pre_rpl, silently
truncating values >= 256. Seven of the nine call sites pass 1000 ms
(it's the third argument from the end).
Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:13:09 +0000 (16:13 -0700)]
ethtool: cmis: require exact CDB reply length
Malicious SFP module could respond with rpl_len longer than
what cmis_cdb_process_reply() expected, leading to OOB writes.
Malicious HW is a bit theoretical but some modules may just
be buggy and/or the reads may occasionally get corrupted,
so let's protect the kernel.
The existing check protects from short replies. We need to
protect from long ones, too. All callers that pass a non-zero
rpl_exp_len cast the reply payload to a fixed-layout struct
and read fields at fixed offsets, with no version negotiation
or short-reply handling:
so let's assume that responses longer than expected do not
have to be handled gracefully here. Add a warning message
to make the debug easier in case my understanding is wrong...
Note that page_data->length (argument of kmalloc) comes from
last arg to ethtool_cmis_page_init() which is rpl_exp_len.
Note2 that AIs also like to point out overflows in args->req.payload
itself (which is a fixed-size 120 B buffer, on the stack),
but callers should be reading structs defined by the standard,
so protecting from requests for more data than max seem like
defensive programming.
Jakub Kicinski [Fri, 22 May 2026 23:13:08 +0000 (16:13 -0700)]
ethtool: module: fix cleanup if socket used for flashing multiple devices
When a single Netlink socket issues MODULE_FW_FLASH_ACT against multiple
devices, ethnl_sock_priv_set() overwrites sk_priv->dev on each call,
retaining only the last one. The socket priv is used on socket close,
to walk the global work list and mark the uncompleted flashing work
as "orphaned". Otherwise if another socket reuses the PID it will
unexpectedly receive the flashing notifications.
Don't record the device, record net pointer instead. The purpose of
the dev is to scope the work to a netns, anyway. If we store netns
the overrides are safe/a nop since all flashed devices must be in
the same netns as the socket.
Jakub Kicinski [Fri, 22 May 2026 23:13:07 +0000 (16:13 -0700)]
ethtool: module: check fw_flash_in_progress under rtnl_lock
ethnl_set_module_validate() inspects module_fw_flash_in_progress
but validate is meant for _input_ validation, not state validation.
rtnl_lock is not held, yet. Move the check into ethnl_set_module().
Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:13:06 +0000 (16:13 -0700)]
ethtool: module: avoid racy updates to dev->ethtool bitfield
When reviewing other changes Gemini points out that we currently
update module_fw_flash_in_progress without holding any locks.
Since module_fw_flash_in_progress is part of a bitfield this
is not great, updates to other fields may be lost.
We could use a bool and sprinkle some READ_ONCE/WRITE_ONCE here
but seems like the issue is rather than the work is an unusual
writer. The other writers already hold the right locks. So just
very briefly take these locks when the work completes.
Note that nothing ever cancels the FW update work, so there's
no concern with deadlocks vs cancel.
Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:13:05 +0000 (16:13 -0700)]
ethtool: module: avoid leaking a netdev ref on module flash errors
module_flash_fw_schedule() is missing undo for setting
the "in_progress" flag and taking the netdev reference.
Delay taking these, the device can't disappear while
we are holding rtnl_lock.
Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Link: https://patch.msgid.link/20260522231312.1710836-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 May 2026 23:06:47 +0000 (16:06 -0700)]
ethtool: rss: avoid device context leak on reply-build failure
We wait with filling the reply for new RSS context creation
until after the driver ->create_rxfh_context call. The driver
needs to fill some of the defaults in the context. The failure
of rss_fill_reply() is somewhat theoretical, but doesn't take
much effort to handle it properly. Call ->remove_rxfh_context().
If the driver's remove callback fails (some implementations like sfc
can return real command errors from firmware RPCs) - skip the xa_erase
and kfree, leaving the context in the xarray. This matches how
ethnl_rss_delete_doit() behaves.
Jakub Kicinski [Fri, 22 May 2026 23:06:46 +0000 (16:06 -0700)]
ethtool: rss: fix hkey leak when indir_size is 0
rss_get_data_alloc() allocates a single buffer that backs both the
indirection table and the hash key, but only assigned data->indir_table
when indir_size was nonzero. The expectation was that no driver
implements RSS without supporting indirection table but apparently
enic does just that (it's the only such in-tree driver).
enic has get_rxfh_key_size but no get_rxfh_indir_size.
data->indir_table stays as NULL, hkey gets set but rss_get_data_free()
kfree(data->indir_table) is a nop and the allocation leaks.
Always store the allocation base in data->indir_table so the free path
is unambiguous. No caller treats indir_table as a sentinel; everything
keys off indir_size.
Jakub Kicinski [Fri, 22 May 2026 23:06:45 +0000 (16:06 -0700)]
ethtool: rss: fix indir_table and hkey leak on get_rxfh failure
rss_prepare_get() allocates the indirection table and hash key buffer
via rss_get_data_alloc(), then calls ops->get_rxfh() to populate them.
If get_rxfh() fails, the function returns an error without freeing
the allocation.
rss_set_prep_indir() compares the new indirection table against the
current one to determine whether any update is needed. The memcmp
call passes data->indir_size as the length argument, but indir_size
is the number of u32 entries, not the byte count.
Jakub Kicinski [Fri, 22 May 2026 23:06:42 +0000 (16:06 -0700)]
ethtool: rss: avoid modifying the RSS context response
Gemini says that we're modifying the RSS_CREATE response skb.
I think it's right, the comment says that unicast() should
unshare the skb but I'm not entirely sure what I meant there.
netlink_trim() does a copy but only if skb is not well sized
(it's at least 2x larger than necessary for the payload).
KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host
A previous bug in __pkvm_init_vm error path showed that the hypervisor
could leak refcounted pages, (i.e. losing access to a page while its
refcount is still elevated). This poses a threat to the pKVM state
machine.
Address this by introducing a fail-safe in __pkvm_hyp_donate_host.
Transitions are not a hot path so added security is worth the extra
check.
In the unlikely case where insert_vm_table_entry fails, __pkvm_init_vm
release the memory donated by the host for the PGD, but as the stage-2
is still set-up the hypervisor keeps a refcount on those pages,
effectively leaking the references.
Fix the rollback with the newly added kvm_guest_destroy_stage2().
Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization") Reported-by: Sashiko <sashiko-bot@kernel.org> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260521143626.1005660-3-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
When a VM fails to initialise after its stage-2 hyp_pool has been
initialised, that stage-2 must be torn down entirely. This requires
resetting both the refcount and the order of its pages back to 0.
Currently, reclaim_pgtable_pages() implicitly resets the page order by
allocating the entire pool with order-0 granularity. However, in the VM
initialisation error path, the addresses of the donated memory (the PGD)
are already known, making it unnecessary to iterate over all pages in
the pool.
Since the vmemmap page order is a hyp_pool-specific field, leaving a
non-zero order on hyp_pool destruction is harmless until another pool
attempts to admit the page. Instead of resetting this field during
destruction, reset it during pool initialization in hyp_pool_init().
For 'external' pages, we can't trust the order either as they bypass
hyp_pool_init(). Since we never coalesce them, enforce order-0 to ensure
safe insertion into the pool.
This leaves no vmemmap order users outside of hyp_pool.
Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260521143626.1005660-2-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Zhan Xusheng [Tue, 26 May 2026 02:21:31 +0000 (10:21 +0800)]
cpufreq/amd-pstate: drop stale @epp_cached kdoc
Commit 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP") removed
the epp_cached field from struct amd_cpudata in favour of always
reading from cppc_req_cached, but the kdoc above the struct still
documents @epp_cached.
Jon Hunter [Tue, 19 May 2026 08:47:06 +0000 (09:47 +0100)]
arm64: tegra: Fix address of Tegra264 main GPIO controller
The 64-bit address of the main GPIO controller on Tegra264 is
0x810c300000. The main GPIO controller was incorrectly added under the
bus@0 node instead of the bus@8100000000 node breaking the boot on
Tegra264. Fix this by moving to main GPIO controller node under
bus@8100000000.
Yuho Choi [Mon, 25 May 2026 02:47:09 +0000 (22:47 -0400)]
ARM: socfpga: Fix OF node refcount leak in SMP setup
socfpga_smp_prepare_cpus() looks up the Cortex-A9 SCU node with
of_find_compatible_node(), which returns a node reference that must be
released with of_node_put().
The function maps the SCU registers and then returns without dropping
that reference, leaking the node on both the success path and the
of_iomap() failure path.
Drop the reference once the mapping attempt is complete. The returned
MMIO mapping does not depend on keeping the device node reference held.
Fixes: 122694a0c712 ("ARM: socfpga: use of_iomap to map the SCU") Cc: stable@vger.kernel.org Signed-off-by: Yuho Choi <dbgh9129@gmail.com> Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Danilo Krummrich [Mon, 25 May 2026 01:23:22 +0000 (03:23 +0200)]
driver core: Guard deferred probe timeout extension with delayed_work_pending()
mod_delayed_work() unconditionally queues the work even when it wasn't
previously pending, which can fire the timeout prematurely or restart it
after it already fired. Add a delayed_work_pending() guard to restore
the originally intended semantics.
Premature firing calls fw_devlink_drivers_done() before all built-in
drivers have registered, causing fw_devlink to prematurely relax device
links for suppliers whose drivers haven't loaded yet.
Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work") Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20260525012340.3860581-2-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Both have the same prompt, depends, default, and help text. Since
arch/mips/mobileye/Kconfig is sourced from arch/mips/Kconfig, both
definitions are parsed and the symbol ends up with two prompts.
The symbol was first introduced in commit 101bd58fde10 ("MIPS: Add
support for Mobileye EyeQ5") directly in
arch/mips/Kconfig. Three months later, commit fbe0fae601b7 ("MIPS:
mobileye: Add EyeQ6H support") created the
arch/mips/mobileye/Kconfig sub-file to organize the growing Mobileye
platform code and added the MACH_EYEQ5/MACH_EYEQ6H choice along with
a copy of FIT_IMAGE_FDT_EPM5. However, the original definition in
arch/mips/Kconfig was not removed at that time, leaving a duplicate.
Remove the definition from arch/mips/Kconfig, keeping the one in
arch/mips/mobileye/Kconfig where it belongs alongside the related
MACH_EYEQ5 machine type definition that it depends on.
Qingfang Deng [Tue, 28 Apr 2026 08:25:40 +0000 (16:25 +0800)]
MIPS: ralink: reduce ARCH_DMA_MINALIGN
Currently, Ralink SoCs use the default ARCH_DMA_MINALIGN value of 128
bytes defined in mach-generic. This is excessive for these platforms
and leads to significant memory waste in kmalloc.
Override ARCH_DMA_MINALIGN to use L1_CACHE_BYTES, which is 16 bytes for
RT288X and 32 bytes for other Ralink SoCs.
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Costa Shulyupin [Fri, 15 May 2026 18:50:41 +0000 (21:50 +0300)]
include: Remove unused jz4740-battery.h
The last user was removed in commit aea12071d6fc
("power/supply: Drop obsolete JZ4740 driver") and replaced by
a self-contained IIO-based driver. No file includes this header.
Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Costa Shulyupin [Fri, 15 May 2026 18:42:23 +0000 (21:42 +0300)]
include: Remove unused jz4740-adc.h
The last user was the JZ4740 MFD ADC driver, removed in commit ff71266aa490 ("mfd: Drop obsolete JZ4740 driver") and replaced
by a self-contained IIO driver. No file includes or references
this header.
Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Rosen Penev [Thu, 7 May 2026 23:23:23 +0000 (16:23 -0700)]
mips: cps: Assemble jr.hb with an R2 ISA level
A MIPS allmodconfig built with LLVM can select CPU_MIPS32_R1 together
with MIPS_MT_SMP. In that configuration clang invokes the integrated
assembler with -march=mips32, and the MIPS MT path in cps-vec.S fails
to assemble two jr.hb instructions:
arch/mips/kernel/cps-vec.S:376:2: error: instruction requires
a CPU feature not currently enabled
arch/mips/kernel/cps-vec.S:490:4: error: instruction requires
a CPU feature not currently enabled
The earlier jr.hb in the same file is already assembled inside a .set
MIPS_ISA_LEVEL_RAW scope. The two failing sites are reached after
popping back to the file's base ISA level, so LLVM correctly rejects
them for an R1 target.
Wrap those jr.hb instructions in the same ISA-level push/pop used by
the working site. This keeps the MT code unchanged while making the
required R2 hazard-branch encoding explicit to the assembler.
Assisted-by: Codex:GPT-5.5 Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: DEC: Prevent initial console buffer from landing in XKPHYS
In 64-bit configurations calling the initial console output handler from
a kernel thread other than the initial one will result in a situation
where the stack has been placed in the XKPHYS 64-bit memory segment and
consequently so has been the buffer allocated there that is used as the
argument corresponding to the `%s' output conversion specifier for the
firmware's printf() entry point.
This 64-bit address will then be truncated by 32-bit firmware, resulting
in an attempt to access the wrong memory location, which in turn will
cause all kinds of unpredictable behaviour, such as a kernel crash:
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Fatal exception in interrupt
KN04 V2.1k (PC: 0xa0026768, SP: 0x806848e8)
>>
In this case the pointer in $4 was truncated from 0x980000000203bd00 to
0x000000000203bd00.
This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started.
Fix the issue by making the buffer static and initdata, and therefore
placed in the CKSEG0 32-bit compatibility segment, observing that the
console output handler is called with the console lock held, implying
no need for this code to be reentrant. Add an assertion to verify the
buffer actually has been placed in a compatibility segment.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # v2.6.12+ Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: DEC: Ensure 32-bit stack location for o32 prom_printf()
In 64-bit configurations calling any firmware entry points from a kernel
thread other than the initial one will result in a situation where the
stack has been placed in the XKPHYS 64-bit memory segment.
Consequently the stack pointer is no longer a 32-bit value and when the
32-bit firmware code called uses 32-bit ALU operations to manipulate the
stack pointer, the calculated result is incorrect (in fact in the 64-bit
MIPS ISA almost all 32-bit ALU operations will produce an unpredictable
result when executed on 64-bit data) and control goes astray.
This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap, or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started, and the kernel will hang at:
pid_max: default: 32768 minimum: 301
or somewhat later, but always before:
cblist_init_generic: Setting adjustable number of callback queues.
has been printed.
It seems that only the prom_printf() entry point is affected. Of all
the other entry points wired only rex_slot_address() and rex_gettcinfo()
are called from a kernel thread other than the initial one, specifically
kernel_init(), and they are leaf functions that do no business with the
stack, having worked with no issue ever since 64-bit support was added
for the platform back in 2002.
To address this issue then, arrange for the stack to be switched in the
o32 wrapper as required for prom_printf() only, by supplying call_o32()
with a pointer to a chunk of initdata space, which is placed in the
CKSEG0 32-bit compatibility segment, observing that prom_printf() is
only called from console output handler and therefore with the console
lock held, implying no need for this code to be reentrant.
Other firmware entry points may be called with interrupts enabled and no
lock held, and may therefore require that call_o32() be reentrant. They
trigger no issue at this point and "if it ain't broke, don't fix it," so
just leave them alone.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: stable@vger.kernel.org # v2.6.12+ Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: DEC: Remove IRQF_ONESHOT reference for IOASIC DMA error IRQs
There is no need for IOASIC DMA error interrupts to use the IRQF_ONESHOT
flag, because while they do need to have the source cleared only at the
conclusion of handling, the action handler supplied is either run in the
hardirq context with interrupts disabled at the CPU level or, where IRQ
threading has been forced, the primary handler has the IRQF_ONESHOT flag
implicitly added and therefore the original action handler, now run as
the thread handler and with interrupts enabled in the CPU, is executed
with the originating interrupt line masked. Therefore no interrupt will
retrigger regardless until the original request has been handled.
Link: https://lore.kernel.org/r/20260127135334.qUEaYP9G@linutronix.de/ Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Remove a bunch of compilation warnings for halt/reset handlers:
arch/mips/dec/reset.c:22:17: warning: no previous prototype for 'dec_machine_restart' [-Wmissing-prototypes]
22 | void __noreturn dec_machine_restart(char *command)
| ^~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:27:17: warning: no previous prototype for 'dec_machine_halt' [-Wmissing-prototypes]
27 | void __noreturn dec_machine_halt(void)
| ^~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:32:17: warning: no previous prototype for 'dec_machine_power_off' [-Wmissing-prototypes]
32 | void __noreturn dec_machine_power_off(void)
| ^~~~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:38:13: warning: no previous prototype for 'dec_intr_halt'
[-Wmissing-prototypes]
38 | irqreturn_t dec_intr_halt(int irq, void *dev_id)
| ^~~~~~~~~~~~~
(which get promoted to compilation errors with CONFIG_WERROR), by moving
the local prototypes from arch/mips/dec/setup.c to a dedicated header
for arch/mips/dec/reset.c to use as well. No functional change.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code, as a tail call. Remove the dec_irq_dispatch() stub
then and the indirection previously introduced with commit 187933f23679
("[MIPS] do_IRQ cleanup"), improving performance by reducing the number
of control flow changes and the overall instruction count, while fixing
a compiler's complaint about a missing prototype for said stub:
arch/mips/dec/setup.c:780:25: warning: no previous prototype for 'dec_irq_dispatch' [-Wmissing-prototypes]
780 | asmlinkage unsigned int dec_irq_dispatch(unsigned int irq)
| ^~~~~~~~~~~~~~~~
(which gets promoted to a compilation error with CONFIG_WERROR).
Fixes: 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section for MIPS") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: Make do_IRQ() available for assembly callers
As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code again, however its `asmlinkage' annotation has never
been brought back from the previous removal of the function with commit 187933f23679 ("[MIPS] do_IRQ cleanup").
Since calling the function directly from assembly code has a performance
advantage, add the annotation back so that the DEC platform can make use
of this again.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: Fix big-endian stack argument fetching in o32 wrapper
Fix an issue in call_o32() where the upper 32-bit half of incoming n64
stack arguments is fetched and used for outgoing o32 stack arguments on
big-endian platforms.
This code was adapted from arch/mips/dec/prom/call_o32.S which was meant
for a little-endian platform only and therefore using 32-bit loads from
64-bit stack slot locations holding incoming stack arguments resulted in
correct values being retrieved for data that is expected to be 32-bit.
This works on little-endian platforms where the lower 32-bit half of the
64-bit value is located at every 64-bit stack slot location. However on
big-endian platforms the lower 32-bit half is instead located at offset
4 from every 64-bit stack slot location.
So to fix the issue the offset of 4 would have to be used on big-endian
platforms only, or alternatively a 64-bit load from the 64-bit stack
slot location can be used across the board, as the subsequent 32-bit
store to the corresponding outgoing stack argument slot will correctly
truncate the value and cause no unpredictable result. We already take
advantage of this architectural feature for the incoming arguments held
in $a6 and $a7 registers, since the o32 wrapper does not know how many
incoming arguments there are and consequently propagates incoming data
which may not be 32-bit.
Since this code is generally supposed to be used with the stack located
in cached memory there is no extra overhead expected for 64-bit loads as
opposed to 32-bit ones, so pick this variant for code simplicity.
Fixes: 231a35d37293 ("[MIPS] RM: Collected changes") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
MIPS: RB532: attach the software node to its target GPIO controller
GPIOLIB wants to remove the software node's name matching against GPIO
controller's label that is going on behind the scenes in software node
lookup. To that end, we need to convert all existing users to using
software nodes actually attached to the GPIO devices they represent.
In order to use an attached software node with the GPIO controller on
rb532: convert the GPIO module into a real platform device, provide
platform device info for it in device.c and assign the software node
using its swnode field.
The software node will get inherited by the GPIO chip from the parent
platform device in devm_gpiochip_add_data() as we don't set the fwnode
using any other of the mechanisms.
Gordon Chen [Tue, 26 May 2026 07:29:06 +0000 (15:29 +0800)]
ALSA: usb-audio: add IFB_SILENCE_ON_EMPTY quirk for Behringer Flow 8
The Behringer Flow 8 (1397:050c) is an 8-channel USB mixer that
declares OUT EP 0x01 with implicit feedback from capture EP 0x81 via
its UAC2 endpoint companion descriptor. After 5-35 minutes of
continuous playback, the device occasionally returns a capture URB in
which every iso_frame_desc has a non-zero status (-EXDEV bursts,
visible as rate-limited "frame N active: -18" lines in dmesg from
pcm.c).
In that case snd_usb_handle_sync_urb() at endpoint.c counts bytes==0
and falls into the early "skip empty packets" return originally added
for M-Audio Fast Track Ultra. As a result the playback EP loses its
sole IFB-driven feeder and the OUT ring starves permanently: hw_ptr
stops advancing while substream state remains RUNNING. Only USB
re-enumeration recovers.
Three independent ftrace captures (taken at the moment of stall via a
userspace watchdog) consistently show:
- 60-70 capture URB completions in the 70ms window before the marker
- 0 retire_playback_urb / queue_pending_output_urbs /
snd_usb_endpoint_implicit_feedback_sink calls
- every usb_submit_urb in the window comes from
snd_complete_urb+0x64e (capture self-resubmit), none from the
queue_pending_output_urbs path
Add a new opt-in quirk QUIRK_FLAG_IFB_SILENCE_ON_EMPTY: when set, the
early return is skipped and we fall through to enqueue a packet_info
whose packet_size[i] are all 0 (the existing loop already maps
status!=0 packets to size 0). prepare_outbound_urb then emits a
silence packet, the OUT ring keeps moving, and the device rides
through the glitch.
The default behaviour (early return) is preserved for all existing
devices including M-Audio Fast Track Ultra. Only Flow 8 opts in here.
Thomas Gleixner [Sun, 17 May 2026 20:02:49 +0000 (22:02 +0200)]
genirq/proc: Speed up /proc/interrupts iteration
Reading /proc/interrupts iterates over the interrupt number space one by
one and looks up the descriptors one by one. That's just a waste of time.
When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and
cache the descriptor pointer efficiently for the sequence file operations.
Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and
leave the fs/proc/ variant for the legacy architectures which ignore generic
code.
This reduces the time wasted for looking up the next record significantly.