Alice Ryhl [Tue, 31 Mar 2026 10:57:49 +0000 (10:57 +0000)]
kbuild: rust: add AutoFDO support
This patch enables AutoFDO build support for Rust code within the Linux
kernel. This allows Rust code to be profiled and optimized based on the
profile.
The RUSTFLAGS variable was suffixed with *_AUTOFDO_CLANG to match the
naming of the config option, which is called CONFIG_AUTOFDO_CLANG.
This implementation has been verified in Android, first by inspecting
the object files and confirming that they look correct. After that,
it was verified as below:
1. Running the binderAddInts benchmark [1] with Rust Binder built as
rust_binder.ko module, using a Pixel 9 Pro.
2. Collecting a profile on a Pixel 10 Pro XL using the app-launch
benchmark, which starts different apps many times, on a device with
Rust Binder as a built-in kernel module. (C Binder was not present on
the device.)
3. Using the collected profile, run the binderAddInts benchmark again
with Rust Binder built both as a rust_binder.ko module, and as a
built-in kernel module.
4. In both cases, Rust Binder without AutoFDO was approximately 13%
slower than the AutoFDO optimized version. Built-in vs .ko did not
make a measurable performance difference.
All of the above was verified in conjunction with my helpers inlining
series [2], which confirmed that this worked correctly for helpers too
once [3] was fixed in the helpers inlining series.
Chaitanya Sabnis [Tue, 26 May 2026 10:22:40 +0000 (15:52 +0530)]
i2c: davinci: fix division by zero on missing clock-frequency
When the 'clock-frequency' property is missing from the device tree,
the driver falls back to DAVINCI_I2C_DEFAULT_BUS_FREQ. However, this
macro was defined in kHz (100), whereas the device tree property is
expected in Hz.
The probe function divided the fallback value by 1000, causing
integer truncation that resulted in dev->bus_freq = 0. This triggered
a deterministic division-by-zero kernel panic when calculating clock
dividers later in the probe sequence.
Fix this by redefining DAVINCI_I2C_DEFAULT_BUS_FREQ in Hz (100000)
to match the expected device tree property unit, allowing the existing
division logic to work correctly for both cases.
Ricardo Robaina [Wed, 13 May 2026 21:47:59 +0000 (18:47 -0300)]
audit: fix removal of dangling executable rules
When an audited executable is deleted from the disk, its dentry
becomes negative. Any later attempt to delete the associated audit
rule will lead to audit_alloc_mark() encountering this negative
dentry and immediately aborting, returning -ENOENT.
This early abort prevents the subsystem from allocating the temporary
fsnotify mark needed to construct the search key, meaning the kernel
cannot find the existing rule in its own lists to delete it. This
leaves a dangling rule in memory, resulting in the following error
while attempting to delete the rule:
# ./audit-dupe-exe-deadlock.sh
No rules
Error deleting rule (No such file or directory)
There was an error while processing parameters
# auditctl -l
-a always,exit -S all -F exe=/tmp/file -F path=/tmp/file -F key=dr
# auditctl -D
Error deleting rule (No such file or directory)
There was an error while processing parameters
This patch fixes this issue by removing the d_really_is_negative()
check. By doing so, a dummy mark can be successfully generated for
the deleted path, which allows the audit subsystem to properly match
and flush the dangling rule.
Cc: stable@kernel.org Fixes: 76a53de6f7ff ("VFS/audit: introduce kern_path_parent() for audit") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Vishal Annapurve [Fri, 22 May 2026 15:15:34 +0000 (15:15 +0000)]
KVM: x86: Treat KVM's virtual PMU as disabled for TDX VMs
Introduce a "protected PMU" concept, and use it to disable KVM's virtual
PMU for TDX VMs, as the PMU state for TDX VMs is virtualized by the TDX
Module[1], i.e. _can't_ emulated/virtualized by KVM, and KVM doesn't yet
support enabling/exposing PMU functionality for/to TDX VMs. For now,
simply treat the PMU as disabled, as it's not clear what all needs to be
changed, e.g. KVM needs to do at least:
1) Configure TD_PARAMS to allow guests to use performance monitoring.
2) Restrict the TD to a subset of the PEBS counters if supported.
3) Limit the TD to setup a certain perfmon events using basic/enhanced
event filtering.
Explicitly disallow enabling the PMU via KVM_CAP_PMU_CAPABILITY for VMs
with a protected PMU to prevent userspace from circumventing KVM's
protections.
Jani Nikula [Wed, 13 May 2026 07:58:40 +0000 (10:58 +0300)]
drm/i915/display: stop passing i to for_each_pipe_crtc_modeset_{enable, disable}()
Refactor for_each_pipe_crtc_modeset_{enable,disable}() and their
underlying for_each_crtc_in_masks{,_reverse}() helpers to utilize
__UNIQUE_ID() to avoid having to pass the for loop variable to them.
Jani Nikula [Wed, 13 May 2026 07:58:38 +0000 (10:58 +0300)]
drm/i915/display: pass struct intel_display to all for_each_intel_crtc*() macros
Now that the for_each_intel_crtc*() iterator macros primarily use
display->pipe_list for iteration, it's more convenient to pass struct
intel_display to them directly instead of struct drm_device. Make it so.
Jani Nikula [Wed, 13 May 2026 07:58:37 +0000 (10:58 +0300)]
drm/i915/display: always pass display->drm to for_each_intel_crtc*()
In preparation for always passing struct intel_display to
for_each_intel_crtc*() family of iterators, start off by unifying their
usage to always having struct intel_display *display around, and passing
display->drm to them.
Jani Nikula [Wed, 13 May 2026 07:58:36 +0000 (10:58 +0300)]
drm/i915/display: switch from drm_for_each_crtc() to for_each_intel_crtc()
intel_has_pending_fb_unpin() has the last direct user of
drm_for_each_crtc() in i915. Switch to for_each_intel_crtc() to ensure
pipe order iteration in all cases.
Jani Nikula [Mon, 25 May 2026 11:05:53 +0000 (14:05 +0300)]
drm/{i915, xe}: move xe_display_flush_cleanup_work() to i915 display
xe_display_flush_cleanup_work() is a bit of an oddball function in xe
display code. There shouldn't be anything this specific or xe
specific. While I'm not sure what the correct refactor for the function
should be, move it to shared display code for starters, next to the
eerily similar but slightly different intel_has_pending_fb_unpin() that
is only called from i915 core.
The main goal here is to unblock some refactors on
for_each_intel_crtc().
Kevin Cheng [Fri, 22 May 2026 23:27:01 +0000 (16:27 -0700)]
KVM: selftests: Add nested page fault injection test
Add a test that exercises nested page fault injection during L2
execution. L2 executes I/O string instructions (OUTSB/INSB) that access
memory restricted in L1's nested page tables (NPT/EPT), triggering a
nested page fault that L0 must inject to L1.
The test supports both AMD SVM (NPF) and Intel VMX (EPT violation) and
verifies that:
- The exit reason is an NPF/EPT violation
- The access type and permission bits are correct
- The faulting GPA is correct
Three test cases are implemented:
- Unmap the final data page (final translation fault, OUTSB read)
- Unmap a PT page (page walk fault, OUTSB read)
- Write-protect the final data page (protection violation, INSB write)
- Write-protect a PT page (protection violation on A/D update, OUTSB
read)
When injecting an EPT Violation into L2 in response to a fault detected
while emulating an L2 GVA access, synthesize the GVA_IS_VALID and
GVA_TRANSLATED bits using information provided by the walker, instead of
pulling the bits from vmcs02.EXIT_QUALIFICATION. The information in
vmcs02.EXIT_QUALIFICATION is valid/correct if and only if the fault being
injected into L1 is the direct result of an EPT Violation VM-Exit from L2.
E.g. if KVM is emulating an I/O instruction and the memory operand's
translation through L1's EPT fails, using vmcs02.EXIT_QUALIFICATION is
wrong as the semantics for EXIT_QUALIFICATION would be for an I/O exit,
not an EPT Violation exit.
Opportunistically clean up the formatting for creating the mask of bits
to pull from vmcs02.EXIT_QUALIFICATION.
Kevin Cheng [Fri, 22 May 2026 23:26:59 +0000 (16:26 -0700)]
KVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits
Fix KVM's generation of PFERR_GUEST_{PAGE,FINAL}_MASK bits when injecting a
Nested Page Fault into L1. Currently, KVM blindly stuffs GUEST_FINAL into
L1, which is blatantly wrong given that KVM obviously generates NPFs for
page table accesses.
There are two paths that trigger NPF injection: hardware NPF exits (from
L2) and emulation-triggered faults, i.e. when KVM detects a NPF as part of
emulating an L2 GVA access. For the hardware case, use the bits verbatim
from the VMCB, as KVM is simply forwarding a NPF to L1. For the emulation
case, propagate the GUEST_{PAGE,FINAL} bits from the access field (which
were recently added for MBEC+GMET support).
To differentiate between the two cases, add "hardware_nested_page_fault"
to "struct x86_exception", and set it when injecting a NPF in response to
an NPF exit from L2.
To help guard against future goofs, assert that exactly one of GUEST_PAGE
or GUEST_FINAL is set when injecting a NPF. Unlike VMX, there are no
(known) cases where hardware doesn't set either bit, and KVM should always
set one or the other when emulating a GVA access.
nvme-multipath: enable PCI P2PDMA for multipath devices
NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when all underlying controllers support it.
Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk()
alongside the other features. nvme_update_ns_info_block() already
calls queue_limits_stack_bdev() to stack each path's limits onto the
head disk, which routes through blk_stack_limits(). The core now
clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does
not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL
are handled.
md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
the RAID device, preventing peer-to-peer DMA through the RAID layer even
when all underlying devices support it.
Enable BLK_FEAT_PCI_P2PDMA unconditionally in raid0, raid1 and raid10
personalities during queue limits setup. blk_stack_limits() clears it
automatically if any member device lacks support, consistent with how
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are handled in the block core.
Parity RAID personalities (raid4/5/6) are excluded because they require
CPU access to data pages for parity computation, which is incompatible
with P2P mappings.
Tested with RAID0/1/10 arrays containing multiple NVMe devices with
P2PDMA support, confirming that peer-to-peer transfers work correctly
through the RAID layer.
block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are cleared in blk_stack_limits()
when an underlying device does not support them. Apply the same
treatment to BLK_FEAT_PCI_P2PDMA: stacking drivers set it
unconditionally and rely on the core to clear it whenever a
non-supporting member device is stacked.
KVM: x86: Tell ->inject_page_fault() whether or a fault came from hardware
When injecting a page fault (including nested TDP faults into L1), tell the
injection routine whether or not the fault originated in hardware, i.e. if
KVM is effectively forwarding a fault it intercept. For nested TDP fault
injection, KVM needs to grab PAGE_WALK vs. GUEST_FINAL information from the
VMCB/VMCS, _if_ the fault originated in hardware.
Note, simply checking whether or not the original exit was due a #NPF or
EPT Violation isn't sufficient/correct, as the fault being synthesized for
L1 may or may not be the "same" fault that triggered a VM-Exit from L2.
E.g. if access to emulated MMIO in L2 hits a !PRESENT fault (EPT Violation
or #NPF), e.g. because MMIO caching is disabled or it's the first time the
GPA has been accessed by L2, then KVM will enter the emulator. If
emulating the MMIO instruction then hits a nested TDP fault, e.g. because
L2 was accessing MMIO with a MOVSQ (memory-to-memory move), or because L1
has since unmapped the code stream, then the TDP fault synthesized to L1
will not be the same emulated fault the triggered the VM-Exit.
No functional change intended (nothing uses the new param, yet...).
Yan Zhao [Thu, 30 Apr 2026 01:50:01 +0000 (09:50 +0800)]
x86/tdx: Drop exported function tdx_quirk_reset_page()
KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
and thus always have struct page.
However, it's also reasonable for KVM to reset those TDX control pages via
tdx_quirk_reset_paddr() directly, eliminating the need to export two
parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
header file is also unnecessary.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://patch.msgid.link/20260430015001.24242-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
x86/tdx: Use PFN directly for unmapping guest private memory
Remove struct page assumptions/constraints in APIs for unmapping guest
private memory and have them take physical address directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tightly tied
to the regular CPU MMUs and the kernel code around them. Using
"struct page" for TDX guest memory is not a good fit anywhere near the KVM
MMU code [2].
Therefore, for unmapping guest private memory: export
tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).
Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
tdx_sept_remove_private_spte() justifies using PAGE_SIZE.
Do not convert tdx_reclaim_page() to use PFN as input since it currently
does not remove guest private memory.
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
exported to KVM only.
[Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]
x86/tdx: Use PFN directly for mapping guest private memory
Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
mapping guest private memory and have them take PFN directly.
Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].
KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tied to the
regular CPU MMUs and the kernel code around them. Using 'struct page' for
TDX guest memory is not a good fit anywhere near the KVM MMU code [2].
Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
only.
Keith Busch [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)]
blk-mq: reinsert cached request to the list
A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.
Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable") Suggested-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://patch.msgid.link/20260526153531.2365935-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Li Ming [Wed, 20 May 2026 12:14:57 +0000 (20:14 +0800)]
cxl/test: Update mock dev array before calling platform_device_add()
CXL test environment hits the following error sometimes.
cxl_mem mem9: endpoint7 failed probe
All mock memdevs are platform firmware devices added by cxl_test module,
and cxl_test module also provides a platform device driver for them to
create a memdev device to CXL subsystem. cxl_test module uses
cxl_rcd/mem_single/mem arrays to store different types of mock memdevs.
CXL drivers calls registered mock functions for a mock memdev by
checking if a given memdev is in these arrays.
When cxl_test module adds these mock memdevs, it always calls
platform_device_add() before adding them to a suitable mock memdev
array. However, there is a small window where CXL drivers calls mock
function for a added memdev before it added to a mock memdev array. In
above case, cxl endpoint driver considers a added memdev was not a mock
memdev, then calling devm_cxl_endpoint_decoders_setup() for it rather
than mock_endpoint_decoders_setup().
An appropriate solution is that adding a new mock device to a mock
device array before calling platform_device_add() for it. It can
guarantee the new mock device is visible to CXL subsystem.
This patch introduces a new helped called cxl_mock_platform_device_add()
to handle the issue, and uses the function for all mock devices addition.
Fixes: 3a2b97b3210b ("cxl/test: Improve init-order fidelity relative to real-world systems") Signed-off-by: Li Ming <ming.li@zohomail.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20260520121457.234404-1-ming.li@zohomail.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Linus Torvalds [Tue, 26 May 2026 20:49:13 +0000 (13:49 -0700)]
Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
Petr Pavlu [Fri, 27 Mar 2026 07:59:03 +0000 (08:59 +0100)]
module, riscv: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all riscv-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:02 +0000 (08:59 +0100)]
module, m68k: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all m68k-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:01 +0000 (08:59 +0100)]
module, arm64: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm64-specific module sections.
Petr Pavlu [Fri, 27 Mar 2026 07:59:00 +0000 (08:59 +0100)]
module, arm: force sh_addr=0 for arch-specific sections
When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].
Force sh_addr=0 for all arm-specific module sections.
Linus Torvalds [Tue, 26 May 2026 20:37:26 +0000 (13:37 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fix from Shuah Khan:
"Fix a use-after-free in kunit debugfs when using kunit.filter when the
executor frees dynamically allocated resources after running boot-time
tests. This resulted in fatal hardware exception due to invalidation
of capability flags on the reclaimed memory on some architectures such
as CHERI RISC-V that support the feature, and silent memory corruption
on others.
The fix for this couples the lifetime of the filtered suite memory
allocation to the lifetime of the kunit subsystem and its associated
VFS nodes. Ownership of the boot-time suite_set is now transferred to
a global tracker ('kunit_boot_suites'), and the memory is cleanly
released in kunit_exit() during module teardown"
* tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: fix use-after-free in debugfs when using kunit.filter
Borislav Petkov [Wed, 13 May 2026 20:06:01 +0000 (22:06 +0200)]
x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest
Patch in Fixes: causes the usual:
unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
Call Trace:
early_init_intel
early_cpu_init
setup_arch
_printk
start_kernel
x86_64_start_reservations
x86_64_start_kernel
common_startup_64
because the kernel is booted in a guest.
In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.
The platform ID needs to be read as early as when microcode is loaded on
the BSP:
and by that time, CPUID leafs haven't been parsed yet.
The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.
Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure") Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
Mostafa Saleh [Tue, 26 May 2026 12:53:17 +0000 (12:53 +0000)]
irqchip/gic-v4: Don't advertise VLPIs if no ITS is probed
When accidentally setting “kvm-arm.vgic_v4_enable=1” on a system that has
no MSI controller device tree node and GICv4, it results a panic as
“gic_domain” is NULL and the kernel attempts to access it.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
Mem abort info:
ESR = 0x0000000096000006
Tvrtko Ursulin [Sat, 23 May 2026 10:34:18 +0000 (11:34 +0100)]
drm/xe: Assign queue name in time for drm_sched_init
Currently the queue name is only assigned after the drm scheduler instance
has been created. This loses information with all logging or debug
workqueue facilities so lets re-order things a bit so the name gets
assigned in time.
To be able to assign a GuC ID early we split the allocation into
reservation and publish phases.
First, with the submission state lock held, we reserve the ID in the GuC
ID manager, which serves as an authoritative source of truth. Then we can
drop the lock and reserve entries in the exec queue lookup XArray. This
can be lockless since the NULL entries are invisible both to the kernel
and userspace. Only after the queue has been fully created we replace the
reserved entries with the queue pointer, which can be done locklessly for
single width queues.
Kevin Cheng [Fri, 22 May 2026 23:26:57 +0000 (16:26 -0700)]
KVM: x86: Widen x86_exception's error_code to 64 bits
Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).
Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.
Piotr Zarycki [Sat, 23 May 2026 11:18:57 +0000 (13:18 +0200)]
KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET
Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.
KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap
In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).
Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.
KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs
Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.
KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0
When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected. Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.
Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots") Reported-by: Bibo Mao <maobibo@loongson.cn> Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn Reviewed-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Zongyao Chen [Fri, 22 May 2026 17:21:50 +0000 (10:21 -0700)]
KVM: selftests: Test guest_memfd binding overlap without GPA overlap
The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot. KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.
Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.
Zongyao Chen [Fri, 22 May 2026 17:21:49 +0000 (10:21 -0700)]
KVM: guest_memfd: Return -EEXIST for overlapping bindings
KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path. That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.
Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.
Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST. Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.
Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST] Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Thomas Weißschuh [Mon, 25 May 2026 08:27:16 +0000 (10:27 +0200)]
selftests/nolibc: use mutable buffer for execve() argv string
The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.
Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.
Dan Carpenter [Mon, 25 May 2026 07:16:27 +0000 (10:16 +0300)]
iio: core: fix uninitialized data in debugfs
If *ppos is non-zero then simple_write_to_buffer() will not initialize
the start of buf[]. Non zero values for *ppos aren't going to work
anyway. Test for them at the start of the function and return -EINVAL.
Fixes: 6d5dd486c715 ("iio: core: make use of simple_write_to_buffer()") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Maxwell Doose <m32285159@gmail.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Dan Carpenter [Mon, 25 May 2026 07:16:11 +0000 (10:16 +0300)]
iio: backend: fix uninitialized data in debugfs
If the *ppos value is non-zero then simple_write_to_buffer() will not
initialize the start of the buf[] buffer. Non-zero ppos values aren't
going to work at all. Check for that at the start of the function and
return -ENOSPC.
Fixes: cdf01e0809a4 ("iio: backend: add debugFs interface") Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Dan Carpenter [Mon, 25 May 2026 07:15:46 +0000 (10:15 +0300)]
iio: dac: ad3552r-hs: fix uninitialized data ni ad3552r_hs_write_data_source()
If the *ppos value is non-zero then the simple_write_to_buffer() function
won't initialized the start of the buf[] buffer. Non-zero values for
*ppos won't work here generally, so just test for them and return -ENOSPC
at the start of the function.
Fixes: b1c5d68ea66e ("iio: dac: ad3552r-hs: add support for internal ramp") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Angelo Dureghello <adureghello@baylibre.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Stepan Ionichev [Wed, 20 May 2026 19:09:24 +0000 (00:09 +0500)]
iio: adc: qcom-spmi-iadc: balance enable_irq_wake() on driver unbind
iadc_probe() calls enable_irq_wake() after a successful
devm_request_irq(), but the driver has no remove callback or
matching disable_irq_wake(), so the wake reference count on the
IRQ is leaked on module unload or driver unbind.
Check the IRQ request error first, then register a devm action
that calls disable_irq_wake() so the wake reference is released
in the same scope as the enable. While here, drop the inverted
"if (!ret) ... else return ret" in favour of the standard
"if (ret) return ret;" pattern.
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
iio: light: al3320a: read both ALS ADC registers again
al3320a_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.
Fixes: 1850e6ae7f91 ("iio: light: al3320a: Implement regmap support") Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
iio: light: al3010: read both ALS ADC registers again
al3010_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.
Fixes: 0e5e21e23dd6 ("iio: light: al3010: Implement regmap support") Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Stepan Ionichev [Sun, 17 May 2026 18:26:13 +0000 (23:26 +0500)]
iio: temperature: tmp006: use devm_iio_trigger_register
tmp006_probe() allocates the DRDY trigger with devm_iio_trigger_alloc()
but registers it with plain iio_trigger_register(). The driver has no
.remove() callback, so on module unload the trigger stays in the global
trigger list while its memory is freed by devm, leaving a dangling
entry.
Switch to devm_iio_trigger_register() so the registration is undone in
the same devm scope as the allocation.
Felix Gu [Mon, 27 Apr 2026 11:11:39 +0000 (19:11 +0800)]
iio: buffer: hw-consumer: free scan_mask on buffer release
The scan_mask lifetime changed in commit 9a2e1233d38c ("iio: buffer:
hw-consumer: remove redundant scan_mask flexible array").
Before that change, the scan mask storage was embedded in struct
hw_consumer_buffer, so iio_hw_buf_release() could free the whole
allocation with a single kfree(hw_buf).
That commit moved the scan mask to a separate bitmap_zalloc() allocation
stored in buffer.scan_mask, but left iio_hw_buf_release() unchanged.
Free the scan mask in iio_hw_buf_release() before freeing the buffer
wrapper.
Fixes: 9a2e1233d38c ("iio: buffer: hw-consumer: remove redundant scan_mask flexible array") Signed-off-by: Felix Gu <ustc.gu@gmail.com> Reviewed-by: Nuno Sá <nuno.sa@analog.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Aravind Anilraj [Sun, 29 Mar 2026 07:06:42 +0000 (03:06 -0400)]
thermal: intel: int340x: Check return value of ptc_create_groups()
proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.
The thermal control interface would be unavailable with no indication
of failure.
Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.
Aravind Anilraj [Sun, 29 Mar 2026 07:06:41 +0000 (03:06 -0400)]
thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()
The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.
Zhongqiu Han [Sun, 19 Apr 2026 13:26:54 +0000 (21:26 +0800)]
cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.
The race can be illustrated with two concurrent paths:
Path A (sysfs write, holds attr_set->update_lock):
Path B (work queue, wins the race between A1 and A2):
dbs_work_handler()
mutex_lock(&policy_dbs->update_mutex) /* acquired before A2 */
dbs_update()
ignore_nice = dbs_data->ignore_nice_load /* sees new value: 1 */
cur_nice = kcpustat_field(...)
idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
j_cdbs->prev_cpu_nice = cur_nice
mutex_unlock(&policy_dbs->update_mutex)
Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.
To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:
- gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
prev_cpu_idle, so both baselines share the same timestamp when
io_is_busy changes. This prevents an interval mismatch between
idle_time and nice_delta on the next dbs_update() when
ignore_nice_load is enabled.
- cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
so the baseline is always valid from the first dbs_update() call;
remove the ignore_nice guard and the now-unused ignore_nice variable.
Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor") Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor") Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Zhongqiu Han [Sun, 19 Apr 2026 13:26:53 +0000 (21:26 +0800)]
cpufreq: governor: Fix data races on per-CPU idle/nice baselines
gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().
Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:
Path A (sysfs write, holds attr_set->update_lock only):
Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.
Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.
Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex. A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex. cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable. Fix this by reading io_busy
inside the mutex.
The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().
Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.
Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor") Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor") Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tal Zussman [Mon, 25 May 2026 18:25:55 +0000 (14:25 -0400)]
block: remove blkdev_write_begin() and blkdev_write_end()
Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.
Yuho Choi [Mon, 25 May 2026 16:25:31 +0000 (12:25 -0400)]
mtip32xx: fix use-after-free on service thread failure
If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.
The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.
Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.
Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.
Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Denis Arefev [Thu, 21 May 2026 07:28:56 +0000 (10:28 +0300)]
block: Avoid mounting the bdev pseudo-filesystem in userspace
The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.
This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`
Mateusz Nowicki [Sat, 23 May 2026 12:52:35 +0000 (12:52 +0000)]
block: switch numa_node to int in blk_mq_hw_ctx and init_request
numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.
Chao Shi [Fri, 22 May 2026 22:00:25 +0000 (18:00 -0400)]
block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.
syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.
Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead") Acked-by: Sungwoo Kim <iam@sung-woo.kim> Acked-by: Dave Tian <daveti@purdue.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Signed-off-by: Chao Shi <coshi036@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Aaron Tomlin [Mon, 25 May 2026 00:51:23 +0000 (20:51 -0400)]
blk-mq: add tracepoint block_rq_tag_wait
In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.
Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.
This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.
This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:
Ackerley Tng [Fri, 22 May 2026 22:46:10 +0000 (15:46 -0700)]
KVM: SEV: Mark source page dirty when writing back CPUID data on failure
When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.
Ackerley Tng [Fri, 22 May 2026 22:46:09 +0000 (15:46 -0700)]
KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements
Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".
Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.
Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.
KVM: SEV: Pin source page for write when adding CPUID data for SNP guest
When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.
Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.
Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory") Cc: stable@vger.kernel.org Signed-off-by: Ackerley Tng <ackerleytng@google.com> Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@] Signed-off-by: Sean Christopherson <seanjc@google.com>
Mark Brown [Tue, 26 May 2026 16:50:18 +0000 (17:50 +0100)]
ASoC: SOF: ipc4-topology: Support for multiple src output formats
Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:
SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.
In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.
Peter Ujfalusi [Tue, 26 May 2026 10:57:48 +0000 (13:57 +0300)]
ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output
The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.
Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com> Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com> Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com Signed-off-by: Mark Brown <broonie@kernel.org>
Peter Ujfalusi [Tue, 26 May 2026 10:57:47 +0000 (13:57 +0300)]
ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc
SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.
Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.
Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.
The kernel updates the stats counters in-place on every copy-fallback
event.
Clément Léger [Tue, 19 May 2026 11:44:33 +0000 (12:44 +0100)]
io_uring/zcrx: notify user on frag copy fallback
Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.
Pavel Begunkov [Tue, 19 May 2026 11:44:32 +0000 (12:44 +0100)]
io_uring/zcrx: notify user when out of buffers
There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.
It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.
When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.
The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.
Pavel Begunkov [Tue, 19 May 2026 11:44:31 +0000 (12:44 +0100)]
io_uring/zcrx: add ctx pointer to zcrx
zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.
Bertie Tryner [Tue, 19 May 2026 11:44:30 +0000 (12:44 +0100)]
io_uring/zcrx: reorder fd allocation in zcrx_export()
Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.
While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.
Pavel Begunkov [Tue, 19 May 2026 11:44:29 +0000 (12:44 +0100)]
io_uring/zcrx: remove extra ifq close
By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.
Pavel Begunkov [Tue, 19 May 2026 11:44:27 +0000 (12:44 +0100)]
io_uring/zcrx: make scrubbing more reliable
Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.
Wentao Liang [Tue, 26 May 2026 10:21:24 +0000 (10:21 +0000)]
block: partitions: fix of_node refcount leak in of_partition()
of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:
1. The compatibility check at the top of the function returns 0
without releasing partitions_np when the node exists but is not
"fixed-partitions" compatible.
2. The function returns 1 at the end after successfully processing
all partitions without releasing partitions_np.
Fix both leaks by adding of_node_put(partitions_np) on each path.
Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF") Cc: stable@vger.kernel.org Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ethan Tidmore [Sat, 23 May 2026 21:15:22 +0000 (16:15 -0500)]
ASoC: cs35l56-shared-test: Fix possible null pointer dereference
The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.
Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)
Miquel Raynal [Tue, 26 May 2026 14:56:43 +0000 (16:56 +0200)]
mtd: spi-nor: debugfs: Add a locked sectors map
In order to get a very clear view of the sectors being locked, besides
the `params` output giving the ranges, we may want to see a proper map
of the sectors and for each of them, their status. Depending on the use
case, this map may be easier to parse by humans and gives a more acurate
feeling of the situation. At least myself, for the few locking-related
developments I recently went through, I found it very useful to get a
clearer mental model of what was locked/unlocked.
The output is wrapped at 64 sectors, spaces every 16 sectors are
improving the readability, every line starts by the first sector
offset (hex) and number (decimal).
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
[pratyush@kernel.org: split the debugfs_create_file() into two lines] Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Linus Torvalds [Tue, 26 May 2026 15:23:19 +0000 (08:23 -0700)]
Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
Miquel Raynal [Tue, 26 May 2026 14:56:42 +0000 (16:56 +0200)]
mtd: spi-nor: debugfs: Add locking support
The ioctl output may be counter intuitive in some cases. Asking for a
"locked status" over a region that is only partially locked will return
"unlocked" whereas in practice maybe the biggest part is actually
locked.
Knowing what is the real software locking state through debugfs would be
very convenient for development/debugging purposes, hence this proposal
for adding an extra block at the end of the file: a "locked sectors"
array which lists every section, if it is locked or not, showing both
the address ranges and the sizes in numbers of "lock sectors" (which on
small density devices is typically different than erase blocks).
Here is an example of output, what is after the "sector map" is new.
Miquel Raynal [Tue, 26 May 2026 14:56:41 +0000 (16:56 +0200)]
mtd: spi-nor: Create a local SR cache
In order to be able to generate debugfs output without having to
actually reach the flash, create a SPI NOR local cache of the status
registers. What matters in our case are all the bits related to sector
locking. As such, in order to make it clear that this cache is not
intended to be used anywhere else, we zero the irrelevant bits.
The cache is initialized once during the early init, and then maintained
every time the write protection scheme is updated.
Miquel Raynal [Tue, 26 May 2026 14:56:40 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Cosmetic changes
As a final preparation step for the introduction of CMP support, make
a few more cosmetic changes to simplify the reading of the diff when
adding the CMP feature. In particular, define "min_prot_len" earlier as
it will be reused and move the definition of the "ret" variable at the
end of the stack just because it looks better.
Miquel Raynal [Tue, 26 May 2026 14:56:39 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Simplify checking the locked/unlocked range
In both the locking/unlocking steps, at the end we verify whether we do
not lock/unlock more than requested (in which case an error must be
returned).
While being possible to do that with very simple mask comparisons, it
does not scale when adding extra locking features such as the CMP
possibility. In order to make these checks slightly easier to read and
more future proof, use existing helpers to read the (future) status
register, extract the covered range, and compare it with very usual
algebric comparisons.
Miquel Raynal [Tue, 26 May 2026 14:56:38 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Create helpers for building the SR register
The status register contains 3 or 4 BP (Block Protect) bits, 0 or 1
TB (Top/Bottom) bit, soon 0 or 1 CMP (Complement) bit. The last BP bit
and the TB bit locations change between vendors. The whole logic of
buildling the content of the status register based on some input
conditions is used two times and soon will be used 4 times.
Miquel Raynal [Tue, 26 May 2026 14:56:36 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Rename a mask
"mask" is not very descriptive when we already manipulate two masks, and
soon will manipulate three. Rename it "bp_mask" to align with the
existing "tb_mask" and soon "cmp_mask".
Miquel Raynal [Tue, 26 May 2026 14:56:35 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Create a helper that writes SR, CR and checks
There are many helpers already to either read and/or write SR and/or CR,
as well as sometimes check the returned values. In order to be able to
switch from a 1 byte status register to a 2 bytes status register while
keeping the same level of verification, let's introduce a new helper
that writes them both (atomically) and then reads them back (separated)
to compare the values.
In case 2 bytes registers are not supported, we still have the usual
fallback available in the helper being exported to the rest of the core.
Miquel Raynal [Tue, 26 May 2026 14:56:34 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Use a pointer for SR instead of a single byte
At this stage, the Status Register is most often seen as a single
byte. This is subject to change when we will need to read the CMP bit
which is located in the Control Register (kind of secondary status
register). Both will need to be carried.
Change a few prototypes to carry a u8 pointer. This way it also makes it
very clear where we access the first register, and where we will access
the second.
Miquel Raynal [Tue, 26 May 2026 14:56:33 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Clarify a comment
The comment states that some power of two sizes are not supported. This
is very device dependent (based on the size), so modulate a bit the
sentence to make it more accurate.
Miquel Raynal [Tue, 26 May 2026 14:56:32 +0000 (16:56 +0200)]
mtd: spi-nor: swp: Explain the MEMLOCK ioctl implementation behaviour
Add more details about how these requests are actually handled in the
SPI NOR core. Their behaviour was not entirely clear to me at first, and
explaining them in plain English sounds the way to go.
Miquel Raynal [Tue, 26 May 2026 14:56:31 +0000 (16:56 +0200)]
mtd: spi-nor: debugfs: Enhance output
Align the number of dashes to the bigger column width (the title in this
case) to make the output more pleasant and aligned with what is done
in the "params" file output.