]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
3 weeks agokbuild: rust: add AutoFDO support
Alice Ryhl [Tue, 31 Mar 2026 10:57:49 +0000 (10:57 +0000)] 
kbuild: rust: add AutoFDO support

This patch enables AutoFDO build support for Rust code within the Linux
kernel. This allows Rust code to be profiled and optimized based on the
profile.

The RUSTFLAGS variable was suffixed with *_AUTOFDO_CLANG to match the
naming of the config option, which is called CONFIG_AUTOFDO_CLANG.

This implementation has been verified in Android, first by inspecting
the object files and confirming that they look correct. After that,
it was verified as below:

1. Running the binderAddInts benchmark [1] with Rust Binder built as
   rust_binder.ko module, using a Pixel 9 Pro.
2. Collecting a profile on a Pixel 10 Pro XL using the app-launch
   benchmark, which starts different apps many times, on a device with
   Rust Binder as a built-in kernel module. (C Binder was not present on
   the device.)
3. Using the collected profile, run the binderAddInts benchmark again
   with Rust Binder built both as a rust_binder.ko module, and as a
   built-in kernel module.
4. In both cases, Rust Binder without AutoFDO was approximately 13%
   slower than the AutoFDO optimized version. Built-in vs .ko did not
   make a measurable performance difference.

All of the above was verified in conjunction with my helpers inlining
series [2], which confirmed that this worked correctly for helpers too
once [3] was fixed in the helpers inlining series.

Link: https://android.googlesource.com/platform/system/extras/+/920f089/tests/binder/benchmarks/binderAddInts.cpp
Link: https://lore.kernel.org/r/20260203-inline-helpers-v2-0-beb8547a03c9@google.com
Link: https://lore.kernel.org/r/aasPsbMEsX6iGUl8@google.com
Reviewed-by: Rong Xu <xur@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Nicolas Schier <nsc@kernel.org>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260331-autofdo-v2-1-eb5c5964820d@google.com
[ Reworded for typos. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
3 weeks agoi2c: davinci: fix division by zero on missing clock-frequency
Chaitanya Sabnis [Tue, 26 May 2026 10:22:40 +0000 (15:52 +0530)] 
i2c: davinci: fix division by zero on missing clock-frequency

When the 'clock-frequency' property is missing from the device tree,
the driver falls back to DAVINCI_I2C_DEFAULT_BUS_FREQ. However, this
macro was defined in kHz (100), whereas the device tree property is
expected in Hz.

The probe function divided the fallback value by 1000, causing
integer truncation that resulted in dev->bus_freq = 0. This triggered
a deterministic division-by-zero kernel panic when calculating clock
dividers later in the probe sequence.

Fix this by redefining DAVINCI_I2C_DEFAULT_BUS_FREQ in Hz (100000)
to match the expected device tree property unit, allowing the existing
division logic to work correctly for both cases.

Fixes: b04ce6385979 ("i2c: davinci: kill platform data")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://lore.kernel.org/all/20260514044726.57297C2BCB7@smtp.kernel.org/
Signed-off-by: Chaitanya Sabnis <chaitanya.msabnis@gmail.com>
Cc: <stable@vger.kernel.org> # v6.14+
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260526102240.4949-1-chaitanya.msabnis@gmail.com
3 weeks agoaudit: fix removal of dangling executable rules
Ricardo Robaina [Wed, 13 May 2026 21:47:59 +0000 (18:47 -0300)] 
audit: fix removal of dangling executable rules

When an audited executable is deleted from the disk, its dentry
becomes negative. Any later attempt to delete the associated audit
rule will lead to audit_alloc_mark() encountering this negative
dentry and immediately aborting, returning -ENOENT.

This early abort prevents the subsystem from allocating the temporary
fsnotify mark needed to construct the search key, meaning the kernel
cannot find the existing rule in its own lists to delete it. This
leaves a dangling rule in memory, resulting in the following error
while attempting to delete the rule:

 # ./audit-dupe-exe-deadlock.sh
 No rules
 Error deleting rule (No such file or directory)
 There was an error while processing parameters

 # auditctl -l
 -a always,exit -S all -F exe=/tmp/file -F path=/tmp/file -F key=dr

 # auditctl -D
 Error deleting rule (No such file or directory)
 There was an error while processing parameters

This patch fixes this issue by removing the d_really_is_negative()
check. By doing so, a dummy mark can be successfully generated for
the deleted path, which allows the audit subsystem to properly match
and flush the dangling rule.

Cc: stable@kernel.org
Fixes: 76a53de6f7ff ("VFS/audit: introduce kern_path_parent() for audit")
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
3 weeks agoKVM: x86: Treat KVM's virtual PMU as disabled for TDX VMs
Vishal Annapurve [Fri, 22 May 2026 15:15:34 +0000 (15:15 +0000)] 
KVM: x86: Treat KVM's virtual PMU as disabled for TDX VMs

Introduce a "protected PMU" concept, and use it to disable KVM's virtual
PMU for TDX VMs, as the PMU state for TDX VMs is virtualized by the TDX
Module[1], i.e. _can't_ emulated/virtualized by KVM, and KVM doesn't yet
support enabling/exposing PMU functionality for/to TDX VMs.  For now,
simply treat the PMU as disabled, as it's not clear what all needs to be
changed, e.g. KVM needs to do at least:

 1) Configure TD_PARAMS to allow guests to use performance monitoring.
 2) Restrict the TD to a subset of the PEBS counters if supported.
 3) Limit the TD to setup a certain perfmon events using basic/enhanced
    event filtering.

Explicitly disallow enabling the PMU via KVM_CAP_PMU_CAPABILITY for VMs
with a protected PMU to prevent userspace from circumventing KVM's
protections.

Link: https://cdrdv2.intel.com/v1/dl/getContent/733575
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Link: https://patch.msgid.link/20260522151534.652522-1-vannapurve@google.com
[sean: massage shortlog and changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agodrm/i915/display: stop passing i to for_each_pipe_crtc_modeset_{enable, disable}()
Jani Nikula [Wed, 13 May 2026 07:58:40 +0000 (10:58 +0300)] 
drm/i915/display: stop passing i to for_each_pipe_crtc_modeset_{enable, disable}()

Refactor for_each_pipe_crtc_modeset_{enable,disable}() and their
underlying for_each_crtc_in_masks{,_reverse}() helpers to utilize
__UNIQUE_ID() to avoid having to pass the for loop variable to them.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/2270d4a10663bb55d5b16902b02798234f440517.1778659089.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agodrm/i915/display: stop passing i to for_each_*_intel_crtc_in_state() macros
Jani Nikula [Wed, 13 May 2026 07:58:39 +0000 (10:58 +0300)] 
drm/i915/display: stop passing i to for_each_*_intel_crtc_in_state() macros

None of the for_each_*_intel_crtc_in_state() macros or their users
actually need the CRTC index i variable anymore. Remove them.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/edb9dc76cb9cad50622a1f425abaf076d1509888.1778659089.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agodrm/i915/display: pass struct intel_display to all for_each_intel_crtc*() macros
Jani Nikula [Wed, 13 May 2026 07:58:38 +0000 (10:58 +0300)] 
drm/i915/display: pass struct intel_display to all for_each_intel_crtc*() macros

Now that the for_each_intel_crtc*() iterator macros primarily use
display->pipe_list for iteration, it's more convenient to pass struct
intel_display to them directly instead of struct drm_device. Make it so.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/90ec6b84d772a4842d4816efc10042ec4403e996.1778659089.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agodrm/i915/display: always pass display->drm to for_each_intel_crtc*()
Jani Nikula [Wed, 13 May 2026 07:58:37 +0000 (10:58 +0300)] 
drm/i915/display: always pass display->drm to for_each_intel_crtc*()

In preparation for always passing struct intel_display to
for_each_intel_crtc*() family of iterators, start off by unifying their
usage to always having struct intel_display *display around, and passing
display->drm to them.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/447a5b2309e213abb849601727d45b406d440c88.1778659089.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agodrm/i915/display: switch from drm_for_each_crtc() to for_each_intel_crtc()
Jani Nikula [Wed, 13 May 2026 07:58:36 +0000 (10:58 +0300)] 
drm/i915/display: switch from drm_for_each_crtc() to for_each_intel_crtc()

intel_has_pending_fb_unpin() has the last direct user of
drm_for_each_crtc() in i915. Switch to for_each_intel_crtc() to ensure
pipe order iteration in all cases.

Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/8ee4320cd15bc35a8b40676faae6db4b33eb50eb.1778659089.git.jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agodrm/{i915, xe}: move xe_display_flush_cleanup_work() to i915 display
Jani Nikula [Mon, 25 May 2026 11:05:53 +0000 (14:05 +0300)] 
drm/{i915, xe}: move xe_display_flush_cleanup_work() to i915 display

xe_display_flush_cleanup_work() is a bit of an oddball function in xe
display code. There shouldn't be anything this specific or xe
specific. While I'm not sure what the correct refactor for the function
should be, move it to shared display code for starters, next to the
eerily similar but slightly different intel_has_pending_fb_unpin() that
is only called from i915 core.

The main goal here is to unblock some refactors on
for_each_intel_crtc().

v2: Add FIXME comment (Ville)

Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260525110553.651208-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
3 weeks agoKVM: selftests: Add nested page fault injection test
Kevin Cheng [Fri, 22 May 2026 23:27:01 +0000 (16:27 -0700)] 
KVM: selftests: Add nested page fault injection test

Add a test that exercises nested page fault injection during L2
execution. L2 executes I/O string instructions (OUTSB/INSB) that access
memory restricted in L1's nested page tables (NPT/EPT), triggering a
nested page fault that L0 must inject to L1.

The test supports both AMD SVM (NPF) and Intel VMX (EPT violation) and
verifies that:
  - The exit reason is an NPF/EPT violation
  - The access type and permission bits are correct
  - The faulting GPA is correct

Three test cases are implemented:
  - Unmap the final data page (final translation fault, OUTSB read)
  - Unmap a PT page (page walk fault, OUTSB read)
  - Write-protect the final data page (protection violation, INSB write)
  - Write-protect a PT page (protection violation on A/D update, OUTSB
    read)

Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: name it nested_tdp_fault_test, consolidate asserts]
Link: https://patch.msgid.link/20260522232701.3671446-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: VMX: Synthesize nested EPT violation GVA_IS_VALID/GVA_TRANSLATED bits
Kevin Cheng [Fri, 22 May 2026 23:27:00 +0000 (16:27 -0700)] 
KVM: VMX: Synthesize nested EPT violation GVA_IS_VALID/GVA_TRANSLATED bits

When injecting an EPT Violation into L2 in response to a fault detected
while emulating an L2 GVA access, synthesize the GVA_IS_VALID and
GVA_TRANSLATED bits using information provided by the walker, instead of
pulling the bits from vmcs02.EXIT_QUALIFICATION.  The information in
vmcs02.EXIT_QUALIFICATION is valid/correct if and only if the fault being
injected into L1 is the direct result of an EPT Violation VM-Exit from L2.

E.g. if KVM is emulating an I/O instruction and the memory operand's
translation through L1's EPT fails, using vmcs02.EXIT_QUALIFICATION is
wrong as the semantics for EXIT_QUALIFICATION would be for an I/O exit,
not an EPT Violation exit.

Opportunistically clean up the formatting for creating the mask of bits
to pull from vmcs02.EXIT_QUALIFICATION.

Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: use plumbed in @access bits, massage changelog]
Link: https://patch.msgid.link/20260522232701.3671446-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits
Kevin Cheng [Fri, 22 May 2026 23:26:59 +0000 (16:26 -0700)] 
KVM: SVM: Fix nested NPF injection of PFERR_GUEST_{PAGE,FINAL}_MASK bits

Fix KVM's generation of PFERR_GUEST_{PAGE,FINAL}_MASK bits when injecting a
Nested Page Fault into L1.  Currently, KVM blindly stuffs GUEST_FINAL into
L1, which is blatantly wrong given that KVM obviously generates NPFs for
page table accesses.

There are two paths that trigger NPF injection: hardware NPF exits (from
L2) and emulation-triggered faults, i.e. when KVM detects a NPF as part of
emulating an L2 GVA access.  For the hardware case, use the bits verbatim
from the VMCB, as KVM is simply forwarding a NPF to L1.  For the emulation
case, propagate the GUEST_{PAGE,FINAL} bits from the access field (which
were recently added for MBEC+GMET support).

To differentiate between the two cases, add "hardware_nested_page_fault"
to "struct x86_exception", and set it when injecting a NPF in response to
an NPF exit from L2.

To help guard against future goofs, assert that exactly one of GUEST_PAGE
or GUEST_FINAL is set when injecting a NPF.  Unlike VMX, there are no
(known) cases where hardware doesn't set either bit, and KVM should always
set one or the other when emulating a GVA access.

Signed-off-by: Kevin Cheng <chengkev@google.com>
[sean: use plumbed in @access bits, massage changelog]
Link: https://patch.msgid.link/20260522232701.3671446-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agonvme-multipath: enable PCI P2PDMA for multipath devices
Kiran Kumar Modukuri [Wed, 13 May 2026 18:51:53 +0000 (11:51 -0700)] 
nvme-multipath: enable PCI P2PDMA for multipath devices

NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when all underlying controllers support it.

Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk()
alongside the other features.  nvme_update_ns_info_block() already
calls queue_limits_stack_bdev() to stack each path's limits onto the
head disk, which routes through blk_stack_limits().  The core now
clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does
not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL
are handled.

Tested-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested=by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20260513185153.95552-4-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agomd: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device
Kiran Kumar Modukuri [Wed, 13 May 2026 18:51:52 +0000 (11:51 -0700)] 
md: propagate BLK_FEAT_PCI_P2PDMA from member devices to RAID device

MD RAID does not propagate BLK_FEAT_PCI_P2PDMA from member devices to
the RAID device, preventing peer-to-peer DMA through the RAID layer even
when all underlying devices support it.

Enable BLK_FEAT_PCI_P2PDMA unconditionally in raid0, raid1 and raid10
personalities during queue limits setup.  blk_stack_limits() clears it
automatically if any member device lacks support, consistent with how
BLK_FEAT_NOWAIT and BLK_FEAT_POLL are handled in the block core.

Parity RAID personalities (raid4/5/6) are excluded because they require
CPU access to data pages for parity computation, which is incompatible
with P2P mappings.

Tested with RAID0/1/10 arrays containing multiple NVMe devices with
P2PDMA support, confirming that peer-to-peer transfers work correctly
through the RAID layer.

Tested-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Yu Kuai <yukuai@fygo.io>
Tested=by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20260513185153.95552-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices
Chaitanya Kulkarni [Wed, 13 May 2026 18:51:51 +0000 (11:51 -0700)] 
block: clear BLK_FEAT_PCI_P2PDMA in blk_stack_limits() for non-supporting devices

BLK_FEAT_NOWAIT and BLK_FEAT_POLL are cleared in blk_stack_limits()
when an underlying device does not support them.  Apply the same
treatment to BLK_FEAT_PCI_P2PDMA: stacking drivers set it
unconditionally and rely on the core to clear it whenever a
non-supporting member device is stacked.

Tested-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested=by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20260513185153.95552-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoKVM: x86: Tell ->inject_page_fault() whether or a fault came from hardware
Sean Christopherson [Fri, 22 May 2026 23:26:58 +0000 (16:26 -0700)] 
KVM: x86: Tell ->inject_page_fault() whether or a fault came from hardware

When injecting a page fault (including nested TDP faults into L1), tell the
injection routine whether or not the fault originated in hardware, i.e. if
KVM is effectively forwarding a fault it intercept.  For nested TDP fault
injection, KVM needs to grab PAGE_WALK vs. GUEST_FINAL information from the
VMCB/VMCS, _if_ the fault originated in hardware.

Note, simply checking whether or not the original exit was due a #NPF or
EPT Violation isn't sufficient/correct, as the fault being synthesized for
L1 may or may not be the "same" fault that triggered a VM-Exit from L2.
E.g. if access to emulated MMIO in L2 hits a !PRESENT fault (EPT Violation
or #NPF), e.g. because MMIO caching is disabled or it's the first time the
GPA has been accessed by L2, then KVM will enter the emulator.  If
emulating the MMIO instruction then hits a nested TDP fault, e.g. because
L2 was accessing MMIO with a MOVSQ (memory-to-memory move), or because L1
has since unmapped the code stream, then the TDP fault synthesized to L1
will not be the same emulated fault the triggered the VM-Exit.

No functional change intended (nothing uses the new param, yet...).

Link: https://patch.msgid.link/20260522232701.3671446-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agox86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users
Yan Zhao [Thu, 30 Apr 2026 01:50:14 +0000 (09:50 +0800)] 
x86/virt/tdx: Move mk_keyed_paddr() to tdx.c due to no external users

Move mk_keyed_paddr() from tdx.h to tdx.c to avoid unnecessary header
inclusion and improve encapsulation since there are no users outside of
tdx.c.

No functional change intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260430015014.24261-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agox86/tdx: Drop exported function tdx_quirk_reset_page()
Yan Zhao [Thu, 30 Apr 2026 01:50:01 +0000 (09:50 +0800)] 
x86/tdx: Drop exported function tdx_quirk_reset_page()

KVM invokes tdx_quirk_reset_page() to reset TDX control pages (including
S-EPT pages, TDR page, etc.), as all those pages are allocated by KVM TDX
and thus always have struct page.

However, it's also reasonable for KVM to reset those TDX control pages via
tdx_quirk_reset_paddr() directly, eliminating the need to export two
parallel APIs. Keeping tdx_quirk_reset_page() as a one-line helper in the
header file is also unnecessary.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260430015001.24242-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agox86/tdx: Use PFN directly for unmapping guest private memory
Sean Christopherson [Thu, 30 Apr 2026 01:49:48 +0000 (09:49 +0800)] 
x86/tdx: Use PFN directly for unmapping guest private memory

Remove struct page assumptions/constraints in APIs for unmapping guest
private memory and have them take physical address directly.

Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].

KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tightly tied
to the regular CPU MMUs and the kernel code around them. Using
"struct page" for TDX guest memory is not a good fit anywhere near the KVM
MMU code [2].

Therefore, for unmapping guest private memory: export
tdx_quirk_reset_paddr() for direct KVM invocation, and convert the SEAMCALL
wrapper API tdh_phymem_page_wbinvd_hkid() to take PFN as input (thus
updating mk_keyed_paddr() and tdh_phymem_page_wbinvd_tdr()).

Intentionally have KVM pass PAGE_SIZE (rather than KVM_HPAGE_SIZE(level))
to tdx_quirk_reset_paddr() in tdx_sept_remove_private_spte() to avoid
mixing in huge page changes. The KVM_BUG_ON() check for !PG_LEVEL_4K in
tdx_sept_remove_private_spte() justifies using PAGE_SIZE.

Do not convert tdx_reclaim_page() to use PFN as input since it currently
does not remove guest private memory.

Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_phymem_page_wbinvd_hkid() and tdx_quirk_reset_paddr() are
exported to KVM only.

[Yan: Use kvm_pfn_t,exclude tdx_reclaim_page(),use tdx_quirk_reset_paddr()]

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com
Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260430014948.24226-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agox86/tdx: Use PFN directly for mapping guest private memory
Sean Christopherson [Thu, 30 Apr 2026 01:49:29 +0000 (09:49 +0800)] 
x86/tdx: Use PFN directly for mapping guest private memory

Remove struct page assumptions/constraints in the SEAMCALL wrapper APIs for
mapping guest private memory and have them take PFN directly.

Having core TDX make assumptions that guest private memory must be backed
by struct page (and/or folio) will create subtle dependencies on how
KVM/guest_memfd allocates/manages memory (e.g., whether it uses memory
allocated from core MM, if the memory is refcounted, or if the folio is
split) that are easily avoided. [1].

KVM's MMUs work with PFNs. This is very much an intentional design choice.
It ensures that the KVM MMUs remain flexible and are not too tied to the
regular CPU MMUs and the kernel code around them. Using 'struct page' for
TDX guest memory is not a good fit anywhere near the KVM MMU code [2].

Use "kvm_pfn_t pfn" for type safety. Using this KVM type is appropriate
since APIs tdh_mem_page_add() and tdh_mem_page_aug() are exported to KVM
only.

[ Yan: Replace "u64 pfn" with "kvm_pfn_t pfn" ]

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/aWgyhmTJphGQqO0Y@google.com
Link: https://lore.kernel.org/all/ac7V0g2q2hN3dU5u@google.com
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20260430014929.24210-1-yan.y.zhao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoARM: rockchip: keep reset control around
Heiko Stuebner [Thu, 21 May 2026 21:09:15 +0000 (23:09 +0200)] 
ARM: rockchip: keep reset control around

Do not put the reset control, retain exclusive control over it.
After turning on a CPU, the corresponding reset line must stay
deasserted.

This also avoids calling reset_control_put() before workqueues
are operational.

Fixes: 78ebbff6d1a0 ("reset: handle removing supplier before consumers")
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Tested-by: Steven Price <steven.price@arm.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20260521210915.2331176-1-heiko@sntech.de
3 weeks agoaudit: use 'unsigned int' instead of 'unsigned'
Ricardo Robaina [Thu, 14 May 2026 15:13:20 +0000 (12:13 -0300)] 
audit: use 'unsigned int' instead of 'unsigned'

Address checkpatch.pl warning below, across the audit subsystem:

  WARNING: Prefer 'unsigned int' to bare use of 'unsigned'

Minor cleanup, no functional changes.

Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
3 weeks agoblk-mq: reinsert cached request to the list
Keith Busch [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)] 
blk-mq: reinsert cached request to the list

A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.

Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/20260526153531.2365935-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agocxl/test: Update mock dev array before calling platform_device_add()
Li Ming [Wed, 20 May 2026 12:14:57 +0000 (20:14 +0800)] 
cxl/test: Update mock dev array before calling platform_device_add()

CXL test environment hits the following error sometimes.

 cxl_mem mem9: endpoint7 failed probe

All mock memdevs are platform firmware devices added by cxl_test module,
and cxl_test module also provides a platform device driver for them to
create a memdev device to CXL subsystem. cxl_test module uses
cxl_rcd/mem_single/mem arrays to store different types of mock memdevs.
CXL drivers calls registered mock functions for a mock memdev by
checking if a given memdev is in these arrays.

When cxl_test module adds these mock memdevs, it always calls
platform_device_add() before adding them to a suitable mock memdev
array. However, there is a small window where CXL drivers calls mock
function for a added memdev before it added to a mock memdev array. In
above case, cxl endpoint driver considers a added memdev was not a mock
memdev, then calling devm_cxl_endpoint_decoders_setup() for it rather
than mock_endpoint_decoders_setup().

An appropriate solution is that adding a new mock device to a mock
device array before calling platform_device_add() for it. It can
guarantee the new mock device is visible to CXL subsystem.

This patch introduces a new helped called cxl_mock_platform_device_add()
to handle the issue, and uses the function for all mock devices addition.

Fixes: 3a2b97b3210b ("cxl/test: Improve init-order fidelity relative to real-world systems")
Signed-off-by: Li Ming <ming.li@zohomail.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260520121457.234404-1-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
3 weeks agoMerge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Tue, 26 May 2026 20:49:13 +0000 (13:49 -0700)] 
Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Regressions:

   - Tighten bounds checking for sunrpc cache hash tables

   - Don't report key material in the ftrace log

  Stable fix:

   - Fix lockd's implementation of the NLM TEST procedure"

* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  lockd: fix TEST handling when not all permissions are available.
  NFSD: Report whether fh_key was actually updated
  sunrpc: prevent out-of-bounds read in __cache_seq_start()

3 weeks agomodule, riscv: force sh_addr=0 for arch-specific sections
Petr Pavlu [Fri, 27 Mar 2026 07:59:03 +0000 (08:59 +0100)] 
module, riscv: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all riscv-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
3 weeks agomodule, m68k: force sh_addr=0 for arch-specific sections
Petr Pavlu [Fri, 27 Mar 2026 07:59:02 +0000 (08:59 +0100)] 
module, m68k: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all m68k-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
3 weeks agomodule, arm64: force sh_addr=0 for arch-specific sections
Petr Pavlu [Fri, 27 Mar 2026 07:59:01 +0000 (08:59 +0100)] 
module, arm64: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all arm64-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
3 weeks agomodule, arm: force sh_addr=0 for arch-specific sections
Petr Pavlu [Fri, 27 Mar 2026 07:59:00 +0000 (08:59 +0100)] 
module, arm: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all arm-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
3 weeks agoMerge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Tue, 26 May 2026 20:37:26 +0000 (13:37 -0700)] 
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit fix from Shuah Khan:
 "Fix a use-after-free in kunit debugfs when using kunit.filter when the
  executor frees dynamically allocated resources after running boot-time
  tests. This resulted in fatal hardware exception due to invalidation
  of capability flags on the reclaimed memory on some architectures such
  as CHERI RISC-V that support the feature, and silent memory corruption
  on others.

  The fix for this couples the lifetime of the filtered suite memory
  allocation to the lifetime of the kunit subsystem and its associated
  VFS nodes. Ownership of the boot-time suite_set is now transferred to
  a global tracker ('kunit_boot_suites'), and the memory is cleanly
  released in kunit_exit() during module teardown"

* tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: fix use-after-free in debugfs when using kunit.filter

3 weeks agox86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest
Borislav Petkov [Wed, 13 May 2026 20:06:01 +0000 (22:06 +0200)] 
x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest

Patch in Fixes: causes the usual:

  unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
  Call Trace:
   early_init_intel
   early_cpu_init
   setup_arch
   _printk
   start_kernel
   x86_64_start_reservations
   x86_64_start_kernel
   common_startup_64

because the kernel is booted in a guest.

In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.

The platform ID needs to be read as early as when microcode is loaded on
the BSP:

  load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature

and by that time, CPUID leafs haven't been parsed yet.

The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.

Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
3 weeks agoirqchip/gic-v4: Don't advertise VLPIs if no ITS is probed
Mostafa Saleh [Tue, 26 May 2026 12:53:17 +0000 (12:53 +0000)] 
irqchip/gic-v4: Don't advertise VLPIs if no ITS is probed

When accidentally setting “kvm-arm.vgic_v4_enable=1” on a system that has
no MSI controller device tree node and GICv4, it results a panic as
“gic_domain” is NULL and the kernel attempts to access it.

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
    Mem abort info:
      ESR = 0x0000000096000006

    CPU: 1 UID: 0 PID: 295 Comm: lkvm-static Not tainted 7.1.0-rc4-ge3f15ad3970e #5 PREEMPT
    Hardware name: linux,dummy-virt (DT)
    pstate: 81402005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    pc : __irq_domain_instantiate+0x1d4/0x578
    lr : __irq_domain_instantiate+0x1cc/0x578

Set vLPI support to false at init time if the host has no ITS, so it
propagates properly to kvm_vgic_global_state.has_gicv4.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260526125317.3672297-1-smostafa@google.com
3 weeks agodrm/xe: Assign queue name in time for drm_sched_init
Tvrtko Ursulin [Sat, 23 May 2026 10:34:18 +0000 (11:34 +0100)] 
drm/xe: Assign queue name in time for drm_sched_init

Currently the queue name is only assigned after the drm scheduler instance
has been created. This loses information with all logging or debug
workqueue facilities so lets re-order things a bit so the name gets
assigned in time.

To be able to assign a GuC ID early we split the allocation into
reservation and publish phases.

First, with the submission state lock held, we reserve the ID in the GuC
ID manager, which serves as an authoritative source of truth. Then we can
drop the lock and reserve entries in the exec queue lookup XArray. This
can be lockless since the NULL entries are invisible both to the kernel
and userspace. Only after the queue has been fully created we replace the
reserved entries with the queue pointer, which can be done locklessly for
single width queues.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/20260523103418.61832-1-tvrtko.ursulin@igalia.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
3 weeks agoKVM: x86: Widen x86_exception's error_code to 64 bits
Kevin Cheng [Fri, 22 May 2026 23:26:57 +0000 (16:26 -0700)] 
KVM: x86: Widen x86_exception's error_code to 64 bits

Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).

Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.

Signed-off-by: Kevin Cheng <chengkev@google.com>
Link: https://patch.msgid.link/20260522232701.3671446-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET
Piotr Zarycki [Sat, 23 May 2026 11:18:57 +0000 (13:18 +0200)] 
KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET

Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260523111857.195396-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agocall_once:: Fix typo in comment for call_once()
Jiun Jeong [Fri, 1 May 2026 14:44:13 +0000 (23:44 +0900)] 
call_once:: Fix typo in comment for call_once()

Change "succesfully" to "successfully" in the kerneldoc
comment of call_once().

Signed-off-by: Jiun Jeong <jiun.jeong.cs@gmail.com>
Link: https://patch.msgid.link/20260501144413.49419-1-jiun.jeong.cs@gmail.com
[sean: don't scope to KVM, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap
Sean Christopherson [Fri, 22 May 2026 17:02:30 +0000 (10:02 -0700)] 
KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap

In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).

Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.

Reported-by: Wu Fei <wu.fei9@sanechips.com.cn>
Closes: https://lore.kernel.org/all/202605111130.64BBUXDN013040@mse-fl2.zte.com.cn
Cc: Wu Fei <atwufei@163.com>
Link: https://patch.msgid.link/20260522170230.3518669-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: Add and use kvm_free_fd() to harden against fd goofs
Sean Christopherson [Fri, 22 May 2026 17:15:35 +0000 (10:15 -0700)] 
KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs

Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.

Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Fuad Tabba <tabba@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0
Sean Christopherson [Fri, 22 May 2026 17:15:34 +0000 (10:15 -0700)] 
KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0

When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected.  Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.

Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots")
Reported-by: Bibo Mao <maobibo@loongson.cn>
Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: Remove unnecessary "%s" formatting of a constant string
Sean Christopherson [Fri, 22 May 2026 17:21:51 +0000 (10:21 -0700)] 
KVM: selftests: Remove unnecessary "%s" formatting of a constant string

Drop superfluous %s formatting from assertions in the guest_memfd overlap
testcases, as the string being printed doesn't require runtime formatting.

No functional change intended.

Reported-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522172151.3530267-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: selftests: Test guest_memfd binding overlap without GPA overlap
Zongyao Chen [Fri, 22 May 2026 17:21:50 +0000 (10:21 -0700)] 
KVM: selftests: Test guest_memfd binding overlap without GPA overlap

The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot.  KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.

Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.

Fixes: 2feabb855df8 ("KVM: selftests: Expand set_memory_region_test to validate guest_memfd()")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: keep the existing GPA overlap testcases]
Link: https://patch.msgid.link/20260522172151.3530267-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: guest_memfd: Return -EEXIST for overlapping bindings
Zongyao Chen [Fri, 22 May 2026 17:21:49 +0000 (10:21 -0700)] 
KVM: guest_memfd: Return -EEXIST for overlapping bindings

KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path.  That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.

Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.

Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST.  Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.

Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST]
Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoselftests/nolibc: test against -Wwrite-strings
Thomas Weißschuh [Mon, 25 May 2026 08:27:17 +0000 (10:27 +0200)] 
selftests/nolibc: test against -Wwrite-strings

Users may use this warning when building their own applications.
Make sure that nolibc does not trigger any such warnings.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-3-ab5cc16c7b23@weissschuh.net
3 weeks agoselftests/nolibc: use mutable buffer for execve() argv string
Thomas Weißschuh [Mon, 25 May 2026 08:27:16 +0000 (10:27 +0200)] 
selftests/nolibc: use mutable buffer for execve() argv string

The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-2-ab5cc16c7b23@weissschuh.net
3 weeks agotools/nolibc: cast default values of program_invocation_name
Thomas Weißschuh [Mon, 25 May 2026 08:27:15 +0000 (10:27 +0200)] 
tools/nolibc: cast default values of program_invocation_name

With -Wwrite-strings the plain assignment triggers a warning as a
'const char *' is assigned to a 'char *', removing the const qualifier.

Casting the const away is fine, as there is no valid modification that
can be done to an empty string anyways.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-1-ab5cc16c7b23@weissschuh.net
3 weeks agospi: dt-bindings: spi-qpic-snand: Add ipq5210 compatible
Varadarajan Narayanan [Thu, 14 May 2026 06:45:31 +0000 (12:15 +0530)] 
spi: dt-bindings: spi-qpic-snand: Add ipq5210 compatible

Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.

Signed-off-by: Varadarajan Narayanan <varadarajan.narayanan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260514-ipq5210-nand-v1-1-cbdd7492e826@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>
3 weeks agoiio: core: fix uninitialized data in debugfs
Dan Carpenter [Mon, 25 May 2026 07:16:27 +0000 (10:16 +0300)] 
iio: core: fix uninitialized data in debugfs

If *ppos is non-zero then simple_write_to_buffer() will not initialize
the start of buf[].  Non zero values for *ppos aren't going to work
anyway.  Test for them at the start of the function and return -EINVAL.

Fixes: 6d5dd486c715 ("iio: core: make use of simple_write_to_buffer()")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Maxwell Doose <m32285159@gmail.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: backend: fix uninitialized data in debugfs
Dan Carpenter [Mon, 25 May 2026 07:16:11 +0000 (10:16 +0300)] 
iio: backend: fix uninitialized data in debugfs

If the *ppos value is non-zero then simple_write_to_buffer() will not
initialize the start of the buf[] buffer.  Non-zero ppos values aren't
going to work at all.  Check for that at the start of the function and
return -ENOSPC.

Fixes: cdf01e0809a4 ("iio: backend: add debugFs interface")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: dac: ad3552r-hs: fix uninitialized data ni ad3552r_hs_write_data_source()
Dan Carpenter [Mon, 25 May 2026 07:15:46 +0000 (10:15 +0300)] 
iio: dac: ad3552r-hs: fix uninitialized data ni ad3552r_hs_write_data_source()

If the *ppos value is non-zero then the simple_write_to_buffer() function
won't initialized the start of the buf[] buffer.  Non-zero values for
*ppos won't work here generally, so just test for them and return -ENOSPC
at the start of the function.

Fixes: b1c5d68ea66e ("iio: dac: ad3552r-hs: add support for internal ramp")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Angelo Dureghello <adureghello@baylibre.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: adc: qcom-spmi-iadc: balance enable_irq_wake() on driver unbind
Stepan Ionichev [Wed, 20 May 2026 19:09:24 +0000 (00:09 +0500)] 
iio: adc: qcom-spmi-iadc: balance enable_irq_wake() on driver unbind

iadc_probe() calls enable_irq_wake() after a successful
devm_request_irq(), but the driver has no remove callback or
matching disable_irq_wake(), so the wake reference count on the
IRQ is leaked on module unload or driver unbind.

Check the IRQ request error first, then register a devm action
that calls disable_irq_wake() so the wake reference is released
in the same scope as the enable. While here, drop the inverted
"if (!ret) ... else return ret" in favour of the standard
"if (ret) return ret;" pattern.

Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: light: al3320a: read both ALS ADC registers again
Alexander A. Klimov [Sat, 23 May 2026 11:44:58 +0000 (13:44 +0200)] 
iio: light: al3320a: read both ALS ADC registers again

al3320a_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.

Fixes: 1850e6ae7f91 ("iio: light: al3320a: Implement regmap support")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: light: al3010: read both ALS ADC registers again
Alexander A. Klimov [Sat, 23 May 2026 11:44:57 +0000 (13:44 +0200)] 
iio: light: al3010: read both ALS ADC registers again

al3010_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.

Fixes: 0e5e21e23dd6 ("iio: light: al3010: Implement regmap support")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: temperature: tmp006: use devm_iio_trigger_register
Stepan Ionichev [Sun, 17 May 2026 18:26:13 +0000 (23:26 +0500)] 
iio: temperature: tmp006: use devm_iio_trigger_register

tmp006_probe() allocates the DRDY trigger with devm_iio_trigger_alloc()
but registers it with plain iio_trigger_register(). The driver has no
.remove() callback, so on module unload the trigger stays in the global
trigger list while its memory is freed by devm, leaving a dangling
entry.

Switch to devm_iio_trigger_register() so the registration is undone in
the same devm scope as the allocation.

Fixes: 91f75ccf9f03 ("iio: temperature: tmp006: add triggered buffer support")
Cc: stable@vger.kernel.org
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: buffer: hw-consumer: free scan_mask on buffer release
Felix Gu [Mon, 27 Apr 2026 11:11:39 +0000 (19:11 +0800)] 
iio: buffer: hw-consumer: free scan_mask on buffer release

The scan_mask lifetime changed in commit 9a2e1233d38c ("iio: buffer:
hw-consumer: remove redundant scan_mask flexible array").

Before that change, the scan mask storage was embedded in struct
hw_consumer_buffer, so iio_hw_buf_release() could free the whole
allocation with a single kfree(hw_buf).

That commit moved the scan mask to a separate bitmap_zalloc() allocation
stored in buffer.scan_mask, but left iio_hw_buf_release() unchanged.

Free the scan mask in iio_hw_buf_release() before freeing the buffer
wrapper.

Fixes: 9a2e1233d38c ("iio: buffer: hw-consumer: remove redundant scan_mask flexible array")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agoiio: adc: ad7768-1: Select GPIOLIB
Jonathan Cameron [Sun, 11 Jan 2026 16:55:28 +0000 (16:55 +0000)] 
iio: adc: ad7768-1: Select GPIOLIB

This driver provides a gpio chip a some related functions are not
stubbed if GPIOLIB is not built.

ad7768-1.c:420: undefined reference to `gpiochip_get_data'
ad7768-1.c:498: undefined reference to `devm_gpiochip_add_data_with_key'

Dual sign off reflects change of affiliation whilst this was under review.

Fixes: d569ae0f052e ("iio: adc: ad7768-1: Add GPIO controller support")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512271235.WwAmAbOa-lkp@intel.com/
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
3 weeks agothermal: intel: int340x: Check return value of ptc_create_groups()
Aravind Anilraj [Sun, 29 Mar 2026 07:06:42 +0000 (03:06 -0400)] 
thermal: intel: int340x: Check return value of ptc_create_groups()

proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.

The thermal control interface would be unavailable with no indication
of failure.

Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-3-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
3 weeks agothermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()
Aravind Anilraj [Sun, 29 Mar 2026 07:06:41 +0000 (03:06 -0400)] 
thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()

The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-2-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
3 weeks agocpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load
Zhongqiu Han [Sun, 19 Apr 2026 13:26:54 +0000 (21:26 +0800)] 
cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load

When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.

The race can be illustrated with two concurrent paths:

Path A (sysfs write, holds attr_set->update_lock):

governor_store()
  mutex_lock(&attr_set->update_lock)
  ignore_nice_load_store()
    dbs_data->ignore_nice_load = 1              /* (A1) */
    gov_update_cpu_data(dbs_data)
      mutex_lock(&policy_dbs->update_mutex)     /* (A2) */
        j_cdbs->prev_cpu_nice = kcpustat_field(...)
      mutex_unlock(&policy_dbs->update_mutex)
  mutex_unlock(&attr_set->update_lock)

Path B (work queue, wins the race between A1 and A2):

dbs_work_handler()
  mutex_lock(&policy_dbs->update_mutex)         /* acquired before A2 */
  dbs_update()
    ignore_nice = dbs_data->ignore_nice_load    /* sees new value: 1 */
    cur_nice = kcpustat_field(...)
    idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
    j_cdbs->prev_cpu_nice = cur_nice
  mutex_unlock(&policy_dbs->update_mutex)

Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.

To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:

  - gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
    prev_cpu_idle, so both baselines share the same timestamp when
    io_is_busy changes.  This prevents an interval mismatch between
    idle_time and nice_delta on the next dbs_update() when
    ignore_nice_load is enabled.
  - cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
    so the baseline is always valid from the first dbs_update() call;
    remove the ignore_nice guard and the now-unused ignore_nice variable.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
3 weeks agocpufreq: governor: Fix data races on per-CPU idle/nice baselines
Zhongqiu Han [Sun, 19 Apr 2026 13:26:53 +0000 (21:26 +0800)] 
cpufreq: governor: Fix data races on per-CPU idle/nice baselines

gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex.  A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex.  cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable.  Fix this by reading io_busy
inside the mutex.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
3 weeks agoblock: remove blkdev_write_begin() and blkdev_write_end()
Tal Zussman [Mon, 25 May 2026 18:25:55 +0000 (14:25 -0400)] 
block: remove blkdev_write_begin() and blkdev_write_end()

Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agomtip32xx: fix use-after-free on service thread failure
Yuho Choi [Mon, 25 May 2026 16:25:31 +0000 (12:25 -0400)] 
mtip32xx: fix use-after-free on service thread failure

If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.

The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.

Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.

Fixes: e8b58ef09e84 ("mtip32xx: fix device removal")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260525162531.1406677-1-dbgh9129@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: don't set BIO_QUIET for BLK_STS_AGAIN
Christoph Hellwig [Mon, 18 May 2026 06:33:30 +0000 (08:33 +0200)] 
block: don't set BIO_QUIET for BLK_STS_AGAIN

Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.

Remove the special handling for now.  If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agodirect-io: remove IOCB_NOWAIT support
Christoph Hellwig [Mon, 18 May 2026 06:33:29 +0000 (08:33 +0200)] 
direct-io: remove IOCB_NOWAIT support

None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry.  Remove this dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: Avoid mounting the bdev pseudo-filesystem in userspace
Denis Arefev [Thu, 21 May 2026 07:28:56 +0000 (10:28 +0300)] 
block: Avoid mounting the bdev pseudo-filesystem in userspace

The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.

This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`

[1]

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor instruction fetch in kernel mode
 #PF: error_code(0x0010) - not-present page
 PGD 23380067 P4D 23380067 PUD 23381067 PMD 0
 Oops: 0010 [#1] PREEMPT SMP KASAN NOPTI
 CPU: 2 PID: 17125 Comm: syz-executor.0 Not tainted 6.1.155-syzkaller-00350-g84221fde2681 #0
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 RIP: 0010:0x0

 Call Trace:
 <TASK>
 lookup_open.isra.0+0x700/0x1180 fs/namei.c:3460
 open_last_lookups fs/namei.c:3550 [inline]
 path_openat+0x953/0x2700 fs/namei.c:3780
 do_filp_open+0x1c5/0x410 fs/namei.c:3810
 do_sys_openat2+0x171/0x4d0 fs/open.c:1318
 do_sys_open fs/open.c:1334 [inline]
 __do_sys_openat fs/open.c:1350 [inline]
 __se_sys_openat fs/open.c:1345 [inline]
 __x64_sys_openat+0x13c/0x1f0 fs/open.c:1345
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/all/20131010004732.GJ13318@ZenIV.linux.org.uk/T/#
Cc: stable@vger.kernel.org
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260521072857.5078-1-arefev@swemel.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: switch numa_node to int in blk_mq_hw_ctx and init_request
Mateusz Nowicki [Sat, 23 May 2026 12:52:35 +0000 (12:52 +0000)] 
block: switch numa_node to int in blk_mq_hw_ctx and init_request

numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1).  Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].

Switch the field and the callback prototype to int and update all
in-tree init_request implementations.  No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.

Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblock: skip sync_blockdev() on surprise removal in bdev_mark_dead()
Chao Shi [Fri, 22 May 2026 22:00:25 +0000 (18:00 -0400)] 
block: skip sync_blockdev() on surprise removal in bdev_mark_dead()

bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.

syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.

Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead")
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoblk-mq: add tracepoint block_rq_tag_wait
Aaron Tomlin [Mon, 25 May 2026 00:51:23 +0000 (20:51 -0400)] 
blk-mq: add tracepoint block_rq_tag_wait

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260525005123.722277-1-atomlin@atomlin.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoKVM: SEV: Mark source page dirty when writing back CPUID data on failure
Ackerley Tng [Fri, 22 May 2026 22:46:10 +0000 (15:46 -0700)] 
KVM: SEV: Mark source page dirty when writing back CPUID data on failure

When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-5-3f196bfad5a1@google.com
[sean: use set_page_dirty(), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements
Ackerley Tng [Fri, 22 May 2026 22:46:09 +0000 (15:46 -0700)] 
KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements

Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".

Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.

Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-4-3f196bfad5a1@google.com
[sean: call out that the bug is benign]
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoKVM: SEV: Pin source page for write when adding CPUID data for SNP guest
Sean Christopherson [Fri, 22 May 2026 22:46:06 +0000 (15:46 -0700)] 
KVM: SEV: Pin source page for write when adding CPUID data for SNP guest

When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware.  Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.

Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.

Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Cc: stable@vger.kernel.org
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
3 weeks agoASoC: SOF: ipc4-topology: Support for multiple src output formats
Mark Brown [Tue, 26 May 2026 16:50:18 +0000 (17:50 +0100)] 
ASoC: SOF: ipc4-topology: Support for multiple src output formats

Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:

SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.

In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.

Link: https://patch.msgid.link/20260526105748.26149-1-peter.ujfalusi@linux.intel.com
3 weeks agoASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output
Peter Ujfalusi [Tue, 26 May 2026 10:57:48 +0000 (13:57 +0300)] 
ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output

The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.

Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
3 weeks agoASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc
Peter Ujfalusi [Tue, 26 May 2026 10:57:47 +0000 (13:57 +0300)] 
ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc

SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-2-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
3 weeks agoio_uring/zcrx: add shared-memory notification statistics
Clément Léger [Tue, 19 May 2026 11:44:34 +0000 (12:44 +0100)] 
io_uring/zcrx: add shared-memory notification statistics

Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.

Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.

The kernel updates the stats counters in-place on every copy-fallback
event.

Signed-off-by: Clément Léger <cleger@meta.com>
[pavel: rename io_uring_zcrx_notif_stats]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: notify user on frag copy fallback
Clément Léger [Tue, 19 May 2026 11:44:33 +0000 (12:44 +0100)] 
io_uring/zcrx: notify user on frag copy fallback

Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.

Signed-off-by: Clément Léger <cleger@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: notify user when out of buffers
Pavel Begunkov [Tue, 19 May 2026 11:44:32 +0000 (12:44 +0100)] 
io_uring/zcrx: notify user when out of buffers

There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: add ctx pointer to zcrx
Pavel Begunkov [Tue, 19 May 2026 11:44:31 +0000 (12:44 +0100)] 
io_uring/zcrx: add ctx pointer to zcrx

zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: reorder fd allocation in zcrx_export()
Bertie Tryner [Tue, 19 May 2026 11:44:30 +0000 (12:44 +0100)] 
io_uring/zcrx: reorder fd allocation in zcrx_export()

Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.

While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.

Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: remove extra ifq close
Pavel Begunkov [Tue, 19 May 2026 11:44:29 +0000 (12:44 +0100)] 
io_uring/zcrx: remove extra ifq close

By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: poison pointers on unregistration
Pavel Begunkov [Tue, 19 May 2026 11:44:28 +0000 (12:44 +0100)] 
io_uring/zcrx: poison pointers on unregistration

Nobody should be touching area and other pointers after zcrx
destruction, poison them instead of zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoio_uring/zcrx: make scrubbing more reliable
Pavel Begunkov [Tue, 19 May 2026 11:44:27 +0000 (12:44 +0100)] 
io_uring/zcrx: make scrubbing more reliable

Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agowifi: ath12k: fix error unwind on arch_init() failure in PCI probe
Ripan Deuri [Tue, 19 May 2026 19:28:15 +0000 (00:58 +0530)] 
wifi: ath12k: fix error unwind on arch_init() failure in PCI probe

When arch_init() fails in ath12k_pci_probe(), the code jumps to
err_pci_msi_free, leaking resources in teardown.

Redirect the failure path to err_free_irq so teardown matches the setup order.

Compile-tested only.

Fixes: 614c23e24ee8 ("wifi: ath12k: Support arch-specific DP device allocation")
Signed-off-by: Ripan Deuri <ripan.deuri@oss.qualcomm.com>
Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Link: https://patch.msgid.link/20260519192815.3911324-1-ripan.deuri@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
3 weeks agoblock: partitions: fix of_node refcount leak in of_partition()
Wentao Liang [Tue, 26 May 2026 10:21:24 +0000 (10:21 +0000)] 
block: partitions: fix of_node refcount leak in of_partition()

of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:

1. The compatibility check at the top of the function returns 0
   without releasing partitions_np when the node exists but is not
   "fixed-partitions" compatible.

2. The function returns 1 at the end after successfully processing
   all partitions without releasing partitions_np.

Fix both leaks by adding of_node_put(partitions_np) on each path.

Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoASoC: cs35l56-shared-test: Fix possible null pointer dereference
Ethan Tidmore [Sat, 23 May 2026 21:15:22 +0000 (16:15 -0500)] 
ASoC: cs35l56-shared-test: Fix possible null pointer dereference

The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.

Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)

Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260523211522.522616-1-ethantidmore06@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
3 weeks agomtd: spi-nor: debugfs: Add a locked sectors map
Miquel Raynal [Tue, 26 May 2026 14:56:43 +0000 (16:56 +0200)] 
mtd: spi-nor: debugfs: Add a locked sectors map

In order to get a very clear view of the sectors being locked, besides
the `params` output giving the ranges, we may want to see a proper map
of the sectors and for each of them, their status. Depending on the use
case, this map may be easier to parse by humans and gives a more acurate
feeling of the situation. At least myself, for the few locking-related
developments I recently went through, I found it very useful to get a
clearer mental model of what was locked/unlocked.

Here is an example of output:

$ cat /sys/kernel/debug/spi-nor/spi0.0/locked-sectors-map
Locked sectors map (x: locked, .: unlocked, unit: 64kiB)
 0x00000000 (#    0): ................ ................ ................ ................
 0x00400000 (#   64): ................ ................ ................ ................
 0x00800000 (#  128): ................ ................ ................ ................
 0x00c00000 (#  192): ................ ................ ................ ................
 0x01000000 (#  256): ................ ................ ................ ................
 0x01400000 (#  320): ................ ................ ................ ................
 0x01800000 (#  384): ................ ................ ................ ................
 0x01c00000 (#  448): ................ ................ ................ ................
 0x02000000 (#  512): ................ ................ ................ ................
 0x02400000 (#  576): ................ ................ ................ ................
 0x02800000 (#  640): ................ ................ ................ ................
 0x02c00000 (#  704): ................ ................ ................ ................
 0x03000000 (#  768): ................ ................ ................ ................
 0x03400000 (#  832): ................ ................ ................ ................
 0x03800000 (#  896): ................ ................ ................ ................
 0x03c00000 (#  960): ................ ................ ................ ..............xx

The output is wrapped at 64 sectors, spaces every 16 sectors are
improving the readability, every line starts by the first sector
offset (hex) and number (decimal).

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
[pratyush@kernel.org: split the debugfs_create_file() into two lines]
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agoMerge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Tue, 26 May 2026 15:23:19 +0000 (08:23 -0700)] 
Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
  post-7.1 issues or aren't considered suitable for backporting.

  All patches are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "mm: introduce a new page type for page pool in page type"
  mm/vmalloc: do not trigger BUG() on BH disabled context
  MAINTAINERS, mailmap: change email for Eugen Hristev
  mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
  kernel/fork: validate exit_signal in kernel_clone()
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  zram: fix use-after-free in zram_writeback_endio
  memfd: deny writeable mappings when implying SEAL_WRITE
  ipc: limit next_id allocation to the valid ID range
  Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
  MAINTAINERS: .mailmap: update after GEHC spin-off

3 weeks agomtd: spi-nor: debugfs: Add locking support
Miquel Raynal [Tue, 26 May 2026 14:56:42 +0000 (16:56 +0200)] 
mtd: spi-nor: debugfs: Add locking support

The ioctl output may be counter intuitive in some cases. Asking for a
"locked status" over a region that is only partially locked will return
"unlocked" whereas in practice maybe the biggest part is actually
locked.

Knowing what is the real software locking state through debugfs would be
very convenient for development/debugging purposes, hence this proposal
for adding an extra block at the end of the file: a "locked sectors"
array which lists every section, if it is locked or not, showing both
the address ranges and the sizes in numbers of "lock sectors" (which on
small density devices is typically different than erase blocks).

Here is an example of output, what is after the "sector map" is new.

$ cat /sys/kernel/debug/spi-nor/spi0.0/params
name (null)
id ef a0 20 00 00 00
size 64.0 MiB
write size 1
page size 256
address nbytes 4
flags HAS_SR_TB | 4B_OPCODES | HAS_4BAIT | HAS_LOCK | HAS_16BIT_SR | HAS_SR_TB_BIT6 | HAS_4BIT_BP | SOFT_RESET | NO_WP

opcodes
 read 0xec
  dummy cycles 6
 erase 0xdc
 program 0x34
 8D extension none

protocols
 read 1S-4S-4S
 write 1S-1S-4S
 register 1S-1S-1S

erase commands
 21 (4.00 KiB) [1]
 dc (64.0 KiB) [3]
 c7 (64.0 MiB)

sector map
 region (in hex)   | erase mask | overlaid
 ------------------+------------+---------
 00000000-03ffffff |     [   3] | no

locked sectors
 region (in hex)   | status   | #sectors
 ------------------+----------+---------
 00000000-03ffffff | unlocked | 1024

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: Create a local SR cache
Miquel Raynal [Tue, 26 May 2026 14:56:41 +0000 (16:56 +0200)] 
mtd: spi-nor: Create a local SR cache

In order to be able to generate debugfs output without having to
actually reach the flash, create a SPI NOR local cache of the status
registers. What matters in our case are all the bits related to sector
locking. As such, in order to make it clear that this cache is not
intended to be used anywhere else, we zero the irrelevant bits.

The cache is initialized once during the early init, and then maintained
every time the write protection scheme is updated.

Suggested-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Cosmetic changes
Miquel Raynal [Tue, 26 May 2026 14:56:40 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Cosmetic changes

As a final preparation step for the introduction of CMP support, make
a few more cosmetic changes to simplify the reading of the diff when
adding the CMP feature. In particular, define "min_prot_len" earlier as
it will be reused and move the definition of the "ret" variable at the
end of the stack just because it looks better.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Simplify checking the locked/unlocked range
Miquel Raynal [Tue, 26 May 2026 14:56:39 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Simplify checking the locked/unlocked range

In both the locking/unlocking steps, at the end we verify whether we do
not lock/unlock more than requested (in which case an error must be
returned).

While being possible to do that with very simple mask comparisons, it
does not scale when adding extra locking features such as the CMP
possibility. In order to make these checks slightly easier to read and
more future proof, use existing helpers to read the (future) status
register, extract the covered range, and compare it with very usual
algebric comparisons.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Create helpers for building the SR register
Miquel Raynal [Tue, 26 May 2026 14:56:38 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Create helpers for building the SR register

The status register contains 3 or 4 BP (Block Protect) bits, 0 or 1
TB (Top/Bottom) bit, soon 0 or 1 CMP (Complement) bit. The last BP bit
and the TB bit locations change between vendors. The whole logic of
buildling the content of the status register based on some input
conditions is used two times and soon will be used 4 times.

Create dedicated helpers for these steps.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Create a TB intermediate variable
Miquel Raynal [Tue, 26 May 2026 14:56:37 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Create a TB intermediate variable

Ease the future reuse of the tb (Top/Bottom) boolean by creating an
intermediate variable.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Rename a mask
Miquel Raynal [Tue, 26 May 2026 14:56:36 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Rename a mask

"mask" is not very descriptive when we already manipulate two masks, and
soon will manipulate three. Rename it "bp_mask" to align with the
existing "tb_mask" and soon "cmp_mask".

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Create a helper that writes SR, CR and checks
Miquel Raynal [Tue, 26 May 2026 14:56:35 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Create a helper that writes SR, CR and checks

There are many helpers already to either read and/or write SR and/or CR,
as well as sometimes check the returned values. In order to be able to
switch from a 1 byte status register to a 2 bytes status register while
keeping the same level of verification, let's introduce a new helper
that writes them both (atomically) and then reads them back (separated)
to compare the values.

In case 2 bytes registers are not supported, we still have the usual
fallback available in the helper being exported to the rest of the core.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Use a pointer for SR instead of a single byte
Miquel Raynal [Tue, 26 May 2026 14:56:34 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Use a pointer for SR instead of a single byte

At this stage, the Status Register is most often seen as a single
byte. This is subject to change when we will need to read the CMP bit
which is located in the Control Register (kind of secondary status
register). Both will need to be carried.

Change a few prototypes to carry a u8 pointer. This way it also makes it
very clear where we access the first register, and where we will access
the second.

There is no functional change.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Clarify a comment
Miquel Raynal [Tue, 26 May 2026 14:56:33 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Clarify a comment

The comment states that some power of two sizes are not supported. This
is very device dependent (based on the size), so modulate a bit the
sentence to make it more accurate.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: swp: Explain the MEMLOCK ioctl implementation behaviour
Miquel Raynal [Tue, 26 May 2026 14:56:32 +0000 (16:56 +0200)] 
mtd: spi-nor: swp: Explain the MEMLOCK ioctl implementation behaviour

Add more details about how these requests are actually handled in the
SPI NOR core. Their behaviour was not entirely clear to me at first, and
explaining them in plain English sounds the way to go.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: debugfs: Enhance output
Miquel Raynal [Tue, 26 May 2026 14:56:31 +0000 (16:56 +0200)] 
mtd: spi-nor: debugfs: Enhance output

Align the number of dashes to the bigger column width (the title in this
case) to make the output more pleasant and aligned with what is done
in the "params" file output.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
3 weeks agomtd: spi-nor: debugfs: Align variable access with the rest of the file
Miquel Raynal [Tue, 26 May 2026 14:56:30 +0000 (16:56 +0200)] 
mtd: spi-nor: debugfs: Align variable access with the rest of the file

The "params" variable is used everywhere else, align this particular
line of the file to use "params" directly rather than the "nor" pointer.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>