ntfs: fix mmap_prepare writable check for shared mappings
Linus pointed out that checking only VMA_WRITE_BIT is incorrect.
Private writable mappings (MAP_PRIVATE) set VM_WRITE but do not
write back to the filesystem. Also, mappings that can become
writable via mprotect() (VM_MAYWRITE) must be handled.
Use vma_desc_test_all(VMA_SHARED_BIT, VMA_MAYWRITE_BIT) instead,
which matches what other filesystems do.
Tiezhu Yang [Wed, 22 Apr 2026 07:45:34 +0000 (15:45 +0800)]
LoongArch: BPF: Support up to 12 function arguments for trampoline
Currently, LoongArch bpf trampoline supports up to 8 function arguments.
According to the statistics from commit 473e3150e30a ("bpf, x86: allow
function arguments up to 12 for TRACING"), there are over 200 functions
accept 9 to 12 arguments, so add 12 arguments support for trampoline.
With this patch, the following related testcases passed:
sudo ./test_progs -a tracing_struct/struct_many_args
sudo ./test_progs -a fentry_test/fentry_many_args
sudo ./test_progs -a fexit_test/fexit_many_args
Tiezhu Yang [Wed, 22 Apr 2026 07:45:34 +0000 (15:45 +0800)]
LoongArch: BPF: Support small struct arguments for trampoline
In the current BPF code, the struct argument size is at most 16 bytes,
enforced by the verifier. According to the Procedure Call Standard for
LoongArch, the struct argument size below 16 bytes are provided as part
of the 8 argument registers, that is to say, the struct argument may be
passed in a pair of registers if its size is more than 8 bytes and no
more than 16 bytes.
Extend the BPF trampoline JIT to support attachment to functions that
take small structures (up to 16 bytes) as argument, save and restore a
number of "argument registers" rather than a number of arguments.
With this patch, the following related testcases passed:
sudo ./test_progs -a tracing_struct/struct_args
sudo ./test_progs -a tracing_struct/union_args
Tiezhu Yang [Wed, 22 Apr 2026 07:45:34 +0000 (15:45 +0800)]
LoongArch: BPF: Open code and remove invoke_bpf_mod_ret()
invoke_bpf_mod_ret() is a small wrapper over invoke_bpf_prog(), it
should check the return value of invoke_bpf_prog() and then return
immediately if invoke_bpf_prog() failed, just open code and remove
it due to it is called only once.
Tiezhu Yang [Wed, 22 Apr 2026 07:45:34 +0000 (15:45 +0800)]
LoongArch: BPF: Support 8 and 16 bit read-modify-write instructions
The 8 and 16 bit read-modify-write instructions {amadd/amswap}.{b/h}
were newly added in the latest LoongArch Reference Manual, use them to
avoid the error of unknown opcode if possible.
Tiezhu Yang [Wed, 22 Apr 2026 07:45:34 +0000 (15:45 +0800)]
LoongArch: BPF: Add the default case in emit_atomic() and rename it
Like the other archs such as x86 and riscv, add the default case
in emit_atomic() to print an error message for the invalid opcode
and return -EINVAL, then make its return type as int.
While at it, given that all of the instructions in emit_atomic()
are only read-modify-write instructions, rename emit_atomic() to
emit_atomic_rmw() to make it clear, because there will be a new
function emit_atomic_ld_st() for load-acquire and store-release
instructions in the later patch.
Tiezhu Yang [Wed, 22 Apr 2026 07:45:13 +0000 (15:45 +0800)]
LoongArch: Define instruction formats for AM{SWAP/ADD}.{B/H} and DBAR
The 8 and 16 bit read-modify-write atomic instructions amadd.{b/h} and
amswap.{b/h} were newly added in the latest LoongArch Reference Manual,
define the instruction format and check whether support via CPUCFG.
Furthermore, define the instruction format for DBAR which will be used
to support BPF load-acquire and store-release instructions.
LoongArch: Add spectre boundry for syscall dispatch table
The LoongArch syscall number is directly controlled by userspace, but
does not have a array_index_nospec() boundry to prevent access past the
syscall function pointer tables.
Most LoongArch processors are vulnerable to Spectre-V1 Proof-of-Concept
(PoC). And the generic mechanism, __user pointer sanitization, can be
used as a mitigation. This means to use array_index_nospec() to prevent
out of boundry access in syscall and other critical paths.
Implement the arch-specific cpu_show_spectre_v1() to show CPU Spectre-V1
vulnerabilites correctly.
LoongArch: Make arch_irq_work_has_interrupt() true only if IPI HW exist
After commit 7c405fb3279b3924 ("rcu: Use an intermediate irq_work to
start process_srcu()"), Loongson-2K0300/2K0500 fail to boot. Because
IRQ_WORK need IPI but Loongson-2K0300/2K0500 don't have IPI HW.
So make arch_irq_work_has_interrupt() return true only if IPI HW exist.
Luo Qiu [Wed, 22 Apr 2026 07:45:12 +0000 (15:45 +0800)]
LoongArch: Use get_random_canary() for stack canary init
Like others, replace the custom stack canary initialization with the
get_random_canary() helper, following the pattern established in commit 622754e84b10 ("stackprotector: actually use get_random_canary()").
Signed-off-by: Luo Qiu <luoqiu@kylinsec.com.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Yuqian Yang [Wed, 22 Apr 2026 07:45:11 +0000 (15:45 +0800)]
LoongArch: Improve the logging of disabling KASLR
Whether KASLR is disabled is not handled in nokaslr() which is the early
param "nokaslr" setup function, but in kaslr_disabled(). However, the
logging was previously done in nokaslr() and lack detail. So we move the
logging to the right place and add more specific infomation about why it
is disabled.
Lisa Robinson [Wed, 22 Apr 2026 07:45:11 +0000 (15:45 +0800)]
LoongArch: Align FPU register state to 32 bytes
Move fpr to the beginning of struct loongarch_fpu so it is naturally
aligned to FPU_ALIGN (32 bytes), improving 256-bit SIMD (LASX) context
switch performance.
Also adjust process.c and fpu.S to work well with the new loongarch_fpu
layout.
Signed-off-by: Lisa Robinson <lisa@bytefly.space> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
LoongArch: Add HIGHMEM (PKMAP and FIX_KMAP) support
Add HIGHMEM (High Memory) support for LoongArch, mostly needed by 32BIT
kernel because the size of kernel virtual memory space is only 512MB and
the size of usable physical memory is only 256MB in this case.
HIGHMEM adds permanent kernel mapping (PKMAP) and fixed kernel mapping
(FIX_KMAP), which increase usable physical memory up to 2.25GB (2304MB).
We can just use the generic copy_user_highpage(), so remove the custom
version.
After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.
Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.
Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com> Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com> Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
x86/hyperv: move stimer cleanup to hv_machine_shutdown()
Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.
Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com> Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().
Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.
Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection") Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com> Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Sangyun Kim [Sun, 19 Apr 2026 08:08:38 +0000 (17:08 +0900)]
pwm: atmel-tcb: Cache clock rates and mark chip as atomic
atmel_tcb_pwm_apply() holds tcbpwmc->lock as a spinlock via
guard(spinlock)() and then calls atmel_tcb_pwm_config(), which calls
clk_get_rate() twice. clk_get_rate() acquires clk_prepare_lock (a
mutex), so this is a sleep-in-atomic-context violation.
On CONFIG_DEBUG_ATOMIC_SLEEP kernels every pwm_apply_state() that
enables or reconfigures the PWM triggers a "BUG: sleeping function
called from invalid context" warning.
Acquire exclusive control over the clock rates with
clk_rate_exclusive_get() at probe time and cache the rates in struct
atmel_tcb_pwm_chip, then read the cached rates from
atmel_tcb_pwm_config(). This keeps the spinlock-based mutual exclusion
introduced in commit 37f7707077f5 ("pwm: atmel-tcb: Fix race condition
and convert to guards") and removes the sleeping calls from the atomic
section.
With no sleeping calls left in .apply() and the regmap-mmio bus already
running with fast_io=true, also mark the chip as atomic so consumers
can use pwm_apply_atomic() from atomic context.
Fixes: 37f7707077f5 ("pwm: atmel-tcb: Fix race condition and convert to guards") Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Link: https://patch.msgid.link/20260419080838.3192357-1-sangyun.kim@snu.ac.kr
[ukleinek: Ensure .clk is enabled before calling clk_get_rate on it.] Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org>
io_uring: take page references for NOMMU pbuf_ring mmaps
Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel
virtual address of the io_mapped_region's backing pages directly;
the user's VMA aliases the kernel allocation. io_uring_mmap() then
just returns 0 -- it takes no page references.
The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on
each inserted page. Those references are released when the VMA is torn
down (zap_pte_range -> put_page). io_free_region() -> release_pages()
drops the io_uring-side references, but the pages survive until munmap
drops the VMA-side references.
Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring ->
io_put_bl -> io_free_region -> release_pages drops the only references
and the pages return to the buddy allocator while the user's VMA still
has vm_start pointing into them. The user can then write into whatever
the allocator hands out next.
Mirror the MMU lifetime: take get_page references in io_uring_mmap() and
release them via vm_ops->close. NOMMU's delete_vma() calls vma_close()
which runs ->close on munmap.
This also incidentally addresses the duplicate-vm_start case: two mmaps
of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer.
With page refs taken per mmap, the second mmap takes its own refs and
the pages survive until both mmaps are closed. The nommu rb-tree BUG_ON
on duplicate vm_start is a separate mm/nommu.c concern (it should share
the existing region rather than BUG), but the page lifetime is now
correct.
Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Anthropic Assisted-by: gkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh
[axboe: get rid of region lookup, just iterate pages in vma] Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge tag 'probes-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
"fprobe bug fixes:
- Prevent re-registration
Add an earlier check to reject re-registering an already active
fprobe before its state is modified during the initialization phase
- Robustness in failure paths:
- Ensure fprobes are correctly removed from all internal tables
and properly RCU-freed during registration failure
- Make unregister_fprobe() proceed with unregistration even if
temporary memory allocation fails
- RCU safety in module unloading
Avoid a potential "sleep in RCU" warning by removing a kcalloc()
call in the module notifier path. This also tries to remove
fprobe_hash_node even if memory allocation fails.
- Type-aware unregistration
Fix a bug where unregistering an fprobe did not account for
different types (entry-only vs entry-exit) at the same address,
which previously left "junk" entries in the underlying
ftrace/fgraph ops
- Unregistration of empty ftrace_ops
Avoid unneeded performance overhead due to making registered
ftrace_ops empty - which means 'trace all functions'. This counts
remaining entries and unregister ftrace_ops when it becomes empty.
Two new selftests to check above fixes:
- Module Unloading Test:
Specifically verifies that fprobe events on a module are correctly
cleaned up and do not trigger 'trace-all' behavior when the module
is removed.
- Multiple Fprobe Events Test:
Ensure that having multiple fprobes on the same function correctly
manages the ftrace hash map during removal"
* tag 'probes-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
selftests/ftrace: Add a testcase for multiple fprobe events
selftests/ftrace: Add a testcase for fprobe events on module
tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
tracing/fprobe: Check the same type fprobe on table as the unregistered one
tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
tracing/fprobe: Remove fprobe from hash in failure path
tracing/fprobe: Unregister fprobe even if memory allocation fails
tracing/fprobe: Reject registration of a registered fprobe before init
fixed an issue where poll->events and req->apoll_events weren't
synchronized, but then when the commit referenced in Fixes got added,
it didn't ensure the same thing.
If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then
ensure it's done for both. Including a link to the original report
below, even though it's mostly nonsense. But it includes a reproducer
that does show that IORING_CQE_F_MORE is set in the previous CQE,
while no more CQEs will be generated for this request. Just ignore
anything that pretends this is security related in any way, it's just
the typical AI nonsense.
(u16)0 - 1 promotes to (int)-1 which converts to SIZE_MAX as size_t,
causing a massive out-of-bounds write.
Reject ahslength == 0 with ISCSI_REASON_PROTOCOL_ERROR before the kmalloc.
Also reject ahslength values that exceed the actual AHS buffer advertised.
Fixes: 8f1f7d297bce ("scsi: target: iscsi: Add support for extended CDB AHS") Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Link: https://patch.msgid.link/20260415040728.187680-1-carlos.bilbao@kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Wang Shuaiwei [Tue, 14 Apr 2026 03:37:18 +0000 (11:37 +0800)]
scsi: ufs: core: Fix bRefClkFreq write failure in HS-LSS mode
According to the UFS spec, the bRefClkFreq attribute can only be written
when both sub-links are in LS-MODE. However, in HS LSS mode with
resetmode = HS_MODE, if the UFS device's default bRefClkFreq value
differs from the host controller's dev_ref_clk_freq setting, the write
operation will fail.
To fix this issue, introduce ufshcd_get_op_mode() function to detect the
current link operational mode. Call ufshcd_set_dev_ref_clk() only when
both sub-links are in LS-MODE to ensure the attribute can be written
successfully.
Merge tag 'drm-next-2026-04-22' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm updates from Dave Airlie:
"This is a followup which is mostly next material with some fixes.
Alex pointed out I missed one of his AMD MRs from last week, so I
added that, then Jani sent the pipe reordering stuff, otherwise it's
just some minor i915 fixes and a dma-buf fix.
drm:
- Add support for AMD VSDB parsing to drm_edid
dma-buf:
- fix documentation formatting
i915:
- add support for reordered pipes to support joined pipes better
- Fix VESA backlight possible check condition
- Verify the correct plane DDB entry
amdgpu:
- Audio regression fix
- Use drm edid parser for AMD VSDB
- Misc cleanups
- VCE cs parse fixes
- VCN cs parse fixes
- RAS fixes
- Clean up and unify vram reservation handling
- GPU Partition updates
- system_wq cleanups
- Add CONFIG_GCOV_PROFILE_AMDGPU kconfig option
- SMU vram copy updates
- SMU 13/14/15 fixes
- UserQ fixes
- Replace pasid idr with an xarray
- Dither handling fix
- Enable amdgpu by default for CIK APUs
- Add IBs to devcoredump
amdkfd:
- system_wq cleanups
radeon:
- system_wq cleanups"
* tag 'drm-next-2026-04-22' of https://gitlab.freedesktop.org/drm/kernel: (62 commits)
drm/i915/display: change pipe allocation order for discrete platforms
drm/i915/wm: Verify the correct plane DDB entry
drm/i915/backlight: Fix VESA backlight possible check condition
drm/i915: Walk crtcs in pipe order
drm/i915/joiner: Make joiner "nomodeset" state copy independent of pipe order
dma-buf: fix htmldocs error for dma_buf_attach_revocable
drm/amdgpu: dump job ibs in the devcoredump
drm/amdgpu: store ib info for devcoredump
drm/amdgpu: extract amdgpu_vm_lock_by_pasid from amdgpu_vm_handle_fault
drm/amdgpu: Use amdgpu by default for CIK APUs too
drm/amd/display: Remove unused NUM_ELEMENTS macros
drm/amd/display: Replace inline NUM_ELEMENTS macro with ARRAY_SIZE
drm/amdgpu: save ring content before resetting the device
drm/amdgpu: make userq fence_drv drop explicit in queue destroy
drm/amdgpu: rework userq fence driver alloc/destroy
drm/amdgpu/userq: use dma_fence_wait_timeout without test for signalled
drm/amdgpu/userq: call dma_resv_wait_timeout without test for signalled
drm/amdgpu/userq: add the return code too in error condition
drm/amdgpu/userq: fence wait for max time in amdgpu_userq_wait_for_signal
drm/amd/display: Change dither policy for 10 bpc output back to dithering
...
selftests/ftrace: Add a testcase for fprobe events on module
Add a testcase for fprobe events on module, which unloads a kernel
module on which fprobe events are probing and ensure the ftrace
hash map is cleared correctly.
tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.
Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.
Alex Markuze [Tue, 10 Feb 2026 09:06:26 +0000 (09:06 +0000)]
ceph: add subvolume metrics collection and reporting
Add complete infrastructure for per-subvolume I/O metrics collection
and reporting to the MDS. This enables administrators to monitor I/O
patterns at the subvolume granularity, which is useful for multi-tenant
CephFS deployments.
This patch adds:
- CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- CEPH_SUBVOLUME_ID_NONE constant (0) for unknown/unset state
- Red-black tree based metrics tracker for efficient per-subvolume
aggregation with kmem_cache for entry allocations
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing CLIENT_METRICS message
- Recording of I/O operations from file read/write and writeback paths
- Debugfs interfaces for monitoring (metrics/subvolumes, metrics/metric_features)
Metrics tracked per subvolume include:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)
The metrics are periodically sent to the MDS as part of the existing
metrics reporting infrastructure when the MDS advertises support for
the SUBVOLUME_METRICS feature.
CEPH_SUBVOLUME_ID_NONE enforces subvolume_id immutability. Following
the FUSE client convention, 0 means unknown/unset. Once an inode has
a valid (non-zero) subvolume_id, it should not change during the
inode's lifetime.
Alex Markuze [Tue, 10 Feb 2026 09:06:25 +0000 (09:06 +0000)]
ceph: parse subvolume_id from InodeStat v9 and store in inode
Add support for parsing the subvolume_id field from InodeStat v9 and
storing it in the inode for later use by subvolume metrics tracking.
The subvolume_id identifies which CephFS subvolume an inode belongs to,
enabling per-subvolume I/O metrics collection and reporting.
This patch:
- Adds subvolume_id field to struct ceph_mds_reply_info_in
- Adds i_subvolume_id field to struct ceph_inode_info
- Parses subvolume_id from v9 InodeStat in parse_reply_info_in()
- Adds ceph_inode_set_subvolume() helper to propagate the ID to inodes
- Initializes i_subvolume_id in inode allocation and clears on destroy
Alex Markuze [Tue, 10 Feb 2026 09:06:24 +0000 (09:06 +0000)]
ceph: handle InodeStat v8 versioned field in reply parsing
Add forward-compatible handling for the new versioned field introduced
in InodeStat v8. This patch only skips the field without using it,
preparing for future protocol extensions.
The v8 encoding adds a versioned sub-structure that needs to be properly
decoded and skipped to maintain compatibility with newer MDS versions.
libceph: Fix slab-out-of-bounds access in auth message processing
If a (potentially corrupted) message of type CEPH_MSG_AUTH_REPLY
contains a positive value in its result field, it is treated as an
error code by ceph_handle_auth_reply() and returned to
handle_auth_reply(). Thereafter, an attempt is made to send the
preallocated message of type CEPH_MSG_AUTH, where the returned value is
interpreted as the size of the front segment to send. If the result
value in the message is greater than the size of the memory buffer
allocated for the front segment, an out-of-bounds access occurs, and
the content of the memory region beyond this buffer is sent out.
This patch fixes the issue by treating only negative values in the
result field as errors. Positive values are therefore treated as success
in the same way as a zero value. Additionally, a BUG_ON is added to
__send_prepared_auth_request() comparing the len parameter to
front_alloc_len to prevent sending the message if it exceeds the bounds
of the allocation and to make it easier to catch any logic flaws leading
to this.
rbd: fix null-ptr-deref when device_add_disk() fails
do_rbd_add() publishes the device with device_add() before calling
device_add_disk(). If device_add_disk() fails after device_add()
succeeds, the error path calls rbd_free_disk() directly and then later
falls through to rbd_dev_device_release(), which calls rbd_free_disk()
again. This double teardown can leave blk-mq cleanup operating on
invalid state and trigger a null-ptr-deref in
__blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set().
Fix this by following the normal remove ordering: call device_del()
before rbd_dev_device_release() when device_add_disk() fails after
device_add(). That keeps the teardown sequence consistent and avoids
re-entering disk cleanup through the wrong path.
The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available.
We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64
guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer
confines failslab injections to the __add_disk() range and injects
fail-nth while mapping an RBD image through
/sys/bus/rbd/add_single_major.
On the unpatched kernel, fail-nth=4 reliably triggered the fault:
Commit 41ebcc0907c5 ("crush: remove forcefeed functionality") from
May 7, 2012 (linux-next), leads to the following Smatch static
checker warning:
net/ceph/crush/mapper.c:1015 crush_do_rule()
warn: iterator 'j' not incremented
Before commit 41ebcc0907c5 ("crush: remove forcefeed functionality"),
we had this logic:
j = 0;
if (osize == 0 && force_pos >= 0) {
o[osize] = force_context[force_pos];
if (recurse_to_leaf)
c[osize] = force_context[0];
j++; /* <-- this was the only increment, now gone */
force_pos--;
}
/* then crush_choose_*(..., o+osize, j, ...) */
Now, the variable j is dead code — a variable that is set
and never meaningfully varied. This patch simply removes
the dead code.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Max Kellermann [Mon, 30 Mar 2026 08:43:19 +0000 (10:43 +0200)]
ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails
This MDS reconnect error path leaves s_cap_reconnect set.
send_mds_reconnect() sets the bit at the beginning of the reconnect,
but the first failing operation after that, ceph_pagelist_encode_32(),
can jump to `fail:` without clearing it.
__ceph_remove_cap() consults that flag to decide whether cap releases
should be queued. A reconnect-preparation failure therefore leaves the
session in reconnect mode from the cap-release path's point of view
and can strand release work until some later state transition repairs
it.
Max Kellermann [Fri, 27 Mar 2026 16:23:08 +0000 (17:23 +0100)]
ceph: only d_add() negative dentries when they are unhashed
Ceph can call d_add(dentry, NULL) on a negative dentry that is already
present in the primary dcache hash.
In the current VFS that is not safe. d_add() goes through __d_add()
to __d_rehash(), which unconditionally reinserts dentry->d_hash into
the hlist_bl bucket. If the dentry is already hashed, reinserting the
same node can corrupt the bucket, including creating a self-loop.
Once that happens, __d_lookup() can spin forever in the hlist_bl walk,
typically looping only on the d_name.hash mismatch check and
eventually triggering RCU stall reports like this one:
This is reachable with reused cached negative dentries. A Ceph lookup
or atomic_open can be handed a negative dentry that is already hashed,
and fs/ceph/dir.c then hits one of two paths that incorrectly assume
"negative" also means "unhashed":
- ceph_finish_lookup():
MDS reply is -ENOENT with no trace
-> d_add(dentry, NULL)
- ceph_lookup():
local ENOENT fast path for a complete directory with shared caps
-> d_add(dentry, NULL)
Both paths can therefore re-add an already-hashed negative dentry.
Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only
calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn)
is true.
Fix both fs/ceph/dir.c sites the same way: only call d_add() for a
negative dentry when it is actually unhashed. If the negative dentry
is already hashed, leave it in place and reuse it as-is.
This preserves the existing behavior for unhashed dentries while
avoiding d_hash list corruption for reused hashed negatives.
kexinsun [Mon, 23 Feb 2026 13:15:07 +0000 (21:15 +0800)]
libceph: update outdated comment in ceph_sock_write_space()
The function try_write() was renamed to ceph_con_v1_try_write()
in commit 566050e17e53 ("libceph: separate msgr1 protocol
implementation") and subsequently moved to net/ceph/messenger_v1.c
in commit 2f713615ddd9 ("libceph: move msgr1 protocol implementation
to its own file"). Update the comment in ceph_sock_write_space()
accordingly.
[ idryomov: account for msgr2 in the updated comment as well ]
Since the call to crypto_shash_setkey() was replaced with
hmac_sha256_preparekey() which doesn't allocate memory regardless of the
alignment of the input key, remove the session key alignment logic from
process_auth_done(). Also remove the inclusion of crypto/hash.h, which
is no longer needed since crypto_shash is no longer used.
Sam Edwards [Wed, 18 Mar 2026 02:37:33 +0000 (19:37 -0700)]
ceph: fix num_ops off-by-one when crypto allocation fails
move_dirty_folio_in_page_array() may fail if the file is encrypted, the
dirty folio is not the first in the batch, and it fails to allocate a
bounce buffer to hold the ciphertext. When that happens,
ceph_process_folio_batch() simply redirties the folio and flushes the
current batch -- it can retry that folio in a future batch.
However, if this failed folio is not contiguous with the last folio that
did make it into the batch, then ceph_process_folio_batch() has already
incremented `ceph_wbc->num_ops`; because it doesn't follow through and
add the discontiguous folio to the array, ceph_submit_write() -- which
expects that `ceph_wbc->num_ops` accurately reflects the number of
contiguous ranges (and therefore the required number of "write extent"
ops) in the writeback -- will panic the kernel:
BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
This issue can be reproduced on affected kernels by writing to
fscrypt-enabled CephFS file(s) with a 4KiB-written/4KiB-skipped/repeat
pattern (total filesize should not matter) and gradually increasing the
system's memory pressure until a bounce buffer allocation fails.
Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct
value when move_dirty_folio_in_page_array() fails, but the folio already
started counting a new (i.e. still-empty) extent.
The defect corrected by this patch has existed since 2022 (see first
`Fixes:`), but another bug blocked multi-folio encrypted writeback until
recently (see second `Fixes:`). The second commit made it into 6.18.16,
6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch
therefore fixes a regression (panic) introduced by cac190c7674f.
Cc: stable@vger.kernel.org Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages") Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files") Signed-off-by: Sam Edwards <CFSworks@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Raphael Zimmer [Wed, 18 Mar 2026 17:09:03 +0000 (18:09 +0100)]
libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
If a message of type CEPH_MSG_AUTH_REPLY contains a zero value for both
protocol and result, this is currently not treated as an error. In case
of ac->negotiating == true and ac->protocol > 0, this leads to setting
ac->protocol = 0 and ac->ops = NULL. Thereafter, the check for
ac->protocol != protocol returns false, and init_protocol() is not
called. Subsequently, ac->ops->handle_reply() is called, which leads to
a null pointer dereference, because ac->ops is still NULL.
This patch changes the check for ac->protocol != protocol to
!ac->protocol, as this also includes the case when the protocol was set
to zero in the message. This causes the message to be treated as
containing a bad auth protocol.
s390/bpf: Inline smp_processor_id and current_task
Inline these calls in bpf jit:
- bpf_get_smp_processor_id()
- bpf_get_current_task()
- bpf_get_current_task_btf()
s390 has a 8 KiB per-CPU prefix area in the CPU's
virtual address space, called the lowcore. It is a
struct that contains the cpu number and a pointer
to the current task. These are exactly the values
returned by the BPF helpers.
Emit a load from the lowcore instead of a helper
function call.
Dave Hansen [Tue, 21 Apr 2026 16:31:36 +0000 (09:31 -0700)]
x86/cpu: Disable FRED when PTI is forced on
FRED and PTI were never intended to work together. No FRED hardware is
vulnerable to Meltdown and all of it should have LASS anyway.
Nevertheless, if you boot a system with pti=on and fred=on, the kernel
tries to do what is asked of it and dies a horrible death on the first
attempt to run userspace (since it never switches to the user page
tables).
Disable FRED when PTI is forced on, and print a warning about it.
A quick brain dump about what a FRED+PTI implementation would look like
is below. I'm not sure it would make any sense to do it, but never say
never. All I know is that it's way too complicated to be worth it today.
<brain dump>
The SWITCH_TO_USER/KERNEL_CR3 bits are simple to fix (or at least we
have the assembly tools to do it already), as is sticking the FRED entry
text in .entry.text (it's not in there today).
The nasty part is the stacks. Today, the CPU pops into the kernel on
MSR_IA32_FRED_RSP0 which is normal old kernel memory and not mapped to
userspace. The hardware pushes gunk on to MSR_IA32_FRED_RSP0, which is
currently the task stacks. MSR_IA32_FRED_RSP0 would need to point
elsewhere, probably cpu_entry_stack(). Then, start playing games with
stacks on entry/exit, including copying gunk to and from the task stack.
While I'd *like* to have PTI everywhere, I'm not sure it's worth mucking
up the FRED code with PTI kludges. If a user wants fast entry/exit, they
use FRED. If you want PTI (and sekuritay), you certainly don't care
about fast entry and FRED isn't going to help you *all* that much, so
you can just stay with the IDT.
Plus, FRED hardware should have LASS which gives you a similar security
profile to PTI without the CR3 munging.
</brain dump>
Reported-by: Gayatri Kammela <Gayatri.Kammela@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Cc:stable@vger.kernel.org Link: https://patch.msgid.link/20260421163136.E7C6788A@davehans-spike.ostc.intel.com
Merge tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, the changes primarily focus on resolving race
conditions, memory safety issues (UAF), and improving the robustness
of garbage collection (GC), and folio management.
Enhancements:
- add page-order information for large folio reads in iostat
- add defrag_blocks sysfs node
Bug fixes:
- fix uninitialized kobject put in f2fs_init_sysfs()
- disallow setting an extension to both cold and hot
- fix node_cnt race between extent node destroy and writeback
- preserve previous reserve_{blocks,node} value when remount
- freeze GC and discard threads quickly
- fix false alarm of lockdep on cp_global_sem lock
- fix data loss caused by incorrect use of nat_entry flag
- skip empty sections in f2fs_get_victim
- fix inline data not being written to disk in writeback path
- fix fsck inconsistency caused by FGGC of node block
- fix fsck inconsistency caused by incorrect nat_entry flag usage
- call f2fs_handle_critical_error() to set cp_error flag
- fix fiemap boundary handling when read extent cache is incomplete
- fix use-after-free of sbi in f2fs_compress_write_end_io()
- fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()
- fix incorrect file address mapping when inline inode is unwritten
- fix incomplete search range in f2fs_get_victim when f2fs_need_rand_seg is enabled
- avoid memory leak in f2fs_rename()"
* tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (35 commits)
f2fs: add page-order information for large folio reads in iostat
f2fs: do not support mmap write for large folio
f2fs: fix uninitialized kobject put in f2fs_init_sysfs()
f2fs: protect extension_list reading with sb_lock in f2fs_sbi_show()
f2fs: disallow setting an extension to both cold and hot
f2fs: fix node_cnt race between extent node destroy and writeback
f2fs: allow empty mount string for Opt_usr|grp|projjquota
f2fs: fix to preserve previous reserve_{blocks,node} value when remount
f2fs: invalidate block device page cache on umount
f2fs: fix to freeze GC and discard threads quickly
f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footer
f2fs: fix false alarm of lockdep on cp_global_sem lock
f2fs: fix data loss caused by incorrect use of nat_entry flag
f2fs: fix to skip empty sections in f2fs_get_victim
f2fs: fix inline data not being written to disk in writeback path
f2fs: fix fsck inconsistency caused by FGGC of node block
f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usage
f2fs: fix to do sanity check on dcc->discard_cmd_cnt conditionally
f2fs: refactor node footer flag setting related code
f2fs: refactor f2fs_move_node_folio function
...
Merge tag 'libnvdimm-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax updates from Ira Weiny:
"The series adds DAX support required for the upcoming fuse/famfs file
system.[1] The support here is required because famfs is backed by
devdax rather than pmem. This all lays the groundwork for using shared
memory as a file system"
Link: https://lore.kernel.org/all/0100019d43e5f632-f5862a3e-361c-4b54-a9a6-96c242a8f17a-000000@email.amazonses.com/
* tag 'libnvdimm-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax/fsdev: fix uninitialized kaddr in fsdev_dax_zero_page_range()
dax: export dax_dev_get()
dax: Add fs_dax_get() func to prepare dax for fs-dax usage
dax: Add dax_set_ops() for setting dax_operations at bind time
dax: Add dax_operations for use by fs-dax on fsdev dax
dax: Save the kva from memremap
dax: add fsdev.c driver for fs-dax on character dax
dax: Factor out dax_folio_reset_order() helper
dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c
Timur Kristóf [Mon, 20 Apr 2026 23:55:04 +0000 (01:55 +0200)]
drm/amd/display: Disable 10-bit truncation and dithering on DCE 6.x
DCE 6.x doesn't support 10-bit truncation and 10-bit dithering
because the following fields are 1-bit only:
FMT_TEMPORAL_DITHER_DEPTH
FMT_SPATIAL_DITHER_DEPTH
FMT_TRUNCATE_DEPTH
Programming these fields to "2" will program them as if the
dithering option was 6-bit, resulting in sub-par picture
quality and an ugly "color banding" effect.
Note that a recent commit changed the default 10-bit dithering
option to DITHER_OPTION_SPATIAL10 which improves the picture
quality because it happens to look better, but is still not
actually supported by DCE 6.x versions.
When the color depth is 10-bit or more, just disable
any kind of dithering options on DCE 6.x.
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5151 Fixes: 529cad0f945c ("drm/amd/display: Add function to set dither option") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6be8ced880dfe29ce38c2d5e74489822da5c250e)
Siwei He [Tue, 14 Apr 2026 18:46:54 +0000 (14:46 -0400)]
drm/amdgpu: OR init_pte_flags into invalid leaf PTE updates
Invalid leaf clears that only set AMDGPU_PTE_EXECUTABLE match the old
GMC9 fault-priority workaround but omit adev->gmc.init_pte_flags.
On GFX12 that includes AMDGPU_PTE_IS_PTE; without it, some cleared
PTEs can fault as no-retry and bypass the SVM/XNACK handler when a
VA is reused after a BO unmap.
Apply init_pte_flags in amdgpu_vm_pte_update_flags() alongside
EXECUTABLE so range-driven clears (e.g. amdgpu_vm_clear_freed) match
amdgpu_vm_pt_clear() for leaf templates.
Signed-off-by: Siwei He <siwei.he@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9d47b2c36b9a6c6b844c33cab407a5d7ad102234)
Merge tag 'pull-coda' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull coda dcache updates from Al Viro:
"Coda dcache-related cleanups and fixes"
* tag 'pull-coda' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coda_flag_children(): fix a UAF
sanitize coda_dentry_delete()
coda: is_bad_inode() is always false there
drm/amd: Adjust ASPM support quirk to cover more Intel hosts
Some of the same issues identified in commit c770ef19673fb
("drm/amd/amdgpu: disable ASPM in some situations") also affect
Tiger Lake systems with GFX11 connected over USB4. Widen the net
to also match these hosts.
Fixes: d9b3a066dfcd ("drm/amd: Exclude dGPUs in eGPU enclosures from DPM quirks") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5145 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a214d888485b9f35fe03882a92962e6d5697849)
Leo Li [Fri, 17 Apr 2026 17:54:30 +0000 (13:54 -0400)]
drm/amd/display: Undo accidental fix revert in amdgpu_dm_ism.c
[Why]
Pausing DPM power profiles during static screen caused a bunch of
audio/performance/clock issues that were addressed in this fix:
'commit 1412482b7143 ("Revert "drm/amd/display: pause the workload setting in dm"")'
This logic in function amdgpu_dm_crtc_vblank_control_worker() was moved
to amdgpu_dm_ism.c, but the fix was lost in the process.
[How]
Reapply the fix to amdgpu_dm_ism.c
Fixes: 754003486c3c ("drm/amd/display: Add Idle state manager(ISM)") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Leo Li <sunpeng.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bc621e91d6fc004cfae9148c5a91acad19ada3e4)
Timur Kristóf [Mon, 20 Apr 2026 23:55:04 +0000 (01:55 +0200)]
drm/amd/display: Disable 10-bit truncation and dithering on DCE 6.x
DCE 6.x doesn't support 10-bit truncation and 10-bit dithering
because the following fields are 1-bit only:
FMT_TEMPORAL_DITHER_DEPTH
FMT_SPATIAL_DITHER_DEPTH
FMT_TRUNCATE_DEPTH
Programming these fields to "2" will program them as if the
dithering option was 6-bit, resulting in sub-par picture
quality and an ugly "color banding" effect.
Note that a recent commit changed the default 10-bit dithering
option to DITHER_OPTION_SPATIAL10 which improves the picture
quality because it happens to look better, but is still not
actually supported by DCE 6.x versions.
When the color depth is 10-bit or more, just disable
any kind of dithering options on DCE 6.x.
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5151 Fixes: 529cad0f945c ("drm/amd/display: Add function to set dither option") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Siwei He [Tue, 14 Apr 2026 18:46:54 +0000 (14:46 -0400)]
drm/amdgpu: OR init_pte_flags into invalid leaf PTE updates
Invalid leaf clears that only set AMDGPU_PTE_EXECUTABLE match the old
GMC9 fault-priority workaround but omit adev->gmc.init_pte_flags.
On GFX12 that includes AMDGPU_PTE_IS_PTE; without it, some cleared
PTEs can fault as no-retry and bypass the SVM/XNACK handler when a
VA is reused after a BO unmap.
Apply init_pte_flags in amdgpu_vm_pte_update_flags() alongside
EXECUTABLE so range-driven clears (e.g. amdgpu_vm_clear_freed) match
amdgpu_vm_pt_clear() for leaf templates.
Signed-off-by: Siwei He <siwei.he@amd.com> Reviewed-by: Philip Yang <philip.yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm/amd: Adjust ASPM support quirk to cover more Intel hosts
Some of the same issues identified in commit c770ef19673fb
("drm/amd/amdgpu: disable ASPM in some situations") also affect
Tiger Lake systems with GFX11 connected over USB4. Widen the net
to also match these hosts.
Fixes: d9b3a066dfcd ("drm/amd: Exclude dGPUs in eGPU enclosures from DPM quirks") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5145 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Leo Li [Fri, 17 Apr 2026 17:54:30 +0000 (13:54 -0400)]
drm/amd/display: Undo accidental fix revert in amdgpu_dm_ism.c
[Why]
Pausing DPM power profiles during static screen caused a bunch of
audio/performance/clock issues that were addressed in this fix:
'commit 1412482b7143 ("Revert "drm/amd/display: pause the workload setting in dm"")'
This logic in function amdgpu_dm_crtc_vblank_control_worker() was moved
to amdgpu_dm_ism.c, but the fix was lost in the process.
[How]
Reapply the fix to amdgpu_dm_ism.c
Fixes: 754003486c3c ("drm/amd/display: Add Idle state manager(ISM)") Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Leo Li <sunpeng.li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull more crypto library updates from Eric Biggers:
"Crypto library fix and documentation update:
- Fix an integer underflow in the mpi library
- Improve the crypto library documentation"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: docs: Add rst documentation to Documentation/crypto/
docs: kdoc: Expand 'at_least' when creating parameter list
lib/crypto: mpi: Fix integer underflow in mpi_read_raw_from_sgl()
Pavel Begunkov [Tue, 21 Apr 2026 08:45:29 +0000 (09:45 +0100)]
io_uring/zcrx: warn on freelist violations
The freelist is appropriately sized to always be able to take a free
niov, but let's be more defensive and check the invariant with a
warning. That should help to catch any double-free issues.
io_uring/register: fix ring resizing with mixed/large SQEs/CQEs
The ring resizing only properly handles "normal" sized SQEs or CQEs, if
there are pending entries around a resize. This normally should not be
the case, but the code is supposed to handle this regardless.
For the mixed SQE/CQE cases, the current copying works fine as they
are indexed in the same way. Each half is just copied separately. But
for fixed large SQEs and CQEs, the iteration and copy need to take that
into account.
io_uring/futex: ensure partial wakes are appropriately dequeued
If a FUTEX_WAITV vectored operation is only partially woken, we
should call __futex_wake_mark() on the queue to account for that.
If not, then a later wakeup will wake the same entry, rather than
the next one in line.
Fixes: 8f350194d5cfd ("io_uring: add support for vectored futex waits") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently anything that requires kvmalloc_flex() for allocations will
not get re-cached, and hence the cache freeing path is correct in that
it always uses kfree() to free the allocated memory. But this seems a
bit fragile as it's something that could get mix should that situation
change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree
as the desctructor.
io_uring/rsrc: unify nospec indexing for direct descriptors
For file updates, the node reset isn't capping the value via
array_index_nospec() like the other paths do. Ensure it's all sane and
have the update path do the proper capping as well.
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge tag 'erofs-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Fix dirent nameoff handling to avoid out-of-bound reads
out of crafted images
- Fix two type truncation issues on 32-bit platforms
* tag 'erofs-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: unify lcn as u64 for 32-bit platforms
erofs: fix offset truncation when shifting pgoff on 32-bit platforms
erofs: fix the out-of-bounds nameoff handling for trailing dirents
Alex Williamson [Fri, 17 Apr 2026 20:27:58 +0000 (14:27 -0600)]
vfio/cdx: Consolidate MSI configured state onto cdx_irqs
struct vfio_cdx_device carries three fields that track whether MSI has
been configured: vdev->cdx_irqs (the allocated vector array), vdev->
msi_count (the array length), and vdev->config_msi (a boolean flag).
The three are set together when vfio_cdx_msi_enable() succeeds and
cleared together by vfio_cdx_msi_disable(). However, the error paths
in vfio_cdx_msi_enable() free the cdx_irqs allocation on failure
without resetting the pointer, leaving it stale and skewed from the
other two fields until the next enable call overwrites it.
Clear vdev->cdx_irqs to NULL alongside the kfree() in both error paths
so the pointer consistently reflects the configured state. With that
invariant restored and access to the MSI state serialized by
cdx_irqs_lock, vdev->config_msi is fully redundant with
(vdev->cdx_irqs != NULL). Drop the config_msi field and switch all
readers to test cdx_irqs directly.
Alex Williamson [Fri, 17 Apr 2026 20:27:57 +0000 (14:27 -0600)]
vfio/cdx: Serialize VFIO_DEVICE_SET_IRQS with a per-device mutex
vfio_cdx_set_msi_trigger() reads vdev->config_msi and operates on the
vdev->cdx_irqs array based on its value, but provides no serialization
against concurrent VFIO_DEVICE_SET_IRQS ioctls. Two callers can race
such that one observes config_msi as set while another clears it and
frees cdx_irqs via vfio_cdx_msi_disable(), resulting in a use-after-free
of the cdx_irqs array.
Add a cdx_irqs_lock mutex to struct vfio_cdx_device and acquire it in
vfio_cdx_set_msi_trigger(), which is the single chokepoint through
which all updates to config_msi, cdx_irqs, and msi_count flow, covering
both the ioctl path and the close-device cleanup path. This keeps the
test of config_msi atomic with the subsequent enable, disable, or
trigger operations.
Drop the pre-call !cdx_irqs test from vfio_cdx_irqs_cleanup() as part
of this change: the optimization it provided is redundant with the
!config_msi early-return inside vfio_cdx_msi_disable(), and leaving the
test in place would be an unsynchronized read of state the new lock is
meant to protect.
vfio/cdx: Fix NULL pointer dereference in interrupt trigger path
Add validation to ensure MSI is configured before accessing cdx_irqs
array in vfio_cdx_set_msi_trigger(). Without this check, userspace
can trigger a NULL pointer dereference by calling VFIO_DEVICE_SET_IRQS
with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE flags before
ever setting up interrupts via VFIO_IRQ_SET_DATA_EVENTFD.
The vfio_cdx_msi_enable() function allocates the cdx_irqs array and
sets config_msi to 1 only when called through the EVENTFD path. The
trigger loop (for DATA_BOOL/DATA_NONE) assumed this had already been
done, but there was no enforcement of this call ordering.
This matches the protection used in the PCI VFIO driver where
vfio_pci_set_msi_trigger() checks irq_is() before the trigger loop.
Fixes: 848e447e000c ("vfio/cdx: add interrupt support") Cc: stable@vger.kernel.org Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com> Acked-by: Nipun Gupta <nipun.gupta@amd.com> Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com> Link: https://lore.kernel.org/r/20260417202800.88287-2-alex.williamson@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Alex Williamson [Fri, 17 Apr 2026 15:28:12 +0000 (09:28 -0600)]
vfio: replace vfio->device_class with a const struct class
The class_create() call has been deprecated in favor of class_register()
as the driver core now allows for a struct class to be in read-only
memory. Replace vfio->device_class with a const struct class and drop
the class_create() call.
Compile tested with both CONFIG_VFIO_DEVICE_CDEV on and off (and
CONFIG_VFIO on); found no errors/warns in dmesg.
Alex Williamson [Tue, 14 Apr 2026 20:06:22 +0000 (14:06 -0600)]
vfio/virtio: Use guard() for bar_mutex in legacy I/O
Convert the bar_mutex acquisition in virtiovf_issue_legacy_rw_cmd()
to use guard(), eliminating the out label and goto-based error paths
in favor of direct returns.
Alex Williamson [Tue, 14 Apr 2026 20:06:21 +0000 (14:06 -0600)]
vfio/virtio: Use guard() for migf->lock where applicable
Convert migf->lock acquisitions in virtiovf_disable_fd() and
virtiovf_save_read() to use guard(). In virtiovf_save_read() this
eliminates the out_unlock label and multiple goto paths by allowing
direct returns, and removes the need for the done variable to double
as an error carrier.
Alex Williamson [Tue, 14 Apr 2026 20:06:20 +0000 (14:06 -0600)]
vfio/virtio: Use guard() for list_lock where applicable
Convert list_lock mutex acquisitions to use guard() and scoped_guard()
where the lock scope aligns with the function or block scope. This
simplifies virtiovf_get_data_buff_from_pos() by replacing goto-based
unwinding with direct returns.
Alex Williamson [Tue, 14 Apr 2026 20:06:19 +0000 (14:06 -0600)]
vfio/virtio: Convert list_lock from spinlock to mutex
The list_lock spinlock with IRQ disabling was copied from the mlx5
vfio-pci variant driver, where it is justified by a hardirq async
command completion callback that accesses the protected lists. The
virtio driver has no such interrupt context usage; all list_lock
acquisitions occur in process context via file read/write operations
or state transitions under state_mutex.
Convert list_lock to a mutex to be consistent with peer vfio-pci
variant drivers (hisilicon, pds, qat, xe) which all use mutexes for
equivalent migration data protection. This also fixes a mismatched
spin_lock()/spin_unlock_irq() pair in virtiovf_read_device_context_chunk()
that could incorrectly enable interrupts.
Reported-by: Jinhui Guo <guojinhui.liam@bytedance.com> Closes: https://lore.kernel.org/all/20260413073603.30538-1-guojinhui.liam@bytedance.com Fixes: 0bbc82e4ec79 ("vfio/virtio: Add support for the basic live migration functionality") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20260414200625.3601509-2-alex.williamson@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Matt Evans [Wed, 15 Apr 2026 18:17:52 +0000 (11:17 -0700)]
vfio/pci: Clean up DMABUFs before disabling function
On device shutdown, make vfio_pci_core_close_device() call
vfio_pci_dma_buf_cleanup() before the function is disabled via
vfio_pci_core_disable(). This ensures that all access via DMABUFs is
revoked before the function's BARs become inaccessible.
This fixes an issue where, if the function is disabled first, a tiny
window exists in which the function's MSE is cleared and yet BARs
could still be accessed via the DMABUF. The resources would also be
freed and up for grabs by a different driver.
Fixes: 5d74781ebc86c ("vfio/pci: Add dma-buf export support for MMIO regions") Signed-off-by: Matt Evans <mattev@meta.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20260415181752.1027604-1-mattev@meta.com Signed-off-by: Alex Williamson <alex@shazbot.org>
block: only restrict bio allocation gfp mask asked to block
If the caller is asking for a non-blocking allocation, we should not
further restrict the gfp mask, which just increases the likelihood
of failures.
Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath") Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20260415060813.807659-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Wrap pat.idx[] reads with xe_cache_pat_idx() so invalid PAT index use
is caught by xe_assert() in debug builds.
Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Link: https://patch.msgid.link/20260416045526.536497-4-x.wang@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Xin Wang [Thu, 16 Apr 2026 04:55:25 +0000 (21:55 -0700)]
drm/xe/pat: Default XE_CACHE_NONE_COMPRESSION to invalid
Initialize XE_CACHE_NONE_COMPRESSION PAT index to XE_PAT_INVALID_IDX by
default, same as XE_CACHE_WB_COMPRESSION. Platforms that support this
cache mode will override it in xe_pat_init_early(). This ensures that
accidental use on unsupported platforms can be detected.
A subsequent patch introduces a helper to assert on invalid PAT index
access at all call sites.
Suggested-by: Matthew Auld <matthew.auld@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Xin Wang <x.wang@intel.com> Link: https://patch.msgid.link/20260416045526.536497-3-x.wang@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Xin Wang [Thu, 16 Apr 2026 04:55:24 +0000 (21:55 -0700)]
drm/xe: Standardize pat_index to u16 type
Ensure all pat_index definitions consistently use u16 type across
the XE driver. This addresses two remaining instances where pat_index
was incorrectly typed:
- xe_vm_snapshot structure used int for pat_index field
- xe_device pat.idx array used u32 instead of u16
This cleanup improves type consistency and ensures proper alignment
with the PAT subsystem design.
Shuicheng Lin [Fri, 17 Apr 2026 16:33:08 +0000 (16:33 +0000)]
drm/xe/gsc: Fix BO leak on error in query_compatibility_version()
When xe_gsc_read_out_header() fails, query_compatibility_version()
returns directly instead of jumping to the out_bo label. This skips
the xe_bo_unpin_map_no_vm() call, leaving the BO pinned and mapped
with no remaining reference to free it.
Fix by using goto out_bo so the error path properly cleans up the BO,
consistent with the other error handling in the same function.
which conflicts with the cid-form qmap rework on for-7.2. Resolved
by applying the same silence-on-NULL semantics to the arena-backed
lookup_task_ctx() and qmap_select_cpu() on for-7.2.
ALSA: hda/realtek - Add mute LED support for HP Victus 15-fa2xxx
The mute LED on this laptop uses ALC245 but requires a quirk to work.
This patch enables the existing ALC245_FIXUP_HP_MUTE_LED_COEFBIT
quirk for the device.
Tested my Victus 15-fa2xxx (PCI SSID 103c:8dcd).
The LED behaviour works as intended.
tools/sched_ext: scx_qmap: Silence task_ctx lookup miss
scx_fork() dispatches ops.init_task to exactly one scheduler - the one
owning the forking task's cgroup. A task forked inside a sub-scheduler's
cgroup is init'd into the sub only; the root scheduler has no task_ctx
entry for it. When that task later appears as @prev in the root's
qmap_dispatch() (or flows through core-sched comparison via task_qdist),
the bpf_task_storage_get() legitimately misses.
qmap treated those misses as fatal via scx_bpf_error("task_ctx lookup
failed") and aborted the scheduler as soon as the first cross-sched
task hit the root. Drop the error in the sites where the miss is
legitimate: lookup_task_ctx() (helper; callers already check for NULL),
qmap_dispatch()'s @prev branch (bookkeeping-only), task_qdist()
(returns 0 which makes the comparison a no-op), and qmap_select_cpu()
(returns prev_cpu as a no-op fallback instead of -ESRCH). The existing
scx_error was a paranoid guard from the pre-sub-sched world where every
task was owned by the one and only scheduler.
v2: qmap_select_cpu() returns prev_cpu on NULL instead of -ESRCH, so
the root scheduler doesn't error on cross-sched tasks that pass
through it (Andrea Righi).
Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Cássio Gabriel [Tue, 21 Apr 2026 13:03:06 +0000 (10:03 -0300)]
ALSA: pcmtest: Fix resource leaks in module init error paths
pcmtest allocates its pattern buffers and creates its debugfs tree
before registering the platform device and driver, but mod_init()
does not release those resources when a later init step fails.
As a result, a debugfs directory creation failure leaks the pattern
buffers, while platform_device_register() and
platform_driver_register() failures leave both the pattern buffers
and the debugfs tree behind. The recent fix for failed device
registration only dropped the embedded device reference.
Add the missing cleanup for the debugfs tree and pattern buffers in
the remaining module init error paths.