git.ipfire.org Git - thirdparty/linux.git/log

ntfs: fix mmap_prepare writable check for shared mappings

Linus pointed out that checking only VMA_WRITE_BIT is incorrect.
Private writable mappings (MAP_PRIVATE) set VM_WRITE but do not
write back to the filesystem. Also, mappings that can become
writable via mprotect() (VM_MAYWRITE) must be handled.

Use vma_desc_test_all(VMA_SHARED_BIT, VMA_MAYWRITE_BIT) instead,
which matches what other filesystems do.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>

LoongArch: BPF: Support up to 12 function arguments for trampoline

Currently, LoongArch bpf trampoline supports up to 8 function arguments.
According to the statistics from commit 473e3150e30a ("bpf, x86: allow
function arguments up to 12 for TRACING"), there are over 200 functions
accept 9 to 12 arguments, so add 12 arguments support for trampoline.

With this patch, the following related testcases passed:

  sudo ./test_progs -a tracing_struct/struct_many_args
  sudo ./test_progs -a fentry_test/fentry_many_args
  sudo ./test_progs -a fexit_test/fexit_many_args

Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: BPF: Support small struct arguments for trampoline

In the current BPF code, the struct argument size is at most 16 bytes,
enforced by the verifier. According to the Procedure Call Standard for
LoongArch, the struct argument size below 16 bytes are provided as part
of the 8 argument registers, that is to say, the struct argument may be
passed in a pair of registers if its size is more than 8 bytes and no
more than 16 bytes.

Extend the BPF trampoline JIT to support attachment to functions that
take small structures (up to 16 bytes) as argument, save and restore a
number of "argument registers" rather than a number of arguments.

With this patch, the following related testcases passed:

sudo ./test_progs -a tracing_struct/struct_args
sudo ./test_progs -a tracing_struct/union_args

Link: https://github.com/loongson/la-abi-specs/blob/release/lapcs.adoc#structures
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: BPF: Open code and remove invoke_bpf_mod_ret()

invoke_bpf_mod_ret() is a small wrapper over invoke_bpf_prog(), it
should check the return value of invoke_bpf_prog() and then return
immediately if invoke_bpf_prog() failed, just open code and remove
it due to it is called only once.

Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: BPF: Support load-acquire and store-release instructions

Use the LoongArch common memory access instructions with the barrier
'dbar' to support the BPF load-acquire and store-release instructions.

With this patch, the following testcases passed on LoongArch if the
macro CAN_USE_LOAD_ACQ_STORE_REL is usable in bpf selftests:

  sudo ./test_progs -t verifier_load_acquire
  sudo ./test_progs -t verifier_store_release
  sudo ./test_progs -t verifier_precision/bpf_load_acquire
  sudo ./test_progs -t verifier_precision/bpf_store_release
  sudo ./test_progs -t compute_live_registers/atomic_load_acq_store_rel

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: BPF: Support 8 and 16 bit read-modify-write instructions

The 8 and 16 bit read-modify-write instructions {amadd/amswap}.{b/h}
were newly added in the latest LoongArch Reference Manual, use them to
avoid the error of unknown opcode if possible.

Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: BPF: Add the default case in emit_atomic() and rename it

Like the other archs such as x86 and riscv, add the default case
in emit_atomic() to print an error message for the invalid opcode
and return -EINVAL, then make its return type as int.

While at it, given that all of the instructions in emit_atomic()
are only read-modify-write instructions, rename emit_atomic() to
emit_atomic_rmw() to make it clear, because there will be a new
function emit_atomic_ld_st() for load-acquire and store-release
instructions in the later patch.

Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Define instruction formats for AM{SWAP/ADD}.{B/H} and DBAR

The 8 and 16 bit read-modify-write atomic instructions amadd.{b/h} and
amswap.{b/h} were newly added in the latest LoongArch Reference Manual,
define the instruction format and check whether support via CPUCFG.

Furthermore, define the instruction format for DBAR which will be used
to support BPF load-acquire and store-release instructions.

This is preparation for later patches.

Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Batch the icache maintenance for jump_label

Switch to the batched version of the jump label update functions so
instruction cache maintenance is deferred until the end of the update.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Add flush_icache_all()/local_flush_icache_all()

LoongArch maintains ICache/DCache coherency by hardware, so we just need
"ibar 0" to avoid instruction hazard here.

Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Add spectre boundry for syscall dispatch table

The LoongArch syscall number is directly controlled by userspace, but
does not have a array_index_nospec() boundry to prevent access past the
syscall function pointer tables.

Cc: stable@vger.kernel.org
Assisted-by: gkh_clanker_2000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Show CPU vulnerabilites correctly

Most LoongArch processors are vulnerable to Spectre-V1 Proof-of-Concept
(PoC). And the generic mechanism, __user pointer sanitization, can be
used as a mitigation. This means to use array_index_nospec() to prevent
out of boundry access in syscall and other critical paths.

Implement the arch-specific cpu_show_spectre_v1() to show CPU Spectre-V1
vulnerabilites correctly.

Cc: stable@vger.kernel.org
Link: https://cc-sw.com/chinese-loongarch-architecture-evaluation-part-3-of-3/
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Make arch_irq_work_has_interrupt() true only if IPI HW exist

After commit 7c405fb3279b3924 ("rcu: Use an intermediate irq_work to
start process_srcu()"), Loongson-2K0300/2K0500 fail to boot. Because
IRQ_WORK need IPI but Loongson-2K0300/2K0500 don't have IPI HW.

So make arch_irq_work_has_interrupt() return true only if IPI HW exist.

Cc: stable@vger.kernel.org
Reported-by: Binbin Zhou <zhoubinbin@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Use get_random_canary() for stack canary init

Like others, replace the custom stack canary initialization with the
get_random_canary() helper, following the pattern established in commit
622754e84b10 ("stackprotector: actually use get_random_canary()").

Signed-off-by: Luo Qiu <luoqiu@kylinsec.com.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Improve the logging of disabling KASLR

Whether KASLR is disabled is not handled in nokaslr() which is the early
param "nokaslr" setup function, but in kaslr_disabled(). However, the
logging was previously done in nokaslr() and lack detail. So we move the
logging to the right place and add more specific infomation about why it
is disabled.

Suggested-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Yuqian Yang <yangyuqian@uniontech.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Align FPU register state to 32 bytes

Move fpr to the beginning of struct loongarch_fpu so it is naturally
aligned to FPU_ALIGN (32 bytes), improving 256-bit SIMD (LASX) context
switch performance.

Also adjust process.c and fpu.S to work well with the new loongarch_fpu
layout.

Signed-off-by: Lisa Robinson <lisa@bytefly.space>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Handle CONFIG_32BIT in syscall_get_arch()

If CONFIG_32BIT is set, it should return AUDIT_ARCH_LOONGARCH32 instead
of AUDIT_ARCH_LOONGARCH64 in syscall_get_arch().

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Add HIGHMEM (PKMAP and FIX_KMAP) support

Add HIGHMEM (High Memory) support for LoongArch, mostly needed by 32BIT
kernel because the size of kernel virtual memory space is only 512MB and
the size of usable physical memory is only 256MB in this case.

HIGHMEM adds permanent kernel mapping (PKMAP) and fixed kernel mapping
(FIX_KMAP), which increase usable physical memory up to 2.25GB (2304MB).

We can just use the generic copy_user_highpage(), so remove the custom
version.

Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Adjust build infrastructure for 32BIT/64BIT

Adjust build infrastructure (Kconfig, Makefile and ld scripts) to let
us enable both 32BIT/64BIT kernel build.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

x86/hyperv: Skip LP/VP creation on kexec

After a kexec the logical processors and virtual processors already
exist in the hypervisor because they were created by the previous
kernel. Attempting to add them again causes either a BUG_ON or
corrupted VP state leading to MCEs in the new kernel.

Add hv_lp_exists() to probe whether an LP is already present by
calling HVCALL_GET_LOGICAL_PROCESSOR_RUN_TIME. When it succeeds the
LP exists and we skip the add-LP and create-VP loops entirely.

Also add hv_call_notify_all_processors_started() which informs the
hypervisor that all processors are online. This is required after
adding LPs (fresh boot) and is a no-op on kexec since we skip that
path.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Co-developed-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburskii@gmail.com>
Co-developed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

x86/hyperv: move stimer cleanup to hv_machine_shutdown()

Move hv_stimer_global_cleanup() from vmbus's hv_kexec_handler() to
hv_machine_shutdown() in the platform code. This ensures stimer cleanup
happens before the vmbus unload, which is required for root partition
kexec to work correctly.

Co-developed-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam <anrayabh@linux.microsoft.com>
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

Drivers: hv: vmbus: fix hyperv_cpuhp_online variable shadowing

vmbus_alloc_synic_and_connect() declares a local 'int
hyperv_cpuhp_online' that shadows the file-scope global of the same
name. The cpuhp state returned by cpuhp_setup_state() is stored in
the local, leaving the global at 0 (CPUHP_OFFLINE). When
hv_kexec_handler() or hv_machine_shutdown() later call
cpuhp_remove_state(hyperv_cpuhp_online) they pass 0, which hits the
BUG_ON in __cpuhp_remove_state_cpuslocked().

Remove the local declaration so the cpuhp state is stored in the
file-scope global where hv_kexec_handler() and hv_machine_shutdown()
expect it.

Fixes: 2647c96649ba ("Drivers: hv: Support establishing the confidential VMBus connection")
Signed-off-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

mshv: Add tracepoint for GPA intercept handling

Provide visibility into GPA intercept operations for debugging and
performance analysis of Microsoft Hypervisor guest memory management.

Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>

pwm: atmel-tcb: Cache clock rates and mark chip as atomic

atmel_tcb_pwm_apply() holds tcbpwmc->lock as a spinlock via
guard(spinlock)() and then calls atmel_tcb_pwm_config(), which calls
clk_get_rate() twice. clk_get_rate() acquires clk_prepare_lock (a
mutex), so this is a sleep-in-atomic-context violation.

On CONFIG_DEBUG_ATOMIC_SLEEP kernels every pwm_apply_state() that
enables or reconfigures the PWM triggers a "BUG: sleeping function
called from invalid context" warning.

Acquire exclusive control over the clock rates with
clk_rate_exclusive_get() at probe time and cache the rates in struct
atmel_tcb_pwm_chip, then read the cached rates from
atmel_tcb_pwm_config(). This keeps the spinlock-based mutual exclusion
introduced in commit 37f7707077f5 ("pwm: atmel-tcb: Fix race condition
and convert to guards") and removes the sleeping calls from the atomic
section.

With no sleeping calls left in .apply() and the regmap-mmio bus already
running with fast_io=true, also mark the chip as atomic so consumers
can use pwm_apply_atomic() from atomic context.

Fixes: 37f7707077f5 ("pwm: atmel-tcb: Fix race condition and convert to guards")
Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr>
Link: https://patch.msgid.link/20260419080838.3192357-1-sangyun.kim@snu.ac.kr
[ukleinek: Ensure .clk is enabled before calling clk_get_rate on it.]
Signed-off-by: Uwe Kleine-König <ukleinek@kernel.org>

io_uring: take page references for NOMMU pbuf_ring mmaps

Under !CONFIG_MMU, io_uring_get_unmapped_area() returns the kernel
virtual address of the io_mapped_region's backing pages directly;
the user's VMA aliases the kernel allocation. io_uring_mmap() then
just returns 0 -- it takes no page references.

The CONFIG_MMU path uses vm_insert_pages(), which takes a reference on
each inserted page.  Those references are released when the VMA is torn
down (zap_pte_range -> put_page). io_free_region() -> release_pages()
drops the io_uring-side references, but the pages survive until munmap
drops the VMA-side references.

Under NOMMU there are no VMA-side references. io_unregister_pbuf_ring ->
io_put_bl -> io_free_region -> release_pages drops the only references
and the pages return to the buddy allocator while the user's VMA still
has vm_start pointing into them.  The user can then write into whatever
the allocator hands out next.

Mirror the MMU lifetime: take get_page references in io_uring_mmap() and
release them via vm_ops->close.  NOMMU's delete_vma() calls vma_close()
which runs ->close on munmap.

This also incidentally addresses the duplicate-vm_start case: two mmaps
of SQ_RING and CQ_RING resolve to the same ctx->ring_region pointer.
With page refs taken per mmap, the second mmap takes its own refs and
the pages survive until both mmaps are closed.  The nommu rb-tree BUG_ON
on duplicate vm_start is a separate mm/nommu.c concern (it should share
the existing region rather than BUG), but the page lifetime is now
correct.

Cc: Jens Axboe <axboe@kernel.dk>
Reported-by: Anthropic
Assisted-by: gkh_clanker_t1000
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/2026042115-body-attention-d15b@gregkh
[axboe: get rid of region lookup, just iterate pages in vma]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'probes-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:
"fprobe bug fixes:

   - Prevent re-registration

     Add an earlier check to reject re-registering an already active
     fprobe before its state is modified during the initialization phase

   - Robustness in failure paths:
      - Ensure fprobes are correctly removed from all internal tables
        and properly RCU-freed during registration failure
      - Make unregister_fprobe() proceed with unregistration even if
        temporary memory allocation fails

   - RCU safety in module unloading

     Avoid a potential "sleep in RCU" warning by removing a kcalloc()
     call in the module notifier path. This also tries to remove
     fprobe_hash_node even if memory allocation fails.

   - Type-aware unregistration

     Fix a bug where unregistering an fprobe did not account for
     different types (entry-only vs entry-exit) at the same address,
     which previously left "junk" entries in the underlying
     ftrace/fgraph ops

   - Unregistration of empty ftrace_ops

     Avoid unneeded performance overhead due to making registered
     ftrace_ops empty - which means 'trace all functions'. This counts
     remaining entries and unregister ftrace_ops when it becomes empty.

  Two new selftests to check above fixes:

   - Module Unloading Test:

     Specifically verifies that fprobe events on a module are correctly
     cleaned up and do not trigger 'trace-all' behavior when the module
     is removed.

   - Multiple Fprobe Events Test:

     Ensure that having multiple fprobes on the same function correctly
     manages the ftrace hash map during removal"

* tag 'probes-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  selftests/ftrace: Add a testcase for multiple fprobe events
  selftests/ftrace: Add a testcase for fprobe events on module
  tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading
  tracing/fprobe: Check the same type fprobe on table as the unregistered one
  tracing/fprobe: Avoid kcalloc() in rcu_read_lock section
  tracing/fprobe: Remove fprobe from hash in failure path
  tracing/fprobe: Unregister fprobe even if memory allocation fails
  tracing/fprobe: Reject registration of a registered fprobe before init

io_uring/poll: ensure EPOLL_ONESHOT is propagated for EPOLL_URING_WAKE

Commit:

aacf2f9f382c ("io_uring: fix req->apoll_events")

fixed an issue where poll->events and req->apoll_events weren't
synchronized, but then when the commit referenced in Fixes got added,
it didn't ensure the same thing.

If we mask in EPOLLONESHOT in the regular EPOLL_URING_WAKE path, then
ensure it's done for both. Including a link to the original report
below, even though it's mostly nonsense. But it includes a reproducer
that does show that IORING_CQE_F_MORE is set in the previous CQE,
while no more CQEs will be generated for this request. Just ignore
anything that pretends this is security related in any way, it's just
the typical AI nonsense.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/CAM0zi7yQzF3eKncgHo4iVM5yFLAjsiob_ucqyWKs=hyd_GqiMg@mail.gmail.com/
Reported-by: Azizcan Daştan <azizcan.d@mileniumsec.com>
Fixes: 4464853277d0 ("io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeups")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'amd-drm-next-7.1-2026-04-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-7.1-2026-04-17:

amdgpu:
- SMU 14 fixes
- Partition fixes
- SMUIO 15.x fix
- SR-IOV fixes
- JPEG fix
- PSP 15.x fix
- NBIF fix
- Devcoredump fixes
- DPC fix
- RAS fixes
- Aldebaran smu fix
- IP discovery fix
- SDMA 7.1 fix
- Runtime pm fix
- MES 12.1 fix
- DML2 fixes
- DCN 4.2 fixes
- YCbCr fixes
- Freesync fixes
- ISM fixes
- Overlay cursor fix
- DC FP fixes
- UserQ locking fixes

amdkfd:
- Fix memory clear handling

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260417225351.8714-1-alexander.deucher@amd.com

scsi: target: iscsi: reject invalid size Extended CDB AHS

If ecdb_ahdr->ahslength is zero, two bugs follow:

  kmalloc(be16_to_cpu(ecdb_ahdr->ahslength) + 15, ...)

allocates 15 bytes, but the immediately following memcpy writes
ISCSI_CDB_SIZE (16) bytes into it, a one-byte heap overflow. Also:

  memcpy(cdb + ISCSI_CDB_SIZE, ecdb_ahdr->ecdb,
           be16_to_cpu(ecdb_ahdr->ahslength) - 1);

(u16)0 - 1 promotes to (int)-1 which converts to SIZE_MAX as size_t,
causing a massive out-of-bounds write.

Reject ahslength == 0 with ISCSI_REASON_PROTOCOL_ERROR before the kmalloc.
Also reject ahslength values that exceed the actual AHS buffer advertised.

Fixes: 8f1f7d297bce ("scsi: target: iscsi: Add support for extended CDB AHS")
Signed-off-by: Carlos Bilbao <carlos.bilbao@kernel.org>
Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Link: https://patch.msgid.link/20260415040728.187680-1-carlos.bilbao@kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: ufs: core: Fix bRefClkFreq write failure in HS-LSS mode

According to the UFS spec, the bRefClkFreq attribute can only be written
when both sub-links are in LS-MODE. However, in HS LSS mode with
resetmode = HS_MODE, if the UFS device's default bRefClkFreq value
differs from the host controller's dev_ref_clk_freq setting, the write
operation will fail.

To fix this issue, introduce ufshcd_get_op_mode() function to detect the
current link operational mode. Call ufshcd_set_dev_ref_clk() only when
both sub-links are in LS-MODE to ensure the attribute can be written
successfully.

Signed-off-by: Wang Shuaiwei <wangshuaiwei1@xiaomi.com>
Link: https://patch.msgid.link/20260414033718.1459540-1-wangshuaiwei1@xiaomi.com
Reviewed-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

Merge tag 'drm-next-2026-04-22' of https://gitlab.freedesktop.org/drm/kernel

Pull more drm updates from Dave Airlie:
"This is a followup which is mostly next material with some fixes.

  Alex pointed out I missed one of his AMD MRs from last week, so I
  added that, then Jani sent the pipe reordering stuff, otherwise it's
  just some minor i915 fixes and a dma-buf fix.

  drm:
   - Add support for AMD VSDB parsing to drm_edid

  dma-buf:
   - fix documentation formatting

  i915:
   - add support for reordered pipes to support joined pipes better
   - Fix VESA backlight possible check condition
   - Verify the correct plane DDB entry

  amdgpu:
   - Audio regression fix
   - Use drm edid parser for AMD VSDB
   - Misc cleanups
   - VCE cs parse fixes
   - VCN cs parse fixes
   - RAS fixes
   - Clean up and unify vram reservation handling
   - GPU Partition updates
   - system_wq cleanups
   - Add CONFIG_GCOV_PROFILE_AMDGPU kconfig option
   - SMU vram copy updates
   - SMU 13/14/15 fixes
   - UserQ fixes
   - Replace pasid idr with an xarray
   - Dither handling fix
   - Enable amdgpu by default for CIK APUs
   - Add IBs to devcoredump

  amdkfd:
   - system_wq cleanups

  radeon:
   - system_wq cleanups"

* tag 'drm-next-2026-04-22' of https://gitlab.freedesktop.org/drm/kernel: (62 commits)
  drm/i915/display: change pipe allocation order for discrete platforms
  drm/i915/wm: Verify the correct plane DDB entry
  drm/i915/backlight: Fix VESA backlight possible check condition
  drm/i915: Walk crtcs in pipe order
  drm/i915/joiner: Make joiner "nomodeset" state copy independent of pipe order
  dma-buf: fix htmldocs error for dma_buf_attach_revocable
  drm/amdgpu: dump job ibs in the devcoredump
  drm/amdgpu: store ib info for devcoredump
  drm/amdgpu: extract amdgpu_vm_lock_by_pasid from amdgpu_vm_handle_fault
  drm/amdgpu: Use amdgpu by default for CIK APUs too
  drm/amd/display: Remove unused NUM_ELEMENTS macros
  drm/amd/display: Replace inline NUM_ELEMENTS macro with ARRAY_SIZE
  drm/amdgpu: save ring content before resetting the device
  drm/amdgpu: make userq fence_drv drop explicit in queue destroy
  drm/amdgpu: rework userq fence driver alloc/destroy
  drm/amdgpu/userq: use dma_fence_wait_timeout without test for signalled
  drm/amdgpu/userq: call dma_resv_wait_timeout without test for signalled
  drm/amdgpu/userq: add the return code too in error condition
  drm/amdgpu/userq: fence wait for max time in amdgpu_userq_wait_for_signal
  drm/amd/display: Change dither policy for 10 bpc output back to dithering
  ...

selftests/ftrace: Add a testcase for multiple fprobe events

Add a testcase for multiple fprobe events on the same function
so that it clears ftrace hash map correctly when removing the
events.

Link: https://lore.kernel.org/all/177669370353.132053.16801520791509406141.stgit@mhiramat.tok.corp.google.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

selftests/ftrace: Add a testcase for fprobe events on module

Add a testcase for fprobe events on module, which unloads a kernel
module on which fprobe events are probing and ensure the ftrace
hash map is cleared correctly.

Link: https://lore.kernel.org/all/177669369564.132053.623527664540176496.stgit@mhiramat.tok.corp.google.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloading

Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.

Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.

# modprobe xt_LOG.ko
# echo 'f:test log_tg*' > dynamic_events
# echo 1 > events/fprobes/test/enable
# cat enabled_functions
log_tg [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1)               tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1)             tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
# rmmod xt_LOG
# wc -l enabled_functions
34085 enabled_functions

Link: https://lore.kernel.org/all/177669368776.132053.10042301916765771279.stgit@mhiramat.tok.corp.google.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

ceph: add subvolume metrics collection and reporting

Add complete infrastructure for per-subvolume I/O metrics collection
and reporting to the MDS. This enables administrators to monitor I/O
patterns at the subvolume granularity, which is useful for multi-tenant
CephFS deployments.

This patch adds:
- CEPHFS_FEATURE_SUBVOLUME_METRICS feature flag for MDS negotiation
- CEPH_SUBVOLUME_ID_NONE constant (0) for unknown/unset state
- Red-black tree based metrics tracker for efficient per-subvolume
aggregation with kmem_cache for entry allocations
- Wire format encoding matching the MDS C++ AggregatedIOMetrics struct
- Integration with the existing CLIENT_METRICS message
- Recording of I/O operations from file read/write and writeback paths
- Debugfs interfaces for monitoring (metrics/subvolumes, metrics/metric_features)

Metrics tracked per subvolume include:
- Read/write operation counts
- Read/write byte counts
- Read/write latency sums (for average calculation)

The metrics are periodically sent to the MDS as part of the existing
metrics reporting infrastructure when the MDS advertises support for
the SUBVOLUME_METRICS feature.

CEPH_SUBVOLUME_ID_NONE enforces subvolume_id immutability. Following
the FUSE client convention, 0 means unknown/unset. Once an inode has
a valid (non-zero) subvolume_id, it should not change during the
inode's lifetime.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ceph: parse subvolume_id from InodeStat v9 and store in inode

Add support for parsing the subvolume_id field from InodeStat v9 and
storing it in the inode for later use by subvolume metrics tracking.

The subvolume_id identifies which CephFS subvolume an inode belongs to,
enabling per-subvolume I/O metrics collection and reporting.

This patch:
- Adds subvolume_id field to struct ceph_mds_reply_info_in
- Adds i_subvolume_id field to struct ceph_inode_info
- Parses subvolume_id from v9 InodeStat in parse_reply_info_in()
- Adds ceph_inode_set_subvolume() helper to propagate the ID to inodes
- Initializes i_subvolume_id in inode allocation and clears on destroy

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ceph: handle InodeStat v8 versioned field in reply parsing

Add forward-compatible handling for the new versioned field introduced
in InodeStat v8. This patch only skips the field without using it,
preparing for future protocol extensions.

The v8 encoding adds a versioned sub-structure that needs to be properly
decoded and skipped to maintain compatibility with newer MDS versions.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

libceph: Fix slab-out-of-bounds access in auth message processing

If a (potentially corrupted) message of type CEPH_MSG_AUTH_REPLY
contains a positive value in its result field, it is treated as an
error code by ceph_handle_auth_reply() and returned to
handle_auth_reply(). Thereafter, an attempt is made to send the
preallocated message of type CEPH_MSG_AUTH, where the returned value is
interpreted as the size of the front segment to send. If the result
value in the message is greater than the size of the memory buffer
allocated for the front segment, an out-of-bounds access occurs, and
the content of the memory region beyond this buffer is sent out.

This patch fixes the issue by treating only negative values in the
result field as errors. Positive values are therefore treated as success
in the same way as a zero value. Additionally, a BUG_ON is added to
__send_prepared_auth_request() comparing the len parameter to
front_alloc_len to prevent sending the message if it exceeds the bounds
of the allocation and to make it easier to catch any logic flaws leading
to this.

Cc: stable@vger.kernel.org
Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

rbd: fix null-ptr-deref when device_add_disk() fails

do_rbd_add() publishes the device with device_add() before calling
device_add_disk(). If device_add_disk() fails after device_add()
succeeds, the error path calls rbd_free_disk() directly and then later
falls through to rbd_dev_device_release(), which calls rbd_free_disk()
again. This double teardown can leave blk-mq cleanup operating on
invalid state and trigger a null-ptr-deref in
__blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set().

Fix this by following the normal remove ordering: call device_del()
before rbd_dev_device_release() when device_add_disk() fails after
device_add(). That keeps the teardown sequence consistent and avoids
re-entering disk cleanup through the wrong path.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available.

We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64
guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer
confines failslab injections to the __add_disk() range and injects
fail-nth while mapping an RBD image through
/sys/bus/rbd/add_single_major.

On the unpatched kernel, fail-nth=4 reliably triggered the fault:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 UID: 0 PID: 273 Comm: bash Not tainted 7.0.0-01247-gd60bc1401583 #6 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__blk_mq_free_map_and_rqs+0x8c/0x240
Code: 00 00 48 8b 6b 60 41 89 f4 49 c1 e4 03 4c 01 e5 45 85 ed 0f 85 0a 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 e9 48 c1 e9 03 <80> 3c 01 00 0f 85 31 01 00 00 4c 8b 6d 00 4d 85 ed 0f 84 e2 00 00
RSP: 0018:ff1100000ab0fac8 EFLAGS: 00000246
RAX: dffffc0000000000 RBX: ff1100000c4806a0 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ff1100000c4806f4
RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c000189001b
R10: ff1100000c4800df R11: ff1100006cf37be0 R12: 0000000000000000
R13: 0000000000000000 R14: ff1100000c480700 R15: ff1100000c480004
FS: 00007f0fbe8fe740(0000) GS:ff110000e5851000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe53473b2e0 CR3: 0000000012eef000 CR4: 00000000007516f0
PKRU: 55555554
Call Trace:
<TASK>
blk_mq_free_tag_set+0x77/0x460
do_rbd_add+0x1446/0x2b80
? __pfx_do_rbd_add+0x10/0x10
? lock_acquire+0x18c/0x300
? find_held_lock+0x2b/0x80
? sysfs_file_kobj+0xb6/0x1b0
? __pfx_sysfs_kf_write+0x10/0x10
kernfs_fop_write_iter+0x2f4/0x4a0
vfs_write+0x98e/0x1000
? expand_files+0x51f/0x850
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf2/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x690
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f0fbea15907
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffe22346ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000058 RCX: 00007f0fbea15907
RDX: 0000000000000058 RSI: 0000563ace6c0ef0 RDI: 0000000000000001
RBP: 0000563ace6c0ef0 R08: 0000563ace6c0ef0 R09: 6b6435726d694141
R10: 5250337279762f78 R11: 0000000000000246 R12: 0000000000000058
R13: 00007f0fbeb1c780 R14: ff1100000c480700 R15: ff1100000c480004
</TASK>

With this fix applied, rerunning the reproducer over fail-nth=1..256
yields no KASAN reports.

[ idryomov: rename err_out_device_del -> err_out_device ]

Cc: stable@vger.kernel.org
Fixes: 27c97abc30e2 ("rbd: add add_disk() error handling")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

crush: cleanup in crush_do_rule() method

Commit 41ebcc0907c5 ("crush: remove forcefeed functionality") from
May 7, 2012 (linux-next), leads to the following Smatch static
checker warning:

net/ceph/crush/mapper.c:1015 crush_do_rule()
warn: iterator 'j' not incremented

Before commit 41ebcc0907c5 ("crush: remove forcefeed functionality"),
we had this logic:

  j = 0;
  if (osize == 0 && force_pos >= 0) {
      o[osize] = force_context[force_pos];
      if (recurse_to_leaf)
          c[osize] = force_context[0];
      j++;           /* <-- this was the only increment, now gone */
      force_pos--;
  }
  /* then crush_choose_*(..., o+osize, j, ...) */

Now, the variable j is dead code — a variable that is set
and never meaningfully varied. This patch simply removes
the dead code.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails

This MDS reconnect error path leaves s_cap_reconnect set.
send_mds_reconnect() sets the bit at the beginning of the reconnect,
but the first failing operation after that, ceph_pagelist_encode_32(),
can jump to `fail:` without clearing it.

__ceph_remove_cap() consults that flag to decide whether cap releases
should be queued. A reconnect-preparation failure therefore leaves the
session in reconnect mode from the cap-release path's point of view
and can strand release work until some later state transition repairs
it.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ceph: only d_add() negative dentries when they are unhashed

Ceph can call d_add(dentry, NULL) on a negative dentry that is already
present in the primary dcache hash.

In the current VFS that is not safe.  d_add() goes through __d_add()
to __d_rehash(), which unconditionally reinserts dentry->d_hash into
the hlist_bl bucket.  If the dentry is already hashed, reinserting the
same node can corrupt the bucket, including creating a self-loop.
Once that happens, __d_lookup() can spin forever in the hlist_bl walk,
typically looping only on the d_name.hash mismatch check and
eventually triggering RCU stall reports like this one:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu:         87-....: (2100 ticks this GP) idle=3a4c/1/0x4000000000000000 softirq=25003319/25003319 fqs=829
rcu:         (t=2101 jiffies g=79058445 q=698988 ncpus=192)
CPU: 87 UID: 2952868916 PID: 3933303 Comm: php-cgi8.3 Not tainted 6.18.17-i1-amd #950 NONE
Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.6 09/22/2023
RIP: 0010:__d_lookup+0x46/0xb0
Code: c1 e8 07 48 8d 04 c2 48 8b 00 49 89 fc 49 89 f5 48 89 c3 48 83 e3 fe 48 83 f8 01 77 0f eb 2d 0f 1f 44 00 00 48 8b 1b 48 85 db <74> 20 39 6b 18 75 f3 48 8d 7b 78 e8 ba 85 d0 00 4c 39 63 10 74 1f
RSP: 0018:ff745a70c8253898 EFLAGS: 00000282
RAX: ff26e470054cb208 RBX: ff26e470054cb208 RCX: 000000006e958966
RDX: ff26e48267340000 RSI: ff745a70c82539b0 RDI: ff26e458f74655c0
RBP: 000000006e958966 R08: 0000000000000180 R09: 9cd08d909b919a89
R10: ff26e458f74655c0 R11: 0000000000000000 R12: ff26e458f74655c0
R13: ff745a70c82539b0 R14: d0d0d0d0d0d0d0d0 R15: 2f2f2f2f2f2f2f2f
FS:  00007f5770896980(0000) GS:ff26e482c5d88000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5764de50c0 CR3: 000000a72abb5001 CR4: 0000000000771ef0
PKRU: 55555554
Call Trace:
  <TASK>
  lookup_fast+0x9f/0x100
  walk_component+0x1f/0x150
  link_path_walk+0x20e/0x3d0
  path_lookupat+0x68/0x180
  filename_lookup+0xdc/0x1e0
  vfs_statx+0x6c/0x140
  vfs_fstatat+0x67/0xa0
  __do_sys_newfstatat+0x24/0x60
  do_syscall_64+0x6a/0x230
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

This is reachable with reused cached negative dentries.  A Ceph lookup
or atomic_open can be handed a negative dentry that is already hashed,
and fs/ceph/dir.c then hits one of two paths that incorrectly assume
"negative" also means "unhashed":

  - ceph_finish_lookup():
      MDS reply is -ENOENT with no trace
      -> d_add(dentry, NULL)

  - ceph_lookup():
      local ENOENT fast path for a complete directory with shared caps
      -> d_add(dentry, NULL)

Both paths can therefore re-add an already-hashed negative dentry.

Ceph already uses the correct pattern elsewhere: ceph_fill_trace() only
calls d_add(dn, NULL) for a negative null-dentry reply when d_unhashed(dn)
is true.

Fix both fs/ceph/dir.c sites the same way: only call d_add() for a
negative dentry when it is actually unhashed.  If the negative dentry
is already hashed, leave it in place and reuse it as-is.

This preserves the existing behavior for unhashed dentries while
avoiding d_hash list corruption for reused hashed negatives.

Cc: stable@vger.kernel.org
Fixes: 2817b000b02c ("ceph: directory operations")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

libceph: update outdated comment in ceph_sock_write_space()

The function try_write() was renamed to ceph_con_v1_try_write()
in commit 566050e17e53 ("libceph: separate msgr1 protocol
implementation") and subsequently moved to net/ceph/messenger_v1.c
in commit 2f713615ddd9 ("libceph: move msgr1 protocol implementation
to its own file"). Update the comment in ceph_sock_write_space()
accordingly.

[ idryomov: account for msgr2 in the updated comment as well ]

Signed-off-by: kexinsun <kexinsun@smail.nju.edu.cn>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

libceph: Remove obsolete session key alignment logic

Since the call to crypto_shash_setkey() was replaced with
hmac_sha256_preparekey() which doesn't allocate memory regardless of the
alignment of the input key, remove the session key alignment logic from
process_auth_done(). Also remove the inclusion of crypto/hash.h, which
is no longer needed since crypto_shash is no longer used.

[ idryomov: rewrap comment ]

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

ceph: fix num_ops off-by-one when crypto allocation fails

move_dirty_folio_in_page_array() may fail if the file is encrypted, the
dirty folio is not the first in the batch, and it fails to allocate a
bounce buffer to hold the ciphertext. When that happens,
ceph_process_folio_batch() simply redirties the folio and flushes the
current batch -- it can retry that folio in a future batch.

However, if this failed folio is not contiguous with the last folio that
did make it into the batch, then ceph_process_folio_batch() has already
incremented `ceph_wbc->num_ops`; because it doesn't follow through and
add the discontiguous folio to the array, ceph_submit_write() -- which
expects that `ceph_wbc->num_ops` accurately reflects the number of
contiguous ranges (and therefore the required number of "write extent"
ops) in the writeback -- will panic the kernel:

BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);

This issue can be reproduced on affected kernels by writing to
fscrypt-enabled CephFS file(s) with a 4KiB-written/4KiB-skipped/repeat
pattern (total filesize should not matter) and gradually increasing the
system's memory pressure until a bounce buffer allocation fails.

Fix this crash by decrementing `ceph_wbc->num_ops` back to the correct
value when move_dirty_folio_in_page_array() fails, but the folio already
started counting a new (i.e. still-empty) extent.

The defect corrected by this patch has existed since 2022 (see first
`Fixes:`), but another bug blocked multi-folio encrypted writeback until
recently (see second `Fixes:`). The second commit made it into 6.18.16,
6.19.6, and 7.0-rc1, unmasking the panic in those versions. This patch
therefore fixes a regression (panic) introduced by cac190c7674f.

Cc: stable@vger.kernel.org
Fixes: d55207717ded ("ceph: add encryption support to writepage and writepages")
Fixes: cac190c7674f ("ceph: fix write storm on fscrypted files")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()

If a message of type CEPH_MSG_AUTH_REPLY contains a zero value for both
protocol and result, this is currently not treated as an error. In case
of ac->negotiating == true and ac->protocol > 0, this leads to setting
ac->protocol = 0 and ac->ops = NULL. Thereafter, the check for
ac->protocol != protocol returns false, and init_protocol() is not
called. Subsequently, ac->ops->handle_reply() is called, which leads to
a null pointer dereference, because ac->ops is still NULL.

This patch changes the check for ac->protocol != protocol to
!ac->protocol, as this also includes the case when the protocol was set
to zero in the message. This causes the message to be treated as
containing a bad auth protocol.

Cc: stable@vger.kernel.org
Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

s390/bpf: Inline smp_processor_id and current_task

Inline these calls in bpf jit:
- bpf_get_smp_processor_id()
- bpf_get_current_task()
- bpf_get_current_task_btf()

s390 has a 8 KiB per-CPU prefix area in the CPU's
virtual address space, called the lowcore. It is a
struct that contains the cpu number and a pointer
to the current task. These are exactly the values
returned by the BPF helpers.

Emit a load from the lowcore instead of a helper
function call.

JIT output for `bpf_get_smp_processor_id`:

Before:                       After:
---------------               ----------------
brasl   %r14,0x3ffe0385460    ly      %r14,928
lgr     %r14,%r2

JIT output for `bpf_get_current_task`:

Before:                        After:
---------------                ----------------
brasl   %r14,0x3ffe0362a90     lg      %r14,832
lgr     %r14,%r2

Benchmark using [1] on KVM(virtme-ng).

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+--------------------+--------------------+--------------+
|     Name      |       Before       |       After        |   % change   |
|---------------+--------------------+--------------------+--------------|
| glob-arr-inc  | 244.954 ± 0.654M/s | 278.501 ± 0.834M/s |   + 13.70%   |
| arr-inc       | 311.597 ± 1.016M/s | 313.610 ± 0.331M/s |   + 0.65%    |
| hash-inc      | 47.421 ± 0.017M/s  | 47.600 ± 0.004M/s  |   + 0.38%    |
+---------------+--------------------+--------------------+--------------+

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Maxim Khmelevskii <max@linux.ibm.com>
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20260414142930.528751-1-max@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

x86/cpu: Disable FRED when PTI is forced on

FRED and PTI were never intended to work together. No FRED hardware is
vulnerable to Meltdown and all of it should have LASS anyway.
Nevertheless, if you boot a system with pti=on and fred=on, the kernel
tries to do what is asked of it and dies a horrible death on the first
attempt to run userspace (since it never switches to the user page
tables).

Disable FRED when PTI is forced on, and print a warning about it.

A quick brain dump about what a FRED+PTI implementation would look like
is below. I'm not sure it would make any sense to do it, but never say
never. All I know is that it's way too complicated to be worth it today.

<brain dump>
The SWITCH_TO_USER/KERNEL_CR3 bits are simple to fix (or at least we
have the assembly tools to do it already), as is sticking the FRED entry
text in .entry.text (it's not in there today).

The nasty part is the stacks. Today, the CPU pops into the kernel on
MSR_IA32_FRED_RSP0 which is normal old kernel memory and not mapped to
userspace. The hardware pushes gunk on to MSR_IA32_FRED_RSP0, which is
currently the task stacks. MSR_IA32_FRED_RSP0 would need to point
elsewhere, probably cpu_entry_stack(). Then, start playing games with
stacks on entry/exit, including copying gunk to and from the task stack.

While I'd *like* to have PTI everywhere, I'm not sure it's worth mucking
up the FRED code with PTI kludges. If a user wants fast entry/exit, they
use FRED. If you want PTI (and sekuritay), you certainly don't care
about fast entry and FRED isn't going to help you *all* that much, so
you can just stay with the IDT.

Plus, FRED hardware should have LASS which gives you a similar security
profile to PTI without the CR3 munging.
</brain dump>

Reported-by: Gayatri Kammela <Gayatri.Kammela@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Cc:stable@vger.kernel.org
Link: https://patch.msgid.link/20260421163136.E7C6788A@davehans-spike.ostc.intel.com

Merge tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
"In this round, the changes primarily focus on resolving race
  conditions, memory safety issues (UAF), and improving the robustness
  of garbage collection (GC), and folio management.

  Enhancements:
   - add page-order information for large folio reads in iostat
   - add defrag_blocks sysfs node

  Bug fixes:
   - fix uninitialized kobject put in f2fs_init_sysfs()
   - disallow setting an extension to both cold and hot
   - fix node_cnt race between extent node destroy and writeback
   - preserve previous reserve_{blocks,node} value when remount
   - freeze GC and discard threads quickly
   - fix false alarm of lockdep on cp_global_sem lock
   - fix data loss caused by incorrect use of nat_entry flag
   - skip empty sections in f2fs_get_victim
   - fix inline data not being written to disk in writeback path
   - fix fsck inconsistency caused by FGGC of node block
   - fix fsck inconsistency caused by incorrect nat_entry flag usage
   - call f2fs_handle_critical_error() to set cp_error flag
   - fix fiemap boundary handling when read extent cache is incomplete
   - fix use-after-free of sbi in f2fs_compress_write_end_io()
   - fix UAF caused by decrementing sbi->nr_pages[] in f2fs_write_end_io()
   - fix incorrect file address mapping when inline inode is unwritten
   - fix incomplete search range in f2fs_get_victim when f2fs_need_rand_seg is enabled
   - avoid memory leak in f2fs_rename()"

* tag 'f2fs-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (35 commits)
  f2fs: add page-order information for large folio reads in iostat
  f2fs: do not support mmap write for large folio
  f2fs: fix uninitialized kobject put in f2fs_init_sysfs()
  f2fs: protect extension_list reading with sb_lock in f2fs_sbi_show()
  f2fs: disallow setting an extension to both cold and hot
  f2fs: fix node_cnt race between extent node destroy and writeback
  f2fs: allow empty mount string for Opt_usr|grp|projjquota
  f2fs: fix to preserve previous reserve_{blocks,node} value when remount
  f2fs: invalidate block device page cache on umount
  f2fs: fix to freeze GC and discard threads quickly
  f2fs: fix to avoid uninit-value access in f2fs_sanity_check_node_footer
  f2fs: fix false alarm of lockdep on cp_global_sem lock
  f2fs: fix data loss caused by incorrect use of nat_entry flag
  f2fs: fix to skip empty sections in f2fs_get_victim
  f2fs: fix inline data not being written to disk in writeback path
  f2fs: fix fsck inconsistency caused by FGGC of node block
  f2fs: fix fsck inconsistency caused by incorrect nat_entry flag usage
  f2fs: fix to do sanity check on dcc->discard_cmd_cnt conditionally
  f2fs: refactor node footer flag setting related code
  f2fs: refactor f2fs_move_node_folio function
  ...

tools/power turbostat: Fix AMD RAPL regression on big systems

turbostat.c:8688: rapl_perf_init: Assertion `next_domain < num_domains' failed.

The initial fix for this regression was incomplete, as it did not
handle multi-package systems with sparse core ids.

Fixes: ef0e60083f76 ("tools/power turbostat: Fix AMD RAPL regression")
Signed-off-by: Len Brown <len.brown@intel.com>

Merge tag 'libnvdimm-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax updates from Ira Weiny:
"The series adds DAX support required for the upcoming fuse/famfs file
  system.[1] The support here is required because famfs is backed by
  devdax rather than pmem. This all lays the groundwork for using shared
  memory as a file system"

Link: https://lore.kernel.org/all/0100019d43e5f632-f5862a3e-361c-4b54-a9a6-96c242a8f17a-000000@email.amazonses.com/
* tag 'libnvdimm-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax/fsdev: fix uninitialized kaddr in fsdev_dax_zero_page_range()
  dax: export dax_dev_get()
  dax: Add fs_dax_get() func to prepare dax for fs-dax usage
  dax: Add dax_set_ops() for setting dax_operations at bind time
  dax: Add dax_operations for use by fs-dax on fsdev dax
  dax: Save the kva from memremap
  dax: add fsdev.c driver for fs-dax on character dax
  dax: Factor out dax_folio_reset_order() helper
  dax: move dax_pgoff_to_phys from [drivers/dax/] device.c to bus.c

drm/amd/display: Disable 10-bit truncation and dithering on DCE 6.x

DCE 6.x doesn't support 10-bit truncation and 10-bit dithering
because the following fields are 1-bit only:
FMT_TEMPORAL_DITHER_DEPTH
FMT_SPATIAL_DITHER_DEPTH
FMT_TRUNCATE_DEPTH
Programming these fields to "2" will program them as if the
dithering option was 6-bit, resulting in sub-par picture
quality and an ugly "color banding" effect.

Note that a recent commit changed the default 10-bit dithering
option to DITHER_OPTION_SPATIAL10 which improves the picture
quality because it happens to look better, but is still not
actually supported by DCE 6.x versions.

When the color depth is 10-bit or more, just disable
any kind of dithering options on DCE 6.x.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5151
Fixes: 529cad0f945c ("drm/amd/display: Add function to set dither option")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6be8ced880dfe29ce38c2d5e74489822da5c250e)

drm/amdgpu: OR init_pte_flags into invalid leaf PTE updates

Invalid leaf clears that only set AMDGPU_PTE_EXECUTABLE match the old
GMC9 fault-priority workaround but omit adev->gmc.init_pte_flags.
On GFX12 that includes AMDGPU_PTE_IS_PTE; without it, some cleared
PTEs can fault as no-retry and bypass the SVM/XNACK handler when a
VA is reused after a BO unmap.

Apply init_pte_flags in amdgpu_vm_pte_update_flags() alongside
EXECUTABLE so range-driven clears (e.g. amdgpu_vm_clear_freed) match
amdgpu_vm_pt_clear() for leaf templates.

Signed-off-by: Siwei He <siwei.he@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9d47b2c36b9a6c6b844c33cab407a5d7ad102234)

Merge tag 'pull-coda' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull coda dcache updates from Al Viro:
"Coda dcache-related cleanups and fixes"

* tag 'pull-coda' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_flag_children(): fix a UAF
  sanitize coda_dentry_delete()
  coda: is_bad_inode() is always false there

drm/amd: Adjust ASPM support quirk to cover more Intel hosts

Some of the same issues identified in commit c770ef19673fb
("drm/amd/amdgpu: disable ASPM in some situations") also affect
Tiger Lake systems with GFX11 connected over USB4. Widen the net
to also match these hosts.

Fixes: d9b3a066dfcd ("drm/amd: Exclude dGPUs in eGPU enclosures from DPM quirks")
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5145
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a214d888485b9f35fe03882a92962e6d5697849)

drm/amd/display: Undo accidental fix revert in amdgpu_dm_ism.c

[Why]

Pausing DPM power profiles during static screen caused a bunch of
audio/performance/clock issues that were addressed in this fix:
'commit 1412482b7143 ("Revert "drm/amd/display: pause the workload setting in dm"")'

This logic in function amdgpu_dm_crtc_vblank_control_worker() was moved
to amdgpu_dm_ism.c, but the fix was lost in the process.

[How]

Reapply the fix to amdgpu_dm_ism.c

Fixes: 754003486c3c ("drm/amd/display: Add Idle state manager(ISM)")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bc621e91d6fc004cfae9148c5a91acad19ada3e4)

drm/amdkfd: Add upper bound check for num_of_nodes

drm/amdkfd: Add upper bound check for num_of_nodes
in kfd_ioctl_get_process_apertures_new.

Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Disable 10-bit truncation and dithering on DCE 6.x

DCE 6.x doesn't support 10-bit truncation and 10-bit dithering
because the following fields are 1-bit only:
FMT_TEMPORAL_DITHER_DEPTH
FMT_SPATIAL_DITHER_DEPTH
FMT_TRUNCATE_DEPTH
Programming these fields to "2" will program them as if the
dithering option was 6-bit, resulting in sub-par picture
quality and an ugly "color banding" effect.

Note that a recent commit changed the default 10-bit dithering
option to DITHER_OPTION_SPATIAL10 which improves the picture
quality because it happens to look better, but is still not
actually supported by DCE 6.x versions.

When the color depth is 10-bit or more, just disable
any kind of dithering options on DCE 6.x.

Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5151
Fixes: 529cad0f945c ("drm/amd/display: Add function to set dither option")
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu: OR init_pte_flags into invalid leaf PTE updates

Invalid leaf clears that only set AMDGPU_PTE_EXECUTABLE match the old
GMC9 fault-priority workaround but omit adev->gmc.init_pte_flags.
On GFX12 that includes AMDGPU_PTE_IS_PTE; without it, some cleared
PTEs can fault as no-retry and bypass the SVM/XNACK handler when a
VA is reused after a BO unmap.

Apply init_pte_flags in amdgpu_vm_pte_update_flags() alongside
EXECUTABLE so range-driven clears (e.g. amdgpu_vm_clear_freed) match
amdgpu_vm_pt_clear() for leaf templates.

Signed-off-by: Siwei He <siwei.he@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd: Adjust ASPM support quirk to cover more Intel hosts

Some of the same issues identified in commit c770ef19673fb
("drm/amd/amdgpu: disable ASPM in some situations") also affect
Tiger Lake systems with GFX11 connected over USB4. Widen the net
to also match these hosts.

Fixes: d9b3a066dfcd ("drm/amd: Exclude dGPUs in eGPU enclosures from DPM quirks")
Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/5145
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx12.1: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx9.4.3: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx12: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx11: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx10: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx9: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx8: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amdgpu/gfx7: align mqd settings with KFD

Make sure to set the quantum bits in the compute MQD
for better fairness across queues of the same priority.

Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/pm: Check SMUv13.0.6/12 metrics integrity

Check if data fetch is proper by matching the first few bytes against
0xFFs. If 0xFFs, that means data couldn't be read properly.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drm/amd/display: Undo accidental fix revert in amdgpu_dm_ism.c

[Why]

Pausing DPM power profiles during static screen caused a bunch of
audio/performance/clock issues that were addressed in this fix:
'commit 1412482b7143 ("Revert "drm/amd/display: pause the workload setting in dm"")'

This logic in function amdgpu_dm_crtc_vblank_control_worker() was moved
to amdgpu_dm_ism.c, but the fix was lost in the process.

[How]

Reapply the fix to amdgpu_dm_ism.c

Fixes: 754003486c3c ("drm/amd/display: Add Idle state manager(ISM)")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Leo Li <sunpeng.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull more crypto library updates from Eric Biggers:
"Crypto library fix and documentation update:

   - Fix an integer underflow in the mpi library

   - Improve the crypto library documentation"

* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  lib/crypto: docs: Add rst documentation to Documentation/crypto/
  docs: kdoc: Expand 'at_least' when creating parameter list
  lib/crypto: mpi: Fix integer underflow in mpi_read_raw_from_sgl()

io_uring/zcrx: warn on freelist violations

The freelist is appropriately sized to always be able to take a free
niov, but let's be more defensive and check the invariant with a
warning. That should help to catch any double-free issues.

Suggested-by: Kai Aizen <kai@snailsploit.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/2f3cea363b04649755e3b6bb9ab66485a95936d5.1776760901.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: clear RQ headers on init

It might be unexpected to users if the RQ head/tail after a ring
creation are not zeroed, fix that.

Cc: stable@vger.kernel.org
Fixes: 6f377873cb239 ("io_uring/zcrx: add interface queue and refill queue")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/331f94663c3e8f021ffa3cb770ca2844a07d4855.1776760911.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: fix user_struct uaf

io_free_rbuf_ring() usees a struct user_struct, which
io_zcrx_ifq_free() puts it down before destroying the ring.

Cc: stable@vger.kernel.org
Fixes: 5c686456a4e83 ("io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/e560ae00960d27a810522a7efc0e201c82dff351.1776760917.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/register: fix ring resizing with mixed/large SQEs/CQEs

The ring resizing only properly handles "normal" sized SQEs or CQEs, if
there are pending entries around a resize. This normally should not be
the case, but the code is supposed to handle this regardless.

For the mixed SQE/CQE cases, the current copying works fine as they
are indexed in the same way. Each half is just copied separately. But
for fixed large SQEs and CQEs, the iteration and copy need to take that
into account.

Cc: stable@kernel.org
Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/futex: ensure partial wakes are appropriately dequeued

If a FUTEX_WAITV vectored operation is only partially woken, we
should call __futex_wake_mark() on the queue to account for that.
If not, then a later wakeup will wake the same entry, rather than
the next one in line.

Fixes: 8f350194d5cfd ("io_uring: add support for vectored futex waits")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rw: add defensive hardening for negative kbuf lengths

No real bug here, just being a bit defensive in ensuring that whatever
gets passed into io_put_kbuf() is always >= 0 and not some random error
value.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: use kvfree() for the imu cache

Currently anything that requires kvmalloc_flex() for allocations will
not get re-cached, and hence the cache freeing path is correct in that
it always uses kfree() to free the allocated memory. But this seems a
bit fragile as it's something that could get mix should that situation
change, so switch io_free_imu() and io_alloc_cache_free() to use kvfree
as the desctructor.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/rsrc: unify nospec indexing for direct descriptors

For file updates, the node reset isn't capping the value via
array_index_nospec() like the other paths do. Ensure it's all sane and
have the update path do the proper capping as well.

Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring: fix spurious fput in registered ring path

Fix an issue with io_uring_ctx_get_file() not gating fput() on whether
or not the file descriptor is a registered/direct one or not.

Fixes: c5e9f6a96bf7 ("io_uring: unify getting ctx from passed in file descriptor")
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'erofs-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:

- Fix dirent nameoff handling to avoid out-of-bound reads
   out of crafted images

- Fix two type truncation issues on 32-bit platforms

* tag 'erofs-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: unify lcn as u64 for 32-bit platforms
  erofs: fix offset truncation when shifting pgoff on 32-bit platforms
  erofs: fix the out-of-bounds nameoff handling for trailing dirents

vfio/cdx: Consolidate MSI configured state onto cdx_irqs

struct vfio_cdx_device carries three fields that track whether MSI has
been configured: vdev->cdx_irqs (the allocated vector array), vdev->
msi_count (the array length), and vdev->config_msi (a boolean flag).
The three are set together when vfio_cdx_msi_enable() succeeds and
cleared together by vfio_cdx_msi_disable().  However, the error paths
in vfio_cdx_msi_enable() free the cdx_irqs allocation on failure
without resetting the pointer, leaving it stale and skewed from the
other two fields until the next enable call overwrites it.

Clear vdev->cdx_irqs to NULL alongside the kfree() in both error paths
so the pointer consistently reflects the configured state.  With that
invariant restored and access to the MSI state serialized by
cdx_irqs_lock, vdev->config_msi is fully redundant with
(vdev->cdx_irqs != NULL).  Drop the config_msi field and switch all
readers to test cdx_irqs directly.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com>
Link: https://lore.kernel.org/r/20260417202800.88287-4-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/cdx: Serialize VFIO_DEVICE_SET_IRQS with a per-device mutex

vfio_cdx_set_msi_trigger() reads vdev->config_msi and operates on the
vdev->cdx_irqs array based on its value, but provides no serialization
against concurrent VFIO_DEVICE_SET_IRQS ioctls. Two callers can race
such that one observes config_msi as set while another clears it and
frees cdx_irqs via vfio_cdx_msi_disable(), resulting in a use-after-free
of the cdx_irqs array.

Add a cdx_irqs_lock mutex to struct vfio_cdx_device and acquire it in
vfio_cdx_set_msi_trigger(), which is the single chokepoint through
which all updates to config_msi, cdx_irqs, and msi_count flow, covering
both the ioctl path and the close-device cleanup path. This keeps the
test of config_msi atomic with the subsequent enable, disable, or
trigger operations.

Drop the pre-call !cdx_irqs test from vfio_cdx_irqs_cleanup() as part
of this change: the optimization it provided is redundant with the
!config_msi early-return inside vfio_cdx_msi_disable(), and leaving the
test in place would be an unsynchronized read of state the new lock is
meant to protect.

Fixes: 848e447e000c ("vfio/cdx: add interrupt support")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com>
Link: https://lore.kernel.org/r/20260417202800.88287-3-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/cdx: Fix NULL pointer dereference in interrupt trigger path

Add validation to ensure MSI is configured before accessing cdx_irqs
array in vfio_cdx_set_msi_trigger(). Without this check, userspace
can trigger a NULL pointer dereference by calling VFIO_DEVICE_SET_IRQS
with VFIO_IRQ_SET_DATA_BOOL or VFIO_IRQ_SET_DATA_NONE flags before
ever setting up interrupts via VFIO_IRQ_SET_DATA_EVENTFD.

The vfio_cdx_msi_enable() function allocates the cdx_irqs array and
sets config_msi to 1 only when called through the EVENTFD path. The
trigger loop (for DATA_BOOL/DATA_NONE) assumed this had already been
done, but there was no enforcement of this call ordering.

This matches the protection used in the PCI VFIO driver where
vfio_pci_set_msi_trigger() checks irq_is() before the trigger loop.

Fixes: 848e447e000c ("vfio/cdx: add interrupt support")
Cc: stable@vger.kernel.org
Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
Acked-by: Nipun Gupta <nipun.gupta@amd.com>
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Acked-by: Nikhil Agarwal <nikhil.agarwal@amd.com>
Link: https://lore.kernel.org/r/20260417202800.88287-2-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio: replace vfio->device_class with a const struct class

The class_create() call has been deprecated in favor of class_register()
as the driver core now allows for a struct class to be in read-only
memory. Replace vfio->device_class with a const struct class and drop
the class_create() call.

Compile tested with both CONFIG_VFIO_DEVICE_CDEV on and off (and
CONFIG_VFIO on); found no errors/warns in dmesg.

Link: https://lore.kernel.org/all/2023040244-duffel-pushpin-f738@gregkh/
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jori Koolstra <jkoolstra@xs4all.nl>
[Remove unused vfio_cdev_init() args]
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Link: https://lore.kernel.org/r/20260417152814.18026-1-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/virtio: Use guard() for bar_mutex in legacy I/O

Convert the bar_mutex acquisition in virtiovf_issue_legacy_rw_cmd()
to use guard(), eliminating the out label and goto-based error paths
in favor of direct returns.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260414200625.3601509-5-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/virtio: Use guard() for migf->lock where applicable

Convert migf->lock acquisitions in virtiovf_disable_fd() and
virtiovf_save_read() to use guard(). In virtiovf_save_read() this
eliminates the out_unlock label and multiple goto paths by allowing
direct returns, and removes the need for the done variable to double
as an error carrier.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260414200625.3601509-4-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/virtio: Use guard() for list_lock where applicable

Convert list_lock mutex acquisitions to use guard() and scoped_guard()
where the lock scope aligns with the function or block scope. This
simplifies virtiovf_get_data_buff_from_pos() by replacing goto-based
unwinding with direct returns.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260414200625.3601509-3-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/virtio: Convert list_lock from spinlock to mutex

The list_lock spinlock with IRQ disabling was copied from the mlx5
vfio-pci variant driver, where it is justified by a hardirq async
command completion callback that accesses the protected lists. The
virtio driver has no such interrupt context usage; all list_lock
acquisitions occur in process context via file read/write operations
or state transitions under state_mutex.

Convert list_lock to a mutex to be consistent with peer vfio-pci
variant drivers (hisilicon, pds, qat, xe) which all use mutexes for
equivalent migration data protection. This also fixes a mismatched
spin_lock()/spin_unlock_irq() pair in virtiovf_read_device_context_chunk()
that could incorrectly enable interrupts.

Reported-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Closes: https://lore.kernel.org/all/20260413073603.30538-1-guojinhui.liam@bytedance.com
Fixes: 0bbc82e4ec79 ("vfio/virtio: Add support for the basic live migration functionality")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20260414200625.3601509-2-alex.williamson@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

vfio/pci: Clean up DMABUFs before disabling function

On device shutdown, make vfio_pci_core_close_device() call
vfio_pci_dma_buf_cleanup() before the function is disabled via
vfio_pci_core_disable(). This ensures that all access via DMABUFs is
revoked before the function's BARs become inaccessible.

This fixes an issue where, if the function is disabled first, a tiny
window exists in which the function's MSE is cleared and yet BARs
could still be accessed via the DMABUF. The resources would also be
freed and up for grabs by a different driver.

Fixes: 5d74781ebc86c ("vfio/pci: Add dma-buf export support for MMIO regions")
Signed-off-by: Matt Evans <mattev@meta.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20260415181752.1027604-1-mattev@meta.com
Signed-off-by: Alex Williamson <alex@shazbot.org>

block: only restrict bio allocation gfp mask asked to block

If the caller is asking for a non-blocking allocation, we should not
further restrict the gfp mask, which just increases the likelihood
of failures.

Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20260415060813.807659-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

drm/xe/pat: Introduce xe_cache_pat_idx() macro helper

Wrap pat.idx[] reads with xe_cache_pat_idx() so invalid PAT index use
is caught by xe_assert() in debug builds.

Suggested-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Link: https://patch.msgid.link/20260416045526.536497-4-x.wang@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>

drm/xe/pat: Default XE_CACHE_NONE_COMPRESSION to invalid

Initialize XE_CACHE_NONE_COMPRESSION PAT index to XE_PAT_INVALID_IDX by
default, same as XE_CACHE_WB_COMPRESSION. Platforms that support this
cache mode will override it in xe_pat_init_early(). This ensures that
accidental use on unsupported platforms can be detected.

A subsequent patch introduces a helper to assert on invalid PAT index
access at all call sites.

Suggested-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Xin Wang <x.wang@intel.com>
Link: https://patch.msgid.link/20260416045526.536497-3-x.wang@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>

drm/xe: Standardize pat_index to u16 type

Ensure all pat_index definitions consistently use u16 type across
the XE driver. This addresses two remaining instances where pat_index
was incorrectly typed:

- xe_vm_snapshot structure used int for pat_index field
- xe_device pat.idx array used u32 instead of u16

This cleanup improves type consistency and ensures proper alignment
with the PAT subsystem design.

Signed-off-by: Xin Wang <x.wang@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patch.msgid.link/20260416045526.536497-2-x.wang@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>

drm/xe/xelp: Fix Wa_18022495364

Command parser relative MMIO addressing needs to be enabled when writing
to the register.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: ca33cd271ef9 ("drm/xe/xelp: Add Wa_18022495364")
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patch.msgid.link/20260420131603.70357-1-tvrtko.ursulin@igalia.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>

drm/xe/gsc: Fix BO leak on error in query_compatibility_version()

When xe_gsc_read_out_header() fails, query_compatibility_version()
returns directly instead of jumping to the out_bo label. This skips
the xe_bo_unpin_map_no_vm() call, leaving the BO pinned and mapped
with no remaining reference to free it.

Fix by using goto out_bo so the error path properly cleans up the BO,
consistent with the other error handling in the same function.

Fixes: 0881cbe04077 ("drm/xe/gsc: Query GSC compatibility version")
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patch.msgid.link/20260417163308.3416147-1-shuicheng.lin@intel.com
Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com>

Merge branch 'for-7.1-fixes' into for-7.2

Pull to receive:

05909810a946 ("tools/sched_ext: scx_qmap: Silence task_ctx lookup miss")

which conflicts with the cid-form qmap rework on for-7.2. Resolved
by applying the same silence-on-NULL semantics to the arena-backed
lookup_task_ctx() and qmap_select_cpu() on for-7.2.

Signed-off-by: Tejun Heo <tj@kernel.org>

ALSA: hda/realtek - Add mute LED support for HP Victus 15-fa2xxx

The mute LED on this laptop uses ALC245 but requires a quirk to work.
This patch enables the existing ALC245_FIXUP_HP_MUTE_LED_COEFBIT
quirk for the device.

Tested my Victus 15-fa2xxx (PCI SSID 103c:8dcd).
The LED behaviour works as intended.

Cc: stable@vger.kernel.org
Signed-off-by: Spencer Payton <spayton681@gmail.com>
Link: https://patch.msgid.link/20260421084918.14685-1-spayton681@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

tools/sched_ext: scx_qmap: Silence task_ctx lookup miss

scx_fork() dispatches ops.init_task to exactly one scheduler - the one
owning the forking task's cgroup. A task forked inside a sub-scheduler's
cgroup is init'd into the sub only; the root scheduler has no task_ctx
entry for it. When that task later appears as @prev in the root's
qmap_dispatch() (or flows through core-sched comparison via task_qdist),
the bpf_task_storage_get() legitimately misses.

qmap treated those misses as fatal via scx_bpf_error("task_ctx lookup
failed") and aborted the scheduler as soon as the first cross-sched
task hit the root. Drop the error in the sites where the miss is
legitimate: lookup_task_ctx() (helper; callers already check for NULL),
qmap_dispatch()'s @prev branch (bookkeeping-only), task_qdist()
(returns 0 which makes the comparison a no-op), and qmap_select_cpu()
(returns prev_cpu as a no-op fallback instead of -ESRCH). The existing
scx_error was a paranoid guard from the pre-sub-sched world where every
task was owned by the one and only scheduler.

v2: qmap_select_cpu() returns prev_cpu on NULL instead of -ESRCH, so
the root scheduler doesn't error on cross-sched tasks that pass
through it (Andrea Righi).

Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>

ALSA: pcmtest: Fix resource leaks in module init error paths

pcmtest allocates its pattern buffers and creates its debugfs tree
before registering the platform device and driver, but mod_init()
does not release those resources when a later init step fails.

As a result, a debugfs directory creation failure leaks the pattern
buffers, while platform_device_register() and
platform_driver_register() failures leave both the pattern buffers
and the debugfs tree behind. The recent fix for failed device
registration only dropped the embedded device reference.

Add the missing cleanup for the debugfs tree and pattern buffers in
the remaining module init error paths.

Fixes: 315a3d57c64c ("ALSA: Implement the new Virtual PCM Test Driver")
Cc: stable@vger.kernel.org
Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Link: https://patch.msgid.link/20260421-alsa-pcmtest-init-unwind-v1-1-03fe0c423dbb@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>