git.ipfire.org Git - thirdparty/kernel/linux.git/log

blk-mq: reinsert cached request to the list

A previous commit removed an optimization out of caution for a scenario
that turns out not to be real: all the "queue_exit" goto's are safe to
reinsert the request into the cached_rq's plug list as they are either
from a non-blocking path, or a successful merge that already holds the
queue reference. This optimization is most needed for small sequential
workloads that successfully merge into larger requests.

Fixes: dc278e9bf2b9 ("blk-mq: pop cached request if it is usable")
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://patch.msgid.link/20260526153531.2365935-1-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cxl/test: Update mock dev array before calling platform_device_add()

CXL test environment hits the following error sometimes.

cxl_mem mem9: endpoint7 failed probe

All mock memdevs are platform firmware devices added by cxl_test module,
and cxl_test module also provides a platform device driver for them to
create a memdev device to CXL subsystem. cxl_test module uses
cxl_rcd/mem_single/mem arrays to store different types of mock memdevs.
CXL drivers calls registered mock functions for a mock memdev by
checking if a given memdev is in these arrays.

When cxl_test module adds these mock memdevs, it always calls
platform_device_add() before adding them to a suitable mock memdev
array. However, there is a small window where CXL drivers calls mock
function for a added memdev before it added to a mock memdev array. In
above case, cxl endpoint driver considers a added memdev was not a mock
memdev, then calling devm_cxl_endpoint_decoders_setup() for it rather
than mock_endpoint_decoders_setup().

An appropriate solution is that adding a new mock device to a mock
device array before calling platform_device_add() for it. It can
guarantee the new mock device is visible to CXL subsystem.

This patch introduces a new helped called cxl_mock_platform_device_add()
to handle the issue, and uses the function for all mock devices addition.

Fixes: 3a2b97b3210b ("cxl/test: Improve init-order fidelity relative to real-world systems")
Signed-off-by: Li Ming <ming.li@zohomail.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://patch.msgid.link/20260520121457.234404-1-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>

Merge tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
"Regressions:

   - Tighten bounds checking for sunrpc cache hash tables

   - Don't report key material in the ftrace log

  Stable fix:

   - Fix lockd's implementation of the NLM TEST procedure"

* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  lockd: fix TEST handling when not all permissions are available.
  NFSD: Report whether fh_key was actually updated
  sunrpc: prevent out-of-bounds read in __cache_seq_start()

module, riscv: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all riscv-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

module, m68k: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all m68k-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

module, arm64: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all arm64-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

module, arm: force sh_addr=0 for arch-specific sections

When linking modules with 'ld.bfd -r', sections defined without an address
inherit the location counter, resulting in non-zero sh_addr values in the
resulting .ko files. Relocatable objects are expected to have sh_addr=0 for
all sections. Non-zero addresses are confusing in this context, typically
worse compressible, and may cause tools to misbehave [1].

Force sh_addr=0 for all arm-specific module sections.

Link: https://sourceware.org/bugzilla/show_bug.cgi?id=33958
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>

Merge tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit fix from Shuah Khan:
"Fix a use-after-free in kunit debugfs when using kunit.filter when the
  executor frees dynamically allocated resources after running boot-time
  tests. This resulted in fatal hardware exception due to invalidation
  of capability flags on the reclaimed memory on some architectures such
  as CHERI RISC-V that support the feature, and silent memory corruption
  on others.

  The fix for this couples the lifetime of the filtered suite memory
  allocation to the lifetime of the kunit subsystem and its associated
  VFS nodes. Ownership of the boot-time suite_set is now transferred to
  a global tracker ('kunit_boot_suites'), and the memory is cleanly
  released in kunit_exit() during module teardown"

* tag 'linux_kselftest-kunit-fixes-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: fix use-after-free in debugfs when using kunit.filter

x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest

Patch in Fixes: causes the usual:

  unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id)
  Call Trace:
   early_init_intel
   early_cpu_init
   setup_arch
   _printk
   start_kernel
   x86_64_start_reservations
   x86_64_start_kernel
   common_startup_64

because the kernel is booted in a guest.

In order to avoid it, this MSR access needs to be prevented when running
virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but
for this particular case it is too early yet.

The platform ID needs to be read as early as when microcode is loaded on
the BSP:

  load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature

and by that time, CPUID leafs haven't been parsed yet.

The microcode loader already has logic to check early whether the kernel
is running virtualized so make that globally available to arch/x86/. The
query whether running virtualized is getting more and more prominent in
recent times so might as well make it an arch-global var which the rest
of the code can use.

Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure")
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com

irqchip/gic-v4: Don't advertise VLPIs if no ITS is probed

When accidentally setting “kvm-arm.vgic_v4_enable=1” on a system that has
no MSI controller device tree node and GICv4, it results a panic as
“gic_domain” is NULL and the kernel attempts to access it.

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
    Mem abort info:
      ESR = 0x0000000096000006

    CPU: 1 UID: 0 PID: 295 Comm: lkvm-static Not tainted 7.1.0-rc4-ge3f15ad3970e #5 PREEMPT
    Hardware name: linux,dummy-virt (DT)
    pstate: 81402005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    pc : __irq_domain_instantiate+0x1d4/0x578
    lr : __irq_domain_instantiate+0x1cc/0x578

Set vLPI support to false at init time if the host has no ITS, so it
propagates properly to kvm_vgic_global_state.has_gicv4.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260526125317.3672297-1-smostafa@google.com

drm/xe: Assign queue name in time for drm_sched_init

Currently the queue name is only assigned after the drm scheduler instance
has been created. This loses information with all logging or debug
workqueue facilities so lets re-order things a bit so the name gets
assigned in time.

To be able to assign a GuC ID early we split the allocation into
reservation and publish phases.

First, with the submission state lock held, we reserve the ID in the GuC
ID manager, which serves as an authoritative source of truth. Then we can
drop the lock and reserve entries in the exec queue lookup XArray. This
can be lockless since the NULL entries are invisible both to the kernel
and userspace. Only after the queue has been fully created we replace the
reserved entries with the queue pointer, which can be done locklessly for
single width queues.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/20260523103418.61832-1-tvrtko.ursulin@igalia.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

KVM: x86: Widen x86_exception's error_code to 64 bits

Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).

Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.

Signed-off-by: Kevin Cheng <chengkev@google.com>
Link: https://patch.msgid.link/20260522232701.3671446-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET

Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260523111857.195396-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

call_once:: Fix typo in comment for call_once()

Change "succesfully" to "successfully" in the kerneldoc
comment of call_once().

Signed-off-by: Jiun Jeong <jiun.jeong.cs@gmail.com>
Link: https://patch.msgid.link/20260501144413.49419-1-jiun.jeong.cs@gmail.com
[sean: don't scope to KVM, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap

In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).

Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.

Reported-by: Wu Fei <wu.fei9@sanechips.com.cn>
Closes: https://lore.kernel.org/all/202605111130.64BBUXDN013040@mse-fl2.zte.com.cn
Cc: Wu Fei <atwufei@163.com>
Link: https://patch.msgid.link/20260522170230.3518669-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs

Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.

Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Fuad Tabba <tabba@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0

When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected. Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.

Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots")
Reported-by: Bibo Mao <maobibo@loongson.cn>
Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Remove unnecessary "%s" formatting of a constant string

Drop superfluous %s formatting from assertions in the guest_memfd overlap
testcases, as the string being printed doesn't require runtime formatting.

No functional change intended.

Reported-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522172151.3530267-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Test guest_memfd binding overlap without GPA overlap

The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot. KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.

Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.

Fixes: 2feabb855df8 ("KVM: selftests: Expand set_memory_region_test to validate guest_memfd()")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: keep the existing GPA overlap testcases]
Link: https://patch.msgid.link/20260522172151.3530267-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: guest_memfd: Return -EEXIST for overlapping bindings

KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path. That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.

Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.

Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST. Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.

Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST]
Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

selftests/nolibc: test against -Wwrite-strings

Users may use this warning when building their own applications.
Make sure that nolibc does not trigger any such warnings.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-3-ab5cc16c7b23@weissschuh.net

selftests/nolibc: use mutable buffer for execve() argv string

The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-2-ab5cc16c7b23@weissschuh.net

tools/nolibc: cast default values of program_invocation_name

With -Wwrite-strings the plain assignment triggers a warning as a
'const char *' is assigned to a 'char *', removing the const qualifier.

Casting the const away is fine, as there is no valid modification that
can be done to an empty string anyways.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-1-ab5cc16c7b23@weissschuh.net

spi: dt-bindings: spi-qpic-snand: Add ipq5210 compatible

Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.

Signed-off-by: Varadarajan Narayanan <varadarajan.narayanan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260514-ipq5210-nand-v1-1-cbdd7492e826@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>

thermal: intel: int340x: Check return value of ptc_create_groups()

proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.

The thermal control interface would be unavailable with no indication
of failure.

Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-3-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()

The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-2-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load

When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.

The race can be illustrated with two concurrent paths:

Path A (sysfs write, holds attr_set->update_lock):

governor_store()
  mutex_lock(&attr_set->update_lock)
  ignore_nice_load_store()
    dbs_data->ignore_nice_load = 1              /* (A1) */
    gov_update_cpu_data(dbs_data)
      mutex_lock(&policy_dbs->update_mutex)     /* (A2) */
        j_cdbs->prev_cpu_nice = kcpustat_field(...)
      mutex_unlock(&policy_dbs->update_mutex)
  mutex_unlock(&attr_set->update_lock)

Path B (work queue, wins the race between A1 and A2):

dbs_work_handler()
  mutex_lock(&policy_dbs->update_mutex)         /* acquired before A2 */
  dbs_update()
    ignore_nice = dbs_data->ignore_nice_load    /* sees new value: 1 */
    cur_nice = kcpustat_field(...)
    idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
    j_cdbs->prev_cpu_nice = cur_nice
  mutex_unlock(&policy_dbs->update_mutex)

Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.

To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:

  - gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
    prev_cpu_idle, so both baselines share the same timestamp when
    io_is_busy changes.  This prevents an interval mismatch between
    idle_time and nice_delta on the next dbs_update() when
    ignore_nice_load is enabled.
  - cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
    so the baseline is always valid from the first dbs_update() call;
    remove the ignore_nice guard and the now-unused ignore_nice variable.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix data races on per-CPU idle/nice baselines

gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex.  A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex.  cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable.  Fix this by reading io_busy
inside the mutex.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

block: remove blkdev_write_begin() and blkdev_write_end()

Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: fix use-after-free on service thread failure

If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.

The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.

Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.

Fixes: e8b58ef09e84 ("mtip32xx: fix device removal")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260525162531.1406677-1-dbgh9129@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: don't set BIO_QUIET for BLK_STS_AGAIN

Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.

Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

direct-io: remove IOCB_NOWAIT support

None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: Avoid mounting the bdev pseudo-filesystem in userspace

The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.

This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`

[1]

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 23380067 P4D 23380067 PUD 23381067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN NOPTI
CPU: 2 PID: 17125 Comm: syz-executor.0 Not tainted 6.1.155-syzkaller-00350-g84221fde2681 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:0x0

Call Trace:
<TASK>
lookup_open.isra.0+0x700/0x1180 fs/namei.c:3460
open_last_lookups fs/namei.c:3550 [inline]
path_openat+0x953/0x2700 fs/namei.c:3780
do_filp_open+0x1c5/0x410 fs/namei.c:3810
do_sys_openat2+0x171/0x4d0 fs/open.c:1318
do_sys_open fs/open.c:1334 [inline]
__do_sys_openat fs/open.c:1350 [inline]
__se_sys_openat fs/open.c:1345 [inline]
__x64_sys_openat+0x13c/0x1f0 fs/open.c:1345
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/all/20131010004732.GJ13318@ZenIV.linux.org.uk/T/#
Cc: stable@vger.kernel.org
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260521072857.5078-1-arefev@swemel.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: switch numa_node to int in blk_mq_hw_ctx and init_request

numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].

Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.

Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: skip sync_blockdev() on surprise removal in bdev_mark_dead()

bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.

syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.

Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead")
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: add tracepoint block_rq_tag_wait

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260525005123.722277-1-atomlin@atomlin.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

KVM: SEV: Mark source page dirty when writing back CPUID data on failure

When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-5-3f196bfad5a1@google.com
[sean: use set_page_dirty(), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements

Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".

Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.

Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-4-3f196bfad5a1@google.com
[sean: call out that the bug is benign]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Pin source page for write when adding CPUID data for SNP guest

When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.

Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.

Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Cc: stable@vger.kernel.org
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>

ASoC: SOF: ipc4-topology: Support for multiple src output formats

Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:

SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.

In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.

Link: https://patch.msgid.link/20260526105748.26149-1-peter.ujfalusi@linux.intel.com

ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output

The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.

Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc

SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-2-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

io_uring/zcrx: add shared-memory notification statistics

Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.

Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.

The kernel updates the stats counters in-place on every copy-fallback
event.

Signed-off-by: Clément Léger <cleger@meta.com>
[pavel: rename io_uring_zcrx_notif_stats]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user on frag copy fallback

Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.

Signed-off-by: Clément Léger <cleger@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user when out of buffers

There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: add ctx pointer to zcrx

zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: reorder fd allocation in zcrx_export()

Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.

While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.

Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: remove extra ifq close

By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: poison pointers on unregistration

Nobody should be touching area and other pointers after zcrx
destruction, poison them instead of zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: make scrubbing more reliable

Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

wifi: ath12k: fix error unwind on arch_init() failure in PCI probe

When arch_init() fails in ath12k_pci_probe(), the code jumps to
err_pci_msi_free, leaking resources in teardown.

Redirect the failure path to err_free_irq so teardown matches the setup order.

Compile-tested only.

Fixes: 614c23e24ee8 ("wifi: ath12k: Support arch-specific DP device allocation")
Signed-off-by: Ripan Deuri <ripan.deuri@oss.qualcomm.com>
Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Link: https://patch.msgid.link/20260519192815.3911324-1-ripan.deuri@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>

block: partitions: fix of_node refcount leak in of_partition()

of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:

1. The compatibility check at the top of the function returns 0
   without releasing partitions_np when the node exists but is not
   "fixed-partitions" compatible.

2. The function returns 1 at the end after successfully processing
   all partitions without releasing partitions_np.

Fix both leaks by adding of_node_put(partitions_np) on each path.

Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ASoC: cs35l56-shared-test: Fix possible null pointer dereference

The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.

Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)

Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260523211522.522616-1-ethantidmore06@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
  post-7.1 issues or aren't considered suitable for backporting.

  All patches are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "mm: introduce a new page type for page pool in page type"
  mm/vmalloc: do not trigger BUG() on BH disabled context
  MAINTAINERS, mailmap: change email for Eugen Hristev
  mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
  kernel/fork: validate exit_signal in kernel_clone()
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  zram: fix use-after-free in zram_writeback_endio
  memfd: deny writeable mappings when implying SEAL_WRITE
  ipc: limit next_id allocation to the valid ID range
  Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
  MAINTAINERS: .mailmap: update after GEHC spin-off

Merge branch 'ethtool-module-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: module: fix a handful of small bugs

I've been poking at the locking in ethtool and it appears
that the FW flashing is not currently taking the ops lock.
Existing drivers which implement module FW flashing seem
to have their own locking, so this series doesn't actually
add the ops lock (I'll add it in net-next). But a number
of other errors have been surfaced in the process.
====================

Link: https://patch.msgid.link/20260522231312.1710836-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate fw->size against start_cmd_payload_size

cmis_fw_update_start_download() copies start_cmd_payload_size bytes
from the firmware blob into the CDB LPL vendor_data[] payload without
validating that the FW has enough data.

Since the start_cmd_payload_size can only be ~120B an image too short
is most likely corrupted, so reject it.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate start_cmd_payload_size from module

The CMIS firmware update code reads start_cmd_payload_size from
the module's FW Management Features CDB reply and uses it directly
as the byte count for memcpy. The destination buffer is 112 bytes
(ETHTOOL_CMIS_CDB_LPL_MAX_PL_LENGTH - 8). So a malicious
module (or corrupted response) can cause a OOB write later on in
cmis_fw_update_start_download().

Let's error out. If modules that expect longer LPL writes actually
exist we should revisit.

struct cmis_cdb_start_fw_download_pl's definition has to move,
no change there.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: fix u16-to-u8 truncation of msleep_pre_rpl

ethtool_cmis_cdb_compose_args() accepts msleep_pre_rpl as u16 but stores
it into the u8 field ethtool_cmis_cdb_cmd_args::msleep_pre_rpl, silently
truncating values >= 256. Seven of the nine call sites pass 1000 ms
(it's the third argument from the end).

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: require exact CDB reply length

Malicious SFP module could respond with rpl_len longer than
what cmis_cdb_process_reply() expected, leading to OOB writes.
Malicious HW is a bit theoretical but some modules may just
be buggy and/or the reads may occasionally get corrupted,
so let's protect the kernel.

The existing check protects from short replies. We need to
protect from long ones, too. All callers that pass a non-zero
rpl_exp_len cast the reply payload to a fixed-layout struct
and read fields at fixed offsets, with no version negotiation
or short-reply handling:

  - cmis_cdb_validate_password()
  - cmis_cdb_module_features_get()
  - cmis_fw_update_fw_mng_features_get()

so let's assume that responses longer than expected do not
have to be handled gracefully here. Add a warning message
to make the debug easier in case my understanding is wrong...

Note that page_data->length (argument of kmalloc) comes from
last arg to ethtool_cmis_page_init() which is rpl_exp_len.

Note2 that AIs also like to point out overflows in args->req.payload
itself (which is a fixed-size 120 B buffer, on the stack),
but callers should be reading structs defined by the standard,
so protecting from requests for more data than max seem like
defensive programming.

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: fix cleanup if socket used for flashing multiple devices

When a single Netlink socket issues MODULE_FW_FLASH_ACT against multiple
devices, ethnl_sock_priv_set() overwrites sk_priv->dev on each call,
retaining only the last one. The socket priv is used on socket close,
to walk the global work list and mark the uncompleted flashing work
as "orphaned". Otherwise if another socket reuses the PID it will
unexpectedly receive the flashing notifications.

Don't record the device, record net pointer instead. The purpose of
the dev is to scope the work to a netns, anyway. If we store netns
the overrides are safe/a nop since all flashed devices must be in
the same netns as the socket.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: check fw_flash_in_progress under rtnl_lock

ethnl_set_module_validate() inspects module_fw_flash_in_progress
but validate is meant for _input_ validation, not state validation.
rtnl_lock is not held, yet. Move the check into ethnl_set_module().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid racy updates to dev->ethtool bitfield

When reviewing other changes Gemini points out that we currently
update module_fw_flash_in_progress without holding any locks.
Since module_fw_flash_in_progress is part of a bitfield this
is not great, updates to other fields may be lost.

We could use a bool and sprinkle some READ_ONCE/WRITE_ONCE here
but seems like the issue is rather than the work is an unusual
writer. The other writers already hold the right locks. So just
very briefly take these locks when the work completes.

Note that nothing ever cancels the FW update work, so there's
no concern with deadlocks vs cancel.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid leaking a netdev ref on module flash errors

module_flash_fw_schedule() is missing undo for setting
the "in_progress" flag and taking the netdev reference.
Delay taking these, the device can't disappear while
we are holding rtnl_lock.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: call ethnl_ops_complete() on module flash errors

When validate() fails we are skipping over ethnl_ops_complete()
even tho we already called ethnl_ops_begin().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ethtool-rss-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: rss: fix a handful of small bugs

Fix a handful of small bugs in the ethtool Netlink RSS code.
====================

Link: https://patch.msgid.link/20260522230647.1705600-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid device context leak on reply-build failure

We wait with filling the reply for new RSS context creation
until after the driver ->create_rxfh_context call. The driver
needs to fill some of the defaults in the context. The failure
of rss_fill_reply() is somewhat theoretical, but doesn't take
much effort to handle it properly. Call ->remove_rxfh_context().

If the driver's remove callback fails (some implementations like sfc
can return real command errors from firmware RPCs) - skip the xa_erase
and kfree, leaving the context in the xarray. This matches how
ethnl_rss_delete_doit() behaves.

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix hkey leak when indir_size is 0

rss_get_data_alloc() allocates a single buffer that backs both the
indirection table and the hash key, but only assigned data->indir_table
when indir_size was nonzero. The expectation was that no driver
implements RSS without supporting indirection table but apparently
enic does just that (it's the only such in-tree driver).
enic has get_rxfh_key_size but no get_rxfh_indir_size.
data->indir_table stays as NULL, hkey gets set but rss_get_data_free()
kfree(data->indir_table) is a nop and the allocation leaks.

Always store the allocation base in data->indir_table so the free path
is unambiguous. No caller treats indir_table as a sentinel; everything
keys off indir_size.

Fixes: 7112a04664bf ("ethtool: add netlink based get rss support")
Link: https://patch.msgid.link/20260522230647.1705600-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix indir_table and hkey leak on get_rxfh failure

rss_prepare_get() allocates the indirection table and hash key buffer
via rss_get_data_alloc(), then calls ops->get_rxfh() to populate them.
If get_rxfh() fails, the function returns an error without freeing
the allocation.

Fixes: 4f038a6a02d2 ("net: ethtool: Don't call .cleanup_data when prepare_data fails")
Link: https://patch.msgid.link/20260522230647.1705600-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix falsely ignoring indir table updates

rss_set_prep_indir() compares the new indirection table against the
current one to determine whether any update is needed. The memcmp
call passes data->indir_size as the length argument, but indir_size
is the number of u32 entries, not the byte count.

Fixes: c0ae03588bbb ("ethtool: rss: initial RSS_SET (indirection table handling)")
Link: https://patch.msgid.link/20260522230647.1705600-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: add missing errno on RSS context delete

Remember to set ret before jumping out if someone tries
to delete a context on a device which doesn't support
contexts.

Fixes: fbe09277fa63 ("ethtool: rss: support removing contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid modifying the RSS context response

Gemini says that we're modifying the RSS_CREATE response skb.
I think it's right, the comment says that unicast() should
unshare the skb but I'm not entirely sure what I meant there.
netlink_trim() does a copy but only if skb is not well sized
(it's at least 2x larger than necessary for the payload).

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host

A previous bug in __pkvm_init_vm error path showed that the hypervisor
could leak refcounted pages, (i.e. losing access to a page while its
refcount is still elevated). This poses a threat to the pKVM state
machine.

Address this by introducing a fail-safe in __pkvm_hyp_donate_host.
Transitions are not a hot path so added security is worth the extra
check.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Fix __pkvm_init_vm error path

In the unlikely case where insert_vm_table_entry fails, __pkvm_init_vm
release the memory donated by the host for the PGD, but as the stage-2
is still set-up the hypervisor keeps a refcount on those pages,
effectively leaking the references.

Fix the rollback with the newly added kvm_guest_destroy_stage2().

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Reset page order in pKVM hyp_pool

When a VM fails to initialise after its stage-2 hyp_pool has been
initialised, that stage-2 must be torn down entirely. This requires
resetting both the refcount and the order of its pages back to 0.

Currently, reclaim_pgtable_pages() implicitly resets the page order by
allocating the entire pool with order-0 granularity. However, in the VM
initialisation error path, the addresses of the donated memory (the PGD)
are already known, making it unnecessary to iterate over all pages in
the pool.

Since the vmemmap page order is a hyp_pool-specific field, leaving a
non-zero order on hyp_pool destruction is harmless until another pool
attempts to admit the page. Instead of resetting this field during
destruction, reset it during pool initialization in hyp_pool_init().

For 'external' pages, we can't trust the order either as they bypass
hyp_pool_init(). Since we never coalesce them, enforce order-0 to ensure
safe insertion into the pool.

This leaves no vmemmap order users outside of hyp_pool.

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

cpufreq/amd-pstate: drop stale @epp_cached kdoc

Commit 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP") removed
the epp_cached field from struct amd_cpudata in favour of always
reading from cppc_req_cached, but the kdoc above the struct still
documents @epp_cached.

Drop the now-stale @epp_cached entry.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Fixes: 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Link: https://lore.kernel.org/r/20260526022131.1302373-1-zhanxusheng@xiaomi.com
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

arm64: tegra: Enable PCIe for Jetson AGX Thor

Enable the PCIe controllers on the Jetson AGX Thor Developer Kit that
are used for ethernet and NVMe.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

arm64: tegra: Fix address of Tegra264 main GPIO controller

The 64-bit address of the main GPIO controller on Tegra264 is
0x810c300000. The main GPIO controller was incorrectly added under the
bus@0 node instead of the bus@8100000000 node breaking the boot on
Tegra264. Fix this by moving to main GPIO controller node under
bus@8100000000.

Fixes: c70e6bc11d20 ("arm64: tegra: Add Tegra264 GPIO controllers")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: socfpga: Fix OF node refcount leak in SMP setup

socfpga_smp_prepare_cpus() looks up the Cortex-A9 SCU node with
of_find_compatible_node(), which returns a node reference that must be
released with of_node_put().

The function maps the SCU registers and then returns without dropping
that reference, leaking the node on both the success path and the
of_iomap() failure path.

Drop the reference once the mapping attempt is complete. The returned
MMIO mapping does not depend on keeping the device node reference held.

Fixes: 122694a0c712 ("ARM: socfpga: use of_iomap to map the SCU")
Cc: stable@vger.kernel.org
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>

driver core: Guard deferred probe timeout extension with delayed_work_pending()

mod_delayed_work() unconditionally queues the work even when it wasn't
previously pending, which can fire the timeout prematurely or restart it
after it already fired. Add a delayed_work_pending() guard to restore
the originally intended semantics.

Premature firing calls fw_devlink_drivers_done() before all built-in
drivers have registered, causing fw_devlink to prematurely relax device
links for suppliers whose drivers haven't loaded yet.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-2-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

driver core: Fix missing jiffies conversion in deferred_probe_extend_timeout()

mod_delayed_work() takes jiffies, not seconds. Thus, restore the dropped
conversion.

While at it, fix incorrect indentation.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

MIPS: Loongson64: dts: Add node for LS7A PCH LPC

Loongson 7A series PCH contain a LPC IRQ controller.

Add the device tree node of it.

Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Loongson64: dts: Sort nodes

The RTC's address is after UARTs, however the node is currently before
them.

Re-order the node to match address sequence.

Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: mobileye: Remove duplicate FIT_IMAGE_FDT_EPM5 from main Kconfig

kconfiglint reports:

  K008: config FIT_IMAGE_FDT_EPM5 has prompts in 2 separate definitions

The FIT_IMAGE_FDT_EPM5 Kconfig symbol is defined identically in two places:

  arch/mips/Kconfig:1052
  arch/mips/mobileye/Kconfig:17

Both have the same prompt, depends, default, and help text. Since
arch/mips/mobileye/Kconfig is sourced from arch/mips/Kconfig, both
definitions are parsed and the symbol ends up with two prompts.

The symbol was first introduced in commit 101bd58fde10 ("MIPS: Add
support for Mobileye EyeQ5") directly in
arch/mips/Kconfig. Three months later, commit fbe0fae601b7 ("MIPS:
mobileye: Add EyeQ6H support") created the
arch/mips/mobileye/Kconfig sub-file to organize the growing Mobileye
platform code and added the MACH_EYEQ5/MACH_EYEQ6H choice along with
a copy of FIT_IMAGE_FDT_EPM5. However, the original definition in
arch/mips/Kconfig was not removed at that time, leaving a duplicate.

Remove the definition from arch/mips/Kconfig, keeping the one in
arch/mips/mobileye/Kconfig where it belongs alongside the related
MACH_EYEQ5 machine type definition that it depends on.

Assisted-by: Claude:claude-opus-4-6 kconfiglint
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: ralink: reduce ARCH_DMA_MINALIGN

Currently, Ralink SoCs use the default ARCH_DMA_MINALIGN value of 128
bytes defined in mach-generic. This is excessive for these platforms
and leads to significant memory waste in kmalloc.

Override ARCH_DMA_MINALIGN to use L1_CACHE_BYTES, which is 16 bytes for
RT288X and 32 bytes for other Ralink SoCs.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

include: Remove unused jz4740-battery.h

The last user was removed in commit aea12071d6fc
("power/supply: Drop obsolete JZ4740 driver") and replaced by
a self-contained IIO-based driver. No file includes this header.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

include: Remove unused jz4740-adc.h

The last user was the JZ4740 MFD ADC driver, removed in commit
ff71266aa490 ("mfd: Drop obsolete JZ4740 driver") and replaced
by a self-contained IIO driver. No file includes or references
this header.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: n64: add __iomem for writel call

sparse: incorrect type in argument 2 (different address spaces) @@
expected void volatile [noderef] __iomem *mem @@
got unsigned int [usertype] *

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202105261445.AcvPd2EE-lkp@intel.com/
Fixes: baec970aa5ba ("mips: Add N64 machine type")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: ralink: mt7621: add missing __iomem

raw_readl and writel calls expect pointers annotated with __iomem.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild/202211060456.cnV6IK6G-lkp@intel.com/
Fixes: cc19db8b312a ("MIPS: ralink: mt7621: do memory detection on KSEG1")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Sergio Paracuellos <sergio.paracuellos@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: cps: Assemble jr.hb with an R2 ISA level

A MIPS allmodconfig built with LLVM can select CPU_MIPS32_R1 together
with MIPS_MT_SMP. In that configuration clang invokes the integrated
assembler with -march=mips32, and the MIPS MT path in cps-vec.S fails
to assemble two jr.hb instructions:

  arch/mips/kernel/cps-vec.S:376:2: error: instruction requires
  a CPU feature not currently enabled

  arch/mips/kernel/cps-vec.S:490:4: error: instruction requires
  a CPU feature not currently enabled

The earlier jr.hb in the same file is already assembled inside a .set
MIPS_ISA_LEVEL_RAW scope. The two failing sites are reached after
popping back to the file's base ISA level, so LLVM correctly rejects
them for an R1 target.

Wrap those jr.hb instructions in the same ISA-level push/pop used by
the working site. This keeps the MT code unchanged while making the
required R2 hazard-branch encoding explicit to the assembler.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Prevent initial console buffer from landing in XKPHYS

In 64-bit configurations calling the initial console output handler from
a kernel thread other than the initial one will result in a situation
where the stack has been placed in the XKPHYS 64-bit memory segment and
consequently so has been the buffer allocated there that is used as the
argument corresponding to the `%s' output conversion specifier for the
firmware's printf() entry point.

This 64-bit address will then be truncated by 32-bit firmware, resulting
in an attempt to access the wrong memory location, which in turn will
cause all kinds of unpredictable behaviour, such as a kernel crash:

  Console: colour dummy device 160x64
  Calibrating delay loop... 49.36 BogoMIPS (lpj=192512)
  pid_max: default: 32768 minimum: 301
  CPU 0 Unable to handle kernel paging request at virtual address 000000000203bd00, epc == ffffffffbfc08364, ra == ffffffffbfc08800
  Oops[#1]:
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc2-00254-gfb649bda6f56-dirty #121
  $ 0   : 0000000000000000 0000000000000001 0000000000000023 ffffffff80684ba0
  $ 4   : 000000000203bd00 ffffffffbfc0f3b4 ffffffffffffffff 0000000000000073
  $ 8   : 0a303d7469000000 0000000000000000 0000000000000073 ffffffffbfc0f473
  $12   : 0000000000000002 0000000000000000 ffffffff80684c1c 0000000000000000
  $16   : 0000000000000000 ffffffff80596dc9 0000000000000000 ffffffffbfc09240
  $20   : ffffffff80684c40 ffffffffbfc0f400 000000000000002d 000000000000002b
  $24   : ffffffffffffffbf 000000000203bd00
  $28   : ffffffff805f0000 ffffffff80684b58 0000000000000030 ffffffffbfc08800
  Hi    : 0000000000000000
  Lo    : 0000000000000aa8
  epc   : ffffffffbfc08364 0xffffffffbfc08364
  ra    : ffffffffbfc08800 0xffffffffbfc08800
  Status: 140120e2        KX SX UX KERNEL EXL
  Cause : 00000008 (ExcCode 02)
  BadVA : 000000000203bd00
  PrId  : 00000430 (R4000SC)
  Modules linked in:
  Process swapper (pid: 0, threadinfo=(____ptrval____), task=(____ptrval____), tls=0000000000000000)
  Stack : 0000000000000000 0000000000000000 0000000000000000 0000004d0000004d
          80684cc0806a2a40 80596dc80000004d 8061000000000000 bfc0850c80684c38
          0000000000000000 000000000203bd00 0000000000000000 0000000000000000
          0000000000000000 00000000bfc0f3b4 0000000000000000 0000000000000000
          0000000000000000 0000000000000000 0000000000000000 0000000000000000
          0000000000000000 0000000000000000 0000000000000000 0000000000000000
          0000002500000000 0000000000000000 0000000000000000 802c1a7400000000
          0203bd0080596dc8 0203bd4d69000000 6c61632000000018 5f746567646e6172
          6c616320625f6d6f 5f736e5f6d6f7266 206361323778302b 303d74696e726320
          806a0a38806b0000 806a0a38806b0000 00000000806b0000 80683c58806b0000
          ...
  Call Trace:

  Code: a082ffff  03e00008  00601021 <80820000> 00001821  10400005  24840001  80820000  24630001

  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Fatal exception in interrupt

  KN04 V2.1k    (PC: 0xa0026768, SP: 0x806848e8)
  >>

In this case the pointer in $4 was truncated from 0x980000000203bd00 to
0x000000000203bd00.

This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started.

Fix the issue by making the buffer static and initdata, and therefore
placed in the CKSEG0 32-bit compatibility segment, observing that the
console output handler is called with the console lock held, implying
no need for this code to be reentrant.  Add an assertion to verify the
buffer actually has been placed in a compatibility segment.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v2.6.12+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Ensure 32-bit stack location for o32 prom_printf()

In 64-bit configurations calling any firmware entry points from a kernel
thread other than the initial one will result in a situation where the
stack has been placed in the XKPHYS 64-bit memory segment.

Consequently the stack pointer is no longer a 32-bit value and when the
32-bit firmware code called uses 32-bit ALU operations to manipulate the
stack pointer, the calculated result is incorrect (in fact in the 64-bit
MIPS ISA almost all 32-bit ALU operations will produce an unpredictable
result when executed on 64-bit data) and control goes astray.

This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap, or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started, and the kernel will hang at:

  pid_max: default: 32768 minimum: 301

or somewhat later, but always before:

  cblist_init_generic: Setting adjustable number of callback queues.

has been printed.

It seems that only the prom_printf() entry point is affected.  Of all
the other entry points wired only rex_slot_address() and rex_gettcinfo()
are called from a kernel thread other than the initial one, specifically
kernel_init(), and they are leaf functions that do no business with the
stack, having worked with no issue ever since 64-bit support was added
for the platform back in 2002.

To address this issue then, arrange for the stack to be switched in the
o32 wrapper as required for prom_printf() only, by supplying call_o32()
with a pointer to a chunk of initdata space, which is placed in the
CKSEG0 32-bit compatibility segment, observing that prom_printf() is
only called from console output handler and therefore with the console
lock held, implying no need for this code to be reentrant.

Other firmware entry points may be called with interrupts enabled and no
lock held, and may therefore require that call_o32() be reentrant.  They
trigger no issue at this point and "if it ain't broke, don't fix it," so
just leave them alone.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v2.6.12+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Remove IRQF_ONESHOT reference for IOASIC DMA error IRQs

There is no need for IOASIC DMA error interrupts to use the IRQF_ONESHOT
flag, because while they do need to have the source cleared only at the
conclusion of handling, the action handler supplied is either run in the
hardirq context with interrupts disabled at the CPU level or, where IRQ
threading has been forced, the primary handler has the IRQF_ONESHOT flag
implicitly added and therefore the original action handler, now run as
the thread handler and with interrupts enabled in the CPU, is executed
with the originating interrupt line masked. Therefore no interrupt will
retrigger regardless until the original request has been handled.

Link: https://lore.kernel.org/r/20260127135334.qUEaYP9G@linutronix.de/
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Fix prototypes for halt/reset handlers

Remove a bunch of compilation warnings for halt/reset handlers:

arch/mips/dec/reset.c:22:17: warning: no previous prototype for 'dec_machine_restart' [-Wmissing-prototypes]
   22 | void __noreturn dec_machine_restart(char *command)
      |                 ^~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:27:17: warning: no previous prototype for 'dec_machine_halt' [-Wmissing-prototypes]
   27 | void __noreturn dec_machine_halt(void)
      |                 ^~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:32:17: warning: no previous prototype for 'dec_machine_power_off' [-Wmissing-prototypes]
   32 | void __noreturn dec_machine_power_off(void)
      |                 ^~~~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:38:13: warning: no previous prototype for 'dec_intr_halt'
[-Wmissing-prototypes]
   38 | irqreturn_t dec_intr_halt(int irq, void *dev_id)
      |             ^~~~~~~~~~~~~

(which get promoted to compilation errors with CONFIG_WERROR), by moving
the local prototypes from arch/mips/dec/setup.c to a dedicated header
for arch/mips/dec/reset.c to use as well.  No functional change.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Remove do_IRQ() call indirection

As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code, as a tail call.  Remove the dec_irq_dispatch() stub
then and the indirection previously introduced with commit 187933f23679
("[MIPS] do_IRQ cleanup"), improving performance by reducing the number
of control flow changes and the overall instruction count, while fixing
a compiler's complaint about a missing prototype for said stub:

arch/mips/dec/setup.c:780:25: warning: no previous prototype for 'dec_irq_dispatch' [-Wmissing-prototypes]
  780 | asmlinkage unsigned int dec_irq_dispatch(unsigned int irq)
      |                         ^~~~~~~~~~~~~~~~

(which gets promoted to a compilation error with CONFIG_WERROR).

Fixes: 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section for MIPS")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Make do_IRQ() available for assembly callers

As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code again, however its `asmlinkage' annotation has never
been brought back from the previous removal of the function with commit
187933f23679 ("[MIPS] do_IRQ cleanup").

Since calling the function directly from assembly code has a performance
advantage, add the annotation back so that the DEC platform can make use
of this again.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Fix big-endian stack argument fetching in o32 wrapper

Fix an issue in call_o32() where the upper 32-bit half of incoming n64
stack arguments is fetched and used for outgoing o32 stack arguments on
big-endian platforms.

This code was adapted from arch/mips/dec/prom/call_o32.S which was meant
for a little-endian platform only and therefore using 32-bit loads from
64-bit stack slot locations holding incoming stack arguments resulted in
correct values being retrieved for data that is expected to be 32-bit.

This works on little-endian platforms where the lower 32-bit half of the
64-bit value is located at every 64-bit stack slot location. However on
big-endian platforms the lower 32-bit half is instead located at offset
4 from every 64-bit stack slot location.

So to fix the issue the offset of 4 would have to be used on big-endian
platforms only, or alternatively a 64-bit load from the 64-bit stack
slot location can be used across the board, as the subsequent 32-bit
store to the corresponding outgoing stack argument slot will correctly
truncate the value and cause no unpredictable result. We already take
advantage of this architectural feature for the incoming arguments held
in $a6 and $a7 registers, since the o32 wrapper does not know how many
incoming arguments there are and consequently propagates incoming data
which may not be 32-bit.

Since this code is generally supposed to be used with the stack located
in cached memory there is no extra overhead expected for 64-bit loads as
opposed to 32-bit ones, so pick this variant for code simplicity.

Fixes: 231a35d37293 ("[MIPS] RM: Collected changes")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: RB532: serial: statify setup_serial_port()

This function is not used outside of this compilation unit so make it
static.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: RB532: attach the software node to its target GPIO controller

GPIOLIB wants to remove the software node's name matching against GPIO
controller's label that is going on behind the scenes in software node
lookup. To that end, we need to convert all existing users to using
software nodes actually attached to the GPIO devices they represent.

In order to use an attached software node with the GPIO controller on
rb532: convert the GPIO module into a real platform device, provide
platform device info for it in device.c and assign the software node
using its swnode field.

The software node will get inherited by the GPIO chip from the parent
platform device in devm_gpiochip_add_data() as we don't set the fwnode
using any other of the mechanisms.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

ALSA: usb-audio: add IFB_SILENCE_ON_EMPTY quirk for Behringer Flow 8

The Behringer Flow 8 (1397:050c) is an 8-channel USB mixer that
declares OUT EP 0x01 with implicit feedback from capture EP 0x81 via
its UAC2 endpoint companion descriptor. After 5-35 minutes of
continuous playback, the device occasionally returns a capture URB in
which every iso_frame_desc has a non-zero status (-EXDEV bursts,
visible as rate-limited "frame N active: -18" lines in dmesg from
pcm.c).

In that case snd_usb_handle_sync_urb() at endpoint.c counts bytes==0
and falls into the early "skip empty packets" return originally added
for M-Audio Fast Track Ultra. As a result the playback EP loses its
sole IFB-driven feeder and the OUT ring starves permanently: hw_ptr
stops advancing while substream state remains RUNNING. Only USB
re-enumeration recovers.

Three independent ftrace captures (taken at the moment of stall via a
userspace watchdog) consistently show:

  - 60-70 capture URB completions in the 70ms window before the marker
  - 0 retire_playback_urb / queue_pending_output_urbs /
    snd_usb_endpoint_implicit_feedback_sink calls
  - every usb_submit_urb in the window comes from
    snd_complete_urb+0x64e (capture self-resubmit), none from the
    queue_pending_output_urbs path

Add a new opt-in quirk QUIRK_FLAG_IFB_SILENCE_ON_EMPTY: when set, the
early return is skipped and we fall through to enqueue a packet_info
whose packet_size[i] are all 0 (the existing loop already maps
status!=0 packets to size 0). prepare_outbound_urb then emits a
silence packet, the OUT ring keeps moving, and the device rides
through the glitch.

The default behaviour (early return) is preserved for all existing
devices including M-Audio Fast Track Ultra. Only Flow 8 opts in here.

Cc: stable@vger.kernel.org
Signed-off-by: Gordon Chen <chengordon326@gmail.com>
Link: https://patch.msgid.link/20260526072906.90106-1-chengordon326@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

genirq/proc: Speed up /proc/interrupts iteration

Reading /proc/interrupts iterates over the interrupt number space one by
one and looks up the descriptors one by one. That's just a waste of time.

When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and
cache the descriptor pointer efficiently for the sequence file operations.

Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and
leave the fs/proc/ variant for the legacy architectures which ignore generic
code.

This reduces the time wasted for looking up the next record significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194932.165280601@kernel.org