git.ipfire.org Git - thirdparty/linux.git/log

cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load

When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.

The race can be illustrated with two concurrent paths:

Path A (sysfs write, holds attr_set->update_lock):

governor_store()
  mutex_lock(&attr_set->update_lock)
  ignore_nice_load_store()
    dbs_data->ignore_nice_load = 1              /* (A1) */
    gov_update_cpu_data(dbs_data)
      mutex_lock(&policy_dbs->update_mutex)     /* (A2) */
        j_cdbs->prev_cpu_nice = kcpustat_field(...)
      mutex_unlock(&policy_dbs->update_mutex)
  mutex_unlock(&attr_set->update_lock)

Path B (work queue, wins the race between A1 and A2):

dbs_work_handler()
  mutex_lock(&policy_dbs->update_mutex)         /* acquired before A2 */
  dbs_update()
    ignore_nice = dbs_data->ignore_nice_load    /* sees new value: 1 */
    cur_nice = kcpustat_field(...)
    idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
    j_cdbs->prev_cpu_nice = cur_nice
  mutex_unlock(&policy_dbs->update_mutex)

Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.

To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:

  - gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
    prev_cpu_idle, so both baselines share the same timestamp when
    io_is_busy changes.  This prevents an interval mismatch between
    idle_time and nice_delta on the next dbs_update() when
    ignore_nice_load is enabled.
  - cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
    so the baseline is always valid from the first dbs_update() call;
    remove the ignore_nice guard and the now-unused ignore_nice variable.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix data races on per-CPU idle/nice baselines

gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex.  A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex.  cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable.  Fix this by reading io_busy
inside the mutex.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

block: remove blkdev_write_begin() and blkdev_write_end()

Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: fix use-after-free on service thread failure

If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.

The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.

Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.

Fixes: e8b58ef09e84 ("mtip32xx: fix device removal")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260525162531.1406677-1-dbgh9129@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: don't set BIO_QUIET for BLK_STS_AGAIN

Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.

Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

direct-io: remove IOCB_NOWAIT support

None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: Avoid mounting the bdev pseudo-filesystem in userspace

The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.

This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`

[1]

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 23380067 P4D 23380067 PUD 23381067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN NOPTI
CPU: 2 PID: 17125 Comm: syz-executor.0 Not tainted 6.1.155-syzkaller-00350-g84221fde2681 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:0x0

Call Trace:
<TASK>
lookup_open.isra.0+0x700/0x1180 fs/namei.c:3460
open_last_lookups fs/namei.c:3550 [inline]
path_openat+0x953/0x2700 fs/namei.c:3780
do_filp_open+0x1c5/0x410 fs/namei.c:3810
do_sys_openat2+0x171/0x4d0 fs/open.c:1318
do_sys_open fs/open.c:1334 [inline]
__do_sys_openat fs/open.c:1350 [inline]
__se_sys_openat fs/open.c:1345 [inline]
__x64_sys_openat+0x13c/0x1f0 fs/open.c:1345
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/all/20131010004732.GJ13318@ZenIV.linux.org.uk/T/#
Cc: stable@vger.kernel.org
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260521072857.5078-1-arefev@swemel.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: switch numa_node to int in blk_mq_hw_ctx and init_request

numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].

Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.

Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: skip sync_blockdev() on surprise removal in bdev_mark_dead()

bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.

syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.

Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead")
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: add tracepoint block_rq_tag_wait

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260525005123.722277-1-atomlin@atomlin.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

KVM: SEV: Mark source page dirty when writing back CPUID data on failure

When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-5-3f196bfad5a1@google.com
[sean: use set_page_dirty(), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements

Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".

Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.

Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-4-3f196bfad5a1@google.com
[sean: call out that the bug is benign]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Pin source page for write when adding CPUID data for SNP guest

When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.

Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.

Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Cc: stable@vger.kernel.org
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>

ASoC: SOF: ipc4-topology: Support for multiple src output formats

Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:

SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.

In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.

Link: https://patch.msgid.link/20260526105748.26149-1-peter.ujfalusi@linux.intel.com

ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output

The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.

Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc

SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-2-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

io_uring/zcrx: add shared-memory notification statistics

Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.

Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.

The kernel updates the stats counters in-place on every copy-fallback
event.

Signed-off-by: Clément Léger <cleger@meta.com>
[pavel: rename io_uring_zcrx_notif_stats]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user on frag copy fallback

Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.

Signed-off-by: Clément Léger <cleger@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user when out of buffers

There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: add ctx pointer to zcrx

zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: reorder fd allocation in zcrx_export()

Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.

While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.

Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: remove extra ifq close

By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: poison pointers on unregistration

Nobody should be touching area and other pointers after zcrx
destruction, poison them instead of zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: make scrubbing more reliable

Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

wifi: ath12k: fix error unwind on arch_init() failure in PCI probe

When arch_init() fails in ath12k_pci_probe(), the code jumps to
err_pci_msi_free, leaking resources in teardown.

Redirect the failure path to err_free_irq so teardown matches the setup order.

Compile-tested only.

Fixes: 614c23e24ee8 ("wifi: ath12k: Support arch-specific DP device allocation")
Signed-off-by: Ripan Deuri <ripan.deuri@oss.qualcomm.com>
Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Link: https://patch.msgid.link/20260519192815.3911324-1-ripan.deuri@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>

block: partitions: fix of_node refcount leak in of_partition()

of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:

1. The compatibility check at the top of the function returns 0
   without releasing partitions_np when the node exists but is not
   "fixed-partitions" compatible.

2. The function returns 1 at the end after successfully processing
   all partitions without releasing partitions_np.

Fix both leaks by adding of_node_put(partitions_np) on each path.

Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ASoC: cs35l56-shared-test: Fix possible null pointer dereference

The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.

Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)

Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260523211522.522616-1-ethantidmore06@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

mtd: spi-nor: debugfs: Add a locked sectors map

In order to get a very clear view of the sectors being locked, besides
the `params` output giving the ranges, we may want to see a proper map
of the sectors and for each of them, their status. Depending on the use
case, this map may be easier to parse by humans and gives a more acurate
feeling of the situation. At least myself, for the few locking-related
developments I recently went through, I found it very useful to get a
clearer mental model of what was locked/unlocked.

Here is an example of output:

$ cat /sys/kernel/debug/spi-nor/spi0.0/locked-sectors-map
Locked sectors map (x: locked, .: unlocked, unit: 64kiB)
0x00000000 (#    0): ................ ................ ................ ................
0x00400000 (#   64): ................ ................ ................ ................
0x00800000 (#  128): ................ ................ ................ ................
0x00c00000 (#  192): ................ ................ ................ ................
0x01000000 (#  256): ................ ................ ................ ................
0x01400000 (#  320): ................ ................ ................ ................
0x01800000 (#  384): ................ ................ ................ ................
0x01c00000 (#  448): ................ ................ ................ ................
0x02000000 (#  512): ................ ................ ................ ................
0x02400000 (#  576): ................ ................ ................ ................
0x02800000 (#  640): ................ ................ ................ ................
0x02c00000 (#  704): ................ ................ ................ ................
0x03000000 (#  768): ................ ................ ................ ................
0x03400000 (#  832): ................ ................ ................ ................
0x03800000 (#  896): ................ ................ ................ ................
0x03c00000 (#  960): ................ ................ ................ ..............xx

The output is wrapped at 64 sectors, spaces every 16 sectors are
improving the readability, every line starts by the first sector
offset (hex) and number (decimal).

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
[pratyush@kernel.org: split the debugfs_create_file() into two lines]
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
  post-7.1 issues or aren't considered suitable for backporting.

  All patches are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "mm: introduce a new page type for page pool in page type"
  mm/vmalloc: do not trigger BUG() on BH disabled context
  MAINTAINERS, mailmap: change email for Eugen Hristev
  mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
  kernel/fork: validate exit_signal in kernel_clone()
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  zram: fix use-after-free in zram_writeback_endio
  memfd: deny writeable mappings when implying SEAL_WRITE
  ipc: limit next_id allocation to the valid ID range
  Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
  MAINTAINERS: .mailmap: update after GEHC spin-off

mtd: spi-nor: debugfs: Add locking support

The ioctl output may be counter intuitive in some cases. Asking for a
"locked status" over a region that is only partially locked will return
"unlocked" whereas in practice maybe the biggest part is actually
locked.

Knowing what is the real software locking state through debugfs would be
very convenient for development/debugging purposes, hence this proposal
for adding an extra block at the end of the file: a "locked sectors"
array which lists every section, if it is locked or not, showing both
the address ranges and the sizes in numbers of "lock sectors" (which on
small density devices is typically different than erase blocks).

Here is an example of output, what is after the "sector map" is new.

$ cat /sys/kernel/debug/spi-nor/spi0.0/params
name (null)
id ef a0 20 00 00 00
size 64.0 MiB
write size 1
page size 256
address nbytes 4
flags HAS_SR_TB | 4B_OPCODES | HAS_4BAIT | HAS_LOCK | HAS_16BIT_SR | HAS_SR_TB_BIT6 | HAS_4BIT_BP | SOFT_RESET | NO_WP

opcodes
read 0xec
  dummy cycles 6
erase 0xdc
program 0x34
8D extension none

protocols
read 1S-4S-4S
write 1S-1S-4S
register 1S-1S-1S

erase commands
21 (4.00 KiB) [1]
dc (64.0 KiB) [3]
c7 (64.0 MiB)

sector map
region (in hex)   | erase mask | overlaid
------------------+------------+---------
00000000-03ffffff |     [   3] | no

locked sectors
region (in hex)   | status   | #sectors
------------------+----------+---------
00000000-03ffffff | unlocked | 1024

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Create a local SR cache

In order to be able to generate debugfs output without having to
actually reach the flash, create a SPI NOR local cache of the status
registers. What matters in our case are all the bits related to sector
locking. As such, in order to make it clear that this cache is not
intended to be used anywhere else, we zero the irrelevant bits.

The cache is initialized once during the early init, and then maintained
every time the write protection scheme is updated.

Suggested-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Cosmetic changes

As a final preparation step for the introduction of CMP support, make
a few more cosmetic changes to simplify the reading of the diff when
adding the CMP feature. In particular, define "min_prot_len" earlier as
it will be reused and move the definition of the "ret" variable at the
end of the stack just because it looks better.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Simplify checking the locked/unlocked range

In both the locking/unlocking steps, at the end we verify whether we do
not lock/unlock more than requested (in which case an error must be
returned).

While being possible to do that with very simple mask comparisons, it
does not scale when adding extra locking features such as the CMP
possibility. In order to make these checks slightly easier to read and
more future proof, use existing helpers to read the (future) status
register, extract the covered range, and compare it with very usual
algebric comparisons.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create helpers for building the SR register

The status register contains 3 or 4 BP (Block Protect) bits, 0 or 1
TB (Top/Bottom) bit, soon 0 or 1 CMP (Complement) bit. The last BP bit
and the TB bit locations change between vendors. The whole logic of
buildling the content of the status register based on some input
conditions is used two times and soon will be used 4 times.

Create dedicated helpers for these steps.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create a TB intermediate variable

Ease the future reuse of the tb (Top/Bottom) boolean by creating an
intermediate variable.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Rename a mask

"mask" is not very descriptive when we already manipulate two masks, and
soon will manipulate three. Rename it "bp_mask" to align with the
existing "tb_mask" and soon "cmp_mask".

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create a helper that writes SR, CR and checks

There are many helpers already to either read and/or write SR and/or CR,
as well as sometimes check the returned values. In order to be able to
switch from a 1 byte status register to a 2 bytes status register while
keeping the same level of verification, let's introduce a new helper
that writes them both (atomically) and then reads them back (separated)
to compare the values.

In case 2 bytes registers are not supported, we still have the usual
fallback available in the helper being exported to the rest of the core.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Use a pointer for SR instead of a single byte

At this stage, the Status Register is most often seen as a single
byte. This is subject to change when we will need to read the CMP bit
which is located in the Control Register (kind of secondary status
register). Both will need to be carried.

Change a few prototypes to carry a u8 pointer. This way it also makes it
very clear where we access the first register, and where we will access
the second.

There is no functional change.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Clarify a comment

The comment states that some power of two sizes are not supported. This
is very device dependent (based on the size), so modulate a bit the
sentence to make it more accurate.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Explain the MEMLOCK ioctl implementation behaviour

Add more details about how these requests are actually handled in the
SPI NOR core. Their behaviour was not entirely clear to me at first, and
explaining them in plain English sounds the way to go.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Enhance output

Align the number of dashes to the bigger column width (the title in this
case) to make the output more pleasant and aligned with what is done
in the "params" file output.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Align variable access with the rest of the file

The "params" variable is used everywhere else, align this particular
line of the file to use "params" directly rather than the "nor" pointer.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Improve opcodes documentation

There are two status registers, named 1 and 2. The current wording is
misleading as "1" may refer to the status register ID as well as the
number of bytes required (which, in this case can be 1 or 2).

Clarify the comments by aligning them on the same pattern:
"{read,write} status {1,2} register"

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Make sure the QE bit is kept enabled if useful

Not all chips implement the 4BAIT table which typically indicates the
program capability, while many of them do implement the relevant SFDP
parts indicating the read capabilities. In such a situation, programs
can happen in single mode (1-1-1) and reads in quad mode (1-1-4 or
1-4-4). For the reads to work in such condition, the QE bit must be set.
In case we later use the spi_nor_write_16bit_sr_and_check() helper with
a chip with such configuration, the QE bit would get incorrectly
cleared.

Make sure this doesn't happen by keeping the QE bit under a simpler
condition:
- the quad enable hook is there (no change)
- and at least one of the two protocols is based on quad I/O cycles

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Drop duplicate Kconfig dependency

I do not think the MTD dependency is needed twice. This is likely a
duplicate coming from a former rebase when the spi-nor core got cleaned
up a while ago. Remove the extra line.

Fixes: b35b9a10362d ("mtd: spi-nor: Move m25p80 code in spi-nor.c")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Improve locking user experience

In the case of the first block being locked (or the few first blocks),
if the user want to fully unlock the device it has two possibilities:
- either it asks to unlock the entire device, and this works;
- or it asks to unlock just the block(s) that are currently locked,
which fails.

It fails because the conditions "can_be_top" and "can_be_bottom" are
true. Indeed, in this case, we unlock everything, so the TB bit does not
matter. However in the current implementation, use_top would be true (as
this is the favourite option) and lock_len, which in practice should be
reduced down to 0, is set to "nor->params->size - (ofs + len)" which is
a positive number. This is wrong.

An easy way is to simply add an extra condition. In the unlock() path,
if we can achieve the same result from both sides, it means we unlock
everything and lock_len must simply be 0. A comment is added to clarify
that logic.

Fixes: 3dd8012a8eeb ("mtd: spi-nor: add TB (Top/Bottom) protect support")
Cc: stable@kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Fix the flags list

As mentioned above the spi_nor_option_flags enumeration in core.h, this
list should be kept in sync with the one in the core.

Add the missing flag.

Fixes: 6a42bc97ccda ("mtd: spi-nor: core: Allow specifying the byte order in Octal DTR mode")
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

Merge branch 'ethtool-module-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: module: fix a handful of small bugs

I've been poking at the locking in ethtool and it appears
that the FW flashing is not currently taking the ops lock.
Existing drivers which implement module FW flashing seem
to have their own locking, so this series doesn't actually
add the ops lock (I'll add it in net-next). But a number
of other errors have been surfaced in the process.
====================

Link: https://patch.msgid.link/20260522231312.1710836-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate fw->size against start_cmd_payload_size

cmis_fw_update_start_download() copies start_cmd_payload_size bytes
from the firmware blob into the CDB LPL vendor_data[] payload without
validating that the FW has enough data.

Since the start_cmd_payload_size can only be ~120B an image too short
is most likely corrupted, so reject it.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate start_cmd_payload_size from module

The CMIS firmware update code reads start_cmd_payload_size from
the module's FW Management Features CDB reply and uses it directly
as the byte count for memcpy. The destination buffer is 112 bytes
(ETHTOOL_CMIS_CDB_LPL_MAX_PL_LENGTH - 8). So a malicious
module (or corrupted response) can cause a OOB write later on in
cmis_fw_update_start_download().

Let's error out. If modules that expect longer LPL writes actually
exist we should revisit.

struct cmis_cdb_start_fw_download_pl's definition has to move,
no change there.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: fix u16-to-u8 truncation of msleep_pre_rpl

ethtool_cmis_cdb_compose_args() accepts msleep_pre_rpl as u16 but stores
it into the u8 field ethtool_cmis_cdb_cmd_args::msleep_pre_rpl, silently
truncating values >= 256. Seven of the nine call sites pass 1000 ms
(it's the third argument from the end).

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: require exact CDB reply length

Malicious SFP module could respond with rpl_len longer than
what cmis_cdb_process_reply() expected, leading to OOB writes.
Malicious HW is a bit theoretical but some modules may just
be buggy and/or the reads may occasionally get corrupted,
so let's protect the kernel.

The existing check protects from short replies. We need to
protect from long ones, too. All callers that pass a non-zero
rpl_exp_len cast the reply payload to a fixed-layout struct
and read fields at fixed offsets, with no version negotiation
or short-reply handling:

  - cmis_cdb_validate_password()
  - cmis_cdb_module_features_get()
  - cmis_fw_update_fw_mng_features_get()

so let's assume that responses longer than expected do not
have to be handled gracefully here. Add a warning message
to make the debug easier in case my understanding is wrong...

Note that page_data->length (argument of kmalloc) comes from
last arg to ethtool_cmis_page_init() which is rpl_exp_len.

Note2 that AIs also like to point out overflows in args->req.payload
itself (which is a fixed-size 120 B buffer, on the stack),
but callers should be reading structs defined by the standard,
so protecting from requests for more data than max seem like
defensive programming.

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: fix cleanup if socket used for flashing multiple devices

When a single Netlink socket issues MODULE_FW_FLASH_ACT against multiple
devices, ethnl_sock_priv_set() overwrites sk_priv->dev on each call,
retaining only the last one. The socket priv is used on socket close,
to walk the global work list and mark the uncompleted flashing work
as "orphaned". Otherwise if another socket reuses the PID it will
unexpectedly receive the flashing notifications.

Don't record the device, record net pointer instead. The purpose of
the dev is to scope the work to a netns, anyway. If we store netns
the overrides are safe/a nop since all flashed devices must be in
the same netns as the socket.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: check fw_flash_in_progress under rtnl_lock

ethnl_set_module_validate() inspects module_fw_flash_in_progress
but validate is meant for _input_ validation, not state validation.
rtnl_lock is not held, yet. Move the check into ethnl_set_module().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid racy updates to dev->ethtool bitfield

When reviewing other changes Gemini points out that we currently
update module_fw_flash_in_progress without holding any locks.
Since module_fw_flash_in_progress is part of a bitfield this
is not great, updates to other fields may be lost.

We could use a bool and sprinkle some READ_ONCE/WRITE_ONCE here
but seems like the issue is rather than the work is an unusual
writer. The other writers already hold the right locks. So just
very briefly take these locks when the work completes.

Note that nothing ever cancels the FW update work, so there's
no concern with deadlocks vs cancel.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid leaking a netdev ref on module flash errors

module_flash_fw_schedule() is missing undo for setting
the "in_progress" flag and taking the netdev reference.
Delay taking these, the device can't disappear while
we are holding rtnl_lock.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: call ethnl_ops_complete() on module flash errors

When validate() fails we are skipping over ethnl_ops_complete()
even tho we already called ethnl_ops_begin().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ethtool-rss-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: rss: fix a handful of small bugs

Fix a handful of small bugs in the ethtool Netlink RSS code.
====================

Link: https://patch.msgid.link/20260522230647.1705600-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid device context leak on reply-build failure

We wait with filling the reply for new RSS context creation
until after the driver ->create_rxfh_context call. The driver
needs to fill some of the defaults in the context. The failure
of rss_fill_reply() is somewhat theoretical, but doesn't take
much effort to handle it properly. Call ->remove_rxfh_context().

If the driver's remove callback fails (some implementations like sfc
can return real command errors from firmware RPCs) - skip the xa_erase
and kfree, leaving the context in the xarray. This matches how
ethnl_rss_delete_doit() behaves.

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix hkey leak when indir_size is 0

rss_get_data_alloc() allocates a single buffer that backs both the
indirection table and the hash key, but only assigned data->indir_table
when indir_size was nonzero. The expectation was that no driver
implements RSS without supporting indirection table but apparently
enic does just that (it's the only such in-tree driver).
enic has get_rxfh_key_size but no get_rxfh_indir_size.
data->indir_table stays as NULL, hkey gets set but rss_get_data_free()
kfree(data->indir_table) is a nop and the allocation leaks.

Always store the allocation base in data->indir_table so the free path
is unambiguous. No caller treats indir_table as a sentinel; everything
keys off indir_size.

Fixes: 7112a04664bf ("ethtool: add netlink based get rss support")
Link: https://patch.msgid.link/20260522230647.1705600-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix indir_table and hkey leak on get_rxfh failure

rss_prepare_get() allocates the indirection table and hash key buffer
via rss_get_data_alloc(), then calls ops->get_rxfh() to populate them.
If get_rxfh() fails, the function returns an error without freeing
the allocation.

Fixes: 4f038a6a02d2 ("net: ethtool: Don't call .cleanup_data when prepare_data fails")
Link: https://patch.msgid.link/20260522230647.1705600-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix falsely ignoring indir table updates

rss_set_prep_indir() compares the new indirection table against the
current one to determine whether any update is needed. The memcmp
call passes data->indir_size as the length argument, but indir_size
is the number of u32 entries, not the byte count.

Fixes: c0ae03588bbb ("ethtool: rss: initial RSS_SET (indirection table handling)")
Link: https://patch.msgid.link/20260522230647.1705600-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: add missing errno on RSS context delete

Remember to set ret before jumping out if someone tries
to delete a context on a device which doesn't support
contexts.

Fixes: fbe09277fa63 ("ethtool: rss: support removing contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid modifying the RSS context response

Gemini says that we're modifying the RSS_CREATE response skb.
I think it's right, the comment says that unicast() should
unshare the skb but I'm not entirely sure what I meant there.
netlink_trim() does a copy but only if skb is not well sized
(it's at least 2x larger than necessary for the payload).

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host

A previous bug in __pkvm_init_vm error path showed that the hypervisor
could leak refcounted pages, (i.e. losing access to a page while its
refcount is still elevated). This poses a threat to the pKVM state
machine.

Address this by introducing a fail-safe in __pkvm_hyp_donate_host.
Transitions are not a hot path so added security is worth the extra
check.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Fix __pkvm_init_vm error path

In the unlikely case where insert_vm_table_entry fails, __pkvm_init_vm
release the memory donated by the host for the PGD, but as the stage-2
is still set-up the hypervisor keeps a refcount on those pages,
effectively leaking the references.

Fix the rollback with the newly added kvm_guest_destroy_stage2().

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Reset page order in pKVM hyp_pool

When a VM fails to initialise after its stage-2 hyp_pool has been
initialised, that stage-2 must be torn down entirely. This requires
resetting both the refcount and the order of its pages back to 0.

Currently, reclaim_pgtable_pages() implicitly resets the page order by
allocating the entire pool with order-0 granularity. However, in the VM
initialisation error path, the addresses of the donated memory (the PGD)
are already known, making it unnecessary to iterate over all pages in
the pool.

Since the vmemmap page order is a hyp_pool-specific field, leaving a
non-zero order on hyp_pool destruction is harmless until another pool
attempts to admit the page. Instead of resetting this field during
destruction, reset it during pool initialization in hyp_pool_init().

For 'external' pages, we can't trust the order either as they bypass
hyp_pool_init(). Since we never coalesce them, enforce order-0 to ensure
safe insertion into the pool.

This leaves no vmemmap order users outside of hyp_pool.

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

cpufreq/amd-pstate: drop stale @epp_cached kdoc

Commit 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP") removed
the epp_cached field from struct amd_cpudata in favour of always
reading from cppc_req_cached, but the kdoc above the struct still
documents @epp_cached.

Drop the now-stale @epp_cached entry.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Fixes: 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Link: https://lore.kernel.org/r/20260526022131.1302373-1-zhanxusheng@xiaomi.com
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

arm64: tegra: Enable PCIe for Jetson AGX Thor

Enable the PCIe controllers on the Jetson AGX Thor Developer Kit that
are used for ethernet and NVMe.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

arm64: tegra: Fix address of Tegra264 main GPIO controller

The 64-bit address of the main GPIO controller on Tegra264 is
0x810c300000. The main GPIO controller was incorrectly added under the
bus@0 node instead of the bus@8100000000 node breaking the boot on
Tegra264. Fix this by moving to main GPIO controller node under
bus@8100000000.

Fixes: c70e6bc11d20 ("arm64: tegra: Add Tegra264 GPIO controllers")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: socfpga: Fix OF node refcount leak in SMP setup

socfpga_smp_prepare_cpus() looks up the Cortex-A9 SCU node with
of_find_compatible_node(), which returns a node reference that must be
released with of_node_put().

The function maps the SCU registers and then returns without dropping
that reference, leaking the node on both the success path and the
of_iomap() failure path.

Drop the reference once the mapping attempt is complete. The returned
MMIO mapping does not depend on keeping the device node reference held.

Fixes: 122694a0c712 ("ARM: socfpga: use of_iomap to map the SCU")
Cc: stable@vger.kernel.org
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>

driver core: Guard deferred probe timeout extension with delayed_work_pending()

mod_delayed_work() unconditionally queues the work even when it wasn't
previously pending, which can fire the timeout prematurely or restart it
after it already fired. Add a delayed_work_pending() guard to restore
the originally intended semantics.

Premature firing calls fw_devlink_drivers_done() before all built-in
drivers have registered, causing fw_devlink to prematurely relax device
links for suppliers whose drivers haven't loaded yet.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-2-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

driver core: Fix missing jiffies conversion in deferred_probe_extend_timeout()

mod_delayed_work() takes jiffies, not seconds. Thus, restore the dropped
conversion.

While at it, fix incorrect indentation.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

MIPS: Loongson64: dts: Add node for LS7A PCH LPC

Loongson 7A series PCH contain a LPC IRQ controller.

Add the device tree node of it.

Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Loongson64: dts: Sort nodes

The RTC's address is after UARTs, however the node is currently before
them.

Re-order the node to match address sequence.

Signed-off-by: Icenowy Zheng <zhengxingda@iscas.ac.cn>
Acked-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: mobileye: Remove duplicate FIT_IMAGE_FDT_EPM5 from main Kconfig

kconfiglint reports:

  K008: config FIT_IMAGE_FDT_EPM5 has prompts in 2 separate definitions

The FIT_IMAGE_FDT_EPM5 Kconfig symbol is defined identically in two places:

  arch/mips/Kconfig:1052
  arch/mips/mobileye/Kconfig:17

Both have the same prompt, depends, default, and help text. Since
arch/mips/mobileye/Kconfig is sourced from arch/mips/Kconfig, both
definitions are parsed and the symbol ends up with two prompts.

The symbol was first introduced in commit 101bd58fde10 ("MIPS: Add
support for Mobileye EyeQ5") directly in
arch/mips/Kconfig. Three months later, commit fbe0fae601b7 ("MIPS:
mobileye: Add EyeQ6H support") created the
arch/mips/mobileye/Kconfig sub-file to organize the growing Mobileye
platform code and added the MACH_EYEQ5/MACH_EYEQ6H choice along with
a copy of FIT_IMAGE_FDT_EPM5. However, the original definition in
arch/mips/Kconfig was not removed at that time, leaving a duplicate.

Remove the definition from arch/mips/Kconfig, keeping the one in
arch/mips/mobileye/Kconfig where it belongs alongside the related
MACH_EYEQ5 machine type definition that it depends on.

Assisted-by: Claude:claude-opus-4-6 kconfiglint
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: ralink: reduce ARCH_DMA_MINALIGN

Currently, Ralink SoCs use the default ARCH_DMA_MINALIGN value of 128
bytes defined in mach-generic. This is excessive for these platforms
and leads to significant memory waste in kmalloc.

Override ARCH_DMA_MINALIGN to use L1_CACHE_BYTES, which is 16 bytes for
RT288X and 32 bytes for other Ralink SoCs.

Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

include: Remove unused jz4740-battery.h

The last user was removed in commit aea12071d6fc
("power/supply: Drop obsolete JZ4740 driver") and replaced by
a self-contained IIO-based driver. No file includes this header.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

include: Remove unused jz4740-adc.h

The last user was the JZ4740 MFD ADC driver, removed in commit
ff71266aa490 ("mfd: Drop obsolete JZ4740 driver") and replaced
by a self-contained IIO driver. No file includes or references
this header.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: n64: add __iomem for writel call

sparse: incorrect type in argument 2 (different address spaces) @@
expected void volatile [noderef] __iomem *mem @@
got unsigned int [usertype] *

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202105261445.AcvPd2EE-lkp@intel.com/
Fixes: baec970aa5ba ("mips: Add N64 machine type")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: ralink: mt7621: add missing __iomem

raw_readl and writel calls expect pointers annotated with __iomem.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild/202211060456.cnV6IK6G-lkp@intel.com/
Fixes: cc19db8b312a ("MIPS: ralink: mt7621: do memory detection on KSEG1")
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Sergio Paracuellos <sergio.paracuellos@gmail.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

mips: cps: Assemble jr.hb with an R2 ISA level

A MIPS allmodconfig built with LLVM can select CPU_MIPS32_R1 together
with MIPS_MT_SMP. In that configuration clang invokes the integrated
assembler with -march=mips32, and the MIPS MT path in cps-vec.S fails
to assemble two jr.hb instructions:

  arch/mips/kernel/cps-vec.S:376:2: error: instruction requires
  a CPU feature not currently enabled

  arch/mips/kernel/cps-vec.S:490:4: error: instruction requires
  a CPU feature not currently enabled

The earlier jr.hb in the same file is already assembled inside a .set
MIPS_ISA_LEVEL_RAW scope. The two failing sites are reached after
popping back to the file's base ISA level, so LLVM correctly rejects
them for an R1 target.

Wrap those jr.hb instructions in the same ISA-level push/pop used by
the working site. This keeps the MT code unchanged while making the
required R2 hazard-branch encoding explicit to the assembler.

Assisted-by: Codex:GPT-5.5
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Prevent initial console buffer from landing in XKPHYS

In 64-bit configurations calling the initial console output handler from
a kernel thread other than the initial one will result in a situation
where the stack has been placed in the XKPHYS 64-bit memory segment and
consequently so has been the buffer allocated there that is used as the
argument corresponding to the `%s' output conversion specifier for the
firmware's printf() entry point.

This 64-bit address will then be truncated by 32-bit firmware, resulting
in an attempt to access the wrong memory location, which in turn will
cause all kinds of unpredictable behaviour, such as a kernel crash:

  Console: colour dummy device 160x64
  Calibrating delay loop... 49.36 BogoMIPS (lpj=192512)
  pid_max: default: 32768 minimum: 301
  CPU 0 Unable to handle kernel paging request at virtual address 000000000203bd00, epc == ffffffffbfc08364, ra == ffffffffbfc08800
  Oops[#1]:
  CPU: 0 PID: 0 Comm: swapper Not tainted 5.18.0-rc2-00254-gfb649bda6f56-dirty #121
  $ 0   : 0000000000000000 0000000000000001 0000000000000023 ffffffff80684ba0
  $ 4   : 000000000203bd00 ffffffffbfc0f3b4 ffffffffffffffff 0000000000000073
  $ 8   : 0a303d7469000000 0000000000000000 0000000000000073 ffffffffbfc0f473
  $12   : 0000000000000002 0000000000000000 ffffffff80684c1c 0000000000000000
  $16   : 0000000000000000 ffffffff80596dc9 0000000000000000 ffffffffbfc09240
  $20   : ffffffff80684c40 ffffffffbfc0f400 000000000000002d 000000000000002b
  $24   : ffffffffffffffbf 000000000203bd00
  $28   : ffffffff805f0000 ffffffff80684b58 0000000000000030 ffffffffbfc08800
  Hi    : 0000000000000000
  Lo    : 0000000000000aa8
  epc   : ffffffffbfc08364 0xffffffffbfc08364
  ra    : ffffffffbfc08800 0xffffffffbfc08800
  Status: 140120e2        KX SX UX KERNEL EXL
  Cause : 00000008 (ExcCode 02)
  BadVA : 000000000203bd00
  PrId  : 00000430 (R4000SC)
  Modules linked in:
  Process swapper (pid: 0, threadinfo=(____ptrval____), task=(____ptrval____), tls=0000000000000000)
  Stack : 0000000000000000 0000000000000000 0000000000000000 0000004d0000004d
          80684cc0806a2a40 80596dc80000004d 8061000000000000 bfc0850c80684c38
          0000000000000000 000000000203bd00 0000000000000000 0000000000000000
          0000000000000000 00000000bfc0f3b4 0000000000000000 0000000000000000
          0000000000000000 0000000000000000 0000000000000000 0000000000000000
          0000000000000000 0000000000000000 0000000000000000 0000000000000000
          0000002500000000 0000000000000000 0000000000000000 802c1a7400000000
          0203bd0080596dc8 0203bd4d69000000 6c61632000000018 5f746567646e6172
          6c616320625f6d6f 5f736e5f6d6f7266 206361323778302b 303d74696e726320
          806a0a38806b0000 806a0a38806b0000 00000000806b0000 80683c58806b0000
          ...
  Call Trace:

  Code: a082ffff  03e00008  00601021 <80820000> 00001821  10400005  24840001  80820000  24630001

  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Fatal exception in interrupt

  KN04 V2.1k    (PC: 0xa0026768, SP: 0x806848e8)
  >>

In this case the pointer in $4 was truncated from 0x980000000203bd00 to
0x000000000203bd00.

This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started.

Fix the issue by making the buffer static and initdata, and therefore
placed in the CKSEG0 32-bit compatibility segment, observing that the
console output handler is called with the console lock held, implying
no need for this code to be reentrant.  Add an assertion to verify the
buffer actually has been placed in a compatibility segment.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v2.6.12+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Ensure 32-bit stack location for o32 prom_printf()

In 64-bit configurations calling any firmware entry points from a kernel
thread other than the initial one will result in a situation where the
stack has been placed in the XKPHYS 64-bit memory segment.

Consequently the stack pointer is no longer a 32-bit value and when the
32-bit firmware code called uses 32-bit ALU operations to manipulate the
stack pointer, the calculated result is incorrect (in fact in the 64-bit
MIPS ISA almost all 32-bit ALU operations will produce an unpredictable
result when executed on 64-bit data) and control goes astray.

This may happen when no final console driver has been enabled in the
configuration and consequently the initial console continues being used
late into bootstrap, or with an upcoming change that will switch the zs
driver to use a platform device, which in turn will make the console
handover happen only after other kernel threads have already been
started, and the kernel will hang at:

  pid_max: default: 32768 minimum: 301

or somewhat later, but always before:

  cblist_init_generic: Setting adjustable number of callback queues.

has been printed.

It seems that only the prom_printf() entry point is affected.  Of all
the other entry points wired only rex_slot_address() and rex_gettcinfo()
are called from a kernel thread other than the initial one, specifically
kernel_init(), and they are leaf functions that do no business with the
stack, having worked with no issue ever since 64-bit support was added
for the platform back in 2002.

To address this issue then, arrange for the stack to be switched in the
o32 wrapper as required for prom_printf() only, by supplying call_o32()
with a pointer to a chunk of initdata space, which is placed in the
CKSEG0 32-bit compatibility segment, observing that prom_printf() is
only called from console output handler and therefore with the console
lock held, implying no need for this code to be reentrant.

Other firmware entry points may be called with interrupts enabled and no
lock held, and may therefore require that call_o32() be reentrant.  They
trigger no issue at this point and "if it ain't broke, don't fix it," so
just leave them alone.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: stable@vger.kernel.org # v2.6.12+
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Remove IRQF_ONESHOT reference for IOASIC DMA error IRQs

There is no need for IOASIC DMA error interrupts to use the IRQF_ONESHOT
flag, because while they do need to have the source cleared only at the
conclusion of handling, the action handler supplied is either run in the
hardirq context with interrupts disabled at the CPU level or, where IRQ
threading has been forced, the primary handler has the IRQF_ONESHOT flag
implicitly added and therefore the original action handler, now run as
the thread handler and with interrupts enabled in the CPU, is executed
with the originating interrupt line masked. Therefore no interrupt will
retrigger regardless until the original request has been handled.

Link: https://lore.kernel.org/r/20260127135334.qUEaYP9G@linutronix.de/
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Fix prototypes for halt/reset handlers

Remove a bunch of compilation warnings for halt/reset handlers:

arch/mips/dec/reset.c:22:17: warning: no previous prototype for 'dec_machine_restart' [-Wmissing-prototypes]
   22 | void __noreturn dec_machine_restart(char *command)
      |                 ^~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:27:17: warning: no previous prototype for 'dec_machine_halt' [-Wmissing-prototypes]
   27 | void __noreturn dec_machine_halt(void)
      |                 ^~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:32:17: warning: no previous prototype for 'dec_machine_power_off' [-Wmissing-prototypes]
   32 | void __noreturn dec_machine_power_off(void)
      |                 ^~~~~~~~~~~~~~~~~~~~~
arch/mips/dec/reset.c:38:13: warning: no previous prototype for 'dec_intr_halt'
[-Wmissing-prototypes]
   38 | irqreturn_t dec_intr_halt(int irq, void *dev_id)
      |             ^~~~~~~~~~~~~

(which get promoted to compilation errors with CONFIG_WERROR), by moving
the local prototypes from arch/mips/dec/setup.c to a dedicated header
for arch/mips/dec/reset.c to use as well.  No functional change.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: DEC: Remove do_IRQ() call indirection

As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code, as a tail call.  Remove the dec_irq_dispatch() stub
then and the indirection previously introduced with commit 187933f23679
("[MIPS] do_IRQ cleanup"), improving performance by reducing the number
of control flow changes and the overall instruction count, while fixing
a compiler's complaint about a missing prototype for said stub:

arch/mips/dec/setup.c:780:25: warning: no previous prototype for 'dec_irq_dispatch' [-Wmissing-prototypes]
  780 | asmlinkage unsigned int dec_irq_dispatch(unsigned int irq)
      |                         ^~~~~~~~~~~~~~~~

(which gets promoted to a compilation error with CONFIG_WERROR).

Fixes: 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section for MIPS")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Make do_IRQ() available for assembly callers

As from commit 8f99a1626535 ("MIPS: Tracing: Add IRQENTRY_EXIT section
for MIPS") do_IRQ() is not a macro anymore and can be invoked directly
from assembly code again, however its `asmlinkage' annotation has never
been brought back from the previous removal of the function with commit
187933f23679 ("[MIPS] do_IRQ cleanup").

Since calling the function directly from assembly code has a performance
advantage, add the annotation back so that the DEC platform can make use
of this again.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: Fix big-endian stack argument fetching in o32 wrapper

Fix an issue in call_o32() where the upper 32-bit half of incoming n64
stack arguments is fetched and used for outgoing o32 stack arguments on
big-endian platforms.

This code was adapted from arch/mips/dec/prom/call_o32.S which was meant
for a little-endian platform only and therefore using 32-bit loads from
64-bit stack slot locations holding incoming stack arguments resulted in
correct values being retrieved for data that is expected to be 32-bit.

This works on little-endian platforms where the lower 32-bit half of the
64-bit value is located at every 64-bit stack slot location. However on
big-endian platforms the lower 32-bit half is instead located at offset
4 from every 64-bit stack slot location.

So to fix the issue the offset of 4 would have to be used on big-endian
platforms only, or alternatively a 64-bit load from the 64-bit stack
slot location can be used across the board, as the subsequent 32-bit
store to the corresponding outgoing stack argument slot will correctly
truncate the value and cause no unpredictable result. We already take
advantage of this architectural feature for the incoming arguments held
in $a6 and $a7 registers, since the o32 wrapper does not know how many
incoming arguments there are and consequently propagates incoming data
which may not be 32-bit.

Since this code is generally supposed to be used with the stack located
in cached memory there is no extra overhead expected for 64-bit loads as
opposed to 32-bit ones, so pick this variant for code simplicity.

Fixes: 231a35d37293 ("[MIPS] RM: Collected changes")
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: RB532: serial: statify setup_serial_port()

This function is not used outside of this compilation unit so make it
static.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

MIPS: RB532: attach the software node to its target GPIO controller

GPIOLIB wants to remove the software node's name matching against GPIO
controller's label that is going on behind the scenes in software node
lookup. To that end, we need to convert all existing users to using
software nodes actually attached to the GPIO devices they represent.

In order to use an attached software node with the GPIO controller on
rb532: convert the GPIO module into a real platform device, provide
platform device info for it in device.c and assign the software node
using its swnode field.

The software node will get inherited by the GPIO chip from the parent
platform device in devm_gpiochip_add_data() as we don't set the fwnode
using any other of the mechanisms.

Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

remoteproc: imx_rproc: Use device node name as processor name

As currently there are maybe multiple remote processors, so change from
using fixed name to using device node name as remote processor name in
order to make them can be distinguished by through of name in sys
filesystem.

Signed-off-by: Jiafei Pan <Jiafei.Pan@nxp.com>
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Link: https://lore.kernel.org/r/20260508032016.27716-1-Jiafei.Pan@nxp.com
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>

ALSA: usb-audio: add IFB_SILENCE_ON_EMPTY quirk for Behringer Flow 8

The Behringer Flow 8 (1397:050c) is an 8-channel USB mixer that
declares OUT EP 0x01 with implicit feedback from capture EP 0x81 via
its UAC2 endpoint companion descriptor. After 5-35 minutes of
continuous playback, the device occasionally returns a capture URB in
which every iso_frame_desc has a non-zero status (-EXDEV bursts,
visible as rate-limited "frame N active: -18" lines in dmesg from
pcm.c).

In that case snd_usb_handle_sync_urb() at endpoint.c counts bytes==0
and falls into the early "skip empty packets" return originally added
for M-Audio Fast Track Ultra. As a result the playback EP loses its
sole IFB-driven feeder and the OUT ring starves permanently: hw_ptr
stops advancing while substream state remains RUNNING. Only USB
re-enumeration recovers.

Three independent ftrace captures (taken at the moment of stall via a
userspace watchdog) consistently show:

  - 60-70 capture URB completions in the 70ms window before the marker
  - 0 retire_playback_urb / queue_pending_output_urbs /
    snd_usb_endpoint_implicit_feedback_sink calls
  - every usb_submit_urb in the window comes from
    snd_complete_urb+0x64e (capture self-resubmit), none from the
    queue_pending_output_urbs path

Add a new opt-in quirk QUIRK_FLAG_IFB_SILENCE_ON_EMPTY: when set, the
early return is skipped and we fall through to enqueue a packet_info
whose packet_size[i] are all 0 (the existing loop already maps
status!=0 packets to size 0). prepare_outbound_urb then emits a
silence packet, the OUT ring keeps moving, and the device rides
through the glitch.

The default behaviour (early return) is preserved for all existing
devices including M-Audio Fast Track Ultra. Only Flow 8 opts in here.

Cc: stable@vger.kernel.org
Signed-off-by: Gordon Chen <chengordon326@gmail.com>
Link: https://patch.msgid.link/20260526072906.90106-1-chengordon326@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

genirq/proc: Speed up /proc/interrupts iteration

Reading /proc/interrupts iterates over the interrupt number space one by
one and looks up the descriptors one by one. That's just a waste of time.

When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and
cache the descriptor pointer efficiently for the sequence file operations.

Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and
leave the fs/proc/ variant for the legacy architectures which ignore generic
code.

This reduces the time wasted for looking up the next record significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194932.165280601@kernel.org

genirq/proc: Runtime size the chip name

The chip name column in the /proc/interrupt output is 8 characters and
right aligned, which causes visual clutter due to the fixed length and the
alignment. Many interrupt chips, e.g. PCI/MSI[X] have way longer names.

Update the length when a chip is assigned to an interrupt and utilize this
information for the output. Align it left so all chip names start at the
begin of the column.

Update the GDB script as well and disentangle the header maze so it
actually works with all .config combinations.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194932.085786035@kernel.org

genirq: Expose irq_find_desc_at_or_after() in core code

... in preparation for a smarter iterator for /proc/interrupts.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194932.005787611@kernel.org

genirq: Add rcuref count to struct irq_desc

Prepare for a smarter iterator for /proc/interrupts so that the next
interrupt descriptor can be cached after lookup.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194931.917415190@kernel.org

genirq/proc: Increase default interrupt number precision to four

Quite some architectures have four character wide acronyms for architecture
specific interrupts like IPI, NMI, etc.

The default precision of printing the Linux device interrupt numbers is
three, which causes quite some code to play games with adding or omitting
space after the acronym and the colon in order to keep the per CPU numbers
properly aligned.

Increase the default number precision to four in the core code and get rid
of the space games all over the place. At the same time align all
architecture specific descriptor texts left so that they show up in the
same column as the interrupt chip names, which makes the output more
uniform accross architectures. Fix up the GDB script to this new scheme as
well.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260517194931.839482411@kernel.org

genirq: Calculate precision only when required

Calculating the precision of the interrupt number column on every initial
show_interrupt() invocation is a pointless exercise as the underlying
maximum number of interrupts rarely changes.

Calculate it only when that number is modified and let show_interrupts()
use the cached value.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Link: https://patch.msgid.link/20260517194931.760664517@kernel.org

genirq: Cache the condition for /proc/interrupts exposure

show_interrupts() evaluates a boatload of conditions to establish whether
it should expose an interrupt in /proc/interrupts or not.

That can be simplified by caching the condition in an internal status flag,
which is updated when one of the relevant inputs changes.

The irq_desc::kstat_irq check is dropped because visible interrupt
descriptors always have a valid pointer.

As a result the number of instructions and branches for reading
/proc/interrupts is reduced significantly.

Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Link: https://patch.msgid.link/20260517194931.680943749@kernel.org