git.ipfire.org Git - thirdparty/kernel/linux.git/log

KVM: x86: Widen x86_exception's error_code to 64 bits

Widen the error_code field in struct x86_exception from u16 to u64 to
accommodate AMD's NPF error code, which defines information bits above
bit 31, e.g. PFERR_GUEST_FINAL_MASK (bit 32), and PFERR_GUEST_PAGE_MASK
(bit 33).

Retain the u16 type for the local errcode variable in walk_addr_generic
as the walker synthesizes conventional #PF error codes that are
architecturally limited to bits 15:0.

Signed-off-by: Kevin Cheng <chengkev@google.com>
Link: https://patch.msgid.link/20260522232701.3671446-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: hyperv_features: test write of 1 to HV_X64_MSR_RESET

Writing 1 to HV_X64_MSR_RESET triggers a real vCPU reset; the test
was writing 0 because the host loop was not prepared to handle the
resulting KVM_EXIT_SYSTEM_EVENT. Add the missing handling and write
1 to actually exercise the reset path.

Signed-off-by: Piotr Zarycki <piotr.zarycki@gmail.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/20260523111857.195396-1-piotr.zarycki@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

call_once:: Fix typo in comment for call_once()

Change "succesfully" to "successfully" in the kerneldoc
comment of call_once().

Signed-off-by: Jiun Jeong <jiun.jeong.cs@gmail.com>
Link: https://patch.msgid.link/20260501144413.49419-1-jiun.jeong.cs@gmail.com
[sean: don't scope to KVM, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Randomize dirty_log_test's delay before reaping the bitmap

In the dirty log test, randomize the delay before the initial call to get
the dirty log bitmap for a given iteration, so that the amount of memory
dirtied by the guest varies from iteration to iteration, and so that the
user can effectively control the duration (by increasing the interval).

Always waiting 1ms effectively hides a KVM RISC-V bug as the test reaps the
dirty bitmap before the guest has a chance to trigger the problematic flow
in KVM.

Reported-by: Wu Fei <wu.fei9@sanechips.com.cn>
Closes: https://lore.kernel.org/all/202605111130.64BBUXDN013040@mse-fl2.zte.com.cn
Cc: Wu Fei <atwufei@163.com>
Link: https://patch.msgid.link/20260522170230.3518669-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Add and use kvm_free_fd() to harden against fd goofs

Add a kvm_free_fd() macro to close and invalidate a file descriptor, and
use it through the core infrastructure to harden against goofs where a
selftest attempts to reuse a closed file descriptor.

Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Fuad Tabba <tabba@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Cast guest_memfd fd to a signed int when checking for >= 0

When conditionally closing a memory region's guest_memfd file descriptor,
cast the field to a signed it so that negative values are correctly
detected. Because selftests reuse "struct kvm_userspace_memory_region2"
instead of providing custom storage, they pick up the kernel uAPI's __u32
definition of the file descriptor, not the more common "int" definition,
e.g. that's used for userspace_mem_region.fd.

Fixes: bb2968ad6c33 ("KVM: selftests: Add support for creating private memslots")
Reported-by: Bibo Mao <maobibo@loongson.cn>
Closes: https://lore.kernel.org/all/20260508015013.4108345-1-maobibo@loongson.cn
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522171535.3525890-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Remove unnecessary "%s" formatting of a constant string

Drop superfluous %s formatting from assertions in the guest_memfd overlap
testcases, as the string being printed doesn't require runtime formatting.

No functional change intended.

Reported-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522172151.3530267-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Test guest_memfd binding overlap without GPA overlap

The guest_memfd binding overlap test recreates the deleted slot with GPA
ranges that overlap the still-live slot. KVM rejects those attempts from
the generic memslot overlap check before reaching kvm_gmem_bind(), so the
test can pass even if guest_memfd binding overlap detection is broken.

Recreate the slot at its original, non-overlapping GPA and use guest_memfd
offsets that overlap the front and back halves of the other slot's binding.
Expand the guest_memfd so the back-half case remains within the file size.

Fixes: 2feabb855df8 ("KVM: selftests: Expand set_memory_region_test to validate guest_memfd()")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: keep the existing GPA overlap testcases]
Link: https://patch.msgid.link/20260522172151.3530267-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: guest_memfd: Return -EEXIST for overlapping bindings

KVM_SET_USER_MEMORY_REGION2 rejects guest_memfd ranges that overlap an
existing binding, but kvm_gmem_bind() currently reports the failure through
its generic -EINVAL path. That makes binding conflicts indistinguishable
from malformed guest_memfd parameters.

Return -EEXIST when the target guest_memfd range is already bound, matching
the errno used for overlapping GPA memslots and making the two types of
range conflicts report the same class of error to userspace.

Note, returning -EINVAL was definitely not intentional, as guest_memfd
support was accompanied by a selftest to verify that attempting to create
overlapping bindings fails with -EEXIST. Except the selftest was also
flawed in that it unintentionally overlapped memslot GPAs, and so failed
on KVM's common memslot checks before reaching guest_memfd.

Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
[sean: call out that the original intent was to return -EEXIST]
Link: https://patch.msgid.link/20260522172151.3530267-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

selftests/nolibc: test against -Wwrite-strings

Users may use this warning when building their own applications.
Make sure that nolibc does not trigger any such warnings.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-3-ab5cc16c7b23@weissschuh.net

selftests/nolibc: use mutable buffer for execve() argv string

The existing code would trigger a warning under -Wwrite-strings which is
about to be enabled. Use a mutable buffer instead. While in this
specific case, casting away the 'const' would be fine, let's avoid casts
which are not really necessary.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-2-ab5cc16c7b23@weissschuh.net

tools/nolibc: cast default values of program_invocation_name

With -Wwrite-strings the plain assignment triggers a warning as a
'const char *' is assigned to a 'char *', removing the const qualifier.

Casting the const away is fine, as there is no valid modification that
can be done to an empty string anyways.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://patch.msgid.link/20260525-nolibc-write-strings-v2-1-ab5cc16c7b23@weissschuh.net

spi: dt-bindings: spi-qpic-snand: Add ipq5210 compatible

Since the QPIC-SPI-NAND flash controller present in ipq5210 is the same
as the one found in ipq9574, document the ipq5210 compatible and with
ipq9574 as the fallback.

Signed-off-by: Varadarajan Narayanan <varadarajan.narayanan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260514-ipq5210-nand-v1-1-cbdd7492e826@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>

iio: core: fix uninitialized data in debugfs

If *ppos is non-zero then simple_write_to_buffer() will not initialize
the start of buf[]. Non zero values for *ppos aren't going to work
anyway. Test for them at the start of the function and return -EINVAL.

Fixes: 6d5dd486c715 ("iio: core: make use of simple_write_to_buffer()")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Maxwell Doose <m32285159@gmail.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: backend: fix uninitialized data in debugfs

If the *ppos value is non-zero then simple_write_to_buffer() will not
initialize the start of the buf[] buffer. Non-zero ppos values aren't
going to work at all. Check for that at the start of the function and
return -ENOSPC.

Fixes: cdf01e0809a4 ("iio: backend: add debugFs interface")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: dac: ad3552r-hs: fix uninitialized data ni ad3552r_hs_write_data_source()

If the *ppos value is non-zero then the simple_write_to_buffer() function
won't initialized the start of the buf[] buffer. Non-zero values for
*ppos won't work here generally, so just test for them and return -ENOSPC
at the start of the function.

Fixes: b1c5d68ea66e ("iio: dac: ad3552r-hs: add support for internal ramp")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Angelo Dureghello <adureghello@baylibre.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: adc: qcom-spmi-iadc: balance enable_irq_wake() on driver unbind

iadc_probe() calls enable_irq_wake() after a successful
devm_request_irq(), but the driver has no remove callback or
matching disable_irq_wake(), so the wake reference count on the
IRQ is leaked on module unload or driver unbind.

Check the IRQ request error first, then register a devm action
that calls disable_irq_wake() so the wake reference is released
in the same scope as the enable. While here, drop the inverted
"if (!ret) ... else return ret" in favour of the standard
"if (ret) return ret;" pattern.

Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: light: al3320a: read both ALS ADC registers again

al3320a_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.

Fixes: 1850e6ae7f91 ("iio: light: al3320a: Implement regmap support")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: light: al3010: read both ALS ADC registers again

al3010_read_raw() used to read two adjacent registers
until the driver was modernized using the regmap framework.
That cleanup accidentally replaced the 16-bit word read
with a single byte read. I'm reverting latter.

Fixes: 0e5e21e23dd6 ("iio: light: al3010: Implement regmap support")
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: temperature: tmp006: use devm_iio_trigger_register

tmp006_probe() allocates the DRDY trigger with devm_iio_trigger_alloc()
but registers it with plain iio_trigger_register(). The driver has no
.remove() callback, so on module unload the trigger stays in the global
trigger list while its memory is freed by devm, leaving a dangling
entry.

Switch to devm_iio_trigger_register() so the registration is undone in
the same devm scope as the allocation.

Fixes: 91f75ccf9f03 ("iio: temperature: tmp006: add triggered buffer support")
Cc: stable@vger.kernel.org
Signed-off-by: Stepan Ionichev <sozdayvek@gmail.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: buffer: hw-consumer: free scan_mask on buffer release

The scan_mask lifetime changed in commit 9a2e1233d38c ("iio: buffer:
hw-consumer: remove redundant scan_mask flexible array").

Before that change, the scan mask storage was embedded in struct
hw_consumer_buffer, so iio_hw_buf_release() could free the whole
allocation with a single kfree(hw_buf).

That commit moved the scan mask to a separate bitmap_zalloc() allocation
stored in buffer.scan_mask, but left iio_hw_buf_release() unchanged.

Free the scan mask in iio_hw_buf_release() before freeing the buffer
wrapper.

Fixes: 9a2e1233d38c ("iio: buffer: hw-consumer: remove redundant scan_mask flexible array")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

iio: adc: ad7768-1: Select GPIOLIB

This driver provides a gpio chip a some related functions are not
stubbed if GPIOLIB is not built.

ad7768-1.c:420: undefined reference to `gpiochip_get_data'
ad7768-1.c:498: undefined reference to `devm_gpiochip_add_data_with_key'

Dual sign off reflects change of affiliation whilst this was under review.

Fixes: d569ae0f052e ("iio: adc: ad7768-1: Add GPIO controller support")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512271235.WwAmAbOa-lkp@intel.com/
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <jic23@kernel.org>

thermal: intel: int340x: Check return value of ptc_create_groups()

proc_thermal_ptc_add() ignores the return value of ptc_create_groups()
causing the driver to silenty continue even if sysfs group creation
fails.

The thermal control interface would be unavailable with no indication
of failure.

Check the return value and on failure clean up any sysfs groups that
were successfully created before the error, then propagate the error to
the caller which already handles it correctly via goto err_rem_rapl.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-3-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: intel: int340x: Fix potential shift overflow in ptc_mmio_write()

The value parameter is u32 but is shifted into a u64 register value
without casting first. If the shift amount pushes bits beyond 32, they
are lost. Cast value to u64 before shifting to ensure all bits are
preserved.

Signed-off-by: Aravind Anilraj <aravindanilraj0702@gmail.com>
Link: https://patch.msgid.link/20260329070642.10721-2-aravindanilraj0702@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load

When ignore_nice_load is toggled from 0 to 1 via sysfs, dbs_update() may
run concurrently and observe the new tunable value while prev_cpu_nice
still holds a stale baseline, producing a spurious massive idle_time that
results in an incorrect CPU load value.

The race can be illustrated with two concurrent paths:

Path A (sysfs write, holds attr_set->update_lock):

governor_store()
  mutex_lock(&attr_set->update_lock)
  ignore_nice_load_store()
    dbs_data->ignore_nice_load = 1              /* (A1) */
    gov_update_cpu_data(dbs_data)
      mutex_lock(&policy_dbs->update_mutex)     /* (A2) */
        j_cdbs->prev_cpu_nice = kcpustat_field(...)
      mutex_unlock(&policy_dbs->update_mutex)
  mutex_unlock(&attr_set->update_lock)

Path B (work queue, wins the race between A1 and A2):

dbs_work_handler()
  mutex_lock(&policy_dbs->update_mutex)         /* acquired before A2 */
  dbs_update()
    ignore_nice = dbs_data->ignore_nice_load    /* sees new value: 1 */
    cur_nice = kcpustat_field(...)
    idle_time += div_u64(cur_nice - j_cdbs->prev_cpu_nice, ..) /* stale */
    j_cdbs->prev_cpu_nice = cur_nice
  mutex_unlock(&policy_dbs->update_mutex)

Fix this by unconditionally sampling cur_nice and advancing prev_cpu_nice
in dbs_update() on every call, regardless of ignore_nice. With
prev_cpu_nice always reflecting the most recent sample, enabling
ignore_nice_load can never produce a stale-baseline spike: the delta will
always be the nice time accumulated in the last sampling interval, not
since boot. The additional kcpustat_field() call per CPU per sample is
negligible given that the sampling path already reads idle and load
accounting.

To keep prev_cpu_nice handling consistent with the always-tracking
semantics introduced above:

  - gov_update_cpu_data() unconditionally resets prev_cpu_nice alongside
    prev_cpu_idle, so both baselines share the same timestamp when
    io_is_busy changes.  This prevents an interval mismatch between
    idle_time and nice_delta on the next dbs_update() when
    ignore_nice_load is enabled.
  - cpufreq_dbs_governor_start() unconditionally initializes prev_cpu_nice
    so the baseline is always valid from the first dbs_update() call;
    remove the ignore_nice guard and the now-unused ignore_nice variable.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-3-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: governor: Fix data races on per-CPU idle/nice baselines

gov_update_cpu_data() resets per-CPU prev_cpu_idle for every CPU in the
governed domain, and conditionally resets prev_cpu_nice when
ignore_nice_load is set. It is called from sysfs store callbacks
(e.g. ignore_nice_load_store) which run under attr_set->update_lock,
held by the surrounding governor_store().

Concurrently, dbs_work_handler() calls gov->gov_dbs_update() (which calls
dbs_update()) under policy_dbs->update_mutex. dbs_update() both reads and
writes the same prev_cpu_idle / prev_cpu_nice fields. The potential race
path is:

Path A (sysfs write, holds attr_set->update_lock only):

  governor_store()
    mutex_lock(&attr_set->update_lock)
    ignore_nice_load_store()
      dbs_data->ignore_nice_load = input
      gov_update_cpu_data(dbs_data)
        list_for_each_entry(policy_dbs, ...)
          for_each_cpu(j, ...)
            j_cdbs->prev_cpu_idle = get_cpu_idle_time(...)  /* write */
            j_cdbs->prev_cpu_nice = kcpustat_field(...)     /* write */
    mutex_unlock(&attr_set->update_lock)

Path B (work queue, holds policy_dbs->update_mutex only):

  dbs_work_handler()
    mutex_lock(&policy_dbs->update_mutex)
    gov->gov_dbs_update(policy)
      dbs_update()
        for_each_cpu(j, policy->cpus)
          idle_time = cur - j_cdbs->prev_cpu_idle           /* read  */
          j_cdbs->prev_cpu_idle = cur_idle_time             /* write */
          idle_time += cur_nice - j_cdbs->prev_cpu_nice     /* read  */
          j_cdbs->prev_cpu_nice = cur_nice                  /* write */
    mutex_unlock(&policy_dbs->update_mutex)

Because attr_set->update_lock and policy_dbs->update_mutex are two
completely independent locks, the two paths are not mutually exclusive.
This results in a data race on cpu_dbs_info.prev_cpu_idle and
cpu_dbs_info.prev_cpu_nice.

Fix this by also acquiring policy_dbs->update_mutex in
gov_update_cpu_data() for each policy, so that path A participates in
the mutual exclusion already established by dbs_work_handler(). Also
update the function comment to accurately reflect the two-level locking
contract.

Additionally, cpufreq_dbs_governor_start() initializes prev_cpu_idle
using io_busy read from dbs_data->io_is_busy without holding
policy_dbs->update_mutex.  A concurrent io_is_busy_store() can update
io_is_busy and call gov_update_cpu_data(), which writes prev_cpu_idle
with the new value under the mutex.  cpufreq_dbs_governor_start() then
overwrites prev_cpu_idle with the stale io_busy value, leaving the
baseline inconsistent with the tunable.  Fix this by reading io_busy
inside the mutex.

The root of this race dates back to the original ondemand/conservative
governors. Before commit ee88415caf73 ("[CPUFREQ] Cleanup locking in
conservative governor") and commit 5a75c82828e7 ("[CPUFREQ] Cleanup
locking in ondemand governor"), all accesses to prev_cpu_idle and
prev_cpu_nice in cpufreq_governor_dbs() (path X), store_ignore_nice_load()
/io_is_busy_store() (path Y), and do_dbs_timer() (path Z) were serialised
by the same dbs_mutex, so no race existed. Those two commits switched
do_dbs_timer() from dbs_mutex to a per-policy/per-cpu timer_mutex to
reduce lock contention, but left path Y (store) still holding dbs_mutex.
As a result, path Y (store) and path Z (do_dbs_timer) no longer shared a
common lock, introducing a potential race on prev_cpu_idle/prev_cpu_nice
between path Y (store) and dbs_check_cpu().

Commit 326c86deaed54a ("[CPUFREQ] Remove unneeded locks") then removed
dbs_mutex from store_ignore_nice_load()/io_is_busy_store() entirely,
introducing an additional potential race between path Y (now lockless)
and cpufreq_governor_dbs() (path X, still holding dbs_mutex), while the
race between path Y and path Z remained.

Fixes: ee88415caf736b ("[CPUFREQ] Cleanup locking in conservative governor")
Fixes: 5a75c82828e7c0 ("[CPUFREQ] Cleanup locking in ondemand governor")
Fixes: 326c86deaed54a ("[CPUFREQ] Remove unneeded locks")
Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com>
Link: https://patch.msgid.link/20260419132655.3800673-2-zhongqiu.han@oss.qualcomm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

block: remove blkdev_write_begin() and blkdev_write_end()

Remove blkdev_write_begin(), blkdev_write_end(), and their entries in
def_blk_aops. These have been unreachable since commit 487c607df790
("block: use iomap for writes to block devices") switched block device
buffered writes from generic_perform_write() to
iomap_file_buffered_write(), which bypasses aops->write_begin/end.

Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260525-blk-write-cleanup-v1-1-391c073e3831@columbia.edu
Signed-off-by: Jens Axboe <axboe@kernel.dk>

mtip32xx: fix use-after-free on service thread failure

If service thread creation fails after device_add_disk() succeeds,
mtip_block_initialize() calls del_gendisk() and then falls through to
put_disk(). Since mtip32xx uses .free_disk to free struct driver_data,
put_disk() can release dd on the added-disk path.

The same unwind then continues to use dd for blk_mq_free_tag_set() and
mtip_hw_exit(), and mtip_pci_probe() can later free dd again. This can
cause a use-after-free and double free.

Track whether the disk was added in the current initialization call.
For the post-add service-thread failure path, remove the disk, release
the local hardware resources, and return without dropping the final disk
reference. The probe error path can then finish its cleanup and call
put_disk() after it is done using dd. Keep the pre-add path using
put_disk() before blk_mq_free_tag_set(), and clear dd->disk so the outer
probe cleanup frees dd directly.

Fixes: e8b58ef09e84 ("mtip32xx: fix device removal")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260525162531.1406677-1-dbgh9129@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: don't set BIO_QUIET for BLK_STS_AGAIN

Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.

Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

direct-io: remove IOCB_NOWAIT support

None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: Avoid mounting the bdev pseudo-filesystem in userspace

The bdev pseudo-filesystem is an internal kernel filesystem with which
userspace should not interfere. Unregister it so that userspace cannot
even attempt to mount it.

This fixes a bug [1] that occurs when attempting to access files,
because the system call move_mount() uses pointers declared in the
inode_operations structure, which for the bdev pseudo-filesystem
are always equal to 0. `inode->i_op = &empty_iops;`

[1]

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 23380067 P4D 23380067 PUD 23381067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN NOPTI
CPU: 2 PID: 17125 Comm: syz-executor.0 Not tainted 6.1.155-syzkaller-00350-g84221fde2681 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:0x0

Call Trace:
<TASK>
lookup_open.isra.0+0x700/0x1180 fs/namei.c:3460
open_last_lookups fs/namei.c:3550 [inline]
path_openat+0x953/0x2700 fs/namei.c:3780
do_filp_open+0x1c5/0x410 fs/namei.c:3810
do_sys_openat2+0x171/0x4d0 fs/open.c:1318
do_sys_open fs/open.c:1334 [inline]
__do_sys_openat fs/open.c:1350 [inline]
__se_sys_openat fs/open.c:1345 [inline]
__x64_sys_openat+0x13c/0x1f0 fs/open.c:1345
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/all/20131010004732.GJ13318@ZenIV.linux.org.uk/T/#
Cc: stable@vger.kernel.org
Signed-off-by: Denis Arefev <arefev@swemel.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260521072857.5078-1-arefev@swemel.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: switch numa_node to int in blk_mq_hw_ctx and init_request

numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].

Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.

Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>

block: skip sync_blockdev() on surprise removal in bdev_mark_dead()

bdev_mark_dead()'s @surprise == true means the device is already gone.
The filesystem callback fs_bdev_mark_dead() honours this and skips
sync_filesystem(), but the bare block device path (no ->mark_dead op)
lost its !surprise guard when the holder ->mark_dead callback was wired
up (see Fixes), and now calls sync_blockdev() unconditionally, which can
hang forever waiting on writeback that can no longer complete.

syzkaller hit this via nvme_reset_work()'s "I/O queues lost" path:
nvme_mark_namespaces_dead() -> blk_mark_disk_dead() ->
bdev_mark_dead(bdev, true) -> sync_blockdev() blocks in
folio_wait_writeback(), wedging the reset worker and every task waiting
on it.

Skip the sync on surprise removal, matching fs_bdev_mark_dead();
invalidate_bdev() still runs. Orderly removal (surprise == false) is
unchanged.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Fixes: d8530de5a6e8 ("block: call into the file system for bdev_mark_dead")
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260522220025.1770388-1-coshi036@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

blk-mq: add tracepoint block_rq_tag_wait

In high-performance storage environments, particularly when utilising
RAID controllers with shared tag sets (BLK_MQ_F_TAG_HCTX_SHARED), severe
latency spikes can occur when fast devices (SSDs) are starved of hardware
tags when sharing the same blk_mq_tag_set.

Currently, diagnosing this specific hardware queue contention is
difficult. When a CPU thread exhausts the tag pool, blk_mq_get_tag()
forces the current thread to block uninterruptible via io_schedule().
While this can be inferred via sched:sched_switch or dynamically
traced by attaching a kprobe to blk_mq_mark_tag_wait(), there is no
dedicated, out-of-the-box observability for this event.

This patch introduces the block_rq_tag_wait tracepoint in the tag
allocation slow-path. It triggers immediately before the task state
is altered to TASK_UNINTERRUPTIBLE (ensuring safety for PREEMPT_RT
locks). It exposes the exact hardware context (hctx) that is starved,
the specific pool experiencing starvation (driver, software scheduler,
or reserved), and the exact pool depth.

This provides storage engineers with a zero-configuration, low-overhead
mechanism to definitively identify shared-tag bottlenecks. For example,
userspace can trivially replicate tag starvation counters using bpftrace:

    # bpftrace -e 'tracepoint:block:block_rq_tag_wait { @tag_waits[cpu] = count(); }'
    Attaching 1 probe...
    ^C
    @tag_waits[4]: 12
    @tag_waits[12]: 87

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260525005123.722277-1-atomlin@atomlin.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

KVM: SEV: Mark source page dirty when writing back CPUID data on failure

When writing back CPUID data (provided by trusted firmware) to the source
page on failure, mark the page/folio as dirty so that the data isn't lost
in the unlikely scenario the page is reclaimed before its read by
userspace.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-5-3f196bfad5a1@google.com
[sean: use set_page_dirty(), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Unmap local kmaps in LIFO order, per highmem requirements

Per highmem.h, local kernel mappings must be unmapped in the reserve order
they were acquired, following a LIFO (last-in, first-out) stack-based
approach, and that failure to do so "is invalid and causes malfunction".

Swap the kunmap_local() calls in SNP post-populate flow to ensure the
mappings are released in the correct order.

Note, because SNP is 64-bit only, the bugs are benign as there are no
highmem mappings to unwind.

Fixes: 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-4-3f196bfad5a1@google.com
[sean: call out that the bug is benign]
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: SEV: Pin source page for write when adding CPUID data for SNP guest

When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.

Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.

Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Cc: stable@vger.kernel.org
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>

ASoC: SOF: ipc4-topology: Support for multiple src output formats

Peter Ujfalusi <peter.ujfalusi@linux.intel.com> says:

SRC can only change the rate, we can still allow different bit depth and
channels to be handled, the only restriction is that the input and output
must have matching bit depth and channel format.

In a separate patch do a sanity check for the number of formats on the
input and output side as SRC/ASRC must have at least one of them.

Link: https://patch.msgid.link/20260526105748.26149-1-peter.ujfalusi@linux.intel.com

ASoC: SOF: ipc4-topology: Allow the use of multiple formats for src output

The SRC module can only change the rate, it keeps the format and channels
intact, but this does not mean the num_output_formats must be 0:
The SRC module can support different formats/channels, we just need to
check if the output format lists the correct combination of out rate and
the input format/channels.

Change the logic to prioritize the sink_rate of the module as target rate,
then the rate of the FE in case of capture or in case of playback check the
single rate specified in the output formats.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Reviewed-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-3-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

ASoC: SOF: ipc4-topology: Validate the number of in/out formats for src/asrc

SRC and ASRC modules must have at least one input and on one output formats
to be usable.
Do a sanity check during setup type and fail if either the number of input
or output formats are 0.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260526105748.26149-2-peter.ujfalusi@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>

io_uring/zcrx: add shared-memory notification statistics

Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.

Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.

The kernel updates the stats counters in-place on every copy-fallback
event.

Signed-off-by: Clément Léger <cleger@meta.com>
[pavel: rename io_uring_zcrx_notif_stats]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/f6af5a21015efea4b733b9d77aba22c637788fe4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user on frag copy fallback

Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.

Signed-off-by: Clément Léger <cleger@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/3d54bcd8bf10b3a1e88beb0cd39c40c3937bea4f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: notify user when out of buffers

There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/35cd307a03a43583838a2e151fc641c69abd786f.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: add ctx pointer to zcrx

zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
Link: https://patch.msgid.link/b60514b3d1bd92f571e3bd91751166f8c3599256.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: reorder fd allocation in zcrx_export()

Currently, zcrx_export() allocates a file descriptor and copies the
control structure to userspace before the backing file is created.

While the operation returns an error on failure, it is cleaner to
follow the standard kernel pattern of performing the copy_to_user()
and fd_install() only after all resource allocations (like the
anon_inode) have succeeded. This aligns the code with other
fd-publishing paths in the VFS.

Signed-off-by: Bertie Tryner <Bertie.Tryner@warwick.ac.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/1513a3f4ae7161692ca6e991b9f01278a6bc60e4.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: remove extra ifq close

By the time io_zcrx_ifq_free() is called the interface queue should
already be closed, so io_close_queue() will be a no-op. Remove the call
and add a couple of warnings.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/be6c4a283a5bab5440e22fbccafe7b885acb7abc.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: poison pointers on unregistration

Nobody should be touching area and other pointers after zcrx
destruction, poison them instead of zeroing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/19112d1412539dcfc04a0317b5812e968623bc51.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/zcrx: make scrubbing more reliable

Currently, scrubbing is done once before killing all recvzc requests.
It's fine as those are cancelled and don't return buffers afterwards,
but it'll be more reliable not to rely that much on cancellations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://patch.msgid.link/c4ea127023494cbbedebd21a2b7ae5ff0448eb95.1779189667.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

wifi: ath12k: fix error unwind on arch_init() failure in PCI probe

When arch_init() fails in ath12k_pci_probe(), the code jumps to
err_pci_msi_free, leaking resources in teardown.

Redirect the failure path to err_free_irq so teardown matches the setup order.

Compile-tested only.

Fixes: 614c23e24ee8 ("wifi: ath12k: Support arch-specific DP device allocation")
Signed-off-by: Ripan Deuri <ripan.deuri@oss.qualcomm.com>
Reviewed-by: Rameshkumar Sundaram <rameshkumar.sundaram@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Link: https://patch.msgid.link/20260519192815.3911324-1-ripan.deuri@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>

block: partitions: fix of_node refcount leak in of_partition()

of_partition() calls of_node_get() on the parent device node at the
beginning of the function, storing the reference in 'partitions_np'.
This reference is leaked in two paths:

1. The compatibility check at the top of the function returns 0
   without releasing partitions_np when the node exists but is not
   "fixed-partitions" compatible.

2. The function returns 1 at the end after successfully processing
   all partitions without releasing partitions_np.

Fix both leaks by adding of_node_put(partitions_np) on each path.

Fixes: 2e3a191e89f9 ("block: add support for partition table defined in OF")
Cc: stable@vger.kernel.org
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev>
Link: https://patch.msgid.link/20260526102124.2283846-1-vulab@iscas.ac.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ASoC: cs35l56-shared-test: Fix possible null pointer dereference

The struct regmap_config is dereferenced before its check. Also, after
it is checked priv->reg_offset is assigned to regmap_config->reg_base,
making the removed line redundant.

Detected by Smatch:
sound/soc/codecs/cs35l56-shared-test.c:681 cs35l56_shared_test_case_base_init()
warn: variable dereferenced before check 'regmap_config' (see line 665)

Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260523211522.522616-1-ethantidmore06@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>

mtd: spi-nor: debugfs: Add a locked sectors map

In order to get a very clear view of the sectors being locked, besides
the `params` output giving the ranges, we may want to see a proper map
of the sectors and for each of them, their status. Depending on the use
case, this map may be easier to parse by humans and gives a more acurate
feeling of the situation. At least myself, for the few locking-related
developments I recently went through, I found it very useful to get a
clearer mental model of what was locked/unlocked.

Here is an example of output:

$ cat /sys/kernel/debug/spi-nor/spi0.0/locked-sectors-map
Locked sectors map (x: locked, .: unlocked, unit: 64kiB)
0x00000000 (#    0): ................ ................ ................ ................
0x00400000 (#   64): ................ ................ ................ ................
0x00800000 (#  128): ................ ................ ................ ................
0x00c00000 (#  192): ................ ................ ................ ................
0x01000000 (#  256): ................ ................ ................ ................
0x01400000 (#  320): ................ ................ ................ ................
0x01800000 (#  384): ................ ................ ................ ................
0x01c00000 (#  448): ................ ................ ................ ................
0x02000000 (#  512): ................ ................ ................ ................
0x02400000 (#  576): ................ ................ ................ ................
0x02800000 (#  640): ................ ................ ................ ................
0x02c00000 (#  704): ................ ................ ................ ................
0x03000000 (#  768): ................ ................ ................ ................
0x03400000 (#  832): ................ ................ ................ ................
0x03800000 (#  896): ................ ................ ................ ................
0x03c00000 (#  960): ................ ................ ................ ..............xx

The output is wrapped at 64 sectors, spaces every 16 sectors are
improving the readability, every line starts by the first sector
offset (hex) and number (decimal).

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
[pratyush@kernel.org: split the debugfs_create_file() into two lines]
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
  post-7.1 issues or aren't considered suitable for backporting.

  All patches are singletons - please see the individual changelogs for
  details"

* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Revert "mm: introduce a new page type for page pool in page type"
  mm/vmalloc: do not trigger BUG() on BH disabled context
  MAINTAINERS, mailmap: change email for Eugen Hristev
  mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
  kernel/fork: validate exit_signal in kernel_clone()
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
  mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
  zram: fix use-after-free in zram_writeback_endio
  memfd: deny writeable mappings when implying SEAL_WRITE
  ipc: limit next_id allocation to the valid ID range
  Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
  MAINTAINERS: .mailmap: update after GEHC spin-off

mtd: spi-nor: debugfs: Add locking support

The ioctl output may be counter intuitive in some cases. Asking for a
"locked status" over a region that is only partially locked will return
"unlocked" whereas in practice maybe the biggest part is actually
locked.

Knowing what is the real software locking state through debugfs would be
very convenient for development/debugging purposes, hence this proposal
for adding an extra block at the end of the file: a "locked sectors"
array which lists every section, if it is locked or not, showing both
the address ranges and the sizes in numbers of "lock sectors" (which on
small density devices is typically different than erase blocks).

Here is an example of output, what is after the "sector map" is new.

$ cat /sys/kernel/debug/spi-nor/spi0.0/params
name (null)
id ef a0 20 00 00 00
size 64.0 MiB
write size 1
page size 256
address nbytes 4
flags HAS_SR_TB | 4B_OPCODES | HAS_4BAIT | HAS_LOCK | HAS_16BIT_SR | HAS_SR_TB_BIT6 | HAS_4BIT_BP | SOFT_RESET | NO_WP

opcodes
read 0xec
  dummy cycles 6
erase 0xdc
program 0x34
8D extension none

protocols
read 1S-4S-4S
write 1S-1S-4S
register 1S-1S-1S

erase commands
21 (4.00 KiB) [1]
dc (64.0 KiB) [3]
c7 (64.0 MiB)

sector map
region (in hex)   | erase mask | overlaid
------------------+------------+---------
00000000-03ffffff |     [   3] | no

locked sectors
region (in hex)   | status   | #sectors
------------------+----------+---------
00000000-03ffffff | unlocked | 1024

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Create a local SR cache

In order to be able to generate debugfs output without having to
actually reach the flash, create a SPI NOR local cache of the status
registers. What matters in our case are all the bits related to sector
locking. As such, in order to make it clear that this cache is not
intended to be used anywhere else, we zero the irrelevant bits.

The cache is initialized once during the early init, and then maintained
every time the write protection scheme is updated.

Suggested-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Cosmetic changes

As a final preparation step for the introduction of CMP support, make
a few more cosmetic changes to simplify the reading of the diff when
adding the CMP feature. In particular, define "min_prot_len" earlier as
it will be reused and move the definition of the "ret" variable at the
end of the stack just because it looks better.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Simplify checking the locked/unlocked range

In both the locking/unlocking steps, at the end we verify whether we do
not lock/unlock more than requested (in which case an error must be
returned).

While being possible to do that with very simple mask comparisons, it
does not scale when adding extra locking features such as the CMP
possibility. In order to make these checks slightly easier to read and
more future proof, use existing helpers to read the (future) status
register, extract the covered range, and compare it with very usual
algebric comparisons.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create helpers for building the SR register

The status register contains 3 or 4 BP (Block Protect) bits, 0 or 1
TB (Top/Bottom) bit, soon 0 or 1 CMP (Complement) bit. The last BP bit
and the TB bit locations change between vendors. The whole logic of
buildling the content of the status register based on some input
conditions is used two times and soon will be used 4 times.

Create dedicated helpers for these steps.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create a TB intermediate variable

Ease the future reuse of the tb (Top/Bottom) boolean by creating an
intermediate variable.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Rename a mask

"mask" is not very descriptive when we already manipulate two masks, and
soon will manipulate three. Rename it "bp_mask" to align with the
existing "tb_mask" and soon "cmp_mask".

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Create a helper that writes SR, CR and checks

There are many helpers already to either read and/or write SR and/or CR,
as well as sometimes check the returned values. In order to be able to
switch from a 1 byte status register to a 2 bytes status register while
keeping the same level of verification, let's introduce a new helper
that writes them both (atomically) and then reads them back (separated)
to compare the values.

In case 2 bytes registers are not supported, we still have the usual
fallback available in the helper being exported to the rest of the core.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Use a pointer for SR instead of a single byte

At this stage, the Status Register is most often seen as a single
byte. This is subject to change when we will need to read the CMP bit
which is located in the Control Register (kind of secondary status
register). Both will need to be carried.

Change a few prototypes to carry a u8 pointer. This way it also makes it
very clear where we access the first register, and where we will access
the second.

There is no functional change.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Clarify a comment

The comment states that some power of two sizes are not supported. This
is very device dependent (based on the size), so modulate a bit the
sentence to make it more accurate.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Explain the MEMLOCK ioctl implementation behaviour

Add more details about how these requests are actually handled in the
SPI NOR core. Their behaviour was not entirely clear to me at first, and
explaining them in plain English sounds the way to go.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Enhance output

Align the number of dashes to the bigger column width (the title in this
case) to make the output more pleasant and aligned with what is done
in the "params" file output.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Align variable access with the rest of the file

The "params" variable is used everywhere else, align this particular
line of the file to use "params" directly rather than the "nor" pointer.

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Improve opcodes documentation

There are two status registers, named 1 and 2. The current wording is
misleading as "1" may refer to the status register ID as well as the
number of bytes required (which, in this case can be 1 or 2).

Clarify the comments by aligning them on the same pattern:
"{read,write} status {1,2} register"

Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Make sure the QE bit is kept enabled if useful

Not all chips implement the 4BAIT table which typically indicates the
program capability, while many of them do implement the relevant SFDP
parts indicating the read capabilities. In such a situation, programs
can happen in single mode (1-1-1) and reads in quad mode (1-1-4 or
1-4-4). For the reads to work in such condition, the QE bit must be set.
In case we later use the spi_nor_write_16bit_sr_and_check() helper with
a chip with such configuration, the QE bit would get incorrectly
cleared.

Make sure this doesn't happen by keeping the QE bit under a simpler
condition:
- the quad enable hook is there (no change)
- and at least one of the two protocols is based on quad I/O cycles

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: Drop duplicate Kconfig dependency

I do not think the MTD dependency is needed twice. This is likely a
duplicate coming from a former rebase when the spi-nor core got cleaned
up a while ago. Remove the extra line.

Fixes: b35b9a10362d ("mtd: spi-nor: Move m25p80 code in spi-nor.c")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: swp: Improve locking user experience

In the case of the first block being locked (or the few first blocks),
if the user want to fully unlock the device it has two possibilities:
- either it asks to unlock the entire device, and this works;
- or it asks to unlock just the block(s) that are currently locked,
which fails.

It fails because the conditions "can_be_top" and "can_be_bottom" are
true. Indeed, in this case, we unlock everything, so the TB bit does not
matter. However in the current implementation, use_top would be true (as
this is the favourite option) and lock_len, which in practice should be
reduced down to 0, is set to "nor->params->size - (ofs + len)" which is
a positive number. This is wrong.

An easy way is to simply add an extra condition. In the unlock() path,
if we can achieve the same result from both sides, it means we unlock
everything and lock_len must simply be 0. A comment is added to clarify
that logic.

Fixes: 3dd8012a8eeb ("mtd: spi-nor: add TB (Top/Bottom) protect support")
Cc: stable@kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

mtd: spi-nor: debugfs: Fix the flags list

As mentioned above the spi_nor_option_flags enumeration in core.h, this
list should be kept in sync with the one in the core.

Add the missing flag.

Fixes: 6a42bc97ccda ("mtd: spi-nor: core: Allow specifying the byte order in Octal DTR mode")
Reviewed-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>

Merge branch 'ethtool-module-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: module: fix a handful of small bugs

I've been poking at the locking in ethtool and it appears
that the FW flashing is not currently taking the ops lock.
Existing drivers which implement module FW flashing seem
to have their own locking, so this series doesn't actually
add the ops lock (I'll add it in net-next). But a number
of other errors have been surfaced in the process.
====================

Link: https://patch.msgid.link/20260522231312.1710836-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate fw->size against start_cmd_payload_size

cmis_fw_update_start_download() copies start_cmd_payload_size bytes
from the firmware blob into the CDB LPL vendor_data[] payload without
validating that the FW has enough data.

Since the start_cmd_payload_size can only be ~120B an image too short
is most likely corrupted, so reject it.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: validate start_cmd_payload_size from module

The CMIS firmware update code reads start_cmd_payload_size from
the module's FW Management Features CDB reply and uses it directly
as the byte count for memcpy. The destination buffer is 112 bytes
(ETHTOOL_CMIS_CDB_LPL_MAX_PL_LENGTH - 8). So a malicious
module (or corrupted response) can cause a OOB write later on in
cmis_fw_update_start_download().

Let's error out. If modules that expect longer LPL writes actually
exist we should revisit.

struct cmis_cdb_start_fw_download_pl's definition has to move,
no change there.

Fixes: c4f78134d45c ("ethtool: cmis_fw_update: add a layer for supporting firmware update using CDB")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: fix u16-to-u8 truncation of msleep_pre_rpl

ethtool_cmis_cdb_compose_args() accepts msleep_pre_rpl as u16 but stores
it into the u8 field ethtool_cmis_cdb_cmd_args::msleep_pre_rpl, silently
truncating values >= 256. Seven of the nine call sites pass 1000 ms
(it's the third argument from the end).

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: cmis: require exact CDB reply length

Malicious SFP module could respond with rpl_len longer than
what cmis_cdb_process_reply() expected, leading to OOB writes.
Malicious HW is a bit theoretical but some modules may just
be buggy and/or the reads may occasionally get corrupted,
so let's protect the kernel.

The existing check protects from short replies. We need to
protect from long ones, too. All callers that pass a non-zero
rpl_exp_len cast the reply payload to a fixed-layout struct
and read fields at fixed offsets, with no version negotiation
or short-reply handling:

  - cmis_cdb_validate_password()
  - cmis_cdb_module_features_get()
  - cmis_fw_update_fw_mng_features_get()

so let's assume that responses longer than expected do not
have to be handled gracefully here. Add a warning message
to make the debug easier in case my understanding is wrong...

Note that page_data->length (argument of kmalloc) comes from
last arg to ethtool_cmis_page_init() which is rpl_exp_len.

Note2 that AIs also like to point out overflows in args->req.payload
itself (which is a fixed-size 120 B buffer, on the stack),
but callers should be reading structs defined by the standard,
so protecting from requests for more data than max seem like
defensive programming.

Fixes: a39c84d79625 ("ethtool: cmis_cdb: Add a layer for supporting CDB commands")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: fix cleanup if socket used for flashing multiple devices

When a single Netlink socket issues MODULE_FW_FLASH_ACT against multiple
devices, ethnl_sock_priv_set() overwrites sk_priv->dev on each call,
retaining only the last one. The socket priv is used on socket close,
to walk the global work list and mark the uncompleted flashing work
as "orphaned". Otherwise if another socket reuses the PID it will
unexpectedly receive the flashing notifications.

Don't record the device, record net pointer instead. The purpose of
the dev is to scope the work to a netns, anyway. If we store netns
the overrides are safe/a nop since all flashed devices must be in
the same netns as the socket.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: check fw_flash_in_progress under rtnl_lock

ethnl_set_module_validate() inspects module_fw_flash_in_progress
but validate is meant for _input_ validation, not state validation.
rtnl_lock is not held, yet. Move the check into ethnl_set_module().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid racy updates to dev->ethtool bitfield

When reviewing other changes Gemini points out that we currently
update module_fw_flash_in_progress without holding any locks.
Since module_fw_flash_in_progress is part of a bitfield this
is not great, updates to other fields may be lost.

We could use a bool and sprinkle some READ_ONCE/WRITE_ONCE here
but seems like the issue is rather than the work is an unusual
writer. The other writers already hold the right locks. So just
very briefly take these locks when the work completes.

Note that nothing ever cancels the FW update work, so there's
no concern with deadlocks vs cancel.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: avoid leaking a netdev ref on module flash errors

module_flash_fw_schedule() is missing undo for setting
the "in_progress" flag and taking the netdev reference.
Delay taking these, the device can't disappear while
we are holding rtnl_lock.

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: module: call ethnl_ops_complete() on module flash errors

When validate() fails we are skipping over ethnl_ops_complete()
even tho we already called ethnl_ops_begin().

Fixes: 32b4c8b53ee7 ("ethtool: Add ability to flash transceiver modules' firmware")
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Danielle Ratson <danieller@nvidia.com>
Link: https://patch.msgid.link/20260522231312.1710836-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ethtool-rss-fix-a-handful-of-small-bugs'

Jakub Kicinski says:

====================
ethtool: rss: fix a handful of small bugs

Fix a handful of small bugs in the ethtool Netlink RSS code.
====================

Link: https://patch.msgid.link/20260522230647.1705600-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid device context leak on reply-build failure

We wait with filling the reply for new RSS context creation
until after the driver ->create_rxfh_context call. The driver
needs to fill some of the defaults in the context. The failure
of rss_fill_reply() is somewhat theoretical, but doesn't take
much effort to handle it properly. Call ->remove_rxfh_context().

If the driver's remove callback fails (some implementations like sfc
can return real command errors from firmware RPCs) - skip the xa_erase
and kfree, leaving the context in the xarray. This matches how
ethnl_rss_delete_doit() behaves.

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix hkey leak when indir_size is 0

rss_get_data_alloc() allocates a single buffer that backs both the
indirection table and the hash key, but only assigned data->indir_table
when indir_size was nonzero. The expectation was that no driver
implements RSS without supporting indirection table but apparently
enic does just that (it's the only such in-tree driver).
enic has get_rxfh_key_size but no get_rxfh_indir_size.
data->indir_table stays as NULL, hkey gets set but rss_get_data_free()
kfree(data->indir_table) is a nop and the allocation leaks.

Always store the allocation base in data->indir_table so the free path
is unambiguous. No caller treats indir_table as a sentinel; everything
keys off indir_size.

Fixes: 7112a04664bf ("ethtool: add netlink based get rss support")
Link: https://patch.msgid.link/20260522230647.1705600-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix indir_table and hkey leak on get_rxfh failure

rss_prepare_get() allocates the indirection table and hash key buffer
via rss_get_data_alloc(), then calls ops->get_rxfh() to populate them.
If get_rxfh() fails, the function returns an error without freeing
the allocation.

Fixes: 4f038a6a02d2 ("net: ethtool: Don't call .cleanup_data when prepare_data fails")
Link: https://patch.msgid.link/20260522230647.1705600-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: fix falsely ignoring indir table updates

rss_set_prep_indir() compares the new indirection table against the
current one to determine whether any update is needed. The memcmp
call passes data->indir_size as the length argument, but indir_size
is the number of u32 entries, not the byte count.

Fixes: c0ae03588bbb ("ethtool: rss: initial RSS_SET (indirection table handling)")
Link: https://patch.msgid.link/20260522230647.1705600-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: add missing errno on RSS context delete

Remember to set ret before jumping out if someone tries
to delete a context on a device which doesn't support
contexts.

Fixes: fbe09277fa63 ("ethtool: rss: support removing contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: rss: avoid modifying the RSS context response

Gemini says that we're modifying the RSS_CREATE response skb.
I think it's right, the comment says that unicast() should
unshare the skb but I'm not entirely sure what I meant there.
netlink_trim() does a copy but only if skb is not well sized
(it's at least 2x larger than necessary for the payload).

Fixes: a166ab7816c5 ("ethtool: rss: support creating contexts via Netlink")
Link: https://patch.msgid.link/20260522230647.1705600-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

KVM: arm64: Add fail-safe for refcounted pages in __pkvm_hyp_donate_host

A previous bug in __pkvm_init_vm error path showed that the hypervisor
could leak refcounted pages, (i.e. losing access to a page while its
refcount is still elevated). This poses a threat to the pKVM state
machine.

Address this by introducing a fail-safe in __pkvm_hyp_donate_host.
Transitions are not a hot path so added security is worth the extra
check.

Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-4-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Fix __pkvm_init_vm error path

In the unlikely case where insert_vm_table_entry fails, __pkvm_init_vm
release the memory donated by the host for the PGD, but as the stage-2
is still set-up the hypervisor keeps a refcount on those pages,
effectively leaking the references.

Fix the rollback with the newly added kvm_guest_destroy_stage2().

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-3-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

KVM: arm64: Reset page order in pKVM hyp_pool

When a VM fails to initialise after its stage-2 hyp_pool has been
initialised, that stage-2 must be torn down entirely. This requires
resetting both the refcount and the order of its pages back to 0.

Currently, reclaim_pgtable_pages() implicitly resets the page order by
allocating the entire pool with order-0 granularity. However, in the VM
initialisation error path, the addresses of the donated memory (the PGD)
are already known, making it unnecessary to iterate over all pages in
the pool.

Since the vmemmap page order is a hyp_pool-specific field, leaving a
non-zero order on hyp_pool destruction is harmless until another pool
attempts to admit the page. Instead of resetting this field during
destruction, reset it during pool initialization in hyp_pool_init().

For 'external' pages, we can't trust the order either as they bypass
hyp_pool_init(). Since we never coalesce them, enforce order-0 to ensure
safe insertion into the pool.

This leaves no vmemmap order users outside of hyp_pool.

Fixes: 256b4668cd89 ("KVM: arm64: Introduce separate hypercalls for pKVM VM reservation and initialization")
Reported-by: Sashiko <sashiko-bot@kernel.org>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260521143626.1005660-2-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>

cpufreq/amd-pstate: drop stale @epp_cached kdoc

Commit 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP") removed
the epp_cached field from struct amd_cpudata in favour of always
reading from cppc_req_cached, but the kdoc above the struct still
documents @epp_cached.

Drop the now-stale @epp_cached entry.

Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Fixes: 4e16c1175238 ("cpufreq/amd-pstate: Stop caching EPP")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Link: https://lore.kernel.org/r/20260526022131.1302373-1-zhanxusheng@xiaomi.com
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

arm64: tegra: Enable PCIe for Jetson AGX Thor

Enable the PCIe controllers on the Jetson AGX Thor Developer Kit that
are used for ethernet and NVMe.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

arm64: tegra: Fix address of Tegra264 main GPIO controller

The 64-bit address of the main GPIO controller on Tegra264 is
0x810c300000. The main GPIO controller was incorrectly added under the
bus@0 node instead of the bus@8100000000 node breaking the boot on
Tegra264. Fix this by moving to main GPIO controller node under
bus@8100000000.

Fixes: c70e6bc11d20 ("arm64: tegra: Add Tegra264 GPIO controllers")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: socfpga: Fix OF node refcount leak in SMP setup

socfpga_smp_prepare_cpus() looks up the Cortex-A9 SCU node with
of_find_compatible_node(), which returns a node reference that must be
released with of_node_put().

The function maps the SCU registers and then returns without dropping
that reference, leaking the node on both the success path and the
of_iomap() failure path.

Drop the reference once the mapping attempt is complete. The returned
MMIO mapping does not depend on keeping the device node reference held.

Fixes: 122694a0c712 ("ARM: socfpga: use of_iomap to map the SCU")
Cc: stable@vger.kernel.org
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>

driver core: Guard deferred probe timeout extension with delayed_work_pending()

mod_delayed_work() unconditionally queues the work even when it wasn't
previously pending, which can fire the timeout prematurely or restart it
after it already fired. Add a delayed_work_pending() guard to restore
the originally intended semantics.

Premature firing calls fw_devlink_drivers_done() before all built-in
drivers have registered, causing fw_devlink to prematurely relax device
links for suppliers whose drivers haven't loaded yet.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-2-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

driver core: Fix missing jiffies conversion in deferred_probe_extend_timeout()

mod_delayed_work() takes jiffies, not seconds. Thus, restore the dropped
conversion.

While at it, fix incorrect indentation.

Fixes: 1137838865bf ("driver core: Use mod_delayed_work to prevent lost deferred probe work")
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525012340.3860581-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

PCI: qcom: Program T_POWER_ON

Some platforms have incorrect T_POWER_ON value programmed in hardware.
Generally these will be corrected by bootloaders, but not all targets
support bootloaders to program correct values. That means the
LTR_L1.2_THRESHOLD value calculated by aspm.c can be wrong, which can
result in improper L1.2 exit behavior. If AER happens to be supported and
enabled, the error may be *reported* via AER.

Parse the 't-power-on-us' property from each Root Port node and program it
as part of host initialization using dw_pcie_program_t_power_on() before
link training.

This property in added to the dtschema here [1].

Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
[mani: reworded comment]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link[1]: https://lore.kernel.org/all/20260205093346.667898-1-krishna.chundru@oss.qualcomm.com/
Link: https://patch.msgid.link/20260428-t_power_on_fux-v5-3-f1ef926a91ff@oss.qualcomm.com

PCI: dwc: Add dw_pcie_program_t_power_on() to program T_POWER_ON

The T_POWER_ON indicates the time (in μs) that a Port requires the port on
the opposite side of Link to wait in L1.2.Exit after sampling CLKREQ#
asserted before actively driving the interface. This value is used by the
ASPM driver to compute the LTR_L1.2_THRESHOLD.

Currently, some controllers expose T_POWER_ON value of zero in the L1SS
capability registers, leading to incorrect LTR_L1.2_THRESHOLD calculations,
which can result in improper L1.2 exit behavior and if AER happens to be
supported and enabled, the error may be *reported* via AER.

Add a helper to override T_POWER_ON value by the DWC controller drivers.

Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
[mani: changed t_power_on to u32]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Shawn Lin <shawn.lin@rock-chips.com>
Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com>
Link: https://patch.msgid.link/20260428-t_power_on_fux-v5-2-f1ef926a91ff@oss.qualcomm.com

PCI/ASPM: Add pcie_encode_t_power_on() helper to encode L1SS T_POWER_ON fields

Add pcie_encode_t_power_on() to encode the PCIe L1 PM Substates T_POWER_ON
parameter into the T_POWER_ON Scale and T_POWER_ON Value fields.

This helper can be used by the controller drivers to change the
default/wrong value of T_POWER_ON in L1SS capability register to avoid
incorrect calculation of LTR_L1.2_THRESHOLD value.

The helper converts a T_POWER_ON time specified in microseconds into the
appropriate scale/value encoding defined by PCIe r7.0, sec 7.8.3.2. Values
that exceed the maximum encodable range are clamped to the largest
representable encoding.

Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
[mani: changed t_power_on_us to u32, added helper name to subject]
Signed-off-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Shawn Lin <shawn.lin@rock-chips.com>
Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20260428-t_power_on_fux-v5-1-f1ef926a91ff@oss.qualcomm.com