git.ipfire.org Git - thirdparty/linux.git/log

HID: logitech-hidpp: Check bounds when deleting force-feedback effects

Without this bounds check, this might otherwise overwrite index -1.

Triggering this condition requires action both from the USB device and from
userspace, which reduces the scenarios in which it can be exploited.

Cc: Lee Jones <lee@kernel.org>
Signed-off-by: Günther Noack <gnoack@google.com>
Reviewed-by: Lee Jones <lee@kernel.org>
Signed-off-by: Jiri Kosina <jkosina@suse.com>

IB/core: Fix zero dmac race in neighbor resolution

dst_fetch_ha() checks nud_state without holding the neighbor lock, then
copies ha under the seqlock. A race in __neigh_update() where nud_state
is set to NUD_REACHABLE before ha is written allows dst_fetch_ha() to
read a zero MAC address while the seqlock reports no concurrent writer.

netevent_callback amplifies this by waking ALL pending addr_req workers
when ANY neighbor becomes NUD_VALID. At scale (N peers resolving ARP
concurrently), the hit probability scales as N^2, making it near-certain
for large RDMA workloads.

N(A): neigh_update(A)                   W(A): addr_resolve(A)
|                                       [sleep]
| write_lock_bh(&A->lock)               |
| A->nud_state = NUD_REACHABLE          |
| // A->ha is still 0                   |
|                                       [woken by netevent_cb() of
|                                         another neighbour]
|                                       | dst_fetch_ha(A)
|                                       |   A->nud_state & NUD_VALID
|                                       |   read_seqbegin(&A->ha_lock)
|                                       |   snapshot = A->ha  /* 0 */
|                                       |   read_seqretry(&A->ha_lock)
|                                       |   return snapshot
| seqlock(&A->ha_lock)
| A->ha = mac_A     /* too late */
| sequnlock(&A->ha_lock)
| write_unlock_bh(&A->lock)

The incorrect/zero mac is read and programmed in the device QP while it
was not yet updated. This causes silent packet loss and eventual
RETRY_EXC_ERR.

Fix by holding the neighbor read lock across the nud_state check and
ha copy in dst_fetch_ha(), ensuring it synchronizes with
__neigh_update() which is updating while holding the write lock.

Cc: stable@vger.kernel.org
Fixes: 92ebb6a0a13a ("IB/cm: Remove now useless rcu_lock in dst_fetch_ha")
Link: https://patch.msgid.link/r/20260405-fix-dmac-race-v1-1-cfa1ec2ce54a@nvidia.com
Signed-off-by: Chen Zhao <chezhao@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

efi: Tag memblock reservations of boot services regions as RSRV_KERN

By definition, EFI memory regions of type boot services code or data
have no special significance to the firmware at runtime, only to the OS.
In some cases, the firmware will allocate tables and other assets that
are passed in memory in regions of this type, and leave it up to the OS
to decide whether or not to treat the allocation as special, or simply
consume the contents at boot and recycle the RAM for ordinary use. The
reason for this approach is that it avoids needless memory reservations
for assets that the OS knows nothing about, and therefore doesn't know
how to free either.

This means that any memblock reservations covering such regions can be
marked as MEMBLOCK_RSRV_KERN - this is a better match semantically, and
is useful on x86 to distinguish true reservations from temporary
reservations that are only needed to work around firmware bugs.

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

memblock: Permit existing reserved regions to be marked RSRV_KERN

Permit existing memblock reservations to be marked as RSRV_KERN. This
will be used by the EFI code on x86 to distinguish between reservations
of boot services data regions that have actual significance to the
kernel and regions that are reserved temporarily to work around buggy
firmware.

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

jbd2: store jinode dirty range in PAGE_SIZE units

jbd2_inode fields are updated under journal->j_list_lock, but some paths
read them without holding the lock (e.g. fast commit helpers and ordered
truncate helpers).

READ_ONCE() alone is not sufficient for the dirty range fields when they
are stored as loff_t because 32-bit platforms can observe torn loads.
Store the dirty range in PAGE_SIZE units as pgoff_t instead.

Represent the dirty range end as an exclusive end page. This avoids a
special sentinel value and keeps MAX_LFS_FILESIZE on 32-bit representable.

Publish a new dirty range by updating end_page before start_page, and
treat start_page >= end_page as empty in the accessor for robustness.

Use READ_ONCE() on the read side and WRITE_ONCE() on the write side for the
dirty range and i_flags to match the existing lockless access pattern.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20260306085643.465275-5-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ocfs2: use jbd2 jinode dirty range accessor

ocfs2 journal commit callback reads jbd2_inode dirty range fields without
holding journal->j_list_lock.
Use jbd2_jinode_get_dirty_range() to get the range in bytes.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20260306085643.465275-4-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: use jbd2 jinode dirty range accessor

ext4 journal commit callbacks access jbd2_inode dirty range fields without
holding journal->j_list_lock.
Use jbd2_jinode_get_dirty_range() to get the range in bytes, and read
i_transaction with READ_ONCE() in the redirty check.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20260306085643.465275-3-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

jbd2: add jinode dirty range accessors

Provide a helper to fetch jinode dirty ranges in bytes. This lets
filesystem callbacks avoid depending on the internal representation,
preparing for a later conversion to page units.

Suggested-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Li Chen <me@linux.beauty>
Link: https://patch.msgid.link/20260306085643.465275-2-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

tracing: Documentation: Update histogram-design.rst for fn() handling

The histogram documentation describes the old method of the histogram
triggers using the fn() field of the histogram field structure to process
the field. But due to Spectre mitigation, the function pointer to handle
the fields at runtime caused a noticeable overhead. It was converted over
to a fn_num and hist_fn_call() is now used to call the specific functions
for the fields via a switch statement based on the field's fn_num value.

Update the documentation to reflect this change.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260126181742.03e8f0d5@gandalf.local.home>

jbd2: gracefully abort on transaction state corruptions

Auditing the jbd2 codebase reveals several legacy J_ASSERT calls
that enforce internal state machine invariants (e.g., verifying
jh->b_transaction or jh->b_next_transaction pointers).

When these invariants are broken, the journal is in a corrupted
state. However, triggering a fatal panic brings down the entire
system for a localized filesystem error.

This patch targets a specific class of these asserts: those
residing inside functions that natively return integer error codes,
booleans, or error pointers. It replaces the hard J_ASSERTs with
WARN_ON_ONCE to capture the offending stack trace, safely drops
any held locks, gracefully aborts the journal, and returns -EINVAL.

This prevents a catastrophic kernel panic while ensuring the
corrupted journal state is safely contained and upstream callers
(like ext4 or ocfs2) can gracefully handle the aborted handle.

Functions modified in fs/jbd2/transaction.c:
- jbd2__journal_start()
- do_get_write_access()
- jbd2_journal_dirty_metadata()
- jbd2_journal_forget()
- jbd2_journal_try_to_free_buffers()
- jbd2_journal_file_inode()

Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20260304172016.23525-3-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

jbd2: gracefully abort instead of panicking on unlocked buffer

In jbd2_journal_get_create_access(), if the caller passes an unlocked
buffer, the code currently triggers a fatal J_ASSERT.

While an unlocked buffer here is a clear API violation and a bug in the
caller, crashing the entire system is an overly severe response. It brings
down the whole machine for a localized filesystem inconsistency.

Replace the J_ASSERT with a WARN_ON_ONCE to capture the offending caller's
stack trace, and return an error (-EINVAL). This allows the journal to
gracefully abort the transaction, protecting data integrity without
causing a kernel panic.

Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20260304172016.23525-2-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

docs: sysctl: Add documentation for /proc/sys/xen/

Add documentation for the Xen hypervisor sysctl controls in
/proc/sys/xen/balloon/.

Documents the hotplug_unpopulated tunable (available when
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is enabled) which controls
whether unpopulated memory regions are automatically hotplugged
when the Xen balloon driver needs to reclaim memory.

The documentation is based on source code analysis of
drivers/xen/balloon.c.

Signed-off-by: Shubham Chakraborty <chakrabortyshubham66@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260304150419.16738-1-chakrabortyshubham66@gmail.com>

ext4: simplify mballoc preallocation size rounding for small files

The if-else ladder in ext4_mb_normalize_request() manually rounds up
the preallocation size to the next power of two for files up to 1MB,
enumerating each step from 16KB to 1MB individually. Replace this with
a single roundup_pow_of_two() call clamped to a 16KB minimum, which
is functionally equivalent but much more concise.

Also replace raw byte constants with SZ_1M and SZ_16K from
<linux/sizes.h> for clarity, and remove the stale "XXX: should this
table be tunable?" comment that has been there since the original
mballoc code.

No functional change.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
Link: https://patch.msgid.link/tencent_E9C5F1B2E9939B3037501FD04A7E9CF0C407@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Docs: hid: intel-ish-hid: make long URL usable

The '\' line continuation character in this long URL
doesn't help anything. There is no documentation tooling that
handles the line continuation character to join the 2 lines
to make a usable URL. Web browsers terminate the URL just
before the '\' character so that the second line of the URL
is lost. See:
https://docs.kernel.org/hid/intel-ish-hid.html

Join the 2 lines together so that the URL is usable.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260321230934.435020-1-rdunlap@infradead.org>

ext4/move_extent: use folio_next_pos()

A series of patches such as commit 60a70e61430b ("mm: Use
folio_next_pos()") replace folio_pos() + folio_size() by
folio_next_pos(). The former performs x << z + y << z while
the latter performs (x + y) << z, which is slightly more
efficient. This case was not taken into account, perhaps
because the argument is not named folio.

The change was performed using the following Coccinelle
semantic patch:

@@
expression folio;
@@

- folio_pos(folio) + folio_size(folio)
+ folio_next_pos(folio)

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260222125049.1309075-1-Julia.Lawall@inria.fr
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ALSA: hda/alc269: Drop superfluous GPIO write at resume

alc269_resume() has an extra code to write GPIO data, but this is
basically already done in the standard alc_init(), hence it's
superfluous. Let's drop the code.

Since all external callers of alc_write_gpio_data() are gone after
this, fold the only usage of alc_write_gpio_data() into the caller and
drop the export as well.

Link: https://patch.msgid.link/20260409143735.1412134-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: usb-audio: Add quirk flags for Feaulle Rainbow

Feaulle Rainbow is a wired USB-C dynamic in-ear monitor (IEM) featuring
active noise cancellation (ANC).

The supported sample rates are 48000Hz and 96000Hz at 16bit or 24bit,
but it does not support reading the current sample rate and results in
an error message printed to kmsg. Set QUIRK_FLAG_GET_SAMPLE_RATE to skip
the sample rate check.

Its playback mixer reports val = -15360/0/128. Setting -15360 (-60dB)
mutes the playback, so QUIRK_FLAG_MIXER_PLAYBACK_MIN_MUTE is needed.

Add a quirk table entry matching VID/PID=0x0e0b/0xfa01 and applying
the mentioned quirk flags, so that it can work properly.

Quirky device sample:

  usb 7-1: New USB device found, idVendor=0e0b, idProduct=fa01, bcdDevice= 1.00
  usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
  usb 7-1: Product: Feaulle Rainbow
  usb 7-1: Manufacturer: Generic
  usb 7-1: SerialNumber: 20210726905926

Signed-off-by: Rong Zhang <i@rong.moe>
Link: https://patch.msgid.link/20260409-feaulle-rainbow-v1-1-09179e09000d@rong.moe
Signed-off-by: Takashi Iwai <tiwai@suse.de>

Documentation/kernel-parameters: fix architecture alignment for pt, nopt, and nobypass

Commit ab0e7f20768a ("Documentation: Merge x86-specific boot options doc
into kernel-parameters.txt") introduced a formatting regression where
architecture tags were placed on separate lines with broken indentation.
This caused the 'nopt' [X86] parameter to appear as if it belonged to
the [PPC/POWERNV] section.

Furthermore, since the main 'iommu=' parameter heading already specifies
it is for [X86, EARLY], the subsequent standalone [X86] tags for 'pt',
'nopt', and the AMD GART options are redundant and clutter the
documentation.

Clean up the formatting by removing these redundant tags and properly
attributing the 'nobypass' option to [PPC/POWERNV].

Fixes: ab0e7f20768a ("Documentation: Merge x86-specific boot options doc into kernel-parameters.txt")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260330105957.2271-1-lirongqing@baidu.com>

ext4: remove tl argument from ext4_fc_replay_{add,del}_range

Since commit a7ba36bc94f2 ("ext4: fix fast commit alignment issues"),
both ext4_fc_replay_add_range and ext4_fc_replay_del_range get
ex based on 'val' instead of 'tl'.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260121063805.19863-1-guoqing.jiang@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove unused i_fc_wait

i_fc_wait is only initialized in ext4_fc_init_inode() and never used for
waiting or wakeups. Drop it.

Signed-off-by: Li Chen <me@linux.beauty>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260120121941.144192-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

efi/memattr: Fix thinko in table size sanity check

While it is true that each PE/COFF runtime driver in memory can
generally be split into 3 different regions (the header, the code/rodata
region and the data/bss region), each with different permissions, it
does not mean that 3x the size of the memory map is a suitable upper
bound. This is due to the fact that all runtime drivers could be
coalesced into a single EFI runtime code region by the firmware, and if
the firmware does a good job of keeping the fragmentation down, it is
conceivable that the memory attributes table has more entries than the
EFI memory map itself.

So instead, base the sanity check on whether the descriptor size matches
the EFI memory map's descriptor size closely enough (which is not
mandated by the spec but extremely unlikely to differ in practice), and
whether the size of the whole table does not exceed 64k entries.

Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

sched/doc: Update yield_task description in sched-design-CFS

The yield_task description referenced the long-removed compat_yield
sysctl and described the function as a dequeue/enqueue cycle. Update
it to reflect current behavior: yielding the CPU by moving the
current task's position back in the runqueue.

Sync zh_CN and sp_SP translations.

Signed-off-by: fangqiurong <fangqiurong@kylinos.cn>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260403055806.358921-1-user@fqr-pc>

Documentation/rtla: Convert links to RST format

Web links in the documentation are not properly displayed.

In the man pages web links look like:
  Osnoise tracer  documentation:  <  <https://www.kernel.org/doc/html/lat‐
  est/trace/osnoise-tracer.html> >

On web pages the URL caption is the URL itself.

Convert tracer documentation links to RST anonymous hyperlink format
for better rendering. Use newer docs.kernel.org instead of
www.kernel.org/doc/html/latest for brevity.

After the change, the links in the man pages look like:
  Osnoise tracer <https://docs.kernel.org/trace/osnoise-tracer.html>

On web pages the captions are the titles of the links.

Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260405163847.3337981-1-costa.shul@redhat.com>

RDMA/mana_ib: Support memory windows

Implement .alloc_mw() and .dealloc_mw() for mana device.

This is just the basic infrastructure, MW is not practically usable until
additional kernel support for allowing user space to submit MW work
requests is completed.

Link: https://patch.msgid.link/r/20260331090851.2276205-1-kotaranov@linux.microsoft.com
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

docs: fix typos and duplicated words across documentation

Fix the following typos and duplicated words:

- admin-guide/pm/intel-speed-select.rst: "weather" -> "whether"
- core-api/real-time/differences.rst: "the the" -> "the"
- admin-guide/bcache.rst: "to to" -> "to"

Signed-off-by: Manuel Cortez <mdjesuscv@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260406030323.1196-1-mdjesuscv@gmail.com>

nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()

A NS will always have a head pointer, so drop the check. As proof in
practice, all the nvme_mpath_clear_current_path() callers also
dereference ns->head.

This check has endured since the original changes to support multipath.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus )

The firmware for Samsung 970 Evo Plus / PM981 / PM983 does not support SUBNQN.
Make quirks to suppress warnings.

# nvme id-ctrl /dev/nvme1n1
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : ***
mn        : Samsung SSD 970 EVO Plus 500GB
fr        : 2B2QEXM7

mcdqpc    : 0
subnqn    :
ioccsz    : 0

Signed-off-by: Alan Cui <me@alancui.cc>
Signed-off-by: Keith Busch <kbusch@kernel.org>

nvmet-tcp: fix race between ICReq handling and queue teardown

nvmet_tcp_handle_icreq() updates queue->state after sending an
Initialization Connection Response (ICResp), but it does so without
serializing against target-side queue teardown.

If an NVMe/TCP host sends an Initialization Connection Request
(ICReq) and immediately closes the connection, target-side teardown
may start in softirq context before io_work drains the already
buffered ICReq. In that case, nvmet_tcp_schedule_release_queue()
sets queue->state to NVMET_TCP_Q_DISCONNECTING and drops the queue
reference under state_lock.

If io_work later processes that ICReq, nvmet_tcp_handle_icreq() can
still overwrite the state back to NVMET_TCP_Q_LIVE. That defeats the
DISCONNECTING-state guard in nvmet_tcp_schedule_release_queue() and
allows a later socket state change to re-enter teardown and issue a
second kref_put() on an already released queue.

The ICResp send failure path has the same problem. If teardown has
already moved the queue to DISCONNECTING, a send error can still
overwrite the state with NVMET_TCP_Q_FAILED, again reopening the
window for a second teardown path to drop the queue reference.

Fix this by serializing both post-send state transitions with
state_lock and bailing out if teardown has already started.

Use -ESHUTDOWN as an internal sentinel for that bail-out path rather
than propagating it as a transport error like -ECONNRESET. Keep
nvmet_tcp_socket_error() setting rcv_state to NVMET_TCP_RECV_ERR before
honoring that sentinel so receive-side parsing stays quiesced until the
existing release path completes.

Fixes: c46a6465bac2 ("nvmet-tcp: add NVMe over TCP target driver")
Cc: stable@vger.kernel.org
Reported-by: Shivam Kumar <skumar47@syr.edu>
Tested-by: Shivam Kumar <kumar.shivam43666@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>

RDMA/rxe: Validate pad and ICRC before payload_size() in rxe_rcv

rxe_rcv() currently checks only that the incoming packet is at least
header_size(pkt) bytes long before payload_size() is used.

However, payload_size() subtracts both the attacker-controlled BTH pad
field and RXE_ICRC_SIZE from pkt->paylen:

payload_size = pkt->paylen - offset[RXE_PAYLOAD] - bth_pad(pkt)
- RXE_ICRC_SIZE

This means a short packet can still make payload_size() underflow even
if it includes enough bytes for the fixed headers. Simply requiring
header_size(pkt) + RXE_ICRC_SIZE is not sufficient either, because a
packet with a forged non-zero BTH pad can still leave payload_size()
negative and pass an underflowed value to later receive-path users.

Fix this by validating pkt->paylen against the full minimum length
required by payload_size(): header_size(pkt) + bth_pad(pkt) +
RXE_ICRC_SIZE.

Cc: stable@vger.kernel.org
Fixes: 8700e3e7c485 ("Soft RoCE driver")
Link: https://patch.msgid.link/r/20260401121907.1468366-1-hkbinbinbin@gmail.com
Signed-off-by: hkbinbin <hkbinbinbin@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

ext4: unmap invalidated folios from page tables in mpage_release_unused_pages()

When delayed block allocation fails (e.g., due to filesystem corruption
detected in ext4_map_blocks()), the writeback error handler calls
mpage_release_unused_pages(invalidate=true) which invalidates affected
folios by clearing their uptodate flag via folio_clear_uptodate().

However, these folios may still be mapped in process page tables. If a
subsequent operation (such as ftruncate calling ext4_block_truncate_page)
triggers a write fault, the existing page table entry allows access to
the now-invalidated folio. This leads to ext4_page_mkwrite() being called
with a non-uptodate folio, which then gets marked dirty, triggering:

    WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
    __folio_mark_dirty+0x578/0x880

    Call Trace:
     fault_dirty_shared_page+0x16e/0x2d0
     do_wp_page+0x38b/0xd20
     handle_pte_fault+0x1da/0x450

The sequence leading to this warning is:

1. Process writes to mmap'd file, folio becomes uptodate and dirty
2. Writeback begins, but delayed allocation fails due to corruption
3. mpage_release_unused_pages(invalidate=true) is called:
   - block_invalidate_folio() clears dirty flag
   - folio_clear_uptodate() clears uptodate flag
   - But folio remains mapped in page tables
4. Later, ftruncate triggers ext4_block_truncate_page()
5. This causes a write fault on the still-mapped folio
6. ext4_page_mkwrite() is called with folio that is !uptodate
7. block_page_mkwrite() marks buffers dirty
8. fault_dirty_shared_page() tries to mark folio dirty
9. block_dirty_folio() calls __folio_mark_dirty(warn=1)
10. WARNING triggers: WARN_ON_ONCE(warn && !uptodate && !dirty)

Fix this by unmapping folios from page tables before invalidating them
using unmap_mapping_pages(). This ensures that subsequent accesses
trigger new page faults rather than reusing invalidated folios through
stale page table entries.

Note that this results in data loss for any writes to the mmap'd region
that couldn't be written back, but this is expected behavior when
writeback fails due to filesystem corruption. The existing error message
already states "This should not happen!! Data will be lost".

Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Tested-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Link: https://patch.msgid.link/20251205055914.1393799-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

docs: fix typo in zoran driver documentation

Replace "an a few" with "and a few" in
Documentation/driver-api/media/drivers/zoran.rst.

Signed-off-by: Gleb Golovko <gaben123001@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260407212818.925-1-gaben123001@gmail.com>

gpio: swnode: defer probe on references to unregistered software nodes

fwnode_property_get_reference_args() now returns -ENOTCONN when called
on a software node referencing another software node which has not yet
been registered as a firmware node. It makes sense to defer probe in this
situation as the node will most likely be registered later on and we'll
be able to resolve the reference eventually. Change the behavior of
swnode_find_gpio() to return -EPROBE_DEFER if the software node reference
resolution returns -ENOTCONN.

Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260407-swnode-unreg-retcode-v4-2-1b2f0725eb9c@oss.qualcomm.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

RDMA/core: Prefer NLA_NUL_STRING

These attributes are evaluated as c-string (passed to strcmp), but
NLA_STRING doesn't check for the presence of a \0 terminator.

Either this needs to switch to nla_strcmp() and needs to adjust printf fmt
specifier to not use plain %s, or this needs to use NLA_NUL_STRING.

As the code has been this way for long time, it seems to me that userspace
does include the terminating nul, even tough its not enforced so far, and
thus NLA_NUL_STRING use is the simpler solution.

Fixes: 30dc5e63d6a5 ("RDMA/core: Add support for iWARP Port Mapper user space service")
Link: https://patch.msgid.link/r/20260330122742.13315-1-fw@strlen.de
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

sched/cache: Allow the user space to turn on and off cache aware scheduling

Provide a debugfs directory llc_balancing, and a knob named
"enabled" under it to allow the user to turn off and on the
cache aware scheduling at runtime.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/0aa56f7fc48db2f8f700cd1aa34dedd0ec88351b.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Enable cache aware scheduling for multi LLCs NUMA node

Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.

Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
Due to length constraints, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-10         1-groups         1.00 (  1.22)  +26.09 (  1.10)
threads-pipe-10         2-groups         1.00 (  4.90)  +22.88 (  0.18)
threads-pipe-10         4-groups         1.00 (  2.07)   +9.00 (  3.49)
threads-pipe-10         8-groups         1.00 (  8.13)   +3.45 (  3.62)
threads-pipe-16         1-groups         1.00 (  2.11)  +26.30 (  0.08)
threads-pipe-16         2-groups         1.00 ( 15.13)   -1.77 ( 11.89)
threads-pipe-16         4-groups         1.00 (  4.37)   +0.58 (  7.99)
threads-pipe-16         8-groups         1.00 (  2.88)   +2.71 (  3.50)
threads-pipe-2          1-groups         1.00 (  9.40)  +22.07 (  0.71)
threads-pipe-2          2-groups         1.00 (  9.99)  +18.01 (  0.95)
threads-pipe-2          4-groups         1.00 (  3.98)  +24.66 (  0.96)
threads-pipe-2          8-groups         1.00 (  7.00)  +21.83 (  0.23)
threads-pipe-20         1-groups         1.00 (  1.03)  +28.84 (  0.21)
threads-pipe-20         2-groups         1.00 (  4.42)  +31.90 (  3.15)
threads-pipe-20         4-groups         1.00 (  9.97)   +4.56 (  1.69)
threads-pipe-20         8-groups         1.00 (  1.87)   +1.25 (  0.74)
threads-pipe-4          1-groups         1.00 (  4.48)  +25.67 (  0.78)
threads-pipe-4          2-groups         1.00 (  9.14)   +4.91 (  2.08)
threads-pipe-4          4-groups         1.00 (  7.68)  +19.36 (  1.53)
threads-pipe-4          8-groups         1.00 ( 10.79)   +7.20 ( 12.20)
threads-pipe-8          1-groups         1.00 (  4.69)  +21.93 (  0.03)
threads-pipe-8          2-groups         1.00 (  1.16)  +25.29 (  0.65)
threads-pipe-8          4-groups         1.00 (  2.23)   -1.27 (  3.62)
threads-pipe-8          8-groups         1.00 (  4.65)   -3.08 (  2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies Base (mean std)      Compare (mean std)   Change
=========================================================================
thread=2                 9.00(0.00)           9.00(1.73)           0.00%
thread=4                 7.33(0.58)           6.33(0.58)           +13.64%
thread=8                 9.00(0.00)           7.67(1.15)           +14.78%
thread=16                8.67(0.58)           8.67(1.53)           0.00%
thread=32                9.00(0.00)           7.00(0.00)           +22.22%
thread=64                9.33(0.58)           9.67(0.58)           -3.64%
thread=128              12.00(0.00)          12.00(0.00)           0.00%

[chacha20 on simulated risc-v]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  2.89)  +47.33 (  1.20)
threads-pipe-2          2-groups         1.00 (  3.88)  +39.82 (  0.61)
threads-pipe-2          4-groups         1.00 (  8.76)   +5.57 ( 13.10)
threads-pipe-20         1-groups         1.00 (  4.61)  +11.72 (  1.06)
threads-pipe-20         2-groups         1.00 (  6.18)  +14.55 (  1.47)
threads-pipe-20         4-groups         1.00 (  2.99)  +10.16 (  4.49)
threads-pipe-4          1-groups         1.00 (  4.23)  +43.70 (  2.14)
threads-pipe-4          2-groups         1.00 (  3.68)   +8.45 (  4.04)
threads-pipe-4          4-groups         1.00 ( 17.72)   +2.42 (  1.14)
threads-pipe-6          1-groups         1.00 (  3.10)   +7.74 (  3.83)
threads-pipe-6          2-groups         1.00 (  3.42)  +14.26 (  4.53)
threads-pipe-6          4-groups         1.00 ( 10.34)  +10.94 (  7.12)
threads-pipe-8          1-groups         1.00 (  4.21)   +9.06 (  4.43)
threads-pipe-8          2-groups         1.00 (  1.88)   +3.74 (  0.58)
threads-pipe-8          4-groups         1.00 (  2.78)  +23.96 (  1.18)

[chacha20 on simulated risc-v]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/71972e12ab4f08aff422b31e34df09bdbd94de84.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Respect LLC preference in task migration and detach

During load balancing, make can_migrate_task()
consider a task's LLC preference.
Prevent a task from being moved out of its preferred LLC.

During the regular load balancing, if
the task cannot be migrated due to LLC locality, the
nr_balance_failed also should not be increased.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/53da65f3d59de31e1a1dc59a4093d8dd9d4dc206.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Handle moving single tasks to/from their preferred LLC

Cache aware scheduling mainly does two things:
1. Prevent task from migrating out of its preferred LLC if not
nessasary.
2. Migrating task to their preferred LLC if nessasary.

For 1:
In the generic load balance, if the busiest runqueue has only one task,
active balancing may be invoked to move it away. However, this migration
might break LLC locality.

Prevent regular load balance from migrating a task that
prefers the current LLC. The load level and imbalance do not warrant
breaking LLC preference per the can_migrate_llc() policy. Here, the
benefit of LLC locality outweighs the power efficiency gained from
migrating the only runnable task away.

Before migration, check whether the task is running on its preferred
LLC: Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

For 2:
On the other hand, if the migration type is migrate_llc_task, it means
that there are tasks on the env->src_cpu that want to be migrated to
their preferred LLC, launch the active load balance anyway.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/9b816d8c27fabf2a9c0e1f61a6b90afe8ec4ad52.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Add migrate_llc_task migration type for cache-aware balancing

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/b9df27c19cc5121ddb2a7d1be7f9d52fec1563dc.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Prioritize tasks preferring destination LLC during balancing

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.
The priority of group_llc_balance is lower than that of group_overloaded
and higher than that of all other group types. This is because
group_llc_balance may exacerbate load imbalance, and if the LLC balancing
attempt fails, the nr_balance_failed mechanism will trigger other group
types to rebalance the load.

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/baa458f45eab3f602af090c6d6af63dc864f5ec6.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Check local_group only once in update_sg_lb_stats()

There is no need to check the local group twice for both
group_asym_packing and group_smt_balance. Adjust the code
to facilitate future checks for group types (cache-aware
load balancing) as well.

No functional changes are expected.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/99a57865c8ae1847087a5c00e92d24351cf3e5a8.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Count tasks prefering destination LLC in a sched group

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/3d8502a33a753c4384b368f97f64ee70b1cea0db.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Calculate the percpu sd task LLC preference

Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/d15a64436d3acd19c5c53344c5e9d3d0b79b3233.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Introduce per CPU's tasks LLC preference counter

The lowest level of sched domain for each CPU is assigned an
array where each element tracks the number of tasks preferring
a given LLC, indexed from 0 to max_lid. Since each CPU
has its dedicated sd, this implies that each CPU will have
a dedicated task LLC preference counter.

For example, sd->llc_counts[3] = 2 signifies that there
are 2 tasks on this runqueue which prefer to run within LLC3.

The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime during sched domain
rebuild.

Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.

Note: the LLC preference statistics of each CPU are reset on
sched domain rebuild and may under count temporarily, until the
CPU becomes idle and the count is cleared. This is a trade off
to avoid complex data synchronization across sched domain builds.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/42e79eceb8cd6be8a032401d481d101913bc5703.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Track LLC-preferred tasks per runqueue

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/459a37102f3d74a4e09ea58401d2094ac731d044.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Assign preferred LLC ID to processes

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/f2ceecba5858680349ad4ce9303a2121f0bb7272.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Make LLC id continuous

Introduce an index mapping between CPUs and their LLCs. This provides
a roughly continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share an approximate
continuous id:

  per_cpu(llc_id, CPU=0...15)  = 0
  per_cpu(llc_id, CPU=16...31) = 1
  per_cpu(llc_id, CPU=32...47) = 2
  ...

Note that the LLC IDs are allocated via bitmask, so the IDs may be
reused during CPU offline->online transitions.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Originally-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/047ef46339e4db497b54a89940a7ebedf27fcf28.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Introduce helper functions to enforce LLC migration policy

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy, or if doing so will not
     leave the preferred LLC much more imbalanced than the
     non-preferred one (>20% utilization difference, a little
     higher than the default imbalance_pct(17%) of the LLC domain
     as hysteresis). Later this threshold will be turned into tunable
     debugfs.
  2. Allow moving a task from overloaded preferred LLC to a non
     preferred LLC if this will not cause the non preferred LLC
     to become too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to
     spread the tasks.

Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/d19b52589cdceaee5e625980959f4d1982d6d7c9.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Record per LLC utilization to guide cache aware scheduling decisions

When a system becomes busy and a process's preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed
when the preferred LLC is overloaded, which requires a metric to
indicate LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/a48151b3d57f2a42a5971aaead1b7f81e69229f4.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Limit the scan number of CPUs when calculating task occupancy

When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.

To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..

While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.

Future work may integrate numa_group to further refine task aggregation.

Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b.1775065312.git.tim.c.chen@linux.intel.com

sched/cache: Introduce infrastructure for cache-aware load balancing

Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/6269a53221b9439b9ca00d18a9d1946fb64d8cff.1775065312.git.tim.c.chen@linux.intel.com

x86/topology: Add paramter to split LLC

Add a (debug) option to virtually split the LLC, no CAT involved, just fake
topology. Used to test code that depends (either in behaviour or directly) on
there being multiple LLC domains in a node.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Merge branch 'net-lan966x-fix-page_pool-error-handling-and-error-paths'

David Carlier says:

====================
net: lan966x: fix page_pool error handling and error paths

This series fixes error handling around the lan966x page pool:

    1/3 adds the missing IS_ERR check after page_pool_create(), preventing
        a kernel oops when the error pointer flows into
        xdp_rxq_info_reg_mem_model().

    2/3 plugs page pool leaks in the lan966x_fdma_rx_alloc() and
        lan966x_fdma_init() error paths, now reachable after 1/3.

    3/3 fixes a use-after-free and page pool leak in the
        lan966x_fdma_reload() restore path, where the hardware could
        resume DMA into pages already returned to the page pool.
====================

Link: https://patch.msgid.link/20260405055241.35767-1-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: lan966x: fix use-after-free and leak in lan966x_fdma_reload()

When lan966x_fdma_reload() fails to allocate new RX buffers, the restore
path restarts DMA using old descriptors whose pages were already freed
via lan966x_fdma_rx_free_pages(). Since page_pool_put_full_page() can
release pages back to the buddy allocator, the hardware may DMA into
memory now owned by other kernel subsystems.

Additionally, on the restore path, the newly created page pool (if
allocation partially succeeded) is overwritten without being destroyed,
leaking it.

Fix both issues by deferring the release of old pages until after the
new allocation succeeds. Save the old page array before the allocation
so old pages can be freed on the success path. On the failure path, the
old descriptors, pages and page pool are all still valid, making the
restore safe. Also ensure the restore path re-enables NAPI and wakes
the netdev, matching the success path.

Fixes: 89ba464fcf54 ("net: lan966x: refactor buffer reload function")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260405055241.35767-4-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: lan966x: fix page pool leak in error paths

lan966x_fdma_rx_alloc() creates a page pool but does not destroy it if
the subsequent fdma_alloc_coherent() call fails, leaking the pool.

Similarly, lan966x_fdma_init() frees the coherent DMA memory when
lan966x_fdma_tx_alloc() fails but does not destroy the page pool that
was successfully created by lan966x_fdma_rx_alloc(), leaking it.

Add the missing page_pool_destroy() calls in both error paths.

Fixes: 11871aba1974 ("net: lan96x: Use page_pool API")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260405055241.35767-3-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: lan966x: fix page_pool error handling in lan966x_fdma_rx_alloc_page_pool()

page_pool_create() can return an ERR_PTR on failure. The return value
is used unconditionally in the loop that follows, passing the error
pointer through xdp_rxq_info_reg_mem_model() into page_pool_use_xdp_mem(),
which dereferences it, causing a kernel oops.

Add an IS_ERR check after page_pool_create() to return early on failure.

Fixes: 11871aba1974 ("net: lan96x: Use page_pool API")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260405055241.35767-2-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

drm/atomic: Increase timeout in drm_atomic_helper_wait_for_vblanks()

Increase the timeout for vblank events from 100 ms to 1000 ms. This
is the same fix as in commit f050da08a4ed ("drm/vblank: Increase
timeout in drm_wait_one_vblank()") for another vblank timeout.

After merging generic DRM vblank timers [1] and converting several
DRM drivers for virtual hardware, these drivers synchronize their
vblank events to the display refresh rate. This can trigger timeouts
within the DRM framework.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://lore.kernel.org/dri-devel/20250904145806.430568-1-tzimmermann@suse.de/
Reported-by: syzbot+fcede535e7eb57cf5b43@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/dri-devel/69381d6c.050a0220.4004e.0017.GAE@google.com/
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Fixes: 74afeb812850 ("drm/vblank: Add vblank timer")
Link: https://patch.msgid.link/20251209143325.102056-1-tzimmermann@suse.de

platform/x86: thinkpad_acpi: remove obsolete TODO comment

This patch removes the obsolete TODO comment regarding fan speed
presets in fan_write_cmd_speed. After discussion with the
maintainers, it was decided that fixed presets (low/medium/high)
are not suitable due to platform-specific variations.

Signed-off-by: Daniil Bulgar <bulgardaniil18@gmail.com>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Link: https://patch.msgid.link/20260407190546.109900-1-bulgardaniil18@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

leds: class: Make led_remove_lookup() NULL-aware

It is a usual pattern in the kernel to make releasing functions be NULL-aware
so they become a no-op. This helps reducing unneeded checks in the code where
the given resource is optional.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260327102729.797254-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Lee Jones <lee@kernel.org>

platform/x86: dell-wmi-sysman: bound enumeration string aggregation

populate_enum_data() aggregates firmware-provided value-modifier
and possible-value strings into fixed 512-byte struct members.
The current code bounds each individual source string but then
appends every string and separator with raw strcat() and no
remaining-space check.

Switch the aggregation loops to a bounded append helper and
reject enumeration packages whose combined strings do not fit
in the destination buffers.

Fixes: e8a60aa7404b ("platform/x86: Introduce support for Systems Management Driver over WMI for Dell Systems")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260408084501.1-dell-wmi-sysman-v2-pengpeng@iscas.ac.cn
[ij: add include]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

kernfs: make directory seek namespace-aware

The rbtree backing kernfs directories is ordered by (hash, ns_id, name)
but kernfs_dir_pos() only searches by hash when seeking to a position
during readdir. When two nodes from different namespaces share the same
hash value, the binary search can land on a node in the wrong namespace.
The subsequent skip-forward loop walks rb_next() and may overshoot the
correct node, silently dropping an entry from the readdir results.

With the recent switch from raw namespace pointers to public namespace
ids as hash seeds, computing hash collisions became an offline operation.
An unprivileged user could unshare into a new network namespace, create
a single interface whose name-hash collides with a target entry in
init_net, and cause a victim's seekdir/readdir on /sys/class/net to miss
that entry.

Fix this by extending the rbtree search in kernfs_dir_pos() to also
compare namespace ids when hashes match. Since the rbtree is already
ordered by (hash, ns_id, name), this makes the seek land directly in the
correct namespace's range, eliminating the wrong-namespace overshoot.

Signed-off-by: Christian Brauner <brauner@kernel.org>

kernfs: use namespace id instead of pointer for hashing and comparison

kernfs uses the namespace tag as both a hash seed (via init_name_hash())
and a comparison key in the rbtree. The resulting hash values are exposed
to userspace through directory seek positions (ctx->pos), and the raw
pointer comparisons in kernfs_name_compare() encode kernel pointer
ordering into the rbtree layout.

This constitutes a KASLR information leak since the hash and ordering
derived from kernel pointers can be observed from userspace.

Fix this by using the 64-bit namespace id (ns_common::ns_id) instead of
the raw pointer value for both hashing and comparison. The namespace id
is a stable, non-secret identifier that is already exposed to userspace
through other interfaces (e.g., /proc/pid/ns/, ioctl NS_GET_NSID).

Introduce kernfs_ns_id() as a helper that extracts the namespace id from
a potentially-NULL ns_common pointer, returning 0 for the no-namespace
case.

All namespace equality checks in the directory iteration and dentry
revalidation paths are also switched from pointer comparison to ns_id
comparison for consistency.

Signed-off-by: Christian Brauner <brauner@kernel.org>

kernfs: pass struct ns_common instead of const void * for namespace tags

kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.

Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.

Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.

This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.

Signed-off-by: Christian Brauner <brauner@kernel.org>

Merge branches 'fixes', 'arm/smmu/updates', 'arm/smmu/bindings', 'riscv', 'intel/vt-d', 'amd/amd-vi' and 'core' into next

iommu: Ensure .iotlb_sync is called correctly

Many drivers have no reason to use the iotlb_gather mechanism, but do
still depend on .iotlb_sync being called to properly complete an unmap.
Since the core code is now relying on the gather to detect when there
is legitimately something to sync, it should also take care of encoding
a successful unmap when the driver does not touch the gather itself.

Fixes: 90c5def10bea ("iommu: Do not call drivers for empty gathers")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/r/8800a38b-8515-4bbe-af15-0dae81274bf7@nvidia.com
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Will Deacon <will@kernel.org>

iommu/vt-d: Restore IOMMU_CAP_CACHE_COHERENCY

In removing IOMMU_CAP_DEFERRED_FLUSH, the below referenced commit
was over-eager in removing the return, resulting in the test for
IOMMU_CAP_CACHE_COHERENCY falling through to an irrelevant option.

Restore dropped return.

Fixes: 1c18a1212c77 ("iommu/dma: Always allow DMA-FQ when iommupt provides the iommu_domain")
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Will Deacon <will@kernel.org>

platform/x86: hp-wmi: Ignore backlight and FnLock events

On HP OmniBook 7 the keyboard backlight and FnLock keys are handled
directly by the firmware. However, they still trigger WMI events which
results in "Unknown key code" warnings in dmesg.

Add these key codes to the keymap with KE_IGNORE to silence the warnings
since no software action is needed.

Tested-by: Artem S. Tashkinov <aros@gmx.com>
Reported-by: Artem S. Tashkinov <aros@gmx.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221181
Signed-off-by: Krishna Chomal <krishna.chomal108@gmail.com>
Link: https://patch.msgid.link/20260403080155.169653-1-krishna.chomal108@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: uniwill-laptop: Fix signedness bug

The function sysfs_match_string() can return negative error codes and
the variable assigned to it is the enum 'option'. Which could be an
unsigned int due to different compiler implementations.

Assign signed variable 'ret' to sysfs_match_string(), check for error,
then assign ret to option.

Detected by Smatch:
drivers/platform/x86/uniwill/uniwill-acpi.c:919 usb_c_power_priority_store()
warn: unsigned 'option' is never less than zero.

Fixes: 03ae0a0d0973b ("platform/x86: uniwill-laptop: Implement USB-C power priority setting")
Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260403070928.802196-1-ethantidmore06@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: dell_rbu: avoid uninit value usage in packet_size_write()

Ensure the temp value has been properly parsed from the user-provided
buffer and initialized to be used in later operations. While at it,
prefer a convenient kstrtoul() helper.

Found by Linux Verification Center (linuxtesting.org) with Svace static
analysis tool.

Fixes: ad6ce87e5bd4 ("[PATCH] dell_rbu: changes in packet update mechanism")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Link: https://patch.msgid.link/20260403134240.604837-1-pchelkin@ispras.ru
[ij: add include]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

nfc: pn533: allocate rx skb before consuming bytes

pn532_receive_buf() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already hand
a complete frame to pn533_recv_frame() before allocating a fresh receive
buffer.

If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().

Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.

Fixes: c656aa4c27b1 ("nfc: pn533: add UART phy driver")
Cc: stable@vger.kernel.org
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260405094003.3-pn533-v2-pengpeng@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

platform/x86: hp-wmi: add locking for concurrent hwmon access

hp_wmi_hwmon_priv.mode and .pwm are written by hp_wmi_hwmon_write() in
sysfs context and read by hp_wmi_hwmon_keep_alive_handler() in a
workqueue. A concurrent write and keep-alive expiry can observe an
inconsistent mode/pwm pair (e.g. mode=MANUAL with a stale pwm).

Add a mutex to hp_wmi_hwmon_priv protecting mode and pwm. Hold it in
hp_wmi_hwmon_write() across the field update and apply call, and in
hp_wmi_hwmon_keep_alive_handler() before calling apply.

In hp_wmi_hwmon_read(), only the pwm_enable path reads priv->mode; use
scoped_guard() there to avoid holding the lock across unrelated WMI
calls.

Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive")
Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Emre Cecanpunar <emreleno@gmail.com>
Link: https://patch.msgid.link/20260407142515.20683-6-emreleno@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: hp-wmi: fix u8 underflow in gpu_delta calculation

gpu_delta was declared as u8. If the firmware specifies a GPU RPM
lower than the CPU RPM, subtracting them causes an underflow
(e.g. 10 - 20 = 246), which forces the GPU fan to remain clamped at
U8_MAX (100% speed) during operation.

Change gpu_delta to int and use signed arithmetic. Existing signed logic
in hp_wmi_fan_speed_set() correctly handles negative deltas.

Fixes: 46be1453e6e6 ("platform/x86: hp-wmi: add manual fan control for Victus S models")
Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Emre Cecanpunar <emreleno@gmail.com>
Link: https://patch.msgid.link/20260407142515.20683-5-emreleno@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: hp-wmi: use mod_delayed_work to reset keep-alive timer

Currently, schedule_delayed_work() is used to queue the 90s keep-alive
timer. If a user manually changes the fan speed at T=85s,
schedule_delayed_work() leaves the existing timer in place as it is a
no-op if the work is already pending. This results in the keep-alive
timer firing unnecessarily at T=90s, just 5 seconds after the user
action.

Replace schedule_delayed_work() with mod_delayed_work() to reset the
90s timer whenever fan settings are applied. This guarantees a full 90s
delay after every user interaction, preventing redundant keep-alive
executions and improving efficiency.

Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive")
Signed-off-by: Emre Cecanpunar <emreleno@gmail.com>
Link: https://patch.msgid.link/20260407142515.20683-4-emreleno@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: hp-wmi: avoid cancel_delayed_work_sync from work handler

hp_wmi_apply_fan_settings() uses cancel_delayed_work_sync() to stop
the keep-alive timer in AUTO mode. However, since
hp_wmi_apply_fan_settings() is also called from the keep-alive
handler, a race condition with a sysfs write can cause the handler to
wait on itself, leading to a deadlock.

Replace cancel_delayed_work_sync() with cancel_delayed_work() in
hp_wmi_apply_fan_settings() to avoid the self-flush deadlock.

Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive")
Signed-off-by: Emre Cecanpunar <emreleno@gmail.com>
Link: https://patch.msgid.link/20260407142515.20683-3-emreleno@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

platform/x86: hp-wmi: fix ignored return values in fan settings

hp_wmi_get_fan_count_userdefine_trigger() can fail, but its return
value was silently ignored in hp_wmi_apply_fan_settings() for
PWM_MODE_MAX/AUTO. Propagate these errors consistently.

Additionally, handle the return value of hp_wmi_apply_fan_settings()
in its callers by adding appropriate warnings on failure, and remove an
unreachable "return 0" at the end of the function.

Fixes: 46be1453e6e6 ("platform/x86: hp-wmi: add manual fan control for Victus S models")
Signed-off-by: Emre Cecanpunar <emreleno@gmail.com>
Link: https://patch.msgid.link/20260407142515.20683-2-emreleno@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

arm64: dts: ti: k3: Use memory-region-names for r5f

Add the newly introduced memory-region-names to all occurences of
ti,*-r5f. This helps adding a name to each memory-region so it is
easier to see what memory regions are for.

Signed-off-by: Markus Schneider-Pargmann (TI) <msp@baylibre.com>
Link: https://patch.msgid.link/20260318-topic-am62a-ioddr-dt-v6-19-v3-3-c41473cb23c3@baylibre.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>

KVM: LoongArch: selftests: Add PMU overflow interrupt test

Extend the PMU test suite to cover overflow interrupts. The test enables
the PMI (Performance Monitor Interrupt), sets counter 0 to one less than
the overflow value, and verifies that an interrupt is raised when the
counter overflows. A guest interrupt handler checks the interrupt cause
and disables further PMU interrupts upon success.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

KVM: LoongArch: selftests: Add basic PMU event counting test

Introduce a basic PMU test that verifies hardware event counting for
four performance counters. The test enables the events for CPU cycles,
instructions retired, branch instructions, and branch misses, runs a
fixed number of loops, and checks that the counter values fall within
expected ranges. It also validates that the host supports PMU and that
the VM feature is enabled.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

KVM: LoongArch: selftests: Add cpucfg read/write helpers

Add helper macros and functions to read and write CPU configuration
registers (cpucfg) from the guest and from the VMM. This interface is
required in upcoming selftests for querying and setting CPU features,
such as PMU capabilities.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Add DMSINTC inject msi to vCPU

Implement irqfd that deliver msi to vCPU and vCPU dmsintc irq injection.
Add pch_msi_set_irq() choice dmsintc to set msi irq by the msg_addr and
implement dmsintc set msi irq.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Add DMSINTC device support

Add device model for DMSINTC interrupt controller, implement basic
create/destroy/set_attr interfaces, and register device model to kvm
device table.

Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function

vcpu_is_preempted() is performance sensitive that called in function
osq_lock(), here make it as a macro. So that parameter is not parsed
at most time, it can avoid cache line thrashing across numa nodes.

Here is part of UnixBench result on Loongson-3C5000 DualWay machine with
32 cores and 2 numa nodes.

          original    inline   macro
execl     7025.7      6991.2   7242.3
fstime    474.6       703.1    1071

From the test result, making vcpu_is_preempted() as a macro is the best,
and there is some improvment compared with the original function method.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch

CSR register LOONGARCH_CSR_GSTAT stores guest VMID information. With
existing implementation method, VMID is per vCPU, similar with ASID in
kernel. LOONGARCH_CSR_GSTAT is written at VM entry even if VMID is not
changed.

Here move LOONGARCH_CSR_GSTAT save/restore in vCPU context switch, and
update LOONGARCH_CSR_GSTAT only when VMID is updated at VM entry. At
most time VM enter/exit is much more frequent than vCPU thread context
switch.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch

CSR register LOONGARCH_CSR_EENTRY is shared between host CPU and guest
vCPU, KVM need save and restore LOONGARCH_CSR_EENTRY register. Here move
LOONGARCH_CSR_EENTRY saving in to context switch function rather than VM
entry.

At most time VM enter/exit is much more frequent than vCPU thread context
switch.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Check kvm_request_pending() in kvm_late_check_requests()

Add kvm_request_pending() checking firstly in kvm_late_check_requests(),
at most time there is no pending request, then the following pending bit
checking can be skipped.

Also embed function kvm_check_pmu() in to kvm_late_check_requests(), and
put it after the kvm_request_pending() checking.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Use CSR_CRMD_PLV in kvm_arch_vcpu_in_kernel()

The function reads LOONGARCH_CSR_CRMD but uses CSR_PRMD_PPLV to
extract the privilege level. While both masks have the same value
(0x3), CSR_CRMD_PLV is the semantically correct constant for CRMD.

Cc: stable@vger.kernel.org
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

ARM: 9471/1: module: fix unwind section relocation out of range error

In an armv7 system that uses non-3G/1G split and with more than 512MB physical memory, driver load may fail with following error:
section 29 reloc 0 sym '': relocation 42 out of range (0xc2ab9be8 ->
0x7fad5998)

This happens when relocation R_ARM_PREL31 from the unwind section
.ARM.extab and .ARM.exidx are allocated from the VMALLOC space while
.text section is from MODULES_VADDR space. It exceeds the +/-1GB
relocation requirement of R_ARM_PREL31 hence triggers the error.

The fix is to mark .ARM.extab and .ARM.exidx sections as executable so
they can be allocated along with .text section and always meet range
requirement.

Co-developed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: William Zhang <william.zhang@broadcom.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

fs/adfs: validate nzones in adfs_validate_bblk()

Reject ADFS disc records with a zero zone count during boot block
validation, before the disc record is used.

When nzones is 0, adfs_read_map() passes it to kmalloc_array(0, ...)
which returns ZERO_SIZE_PTR, and adfs_map_layout() then writes to
dm[-1], causing an out-of-bounds write before the allocated buffer.

adfs_validate_dr0() already rejects nzones != 1 for old-format
images. Add the equivalent check to adfs_validate_bblk() for
new-format images so that a crafted image with nzones == 0 is
rejected at probe time.

Found by syzkaller.

Fixes: f6f14a0d71b0 ("fs/adfs: map: move map-specific sb initialisation to map.c")
Signed-off-by: Bae Yeonju <iwasbaeyz@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>

Merge branch 'r8152-add-support-for-the-rtl8157-5gbit-usb-ethernet-chip'

Birger Koblitz says:

====================
r8152: Add support for the RTL8157 5Gbit USB Ethernet chip

Add support for the RTL8157, which is a 5GBit USB-Ethernet adapter
chip in the RTL815x family of chips.

The RTL8157 uses a different frame descriptor format, and different
SRAM/ADV access methods, plus offers 5GBit/s Ethernet, so support for these
features is added in addition to chip initialization and configuration.

The module was tested with an OEM RTL8157 USB adapter:
[25758.328238] usb 4-1: new SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.345565] usb 4-1: New USB device found, idVendor=0bda, idProduct=8157, bcdDevice=30.00
[25758.345585] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=7
[25758.345593] usb 4-1: Product: USB 10/100/1G/2.5G/5G LAN
[25758.345599] usb 4-1: Manufacturer: Realtek
[25758.345605] usb 4-1: SerialNumber: 000300E04C68xxxx
[25758.534241] r8152-cfgselector 4-1: reset SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.603511] r8152 4-1:1.0: skip request firmware
[25758.653351] r8152 4-1:1.0 eth0: v1.12.13
[25758.689271] r8152 4-1:1.0 enx00e04c68xxxx: renamed from eth0
[25763.271682] r8152 4-1:1.0 enx00e04c68xxxx: carrier on

The RTL8157 adapter was tested against an AQC107 PCIe-card supporting
10GBit/s and an RTL8126 5Gbit PCIe-card supporting 5GBit/s for
performance, link speed and EEE negotiation. Using USB3.2 Gen 1 with
the RTL8157 USB adapter and running iperf3 against the AQC107 PCIe
card resulted in 3.47 Gbits/sec, whereas using USB3.2 Gen2 resulted
in 4.70 Gbits/sec, speeds against the RTL8126-card were the same.

As the code integrates the RTL8157-specific code with existing RTL8156 code
in order to improve code maintainability (instead of adding RTL8157-specific
functions duplicaing most of the RTL8156 code), regression tests were done
with an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156.

The code is based on the out-of-tree r8152 driver published by Realtek under
the GPL.

This patch is on top of linux-next as the code re-uses the 2.5 Gbit EEE
recently added in r8152.c.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
====================

Link: https://patch.msgid.link/20260404-rtl8157_next-v7-0-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

r8152: Add support for the RTL8157 hardware

The RTL8157 uses a different packet descriptor format compared to the
previous generation of chips. Add support for this format by adding a
descriptor format structure into the r8152 structure and corresponding
desc_ops functions which abstract the vlan-tag, tx/rx len and
tx/rx checksum algorithms.

Also, add support for the ADV indirect access interface of the RTL8157
and PHY setup.

For initialization of the RTL8157, combine the existing RTL8156B and
RTL8156 init functions and add RTL8157-specific functinality in order
to improve code readability and maintainability.
r8156_init() is now called with RTL_VER_10 and RTL_VER_11 for the RTL8156,
with RTL_VER_12, RTL_VER_13 and RTL_VER_15 for the RTL8156B and with
RTL_VER_16 for the RTL8157 and checks the version for chip-specific code.
Also add USB power control functions for the RTL8157.

Add support for the USB device ID of Realtek RTL8157-based adapters. Detect
the RTL8157 as RTL_VER_16 and set it up.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260404-rtl8157_next-v7-2-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

r8152: Add support for 5Gbit Link Speeds and EEE

The RTL8157 supports 5GBit Link speeds. Add support for this speed
in the setup and setting/getting through ethtool. Also add 5GBit EEE.
Add functionality for setup and ethtool get/set methods.

Signed-off-by: Birger Koblitz <mail@birger-koblitz.de>
Link: https://patch.msgid.link/20260404-rtl8157_next-v7-1-039121318f23@birger-koblitz.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ALSA: hda/senarytech: Clean up with the new GPIO helper

Use the new GPIO helper function to clean up the open code.

Merely a code refactoring, and no behavior change.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-11-tiwai@suse.de

ALSA: hda/conexant: Clean up with the new GPIO helper

Use the new GPIO helper function to clean up the open code.

Merely a code refactoring, and no behavior change.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-10-tiwai@suse.de

ALSA: hda/cirrus: Clean up with the new GPIO helper

Use the new GPIO helper function to clean up the open code.

Merely a code refactoring, and no behavior change.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-9-tiwai@suse.de

ALSA: hda/ca0132: Clean up with the new GPIO helper

Use the new GPIO helper function to clean up the open code.

Merely a code refactoring, and no behavior change.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-8-tiwai@suse.de

ALSA: hda/sigmatel: Clean up with the new GPIO helper

Use the new GPIO helper function to clean up the open code.

Merely a code refactoring, and no behavior change.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-7-tiwai@suse.de

ALSA: hda/analog: Fix GPIO verb orders

So far we used the verb cache to restore the GPIO mask, direction and
data bits at PM resume. But, due to the nature of the cache resume
mechanism, the calling order isn't guaranteed, and this might lead to
some inconsistency at the restored state.

For assuring the GPIO verb orders, use the new GPIO helper function to
explicitly set up the GPIO bits, instead of using the codec verb
caches, while keeping the current data bits in ad198x_spec.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-6-tiwai@suse.de

ALSA: hda/alc662: Simplify the quirk for CSL Unity BF24B

The previous implementation of the quirk for CSL Unity BF24B in commit
de65275fc94e ("ALSA: hda/realtek: Add quirk for CSL Unity BF24B")
introduced the unnecessary GPIO caching which leads to a superfluous
write at each init/resume.

Use the new helper to write GPIO bits directly for optimization.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-5-tiwai@suse.de

ALSA: hda/realtek: Clean up with snd_hda_codec_set_gpio()

Use a new helper function to clean up the code.

Along with it, make alc_write_gpio() static as well, which is used
only locally in realtek.c.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-4-tiwai@suse.de

ALSA: hda: Add a simple GPIO setup helper function

Introduce a common GPIO setup helper function, so that we can clean up
the open code found in many codec drivers later.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-3-tiwai@suse.de

ALSA: hda: Add sync version of snd_hda_codec_write()

We used snd_hda_codec_read() for the verb write when a synchronization
is needed after the write, e.g. for the power state toggle or such
cases. It works in principle, but it looks rather confusing and too
hackish.

For improving the code readability, introduce a new helper function,
snd_hda_codec_write_sync(), which is another variant of
snd_hda_codec_write(), and replace the existing snd_hda_codec_read()
calls with this one.

No behavior change but just the code refactoring.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260409093826.1317626-2-tiwai@suse.de