Günther Noack [Tue, 31 Mar 2026 07:40:51 +0000 (09:40 +0200)]
HID: logitech-hidpp: Check bounds when deleting force-feedback effects
Without this bounds check, this might otherwise overwrite index -1.
Triggering this condition requires action both from the USB device and from
userspace, which reduces the scenarios in which it can be exploited.
Cc: Lee Jones <lee@kernel.org> Signed-off-by: Günther Noack <gnoack@google.com> Reviewed-by: Lee Jones <lee@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
IB/core: Fix zero dmac race in neighbor resolution
dst_fetch_ha() checks nud_state without holding the neighbor lock, then
copies ha under the seqlock. A race in __neigh_update() where nud_state
is set to NUD_REACHABLE before ha is written allows dst_fetch_ha() to
read a zero MAC address while the seqlock reports no concurrent writer.
netevent_callback amplifies this by waking ALL pending addr_req workers
when ANY neighbor becomes NUD_VALID. At scale (N peers resolving ARP
concurrently), the hit probability scales as N^2, making it near-certain
for large RDMA workloads.
N(A): neigh_update(A) W(A): addr_resolve(A)
| [sleep]
| write_lock_bh(&A->lock) |
| A->nud_state = NUD_REACHABLE |
| // A->ha is still 0 |
| [woken by netevent_cb() of
| another neighbour]
| | dst_fetch_ha(A)
| | A->nud_state & NUD_VALID
| | read_seqbegin(&A->ha_lock)
| | snapshot = A->ha /* 0 */
| | read_seqretry(&A->ha_lock)
| | return snapshot
| seqlock(&A->ha_lock)
| A->ha = mac_A /* too late */
| sequnlock(&A->ha_lock)
| write_unlock_bh(&A->lock)
The incorrect/zero mac is read and programmed in the device QP while it
was not yet updated. This causes silent packet loss and eventual
RETRY_EXC_ERR.
Fix by holding the neighbor read lock across the nud_state check and
ha copy in dst_fetch_ha(), ensuring it synchronizes with
__neigh_update() which is updating while holding the write lock.
Cc: stable@vger.kernel.org Fixes: 92ebb6a0a13a ("IB/cm: Remove now useless rcu_lock in dst_fetch_ha") Link: https://patch.msgid.link/r/20260405-fix-dmac-race-v1-1-cfa1ec2ce54a@nvidia.com Signed-off-by: Chen Zhao <chezhao@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Ard Biesheuvel [Wed, 25 Feb 2026 11:03:04 +0000 (12:03 +0100)]
efi: Tag memblock reservations of boot services regions as RSRV_KERN
By definition, EFI memory regions of type boot services code or data
have no special significance to the firmware at runtime, only to the OS.
In some cases, the firmware will allocate tables and other assets that
are passed in memory in regions of this type, and leave it up to the OS
to decide whether or not to treat the allocation as special, or simply
consume the contents at boot and recycle the RAM for ordinary use. The
reason for this approach is that it avoids needless memory reservations
for assets that the OS knows nothing about, and therefore doesn't know
how to free either.
This means that any memblock reservations covering such regions can be
marked as MEMBLOCK_RSRV_KERN - this is a better match semantically, and
is useful on x86 to distinguish true reservations from temporary
reservations that are only needed to work around firmware bugs.
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Ard Biesheuvel [Wed, 25 Feb 2026 12:39:48 +0000 (13:39 +0100)]
memblock: Permit existing reserved regions to be marked RSRV_KERN
Permit existing memblock reservations to be marked as RSRV_KERN. This
will be used by the EFI code on x86 to distinguish between reservations
of boot services data regions that have actual significance to the
kernel and regions that are reserved temporarily to work around buggy
firmware.
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Li Chen [Fri, 6 Mar 2026 08:56:42 +0000 (16:56 +0800)]
jbd2: store jinode dirty range in PAGE_SIZE units
jbd2_inode fields are updated under journal->j_list_lock, but some paths
read them without holding the lock (e.g. fast commit helpers and ordered
truncate helpers).
READ_ONCE() alone is not sufficient for the dirty range fields when they
are stored as loff_t because 32-bit platforms can observe torn loads.
Store the dirty range in PAGE_SIZE units as pgoff_t instead.
Represent the dirty range end as an exclusive end page. This avoids a
special sentinel value and keeps MAX_LFS_FILESIZE on 32-bit representable.
Publish a new dirty range by updating end_page before start_page, and
treat start_page >= end_page as empty in the accessor for robustness.
Use READ_ONCE() on the read side and WRITE_ONCE() on the write side for the
dirty range and i_flags to match the existing lockless access pattern.
Li Chen [Fri, 6 Mar 2026 08:56:41 +0000 (16:56 +0800)]
ocfs2: use jbd2 jinode dirty range accessor
ocfs2 journal commit callback reads jbd2_inode dirty range fields without
holding journal->j_list_lock.
Use jbd2_jinode_get_dirty_range() to get the range in bytes.
Li Chen [Fri, 6 Mar 2026 08:56:40 +0000 (16:56 +0800)]
ext4: use jbd2 jinode dirty range accessor
ext4 journal commit callbacks access jbd2_inode dirty range fields without
holding journal->j_list_lock.
Use jbd2_jinode_get_dirty_range() to get the range in bytes, and read
i_transaction with READ_ONCE() in the redirty check.
Li Chen [Fri, 6 Mar 2026 08:56:39 +0000 (16:56 +0800)]
jbd2: add jinode dirty range accessors
Provide a helper to fetch jinode dirty ranges in bytes. This lets
filesystem callbacks avoid depending on the internal representation,
preparing for a later conversion to page units.
Steven Rostedt [Mon, 26 Jan 2026 23:17:42 +0000 (18:17 -0500)]
tracing: Documentation: Update histogram-design.rst for fn() handling
The histogram documentation describes the old method of the histogram
triggers using the fn() field of the histogram field structure to process
the field. But due to Spectre mitigation, the function pointer to handle
the fields at runtime caused a noticeable overhead. It was converted over
to a fn_num and hist_fn_call() is now used to call the specific functions
for the fields via a switch statement based on the field's fn_num value.
Update the documentation to reflect this change.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260126181742.03e8f0d5@gandalf.local.home>
Milos Nikic [Wed, 4 Mar 2026 17:20:16 +0000 (09:20 -0800)]
jbd2: gracefully abort on transaction state corruptions
Auditing the jbd2 codebase reveals several legacy J_ASSERT calls
that enforce internal state machine invariants (e.g., verifying
jh->b_transaction or jh->b_next_transaction pointers).
When these invariants are broken, the journal is in a corrupted
state. However, triggering a fatal panic brings down the entire
system for a localized filesystem error.
This patch targets a specific class of these asserts: those
residing inside functions that natively return integer error codes,
booleans, or error pointers. It replaces the hard J_ASSERTs with
WARN_ON_ONCE to capture the offending stack trace, safely drops
any held locks, gracefully aborts the journal, and returns -EINVAL.
This prevents a catastrophic kernel panic while ensuring the
corrupted journal state is safely contained and upstream callers
(like ext4 or ocfs2) can gracefully handle the aborted handle.
Milos Nikic [Wed, 4 Mar 2026 17:20:15 +0000 (09:20 -0800)]
jbd2: gracefully abort instead of panicking on unlocked buffer
In jbd2_journal_get_create_access(), if the caller passes an unlocked
buffer, the code currently triggers a fatal J_ASSERT.
While an unlocked buffer here is a clear API violation and a bug in the
caller, crashing the entire system is an overly severe response. It brings
down the whole machine for a localized filesystem inconsistency.
Replace the J_ASSERT with a WARN_ON_ONCE to capture the offending caller's
stack trace, and return an error (-EINVAL). This allows the journal to
gracefully abort the transaction, protecting data integrity without
causing a kernel panic.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20260304172016.23525-2-nikic.milos@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
docs: sysctl: Add documentation for /proc/sys/xen/
Add documentation for the Xen hypervisor sysctl controls in
/proc/sys/xen/balloon/.
Documents the hotplug_unpopulated tunable (available when
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is enabled) which controls
whether unpopulated memory regions are automatically hotplugged
when the Xen balloon driver needs to reclaim memory.
The documentation is based on source code analysis of
drivers/xen/balloon.c.
Signed-off-by: Shubham Chakraborty <chakrabortyshubham66@gmail.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260304150419.16738-1-chakrabortyshubham66@gmail.com>
Weixie Cui [Wed, 25 Feb 2026 05:02:31 +0000 (13:02 +0800)]
ext4: simplify mballoc preallocation size rounding for small files
The if-else ladder in ext4_mb_normalize_request() manually rounds up
the preallocation size to the next power of two for files up to 1MB,
enumerating each step from 16KB to 1MB individually. Replace this with
a single roundup_pow_of_two() call clamped to a 16KB minimum, which
is functionally equivalent but much more concise.
Also replace raw byte constants with SZ_1M and SZ_16K from
<linux/sizes.h> for clarity, and remove the stale "XXX: should this
table be tunable?" comment that has been there since the original
mballoc code.
Randy Dunlap [Sat, 21 Mar 2026 23:09:34 +0000 (16:09 -0700)]
Docs: hid: intel-ish-hid: make long URL usable
The '\' line continuation character in this long URL
doesn't help anything. There is no documentation tooling that
handles the line continuation character to join the 2 lines
to make a usable URL. Web browsers terminate the URL just
before the '\' character so that the second line of the URL
is lost. See:
https://docs.kernel.org/hid/intel-ish-hid.html
Join the 2 lines together so that the URL is usable.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260321230934.435020-1-rdunlap@infradead.org>
Julia Lawall [Sun, 22 Feb 2026 12:50:49 +0000 (13:50 +0100)]
ext4/move_extent: use folio_next_pos()
A series of patches such as commit 60a70e61430b ("mm: Use
folio_next_pos()") replace folio_pos() + folio_size() by
folio_next_pos(). The former performs x << z + y << z while
the latter performs (x + y) << z, which is slightly more
efficient. This case was not taken into account, perhaps
because the argument is not named folio.
The change was performed using the following Coccinelle
semantic patch:
ALSA: hda/alc269: Drop superfluous GPIO write at resume
alc269_resume() has an extra code to write GPIO data, but this is
basically already done in the standard alc_init(), hence it's
superfluous. Let's drop the code.
Since all external callers of alc_write_gpio_data() are gone after
this, fold the only usage of alc_write_gpio_data() into the caller and
drop the export as well.
Rong Zhang [Wed, 8 Apr 2026 18:33:05 +0000 (02:33 +0800)]
ALSA: usb-audio: Add quirk flags for Feaulle Rainbow
Feaulle Rainbow is a wired USB-C dynamic in-ear monitor (IEM) featuring
active noise cancellation (ANC).
The supported sample rates are 48000Hz and 96000Hz at 16bit or 24bit,
but it does not support reading the current sample rate and results in
an error message printed to kmsg. Set QUIRK_FLAG_GET_SAMPLE_RATE to skip
the sample rate check.
Its playback mixer reports val = -15360/0/128. Setting -15360 (-60dB)
mutes the playback, so QUIRK_FLAG_MIXER_PLAYBACK_MIN_MUTE is needed.
Add a quirk table entry matching VID/PID=0x0e0b/0xfa01 and applying
the mentioned quirk flags, so that it can work properly.
Quirky device sample:
usb 7-1: New USB device found, idVendor=0e0b, idProduct=fa01, bcdDevice= 1.00
usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 7-1: Product: Feaulle Rainbow
usb 7-1: Manufacturer: Generic
usb 7-1: SerialNumber: 20210726905926
Li RongQing [Mon, 30 Mar 2026 10:59:57 +0000 (06:59 -0400)]
Documentation/kernel-parameters: fix architecture alignment for pt, nopt, and nobypass
Commit ab0e7f20768a ("Documentation: Merge x86-specific boot options doc
into kernel-parameters.txt") introduced a formatting regression where
architecture tags were placed on separate lines with broken indentation.
This caused the 'nopt' [X86] parameter to appear as if it belonged to
the [PPC/POWERNV] section.
Furthermore, since the main 'iommu=' parameter heading already specifies
it is for [X86, EARLY], the subsequent standalone [X86] tags for 'pt',
'nopt', and the AMD GART options are redundant and clutter the
documentation.
Clean up the formatting by removing these redundant tags and properly
attributing the 'nobypass' option to [PPC/POWERNV].
Fixes: ab0e7f20768a ("Documentation: Merge x86-specific boot options doc into kernel-parameters.txt") Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260330105957.2271-1-lirongqing@baidu.com>
Guoqing Jiang [Wed, 21 Jan 2026 06:38:05 +0000 (14:38 +0800)]
ext4: remove tl argument from ext4_fc_replay_{add,del}_range
Since commit a7ba36bc94f2 ("ext4: fix fast commit alignment issues"),
both ext4_fc_replay_add_range and ext4_fc_replay_del_range get
ex based on 'val' instead of 'tl'.
Ard Biesheuvel [Thu, 26 Mar 2026 13:26:57 +0000 (14:26 +0100)]
efi/memattr: Fix thinko in table size sanity check
While it is true that each PE/COFF runtime driver in memory can
generally be split into 3 different regions (the header, the code/rodata
region and the data/bss region), each with different permissions, it
does not mean that 3x the size of the memory map is a suitable upper
bound. This is due to the fact that all runtime drivers could be
coalesced into a single EFI runtime code region by the firmware, and if
the firmware does a good job of keeping the fragmentation down, it is
conceivable that the memory attributes table has more entries than the
EFI memory map itself.
So instead, base the sanity check on whether the descriptor size matches
the EFI memory map's descriptor size closely enough (which is not
mandated by the spec but extremely unlikely to differ in practice), and
whether the size of the whole table does not exceed 64k entries.
sched/doc: Update yield_task description in sched-design-CFS
The yield_task description referenced the long-removed compat_yield
sysctl and described the function as a dequeue/enqueue cycle. Update
it to reflect current behavior: yielding the CPU by moving the
current task's position back in the runqueue.
Sync zh_CN and sp_SP translations.
Signed-off-by: fangqiurong <fangqiurong@kylinos.cn> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260403055806.358921-1-user@fqr-pc>
Costa Shulyupin [Sun, 5 Apr 2026 16:38:45 +0000 (19:38 +0300)]
Documentation/rtla: Convert links to RST format
Web links in the documentation are not properly displayed.
In the man pages web links look like:
Osnoise tracer documentation: < <https://www.kernel.org/doc/html/lat‐
est/trace/osnoise-tracer.html> >
On web pages the URL caption is the URL itself.
Convert tracer documentation links to RST anonymous hyperlink format
for better rendering. Use newer docs.kernel.org instead of
www.kernel.org/doc/html/latest for brevity.
After the change, the links in the man pages look like:
Osnoise tracer <https://docs.kernel.org/trace/osnoise-tracer.html>
On web pages the captions are the titles of the links.
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260405163847.3337981-1-costa.shul@redhat.com>
Implement .alloc_mw() and .dealloc_mw() for mana device.
This is just the basic infrastructure, MW is not practically usable until
additional kernel support for allowing user space to submit MW work
requests is completed.
Signed-off-by: Manuel Cortez <mdjesuscv@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20260406030323.1196-1-mdjesuscv@gmail.com>
John Garry [Wed, 8 Apr 2026 08:03:57 +0000 (08:03 +0000)]
nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()
A NS will always have a head pointer, so drop the check. As proof in
practice, all the nvme_mpath_clear_current_path() callers also
dereference ns->head.
This check has endured since the original changes to support multipath.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvmet-tcp: fix race between ICReq handling and queue teardown
nvmet_tcp_handle_icreq() updates queue->state after sending an
Initialization Connection Response (ICResp), but it does so without
serializing against target-side queue teardown.
If an NVMe/TCP host sends an Initialization Connection Request
(ICReq) and immediately closes the connection, target-side teardown
may start in softirq context before io_work drains the already
buffered ICReq. In that case, nvmet_tcp_schedule_release_queue()
sets queue->state to NVMET_TCP_Q_DISCONNECTING and drops the queue
reference under state_lock.
If io_work later processes that ICReq, nvmet_tcp_handle_icreq() can
still overwrite the state back to NVMET_TCP_Q_LIVE. That defeats the
DISCONNECTING-state guard in nvmet_tcp_schedule_release_queue() and
allows a later socket state change to re-enter teardown and issue a
second kref_put() on an already released queue.
The ICResp send failure path has the same problem. If teardown has
already moved the queue to DISCONNECTING, a send error can still
overwrite the state with NVMET_TCP_Q_FAILED, again reopening the
window for a second teardown path to drop the queue reference.
Fix this by serializing both post-send state transitions with
state_lock and bailing out if teardown has already started.
Use -ESHUTDOWN as an internal sentinel for that bail-out path rather
than propagating it as a transport error like -ECONNRESET. Keep
nvmet_tcp_socket_error() setting rcv_state to NVMET_TCP_RECV_ERR before
honoring that sentinel so receive-side parsing stays quiesced until the
existing release path completes.
This means a short packet can still make payload_size() underflow even
if it includes enough bytes for the fixed headers. Simply requiring
header_size(pkt) + RXE_ICRC_SIZE is not sufficient either, because a
packet with a forged non-zero BTH pad can still leave payload_size()
negative and pass an underflowed value to later receive-path users.
Fix this by validating pkt->paylen against the full minimum length
required by payload_size(): header_size(pkt) + bth_pad(pkt) +
RXE_ICRC_SIZE.
ext4: unmap invalidated folios from page tables in mpage_release_unused_pages()
When delayed block allocation fails (e.g., due to filesystem corruption
detected in ext4_map_blocks()), the writeback error handler calls
mpage_release_unused_pages(invalidate=true) which invalidates affected
folios by clearing their uptodate flag via folio_clear_uptodate().
However, these folios may still be mapped in process page tables. If a
subsequent operation (such as ftruncate calling ext4_block_truncate_page)
triggers a write fault, the existing page table entry allows access to
the now-invalidated folio. This leads to ext4_page_mkwrite() being called
with a non-uptodate folio, which then gets marked dirty, triggering:
WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
__folio_mark_dirty+0x578/0x880
1. Process writes to mmap'd file, folio becomes uptodate and dirty
2. Writeback begins, but delayed allocation fails due to corruption
3. mpage_release_unused_pages(invalidate=true) is called:
- block_invalidate_folio() clears dirty flag
- folio_clear_uptodate() clears uptodate flag
- But folio remains mapped in page tables
4. Later, ftruncate triggers ext4_block_truncate_page()
5. This causes a write fault on the still-mapped folio
6. ext4_page_mkwrite() is called with folio that is !uptodate
7. block_page_mkwrite() marks buffers dirty
8. fault_dirty_shared_page() tries to mark folio dirty
9. block_dirty_folio() calls __folio_mark_dirty(warn=1)
10. WARNING triggers: WARN_ON_ONCE(warn && !uptodate && !dirty)
Fix this by unmapping folios from page tables before invalidating them
using unmap_mapping_pages(). This ensures that subsequent accesses
trigger new page faults rather than reusing invalidated folios through
stale page table entries.
Note that this results in data loss for any writes to the mmap'd region
that couldn't be written back, but this is expected behavior when
writeback fails due to filesystem corruption. The existing error message
already states "This should not happen!! Data will be lost".
gpio: swnode: defer probe on references to unregistered software nodes
fwnode_property_get_reference_args() now returns -ENOTCONN when called
on a software node referencing another software node which has not yet
been registered as a firmware node. It makes sense to defer probe in this
situation as the node will most likely be registered later on and we'll
be able to resolve the reference eventually. Change the behavior of
swnode_find_gpio() to return -EPROBE_DEFER if the software node reference
resolution returns -ENOTCONN.
Florian Westphal [Mon, 30 Mar 2026 12:27:39 +0000 (14:27 +0200)]
RDMA/core: Prefer NLA_NUL_STRING
These attributes are evaluated as c-string (passed to strcmp), but
NLA_STRING doesn't check for the presence of a \0 terminator.
Either this needs to switch to nla_strcmp() and needs to adjust printf fmt
specifier to not use plain %s, or this needs to use NLA_NUL_STRING.
As the code has been this way for long time, it seems to me that userspace
does include the terminating nul, even tough its not enforced so far, and
thus NLA_NUL_STRING use is the simpler solution.
Fixes: 30dc5e63d6a5 ("RDMA/core: Add support for iWARP Port Mapper user space service") Link: https://patch.msgid.link/r/20260330122742.13315-1-fw@strlen.de Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
sched/cache: Allow the user space to turn on and off cache aware scheduling
Provide a debugfs directory llc_balancing, and a knob named
"enabled" under it to allow the user to turn off and on the
cache aware scheduling at runtime.
sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.
Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.
The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.
hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.
[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Detail:
Due to length constraints, data without much difference with
baseline is not presented.
Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.
[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.
[chacha20 on simulated risc-v]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms
Time reduced by 24%
Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:
Tim Chen [Wed, 1 Apr 2026 21:52:26 +0000 (14:52 -0700)]
sched/cache: Handle moving single tasks to/from their preferred LLC
Cache aware scheduling mainly does two things:
1. Prevent task from migrating out of its preferred LLC if not
nessasary.
2. Migrating task to their preferred LLC if nessasary.
For 1:
In the generic load balance, if the busiest runqueue has only one task,
active balancing may be invoked to move it away. However, this migration
might break LLC locality.
Prevent regular load balance from migrating a task that
prefers the current LLC. The load level and imbalance do not warrant
breaking LLC preference per the can_migrate_llc() policy. Here, the
benefit of LLC locality outweighs the power efficiency gained from
migrating the only runnable task away.
Before migration, check whether the task is running on its preferred
LLC: Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.
For 2:
On the other hand, if the migration type is migrate_llc_task, it means
that there are tasks on the env->src_cpu that want to be migrated to
their preferred LLC, launch the active load balance anyway.
Tim Chen [Wed, 1 Apr 2026 21:52:25 +0000 (14:52 -0700)]
sched/cache: Add migrate_llc_task migration type for cache-aware balancing
Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.
After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.
Tim Chen [Wed, 1 Apr 2026 21:52:24 +0000 (14:52 -0700)]
sched/cache: Prioritize tasks preferring destination LLC during balancing
During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.
Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.
The priority of group_llc_balance is lower than that of group_overloaded
and higher than that of all other group types. This is because
group_llc_balance may exacerbate load imbalance, and if the LLC balancing
attempt fails, the nr_balance_failed mechanism will trigger other group
types to rebalance the load.
The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.
With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.
Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/baa458f45eab3f602af090c6d6af63dc864f5ec6.1775065312.git.tim.c.chen@linux.intel.com
Tim Chen [Wed, 1 Apr 2026 21:52:23 +0000 (14:52 -0700)]
sched/cache: Check local_group only once in update_sg_lb_stats()
There is no need to check the local group twice for both
group_asym_packing and group_smt_balance. Adjust the code
to facilitate future checks for group types (cache-aware
load balancing) as well.
Tim Chen [Wed, 1 Apr 2026 21:52:22 +0000 (14:52 -0700)]
sched/cache: Count tasks prefering destination LLC in a sched group
During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.
For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.
Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.
These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.
Tim Chen [Wed, 1 Apr 2026 21:52:21 +0000 (14:52 -0700)]
sched/cache: Calculate the percpu sd task LLC preference
Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.
Tim Chen [Wed, 1 Apr 2026 21:52:20 +0000 (14:52 -0700)]
sched/cache: Introduce per CPU's tasks LLC preference counter
The lowest level of sched domain for each CPU is assigned an
array where each element tracks the number of tasks preferring
a given LLC, indexed from 0 to max_lid. Since each CPU
has its dedicated sd, this implies that each CPU will have
a dedicated task LLC preference counter.
For example, sd->llc_counts[3] = 2 signifies that there
are 2 tasks on this runqueue which prefer to run within LLC3.
The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime during sched domain
rebuild.
Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.
Note: the LLC preference statistics of each CPU are reset on
sched domain rebuild and may under count temporarily, until the
CPU becomes idle and the count is cleared. This is a trade off
to avoid complex data synchronization across sched domain builds.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/42e79eceb8cd6be8a032401d481d101913bc5703.1775065312.git.tim.c.chen@linux.intel.com
Tim Chen [Wed, 1 Apr 2026 21:52:19 +0000 (14:52 -0700)]
sched/cache: Track LLC-preferred tasks per runqueue
For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.
Tim Chen [Wed, 1 Apr 2026 21:52:18 +0000 (14:52 -0700)]
sched/cache: Assign preferred LLC ID to processes
With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.
Tim Chen [Wed, 1 Apr 2026 21:52:17 +0000 (14:52 -0700)]
sched/cache: Make LLC id continuous
Introduce an index mapping between CPUs and their LLCs. This provides
a roughly continuous per LLC index needed for cache-aware load balancing in
later patches.
The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.
With the new mapping, CPUs in the same LLC share an approximate
continuous id:
sched/cache: Introduce helper functions to enforce LLC migration policy
Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.
Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:
1. Aggregate a task to its preferred LLC if both source and
destination LLCs are not too busy, or if doing so will not
leave the preferred LLC much more imbalanced than the
non-preferred one (>20% utilization difference, a little
higher than the default imbalance_pct(17%) of the LLC domain
as hysteresis). Later this threshold will be turned into tunable
debugfs.
2. Allow moving a task from overloaded preferred LLC to a non
preferred LLC if this will not cause the non preferred LLC
to become too imbalanced to cause a later migration back.
3. If both LLCs are too busy, let the generic load balance to
spread the tasks.
Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.
sched/cache: Record per LLC utilization to guide cache aware scheduling decisions
When a system becomes busy and a process's preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed
when the preferred LLC is overloaded, which requires a metric to
indicate LLC utilization.
Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.
sched/cache: Limit the scan number of CPUs when calculating task occupancy
When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.
To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..
While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.
Future work may integrate numa_group to further refine task aggregation.
sched/cache: Introduce infrastructure for cache-aware load balancing
Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.
In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.
This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.
At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.
In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.
Peter Zijlstra [Thu, 19 Feb 2026 11:11:16 +0000 (12:11 +0100)]
x86/topology: Add paramter to split LLC
Add a (debug) option to virtually split the LLC, no CAT involved, just fake
topology. Used to test code that depends (either in behaviour or directly) on
there being multiple LLC domains in a node.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
====================
net: lan966x: fix page_pool error handling and error paths
This series fixes error handling around the lan966x page pool:
1/3 adds the missing IS_ERR check after page_pool_create(), preventing
a kernel oops when the error pointer flows into
xdp_rxq_info_reg_mem_model().
2/3 plugs page pool leaks in the lan966x_fdma_rx_alloc() and
lan966x_fdma_init() error paths, now reachable after 1/3.
3/3 fixes a use-after-free and page pool leak in the
lan966x_fdma_reload() restore path, where the hardware could
resume DMA into pages already returned to the page pool.
====================
David Carlier [Sun, 5 Apr 2026 05:52:41 +0000 (06:52 +0100)]
net: lan966x: fix use-after-free and leak in lan966x_fdma_reload()
When lan966x_fdma_reload() fails to allocate new RX buffers, the restore
path restarts DMA using old descriptors whose pages were already freed
via lan966x_fdma_rx_free_pages(). Since page_pool_put_full_page() can
release pages back to the buddy allocator, the hardware may DMA into
memory now owned by other kernel subsystems.
Additionally, on the restore path, the newly created page pool (if
allocation partially succeeded) is overwritten without being destroyed,
leaking it.
Fix both issues by deferring the release of old pages until after the
new allocation succeeds. Save the old page array before the allocation
so old pages can be freed on the success path. On the failure path, the
old descriptors, pages and page pool are all still valid, making the
restore safe. Also ensure the restore path re-enables NAPI and wakes
the netdev, matching the success path.
Fixes: 89ba464fcf54 ("net: lan966x: refactor buffer reload function") Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260405055241.35767-4-devnexen@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Carlier [Sun, 5 Apr 2026 05:52:40 +0000 (06:52 +0100)]
net: lan966x: fix page pool leak in error paths
lan966x_fdma_rx_alloc() creates a page pool but does not destroy it if
the subsequent fdma_alloc_coherent() call fails, leaking the pool.
Similarly, lan966x_fdma_init() frees the coherent DMA memory when
lan966x_fdma_tx_alloc() fails but does not destroy the page pool that
was successfully created by lan966x_fdma_rx_alloc(), leaking it.
Add the missing page_pool_destroy() calls in both error paths.
David Carlier [Sun, 5 Apr 2026 05:52:39 +0000 (06:52 +0100)]
net: lan966x: fix page_pool error handling in lan966x_fdma_rx_alloc_page_pool()
page_pool_create() can return an ERR_PTR on failure. The return value
is used unconditionally in the loop that follows, passing the error
pointer through xdp_rxq_info_reg_mem_model() into page_pool_use_xdp_mem(),
which dereferences it, causing a kernel oops.
Add an IS_ERR check after page_pool_create() to return early on failure.
drm/atomic: Increase timeout in drm_atomic_helper_wait_for_vblanks()
Increase the timeout for vblank events from 100 ms to 1000 ms. This
is the same fix as in commit f050da08a4ed ("drm/vblank: Increase
timeout in drm_wait_one_vblank()") for another vblank timeout.
After merging generic DRM vblank timers [1] and converting several
DRM drivers for virtual hardware, these drivers synchronize their
vblank events to the display refresh rate. This can trigger timeouts
within the DRM framework.
Daniil Bulgar [Tue, 7 Apr 2026 19:05:46 +0000 (21:05 +0200)]
platform/x86: thinkpad_acpi: remove obsolete TODO comment
This patch removes the obsolete TODO comment regarding fan speed
presets in fan_write_cmd_speed. After discussion with the
maintainers, it was decided that fixed presets (low/medium/high)
are not suitable due to platform-specific variations.
Signed-off-by: Daniil Bulgar <bulgardaniil18@gmail.com> Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Link: https://patch.msgid.link/20260407190546.109900-1-bulgardaniil18@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Andy Shevchenko [Fri, 27 Mar 2026 10:27:29 +0000 (11:27 +0100)]
leds: class: Make led_remove_lookup() NULL-aware
It is a usual pattern in the kernel to make releasing functions be NULL-aware
so they become a no-op. This helps reducing unneeded checks in the code where
the given resource is optional.
populate_enum_data() aggregates firmware-provided value-modifier
and possible-value strings into fixed 512-byte struct members.
The current code bounds each individual source string but then
appends every string and separator with raw strcat() and no
remaining-space check.
Switch the aggregation loops to a bounded append helper and
reject enumeration packages whose combined strings do not fit
in the destination buffers.
Fixes: e8a60aa7404b ("platform/x86: Introduce support for Systems Management Driver over WMI for Dell Systems") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Link: https://patch.msgid.link/20260408084501.1-dell-wmi-sysman-v2-pengpeng@iscas.ac.cn
[ij: add include] Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The rbtree backing kernfs directories is ordered by (hash, ns_id, name)
but kernfs_dir_pos() only searches by hash when seeking to a position
during readdir. When two nodes from different namespaces share the same
hash value, the binary search can land on a node in the wrong namespace.
The subsequent skip-forward loop walks rb_next() and may overshoot the
correct node, silently dropping an entry from the readdir results.
With the recent switch from raw namespace pointers to public namespace
ids as hash seeds, computing hash collisions became an offline operation.
An unprivileged user could unshare into a new network namespace, create
a single interface whose name-hash collides with a target entry in
init_net, and cause a victim's seekdir/readdir on /sys/class/net to miss
that entry.
Fix this by extending the rbtree search in kernfs_dir_pos() to also
compare namespace ids when hashes match. Since the rbtree is already
ordered by (hash, ns_id, name), this makes the seek land directly in the
correct namespace's range, eliminating the wrong-namespace overshoot.
Signed-off-by: Christian Brauner <brauner@kernel.org>
kernfs: use namespace id instead of pointer for hashing and comparison
kernfs uses the namespace tag as both a hash seed (via init_name_hash())
and a comparison key in the rbtree. The resulting hash values are exposed
to userspace through directory seek positions (ctx->pos), and the raw
pointer comparisons in kernfs_name_compare() encode kernel pointer
ordering into the rbtree layout.
This constitutes a KASLR information leak since the hash and ordering
derived from kernel pointers can be observed from userspace.
Fix this by using the 64-bit namespace id (ns_common::ns_id) instead of
the raw pointer value for both hashing and comparison. The namespace id
is a stable, non-secret identifier that is already exposed to userspace
through other interfaces (e.g., /proc/pid/ns/, ioctl NS_GET_NSID).
Introduce kernfs_ns_id() as a helper that extracts the namespace id from
a potentially-NULL ns_common pointer, returning 0 for the no-namespace
case.
All namespace equality checks in the directory iteration and dentry
revalidation paths are also switched from pointer comparison to ns_id
comparison for consistency.
Signed-off-by: Christian Brauner <brauner@kernel.org>
kernfs: pass struct ns_common instead of const void * for namespace tags
kernfs has historically used const void * to pass around namespace tags
used for directory-level namespace filtering. The only current user of
this is sysfs network namespace tagging where struct net pointers are
cast to void *.
Replace all const void * namespace parameters with const struct
ns_common * throughout the kernfs, sysfs, and kobject namespace layers.
This includes the kobj_ns_type_operations callbacks, kobject_namespace(),
and all sysfs/kernfs APIs that accept or return namespace tags.
Passing struct ns_common is needed because various codepaths require
access to the underlying namespace. A struct ns_common can always be
converted back to the concrete namespace type (e.g., struct net) via
container_of() or to_ns_common() in the reverse direction.
This is a preparatory change for switching to ns_id-based directory
iteration to prevent a KASLR pointer leak through the current use of
raw namespace pointers as hash seeds and comparison keys.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Robin Murphy [Wed, 8 Apr 2026 14:40:57 +0000 (15:40 +0100)]
iommu: Ensure .iotlb_sync is called correctly
Many drivers have no reason to use the iotlb_gather mechanism, but do
still depend on .iotlb_sync being called to properly complete an unmap.
Since the core code is now relying on the gather to detect when there
is legitimately something to sync, it should also take care of encoding
a successful unmap when the driver does not touch the gather itself.
Fixes: 90c5def10bea ("iommu: Do not call drivers for empty gathers") Reported-by: Jon Hunter <jonathanh@nvidia.com> Closes: https://lore.kernel.org/r/8800a38b-8515-4bbe-af15-0dae81274bf7@nvidia.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Will Deacon <will@kernel.org>
Alex Williamson [Wed, 8 Apr 2026 18:44:42 +0000 (12:44 -0600)]
iommu/vt-d: Restore IOMMU_CAP_CACHE_COHERENCY
In removing IOMMU_CAP_DEFERRED_FLUSH, the below referenced commit
was over-eager in removing the return, resulting in the test for
IOMMU_CAP_CACHE_COHERENCY falling through to an irrelevant option.
Restore dropped return.
Fixes: 1c18a1212c77 ("iommu/dma: Always allow DMA-FQ when iommupt provides the iommu_domain") Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Will Deacon <will@kernel.org>
platform/x86: hp-wmi: Ignore backlight and FnLock events
On HP OmniBook 7 the keyboard backlight and FnLock keys are handled
directly by the firmware. However, they still trigger WMI events which
results in "Unknown key code" warnings in dmesg.
Add these key codes to the keymap with KE_IGNORE to silence the warnings
since no software action is needed.
Tested-by: Artem S. Tashkinov <aros@gmx.com> Reported-by: Artem S. Tashkinov <aros@gmx.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221181 Signed-off-by: Krishna Chomal <krishna.chomal108@gmail.com> Link: https://patch.msgid.link/20260403080155.169653-1-krishna.chomal108@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The function sysfs_match_string() can return negative error codes and
the variable assigned to it is the enum 'option'. Which could be an
unsigned int due to different compiler implementations.
Assign signed variable 'ret' to sysfs_match_string(), check for error,
then assign ret to option.
Detected by Smatch:
drivers/platform/x86/uniwill/uniwill-acpi.c:919 usb_c_power_priority_store()
warn: unsigned 'option' is never less than zero.
Fixes: 03ae0a0d0973b ("platform/x86: uniwill-laptop: Implement USB-C power priority setting") Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com> Link: https://patch.msgid.link/20260403070928.802196-1-ethantidmore06@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: dell_rbu: avoid uninit value usage in packet_size_write()
Ensure the temp value has been properly parsed from the user-provided
buffer and initialized to be used in later operations. While at it,
prefer a convenient kstrtoul() helper.
Found by Linux Verification Center (linuxtesting.org) with Svace static
analysis tool.
Fixes: ad6ce87e5bd4 ("[PATCH] dell_rbu: changes in packet update mechanism") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Link: https://patch.msgid.link/20260403134240.604837-1-pchelkin@ispras.ru
[ij: add include] Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
nfc: pn533: allocate rx skb before consuming bytes
pn532_receive_buf() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already hand
a complete frame to pn533_recv_frame() before allocating a fresh receive
buffer.
If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().
Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.
platform/x86: hp-wmi: add locking for concurrent hwmon access
hp_wmi_hwmon_priv.mode and .pwm are written by hp_wmi_hwmon_write() in
sysfs context and read by hp_wmi_hwmon_keep_alive_handler() in a
workqueue. A concurrent write and keep-alive expiry can observe an
inconsistent mode/pwm pair (e.g. mode=MANUAL with a stale pwm).
Add a mutex to hp_wmi_hwmon_priv protecting mode and pwm. Hold it in
hp_wmi_hwmon_write() across the field update and apply call, and in
hp_wmi_hwmon_keep_alive_handler() before calling apply.
In hp_wmi_hwmon_read(), only the pwm_enable path reads priv->mode; use
scoped_guard() there to avoid holding the lock across unrelated WMI
calls.
Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive") Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Emre Cecanpunar <emreleno@gmail.com> Link: https://patch.msgid.link/20260407142515.20683-6-emreleno@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: hp-wmi: fix u8 underflow in gpu_delta calculation
gpu_delta was declared as u8. If the firmware specifies a GPU RPM
lower than the CPU RPM, subtracting them causes an underflow
(e.g. 10 - 20 = 246), which forces the GPU fan to remain clamped at
U8_MAX (100% speed) during operation.
Change gpu_delta to int and use signed arithmetic. Existing signed logic
in hp_wmi_fan_speed_set() correctly handles negative deltas.
Fixes: 46be1453e6e6 ("platform/x86: hp-wmi: add manual fan control for Victus S models") Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Emre Cecanpunar <emreleno@gmail.com> Link: https://patch.msgid.link/20260407142515.20683-5-emreleno@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: hp-wmi: use mod_delayed_work to reset keep-alive timer
Currently, schedule_delayed_work() is used to queue the 90s keep-alive
timer. If a user manually changes the fan speed at T=85s,
schedule_delayed_work() leaves the existing timer in place as it is a
no-op if the work is already pending. This results in the keep-alive
timer firing unnecessarily at T=90s, just 5 seconds after the user
action.
Replace schedule_delayed_work() with mod_delayed_work() to reset the
90s timer whenever fan settings are applied. This guarantees a full 90s
delay after every user interaction, preventing redundant keep-alive
executions and improving efficiency.
Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive") Signed-off-by: Emre Cecanpunar <emreleno@gmail.com> Link: https://patch.msgid.link/20260407142515.20683-4-emreleno@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: hp-wmi: avoid cancel_delayed_work_sync from work handler
hp_wmi_apply_fan_settings() uses cancel_delayed_work_sync() to stop
the keep-alive timer in AUTO mode. However, since
hp_wmi_apply_fan_settings() is also called from the keep-alive
handler, a race condition with a sysfs write can cause the handler to
wait on itself, leading to a deadlock.
Replace cancel_delayed_work_sync() with cancel_delayed_work() in
hp_wmi_apply_fan_settings() to avoid the self-flush deadlock.
Fixes: c203c59fb5de ("platform/x86: hp-wmi: implement fan keep-alive") Signed-off-by: Emre Cecanpunar <emreleno@gmail.com> Link: https://patch.msgid.link/20260407142515.20683-3-emreleno@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: hp-wmi: fix ignored return values in fan settings
hp_wmi_get_fan_count_userdefine_trigger() can fail, but its return
value was silently ignored in hp_wmi_apply_fan_settings() for
PWM_MODE_MAX/AUTO. Propagate these errors consistently.
Additionally, handle the return value of hp_wmi_apply_fan_settings()
in its callers by adding appropriate warnings on failure, and remove an
unreachable "return 0" at the end of the function.
Fixes: 46be1453e6e6 ("platform/x86: hp-wmi: add manual fan control for Victus S models") Signed-off-by: Emre Cecanpunar <emreleno@gmail.com> Link: https://patch.msgid.link/20260407142515.20683-2-emreleno@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
arm64: dts: ti: k3: Use memory-region-names for r5f
Add the newly introduced memory-region-names to all occurences of
ti,*-r5f. This helps adding a name to each memory-region so it is
easier to see what memory regions are for.
Song Gao [Thu, 9 Apr 2026 10:56:38 +0000 (18:56 +0800)]
KVM: LoongArch: selftests: Add PMU overflow interrupt test
Extend the PMU test suite to cover overflow interrupts. The test enables
the PMI (Performance Monitor Interrupt), sets counter 0 to one less than
the overflow value, and verifies that an interrupt is raised when the
counter overflows. A guest interrupt handler checks the interrupt cause
and disables further PMU interrupts upon success.
Signed-off-by: Song Gao <gaosong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Song Gao [Thu, 9 Apr 2026 10:56:37 +0000 (18:56 +0800)]
KVM: LoongArch: selftests: Add basic PMU event counting test
Introduce a basic PMU test that verifies hardware event counting for
four performance counters. The test enables the events for CPU cycles,
instructions retired, branch instructions, and branch misses, runs a
fixed number of loops, and checks that the counter values fall within
expected ranges. It also validates that the host supports PMU and that
the VM feature is enabled.
Signed-off-by: Song Gao <gaosong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Add helper macros and functions to read and write CPU configuration
registers (cpucfg) from the guest and from the VMM. This interface is
required in upcoming selftests for querying and setting CPU features,
such as PMU capabilities.
Signed-off-by: Song Gao <gaosong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Song Gao [Thu, 9 Apr 2026 10:56:37 +0000 (18:56 +0800)]
LoongArch: KVM: Add DMSINTC inject msi to vCPU
Implement irqfd that deliver msi to vCPU and vCPU dmsintc irq injection.
Add pch_msi_set_irq() choice dmsintc to set msi irq by the msg_addr and
implement dmsintc set msi irq.
Signed-off-by: Song Gao <gaosong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Thu, 9 Apr 2026 10:56:36 +0000 (18:56 +0800)]
LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function
vcpu_is_preempted() is performance sensitive that called in function
osq_lock(), here make it as a macro. So that parameter is not parsed
at most time, it can avoid cache line thrashing across numa nodes.
Here is part of UnixBench result on Loongson-3C5000 DualWay machine with
32 cores and 2 numa nodes.
Bibo Mao [Thu, 9 Apr 2026 10:56:36 +0000 (18:56 +0800)]
LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch
CSR register LOONGARCH_CSR_GSTAT stores guest VMID information. With
existing implementation method, VMID is per vCPU, similar with ASID in
kernel. LOONGARCH_CSR_GSTAT is written at VM entry even if VMID is not
changed.
Here move LOONGARCH_CSR_GSTAT save/restore in vCPU context switch, and
update LOONGARCH_CSR_GSTAT only when VMID is updated at VM entry. At
most time VM enter/exit is much more frequent than vCPU thread context
switch.
Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Thu, 9 Apr 2026 10:56:36 +0000 (18:56 +0800)]
LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch
CSR register LOONGARCH_CSR_EENTRY is shared between host CPU and guest
vCPU, KVM need save and restore LOONGARCH_CSR_EENTRY register. Here move
LOONGARCH_CSR_EENTRY saving in to context switch function rather than VM
entry.
At most time VM enter/exit is much more frequent than vCPU thread context
switch.
Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Bibo Mao [Thu, 9 Apr 2026 10:56:36 +0000 (18:56 +0800)]
LoongArch: KVM: Check kvm_request_pending() in kvm_late_check_requests()
Add kvm_request_pending() checking firstly in kvm_late_check_requests(),
at most time there is no pending request, then the following pending bit
checking can be skipped.
Also embed function kvm_check_pmu() in to kvm_late_check_requests(), and
put it after the kvm_request_pending() checking.
Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Tao Cui [Thu, 9 Apr 2026 10:56:36 +0000 (18:56 +0800)]
LoongArch: KVM: Use CSR_CRMD_PLV in kvm_arch_vcpu_in_kernel()
The function reads LOONGARCH_CSR_CRMD but uses CSR_PRMD_PPLV to
extract the privilege level. While both masks have the same value
(0x3), CSR_CRMD_PLV is the semantically correct constant for CRMD.
Cc: stable@vger.kernel.org Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Tao Cui <cuitao@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
William Zhang [Thu, 2 Apr 2026 01:43:25 +0000 (02:43 +0100)]
ARM: 9471/1: module: fix unwind section relocation out of range error
In an armv7 system that uses non-3G/1G split and with more than 512MB physical memory, driver load may fail with following error:
section 29 reloc 0 sym '': relocation 42 out of range (0xc2ab9be8 ->
0x7fad5998)
This happens when relocation R_ARM_PREL31 from the unwind section
.ARM.extab and .ARM.exidx are allocated from the VMALLOC space while
.text section is from MODULES_VADDR space. It exceeds the +/-1GB
relocation requirement of R_ARM_PREL31 hence triggers the error.
The fix is to mark .ARM.extab and .ARM.exidx sections as executable so
they can be allocated along with .text section and always meet range
requirement.
Co-developed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: William Zhang <william.zhang@broadcom.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Bae Yeonju [Sat, 21 Mar 2026 04:45:02 +0000 (13:45 +0900)]
fs/adfs: validate nzones in adfs_validate_bblk()
Reject ADFS disc records with a zero zone count during boot block
validation, before the disc record is used.
When nzones is 0, adfs_read_map() passes it to kmalloc_array(0, ...)
which returns ZERO_SIZE_PTR, and adfs_map_layout() then writes to
dm[-1], causing an out-of-bounds write before the allocated buffer.
adfs_validate_dr0() already rejects nzones != 1 for old-format
images. Add the equivalent check to adfs_validate_bblk() for
new-format images so that a crafted image with nzones == 0 is
rejected at probe time.
Found by syzkaller.
Fixes: f6f14a0d71b0 ("fs/adfs: map: move map-specific sb initialisation to map.c") Signed-off-by: Bae Yeonju <iwasbaeyz@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
====================
r8152: Add support for the RTL8157 5Gbit USB Ethernet chip
Add support for the RTL8157, which is a 5GBit USB-Ethernet adapter
chip in the RTL815x family of chips.
The RTL8157 uses a different frame descriptor format, and different
SRAM/ADV access methods, plus offers 5GBit/s Ethernet, so support for these
features is added in addition to chip initialization and configuration.
The module was tested with an OEM RTL8157 USB adapter:
[25758.328238] usb 4-1: new SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.345565] usb 4-1: New USB device found, idVendor=0bda, idProduct=8157, bcdDevice=30.00
[25758.345585] usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=7
[25758.345593] usb 4-1: Product: USB 10/100/1G/2.5G/5G LAN
[25758.345599] usb 4-1: Manufacturer: Realtek
[25758.345605] usb 4-1: SerialNumber: 000300E04C68xxxx
[25758.534241] r8152-cfgselector 4-1: reset SuperSpeed Plus Gen 2x1 USB device number 2 using xhci_hcd
[25758.603511] r8152 4-1:1.0: skip request firmware
[25758.653351] r8152 4-1:1.0 eth0: v1.12.13
[25758.689271] r8152 4-1:1.0 enx00e04c68xxxx: renamed from eth0
[25763.271682] r8152 4-1:1.0 enx00e04c68xxxx: carrier on
The RTL8157 adapter was tested against an AQC107 PCIe-card supporting
10GBit/s and an RTL8126 5Gbit PCIe-card supporting 5GBit/s for
performance, link speed and EEE negotiation. Using USB3.2 Gen 1 with
the RTL8157 USB adapter and running iperf3 against the AQC107 PCIe
card resulted in 3.47 Gbits/sec, whereas using USB3.2 Gen2 resulted
in 4.70 Gbits/sec, speeds against the RTL8126-card were the same.
As the code integrates the RTL8157-specific code with existing RTL8156 code
in order to improve code maintainability (instead of adding RTL8157-specific
functions duplicaing most of the RTL8156 code), regression tests were done
with an Edimax EU-4307 V1.0 USB-Ethernet adapter with RTL8156.
The code is based on the out-of-tree r8152 driver published by Realtek under
the GPL.
This patch is on top of linux-next as the code re-uses the 2.5 Gbit EEE
recently added in r8152.c.
The RTL8157 uses a different packet descriptor format compared to the
previous generation of chips. Add support for this format by adding a
descriptor format structure into the r8152 structure and corresponding
desc_ops functions which abstract the vlan-tag, tx/rx len and
tx/rx checksum algorithms.
Also, add support for the ADV indirect access interface of the RTL8157
and PHY setup.
For initialization of the RTL8157, combine the existing RTL8156B and
RTL8156 init functions and add RTL8157-specific functinality in order
to improve code readability and maintainability.
r8156_init() is now called with RTL_VER_10 and RTL_VER_11 for the RTL8156,
with RTL_VER_12, RTL_VER_13 and RTL_VER_15 for the RTL8156B and with
RTL_VER_16 for the RTL8157 and checks the version for chip-specific code.
Also add USB power control functions for the RTL8157.
Add support for the USB device ID of Realtek RTL8157-based adapters. Detect
the RTL8157 as RTL_VER_16 and set it up.
The RTL8157 supports 5GBit Link speeds. Add support for this speed
in the setup and setting/getting through ethtool. Also add 5GBit EEE.
Add functionality for setup and ethtool get/set methods.
So far we used the verb cache to restore the GPIO mask, direction and
data bits at PM resume. But, due to the nature of the cache resume
mechanism, the calling order isn't guaranteed, and this might lead to
some inconsistency at the restored state.
For assuring the GPIO verb orders, use the new GPIO helper function to
explicitly set up the GPIO bits, instead of using the codec verb
caches, while keeping the current data bits in ad198x_spec.
ALSA: hda/alc662: Simplify the quirk for CSL Unity BF24B
The previous implementation of the quirk for CSL Unity BF24B in commit de65275fc94e ("ALSA: hda/realtek: Add quirk for CSL Unity BF24B")
introduced the unnecessary GPIO caching which leads to a superfluous
write at each init/resume.
Use the new helper to write GPIO bits directly for optimization.
ALSA: hda: Add sync version of snd_hda_codec_write()
We used snd_hda_codec_read() for the verb write when a synchronization
is needed after the write, e.g. for the power state toggle or such
cases. It works in principle, but it looks rather confusing and too
hackish.
For improving the code readability, introduce a new helper function,
snd_hda_codec_write_sync(), which is another variant of
snd_hda_codec_write(), and replace the existing snd_hda_codec_read()
calls with this one.