Faicker Mo [Mon, 11 May 2026 14:05:51 +0000 (22:05 +0800)]
net: net_failover: Fix the deadlock in slave register
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
bpf: Maximum combined stack depth
This patchset dumps the maximum combined stack depth in verifier logs
and parses it in veristat.
Changes in v3:
- Increment spec_cnt field in veristat for new MAX_STACK id (AI bot).
Changes in v2:
- Remove unnecessary max_stack_depth assignment (Eduard).
- Fix and test incorrect handling of private stacks.
- Add veristat metric (Eduard).
====================
Paul Chaignon [Wed, 13 May 2026 19:35:36 +0000 (21:35 +0200)]
veristat: Report max stack depth
This patch adds a new "Max stack depth" field to the set of gathered
statistics. This field reports the maximum combined stack depth compared
to the 512 bytes limit. It is null for rejected programs.
Paul Chaignon [Wed, 13 May 2026 19:35:01 +0000 (21:35 +0200)]
selftests/bpf: Test reported max stack depth
This patch tests the maximum stack depth reporting in verifier logs,
with a couple special cases covered: fastcall, private stacks (main
subprog & callee), and rounding up to 16 bytes. For that last one, we
need to skip the test when JIT compilation is disabled as the rounding
is then to 32 bytes.
Paul Chaignon [Wed, 13 May 2026 19:34:50 +0000 (21:34 +0200)]
bpf: Report maximum combined stack depth
We've hit the 512 bytes limit on stack depth a few times in Cilium
recently. As a result, we started reporting in CI our current maximum
stack depth across all configurations for each BPF program.
Unfortunately, that is not trivial to compute in userspace. The
verifier reports the stack depths of individual subprogs at the end of
the logs. However the maximum combined stack depth also depends on the
callgraph of those subprogs (the max combined stack depth is the height
of the callgraph weighted by per-subprog stack depths). We can compute
a callgraph in userspace from the loaded instructions, but it often
doesn't match the verifier's own callgraph because of dead code
elimination. Our current approach relies on dumping the BPF_LOG_LEVEL2
logs, but this feels overkill considering the verifier already has the
information we need.
The patch lets the verifier dump the maximum combined stack depth in
the logs, on the same line as the per-subprog stack depths:
stack depth 16+256 max 272
The per-subprog stack depths and the new max stack depth are not
directly comparable. The former is sometimes updated during fixups,
while the latter is not. As a result, even with a single subprog, we
may end up with two slightly different values. The aim of the new max
value is to be closest to what is actually enforced by the verifier.
====================
netpoll: move out netconsole-specific functions
netpoll and netconsole were created together and their code has
been intermixed in net/core/netpoll.c for decades. The result is
that netpoll exposes two send-side interfaces:
* a generic "give me an sk_buff" path used by every stacked-device
driver (bonding, team, vlan, bridge, macvlan, dsa),
* a second path that takes raw bytes and builds a UDP/IP/Ethernet
packet -- exclusively for netconsole.
The packet builder, an skb pool allocator, and several
netconsole-specific helpers all live next to the generic plumbing even
though no other consumer ever touches them.
Worse, every netpoll user pays for that overlap: struct netpoll carries
an skb_pool and a refill work_struct that only netconsole's find_skb()
ever reads from, and net-core has to review unrelated changes (TTL, hop
limit, IP ID generation, source MAC selection, pool sizing) just because
they happen to be coded inside netpoll.
This is a waste of memory for something useless.
This series splits the netconsole-specific code out:
* netpoll_send_udp() and its private helpers (push_ipv6, push_ipv4,
push_eth, push_udp, netpoll_udp_checksum, find_skb) move into
drivers/net/netconsole.c, leaving netpoll with a single skb-only
send interface that is the same for every user.
The moves are one function per patch for reviewability; helpers are
temporarily EXPORT_SYMBOL_GPL'd while netpoll_send_udp() is still in
netpoll calling them, then those exports are dropped together once
netpoll_send_udp() itself moves.
The only new permanent export is zap_completion_queue(), needed because
find_skb() still drains the per-CPU TX completion queue before
allocating.
struct netpoll is unchanged in this series; making the pool itself
netconsole-private (and reclaiming the skb_pool / refill_wq fields for
the rest of netpoll's users) is the natural follow-up, once this patchset
lands.
====================
Breno Leitao [Tue, 12 May 2026 10:46:42 +0000 (03:46 -0700)]
netconsole: move find_skb() from netpoll
find_skb() is the netconsole-specific entry into the netpoll skb
pool: every other netpoll consumer (bonding, team, vlan, bridge,
macvlan, dsa) builds its own sk_buff and never touches the pool.
With netpoll_send_udp() (its only caller) now living in netconsole,
find_skb() can join it.
Move find_skb() into drivers/net/netconsole.c as a file-static
helper, drop EXPORT_SYMBOL_GPL(find_skb) and remove its prototype
from include/linux/netpoll.h. find_skb() drains TX completions via
netpoll_zap_completion_queue(), which is already exported in the
NETDEV_INTERNAL namespace, so netconsole picks up
MODULE_IMPORT_NS("NETDEV_INTERNAL") to consume it.
The skb pool's lifecycle (np->skb_pool, np->refill_wq, refill_skbs(),
refill_skbs_work_handler(), skb_pool_flush()) stays in netpoll: it
is initialised in __netpoll_setup() and torn down in
__netpoll_cleanup(), both of which remain netpoll's responsibility.
The refill work queued via schedule_work(&np->refill_wq) from the
moved find_skb() runs refill_skbs_work_handler() in netpoll without
any further plumbing.
This is pure code motion: the function body is unchanged and its
sole caller (netpoll_send_udp(), already moved by an earlier patch)
keeps invoking it the same way. Pre-existing concerns about
find_skb() running from NMI/printk context (zap_completion_queue()
re-entry, skb_pool spinlocks, GFP_ATOMIC allocation, fallback skb
sizing vs. MAX_SKB_SIZE, PREEMPT_RT semantics of __kfree_skb()) are
inherited as-is and are not addressed here; they predate this
series and are out of scope. Fixing them is left for follow-up
work.
Breno Leitao [Tue, 12 May 2026 10:46:41 +0000 (03:46 -0700)]
netpoll: rename and export netpoll_zap_completion_queue()
zap_completion_queue() drains the per-CPU softnet completion queue.
Rename it with the netpoll_ prefix shared by the rest of the
subsystem's public API, and promote it from file-static to
EXPORT_SYMBOL_NS_GPL in the NETDEV_INTERNAL namespace so the upcoming
netconsole-side find_skb() can call it once the function moves out.
A forward declaration is added to include/linux/netpoll.h, and the
old file-static forward declaration is dropped.
Breno Leitao [Tue, 12 May 2026 10:46:40 +0000 (03:46 -0700)]
netconsole: move netpoll_udp_checksum() from netpoll
netpoll_udp_checksum() computes the UDP checksum for netconsole's
packets. Move it into drivers/net/netconsole.c as a file-static
helper; drop its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
This was the last csum_ipv6_magic() consumer in net/core/netpoll.c,
so drop the now-stale <net/ip6_checksum.h> include there. Pull it
into netconsole.c so the moved code keeps building.
It was also the last udp_hdr() consumer in net/core/netpoll.c. The
file no longer needs anything from <net/udp.h> (the UDP socket-layer
helpers); MAX_SKB_SIZE only needs struct udphdr, which is provided
by the lighter <linux/udp.h>. Swap the include accordingly.
Breno Leitao [Tue, 12 May 2026 10:46:39 +0000 (03:46 -0700)]
netconsole: move push_udp() from netpoll
push_udp() builds the UDP header (and triggers the checksum) for
netconsole's UDP packets. Move it into drivers/net/netconsole.c as
a file-static helper; drop its EXPORT_SYMBOL_GPL and remove the
prototype from include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:38 +0000 (03:46 -0700)]
netconsole: move push_eth() from netpoll
push_eth() builds the Ethernet header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:37 +0000 (03:46 -0700)]
netconsole: move push_ipv4() from netpoll
push_ipv4() builds the IPv4 header for netconsole's UDP packets.
Move it into drivers/net/netconsole.c as a file-static helper; drop
its EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
put_unaligned() is no longer used in net/core/netpoll.c, so drop
the now-stale <linux/unaligned.h> include from there. Pull it into
netconsole.c so the moved code keeps building.
Breno Leitao [Tue, 12 May 2026 10:46:36 +0000 (03:46 -0700)]
netconsole: move push_ipv6() from netpoll
push_ipv6() builds the IPv6 header for netconsole's UDP packets.
Its only caller, netpoll_send_udp(), now lives in netconsole, so
the helper can move there as a file-static function. Drop its
EXPORT_SYMBOL_GPL and remove the prototype from
include/linux/netpoll.h.
Breno Leitao [Tue, 12 May 2026 10:46:35 +0000 (03:46 -0700)]
netconsole: move netpoll_send_udp() from netpoll
Move netpoll_send_udp() from net/core/netpoll.c into
drivers/net/netconsole.c as a static helper, drop EXPORT_SYMBOL(),
and remove the prototype from include/linux/netpoll.h.
netconsole was the only in-tree caller of this entry point. Every
other netpoll consumer (bonding, team, vlan, bridge, macvlan, dsa)
already builds its own sk_buff and hands it to netpoll_send_skb(),
so the netpoll send-side interface is now skb-only.
The helpers it depends on (find_skb(), push_ipv6(), push_ipv4(),
push_udp(), push_eth(), netpoll_udp_checksum()) were exposed in
the previous patches and stay in net/core/netpoll.c for now.
Subsequent patches move each of them into netconsole one at a time
and drop the corresponding EXPORT_SYMBOL_GPL.
Pull <linux/ip.h>, <linux/ipv6.h> and <linux/udp.h> into netconsole.c
so the moved code can name the header structures.
Breno Leitao [Tue, 12 May 2026 10:46:34 +0000 (03:46 -0700)]
netpoll: expose UDP packet builder helpers for netconsole
Promote each from file-static to EXPORT_SYMBOL_GPL and forward-
declare them in include/linux/netpoll.h so netconsole can call
them once netpoll_send_udp() moves out.
These exports are kept until the end of the series, when
al of them move into netconsole.
Zong Li [Tue, 28 Apr 2026 02:41:05 +0000 (19:41 -0700)]
riscv: cfi: reduce shadow stack size limit from 4GB to 2GB
Follow the ARM64 GCS (Guarded Control Stack) implementation approach
by reducing the shadow stack size allocation from min(RLIMIT_STACK, 4GB)
to min(RLIMIT_STACK/2, 2GB). See commit 506496bcbb42 ("arm64/gcs: Ensure
that new threads have a GCS")
Rationale:
1. Shadow stacks only store return addresses (8 bytes per entry), not
local variables, function parameters, or saved registers. A 2GB
shadow stack is far more than sufficient for any practical
application, even with extremely deep recursion. Using half the size
maintains adequate margin while being more resource-efficient.
2. On memory-constrained systems (e.g., platforms with only 4GB of
physical memory, which is a common configuration), allocating 4GB
of virtual address space for shadow stack per process/thread can
lead to virtual memory allocation failures when the overcommit mode
is set to OVERCOMMIT_GUESS or OVERCOMMIT_NEVER:
Error: "__vm_enough_memory: not enough memory for the allocation"
This reduces virtual address space consumption by 50% while maintaining
more than adequate space for return address storage.
Sajal Gupta [Sat, 9 May 2026 09:51:48 +0000 (15:21 +0530)]
net: usb: pegasus: replace simple_strtoul with kstrtouint
simple_strtoul() is deprecated as it has no error checking. Replace it
with kstrtouint() which returns an error code on invalid input, and add
appropriate error handling.
Also add a NULL check before parsing flags, since strsep() can set id
to NULL if the input has fewer tokens than expected.
Preserve the original behavior for a trailing colon by checking *id
before parsing flags, so an empty string results in flags = 0 rather
than an error.
Victor Nogueira [Mon, 11 May 2026 18:30:58 +0000 (14:30 -0400)]
selftests/tc-testing: Add QFQ/CBS qlen underflow test
Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.
Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Mon, 11 May 2026 18:30:57 +0000 (14:30 -0400)]
net/sched: sch_cbs: Call qdisc_reset for child qdisc
During a reset, CBS is not calling reset on its child qdisc, which
might cause qlen/backlog accounting issues. For example, if we have CBS
with a QFQ parent and a netem child with delay, we can create a scenario
where the parent's qlen underflows. QFQ, specifically, uses qlen to
check whether it should deference a pointer, so this scenario may cause
a null-ptr deref in QFQ:
Fix this by calling qdisc_reset for CBS' child qdisc.
Sashiko caught an issue which could result in a null ptr deref if
qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed
by this patch. We add an early return if the qdisc is null to address this.
This is a similar approach used by two other fixes[1][2].
The proper fix for this specific issue elucidated by sashiko is to remove
the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc
isn't attached anywhere yet at that point, calling the reset callback doesn't
make much sense (and as stated has been a source of two other bugs).
We plan on submitting this fix in a later patch.
[1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/
[2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/
Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tun/tap & vhost-net: apply qdisc backpressure on full ptr_ring to reduce TX drops
This patch series deals with tun/tap & vhost-net which drop incoming
SKBs whenever their internal ptr_ring buffer is full. Instead, with this
patch series, the associated netdev queue is stopped - but only when a
qdisc is attached. If no qdisc is present the existing behavior is
preserved. The XDP transmit path is not affected. This patch series
touches tun/tap and vhost-net, as they share common logic and must be
updated together. Modifying only one of them would break the other.
By applying proper backpressure, this change allows the connected qdisc to
operate correctly, as reported in [1], and significantly improves
performance in real-world scenarios, as demonstrated in our paper [2]. For
example, we observed a 36% TCP throughput improvement for an OpenVPN
connection between Germany and the USA.
Synthetic pktgen benchmarks indicate a slight regression, and packet
loss is reduced to near zero. Pktgen benchmarks are provided per commit,
with the final commit showing the overall performance.
Simon Schippers [Sun, 10 May 2026 15:15:29 +0000 (17:15 +0200)]
tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
This commit prevents tail-drop when a qdisc is present and the ptr_ring
becomes full. Once the ring reaches capacity after a produce attempt,
the netdev queue is stopped instead of dropping subsequent packets.
If no qdisc is present, the previous tail-drop behavior is preserved.
If producing an entry fails anyway due to a race, tun_net_xmit() drops
the packet. Such races are expected because LLTX is enabled and the
transmit path operates without the usual locking.
The __tun_wake_queue() function of the consumer races with the producer
for waking/stopping the netdev queue, which could result in a stalled
queue. Therefore, an smp_mb__after_atomic() is introduced that pairs
with the smp_mb() of the consumer. It follows the principle of store
buffering described in tools/memory-model/Documentation/recipes.txt:
- The producer in tun_net_xmit() first sets __QUEUE_STATE_DRV_XOFF,
followed by an smp_mb__after_atomic() (= smp_mb()), and then reads the
ring with __ptr_ring_check_produce().
- The consumer in __tun_wake_queue() first writes zero to the ring in
__ptr_ring_consume(), followed by an smp_mb(), and then reads the queue
status with netif_tx_queue_stopped().
=> Following the aforementioned principle, it is impossible for the
producer to see a full ring (and therefore not wake the queue on the
re-check) while the consumer simultaneously fails to see a stopped
queue (and therefore also does not wake it).
Benchmarks:
The benchmarks show a slight regression in raw transmission performance
when using two sending threads. Packet loss also occurs only in the
two-thread sending case; no packet loss was observed with a single
sending thread.
Test setup:
AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads;
Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2
mitigations disabled.
Note for tap+vhost-net:
XDP drop program active in VM -> ~2.5x faster; slower for tap due to
more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf)
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-5-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:28 +0000 (17:15 +0200)]
ptr_ring: move free-space check into separate helper
This patch moves the check for available free space for a new entry into
a separate function. Existing callers that only check for a non-zero
return value are unaffected; __ptr_ring_produce() now returns -EINVAL
for a zero-size ring and -ENOSPC when full, whereas before both cases
returned -ENOSPC. The new helper allows callers to determine in advance
whether subsequent __ptr_ring_produce() calls will succeed. This
information can, for example, be used to temporarily stop producing until
__ptr_ring_check_produce() indicates that space is available again.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-4-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:27 +0000 (17:15 +0200)]
vhost-net: wake queue of tun/tap after ptr_ring consume
Add tun_wake_queue() to tun.c and export it for use by vhost-net. The
function validates that the file belongs to a tun/tap device and that
the tfile exists, dereferences the tun_struct under RCU, and delegates
to __tun_wake_queue().
vhost_net_buf_produce() now calls tun_wake_queue() after a successful
batched consume of the ring to allow the netdev subqueue to be woken up.
The point is to allow the queue to be stopped when it gets full, which
is required for traffic shaping - implemented by the following
"avoid ptr_ring tail-drop when a qdisc is present".
Without the corresponding queue stopping, this patch alone causes no
throughput regression for a tap+vhost-net setup sending to a qemu VM:
3.857 Mpps to 3.891 Mpps.
Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, XDP drop program active in VM, pktgen sender; Avg over
50 runs @ 100,000,000 packets. SRSO and spectre v2 mitigations disabled.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-3-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Schippers [Sun, 10 May 2026 15:15:26 +0000 (17:15 +0200)]
tun/tap: add ptr_ring consume helper with netdev queue wakeup
Introduce tun_ring_consume() that wraps ptr_ring_consume() and calls
__tun_wake_queue(). The latter wakes the stopped netdev subqueue once
half of the ring capacity has been consumed, tracked via the new
cons_cnt field in tun_file. As a safety net, the queue is also woken on
the last consumed entry if it leaves the ring empty. The point is to
allow the queue to be stopped when it gets full, which is required for
traffic shaping - implemented by the following "avoid ptr_ring tail-drop
when a qdisc is present".
Some implementation details:
- tun_ring_recv() replaces ptr_ring_consume() with tun_ring_consume()
to properly wake the queue.
- __tun_detach() locks the tx_ring.consumer_lock to avoid races with
the consumer on the queue_index.
- The ptr_ring_consume() call in tun_queue_purge() is not replaced with
tun_ring_consume(). Instead, within the same tx_ring.consumer_lock
in __tun_detach(), the netdev queue is woken for the ntfile taking
it over, to avoid a possible stall. This does not matter for
tun_detach_all(), as it is called during device teardown and no tfile
takes over any queue.
- Reset cons_cnt in tun_attach() so the half-ring wake threshold is
valid for the new ring size after ptr_ring_resize().
- tun_queue_resize() wakes all queues after resizing with the proper
tx_ring.consumer_lock and resets the cons_cnt to avoid a possible
stale queue.
- The aforementioned upcoming patch explains the pairing of the smp_mb()
of __tun_wake_queue().
Without the corresponding queue stopping, this patch alone causes no
regression for a tap setup sending to a qemu VM: 1.132 Mpps
to 1.134 Mpps.
Details: AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU
threads, pktgen sender; Avg over 50 runs @ 100,000,000 packets;
SRSO and spectre v2 mitigations disabled.
Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20260510151529.43895-2-simon.schippers@tu-dortmund.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Niranjan H Y [Wed, 13 May 2026 01:55:42 +0000 (07:25 +0530)]
ASoC: sdw_utils: Remove dead code in asoc_sdw_ti_add_tac5xx2_routes()
Remove unnecessary checks for scnprintf() return values in
asoc_sdw_ti_add_tac5xx2_routes(). The function scnprintf() never
returns negative values and cannot return zero given the format
strings used ("%s SPK_L" and "%s SPK_R").
The existing length validation at line 110 already ensures that
name_prefix won't cause buffer overflow, and scnprintf() guarantees
null-termination even in case of truncation.
Fixes: e812de61e9a0 ("ASoC: sdw_utils: TI amp utility for tac5xx2 family") Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/linux-sound/agF8GBcHYUaGJbXY@stanley.mountain/ Signed-off-by: Niranjan H Y <niranjan.hy@ti.com> Link: https://patch.msgid.link/20260513015542.2420-1-niranjan.hy@ti.com Signed-off-by: Mark Brown <broonie@kernel.org>
devm_kasprintf() can fail while building the temporary speaker
component string. If that happens, spk_components is set to NULL, but
the current code can still pass it to strlen() on a later loop iteration
or after the loop when appending the speaker component list to
card->components.
Use NULL to represent the initial "no speaker components" state, and
return -ENOMEM immediately if building spk_components fails.
Alistair Popple [Fri, 1 May 2026 06:51:16 +0000 (16:51 +1000)]
mm/memory: fix spurious warning when unmapping device-private/exclusive pages
Device private and exclusive entries are only supported for anonymous
folios. This condition is tested in __migrate_device_pages() and
make_device_exclusive() using folio_test_anon(). However the unmap path
tests this assumption using vma_is_anonymous().
This is wrong because whilst anonymous VMAs can only contain folios where
folio_test_anon() is true the opposite relation does not hold. A folio
for which folio_test_anon() is true does not imply vma_is_anonymous() is
true. Such a condition can occur if for example a folio is part of a
private filebacked mapping.
In this case vma_is_anonymous() is false as the mapping is filebacked, but
folio_test_anon() may be true, thus permitting devices to migrate the
folio to device private memory. This can lead to the following spurious
warnings during process teardown:
mm: fix __vm_normal_page() to handle missing support for pmd_special()/pud_special()
On x86 32-bit with THP enabled, zap_huge_pmd() is seen to generate a
"WARNING: mm/memory.c:735 at __vm_normal_page+0x6a/0x7d", from the
VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); followed by
"BUG: Bad rss-counter state"s, then later "BUG: Bad page state"s when
reclaim gets to call shrink_huge_zero_folio_scan().
It's as if the _PAGE_SPECIAL bit never got set in the huge_zero pmd: and
indeed, whereas pte_special() and pte_mkspecial() are subject to a
dedicated CONFIG_ARCH_HAS_PTE_SPECIAL, pmd_special() and pmd_mkspecial()
are subject to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP, which is never enabled on
any 32-bit architecture.
While the problem was exposed through commit d80a9cb1a64a
("mm/huge_memory: add and use normal_or_softleaf_folio_pmd()"), it was an
oversight in commit af38538801c6 ("mm/memory: factor out common code from
vm_normal_page_*()") and would result in other problems:
* huge zero folio accounted in smaps, pagemap (PAGE_IS_FILE) and
numamaps as file-backed THP
* folio_walk_start() returning the folio even without FW_ZEROPAGE set.
Callers seem to tolerate that, though.
... and triggering the VM_WARN_ON_ONE(), although never reported so far.
To fix it, teach vm_normal_page_pmd()/vm_normal_page_pud() to consider
whether pmd_special/pud_special is actually implemented.
Link: https://lore.kernel.org/20260430-pmd_special-v1-1-dbcbcfd72c20@kernel.org Fixes: af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reported-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/r/74a75b59-2e13-3985-ee99-d5521f39df2a@google.com Reported-by: Bibo Mao <maobibo@loongson.cn> Closes: https://lore.kernel.org/r/20260430041121.2839350-1-maobibo@loongson.cn Debugged-by: Hugh Dickins <hughd@google.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Bibo Mao <maobibo@loongson.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:52:18 +0000 (16:52 +0800)]
drivers/base/memory: fix memory block reference leak in poison accounting
memblk_nr_poison_inc() and memblk_nr_poison_sub() look up a memory block
via find_memory_block_by_id(), which acquires a reference to the memory
block device.
Both helpers use the returned memory block without dropping that
reference, leaking the device reference on each successful lookup. Drop
the reference after updating nr_hwpoison.
Link: https://lore.kernel.org/20260428085219.1316047-3-songmuchun@bytedance.com Fixes: 5033091de814 ("mm/hwpoison: introduce per-memory_block hwpoison counter") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:52:17 +0000 (16:52 +0800)]
mm/memory_hotplug: fix memory block reference leak on remove
Patch series "mm: Fix memory block leaks and locking", v2.
This series fixes two memory block device reference leaks and one locking
issue around the per-memory_block hwpoison counter.
This patch (of 2):
remove_memory_blocks_and_altmaps() looks up each memory block with
find_memory_block(), which acquires a reference to the memory block
device.
That reference is never dropped on this path, resulting in a leaked device
reference when removing memory blocks and their altmaps. Drop the
reference after retrieving mem->altmap and clearing mem->altmap, before
removing the memory block device.
Link: https://lore.kernel.org/20260428085219.1316047-1-songmuchun@bytedance.com Link: https://lore.kernel.org/20260428085219.1316047-2-songmuchun@bytedance.com Fixes: 6b8f0798b85a ("mm/memory_hotplug: split memmap_on_memory requests across memblocks") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Increase buffer size to accommodate machines with 64K PAGE_SIZE.
Link: https://lore.kernel.org/20260421070707.992873-1-lk@c--e.de Fixes: 0913b7554726 ("lib: kunit_iov_iter: add tests for extract_iter_to_sg") Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Reported-by: David Gow <davidgow@google.com> Closes: https://lore.kernel.org/34a81ec2-af84-465d-9b5e-7bb5bf01680f@davidgow.net Tested-by: David Gow <davidgow@google.com> Tested-by: Josh Law <joshlaw48@gmail.com> Reviewed-by: Josh Law <joshlaw48@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/page_alloc: fix initialization of tags of the huge zero folio with init_on_free
__GFP_ZEROTAGS semantics are currently a bit weird, but effectively this
flag is only ever set alongside __GFP_ZERO and __GFP_SKIP_KASAN.
If we run with init_on_free, we will zero out pages during
__free_pages_prepare(), to skip zeroing on the allocation path.
However, when allocating with __GFP_ZEROTAG set, post_alloc_hook() will
consequently not only skip clearing page content, but also skip clearing
tag memory.
Not clearing tags through __GFP_ZEROTAGS is irrelevant for most pages that
will get mapped to user space through set_pte_at() later: set_pte_at() and
friends will detect that the tags have not been initialized yet
(PG_mte_tagged not set), and initialize them.
However, for the huge zero folio, which will be mapped through a PMD
marked as special, this initialization will not be performed, ending up
exposing whatever tags were still set for the pages.
The docs (Documentation/arch/arm64/memory-tagging-extension.rst) state
that allocation tags are set to 0 when a page is first mapped to user
space. That no longer holds with the huge zero folio when init_on_free is
enabled.
Fix it by decoupling __GFP_ZEROTAGS from __GFP_ZERO, passing to
tag_clear_highpages() whether we want to also clear page content.
Invert the meaning of the tag_clear_highpages() return value to have
clearer semantics.
Reproduced with the huge zero folio by modifying the check_buffer_fill
arm64/mte selftest to use a 2 MiB area, after making sure that pages have
a non-0 tag set when freeing (note that, during boot, we will not actually
initialize tags, but only set KASAN_TAG_KERNEL in the page flags).
$ ./check_buffer_fill
1..20
...
not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
not ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
...
This code needs more cleanups; we'll tackle that next, like
decoupling __GFP_ZEROTAGS from __GFP_SKIP_KASAN.
[akpm@linux-foundation.org: s/__GPF_ZERO/__GFP_ZERO/, per David] Link: https://lore.kernel.org/20260421-zerotags-v2-1-05cb1035482e@kernel.org Fixes: adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20260428124833.1903302-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Baoquan He <baoquan.he@linux.dev> Cc: Dave Young <ruirui.yang@linux.dev> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/20260428124833.1903302-1-rppt@kernel.org Link: https://lore.kernel.org/20260428124833.1903302-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: Baoquan He <baoquan.he@linux.dev> Acked-by: Pratyush Yadav <pratyush@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Young <ruirui.yang@linux.dev> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Destructive tests should be invoked with -d command-line option, but this
won't work today since 'd' is missing in getopts command-line. This
commit fixes it.
Link: https://lore.kernel.org/214fd9e4-5398-4c26-859e-c982c2e277c3@redhat.com Fixes: f16ff3b692ad ("selftests/mm: run_vmtests.sh: add missing tests") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scripts/gdb: slab: update field names of struct kmem_cache
The commit 5ba6bc27b1f9 ("slab: decouple pointer to barn from
kmem_cache_node") reorganized the struct kmem_cache to factor out the
per-node fields to the new struct kmem_cache_per_node_ptrs. This causes
the gdb scripts for lx-slabinfo and lx-slabtrace fail as they still
reference the old structure.
Adjust the gdb scripts to match the current state of struct kmem_cache.
Link: https://lore.kernel.org/20260427142448.666117-3-illia@yshyn.com Fixes: 5ba6bc27b1f9 ("slab: decouple pointer to barn from kmem_cache_node") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Hao Li <hao.li@linux.dev> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Seongjun Hong <hsj0512@snu.ac.kr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scripts/gdb: mm: cast untyped symbols in x86_page_ops
The symbols phys_base, _text, and _end, used in x86_page_ops are either
defined in assembly or implicitly by the linker. Thus, they lack type
information and cause a conversion error after gdb.parse_and_eval.
Explicitly cast these expressions to unsigned long.
Link: https://lore.kernel.org/20260427142448.666117-2-illia@yshyn.com Fixes: 55f8b4518d14 ("scripts/gdb: implement x86_page_ops in mm.py") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Cc: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Vlastimil Babka <vbabka@suse.com> Cc: Hao Li <hao.li@linux.dev> Cc: Harry Yoo <harry@kernel.org> Cc: Seongjun Hong <hsj0512@snu.ac.kr> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_sysfs_memcg_path_to_id() breaks mem_cgroup_iter() loop without
calling mem_cgroup_iter_break(). This leaks the cgroup reference. Fix
the issue by calling mem_cgroup_iter_break() before the break.
mm/migrate_device: fix spinlock leak in migrate_vma_insert_huge_pmd_page
When check_stable_address_space() fails after the PMD spinlock has
been acquired via pmd_lock(), the code jumps directly to the abort
label, bypassing the spin_unlock() call in unlock_abort. This causes
the PMD spinlock to be permanently held, leading to a deadlock.
Change the goto target from abort to unlock_abort to ensure the
spinlock is always released on this error path.
Link: https://lore.kernel.org/20260425133537.17463-1-nueralspacetech@gmail.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Sunny Patel <nueralspacetech@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot. Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.
The safety mechanism depends on the origin of the reset. For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event. In that case no sleep is possible as a device halt is reported
at the hardirq level.
A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6ffbee
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:
The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to. Therefore there is a
very remote issue only this is a sign of.
Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.
Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean). It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection. For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).
Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.
Fixes: 61414f5ec9834 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Kelley [Tue, 17 Feb 2026 18:23:35 +0000 (10:23 -0800)]
drm/hyperv: During panic do VMBus unload after frame buffer is flushed
In a VM, Linux panic information (reason for the panic, stack trace,
etc.) may be written to a serial console and/or a virtual frame buffer
for a graphics console. The latter may need to be flushed back to the
host hypervisor for display.
The current Hyper-V DRM driver for the frame buffer does the flushing
*after* the VMBus connection has been unloaded, such that panic messages
are not displayed on the graphics console. A user with a Hyper-V graphics
console is left with just a hung empty screen after a panic. The enhanced
control that DRM provides over the panic display in the graphics console
is similarly non-functional.
Commit 3671f3777758 ("drm/hyperv: Add support for drm_panic") added
the Hyper-V DRM driver support to flush the virtual frame buffer. It
provided necessary functionality but did not handle the sequencing
problem with VMBus unload.
Fix the full problem by using VMBus functions to suppress the VMBus
unload that is normally done by the VMBus driver in the panic path. Then
after the frame buffer has been flushed, do the VMBus unload so that a
kdump kernel can start cleanly. As expected, CONFIG_DRM_PANIC must be
selected for these changes to have effect. As a side benefit, the
enhanced features of the DRM panic path are also functional.
Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Michael Kelley [Tue, 17 Feb 2026 18:23:34 +0000 (10:23 -0800)]
Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
Currently, VMBus code initiates a VMBus unload in the panic path so
that if a kdump kernel is loaded, it can start fresh in setting up its
own VMBus connection. However, a driver for the VMBus virtual frame
buffer may need to flush dirty portions of the frame buffer back to
the Hyper-V host so that panic information is visible in the graphics
console. To support such flushing, provide exported functions for the
frame buffer driver to specify that the VMBus unload should not be
done by the VMBus driver, and to initiate the VMBus unload itself.
Together these allow a frame buffer driver to delay the VMBus unload
until after it has completed the flush.
Ideally, the VMBus driver could use its own panic-path callback to do
the unload after all frame buffer drivers have finished. But DRM frame
buffer drivers use the kmsg dump callback, and there are no callbacks
after that in the panic path. Hence this somewhat messy approach to
properly sequencing the frame buffer flush and the VMBus unload.
Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Pengpeng Hou [Thu, 7 May 2026 08:18:11 +0000 (16:18 +0800)]
drivers/of: validate status properties in reconfig state changes
Live-tree reconfiguration properties also carry raw values plus explicit
lengths. `of_reconfig_get_state_change()` currently treats `status`
property values as NUL-terminated strings and feeds them straight into
`strcmp()`.
Factor the `"okay"` / `"ok"` check out into a helper that first verifies
that the property contains a bounded C string within `prop->length`.
Malformed `status` updates should be treated as not enabling the node.
Pengpeng Hou [Thu, 7 May 2026 08:18:10 +0000 (16:18 +0800)]
drivers/of: validate live-tree string properties before string use
`populate_properties()` stores live-tree property values as raw byte
sequences plus a separate `length`. They are not globally guaranteed to
be NUL-terminated.
`of_prop_next_string()` iterates string-list properties by walking raw
bytes, `__of_node_is_type()` checks `device_type`,
`__of_device_is_status()` checks `status`, and
`of_alias_from_compatible()` reads the first `compatible` entry. These
paths must validate that the relevant string fits within the property
bounds before they hand it to C string helpers.
Validate these live-tree string properties within their declared bounds.
In particular, make `of_prop_next_string()` reject malformed entries
before returning them, keep the `device_type` check inside the existing
no-lock helper path, and add unit coverage for malformed first and
trailing string-list entries.
i2c: tegra: make tegra_i2c_mutex_unlock() return void
tegra_i2c_mutex_unlock() returning an error that overwrites the transfer
result causes silent loss of I2C transfer errors. If the transfer failed
but the unlock succeeded, the error was lost and the function incorrectly
reported success.
Rather than propagating the unlock error (which is not actionable by the
caller - the I2C message may have been sent regardless), convert the
function to return void and WARN on the unexpected condition. If the
unlock fails, subsequent lock attempts will fail anyway, making the error
visible on the next transfer.
i2c: tegra: fix pm_runtime leak on mutex_lock failure
If tegra_i2c_mutex_lock() fails, the function returns without calling
pm_runtime_put(), leaking the runtime PM reference acquired by the
preceding pm_runtime_get_sync(). This prevents the device from ever
entering runtime suspend.
Add the missing pm_runtime_put() before returning on lock failure.
driver core: platform: remove software node on release()
If we pass a software node to a newly created device using struct
platform_device_info, it will not be removed when the device is
released. This may happen when a module creating the device is removed
or on failure in platform_device_add().
When we try to reuse that software node in a subsequent call to
platform_device_register_full(), it will fail with -EBUSY.
Provide a wrapper around the existing platform_device_release() that
additionally calls device_remove_software_node() and use it to replace
the former if we end up adding a software node.
While at it: check all three possible situations in which two software
nodes for a single platform device can be created/assigned in
platform_device_register_full() and bail-out early.
KVM: SEV: Allocate only as many bytes as needed for temp crypt buffers
When using a temporary buffer to {de,en}crypt unaligned memory for debug,
allocate only the number of bytes that are needed instead of allocating an
entire page. The most common case for unaligned accesses will be reading
or writing less than 16 bytes.
KVM: SEV: Rewrite logic to {de,en}crypt memory for debug
Wholesale rewrite the guts of the debug {de,en}crypt flows, as the existing
code is broken, e.g. doesn't handle cases where the access isn't naturally
sized and aligned, and is so wildly flawed that attempting to salvage the
current code in an iterative fashion would be more risky than a rewrite.
E.g. when encrypting 9 bytes at offset 8, KVM needs to _decrypt_
destination[31:0] into a temporary buffer, buffer[31:0], then copy 9 bytes
from source[8:0] to buffer[16:8], then encrypt buffer[31:0] back into
destination[31:0]. The current code only ever copies 16 bytes, and
bizarrely uses a temporary buffer for the source as well.
To keep the code easier to read and maintain, send the unaligned cases
down dedicated "slow" paths instead of trying to mix and match the possible
combinations in one helper.
For now, preserve the basic approach of the current code, e.g. allocate an
entire page for the temporary buffer, to minimize unwanted changes in
functionality.
KVM: SEV: Add helper function to pin/unpin a single page
Add helpers to pin/unpin a single page, and use it in all flows that pin
exactly one page. None of the single-page users actually check that the
correct number of pages was pinned, which is functionally ok, but visually
jarring, especially in the decrypt/encrypt flow, which separately pins two
pages, but uses a single variable to track how pages were pinned each time.
Again, it's functionally ok since core mm guarantees exactly one page will
be pinned on success, but it's ugly.
Opportunistically use page_to_phys() instead of open coding an equivalent
via page_to_pfn().
Note, all users of the single-page helper pre-validate the address and
length, i.e. don't rely on the sanity check in sev_pin_memory().
KVM: SEV: Explicitly validate the dst buffer for debug operations
When encrypting/decrypting guest memory, explicitly check that the
destination is non-NULL and doesn't wrap instead of subtly relying on
sev_pin_memory() to perform the check. This will allow adding and using
a more focused single-page pinning helper.
KVM: selftests: Add a test to verify SEV {en,de}crypt debug ioctls
Add a selftest to verify KVM's handling of {de,en}crypt debug ioctls,
specifically focusing on edge cases around the chunk (16 bytes) and page
(4096) sizes, where KVM had multiple bugs. E.g. KVM would fail to handle
small sizes that aren't naturally aligned and sized, would buffer overflow
if the destination was unaligned but the source was not, etc.
Attempt to strike a balance between an exhaustive test and a reasonable
runtime. On a system with both SEV and SEV-ES support, the current runtime
is under 45 seconds. Which isn't great, but it's tolerable, and it's not
obvious which of the combinations are "better" than the others.
Linus Torvalds [Wed, 13 May 2026 22:00:40 +0000 (15:00 -0700)]
Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
- PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
the failing task rather than the irq_work kthread
- Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
fixes"
* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
sched_ext: Fix ops->priv clobber on concurrent attach/detach
selftests/sched_ext: Fix build error in dequeue selftest
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
sched_ext: Close sub-sched init race with post-init DEAD recheck
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
sched_ext: Drop unused scx_find_sub_sched() stub
sched_ext: Move scx_error() out of scx_link_sched()'s lock region
Dmitry Osipenko [Fri, 1 May 2026 00:00:43 +0000 (03:00 +0300)]
drm/virtio: Extend blob UAPI with deferred-mapping hinting
If userspace never maps GEM object, then BO wastes hostmem space
because VirtIO-GPU driver maps VRAM BO at the BO's creating time.
Make mappings on-demand by adding new RESOURCE_CREATE_BLOB IOCTL/UAPI
hinting flag telling that host mapping should be deferred until first
mapping is made when the flag is set by userspace.
Chaitanya Sabnis [Wed, 13 May 2026 12:37:57 +0000 (18:07 +0530)]
dt-bindings: i2c: convert davinci i2c to dt-schema
Convert the Texas Instruments DaVinci and Keystone I2C controller
bindings from legacy text format to modern dt-schema (YAML).
During the conversion, the `interrupts` property was made required
to match the strict requirement in the driver probe function. The
custom `ti,has-pfunc` and `power-domains` properties were also
properly defined to match SoC-specific hardware features.
Linus Torvalds [Wed, 13 May 2026 21:56:31 +0000 (14:56 -0700)]
Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
Linus Torvalds [Wed, 13 May 2026 21:49:13 +0000 (14:49 -0700)]
Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path
- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
set with no owner
* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
workqueue: Release PENDING in __queue_work() drain/destroy reject path
Mikko Perttunen [Tue, 21 Apr 2026 04:02:38 +0000 (13:02 +0900)]
drm/msm: Fix iommu_map_sgtable() return value check and avoid WARN
Commit "iommu: return full error code from iommu_map_sg[_atomic]()"
changed iommu_map_sgtable() to return an ssize_t and negative values
in error cases, rather than a size_t and a zero.
Store the return value in the appropriate type and in case of error,
return it rather than WARNing.
Fixes: ad8f36e4b6b1 ("iommu: return full error code from iommu_map_sg[_atomic]()") Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com>
Patchwork: https://patchwork.freedesktop.org/patch/719685/
Message-ID: <20260421-iommu_map_sgtable-return-v1-3-fb484c07d2a1@nvidia.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Rob Clark [Sat, 11 Apr 2026 15:03:12 +0000 (08:03 -0700)]
drm/msm/a6xx: Restore sysprof_active
This got lost in the shuffle somehow when moving the vfunc table to
catalogue. Fixes inhibiting IFPC when userspace is collecting perfcntr
data.
Fixes: 491fadb2b818 ("drm/msm/adreno: Move adreno_gpu_func to catalogue") Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com> Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/717780/
Message-ID: <20260411150312.257937-1-robin.clark@oss.qualcomm.com>
drm/msm/adreno: fix userspace-triggered crash on a2xx-a4xx
Before a5xx Adreno driver will not try fetching UBWC params (because
those generations didn't support UBWC anyway), however it's still
possible to query UBWC-related params from the userspace, triggering
possible NULL pointer dereference. Check for UBWC config in
adreno_get_param() and return sane defaults if there is none.
Fixes: a452510aad53 ("drm/msm/adreno: Switch to the common UBWC config struct") Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Reviewed-by: Rob Clark <rob.clark@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/717778/
Message-ID: <20260411-adreno-fix-ubwc-v3-1-4983156f3f80@oss.qualcomm.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Jeremy Laratro [Tue, 12 May 2026 23:26:16 +0000 (08:26 +0900)]
ksmbd: fix null pointer dereference in compare_guid_key()
session_fd_check() walks the per-inode m_op_list during durable-handle
session teardown and sets op->conn = NULL for every opinfo whose conn
matched the closing session's connection. The matching opinfo, however,
stays linked in its per-ClientGuid lease_table_list entry's lb->lease_list
because destroy_lease_table() only runs on full TCP-connection teardown,
not on SESSION_LOGOFF.
If the same TCP connection then negotiates a fresh session with the
same ClientGuid (ClientGuid is bound to NEGOTIATE, not the session, and
is unchanged across LOGOFF + SETUP) and issues a SMB2 CREATE with a
lease context on a different inode, find_same_lease_key() walks
lb->lease_list, reaches the stale opinfo, and calls compare_guid_key(),
which unconditionally dereferences opinfo->conn->ClientGUID. The conn
pointer is NULL and the kernel panics.
Reproducer requires only a successful SMB2 SESSION_SETUP and a share
configured with 'durable handles = yes'. KASAN report on mainline 70390501d194:
general protection fault, probably for non-canonical address
0xdffffc0000000069: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000348-0x000000000000034f]
Workqueue: ksmbd-io handle_ksmbd_work
RIP: 0010:bcmp+0x5b/0x230
Call Trace:
compare_guid_key+0x4b/0xd0
find_same_lease_key+0x324/0x690
smb2_open+0x6aea/0x8e60
handle_ksmbd_work+0x796/0xee0
...
Faulting address 0x348 is the offset of ClientGUID within struct
ksmbd_conn, confirming opinfo->conn was NULL.
Read opinfo->conn once and bail out if it has been cleared by a
concurrent session_fd_check(). A half-detached opinfo cannot be the
owner of an active lease, so returning 0 is the correct match result.
Fixes: c8efcc786146 ("ksmbd: add support for durable handles v1/v2") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Jeremy Laratro [Tue, 12 May 2026 23:23:26 +0000 (08:23 +0900)]
ksmbd: fix null pointer dereference in proc_show_files()
When a SMB2 client opens a file with a durable v2 handle and then issues
SMB2 SESSION_LOGOFF, session_fd_check() clears fp->tcon = NULL on the
reconnectable file pointer but leaves the fp registered in global_ft.idr
until the durable scavenger fires (up to fp->durable_timeout seconds
later).
During that window any read of /proc/fs/ksmbd/files (mode 0400) panics
the kernel because proc_show_files() walks global_ft.idr and
unconditionally dereferences fp->tcon->id with no NULL guard.
Reproducer requires only a successful SMB2 SESSION_SETUP and a share
configured with 'durable handles = yes'. KASAN report on mainline 70390501d194:
general protection fault, probably for non-canonical address
0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:proc_show_files+0x118/0x740
Call Trace:
proc_show_files+0x118/0x740
seq_read_iter+0x4ef/0xe10
proc_reg_read_iter+0x1b7/0x280
...
Guard the dereference. A durable-disconnected fp legitimately has no
tcon; report its tree id as 0 rather than oopsing.
Fixes: b38f99c1217a ("ksmbd: add procfs interface for runtime monitoring and statistics") Cc: stable@vger.kernel.org Signed-off-by: Jeremy Laratro <research@aradex.io> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Ferry Meng [Mon, 11 May 2026 13:18:16 +0000 (21:18 +0800)]
ksmbd: fix SID memory leak in set_posix_acl_entries_dacl() on overflow
Commit 299f962c0b02 ("ksmbd: use check_add_overflow() to prevent u16
DACL size overflow") added check_add_overflow() guards that break out
of the ACE-building loops in set_posix_acl_entries_dacl() when the
accumulated DACL size would wrap past 65535.
However, each iteration allocates a struct smb_sid via kmalloc_obj()
at the top of the loop and relies on the kfree(sid) call at the end
of the loop body (the 'pass_same_sid' label in the first loop, and
the explicit kfree at the tail of the second loop) to release it.
The newly introduced 'break' statements bypass those kfree() calls,
leaking the sid buffer every time an overflow is detected.
A malicious or malformed file with enough POSIX ACL entries to trip
the overflow check will leak one or more struct smb_sid allocations
on every request that touches the file's DACL, providing a trivial
kernel memory exhaustion vector.
Free sid before breaking out of the loops to plug the leak.
Fixes: 299f962c0b02 ("ksmbd: use check_add_overflow() to prevent u16 DACL size overflow") Cc: stable@vger.kernel.org Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Felix Gu [Fri, 23 Jan 2026 16:37:38 +0000 (00:37 +0800)]
drm/msm/adreno: Fix a reference leak in a6xx_gpu_init()
In a6xx_gpu_init(), node is obtained via of_parse_phandle().
While there was a manual of_node_put() at the end of the
common path, several early error returns would bypass this call,
resulting in a reference leak.
Fix this by using the __free(device_node) cleanup handler to
release the reference when the variable goes out of scope.
Fixes: 5a903a44a984 ("drm/msm/a6xx: Introduce GMU wrapper support") Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Patchwork: https://patchwork.freedesktop.org/patch/700661/
Message-ID: <20260124-a6xx_gpu-v1-1-fa0c8b2dcfb1@gmail.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Stefan Dösinger [Wed, 28 Jan 2026 19:13:17 +0000 (22:13 +0300)]
ARM: dts: zte: Add D-Link DWR-932M support
This adds base DT definition for zx297520v3 and one board that consumes it.
The stock kernel does not use the armv7 timer, but it seems to work
fine. The board has other board-specific timers that would need a driver
and I see no reason to bother with them since the arm standard timer
works.
The caveat is the non-standard GIC setup needed to handle the timer's
level-low PPI. This is the responsibility of the boot loader and
documented in Documentation/arch/arm/zte/zx297520v3.rst.
Reviewed-by: Linus Walleij <linusw@kernel.org> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Changes in
v8: Remove redundant label, use "arm,pl011" for uart0 and 2 too.
v6: Squash board + timer + uart patches into one
v5: Prepend the SoC name in the device specific DTS filename.
v4:
Declare all uarts
Remove the UART aliases for now. I can revisit this when I get my
hands on a board that exposes two UARTs.
Stefan Dösinger [Wed, 28 Jan 2026 19:12:44 +0000 (22:12 +0300)]
dt-bindings: arm: zte: Add D-Link DWR932M board based on zx297520v3 SoC
This adds a new binding file for ZTE, containing their zx297520v3 SoC
and one board (D-Link DWR-932M) based on it.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Changelog:
v6:
Removed extra boards, I'll add them when submitting their individual
DTS files. Rephrase the subject to add "zte" and remove the redundant
use of "binding".
Moved the devicetree bindings patch ahead of the implementation patches.
Moved the MAINTAINERS section from "ZX29" to "ARM/ZTE".
Commit dc220915ddb2 ("drm/msm: Fix GMEM_BASE for gen8") changed the
GMEM_BASE check from adreno_is_a650_family() & adreno_is_a740_family()
to family >= ADRENO_6XX_GEN4.
This inadvertently excluded A650 (ADRENO_6XX_GEN3), causing it to report
an incorrect GMEM_BASE which results in severe rendering corruption.
Update check to also include ADRENO_6XX_GEN3 to fix A650.
Fixes: dc220915ddb2 ("drm/msm: Fix GMEM_BASE for gen8") Signed-off-by: Alexander Koskovich <akoskovich@pm.me> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Reviewed-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/711880/
Message-ID: <20260314-fix-gmem-base-a650-v1-1-3308f60cf74c@pm.me> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
Stefan Dösinger [Tue, 27 Jan 2026 17:52:08 +0000 (20:52 +0300)]
ARM: zte: Add zx297520v3 platform support
This SoC is used in low end LTE-to-WiFi routers, for example some D-Link
DWR 932 revisions, ZTE K10, ZLT S10 4G, but also models that are branded
and sold by ISPs themselves. They are widespread in Africa, China,
Russia and Eastern Europe.
This SoC is a relative of the zx296702 and zx296718 that had some
upstream support until commit 89d4f98ae90d ("ARM: remove zte zx
platform").
Reviewed-by: Linus Walleij <linusw@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Stefan Dösinger <stefandoesinger@gmail.com>
---
Patch changelog:
v8:
* Select ARM_PSCI_FW (Sashiko). This is an issue make defconfig pointed
out in the last patch in this series. The board does not have PSCI
firmware as far as I can tell, but the ARM_GIC_V3 option indirectly
assumes ARM_PSCI_FW is enabled.
* Include <linux/init.h> in the board file for __initdata (Sashiko),
removed other includes copypasted from another platform that aren't
needed. Let's see if Sashiko agrees.
* Add the SoC documentation to the documentation index (Sashiko)
* Add the SoC documentation to MAINTAINERS (Sashiko)
* Removed redundant if ARCH_ZTE (Sashiko)
* Point towards a sane (USB-Only) U-Boot and modify the example code for
booting from NAND to detect already fixed GIC setups.
Matt Evans [Mon, 11 May 2026 14:46:42 +0000 (07:46 -0700)]
vfio/pci: Make VFIO_PCI_OFFSET_TO_INDEX() return unsigned
VFIO_PCI_OFFSET_TO_INDEX() is used in several places with a signed
parameter (e.g. loff_t). Because it makes no sense for a BAR/resource
index to be negative, enforce this in the macro.
This fixes at least one current issue, where vfio_pci_ioeventfd() uses
this macro with an unvalidated signed loff_t returned into a signed
type, leading to a possible negative array access. This instance does
test against an out-of-bounds positive value, so treating the index as
unsigned fixes this issue.
Thomas Gleixner [Wed, 13 May 2026 12:59:29 +0000 (14:59 +0200)]
hrtimer: Fix the bogus return type of __hrtimer_start_range_ns()
__hrtimer_start_range_ns() has a bool return type, but returns actually
three different values, which are checked at the call site.
Make the return type int.
Fixes: bd5956166d20 ("hrtimer: Provide hrtimer_start_range_ns_user()") Reported-by: Dan Carpenter <error27@gmail.com Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Andrea Righi [Wed, 13 May 2026 11:24:38 +0000 (13:24 +0200)]
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.
Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:
In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.
Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.
Alex Williamson [Thu, 7 May 2026 14:35:46 +0000 (08:35 -0600)]
vfio/pci: fix dma-buf kref underflow after revoke
vfio_pci_dma_buf_move(revoked=true) and vfio_pci_dma_buf_cleanup()
ran the same drain sequence: set priv->revoked, invalidate mappings,
wait for fences, drop the registered kref, wait for completion.
When the VFIO device fd was closed after PCI_COMMAND_MEMORY had been
cleared, both ran in turn -- the second kref_put underflowed and the
subsequent wait_for_completion() blocked on a completion that the
first run had already consumed:
Collapse the duplication: vfio_pci_dma_buf_cleanup() now delegates
the drain to vfio_pci_dma_buf_move(true), which is idempotent for
already-revoked dma-bufs. cleanup retains only list removal and
the device registration drop; the dma_resv_lock that bracketed
those is dropped along with the in-line drain that required it,
memory_lock continues to protect them.
Re-arm the kref and the completion at the end of move()'s revoke
branch so post-revoke state matches post-creation (kref == 1,
completion ready). This keeps cleanup's call into move() a no-op
when revoke already ran, and replaces the explicit kref_init() that
the un-revoke branch used to perform for the un-revoke -> remap
path.
Fixes: 1a8a5227f229 ("vfio: Wait for dma-buf invalidation to complete") Reported-by: Joonas Kylmälä <joonas.kylmala@netum.fi> Closes: https://lore.kernel.org/all/GVXPR02MB12019AA6014F27EF5D773E89BFB372@GVXPR02MB12019.eurprd02.prod.outlook.com/ Cc: stable@vger.kernel.org Assisted-by: Claude:claude-opus-4-7 Reviewed-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260507143548.1018405-1-alex.williamson@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
block: pass a minsize argument to bio_iov_iter_bounce
When bouncing for block size > PAGE_SIZE file systems that require
file system block size alignment (e.g. zoned XFS), the bio needs to
be big enough to fit an entire block.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Henry Tseng [Wed, 13 May 2026 10:28:46 +0000 (18:28 +0800)]
cpufreq: intel_pstate: Use HYBRID_SCALING_FACTOR_ADL for Bartlett Lake
Bartlett Lake P-core only SKUs (e.g. Intel Core 9 273PE)
do not report X86_FEATURE_HYBRID_CPU and are not in
intel_hybrid_scaling_factor[]. In hwp_get_cpu_scaling(), the
non-hybrid fallback then applies core_get_scaling() (100000),
producing cpuinfo_max_freq values that exceed the documented Max
Turbo Frequency:
Per the Intel datasheet [1], the Intel Core 9 273PE specifies:
Performance-cores: 12
Efficient-cores: 0
Max Turbo Frequency: 5.7 GHz
Intel Thermal Velocity Boost Frequency: 5.7 GHz
Intel Turbo Boost Max Technology 3.0 Frequency: 5.6 GHz
Performance-core Max Turbo Frequency: 5.4 GHz
Performance-core Base Frequency: 2.3 GHz
Bartlett Lake P-cores are Raptor Cove cores, per
commit d466304c4322 ("x86/cpu: Add CPU model number for Bartlett
Lake CPUs with Raptor Cove cores"). The Alder Lake and Raptor Lake
P-core entries in intel_hybrid_scaling_factor[] use
HYBRID_SCALING_FACTOR_ADL (78741). The same factor applies to
Bartlett Lake.
Add Bartlett Lake to intel_hybrid_scaling_factor[] with
HYBRID_SCALING_FACTOR_ADL so HWP performance levels map to the
correct CPU frequencies matching the datasheet's Max Turbo Frequency.
cpufreq: intel_pstate: Use correct scaling factor on Raptor Lake-E
Raptor Lake-E has the same processor ID as Raptor Lake-S, so there is
an entry in intel_hybrid_scaling_factor[] for it. It does not contain
E-cores though and hybrid_get_cpu_type() returns 0 for its P-cores, so
they get the default "core" scaling factor. However, the original
Raptor Lake scaling factor for P-cores still needs to be used for
mapping the HWP performance levels of the P-cores in Raptor Lake-E to
frequency, as though they were part of a real hybrid system.
To address this, update hwp_get_cpu_scaling() to return
hybrid_scaling_factor, which is the P-core scaling factor
retrieved from intel_hybrid_scaling_factor[], for all CPUs
that are not enumerated as E-cores.
Documentation: intel_pstate: Fix description of asymmetric packing with SMT
Patchset [1], including commits
046a5a95c3b0 ("x86/sched/itmt: Give all SMT siblings of a core the same priority") 995998ebdebd ("x86/sched: Remove SD_ASYM_PACKING from the SMT domain flags")
overhauled asym_packing handling in the scheduler on x86 hybrid
processors with SMT. It removed SD_ASYM_PACKING from the x86 SMT
scheduling domain and made all SMT siblings of a core share the same
priority. As a result, asym_packing operates only across physical
cores, spreading tasks among them and only using idle SMT siblings
once all physical cores are busy.
Ethan Tidmore [Fri, 8 May 2026 01:57:16 +0000 (20:57 -0500)]
memory: tegra: Fix possible null pointer dereference
The function tegra114_emc_find_timing() has the possibility of returning
null and it's return value 'timing' is dereferenced before it is
checked for null.
Place dereference after null pointer check.
Detected by Smatch:
drivers/memory/tegra/tegra114-emc.c:520 tegra114_emc_prepare_timing_change()
warn: variable dereferenced before check 'timing' (see line 515)
Nicholas Carlini [Mon, 11 May 2026 18:02:16 +0000 (18:02 +0000)]
io-wq: check that the predecessor is hashed in io_wq_remove_pending()
io_wq_remove_pending() needs to fix up wq->hash_tail[] if the cancelled
work was the tail of its hash bucket. When doing this, it checks whether
the preceding entry in acct->work_list has the same hash value, but
never checks that the predecessor is hashed at all. io_get_work_hash()
is simply atomic_read(&work->flags) >> IO_WQ_HASH_SHIFT, and the hash
bits are never set for non-hashed work, so it returns 0. Thus, when a
hashed bucket-0 work is cancelled while a non-hashed work is its list
predecessor, the check spuriously passes and a pointer to the non-hashed
io_kiocb is stored in wq->hash_tail[0].
Because non-hashed work is dequeued via the fast path in
io_get_next_work(), which never touches hash_tail[], the stale pointer
is never cleared. Therefore, after the non-hashed io_kiocb completes and
is freed back to req_cachep, wq->hash_tail[0] is a dangling pointer. The
io_wq is per-task (tctx->io_wq) and survives ring open/close, so the
dangling pointer persists for the lifetime of the task; the next hashed
bucket-0 enqueue dereferences it in io_wq_insert_work() and
wq_list_add_after() writes through freed memory.
Add the missing io_wq_is_hashed() check so a non-hashed predecessor
never inherits a hash_tail[] slot.
Cc: stable@vger.kernel.org Fixes: 204361a77f40 ("io-wq: fix hang after cancelling pending hashed work") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
thermal: hwmon: Use extra_groups for adding temperature attributes
Instead of passing NULL as the last argument to __hwmon_device_register()
in hwmon_device_register_for_thermal() and then adding each temperature
sysfs attribute to the hwmon device via device_create_file(), redefine
hwmon_device_register_for_thermal() to take an extra_groups argument
that will be passed to __hwmon_device_register(), define an attribute
group with a proper .is_visible() callback for the temperature
attributes and a related attribute groups pointer, and pass the latter
to hwmon_device_register_for_thermal().
This causes the code to be way more straightforward and closer to
what the other users of the hwmon subsystem do.
When thermal_zone0 is removed, say because the ACPI thermal driver is
unbound from the underlying platform device, the removal code skips the
removal of hwmon0 because of the temp2_input attribute belonging to
thermal_zone1 which effectively prevents thermal_zone0 removal from
making progress.
To address this problem, rework the thermal hwmon code to register one
hwmon device for each thermal zone, but since user space utilities
produce confusing output in some cases when there are multiple hwmon
devices with the same name attribute value present under thermal zones
of the same type, append the thermal zone ID preceded by an underline
character to the name of the hwmon device registered for that thermal
zone.
thermal: hwmon: Fix critical temperature attribute removal
Since the return value of thermal_zone_crit_temp_valid() depends on
the behavior of the thermal zone .get_crit_temp() callback which
may change over time in theory, thermal_remove_hwmon_sysfs() may
attempt to remove a critical temperature attribute that has not
been created, passing a pointer to an uninitialized attribute
structure to device_remove_file().
To avoid that, set a flag in struct thermal_hwmon_temp after creating
a critical temperature attribute and use the value of that flag to
decide whether or not the attribute needs to be removed.
Fixes: e8db5d6736a7 ("thermal: hwmon: Make the check for critical temp valid consistent") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/2437056.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Lezcano [Fri, 8 May 2026 18:05:10 +0000 (20:05 +0200)]
thermal/core: Allocate the thermal class dynamically
Use class_create() instead of a statically allocated struct class.
This allows the thermal class to be managed through a dynamically
allocated class object and avoids keeping a static class instance
around.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
[ rjw: Added __ro_after_init to thermal_class ]
[ rjw: Used temporary local var to store class_create() return value ] Link: https://patch.msgid.link/20260508180511.1306659-4-daniel.lezcano@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Daniel Lezcano [Fri, 8 May 2026 18:05:09 +0000 (20:05 +0200)]
thermal/core: Add dedicated release callback for thermal zones
The thermal class release callback currently handles thermal zone
cleanup by checking the device name prefix.
Move the thermal zone cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the thermal zone lifetime handling local to the thermal
zone object.
Daniel Lezcano [Fri, 8 May 2026 18:05:08 +0000 (20:05 +0200)]
thermal/core: Add dedicated release callback for cooling devices
The thermal class release callback currently handles both thermal
zones and cooling devices by checking the device name prefix.
Move the cooling device cleanup to a dedicated struct device release
callback. This avoids relying on device names to select the release
path and keeps the cooling device lifetime handling local to the
cooling device object.
Lists should have fixed constraints, because binding must be specific in
respect to hardware. Add missing constraints to number of iommus in
Mediatek media devices and remove completely redundant and obvious
description.
Cheng-Yang Chou [Wed, 13 May 2026 08:17:11 +0000 (16:17 +0800)]
tools/sched_ext: scx_qmap: Fix qa arena placement
__arena is a pointer qualifier meaning "this pointer points to arena
memory". When used on a global variable declaration, it expands to
nothing in scx's build because __BPF_FEATURE_ADDR_SPACE_CAST is never
defined, leaving qa as a plain global in BSS. bpftool then generates
skel->bss->qa instead of the expected skel->arena->qa, causing:
scx_qmap.c: error: 'struct scx_qmap' has no member named 'arena'
__arena_global is the correct annotation for global variables that
reside in the arena. When __BPF_FEATURE_ADDR_SPACE_CAST is not defined
it expands to SEC(".addr_space.1"), placing qa in the arena ELF section.
When __BPF_FEATURE_ADDR_SPACE_CAST is defined it expands to
__attribute__((address_space(1))). In both cases bpftool generates the
typed skel->arena accessor.
Fixes: 60a59eaca71b ("sched_ext: scx_qmap: move globals and cpu_ctx into a BPF arena map") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().
Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.
# b1 becomes partition root with CPUs 1-2, but sibling exclusion
# reduces its effective_xcpus to CPU 2 only
echo "1-2" > b1/cpuset.cpus
echo "root" > b1/cpuset.cpus.partition
Linus Torvalds [Wed, 13 May 2026 18:53:51 +0000 (11:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
Yu Miao [Wed, 13 May 2026 02:39:07 +0000 (10:39 +0800)]
selftests/cgroup: Fix error path leaks in test_percpu_basic
When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.
When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.