wifi: mac80211: add __packed to union members of struct ieee80211_rx_status
The arm-linux-gnueabi-gcc compiler, align the field followed by union
members, causing size of struct ieee80211_rx_status over skb->cb
(48 bytes). By investigation, the union member starts at offset 32,
and the offset of next field rate_idx is 36 instead of expected 33, and
the total size is (unexpected) 52.
When compiling rtw88 driver, it throws:
In file included from /work/linux-src/linux-stable/include/linux/string.h:386,
from /work/linux-src/linux-stable/include/linux/bitmap.h:13,
from /work/linux-src/linux-stable/include/linux/cpumask.h:11,
from /work/linux-src/linux-stable/include/linux/smp.h:13,
from /work/linux-src/linux-stable/include/linux/lockdep.h:14,
from /work/linux-src/linux-stable/include/linux/mutex.h:17,
from /work/linux-src/linux-stable/include/linux/kernfs.h:11,
from /work/linux-src/linux-stable/include/linux/sysfs.h:16,
from /work/linux-src/linux-stable/include/linux/kobject.h:20,
from /work/linux-src/linux-stable/include/linux/dmi.h:6,
from pci.c:5:
In function 'fortify_memcpy_chk',
inlined from 'rtw_pci_rx_napi.constprop' at pci.c:1095:4:
/work/linux-src/linux-stable/include/linux/fortify-string.h:569:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
569 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After this patch, the size of struct ieee80211_rx_status is 48.
wifi: Rename EMLSR delay constants and add EMLMR helpers and definitions
In the final version of 802.11be-2024, Transition Delay and Padding
Delay subfield are for both EMLSR and EMLMR. Depending if the mode is
EMLSR or EMLMR, the interpretation of the encoded value might change.
Define all the constants and helpers to interpret delay subfields both
in EMLSR and EMLMR mode.
In the finalized version of 802.11be-2024, the EMLMR delay values have
been merged in the EMLSR Padding/Transition Delay subfields and
therefore the subfield EMLMR Delay has been converted to a reserved field.
In Table 9-417m of 802.11be-2024, Transition Timeout is defined up
to value 10 for a Transition Timeout of 64TUs. The value 11 is reserved
and does not correspond to a Transition Timeout of 128TUs.
sched/fair: Clear rel_deadline when initializing forked entities
A yield-triggered crash can happen when a newly forked sched_entity
enters the fair class with se->rel_deadline unexpectedly set.
The failing sequence is:
1. A task is forked while se->rel_deadline is still set.
2. __sched_fork() initializes vruntime, vlag and other sched_entity
state, but does not clear rel_deadline.
3. On the first enqueue, enqueue_entity() calls place_entity().
4. Because se->rel_deadline is set, place_entity() treats se->deadline
as a relative deadline and converts it to an absolute deadline by
adding the current vruntime.
5. However, the forked entity's deadline is not a valid inherited
relative deadline for this new scheduling instance, so the conversion
produces an abnormally large deadline.
6. If the task later calls sched_yield(), yield_task_fair() advances
se->vruntime to se->deadline.
7. The inflated vruntime is then used by the following enqueue path,
where the vruntime-derived key can overflow when multiplied by the
entity weight.
8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility
calculation, and can eventually make all entities appear ineligible.
pick_next_entity() may then return NULL unexpectedly, leading to a
later NULL dereference.
A captured trace shows the effect clearly. Before yield, the entity's
vruntime was around:
This shows that the deadline had already become abnormally large before
yield_task_fair() copied it into vruntime.
rel_deadline is only meaningful when se->deadline really carries a
relative deadline that still needs to be placed against vruntime. A
freshly forked sched_entity should not inherit or retain this state.
Clear se->rel_deadline in __sched_fork(), together with the other
sched_entity runtime state, so that the first enqueue does not interpret
the new entity's deadline as a stale relative deadline.
Vincent Guittot [Wed, 22 Apr 2026 09:34:00 +0000 (11:34 +0200)]
sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
Similar to how pick_next_entity() must dequeue delayed entities, so too must
wakeup_preempt_fair(). Any delayed task being found means it is eligible and
hence past the 0-lag point, ready for removal.
Worse, by not removing delayed entities from consideration, it can skew the
preemption decision, with the end result that a short slice wakeup will not
result in a preemption.
Peter Zijlstra [Thu, 23 Apr 2026 11:22:22 +0000 (13:22 +0200)]
sched/fair: Fix the negative lag increase fix
Vincent reported that my rework of his original patch lost a little
something.
Specifically it got the return value wrong; it should not compare
against the old se->vlag, but rather against the current value. Since
the thing that matters is if the effective vruntime of an entity is
affected and the thing needs repositioning or not.
Fixes: 059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue") Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net
ALSA: usb-audio: Avoid potential endless loop in convert_chmap_v3()
The convert_chmap_v3() has a loop with its increment size of
cs_desc->wLength, but we forgot to validate cs_desc->wLength itself,
which may lead to potential endless loop by a malformed descriptor.
Add a proper size check to abort the loop for plugging the hole.
ALSA: usb-audio: Fix potential leak of pd at parsing UAC3 streams
At parsing UAC3 streams, we allocate a PD object at each time, and
either assign or free it. But there is a case where the PD object may
be leaked; namely, in __snd_usb_parse_audio_interface() loop, when an
audioformat shares the same endpoint with others, it's put to a link
and returns from snd_usb_add_audio_stream(), but the PD is forgotten
afterwards. Overall, the treatment of PD object in the parser code is
a bit flaky, and we should be more careful about the object ownership.
This patch tries to fix the above case and improve the code a bit.
The pd object is now managed with the auto-cleanup in the loop, and
the ownership is updated when the pd object gets assigned to the
stream, which guarantees the release of the leftover object.
ALSA: caiaq: Don't abort when no input device is available
The previous fix to handle the error from setup_card() caused a
regression for the models that have no dedicated input device;
snd_usb_caiaq_input_init() just returns -EINVAL, and we treat it as a
fatal error although it should be ignored.
As a regression fix, change the error code to -ENODEV, and ignore this
error in the callee, to continue probing.
ALSA: caiaq: Fix potentially leftover ep1_in_urb at error path
The previous fix for handling the error from setup_card() missed that
an internal URB cdev->ep1_in_urb might have been already submitted
beforehand. In the normal case, this URB gets killed at the
disconnection, but in the error path, we didn't do it, hence there can
be a potential leak.
xfrm: Don't clobber inner headers when already set
On VXLAN over IPsec egress, xfrm{4,6}_transport_output() blindly
overwrite inner_transport_header (== the inner TCP header saved in VXLAN
iptunnel_handle_offloads() -> skb_reset_inner_headers()) with the
current transport_header (== the VXLAN outer UDP header set by
udp_tunnel_xmit_skb()).
This was a latent bug, harmless until commit [1] added a doff validation
check in qdisc_pkt_len_segs_init() for encapsulated GSO packets. With
the wrong inner_transport_header set by xfrm, qdisc_pkt_len_segs_init()
interprets inner_transport_header as a TCP header, reads doff=0 from the
upper byte of the VNI and drops the packet with DROP_REASON_SKB_BAD_GSO.
Besides the use in GSO to determine the header size of segmented
packets, inner_transport_header might be used by drivers to set up
inner checksum offloading by pointing the HW to the inner transport
header. A quick browse through available drivers shows that mlx5 uses
skb->csum_start specifically for this scenario, while others either
don't support VXLAN over IPsec crypto offload (ixgbe) or the HW is
capable of parsing the packets itself (nfp, Chelsio).
But in all cases, it is more correct to let the inner_transport_header
point to the innermost header instead of overwriting it in xfrm.
So fix this by guarding all four inner header save sites in
xfrm_output.c (xfrm{4,6}_transport_output, xfrm{4,6}_tunnel_encap_add)
with a check for skb->inner_protocol. When inner_protocol is set, a
tunnel layer (VXLAN, Geneve, GRE, etc.) has already saved the correct
inner header offsets and they must not be overwritten. When
inner_protocol is zero, no prior tunnel encapsulation exists and xfrm
must save the inner headers itself. The tunnel mode checks are only
added for completion, since they aren't strictly required, as
xfrm_output() forces software GSO in tunnel mode before encap.
This makes the previously added test pass:
# ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
TAP version 13
1..4
ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4
ok 2 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v6
ok 3 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v4
ok 4 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v6
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[1] commit 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()") Fixes: f1bd7d659ef0 ("xfrm: Add encapsulation header offsets while SKB is not encrypted") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
There are VXLAN tests and IPsec tests, but there is no test that
combines the two protocols and exercises the tunnel-over-ipsec code
paths. Fix that by adding a traffic test with VXLAN and IPsec using
crypto offload. This is runnable on HW which supports ESP offload (so no
nsim unfortunately).
Traffic is done with iperf3 and the test validates that there are no
packet drops and iperf3 can get to at least 100 Mbps (a very
conservative value on today's crypto offload HW, as it can typically
reach multi-Gbps rates).
Ran right now, the test fails due to a recently exposed bug in xfrm,
which will be fixed in the next patch:
# ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
TAP version 13
1..4
# Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py,
# line 161, in test_vxlan_ipsec_crypto_offload:
# Check| ksft_eq(drops_after - drops_before, 0,
# Check failed 189 != 0 TX drops during VXLAN+IPsec
# Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py,
# line 163, in test_vxlan_ipsec_crypto_offload:
# Check| ksft_ge(bw_gbps, 0.1,
# Check failed 0.0015058278404812596 < 0.1 Minimum 100Mbps over
# VXLAN+IPsec
not ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4
...
tools/selftests: Use a sensible timeout value for iperf3 client
The default timeout of cmd() is 5 seconds and Iperf3Runner requests the
iperf3 client to run for 10 seconds, which clearly doesn't work since
commit [1] enforced the timeout parameter.
Use a value derived from duration as timeout (+5 seconds for
startup/teardown/various other overhead).
In aw88395_i2c_probe(), if `devm_gpiod_get_optional()` fails, it returns
an ERR_PTR() error pointer. The current code only prints a message and
continues execution, leaving `aw88395->reset_gpio` as an invalid pointer.
Later, in `aw88395_hw_reset()`, this invalid pointer is passed to
`gpiod_set_value_cansleep()`, which dereferences it and causes a kernel
panic.
For optional GPIOs, `devm_gpiod_get_optional()` returns NULL if the GPIO
is not defined in the DT, which is safe. If it returns an ERR_PTR, it
means a real error occurred (e.g., -EPROBE_DEFER) and the probe must be
aborted.
Also, since the GPIO is optional, remove the dev_err() log in
aw88395_hw_reset() when the GPIO is missing to match the optional
semantics. This also fixes a potential NULL pointer dereference as
aw_pa is not initialized when aw88395_hw_reset() is called.
firmware: google: Add bounds checks in coreboot_table_populate()
coreboot_table_populate() iterates over firmware-provided table entries
with no validation that the entries stay within the mapped memory
region. A corrupt table with a large `entry->size` advances `ptr_entry`
past the mapped region, causing an out-of-bounds read on the next
iteration.
Add a check before dereferencing `ptr_entry` to ensure the entry header
is readable, and a second check after reading `entry->size` to ensure
the full entry stays within the mapped region.
Pass `len` from coreboot_table_probe() into coreboot_table_populate() to
make the mapped region size available for validation.
netpoll_setup() decides whether to auto-populate the local source
address by testing np->local_ip.ip, which only inspects the first 4
bytes of the union inet_addr storage.
For an IPv6 netpoll whose caller-supplied local address has a zero
high-32 bits (::1, ::<suffix>, IPv4-mapped ::ffff:a.b.c.d, etc.), this
misdetects the address as unset (which they are not, but the first
4 bytes are empty), calls netpoll_take_ipv6() and overwrites it with
whatever matching link-local/global address the device happens to expose
first.
Introduce a helper netpoll_local_ip_unset() that picks the correct
family-aware test (ipv6_addr_any() for IPv6, !.ip for IPv4) and use it
from netpoll_setup().
Reproducer is something like:
echo "::2" > local_ip
echo 1 > enabled
cat local_ip
# before this fix: 2001:db8::1 (caller-supplied ::2 was clobbered)
# after this fix: ::2
tcp: make probe0 timer handle expired user timeout
tcp_clamp_probe0_to_user_timeout() computes remaining time in jiffies
using subtraction with an unsigned lvalue. If elapsed probing time
exceeds the configured TCP_USER_TIMEOUT, the underflow yields a large
value.
This ends up re-arming the probe timer for a full backoff interval
instead of expiring immediately, delaying connection teardown beyond
the configured timeout.
Fix this by preventing underflow so user-set timeout expiration is
handled correctly without extending the probe timer.
Mingming Cao [Fri, 24 Apr 2026 16:29:17 +0000 (09:29 -0700)]
ibmveth: Disable GSO for packets with small MSS
Some physical adapters on Power systems do not support segmentation
offload when the MSS is less than 224 bytes. Attempting to send such
packets causes the adapter to freeze, stopping all traffic until
manually reset.
Implement ndo_features_check to disable GSO for packets with small MSS
values. The network stack will perform software segmentation instead.
The 224-byte minimum matches ibmvnic
commit <f10b09ef687f> ("ibmvnic: Enforce stronger sanity checks
on GSO packets")
which uses the same physical adapters in SEA configurations.
The issue occurs specifically when the hardware attempts to perform
segmentation (gso_segs > 1) with a small MSS. Single-segment GSO packets
(gso_segs == 1) do not trigger the problematic LSO code path and are
transmitted normally without segmentation.
Add an ndo_features_check callback to disable GSO when MSS < 224 bytes.
Also call vlan_features_check() to ensure proper handling of VLAN packets,
particularly QinQ (802.1ad) configurations where the hardware parser may
not support certain offload features.
Validated using iptables to force small MSS values. Without the fix,
the adapter freezes. With the fix, packets are segmented in software
and transmission succeeds. Comprehensive regression testing completedd
(MSS tests, performance, stability).
Fixes: 8641dd85799f ("ibmveth: Add support for TSO") Cc: stable@vger.kernel.org Reviewed-by: Brian King <bjking1@linux.ibm.com> Tested-by: Shaik Abdulla <shaik.abdulla1@ibm.com> Tested-by: Naveed Ahmed <naveedaus@in.ibm.com> Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Link: https://patch.msgid.link/20260424162917.65725-1-mmc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
neigh_xmit always releases the skb, except when no neighbour table is
found. But even the first added user of neigh_xmit (mpls) relied on
neigh_xmit to release the skb (or queue it for tx).
sashiko reported:
If neigh_xmit() is called with an uninitialized neighbor table (for
example, NEIGH_ND_TABLE when IPv6 is disabled), it returns -EAFNOSUPPORT
and bypasses its internal out_kfree_skb error path. Because the return
value of neigh_xmit() is ignored here, does this leak the SKB?
Assume full ownership and remove the last code path that doesn't
xmit or free skb.
With CONFIG_IP_MROUTE_MULTIPLE_TABLES=n, ipmr_fib_lookup()
does not check if net->ipv4.mrt is NULL.
Since default_device_exit_batch() is called after ->exit_rtnl(),
a device could receive IGMP packets and access net->ipv4.mrt
during/after ipmr_rules_exit_rtnl().
If ipmr_rules_exit_rtnl() had already cleared it and freed the
memory, the access would trigger null-ptr-deref or use-after-free.
Let's fix it by using RCU helper and free mrt after RCU grace
period.
In addition, check_net(net) is added to mroute_clean_tables()
and ipmr_cache_unresolved() to synchronise via mfc_unres_lock.
This prevents ipmr_cache_unresolved() from putting skb into
c->_c.mfc_un.unres.unresolved after mroute_clean_tables()
purges it.
For the same reason, timer_shutdown_sync() is moved after
mroute_clean_tables().
Since rhltable_destroy() holds mutex internally, rcu_work is
used, and it is placed as the first member because rcu_head
must be placed within <4K offset. mr_table is alraedy 3864
bytes without rcu_work.
Note that IP6MR is not yet converted to ->exit_rtnl(), so this
change is not needed for now but will be.
However pn_socket_bind() also returns -EINVAL when sk->sk_state is not
TCP_CLOSE, even when the socket has never been bound and pn_port() is
still 0. In that case the BUG_ON() fires and panics the kernel from a
user-triggerable path.
Treat the "bind returned -EINVAL but pn_port() is still 0" case as a
regular error and propagate -EINVAL to the caller instead of crashing.
Existing callers already translate a non-zero return from
pn_socket_autobind() into -ENOBUFS/-EAGAIN, so returning -EINVAL here
only changes behaviour from panic to a normal errno.
====================
net/sched: taprio: fix NULL pointer dereference in class dump
Patch 1/2 is the fix: replace NULL entries in q->qdiscs[] with the
global &noop_qdisc singleton so that control-plane dump paths, as well
as the existing NULL guards in the data-plane enqueue/dequeue paths,
cannot deref a NULL child qdisc.
Patch 2/2 is a tdc regression test that drives the graft + delete +
class-dump sequence on a multi-queue netdevsim device. It panics the
vulnerable kernel and passes on the fixed one.
====================
Weiming Shi [Wed, 22 Apr 2026 16:19:59 +0000 (00:19 +0800)]
selftests/tc-testing: add taprio test for class dump after child delete
Add a regression test for the NULL pointer dereference fixed in the
previous commit. Before the fix, taprio_graft() stored NULL into
q->qdiscs[cl - 1] when an explicitly grafted child qdisc was deleted
via RTM_DELQDISC; the next RTM_GETTCLASS dump then crashed the kernel
in taprio_dump_class() while reading child->handle.
The test installs a taprio root qdisc on a multi-queue netdevsim
device, grafts a pfifo child onto class 8001:1, deletes that child,
and then performs a class dump. On a fixed kernel the dump succeeds
and all eight taprio classes are listed; on an unpatched kernel the
class dump crashes, which surfaces as a test failure.
Weiming Shi [Wed, 22 Apr 2026 16:19:58 +0000 (00:19 +0800)]
net/sched: taprio: fix NULL pointer dereference in class dump
When a TAPRIO child qdisc is deleted via RTM_DELQDISC, taprio_graft()
is called with new == NULL and stores NULL into q->qdiscs[cl - 1].
Subsequent RTM_GETTCLASS dump operations walk all classes via
taprio_walk() and call taprio_dump_class(), which calls taprio_leaf()
returning the NULL pointer, then dereferences it to read child->handle,
causing a kernel NULL pointer dereference.
The bug is reachable with namespace-scoped CAP_NET_ADMIN on any kernel
with CONFIG_NET_SCH_TAPRIO enabled. On systems with unprivileged user
namespaces enabled, an unprivileged local user can trigger a kernel
panic by creating a taprio qdisc inside a new network namespace,
grafting an explicit child qdisc, deleting it, and requesting a class
dump. The RTM_GETTCLASS dump itself requires no capability.
Fix this by substituting &noop_qdisc when new is NULL in
taprio_graft(), a common pattern used by other qdiscs (e.g.,
multiq_graft()) to ensure the q->qdiscs[] slots are never NULL.
This makes control-plane dump paths safe without requiring individual
NULL checks.
Since the data-plane paths (taprio_enqueue and taprio_dequeue_from_txq)
previously had explicit NULL guards that would drop/skip the packet
cleanly, update those checks to test for &noop_qdisc instead. Without
this, packets would reach taprio_enqueue_one() which increments the root
qdisc's qlen and backlog before calling the child's enqueue; noop_qdisc
drops the packet but those counters are never rolled back, permanently
inflating the root qdisc's statistics.
After this change *old can be a valid qdisc, NULL, or &noop_qdisc.
Only call qdisc_put(*old) in the first case to avoid decreasing
noop_qdisc's refcount, which was never increased.
Fixes: 665338b2a7a0 ("net/sched: taprio: dump class stats for the actual q->qdiscs[]") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/20260422161958.2517539-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'xsa48x-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"XSA-485 and XSA-487 security patches"
* tag 'xsa48x-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: fix double free via VMA splitting
Buffer overflow in drivers/xen/sys-hypervisor.c
Paul Geurts [Wed, 22 Apr 2026 10:09:30 +0000 (12:09 +0200)]
NFC: trf7970a: Ignore antenna noise when checking for RF field
The main channel Received Signal Strength Indicator (RSSI) measurement
is used to determine whether an RF field is present or not. RSSI != 0
is interpreted as an RF Field is present. This does not take RF noise
and measurement inaccuracy into account, and results in false positives
in the field.
Define a noise level and make sure the RF field is only interpreted as
present when the RSSI is above the noise level.
Fixes: 851ee3cbf850 ("NFC: trf7970a: Don't turn on RF if there is already an RF field") Signed-off-by: Paul Geurts <paul.geurts@prodrive-technologies.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Reviewed-by: Mark Greer <mgreer@animalcreek.com> Link: https://patch.msgid.link/20260422100930.581237-1-paul.geurts@prodrive-technologies.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Gu [Mon, 27 Apr 2026 17:42:00 +0000 (01:42 +0800)]
spi: amlogic-spisg: initialize completion before requesting IRQ
Move init_completion(&spisg->completion) to before devm_request_irq()
to avoid a potential race condition where an interrupt could fire
before the completion structure is initialized.
net: usb: rtl8150: free skb on usb_submit_urb() failure in xmit
When rtl8150_start_xmit() fails to submit the tx URB, the URB is never
handed to the USB core and write_bulk_callback() will not run. The
driver returns NETDEV_TX_OK, which tells the networking stack that the
skb has been consumed, but nothing actually frees the skb on this
error path:
dev->tx_skb = skb;
...
if ((res = usb_submit_urb(dev->tx_urb, GFP_ATOMIC))) {
...
/* no kfree_skb here */
}
return NETDEV_TX_OK;
This leaks the skb on every submit failure and also leaves dev->tx_skb
pointing at memory that the driver itself may later free, which is
fragile.
Free the skb with dev_kfree_skb_any() in the error path and clear
dev->tx_skb so no stale pointer is left behind.
Zhan Jun [Thu, 23 Apr 2026 00:49:12 +0000 (08:49 +0800)]
net: usb: rtl8150: fix use-after-free in rtl8150_start_xmit()
syzbot reported a KASAN slab-use-after-free read in rtl8150_start_xmit()
when accessing skb->len for tx statistics after usb_submit_urb() has
been called:
BUG: KASAN: slab-use-after-free in rtl8150_start_xmit+0x71f/0x760
drivers/net/usb/rtl8150.c:712
Read of size 4 at addr ffff88810eb7a930 by task kworker/0:4/5226
The URB completion handler write_bulk_callback() frees the skb via
dev_kfree_skb_irq(dev->tx_skb). The URB may complete on another CPU
in softirq context before usb_submit_urb() returns in the submitter,
so by the time the submitter reads skb->len the skb has already been
queued to the per-CPU completion_queue and freed by net_tx_action():
CPU A (xmit) CPU B (USB completion softirq)
------------ ------------------------------
dev->tx_skb = skb;
usb_submit_urb() --+
|-------> write_bulk_callback()
| dev_kfree_skb_irq(dev->tx_skb)
| net_tx_action()
| napi_skb_cache_put() <-- free
netdev->stats.tx_bytes |
+= skb->len; <-- UAF read
Fix it by caching skb->len before submitting the URB and using the
cached value when updating the tx_bytes counter.
The pre-existing tx_bytes semantics are preserved: the counter tracks
the original frame length (skb->len), not the ETH_ZLEN/USB-alignment
padded "count" value that is handed to the device. Changing that
would be a user-visible accounting change and is out of scope for
this UAF fix.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+3f46c095ac0ca048cb71@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69e69ee7.050a0220.24bfd3.002b.GAE@google.com/ Closes: https://syzkaller.appspot.com/bug?extid=3f46c095ac0ca048cb71 Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Zhan Jun <zhanjun@uniontech.com> Link: https://patch.msgid.link/809895186B866C10+20260423004913.136655-1-zhangdandan@uniontech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6: rpl: reserve mac_len headroom when recompressed SRH grows
ipv6_rpl_srh_rcv() decompresses an RFC 6554 Source Routing Header, swaps
the next segment into ipv6_hdr->daddr, recompresses, then pulls the old
header and pushes the new one plus the IPv6 header back. The
recompressed header can be larger than the received one when the swap
reduces the common-prefix length the segments share with daddr (CmprI=0,
CmprE>0, seg[0][0] != daddr[0] gives the maximum +8 bytes).
pskb_expand_head() was gated on segments_left == 0, so on earlier
segments the push consumed unchecked headroom. Once skb_push() leaves
fewer than skb->mac_len bytes in front of data,
skb_mac_header_rebuild()'s call to:
skb_set_mac_header(skb, -skb->mac_len);
will store (data - head) - mac_len into the u16 mac_header field, which
wraps to ~65530, and the following memmove() writes mac_len bytes ~64KiB
past skb->head.
A single AF_INET6/SOCK_RAW/IPV6_HDRINCL packet over lo with a two
segment type-3 SRH (CmprI=0, CmprE=15) reaches headroom 8 after one
pass; KASAN reports a 14-byte OOB write in ipv6_rthdr_rcv.
Fix this by expanding the head whenever the remaining room is less than
the push size plus mac_len, and request that much extra so the rebuilt
MAC header fits afterwards.
Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr") Cc: stable <stable@kernel.org> Reported-by: Anthropic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026042133-gout-unvented-1bd9@gregkh Signed-off-by: Jakub Kicinski <kuba@kernel.org>
vrf: Fix a potential NPD when removing a port from a VRF
RCU readers that identified a net device as a VRF port using
netif_is_l3_slave() assume that a subsequent call to
netdev_master_upper_dev_get_rcu() will return a VRF device. They then
continue to dereference its l3mdev operations.
This assumption is not always correct and can result in a NPD [1]. There
is no RCU synchronization when removing a port from a VRF, so it is
possible for an RCU reader to see a new master device (e.g., a bridge)
that does not have l3mdev operations.
Fix by adding RCU synchronization after clearing the IFF_L3MDEV_SLAVE
flag. Skip this synchronization when a net device is removed from a VRF
as part of its deletion and when the VRF device itself is deleted. In
the latter case an RCU grace period will pass by the time RTNL is
released.
Eric Dumazet [Thu, 23 Apr 2026 06:28:39 +0000 (06:28 +0000)]
net/sched: sch_choke: annotate data-races in choke_dump_stats()
choke_dump_stats() only runs with RTNL held.
It reads fields that can be changed in qdisc fast path.
Add READ_ONCE()/WRITE_ONCE() annotations.
Fixes: edb09eb17ed8 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260423062839.2524324-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Fri, 24 Apr 2026 09:00:28 +0000 (11:00 +0200)]
net: airoha: Do not read uninitialized fragment address in airoha_dev_xmit()
The transmit loop in airoha_dev_xmit() reads fragment address and length
during its final iteration, when the loop index equals
skb_shinfo(skb)->nr_frags, at which point the fragment data is
uninitialized. While these values are never consumed, the read itself is
unsafe and may trigger a page fault. Fix this by avoiding the fragment
read on the last iteration.
Additionally, move the skb pointer from the first to the last used packet
descriptor, so that airoha_qdma_tx_napi_poll() defers freeing the skb
until the final descriptor is processed.
Lorenzo Bianconi [Tue, 21 Apr 2026 08:53:33 +0000 (10:53 +0200)]
net: airoha: Do not wake all netdev TX queues in airoha_qdma_wake_netdev_txqs()
Do not wake every netdev TX queue across all ports sharing the QDMA
running netif_tx_wake_all_queues routine in airoha_qdma_wake_netdev_txqs()
but only the ones that are mapped the specific QDMA stopped hw TX queue.
This patch can potentially avoid waking already stopped netdev TX queues
that are mapped to a different QDMA hw TX queue.
Introduce airoha_qdma_get_txq utility routine.
Lorenzo Bianconi [Tue, 21 Apr 2026 06:43:07 +0000 (08:43 +0200)]
net: airoha: stop net_device TX queue before updating CPU index
Currently, airoha_eth driver updates the CPU index register prior of
verifying whether the number of free descriptors has fallen below the
threshold.
Move net_device TX queue length check before updating the TX CPU index
in order to update TX CPU index even if there are more packets to be
transmitted but the net_device TX queue is going to be stopped
accounting the inflight packets.
Lorenzo Bianconi [Tue, 21 Apr 2026 06:35:11 +0000 (08:35 +0200)]
net: airoha: fix BQL imbalance in TX path
Fix a possible BQL imbalance in airoha_dev_xmit(), where inflight
packets are accounted only for the AIROHA_NUM_TX_RING netdev TX
queues. The queue index is computed as:
Jakub Kicinski [Tue, 28 Apr 2026 00:30:48 +0000 (17:30 -0700)]
Merge branch 'netem-bug-fixes'
Stephen Hemminger says:
====================
netem: bug fixes
These bugs were found when doing AI-assisted review of sch_netem.c
during investigation of the packet duplication recursion problem
addressed in Jamal's series.
The fixes cover:
- probability gaps in the 4-state Markov loss model
- queue limit not accounting for reordered packets
- PRNG reseeded on every tc change, breaking reproducibility
- slot configuration not validated (inverted ranges, negative
delays, negative limits)
- slot delay arithmetic overflow for ranges above ~2.1 seconds
- negative latency and jitter wrapping to huge time_to_send
values via u64 arithmetic
====================
net/sched: netem: check for negative latency and jitter
Reject requests with negative latency or jitter.
A negative value added to current timestamp (u64) wraps
to an enormous time_to_send, disabling dequeue.
The original UAPI used u32 for these values; the conversion to 64-bit
time values via TCA_NETEM_LATENCY64 and TCA_NETEM_JITTER64
allowed signed values to reach the kernel without validation.
Jitter is already silently clamped by an abs() in netem_change();
that abs() can be removed in a follow-up once this rejection is in
place.
Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in nanoseconds") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-7-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
get_slot_next() computes a random delay between min_delay and
max_delay using:
get_random_u32() * (max_delay - min_delay) >> 32
This overflows signed 64-bit arithmetic when the delay range exceeds
approximately 2.1 seconds (2^31 nanoseconds), producing a negative
result that effectively disables slot-based pacing. This is a
realistic configuration for WAN emulation (e.g., slot 1s 5s).
Use mul_u64_u32_shr() which handles the widening multiply without
overflow.
Fixes: 0a9fe5c375b5 ("netem: slotting with non-uniform distribution") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-6-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reject slot configurations that have no defensible meaning:
- negative min_delay or max_delay
- min_delay greater than max_delay
- negative dist_delay or dist_jitter
- negative max_packets or max_bytes
Negative or out-of-order delays underflow in get_slot_next(),
producing garbage intervals. Negative limits trip the per-slot
accounting (packets_left/bytes_left <= 0) on the first packet of
every slot, defeating the rate-limiting half of the slot feature.
Note that dist_jitter has been silently coerced to its absolute
value by get_slot() since the feature was introduced; rejecting
negatives here converts that silent coercion into -EINVAL. The
abs() can be removed in a follow-up.
Fixes: 836af83b54e3 ("netem: support delivering packets in delayed time slots") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-5-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/sched: netem: only reseed PRNG when seed is explicitly provided
netem_change() unconditionally reseeds the PRNG on every tc change
command. If TCA_NETEM_PRNG_SEED is not specified, a new random seed
is generated, destroying reproducibility for users who set a
deterministic seed on a previous change.
Move the initial random seed generation to netem_init() and only
reseed in netem_change() when TCA_NETEM_PRNG_SEED is explicitly
provided by the user.
Fixes: 4072d97ddc44 ("netem: add prng attribute to netem_sched_data") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-4-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/sched: netem: fix queue limit check to include reordered packets
The queue limit check in netem_enqueue() uses q->t_len which only
counts packets in the internal tfifo. Packets placed in sch->q by
the reorder path (__qdisc_enqueue_head) are not counted, allowing
the total queue occupancy to exceed sch->limit under reordering.
Include sch->q.qlen in the limit check.
Fixes: f8d4bc455047 ("net/sched: netem: account for backlog updates from child qdisc") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-3-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/sched: netem: fix probability gaps in 4-state loss model
The 4-state Markov chain in loss_4state() has gaps at the boundaries
between transition probability ranges. The comparisons use:
if (rnd < a4)
else if (a4 < rnd && rnd < a1 + a4)
When rnd equals a boundary value exactly, neither branch matches and
no state transition occurs. The redundant lower-bound check (a4 < rnd)
is already implied by being in the else branch.
Remove the unnecessary lower-bound comparisons so the ranges are
contiguous and every random value produces a transition, matching
the GI (General and Intuitive) loss model specification.
This bug goes back to original implementation of this model.
Fixes: 661b79725fea ("netem: revised correlated loss generator") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260418032027.900913-2-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix UAF race in psi pressure_write() against cgroup file release by
extending cgroup_mutex coverage and ordering of->priv access after
cgroup_kn_lock_live()
- Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
by performing the increment in s64
- Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
recording the CPU used by dl_bw_alloc() so cancel_attach() returns
the reservation to the same root domain
- Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
rmdir by incrementing from kill_css() instead of offline_css()
- Typo fix in cgroup-v2 documentation
* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fix typo 'protetion' -> 'protection'
cgroup: Increment nr_dying_subsys_* from rmdir context
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
sched/psi: fix race between file release and pressure write
Documentation: net/smc: correct old value of smcr_max_recv_wr
The smc-sysctl.rst documentation incorrectly stated that
the previous hardcoded maximum number of WR buffers on
the receive path (smcr_max_recv_wr) was 16.
The correct historical value used before the introduction of
the sysctl control was 48. Update the documentation to reflect
the accurate historical value. Also fix a couple of minor typos.
Merge tag 'fs_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull isofs and udf fixes from Jan Kara:
"Several isofs and udf fixes"
* tag 'fs_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
docs: isofs: replace dead ECMA-119 FTP link
udf: reject descriptors with oversized CRC length
isofs: use QSTR_LEN() in isofs_cmp
isofs: validate block number from NFS file handle in isofs_export_iget
isofs: validate Rock Ridge CE continuation extent against volume size
Merge tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- space reservation fixes:
- correctly undo 'may_use' accounting for remap tree
- avoid double decrement of 'may_use' when submitting async io
- actually enable the shutdown ioctl callback (not just the superblock
ops)
- raid stripe tree fixes when deleting extents
- add missing error handling
- fix various incorrect values set
- fix transaction state when removing a directory, possibly leading to
EIO during log replay
- additional b-tree node key checks during metadata readahead
- error handling and transaction abort updates
* tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()
btrfs: check return value of btrfs_partially_delete_raid_extent()
btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer
btrfs: replace ASSERT with proper error handling in stripe lookup fallback
btrfs: fix wrong min_objectid in btrfs_previous_item() call
btrfs: fix raid stripe search missing entries at leaf boundaries
btrfs: copy devid in btrfs_partially_delete_raid_extent()
btrfs: handle unexpected free-space-tree key types
btrfs: fix missing last_unlink_trans update when removing a directory
btrfs: don't clobber errors in add_remap_tree_entries()
btrfs: enable shutdown ioctl for non-experimental builds
btrfs: apply first key check for readahead when possible
btrfs: abort transaction in do_remap_reloc_trans() on failure
btrfs: fix bytes_may_use leak in do_remap_reloc_trans()
btrfs: fix bytes_may_use leak in move_existing_remap()
David Windsor [Sun, 26 Apr 2026 23:23:49 +0000 (19:23 -0400)]
selinux: don't reserve xattr slot when we won't fill it
Move lsm_get_xattr_slot() below the SBLABEL_MNT check so we don't leave
a NULL-named slot in the array when returning -EOPNOTSUPP; filesystem
initxattrs() callbacks stop iterating at the first NULL ->name, silently
dropping xattrs installed by later LSMs.
Cc: stable@vger.kernel.org Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
selinux: use sk blob accessor in socket permission helpers
SELinux socket state lives in the composite LSM socket blob.
sock_has_perm() and nlmsg_sock_has_extended_perms() currently
dereference sk->sk_security directly, which assumes the SELinux socket
blob is at offset zero.
In stacked configurations that assumption does not hold. If another LSM
allocates socket blob storage before SELinux, these helpers may read the
wrong blob and feed invalid SID and class values into AVC checks.
Use selinux_sock() instead of accessing sk->sk_security directly.
Fixes: d1d991efaf34 ("selinux: Add netlink xperm support") Cc: stable@vger.kernel.org # v6.13+ Signed-off-by: Zongyao Chen <ZongYao.Chen@linux.alibaba.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
ASoC: Intel: cht_bsw_rt5672: Simplify probe() with local 'dev' pointer
In snd_cht_mc_probe(), &pdev->dev is dereferenced repeatedly throughout
the function. Introduce a local dev pointer
early in the function and use it consistently in place of all open-coded
&pdev->dev references.
It reduces repetition, improves readability, and aligns with the common
kernel driver pattern of caching the device pointer at function entry.
ASoC: wsa881x: Move custom workaround to gpiolib-of
The WSA881x codec driver has a local workaround for old device
trees that have the "powerdown" GPIO flagged as active high,
despite it is active low.
This quirk can be replaced by a single quirk entry in
gpiolib-of.c
Drop all polarity inversion code and drop the surplus
gpiod_direction_output() call in probe() since we now set up
the line correctly when getting the GPIO.
Also drop the inclusion of the unused <linux/gpio.h>.
Syzbot reported an uninit-value bug in [1] with a corrupted HFS+ image,
during the file system mounting process, specifically while loading the
catalog, a corrupted node_size value of 1 caused the rec_off argument
passed to hfs_bnode_read_u16() (within hfs_bnode_find()) to be excessively
large. Consequently, the function failed to return a valid value to
initialize the off variable, triggering the bug [1].
Every node starts from BTree node descriptor: struct hfs_bnode_desc.
So, the size of node cannot be lesser than that. However, technical
specification declares that: "The node size (which is expressed in bytes)
must be power of two, from 512 through 32,768, inclusive." Add a check
for btree node size base on technical specification.
Keith Busch [Tue, 21 Apr 2026 15:06:44 +0000 (08:06 -0700)]
PCI: Don't fallback to bus reset after failed slot reset
If a bus has hotplug slots that implement the slot's reset_slot callback,
it is not safe to do the non-slot specific bus reset, so don't fallback to
it. If a slot reset does fail, the subsequent bus reset will attempt a 2nd
link reset on top of previous and fail to handle the hotplug events.
The hfsplus_get_block() only allows creating the next
sequential block. It returns -EIO for direct writes
beyond EOF. This patch waits for any in-flight DIO on the inode
to finish. Then, it extends the file by calling
generic_cont_expand_simple() with the goal to guarantee
that blockdev_direct_IO() finds all needed blocks
already reachable sequentially. And, finally, it flushes and
invalidates the DIO range again so the page cache is clean
before the direct write begins.
hfsplus: Remove the duplicate attr inode dirty marking action
Syzbot reported a null-ptr-deref in [1].
If the attributes file is not loaded during system mount, a trigger
occurs [1] when setxattr is executed in userspace.
Merge tag 'mailbox-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox
Pull mailbox updates from Jassi Brar:
- core: fix NULL message handling and add API to query TX queue slots
- test: resolve concurrency bugs, dangling IRQs, and memory leaks
- dt-bindings: qcom: add Eliza IPCC
- mtk: fix address calculation and pointer handling bugs
- cix: resolve SCMI suspend timeouts
- misc memory allocation optimizations and cleanups
* tag 'mailbox-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox:
mailbox: mailbox-test: make data_ready a per-instance variable
mailbox: mailbox-test: initialize struct earlier
mailbox: mailbox-test: don't free the reused channel
mailbox: mailbox-test: handle channel errors consistently
mailbox: update kdoc for struct mbox_controller
mailbox: add sanity check for channel array
mailbox: mailbox-test: free channels on probe error
mailbox: prefix new constants with MBOX_
dt-bindings: mailbox: qcom-ipcc: Document the Eliza Inter-Processor Communication Controller
mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt
mailbox: Fix NULL message support in mbox_send_message()
mailbox: remove superfluous internal header
mailbox: correct kdoc title for mbox_bind_client
mailbox: test: really ignore optional memory resources
mailbox: exynos: drop superfluous mbox setting per channel
mailbox: mtk-cmdq: Fix CURR and END addr for task insert case
mailbox: mtk-vcp-mailbox: Fix the return value in mtk_vcp_mbox_xlate()
mailbox: hi6220: kzalloc + kcalloc to kzalloc
mailbox: rockchip: kzalloc + kcalloc to kzalloc
mailbox: add API to query available TX queue slots
Rick Edgecombe [Thu, 2 Apr 2026 06:32:05 +0000 (00:32 -0600)]
x86/virt/tdx: Remove kexec docs
Recent changes have removed the hard limitations for using kexec and
TDX together. So remove the section in the TDX docs.
Users on partial write erratums will need an updated TDX module to
handle the rare edge cases. The docs do not currently provide any
guidance on recommended TDX module versions, so don't keep a whole
section around to document this interaction.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20260402-fuller_tdx_kexec_support-v3-5-34438d7094bf@intel.com
x86/tdx: Disable the TDX module during kexec and kdump
Use the TDH.SYS.DISABLE SEAMCALL, which disables the TDX module,
reclaims all memory resources assigned to TDX, and clears any
partial-write induced poison, to allow kexec and kdump on platforms with
the partial write errata.
On TDX-capable platforms with the partial write erratum, kexec has been
disabled because the new kernel could hit a machine check reading a
previously poisoned memory location.
Later TDX modules support TDH.SYS.DISABLE, which disables the module and
reclaims all TDX memory resources, allowing the new kernel to re-initialize
TDX from scratch. This operation also clears the old memory, cleaning up
any poison.
Add tdx_sys_disable() to tdx_shutdown(), which is called in the
syscore_shutdown path for kexec. This is done just before tdx_shutdown()
disables VMX on all CPUs.
For kdump, call tdx_sys_disable() in the crash path before
x86_virt_emergency_disable_virtualization_cpu() does VMXOFF.
Since this clears any poison on TDX-managed memory, remove the
X86_BUG_TDX_PW_MCE check in machine_kexec() that blocked kexec on
partial write errata platforms.
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20260402-fuller_tdx_kexec_support-v3-4-34438d7094bf@intel.com
x86/virt/tdx: Add SEAMCALL wrapper for TDH.SYS.DISABLE
Some early TDX-capable platforms have an erratum where a partial write
to TDX private memory can cause a machine check on a subsequent read.
On these platforms, kexec and kdump have been disabled in these cases,
because the old kernel cannot safely hand off TDX state to the new
kernel. Later TDX modules support the TDH.SYS.DISABLE SEAMCALL, which
provides a way to cleanly disable TDX and allow kexec to proceed.
The new SEAMCALL has an enumeration bit, but that is ignored. It is
expected that users will be using the latest TDX module, and the failure
mode for running the missing SEAMCALL on an older module is not fatal.
This can be a long running operation, and the time needed largely
depends on the amount of memory that has been allocated to TDs. If all
TDs have been destroyed prior to the sys_disable call, then it is fast,
with only needing to override the TDX module memory.
After the SEAMCALL completes, the TDX module is disabled and all memory
resources allocated to TDX are freed and reset. The next kernel can then
re-initialize the TDX module from scratch via the normal TDX bring-up
sequence.
The SEAMCALL can return two different error codes that expect a retry.
- TDX_INTERRUPTED_RESUMABLE can be returned in the case of a host
interrupt. However, it will not return until it makes some forward
progress, so we can expect to complete even in the case of interrupt
storms.
- TDX_SYS_BUSY will be returned on contention with other TDH.SYS.*
SEAMCALLs, however a side effect of TDH.SYS.DISABLE is that it will
block other SEAMCALLs once it gets going. So this contention will be
short lived.
So loop infinitely on either of these error codes, until success or other
error.
An error is printed if the SEAMCALL fails with anything other than the
error codes that cause retries, or 'synthesized' error codes produced
for #GP or #UD. e.g., an old module that has been properly initialized,
that doesn't implement SYS_DISABLE, returns TDX_OPERAND_INVALID. This
prints:
Rick Edgecombe [Thu, 2 Apr 2026 06:32:02 +0000 (00:32 -0600)]
x86/virt/tdx: Pull kexec cache flush logic into arch/x86
KVM tries to take care of some required cache flushing earlier in the
kexec path in order to be kind to some long standing races that can occur
later in the operation. Until recently, VMXOFF was handled within KVM.
Since VMX being enabled is required to make a SEAMCALL, it had the best
per-cpu scoped operation to plug the flushing into. So it is kicked off
from there.
This early kexec cache flushing in KVM happens via a syscore shutdown
callback. Now that VMX enablement control has moved to arch/x86, which has
grown its own syscore shutdown callback, it no longer make sense for it to
live in KVM. It fits better with the TDX enablement managing code.
In addition, future changes will add a SEAMCALL that happens immediately
before VMXOFF, which means the cache flush in KVM will be too late to
flush the cache before the last SEAMCALL. So move it to the newly added TDX
arch/x86 syscore shutdown handler.
Since tdx_cpu_flush_cache_for_kexec() is no longer needed by KVM, make it
static and remove the export. Since it is also not part of an operation
spread across disparate components, remove the redundant comments and
verbose naming.
In the existing KVM based code, CPU offline also funnels through
tdx_cpu_flush_cache_for_kexec(). Add an explicit WBINVD in
tdx_offline_cpu() as well, even though it may be redundant with WBINVD
done elsewhere during CPU offline (e.g. hlt_play_dead()). This avoids
relying on fragile code ordering for cache coherency safety.
[Vishal: add explicit WBINVD in tdx_offline_cpu()]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Acked-by: Kiryl Shutsemau (Meta) <kas@kernel.org> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260402-fuller_tdx_kexec_support-v3-2-34438d7094bf@intel.com
They have some overlap that is already defined similarly. Reduce the
duplication by unifying the architectural error codes at:
asm/shared/tdx_errno.h
...and update the headers that contained the duplicated definitions to
include the new unified header.
"asm/shared" is used for sharing TDX code between the early compressed
code and the normal kernel code. While the compressed code for the guest
doesn't use these error code header definitions today, it does make the
types of calls that return the values they define. So place the defines in
"shared" location so that it can, but leave such cleanups for future
changes.
[Rick: enhance log]
[Vishal: reduce to a simple move of architectural defines only]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260402-fuller_tdx_kexec_support-v3-1-34438d7094bf@intel.com
drm/xe/guc_pc: Reorder forcewake in xe_guc_pc_fini_hw()
xe_guc_pc_stop() doesn't perform any MMIO operation that requires
forcewake in it's code path. Move it before pc_set_cur_freq() which
writes to RPNSWREQ and actually requires it.
cdrom, scsi: sr: propagate read-only status to block layer via set_disk_ro()
The cdrom core never calls set_disk_ro() for a registered device, so
BLKROGET on a CD-ROM device always returns 0 (writable), even when the
drive has no write capabilities and writes will inevitably fail. This
causes problems for userspace that relies on BLKROGET to determine
whether a block device is read-only. For example, systemd's loop device
setup uses BLKROGET to decide whether to create a loop device with
LO_FLAGS_READ_ONLY. Without the read-only flag, writes pass through the
loop device to the CD-ROM and fail with I/O errors. systemd-fsck
similarly checks BLKROGET to decide whether to run fsck in no-repair
mode (-n).
The write-capability bits in cdi->mask come from two different sources:
CDC_DVD_RAM and CDC_CD_RW are populated by the driver from the MODE
SENSE capabilities page (page 0x2A) before register_cdrom() is called,
while CDC_MRW_W and CDC_RAM require the MMC GET CONFIGURATION command
and were only probed by cdrom_open_write() at device open time. This
meant that any attempt to compute the writable state from the full
mask at probe time was incorrect, because the GET CONFIGURATION bits
were still unset (and cdi->mask is initialized such that capabilities
are assumed present).
Fix this by factoring the GET CONFIGURATION probing out of
cdrom_open_write() into a new exported helper,
cdrom_probe_write_features(), and having sr call it from sr_probe()
right after get_capabilities() has populated the MODE SENSE bits.
register_cdrom() then calls set_disk_ro() based on the full
write-capability mask (CDC_DVD_RAM | CDC_MRW_W | CDC_RAM | CDC_CD_RW)
so the block layer reflects the drive's actual write support. The
feature queries used (CDF_MRW and CDF_RWRT via GET CONFIGURATION with
RT=00) report drive-level capabilities that are persistent across
media, so a single probe before register_cdrom() is sufficient and the
redundant probe at open time is dropped.
With set_disk_ro() now accurate, the long-vestigial cd->writeable flag
in sr can go: get_capabilities() used to set cd->writeable based on
the same four mask bits, but because CDC_MRW_W and CDC_RAM default to
"capability present" in cdi->mask and aren't touched by MODE SENSE,
the condition that gated cd->writeable was always true, making it
unconditionally 1. Replace the corresponding gate in sr_init_command()
with get_disk_ro(cd->disk), which turns a previously no-op check into
a real one and also catches kernel-internal bio writers that bypass
blkdev_write_iter()'s bdev_read_only() check.
The sd driver (SCSI disks) does not have this problem because it
checks the MODE SENSE Write Protect bit and calls set_disk_ro()
accordingly. The sr driver cannot use the same approach because the
MMC specification does not define the WP bit in the MODE SENSE
device-specific parameter byte for CD-ROM devices.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Daan De Meyer <daan@amutable.com> Reviewed-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Phillip Potter <phil@philpotter.co.uk> Link: https://patch.msgid.link/20260427210139.1400-2-phil@philpotter.co.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge tag 'nvme-7.1-2026-04-24' of git://git.infradead.org/nvme into block-7.1
Pull NVMe fixes from Keith:
"- Target data transfer size confiruation (Aurelien)
- Enable P2P for RDMA (Shivaji Kant)
- TCP target updates (Maurizio, Alistair, Chaitanya, Shivam Kumar)
- TCP host updates (Alistair, Chaitanya)
- Authentication updates (Alistair, Daniel, Chris Leech)
- Multipath fixes (John Garry)
- New quirks (Alan Cui, Tao Jiang)
- Apple driver fix (Fedor Pchelkin)
- PCI admin doorbell update fix (Keith)"
* tag 'nvme-7.1-2026-04-24' of git://git.infradead.org/nvme: (22 commits)
nvme-auth: Hash DH shared secret to create session key
nvme-pci: fix missed admin queue sq doorbell write
nvme-auth: Include SC_C in RVAL controller hash
nvme-tcp: teardown circular locking fixes
nvmet-tcp: Don't clear tls_key when freeing sq
Revert "nvmet-tcp: Don't free SQ on authentication success"
nvme: skip trace completion for host path errors
nvme-pci: add quirk for Memblaze Pblaze5 (0x1c5f:0x0555)
nvme-multipath: put module reference when delayed removal work is canceled
nvme: expose TLS mode
nvme-apple: drop invalid put of admin queue reference count
nvme-core: fix parameter name in comment
nvmet: avoid recursive nvmet-wq flush in nvmet_ctrl_free
nvme-multipath: drop head pointer check in nvme_mpath_clear_current_path()
nvme: add quirk NVME_QUIRK_IGNORE_DEV_SUBNQN for 144d:a808 (Samsung PM981/983/970 EVO Plus )
nvmet-tcp: fix race between ICReq handling and queue teardown
nvmet-tcp: remove redundant calls to nvmet_tcp_fatal_error()
nvmet-tcp: propagate nvmet_tcp_build_pdu_iovec() errors to its callers
nvme: enable PCI P2PDMA support for RDMA transport
nvmet: introduce new mdts configuration entry
...
Matt Roper [Fri, 24 Apr 2026 20:48:20 +0000 (13:48 -0700)]
drm/xe: Mark BCS engines as belonging to the GT forcewake domain
On all platforms supported by the Xe driver, BCS engines are part of the
GT forcewake domain, not the RENDER domain. Fix the engine list
definition to match the spec. This mistake didn't really cause any
real problems because the forcewake domain here was only used in a
couple assertions that aren't really necessary and included in the
information dumped during error capture.
Matt Roper [Fri, 24 Apr 2026 20:48:19 +0000 (13:48 -0700)]
drm/xe: Drop xe_hw_engine_mmio_write32()
xe_hw_engine_mmio_write32() is only used in a single place and is easily
replaced by a regular xe_mmio_write32() call. Register read/write
interfaces are already complicated enough with MCR vs non-MCR handling,
so we should avoid adding extra wrappers that just make it more
confusing what to use.
xe_hw_engine_mmio_write32() did have a forcewake assertion that we're
dropping here, but that assertion wasn't entirely correct anyway. It was
checking hwe->domain which is currently set to XE_FW_RENDER for the BCS
engine, even though BCS engines reside in the GT domain.
v2:
- Drop prototype in header file as well. (Shuicheng)
Matt Roper [Fri, 24 Apr 2026 20:48:18 +0000 (13:48 -0700)]
drm/xe: Drop unnecessary STOP_RING clearing
The STOP_RING bit in MI_MODE is already clear by default out of hardware
reset and will only be '1' if the driver intentionally sets it after
that.
The logic of clearing this bit appears to originate from very
early (pre-GuC, pre-execlist) code in i915 where we needed to stop the
ring before performing a host-initiated engine reset; after the reset
the STOP_RING bit needed to be cleared to allow execution to resume.
None of that is relevant to Xe (or even modern i915) since STOP_RING
isn't necessary for execlist-based engine resets (and even if it were,
Xe doesn't initiate any engine resets; the GuC handles that now).
Matt Roper [Fri, 24 Apr 2026 20:48:17 +0000 (13:48 -0700)]
drm/xe: Move GFX_MODE programming to RTP
The write GFX_MODE to disable engine "legacy mode" and to enable MSI-X
support was unnecessarily open-coded in xe_hw_engine_enable_ring();
it's preferable to do such programming in the engine_entries[] RTP table
since gets reflected/verified in debugfs, and will also automatically
ensure that the register is properly saved/restored around engine
resets. This also helps consolidate common logic that was duplicated
between the main driver initialization path and the dead-code execlist
initialization path.
This also allows us to drop GFX_MODE from the list of extra registers to
be added to the GuC ADS' save-restore list since all registers on the
RTP table are added automatically.
v2:
- Actually use the xe_rtp_match_has_msix match function added.
(Shuicheng)
Matt Roper [Fri, 24 Apr 2026 20:48:15 +0000 (13:48 -0700)]
drm/xe: Fix name and definition of GFX_MODE register
The register located at $base+0x29c is referred to as GFX_MODE in the
bspec. Although many other registers have RING_* prefixes for
historical reasons, this register does not, so using a name that does
not match the bspec just makes it harder to recognize/find.
Also, GFX_MODE is a masked register (updating bits [15:0] requires that
the corresponding bit(s) in [31:16] are also set), so add the
XE_REG_OPTION_MASKED flag to the register definition; this will become
important when we start programming this register via RTP tables in a
future patch.
Finally swap the order of the register's two bit definitions to match
our regular coding style of descending order for register bits/fields.
Matt Roper [Fri, 24 Apr 2026 20:48:14 +0000 (13:48 -0700)]
drm/xe: Move HWSTAM programming to RTP
The write to RING_HWSTAM to disable hardware status page writes on
interrupt was unnecessarily open-coded in xe_hw_engine_enable_ring();
it's preferable to do such programming in the engine_entries[] RTP table
since gets reflected/verified in debugfs, and will also automatically
ensure that the register is properly saved/restored around engine
resets.
In this case the HWSTAM register wasn't explicitly added to the GuC ADS'
save-restore list, so there was the potential for the value to be lost
on engine resets. This doesn't seem to have happened in practice, so
likely the GuC firmware is automatically saving/restoring this register
on our behalf, but we shouldn't rely on this implicit behavior going
forward.
One other slight change with this patch is that HWSTAM will now be
programmed on the vestigial execlist (non-GuC) initialization path.
Since the register's default value is 0x0 and the documentation
indicates that it's only legal to leave a single bit unmasked at a time,
this likely would have been an illegal situation if the execlist code
were actually usable.
Matt Roper [Fri, 24 Apr 2026 20:48:13 +0000 (13:48 -0700)]
drm/xe: Stop programming BLIT_CCTL on Xe2 and later platforms
Xe1 platforms used the BLIT_CCTL register to specify the MOCS value that
would be used for BCS engine instructions that did not have a way of
specifying a MOCS index directly. From Xe2 onward, all BCS instructions
now have explicit instruction fields for specifying a MOCS index and the
BLIT_CCTL register is now a dummy register with no valid fields.
Although continuing to write to it today has no effect, the register
could repurposed in future platforms, so restrict the BLIT_CCTL RTP
entry to only apply to Xe1 platforms.
Matt Roper [Fri, 24 Apr 2026 20:48:12 +0000 (13:48 -0700)]
drm/xe/rtp: Add "always true" match function
All RTP table entries are required to have at least one rule. In cases
where an entry should apply unconditionally across all platforms we've
been using a graphics version range of 12.00 - forever since this covers
all platforms supported by the driver. However if the primary GT is
disabled via configfs (not actually possible today, but probably
possible in the future) or if we have a future platform that lacks a
primary GT and only supports media/display, this rule would cause
important programming to fail to apply on the media GT.
Add a simple match function that just always returns true
unconditionally. This solves the worries above while also being more
immediately human-readable.
Matt Roper [Fri, 24 Apr 2026 20:48:11 +0000 (13:48 -0700)]
drm/xe: Move CCS enablement to engine setup RTP
Most register programming for engine setup happens via RTP tables in
hw_engine_setup_default_state(). Move the programming of RCU_MODE[0]
which enables the platform's CCS engine(s) there. This both makes the
code more consistent (other RCU_MODE register programming is already
happening in this RTP table) and improves debuggability (since RTP
contents and checks of their correct programming are exposed via
debugfs). It also helps consolidate the regular driver initialization
paths with the vestigial and currently unused execlist (i.e., non-GuC)
initialization.
With the original programming, the RCU_MODE register (which is a single
global register, not a per-engine register) was getting re-programmed
with the same value during the initialization of each CCS engine. When
moved to the RTP table, we use the xe_rtp_match_first_render_or_compute
match function so that it will just be programmed once, while doing the
initialization for the first RCS/CCS engine, which avoids the redundant
and unnecessary repetition.
We can also safely drop the explicit addition of RCU_MODE from the GuC
ADS save-restore list now since all registers programmed via RTP tables
are automatically added to the GuC's list.
v2:
- Only enable CCS engines on Xe_HP and later. Even though Xe_LP
platforms technically have a CCS engine, it's never been enabled on
i915 or Xe due to other issues on these old platforms.
ACPI: video: force native backlight on HP OMEN 16 (8A44)
The HP OMEN 16 Gaming Laptop (board name 8A44) has a mux-less hybrid
GPU configuration with AMD Rembrandt (Radeon 680M) and NVIDIA GA104
(RTX 3070 Ti). The internal eDP panel is wired to the AMD iGPU.
When Nouveau loads without GSP firmware, the ACPI video backlight
device (acpi_video0) gets registered alongside the native AMD
backlight (amdgpu_bl2). In this state, writes to amdgpu_bl2 update
the software brightness value but fail to change the physical panel
brightness.
Force native backlight to prevent acpi_video0 from registering.
Confirmed that booting with acpi_backlight=native resolves the
issue.
ACPI: TAD: RTC: Refine timer value computations and checks
Since rtc_tm_to_ktime() may overflow for large RTC time values and
full second granularity is sufficient in timer value computations
in acpi_tad_rtc_set_alarm() and acpi_tad_rtc_read_alarm(), use
rtc_tm_to_time64() instead of that function, which also allows the
computations to be simplified.
Moreover, U32_MAX is a special "timer disabled" value, so make
acpi_tad_rtc_set_alarm() reject it when attempting to program the
alarm timers.
Fixes: 7572dcabe38d ("ACPI: TAD: Add alarm support to the RTC class device interface") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://patch.msgid.link/3414608.aeNJFYEL58@rafael.j.wysocki
The code in acpi_tad_remove() needs to run after the unregistration of
the devres-managed RTC class device so that it doesn't race with the
class callbacks of the latter.
To make that happen, pass it to devm_add_action_or_reset() before
registering the RTC class device.
Fixes: 7572dcabe38d ("ACPI: TAD: Add alarm support to the RTC class device interface") Fixes: 8a1e7f4b1764 ("ACPI: TAD: Add RTC class device interface") Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/14001754.uLZWGnKmhe@rafael.j.wysocki
Recent commit 93afe8ba9b01 ("ACPI: TAD: Use dev_groups in struct
device_driver") switched over the ACPI TAD driver to using device
attribute groups instead of creating and removing the device sysfs
attributes directly, but it might go one step farther and use the
__ATTRIBUTE_GROUPS() macro which would reduce the code size slightly.
Add devicetree changes to enable second Mobile Display Subsystem (mdss1),
Display Processing Unit(DPU), Display Port(DP), Display clock controller
(dispcc1) and eDP PHYs on the Qualcomm Lemans platform.
ACPI: CPPC: Fix related_cpus inconsistency during CPU hotplug
When concurrently bringing up and down two SMT threads of a physical
core, many warning call traces occur as below:
The issue timeline is as follows:
1. When the system starts,
cpufreq: CPU: 220, policy->related_cpus: 220-221, policy->cpus: 220-221
2. Offline CPU 220 and CPU 221.
3. Online CPU 220
- CPU 221 is now offline, as acpi_get_psd_map() use
for_each_online_cpu(), so the cpu_data->shared_cpu_map,
policy->cpus, and related_cpus has only CPU 220.
5. Online CPU 221, the below call trace occurs:
- Since CPU 220 and CPU 221 share one policy, and
policy->related_cpus = 220 after step 3, so CPU 221
is not in policy->related_cpus but
per_cpu(cpufreq_cpu_data, cpu221) is not NULL.
After reverting commit 56eb0c0ed345 ("ACPI: CPPC: Fix remaining
for_each_possible_cpu() to use online CPUs"), the issue disappeared.
The _PSD (P-State Dependency) defines the hardware-level dependency of
frequency control across CPU cores. Since this relationship is a physical
attribute of the hardware topology, it remains constant regardless of the
online or offline status of the CPUs.
Using for_each_online_cpu() in acpi_get_psd_map() is problematic. If a
CPU is offline, it will be excluded from the shared_cpu_map.
Consequently, if that CPU is brought online later, the kernel will fail
to recognize it as part of any shared frequency domain.
Switch back to for_each_possible_cpu() to ensure that all cores defined
in the ACPI tables are correctly mapped into their respective performance
domains from the start. This aligns with the logic of policy->related_cpus,
which must encompass all potentially available cores in the domain to
prevent logic gaps during CPU hotplug operations.
To resolve the original issue regarding the "nosmt" or "nosmt=force"
boot parameter, as send_pcc_cmd() function already does if (!desc)
continue, so reverting that loop back to for_each_possible_cpu() is ok,
only need to change the match_cpc_ptr NULL case in acpi_get_psd_map() to
continue as Sean suggested.
How to reproduce, on arm64 machine with SMT support which use acpi cppc
cpufreq driver:
bash test.sh 220 & bash test.sh 221 &
The test.sh is as below:
while true
do
echo 0 > /sys/devices/system/cpu/cpu${1}/online
sleep 0.5
cat /sys/devices/system/cpu/cpu${1}/cpufreq/related_cpus
echo 1 > /sys/devices/system/cpu/cpu${1}/online
cat /sys/devices/system/cpu/cpu${1}/cpufreq/related_cpus
done