Aditya Garg [Mon, 8 Jun 2026 10:13:41 +0000 (03:13 -0700)]
net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check
mana_create_txq() has several error paths (after mana_alloc_queues() or
mana_create_wq_obj() failure) where tx_qp[i].tx_object stays as the
INVALID_MANA_HANDLE sentinel set at allocation. mana_destroy_txq() then
unconditionally calls mana_destroy_wq_obj() with (u64)-1, which firmware
rejects and logs an error.
Mirror the RX-side pattern in mana_destroy_rxq() and skip the destroy
when the handle is still INVALID_MANA_HANDLE.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260608101345.2267320-3-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aditya Garg [Mon, 8 Jun 2026 10:13:40 +0000 (03:13 -0700)]
net: mana: initialize gdma queue id to INVALID_QUEUE_ID
mana_gd_create_mana_wq_cq() leaves queue->id as 0 (from kzalloc_obj())
until mana_create_wq_obj() assigns the firmware-returned id. If creation
fails before that, cleanup calls mana_gd_destroy_cq() with id 0, NULLing
gc->cq_table[0] and silently breaking whichever real CQ owns that slot.
Initialize queue->id to INVALID_QUEUE_ID right after allocation, matching
mana_gd_create_eq(). The existing (id >= max_num_cqs) guard then
short-circuits cleanly.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260608101345.2267320-2-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Avoid mistaken parent class deactivation during peek
Several qdiscs (fq_codel, codel and dualpi2) may drop packets while
peeking at their queue. When that happens they call
qdisc_tree_reduce_backlog() to notify the parent of the backlog/qlen
change. The problem is that they do so *before* reincrementing the qlen
that peek had temporarily decremented.
If the qlen momentarily drops to zero while peek still has an skb to
return, qdisc_tree_reduce_backlog() ends up invoking the parent's
qlen_notify() callback even though the child is not actually empty. The
parent then deactivates the class, while the child still holds a packet.
For parents such as QFQ this desync corrupts the active class list and
leads to wild memory accesses and NULL pointer dereferences (see the
per-patch splats). For HFSC it might lead to stalls [1].
Fix all three qdiscs the same way: only call qdisc_tree_reduce_backlog()
once the qlen has been restored, so the parent never observes a
transient empty child during peek.
Patch 1 fixes this for fq_codel, patch 2 for codel, patch 3 for dualpi2
and patch 4 adds test cases for these 3 setups.
Note: Patch 1 is one of two fixes for the stall reported in [1]; the
companion fix is "net/sched: sch_hfsc: Don't make class passive twice",
sent separately.
Note2: A possible cleaner fix is to create a new helper function for peek
that only calls qdisc_tree_reduce_backlog after reincrementing the qlen.
This would be called from the 3 vulnerable qdiscs, however we thought this
might make it harder for backporting so, if people agree, we can submit
this cleaner version to net-next after this one is merged.
Victor Nogueira [Wed, 10 Jun 2026 19:28:55 +0000 (16:28 -0300)]
selftests/tc-testing: Verify child qdisc will not mistakenly deactivate QFQ parent
Create 3 test cases:
- Verify fq_codel won't mistakenly deactivate QFQ parent class during peek
- Verify codel won't mistakenly deactivate QFQ parent class during peek
- Verify dualpi2 won't mistakenly deactivate QFQ parent class during peek
Verify that these 3 qdiscs (fq_codel, codel, dualpi2) will not call
qdisc_tree_reduce_backlog with an incorrect qlen (0) during peek and
mistakenly deactivate a parent class.
Victor Nogueira [Wed, 10 Jun 2026 19:28:54 +0000 (16:28 -0300)]
net/sched: sch_dualpi2: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever dualpi2 drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though dualpi2 still has 1 packet on the queue and, thus,
mistakenly deactivates the parent's class which leads to a null-ptr-deref:
Victor Nogueira [Wed, 10 Jun 2026 19:28:53 +0000 (16:28 -0300)]
net/sched: sch_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will
be executed even though codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a wild
memory access when qfq has codel as a child:
Victor Nogueira [Wed, 10 Jun 2026 19:28:52 +0000 (16:28 -0300)]
net/sched: sch_fq_codel: Do not call qdisc_tree_reduce_backlog during peek before restoring qlen
Whenever fq_codel drops packets during peek, it calls
qdisc_tree_reduce_backlog. An issue arises because it calls
qdisc_tree_reduce_backlog before it reincrements the qlen. If qlen drops
to zero, but peek returns an skb, the parent's qlen_notify callback will be
executed even though fq_codel still has 1 packet on the queue and, thus,
will mistakenly deactivate the parent's class causing issues like a recent
report [1] and a wild memory access in qfq:
Jakub Kicinski [Fri, 12 Jun 2026 23:48:57 +0000 (16:48 -0700)]
Merge branch 'rxrpc-miscellaneous-fixes'
David Howells says:
====================
rxrpc: Miscellaneous fixes
Here are some miscellaneous AF_RXRPC fixes:
(1) Make sure rxrpc_verify_data() allocates a buffer, even if the DATA
packet being looked at is zero length to avoid potential NULL-pointer
exceptions.
(2) Don't move an OOB message (e.g. an RxGK CHALLENGE) off the receive
queue onto the pending queue in recvmsg() if MSG_PEEK is specified.
(3) Fix a potential UAF in rxgk_issue_challenge() in which a tracepoint
refers to memory just freed by a different pointer.
(4) Fix afs net namespace teardown to cancel the incoming call
preallocation charger before we disable listening (which will delete
the preallocation queue).
(5) Fix rxrpc_kernel_charge_accept() to use the socket mutex to defend
against listen(0)/shutdown simultaneously deleting the preallocation
queue.
====================
Li Daming [Tue, 9 Jun 2026 14:09:09 +0000 (15:09 +0100)]
rxrpc: serialize kernel accept preallocation with socket teardown
rxrpc_kernel_charge_accept() reads rx->backlog without any
socket/backlog synchronization and passes that raw pointer into
rxrpc_service_prealloc_one(). A concurrent rxrpc_discard_prealloc()
sets rx->backlog = NULL and frees the backlog rings, so a kernel
preallocation worker can keep using a freed struct rxrpc_backlog
while updating *_backlog_head/tail and array slots.
Serialize the state check and backlog lookup with the socket lock,
and reject kernel preallocation once teardown has disabled
listening or discarded the service backlog.
Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Li Daming <d4n.for.sec@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jun 2026 14:09:08 +0000 (15:09 +0100)]
afs: Fix netns teardown to cancel the preallocation charger
Fix the teardown of an afs network namespace to make sure it cancels the
work item that keeps the preallocated rxrpc call/conn/peer queue charged
before incoming calls are disabled (i.e. listen 0).
Also, if net->live is false because the afs netns is being deleted, make
afs_charge_preallocation() skip charging and make afs_rx_new_call() avoid
requeuing the charger.
(This was found by AI review).
Fixes: 00e907127e6f ("rxrpc: Preallocate peers, conns and calls for incoming service requests") Reported-by: Simon Horman <horms@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Li Daming <d4n.for.sec@gmail.com>
cc: Ren Wei <n05ec@lzu.edu.cn>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Tue, 9 Jun 2026 14:09:07 +0000 (15:09 +0100)]
rxrpc: Fix UAF in rxgk_issue_challenge()
Fix rxgk_issue_challenge() to free the page containing the challenge
content after invoking the tracepoint as the whdr passed to the tracepoint
points into the page just freed.
Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hyunwoo Kim [Tue, 9 Jun 2026 14:09:06 +0000 (15:09 +0100)]
rxrpc: Don't move a peeked OOB message onto the pending queue
rxrpc_recvmsg_oob() takes a received oob message off recvmsg_oobq and,
if a response is needed, moves it onto the pending_oobq tree. However,
only the unlink from recvmsg_oobq is guarded by MSG_PEEK; the move onto
pending_oobq always runs.
As a result, reading a challenge with MSG_PEEK leaves the skb on
recvmsg_oobq while also adding it to pending_oobq. Since struct
sk_buff's rbnode shares storage with its next and prev pointers,
rb_insert_color() overwrites the list linkage, and the skb, which holds
a single reference, becomes reachable from both queues at once.
When the socket is closed both queues are drained in turn. While
draining recvmsg_oobq, __skb_unlink() follows the next and prev
pointers that rbnode has overwritten and writes to a bad address. Also,
as the skb holds a single reference but is freed from each queue, both
the skb and the connection reference it holds are released twice. This
leads to memory corruption and to a use-after-free caused by the
connection refcount underflow.
MSG_PEEK does not consume the message from the queue, so only unlink it
from recvmsg_oobq and then move it onto pending_oobq or free it when
the message is actually consumed.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE") Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-3-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxrpc_recvmsg_data() calls rxrpc_verify_data() whenever the
rxrpc_call.rx_dec_buffer is unallocated and assumes that upon
successful return that rx_dec_buffer must be allocated.
However, rxrpc_verify_data() does not request an allocation if
the rxrpc_skb_priv.len is zero.
In addition, failure to allocate rx_dec_buffer will result in a
call to skb_copy_bits() with a NULL destination which can
trigger a NULL pointer dereference.
To prevent these issues rxrpc_verify_data() is modified to
always attempt to allocate the rxrpc_call.rx_dec_buffer if it
is NULL.
This issue was identified with assistance of a private
sashiko instance.
Fixes: d2bc90cf6c75cb ("rxrpc: Fix DATA decrypt vs splice() by copying data to buffer in recvmsg") Reported-by: Simon Horman <simon.horman@redhat.com> Signed-off-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jiayuan Chen <jiayuan.chen@linux.dev>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260609140911.838677-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Thu, 11 Jun 2026 16:36:48 +0000 (18:36 +0200)]
virtio_net: do not allow tunnel csum offload for non GSO packets
Fiona reports broken connectivity for virtio net setup using UDP tunnel
inside the guest and NIC with not UDP tunnel TSO support in the host.
Currently the virtio_net driver exposes csum offload for UDP-tunneled,
TCP non GSO packets. Such packet reach the host as CSUM_PARTIAL ones
with the 'encapsulation' flag cleared, as the virtio specification do
not support this specific kind of offload.
HW NICs with UDP tunnel TSO support - and those drivers directly
accessing skb->csum_start/csum_offset - are still capable of computing
the needed csum correctly, but otherwise the packets reach the wire with
bad csum on both the inner and outer transport header.
Address the issue explicitly disabling csum offload for UDP tunneled,
non GSO packets via the ndo_features_check op.
Fixes: 56a06bd40fab ("virtio_net: enable gso over UDP tunnel support.") Reported-by: Fiona Ebner <f.ebner@proxmox.com> Closes: https://bugzilla.proxmox.com/show_bug.cgi?id=7627 Tested-by: Fiona Ebner <f.ebner@proxmox.com> Tested-by: Gabriel Goller <g.goller@proxmox.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Gabriel Goller <g.goller@proxmox.com> Tested-by: Gabriel Goller <g.goller@proxmox.com> Link: https://patch.msgid.link/6c3b6c47fb05c100f384630dc48f3975cf37b67a.1781195144.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: atm: reject out-of-range traffic classes in QoS validation
Reject ATM traffic classes above ATM_ANYCLASS in check_tp().
SO_ATMQOS stores the supplied QoS after check_qos() succeeds, so
accepting larger values leaves invalid traffic_class values in
vcc->qos.
That bad state later reaches pvc_info(), which indexes class_name[]
with vcc->qos.{rx,tp}.traffic_class. Values above ATM_ANYCLASS cause
an out-of-bounds read when /proc/net/atm/pvc is read.
Tighten the existing QoS validation so invalid traffic_class values
are rejected at the point where user supplied QoS is accepted.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/58f02c6f73d9818fd5d2022e1116759fdde6116b.1780965530.git.zcliangcn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sechang Lim [Thu, 11 Jun 2026 09:29:18 +0000 (09:29 +0000)]
tcp: clear sock_ops cb flags before force-closing a child socket
A child socket inherits the listener's bpf_sock_ops_cb_flags via
sk_clone_lock(). If its setup fails in tcp_v4_syn_recv_sock() /
tcp_v6_syn_recv_sock(), the child is freed through put_and_exit, where
inet_csk_prepare_forced_close() drops the socket lock and tcp_done() runs
without it.
If BPF_SOCK_OPS_STATE_CB_FLAG was inherited, tcp_done() -> tcp_set_state()
calls tcp_call_bpf(), which expects the lock and trips sock_owned_by_me():
The child is freed before it is ever established, so it should run no
sock_ops callback. Clear its cb flags in inet_csk_prepare_for_destroy_sock(),
the common point for the IPv4, IPv6 and chtls forced-close paths and for the
MPTCP ->syn_recv_sock() failure path (dispose_child), which reaches tcp_done()
on a child that was never established too.
The test test_xdp_native_adjst_head_grow_data tests a case where the
head is adjusted by -256.
When this test runs, data_ptr is shifted to frag_start + 2 (where
frag_start = page_address(page) + offset).
Then, bnxt_rx_multi_page_skb is invoked and the napi_build_skb
expression subtracts 258, landing at an address before frag_start. This
could be either the previous fragment or the previous physical page when
the offset is < 256 (e.g. if the fragment started at offset 0).
When the skb is freed, the page pool fragment reference is dropped on
either the wrong page or the wrong frag of the right page. In either
case, the corrupted reference count can lead to the page being
prematurely recycled while still in use. Once (incorrectly) recycled, it
can be handed out again and on driver teardown this would result in a
double free.
The commit under fixes updated this code to handle the case where the
native page size is >= 64k, but it unintentionally broke the head grow
case.
To fix this, add an offset field to struct bnxt_sw_rx_bd, mirroring the
existing offset field in struct bnxt_sw_rx_agg_bd. Populate it on
allocation and preserve it on reuse.
In bnxt_rx_multi_page_skb, use the newly added offset field to compute
the fragment start and pass that to napi_build_skb. Adjust the layout
with skb_reserve.
There are two cases, the non-adjustment case and the adjustment case.
In both cases, the skb is built at page_address(page) + offset to
account for the case where the native page size >= 64K and skb_reserve
is called with data_ptr - (page_address(page) + offset). That
difference equals bp->rx_offset when data_ptr was not moved, or
bp->rx_offset + xdp_adjust when XDP adjusted the head.
Re-running the failing test with this commit applied causes the test to
run successfully to completion.
The other rx_skb_func implementations don't have this issue.
Fixes: f6974b4c2d8e ("bnxt_en: Fix page pool logic for page size >= 64K") Signed-off-by: Joe Damato <joe@dama.to> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20260609204458.2237787-2-joe@dama.to Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tipc: fix netlink gate and receive-path bugs
This is v4 of the public TIPC series. The only change from v3 is in
patch 1: TIPC_NL_MEDIA_SET now uses GENL_UNS_ADMIN_PERM like the other
mutators, instead of GENL_ADMIN_PERM, so the whole series uses the
namespace-aware CAP_NET_ADMIN check that matches the legacy TIPC netlink
path. Patches 2 and 3 are unchanged.
Patch 1 gives the TIPCv2 mutating generic-netlink operations the admin
gate the legacy API already has, so a local unprivileged process can no
longer change TIPC state. Patch 2 drops CONN_ACK messages that
acknowledge more outstanding sends than exist, preventing the
snt_unacked underflow. Patch 3 rejects peer bindings with lower > upper,
which would otherwise leak binding-table memory.
====================
tipc: reject inverted service ranges from peer bindings
tipc_update_nametbl() inserts a binding advertised by a peer node using
the lower and upper service-range bounds taken directly from the wire,
without checking that lower <= upper. The local bind path validates the
ordering (tipc_uaddr_valid()), but the name-distribution path does not.
A binding with lower > upper is inserted at the far end of the
service-range rbtree (keyed on lower) where no lookup or withdrawal can
ever match it (service_range_foreach_match() requires sr->lower <= end).
The publication, its service_range node and the augmented rbtree entry
are then leaked for the lifetime of the namespace, and there is no
per-peer cap equivalent to TIPC_MAX_PUBL on locally created bindings.
Reject inverted ranges in the network path as well. A peer node can
otherwise leak unbounded binding-table memory by sending PUBLICATION
items with lower > upper.
Fixes: 37922ea4a310 ("tipc: permit overlapping service ranges in name table") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech> Link: https://patch.msgid.link/20260610124003.3831170-4-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tipc_sk_conn_proto_rcv() subtracts the peer-supplied connection ack count
from the unsigned 16-bit send counter snt_unacked without checking that it
does not exceed the number of messages actually outstanding:
tsk->snt_unacked -= msg_conn_ack(hdr);
msg_conn_ack() is read straight from a received CONN_MANAGER/CONN_ACK
message. If the ack count is larger than snt_unacked, the subtraction
wraps to a near-maximum value, leaving tsk_conn_cong() permanently true
and starving the connection of further transmits.
Validate the ACK count at the start of the CONN_ACK block and drop the
message if it acknowledges more messages than are outstanding. A peer (or,
for a local connection, the connected peer socket) can otherwise wedge a
TIPC connection's send side by sending an oversized connection ack.
tipc: require net admin for TIPCv2 netlink mutators
TIPCv2 registers mutating generic-netlink operations without admin
permission flags. Generic netlink only checks CAP_NET_ADMIN when an
operation sets GENL_ADMIN_PERM or GENL_UNS_ADMIN_PERM, so a local
unprivileged process can currently change TIPC state through commands
such as TIPC_NL_NET_SET, TIPC_NL_KEY_SET, TIPC_NL_KEY_FLUSH, and
bearer enable/disable.
The legacy TIPC netlink API already checks netlink_net_capable(...,
CAP_NET_ADMIN) for administrative commands. Give the TIPCv2 mutators
the equivalent generic-netlink gate. Use GENL_UNS_ADMIN_PERM, which
maps to the same namespace-aware CAP_NET_ADMIN check that
netlink_net_capable() performs, so the behaviour matches the legacy
path and keeps working for CAP_NET_ADMIN holders in a non-initial user
namespace (containers).
A QEMU/KASAN repro run as uid/gid 65534 with zero effective
capabilities previously succeeded in changing the network id and node
identity, setting and flushing key material, and enabling/disabling a
UDP bearer. With this patch applied the same operations fail with
-EPERM.
Victor Nogueira [Wed, 10 Jun 2026 13:28:24 +0000 (10:28 -0300)]
net/sched: sch_hfsc: Don't make class passive twice
update_vf() is called from two places for the same class during a single
dequeue when the class's child qdisc (e.g. codel/fq_codel) drops its last
packets while dequeuing:
1. The child calls qdisc_tree_reduce_backlog(), which, now that the child
is empty, invokes hfsc_qlen_notify() -> update_vf(cl, 0, 0) and turns
the class passive (cl_nactive is decremented up the hierarchy).
2. hfsc_dequeue() then calls update_vf(cl, qdisc_pkt_len(skb), cur_time)
to charge the dequeued bytes.
On the second call the class is already passive, but its child qdisc is
still empty, so update_vf() arms go_passive again:
The leaf is then skipped by the cl_nactive == 0 check inside the loop,
which does not clear go_passive, so the stale go_passive propagates to the
parent and decrements its cl_nactive a second time. A parent that still
has other active children is driven to cl_nactive == 0 and removed from
the vttree, even though those siblings are still backlogged. They are
never dequeued again and the qdisc stalls.
Fix this by only arming go_passive when the class is actually active, so an
already-passive class no longer triggers a second passive transition. The
byte accounting (cl->cl_total += len) still runs for every ancestor, so
dequeued bytes continue to be counted exactly once.
Fixes: 51eb3b65544c ("sch_hfsc: make hfsc_qlen_notify() idempotent") Reported-by: Anirudh Gupta <anirudhrudr@gmail.com> Closes: https://lore.kernel.org/netdev/CAN2cbVe79oj0O9==m4+4x3v+O+qzRagA=2=wkrp9i9=CqYvyZA@mail.gmail.com/ Tested-by: Anirudh Gupta <anirudhrudr@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260610132824.3027549-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann [Tue, 9 Jun 2026 21:22:40 +0000 (23:22 +0200)]
net: Stop leased rxq before uninstalling its memory provider
netif_rxq_cleanup_unlease() tears down the memory provider that was
installed on a physical RX queue through a netkit queue lease. It
currently revokes the provider's DMA mappings before stopping the
physical queue:
This inverts the ordering used by the regular teardown paths (normal
device unregister and the io_uring zcrx close path), which stop the
queue before revoking the provider's mappings.
With the physical queue still live, its NAPI can keep consuming
net_iov entries from the page_pool alloc cache after the
__netif_mp_uninstall_rxq() has already cleared their dma_addr,
opening a window for the device to DMA to a stale or zero address.
Fix it by swapping the two calls so the queue is stopped (and its
NAPI quiesced) before the provider is uninstalled. No functional
regression was observed across repeated runs of the nk_qlease.py
HW selftest, which exercises the lease teardown path; this was
tested against fbnic QEMU emulation.
Fixes: 5602ad61ebee ("net: Proxy netif_mp_{open,close}_rxq for leased queues") Reported-by: Ahmed Abdelmoemen <ahmedabdelmoumen05@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Wei <dw@davidwei.uk> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260609212240.677889-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yizhou Zhao [Tue, 9 Jun 2026 08:00:52 +0000 (16:00 +0800)]
6lowpan: fix NHC entry use-after-free on error path
lowpan_nhc_do_uncompression() looks up an NHC descriptor while holding
lowpan_nhc_lock. If the descriptor has no uncompress callback, the error
path drops the lock before printing nhc->name.
lowpan_nhc_del() removes descriptors under the same lock and then relies
on synchronize_net() before the owning module can be unloaded. That only
waits for net RX RCU readers. lowpan_header_decompress() is also exported
and can be reached from callers that are not necessarily covered by the net
core RX critical section, for example the Bluetooth 6LoWPAN L2CAP receive
path.
This leaves a race where one task drops lowpan_nhc_lock in the error path,
another task unregisters and frees the matching descriptor after
synchronize_net() returns, and the first task then dereferences nhc->name
for the warning.
With the post-unlock window widened, KASAN reports:
BUG: KASAN: slab-use-after-free in lowpan_nhc_do_uncompression+0x1f4/0x220
Read of size 8
lowpan_nhc_do_uncompression
lowpan_header_decompress
Fix this by printing the warning before dropping lowpan_nhc_lock, so the
descriptor name is read while unregister is still excluded. The malformed
packet is still rejected with -ENOTSUPP.
Fixes: 92aa7c65d295 ("6lowpan: add generic nhc layer interface") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://patch.msgid.link/20260609080054.4541-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Samuel Moelius [Tue, 9 Jun 2026 23:22:45 +0000 (23:22 +0000)]
net: pfcp: allocate per-cpu tstats for PFCP netdevs
PFCP uses dev_get_tstats64() as its ndo_get_stats64 callback, but
pfcp_link_setup() does not request NETDEV_PCPU_STAT_TSTATS. The net
core therefore leaves dev->tstats NULL for PFCP devices.
Creating a PFCP rtnetlink device can immediately ask the new netdev for
stats while building the RTM_NEWLINK notification. That reaches
dev_get_tstats64() and dereferences the NULL dev->tstats pointer.
Set pcpu_stat_type to NETDEV_PCPU_STAT_TSTATS during PFCP link setup so
the net core allocates the storage expected by dev_get_tstats64().
Xin Long [Tue, 9 Jun 2026 22:14:28 +0000 (18:14 -0400)]
sctp: validate embedded address parameter length
sctp_verify_asconf() and sctp_verify_param() only validate ADD_IP, DEL_IP,
and SET_PRIMARY parameters against a fixed minimum size of sizeof(struct
sctp_addip_param) + sizeof(struct sctp_paramhdr). This ensures the outer
parameter is large enough to contain an embedded address parameter header,
but does not verify that the embedded address parameter's declared length
fits within the bounds of the outer parameter.
Later, sctp_process_param() and sctp_process_asconf_param() extract the
embedded address parameter and pass it to af->from_addr_param(), which uses
the address parameter length to parse the variable-length address payload.
A malformed peer can therefore advertise an embedded address parameter
length that exceeds the remaining bytes in the enclosing parameter.
Validate that addr_param->p.length does not exceed the space available
after the sctp_addip_param header before processing the embedded address
parameter. Reject malformed parameters when the embedded address length
extends beyond the enclosing parameter bounds.
This prevents out-of-bounds reads when parsing malformed parameters carried
in INIT or ASCONF processing paths.
Xiang Mei [Tue, 9 Jun 2026 06:51:16 +0000 (23:51 -0700)]
bridge: cfm: reject invalid CCM interval at configuration time
ccm_tx_work_expired() re-arms itself via queue_delayed_work() using
the configured exp_interval converted by interval_to_us(). When
exp_interval is BR_CFM_CCM_INTERVAL_NONE or out of range,
interval_to_us() returns 0, causing the worker to fire immediately in
a tight loop that allocates skbs until OOM.
Fix this by validating exp_interval at configuration time:
- Constrain IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL to the valid range
[BR_CFM_CCM_INTERVAL_3_3_MS, BR_CFM_CCM_INTERVAL_10_MIN] in the
netlink policy so userspace cannot set an invalid value.
- Reject starting CCM TX in br_cfm_cc_ccm_tx() when exp_interval has
not yet been configured (defaults to 0 from kzalloc).
Fixes: 2be665c3940d ("bridge: cfm: Netlink SET configuration Interface.") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609065116.2818837-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Wed, 10 Jun 2026 10:18:39 +0000 (06:18 -0400)]
net/sched: cls_flow: Dont expose folded kernel pointers
The flow classifier falls back to addr_fold() for fields that are missing
from packet headers. In map mode, userspace controls mask, xor, rshift,
addend and divisor, and can observe the resulting classid through class
statistics. This allows a tc classifier in a user/network namespace to
recover the 32-bit folded value of skb->sk, skb_dst() or skb_nfct().
Align with standard kernel practices for pointer hashing and replace the
XOR folding with a keyed siphash (which is cryptographically secure)
Fixes: e5dfb815181f ("[NET_SCHED]: Add flow classifier") Reported-by: Kyle Zeng <kylebot@openai.com> Tested-by: Kyle Zeng <kylebot@openai.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260610101839.14135-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: fix led devicename when using external mdio bus
The qca8k dsa switch can use either an external or internal mdio bus.
This depends on whether the mdio node is defined under the switch node
itself. Upon registering the internal mdio bus, the internal_mdio_bus
of the dsa switch is assigned to this bus. When an external mdio bus is
used, the driver still uses the internal_mdio_bus id which is used to
create the device names of the leds.
This leads to the leds being prefixed with '(efault)' as the
internal_mii_bus is null. So let's fix this by adding a null check and
use the devicename of the external bus instead when an external bus is
configured.
Linus Torvalds [Thu, 11 Jun 2026 17:17:49 +0000 (10:17 -0700)]
Merge tag 'net-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from IPsec and netfilter.
This is relatively small, mostly because we are a bit behind our PW
queue. I'm not aware of any pending regression.
Current release - regressions:
- netfilter: nf_tables_offload: drop device refcount on error
Previous releases - regressions:
- core: add pskb_may_pull() to skb_gro_receive_list()
- xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
- ipv6: fix a potential NPD in cleanup_prefix_route()
- ipv4: fix use-after-free caused by the fqdir_pre_exit() flush
- eth:
- bnxt_en: fix NULL pointer dereference
- emac: fix use-after-free during device removal
- octeontx2-af: fix memory leak in rvu_setup_hw_resources()
- tun: zero the whole vnet header in tun_put_user()
- sit: reload inner IPv6 header after GSO offloads
Previous releases - always broken:
- core: fix double-free in netdev_nl_bind_rx_doit()
- netfilter: nf_log: validate MAC header was set before dumping it
- xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
- tcp: restrict SO_ATTACH_FILTER to priv users
- mctp: usb: fix race between urb completion and rx_retry
cancellation
- eth:
- mlx5: fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list
- mvpp2: sync RX data at the hardware packet offset"
* tag 'net-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
octeontx2-af: fix IP fragment flag corruption on custom KPU profile load
ipv6: Fix a potential NPD in cleanup_prefix_route()
net: txgbe: initialize PHY interface to 0
net: txgbe: distinguish module types by checking identifier
net: txgbe: initialize module info buffer
net: mvpp2: build skb from XDP-adjusted data on XDP_PASS
net: mvpp2: refill RX buffers before XDP or skb use
net: mvpp2: limit XDP frame size to the RX buffer
net: mvpp2: sync RX data at the hardware packet offset
netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
netfilter: nft_fib: fix stale stack leak via the OIFNAME register
netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
netfilter: nf_log: validate MAC header was set before dumping it
netfilter: x_tables: avoid leaking percpu counter pointers
netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
netfilter: nf_tables_offload: drop device refcount on error
netfilter: revalidate bridge ports
rds: mark snapshot pages dirty in rds_info_getsockopt()
ip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup()
ptp: ocp: fix resource freeing order
...
Linus Torvalds [Thu, 11 Jun 2026 16:54:51 +0000 (09:54 -0700)]
Merge tag 'pmdomain-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm
Pull pmdomain fixes from Ulf Hansson:
- imx: Fix OF node refcount
- ti: Fix wakeup configuration for parent devices of wakeup sources
* tag 'pmdomain-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: imx: fix OF node refcount
pmdomain: ti_sci: add wakeup constraint to parent devices of wakeup source
Kiran Kumar K [Mon, 8 Jun 2026 09:54:55 +0000 (15:24 +0530)]
octeontx2-af: fix IP fragment flag corruption on custom KPU profile load
npc_cn20k_apply_custom_kpu() overwrites KPU profile entries with custom
firmware values and then calls npc_cn20k_update_action_entries_n_flags()
over all entries. Since the same function already ran during default
profile initialisation, entries not overridden by the custom firmware
get their flags translated twice, corrupting the CN20K-specific values.
Fix this by extracting the per-entry translation into a helper
npc_cn20k_translate_action_flags() and calling it as each custom entry
is loaded, removing the redundant batch call at the end.
Paolo Abeni [Thu, 11 Jun 2026 10:29:59 +0000 (12:29 +0200)]
Merge tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Revalidate bridge ports, add missing NULL checks to fetch the bridge
device by the port. From Florian Westphal.
2) Fix netdevice refcount leak in the error path of nft_fwd hardware
offload function, also from Florian.
3) Unregister helper expectfn callback on conntrack helper module
removal, otherwise dangling pointer remains in place,
from Weiming Shi.
4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES,
From Kyle Zeng.
5) Validate that device MAC header is present before nf_syslog
accesses it. From Xiang Mei.
6-8) Three patches to address a possible infoleak of stale stack
data in three nf_tables expressions, due to mismatch in the
_init() and _eval() function which is possible since 14fb07130c7d.
From Davide Ornaghi and Florian Westphal.
netfilter pull request 26-06-10
* tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
netfilter: nft_fib: fix stale stack leak via the OIFNAME register
netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
netfilter: nf_log: validate MAC header was set before dumping it
netfilter: x_tables: avoid leaking percpu counter pointers
netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
netfilter: nf_tables_offload: drop device refcount on error
netfilter: revalidate bridge ports
====================
1) xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
Propagate SKBFL_SHARED_FRAG when paged fragments are moved between
skbs so ESP can decide whether in-place crypto is safe.
2) xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload
Replace the unlocked read of xtfs->ra_newskb with a local flag so a
concurrent reassembly can no longer free first_skb between
spin_unlock and the post-loop check.
3) xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()
Prune the inexact bin under xfrm_policy_lock so a concurrent
xfrm_hash_rebuild() can no longer free it before xfrm_policy_kill()
dereferences it.
4) xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
Move hrtimer_cancel() for the output and drop timers ahead of their
spinlocks, breaking the softirq/lock cycle that could deadlock
against the timer callbacks on SMP.
5) xfrm: espintcp: do not reuse an in-progress partial send
Fail a new send when espintcp_push_msgs() returns with emsg->len
still set, so a blocking caller can no longer overwrite ctx->partial
while a previous transfer still owns it.
6) esp: fix page frag reference leak on skb_to_sgvec failure
Add a flag to esp_ssg_unref() to unconditionally unref the source
scatterlist, releasing the old page references that are otherwise
leaked when the second skb_to_sgvec() in esp_output_tail() fails.
Please pull or let me know if there are problems.
ipsec-2026-06-10
* tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
esp: fix page frag reference leak on skb_to_sgvec failure
xfrm: espintcp: do not reuse an in-progress partial send
xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()
xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload
xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
====================
Ido Schimmel [Tue, 9 Jun 2026 14:54:48 +0000 (17:54 +0300)]
ipv6: Fix a potential NPD in cleanup_prefix_route()
addrconf_get_prefix_route() can return the fib6_null_entry sentinel
entry which has a NULL fib6_table pointer. Therefore, before setting the
route's expiration time, check that we are not working with this entry,
as otherwise a NPD will be triggered [1].
Note that the other callers of addrconf_get_prefix_route() are not
susceptible to this bug:
1. addrconf_prefix_rcv(): Requests a route with the 'RTF_ADDRCONF |
RTF_PREFIX_RT' flags which are not set on fib6_null_entry.
2. modify_prefix_route(): Fixed by commit a747e02430df ("ipv6: avoid
possible NULL deref in modify_prefix_route()").
3. __ipv6_ifa_notify(): Calls ip6_del_rt() which specifically checks for
fib6_null_entry and returns an error.
Fixes: 5eb902b8e719 ("net/ipv6: Remove expired routes with a separated list of routes.") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Reviewed-by: David Ahern <dahern@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609145448.768318-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For AML devices, there are some issues where the wrong module
indentified then configure PHY failed.
The module info buffers should be initialized to 0 before the firmware
returns information. And DECLARE_PHY_INTERFACE_MASK() does not guarantee
zeroed contents, so explicitly clear the temporary interface masks before
setting supported interfaces.
Rework txgbe_identify_module() to validate module identifiers through
explicit type checks instead of relying on transceiver_type heuristics.
When using the SFP module, transceiver_type could be a random value,
because it was read from an invalid register.
====================
Jiawen Wu [Mon, 8 Jun 2026 07:08:42 +0000 (15:08 +0800)]
net: txgbe: initialize PHY interface to 0
DECLARE_PHY_INTERFACE_MASK() does not guarantee zeroed contents. Add a
new macro DECLARE_PHY_INTERFACE_MASK_ZERO(), make the stack variable to
be zeroed before setting supported interfaces.
Jiawen Wu [Mon, 8 Jun 2026 07:08:41 +0000 (15:08 +0800)]
net: txgbe: distinguish module types by checking identifier
Rework txgbe_identify_module() to validate module identifiers through
explicit type checks instead of relying on transceiver_type heuristics.
When using the SFP module, transceiver_type could be a random value,
because it was read from an invalid register.
Jiawen Wu [Mon, 8 Jun 2026 07:08:40 +0000 (15:08 +0800)]
net: txgbe: initialize module info buffer
The module info buffer should be initialized to 0 before the firmware
returns information. Otherwise, there is a risk that the buffer field
not filled by the firmware is random value.
This is v5 of the earlier XDP_PASS fix. The XDP_PASS change is
retained, and the series also fixes related RX/XDP buffer handling
issues found during review.
Tested with tools/testing/selftests/drivers/net/xdp.py on mvpp2
hardware.
====================
Til Kaiser [Sun, 7 Jun 2026 13:49:43 +0000 (15:49 +0200)]
net: mvpp2: build skb from XDP-adjusted data on XDP_PASS
When an XDP program uses bpf_xdp_adjust_head() or bpf_xdp_adjust_tail()
and then returns XDP_PASS, mvpp2 still builds the skb from fixed offsets
derived from the original RX descriptor. Packet geometry changes made by
the XDP program are therefore discarded before the skb reaches the stack.
Update rx_offset and rx_bytes from xdp.data and xdp.data_end for
XDP_PASS. This makes skb_reserve() and skb_put() reflect the packet seen
by XDP, and makes RX byte accounting for XDP_PASS follow the length of the
skb passed to the network stack.
Keep a separate rx_sync_size for page-pool recycling on skb allocation
failure, which must stay tied to the received buffer range.
Non-PASS verdicts continue to account the descriptor length because no skb
is passed up in those cases.
Til Kaiser [Sun, 7 Jun 2026 13:49:42 +0000 (15:49 +0200)]
net: mvpp2: refill RX buffers before XDP or skb use
The RX error path returns the current descriptor buffer to the hardware
BM pool. That is only valid while the driver still owns the buffer.
mvpp2_rx_refill() can fail after the current buffer has been handed to
XDP or attached to an skb. In those cases mvpp2_run_xdp() may have
recycled, redirected, or queued the page for XDP_TX, and an skb free also
retires the data buffer. Returning such a buffer to BM lets hardware DMA
into memory that is no longer owned by the RX ring.
Refill the BM pool before handing the current buffer to XDP or to the
skb. If the allocation fails there, drop the packet and return the
still-owned current buffer to BM, preserving the pool depth. Once the
refill succeeds, later local drops retire/free the current buffer instead
of returning it to BM.
Fixes: 07dd0a7aae7f ("mvpp2: add basic XDP support") Fixes: d6526926de73 ("net: mvpp2: fix memory leak in mvpp2_rx") Signed-off-by: Til Kaiser <mail@tk154.de> Link: https://patch.msgid.link/20260607134943.21996-4-mail@tk154.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Til Kaiser [Sun, 7 Jun 2026 13:49:41 +0000 (15:49 +0200)]
net: mvpp2: limit XDP frame size to the RX buffer
mvpp2 has short and long BM pools, and short pool buffers can be smaller
than PAGE_SIZE. The XDP path nevertheless initializes every xdp_buff with
PAGE_SIZE as frame size.
XDP helpers use frame_sz to validate tail growth and to derive the hard
end of the data area. Advertising PAGE_SIZE for short buffers can let
bpf_xdp_adjust_tail() grow a packet past the real allocation, corrupting
memory or later tripping skb tailroom checks.
Initialize the XDP buffer with bm_pool->frag_size so XDP tailroom matches
the actual buffer backing the packet.
Til Kaiser [Sun, 7 Jun 2026 13:49:40 +0000 (15:49 +0200)]
net: mvpp2: sync RX data at the hardware packet offset
mvpp2 programs the RX queue packet offset, so hardware writes received
data at dma_addr + MVPP2_SKB_HEADROOM. The current CPU sync starts at
dma_addr and only covers rx_bytes + MVPP2_MH_SIZE bytes, which syncs the
unused headroom and misses the same number of bytes at the packet tail.
On non-coherent DMA systems this can leave the CPU reading stale cache
contents for the end of the received frame.
Use dma_sync_single_range_for_cpu() with MVPP2_SKB_HEADROOM as the range
offset so the sync covers the Marvell header and packet data actually
written by hardware.
Linus Torvalds [Wed, 10 Jun 2026 18:53:55 +0000 (11:53 -0700)]
Merge tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address some remaining fallout after introducing dynamic EPP
support in the amd-pstate driver during the current development cycle:
- Restore allowing writing EPP of 0 when in performance mode in the
amd-pstate driver which was unnecessarily disallowed by one of the
recent updates (Mario Limonciello)
- Remove stale documentation of the epp_cached field in struct
amd_cpudata that has been dropped recently (Zhan Xusheng)"
* tag 'pm-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq/amd-pstate: Fix setting EPP in performance mode
cpufreq/amd-pstate: drop stale @epp_cached kdoc
Davide Ornaghi [Wed, 10 Jun 2026 10:39:13 +0000 (12:39 +0200)]
netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
NFT_META_BRI_IIFHWADDR declares its destination register with
len = ETH_ALEN (6 bytes), which the register-init tracking rounds up to
two 32-bit registers (8 bytes). nft_meta_bridge_get_eval() then does
memcpy(dest, br_dev->dev_addr, ETH_ALEN), writing only 6 bytes and
leaving the upper 2 bytes of the second register as uninitialised
nft_do_chain() stack. A downstream load of that register span leaks
those stale bytes to userspace.
Zero the second register before the memcpy so the full declared span is
written.
Davide Ornaghi [Wed, 10 Jun 2026 10:39:12 +0000 (12:39 +0200)]
netfilter: nft_fib: fix stale stack leak via the OIFNAME register
For NFT_FIB_RESULT_OIFNAME the destination register is declared with
len = IFNAMSIZ (four 32-bit registers), but on the lookup-fail,
RTN_LOCAL and oif-mismatch paths nft_fib{4,6}_eval() only writes one
register via "*dest = 0". The remaining three registers are left as
whatever was on the stack in nft_do_chain()'s struct nft_regs, and a
downstream expression that loads the register span can leak that
uninitialised kernel stack to userspace.
The NFTA_FIB_F_PRESENT existence check has the same shape: it is only
meaningful for NFT_FIB_RESULT_OIF, yet it was accepted for any result type
while the eval stores a single byte via nft_reg_store8(), leaving the rest
of the declared span stale.
Fix both:
- replace the bare "*dest = 0" in the eval with nft_fib_store_result(),
which strscpy_pad()s the whole IFNAMSIZ for OIFNAME (and is already
used on the other early-return path), and
- restrict NFTA_FIB_F_PRESENT to NFT_FIB_RESULT_OIF and declare its
destination as a single u8, so the marked span matches the one byte
the eval writes.
netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
nft_exthdr_init() passes user-controlled priv->len to
nft_parse_register_store(), which marks that many bytes in the
register bitmap as initialized. However, when NFT_EXTHDR_F_PRESENT
is set, the eval paths write only 1 byte (nft_reg_store8) or
4 bytes (*dest = 0 on TCP/DCCP error path). When len > 4,
registers beyond the first are never written, retaining
uninitialized stack data from nft_regs.
Bail out if userspace requests too much data when F_PRESENT is set.
Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Fixes: c078ca3b0c5b ("netfilter: nft_exthdr: Add support for existence check") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Xiang Mei [Tue, 9 Jun 2026 22:55:02 +0000 (15:55 -0700)]
netfilter: nf_log: validate MAC header was set before dumping it
The fallback path of dump_mac_header() guards the MAC header access
only with "skb->mac_header != skb->network_header", without checking
skb_mac_header_was_set(). When the MAC header is unset, mac_header is
0xffff, so the test passes and skb_mac_header(skb) returns
skb->head + 0xffff, ~64 KiB past the buffer; the loop then reads
dev->hard_header_len bytes out of bounds into the kernel log.
This is reachable via the netdev logger: nf_log_unknown_packet() calls
dump_mac_header() unconditionally, and an skb sent through AF_PACKET
with PACKET_QDISC_BYPASS reaches the egress hook with mac_header still
unset (__dev_queue_xmit(), which would reset it, is bypassed).
Add the skb_mac_header_was_set() check the ARPHRD_ETHER path already
uses, and replace the open-coded MAC header length test with
skb_mac_header_len(). Only skbs with an unset MAC header are affected;
valid ones are dumped as before.
The native and compat get-entries paths copy the fixed rule entry header
from the kernelized rule blob to userspace before overwriting the entry's
counter fields with a sanitized counter snapshot.
On SMP kernels, entry->counters.pcnt contains the percpu allocation
address used by x_tables rule counters. A caller can provide a userspace
buffer that faults during the initial fixed-header copy after pcnt has
been copied but before the later sanitized counter copy runs. The syscall
then returns -EFAULT while leaving the raw percpu pointer in userspace.
Copy only the fixed entry prefix before counters from the kernelized rule
blob, then copy the sanitized counter snapshot into the counter field.
Apply this ordering to the IPv4, IPv6, and ARP native and compat
get-entries implementations so a fault cannot expose the internal percpu
counter pointer.
Fixes: 71ae0dff02d7 ("netfilter: xtables: use percpu rule counters") Signed-off-by: Kyle Zeng <kylebot@openai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Weiming Shi [Wed, 3 Jun 2026 07:38:17 +0000 (00:38 -0700)]
netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
NAT helpers such as nf_nat_h323 store a raw pointer to module text in
exp->expectfn (e.g. ip_nat_q931_expect). nf_ct_helper_expectfn_unregister()
only unlinks the callback descriptor and never walks the expectation table,
so an expectation pending at module removal survives with a dangling
exp->expectfn into freed module text.
When the expected connection arrives, init_conntrack() invokes
exp->expectfn(), now a stale pointer into the unloaded module. Reproduced
on a KASAN build by loading the H.323 helpers, creating a Q.931
expectation, unloading nf_nat_h323, then connecting to the expected port:
Reaching the dangling state requires CAP_SYS_MODULE in the initial user
namespace to remove a NAT helper that still has live expectations, so this
is a robustness fix; leaving an expectation pointing at freed text is wrong
regardless.
Add nf_ct_helper_expectfn_destroy(), which walks the expectation table and
drops every expectation whose ->expectfn matches the descriptor being torn
down. Call it from each NAT helper's exit path after the existing RCU grace
period, so no expectation outlives the code it points at and no extra
synchronize_rcu() is introduced. With the fix, the same reproducer runs to
completion without the Oops.
Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Reported-by: Xiang Mei <xmei5@asu.edu> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebt_redirect_tg() dereferences br_port_get_rcu() return without a
NULL check, causing a kernel panic when the bridge port has been
removed between the original hook invocation and an NFQUEUE
reinject.
A mere NULL check isn't sufficient, however. As sashiko review
points out userspace can not only remove the port from the bridge,
it could also place the device in a different virtual device, e.g.
macvlan.
If this happens, we must drop the packet, there is no way for us to
reinject it into the bridge path.
Switch to _upper API, we don't need the bridge port structure.
Also, this fix keeps another bug intact:
Both nfnetlink_log and nfnetlink_queue use CONFIG_BRIDGE_NETFILTER
too aggressive, which prevents certain logging features when queueing
in bridge family: NETFILTER_FAMILY_BRIDGE can be enabled while the old
CONFIG_BRIDGE_NETFILTER cruft is off.
Fixes tag is a common ancestor, this was always broken.
Fixes: f350a0a87374 ("bridge: use rx_handler_data pointer to store net_bridge_port pointer") Reported-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Breno Leitao [Mon, 8 Jun 2026 09:32:05 +0000 (02:32 -0700)]
rds: mark snapshot pages dirty in rds_info_getsockopt()
rds_info_getsockopt() pins the destination user pages with FOLL_WRITE and
the RDS_INFO_* producers memcpy the snapshot into them through
kmap_atomic(). Because that copy goes through the kernel direct map, the
dirty bit on the user PTE is never set, so unpin_user_pages() releases the
pages without marking them dirty. A file-backed destination page can then
be reclaimed without writeback, silently discarding the copied data.
Use unpin_user_pages_dirty_lock() with make_dirty=true so the modified
pages are marked dirty before they are unpinned.
Eric Dumazet [Mon, 8 Jun 2026 16:46:13 +0000 (16:46 +0000)]
ip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup()
In vti6_tnl_lookup(), when an exact match for a tunnel fails,
the code falls back to searching for wildcard tunnels:
- Tunnels matching the packet's local address, with any remote address
wildcard remote).
- Tunnels matching the packet's remote address, with any local address
(wildcard local).
However, vti6 stores all these different types of tunnels in the same
hash table (ip6n->tnls_r_l) prone to hash collisions.
The bug is that the fallback search loops in vti6_tnl_lookup() were
missing checks to ensure that the candidate tunnel actually has
a wildcard address.
Fixes: fbe68ee87522 ("vti6: Add a lookup method for tunnels with wildcard endpoints.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608164613.933023-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 10 Jun 2026 14:18:32 +0000 (07:18 -0700)]
Merge tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Fix the implementation of the CFI branch landing pad control prctl()s
to return -EINVAL if unknown control bits are set, rather than
silently ignoring the request; and add a kselftest for this case
- Fix unaligned access performance testing to happen earlier in boot,
which fixes a performance regression in the lib/checksum code
- Fix a binfmt_elf warning when dumping core (due to missing
.core_note_name for CFI registers)
* tag 'riscv-for-linux-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: cfi: reject unknown flags in PR_SET_CFI
riscv: Fix fast_unaligned_access_speed_key not getting initialized
riscv/ptrace: Use USER_REGSET_NOTE_TYPE for REGSET_CFI
Jann Horn [Fri, 5 Jun 2026 20:27:33 +0000 (22:27 +0200)]
namespace: restrict OPEN_TREE_NAMESPACE/FSMOUNT_NAMESPACE to directories
open_tree(..., OPEN_TREE_NAMESPACE) and
fsmount(..., FSMOUNT_NAMESPACE, ...) currently work on non-directories,
like regular files. That's bad for two reasons:
- It ends up mounting a regular file over the inherited namespace root,
which is a directory; mounting a non-directory over a directory is
normally explicitly forbidden, see for example do_move_mount()
- It causes setns() on the new namespace to set the cwd to a regular
file, which the rest of VFS does not expect
Fix it by restricting create_new_namespace() (which is used by both of
these flags) to directories.
Leave the behavior for OPEN_TREE_CLONE as-is, that seems unproblematic.
Fixes: 9b8a0ba68246 ("mount: add OPEN_TREE_NAMESPACE") Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Drake [Mon, 8 Jun 2026 21:01:08 +0000 (22:01 +0100)]
gpiolib: handle gpio-hogs only once
Commit d1d564ec49929 ("gpio: move hogs into GPIO core") introduced a
behaviour change that breaks boot on Raspberry Pi 5 when using the
firmware-supplied device tree:
gpiochip_add_data_with_key: GPIOs 544..575
(/soc@107c000000/gpio@7d517c00) failed to register, -22
brcmstb-gpio 107d517c00.gpio: Could not add gpiochip for bank 1
brcmstb-gpio 107d517c00.gpio: probe with driver brcmstb-gpio failed
with error -22
gpio-brcmstb registers two gpio_chips against the device tree
node gpio@7d517c00, one for each bank. The firmware-supplied DT includes
a gpio-hog on RP1 RUN, and this gpio-hog is attempted to be applied to
*both* gpio_chips. This succeeds against bank 0 (which hosts the GPIO)
and fails for bank 1 (which does not).
In the previous implementation, failures to apply gpio-hogs were
quietly ignored. In the new code, the error code propagates and causes
probe to fail.
Closely approximate the previous behaviour by using the OF_POPULATED flag
to ensure that each gpio-hog is processed only once. The flag was
previously being set before the gpio-hogs were processed, so as part
of this change, the flag now gets set only after the gpio-hog is actioned.
The handling of gpio-hogs on a DT node with multiple gpio_chips remains a
bit incomplete/unclear, but this at least retains the ability to apply
hogs to the first gpio_chip per node.
If gpiochip_hog_lines() successfully processes some hogs but fails on
a later one, the error handling path in gpiochip_add_data_with_key()
jumps directly to err_remove_of_chip. This leaks resources allocated
earlier for ACPI, interrupts and hogs that were successfully processed.
Use the right label in error path.
Closes: https://sashiko.dev/#/patchset/20260608210108.36248-1-dan%40reactivated.net Fixes: d1d564ec4992 ("gpio: move hogs into GPIO core") Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://patch.msgid.link/20260609-gpio-hogs-fixes-v1-2-b4064f8070e7@oss.qualcomm.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Vadim Fedorenko [Mon, 8 Jun 2026 15:59:52 +0000 (15:59 +0000)]
ptp: ocp: fix resource freeing order
Commit a60fc3294a37 ("ptp: rework ptp_clock_unregister() to disable
events") added a call to ptp_disable_all_events() which changes the
configuration of pins if they support EXTTS events. In ptp_ocp_detach()
pins resources are freed before ptp_clock_unregister() and it leads to
use-after-free during driver removal. Fix it by changing the order of
free/unregister calls. To avoid irq handler running on the other core
while ptp device unregistering, call synchronize_irq() after HW is
configured to stop producing irqs and no irqs are in-flight.
Xiang Mei [Sun, 7 Jun 2026 05:44:28 +0000 (22:44 -0700)]
tun: zero the whole vnet header in tun_put_user()
tun_put_user() declares an on-stack struct virtio_net_hdr_v1_hash_tunnel
without zeroing it. For a non-tunnel skb, virtio_net_hdr_tnl_from_skb()
only initializes the first 10 bytes (sizeof(struct virtio_net_hdr)),
leaving bytes 10..23 (num_buffers and the hash/tunnel fields) as stack
garbage.
An unprivileged user can set the vnet header size to 24 with
TUNSETVNETHDRSZ, so __tun_vnet_hdr_put() copies all 24 bytes of the
partially-initialized struct to userspace, leaking 14 bytes of kernel
stack on every read of a non-tunnel packet.
Fix it the same way tun_get_user() already does by zeroing the whole
header right after declaration.
Fixes: 288f30435132 ("tun: enable gso over UDP tunnel support.") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260607054428.3050243-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Weiming Shi [Sat, 6 Jun 2026 19:24:48 +0000 (12:24 -0700)]
net/rds: fix NULL deref in rds_ib_send_cqe_handler() on masked atomic completion
rds_ib_xmit_atomic() always programs a masked atomic opcode
(IB_WR_MASKED_ATOMIC_CMP_AND_SWP or IB_WR_MASKED_ATOMIC_FETCH_AND_ADD)
for every RDS atomic cmsg. But the completion-side switch in
rds_ib_send_unmap_op() only handles the non-masked opcodes, so a masked
atomic completion falls through to default and returns rm == NULL while
send->s_op is left set. rds_ib_send_cqe_handler() then dereferences the
NULL rm via rm->m_final_op, oopsing in softirq context. An unprivileged
AF_RDS sendmsg() of an atomic cmsg over an active RDS/IB connection
triggers it; on hardware that natively accepts masked atomics (mlx4,
mlx5) no extra setup is needed.
RDS/IB: rds_ib_send_unmap_op: unexpected opcode 0xd in WR!
Oops: general protection fault [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000190-0x0000000000000197]
RIP: rds_ib_send_cqe_handler+0x25c/0xb10 (net/rds/ib_send.c:282)
Call Trace:
<IRQ>
rds_ib_send_cqe_handler (net/rds/ib_send.c:282)
poll_scq (net/rds/ib_cm.c:274)
rds_ib_tasklet_fn_send (net/rds/ib_cm.c:294)
tasklet_action_common (kernel/softirq.c:943)
handle_softirqs (kernel/softirq.c:573)
run_ksoftirqd (kernel/softirq.c:479)
</IRQ>
Kernel panic - not syncing: Fatal exception in interrupt
Handle the masked atomic opcodes in the same case as the non-masked
ones: they map to the same struct rds_message.atomic union member, so
the existing container_of()/rds_ib_send_unmap_atomic() body is correct
for them.
Fixes: 20c72bd5f5f9 ("RDS: Implement masked atomic operations") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260606192447.1179255-2-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kyle Zeng [Sun, 7 Jun 2026 02:18:19 +0000 (19:18 -0700)]
net: guard timestamp cmsgs to real error queue skbs
skb_is_err_queue() treats PACKET_OUTGOING as the sole marker for an skb
from sk_error_queue. That assumption is not true for AF_PACKET sockets:
outgoing packet taps are also delivered to packet sockets with
skb->pkt_type == PACKET_OUTGOING, but their skb->cb is owned by AF_PACKET
instead of struct sock_exterr_skb.
If such an skb is received with timestamping enabled, the generic
timestamp cmsg path can read AF_PACKET control-buffer state as
sock_exterr_skb::opt_stats. With SO_RXQ_OVFL enabled, the packet drop
counter overlaps opt_stats. An odd drop count makes the path emit
SCM_TIMESTAMPING_OPT_STATS with skb->len and skb->data. For non-linear
skbs this copies past the linear head and can trigger hardened usercopy or
disclose adjacent heap contents.
Keep skb_is_err_queue() local to net/socket.c, but make it verify that
the PACKET_OUTGOING marker is paired with the sock_rmem_free destructor
installed by sock_queue_err_skb(). AF_PACKET receive skbs use normal
receive ownership and no longer pass as error-queue skbs, while legitimate
sk_error_queue entries keep the PACKET_OUTGOING marker and sock_rmem_free
ownership.
Fixes: 8605330aac5a ("tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260607021819.49698-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sun, 7 Jun 2026 23:03:47 +0000 (19:03 -0400)]
sctp: validate embedded INIT chunk and address list lengths in cookie
sctp_unpack_cookie() only checked that the embedded INIT chunk length
did not exceed the remaining cookie payload, but did not ensure that the
INIT chunk is large enough to contain a complete INIT header.
A malformed COOKIE_ECHO can therefore carry a truncated INIT chunk whose
length field is smaller than sizeof(struct sctp_init_chunk). Later,
sctp_process_init() accesses INIT parameters unconditionally, which may
lead to out-of-bounds reads.
In addition, raw_addr_list_len is not fully validated against the
remaining cookie payload. When cookie authentication is disabled, an
attacker can supply an oversized raw_addr_list_len and cause
sctp_raw_to_bind_addrs() to read beyond the end of the cookie. The
address parser also lacks sufficient bounds checks for parameter headers
and lengths, allowing malformed address parameters to trigger
out-of-bounds reads.
Fix this by:
- requiring the embedded INIT chunk length to be at least sizeof(struct
sctp_init_chunk);
- validating that the INIT chunk and raw address list together fit
within the cookie payload;
- verifying sufficient data exists for each address parameter header and
payload before parsing it.
Note that sctp_verify_init() must be called after sctp_unpack_cookie()
and before sctp_process_init() when cookie authentication is disabled.
This will be addressed in a separate patch.
Eric Dumazet [Mon, 8 Jun 2026 15:59:18 +0000 (15:59 +0000)]
ip6_vti: set netns_immutable on the fallback device.
john1988 and Noam Rathaus reported that vti6_init_net() does not set the
netns_immutable flag on the per-netns fallback tunnel device (ip6_vti0).
Other similar tunnel drivers (like ip6_tunnel, sit, ip6_gre, and ip_tunnel)
correctly set this flag during their fallback device initialization to
prevent them from being moved to another network namespace.
Fixes: 61220ab34948 ("vti6: Enable namespace changing") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20260608155918.787644-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sctp: fix uninit-value in __sctp_rcv_asconf_lookup()
__sctp_rcv_asconf_lookup() in net/sctp/input.c only checks that the ASCONF
chunk can hold the ADDIP header and a parameter header, then calls
af->from_addr_param(), which reads the full address (16 bytes for IPv6)
trusting the parameter's declared length.
An unauthenticated peer can send a truncated trailing ASCONF chunk that
declares an IPv6 address parameter but stops after the 4-byte parameter
header; reached from the no-association lookup path, from_addr_param() then
reads uninitialized bytes past the parameter.
Impact: an unauthenticated SCTP peer makes the receive path read up to 16
bytes of uninitialized memory past a truncated ASCONF address parameter.
The sibling __sctp_rcv_init_lookup() bounds parameters with
sctp_walk_params(); this path open-codes the fetch and omits the bound.
Verify the whole address parameter lies within the chunk before
from_addr_param() reads it, the same class of fix as commit 51e5ad549c43
("net: sctp: fix KMSAN uninit-value in sctp_inq_pop").
Fixes: df2185771439 ("[SCTP]: Update association lookup to look at ASCONF chunks as well") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20260608122234.459098-1-michael.bommarito@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kyle Meyer [Fri, 5 Jun 2026 22:25:24 +0000 (17:25 -0500)]
bnxt_en: Fix NULL pointer dereference
PCIe errors detected by a Root Port or Downstream Port cause error
recovery services to run on all subordinate devices regardless of
administrative state.
The .error_detected() callback, bnxt_io_error_detected(), disables
and synchronizes IRQs via bnxt_disable_int_sync(), which calls
bnxt_cp_num_to_irq_num() to map completion rings to IRQs using
bp->bnapi.
Since bp->bnapi is allocated on NIC open and freed on NIC close, PCIe
error recovery on a closed NIC can dereference a NULL pointer.
Check if bp->bnapi is NULL before disabling and synchronizing IRQs.
Wyatt Feng [Fri, 5 Jun 2026 05:53:42 +0000 (13:53 +0800)]
sctp: stream: fully roll back denied add-stream state
When ADD_OUT_STREAMS is denied, SCTP only shrinks the queued chunks and
then lowers outcnt. That leaves removed stream metadata behind, so a
later re-add can reuse a stale ext and hit a null-pointer dereference in
the scheduler get path.
Fix the rollback by tearing down the removed stream state the same way
other stream resizes do. Unschedule the current scheduler state, drop
the removed stream ext state with sctp_stream_outq_migrate(), and then
reschedule the remaining streams.
This keeps scheduler-private RR/FC/PRIO lists consistent while fully
rolling back denied outgoing stream additions.
Fixes: 637784ade221 ("sctp: introduce priority based stream scheduler") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/d78954ecd94954653ee299400e98d74a03a6f7d3.1780603399.git.bronzed_45_vested@icloud.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 10 Jun 2026 00:20:00 +0000 (17:20 -0700)]
Merge tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors
type and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix suffix strip in dot2k
tools/rv: Fix cleanup after failed trace setup
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix substring match bug in monitor name search
tools/rv: Ensure monitor name and desc are NUL-terminated
rv: Use 0 to check preemption enabled in opid
rv: Prevent task migration while handling per-CPU events
rv: Ensure synchronous cleanup for HA monitors
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Fix monitor start ordering and memory ordering for monitoring flag
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Reset per-task DA monitors before releasing the slot
rv: Fix __user specifier usage in extract_params()
Linus Torvalds [Wed, 10 Jun 2026 00:05:19 +0000 (17:05 -0700)]
Merge tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull RTLA fix from Steven Rostedt:
- Fix multi-character short option parsing
Fix regression in parsing of multiple-character short options
(eg -p100 /= -p 100/, -un /= -u -n/) caused by getopt_long()
internal state corruption after a refactoring.
* tag 'trace-tools-v7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix parsing of multi-character short options
Linus Torvalds [Tue, 9 Jun 2026 15:24:25 +0000 (08:24 -0700)]
Merge tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"11 hotfixes. 9 are for MM. 8 are cc:stable and the remaining 3 address
post-7.1 issues or aren't considered suitable for backporting.
Thre's a two-patch series "mm/damon/{reclaim,lru_sort}: handle ctx
allocation failures" from SeongJae Park which fixes a couple of DAMON
-ENOMEM bloopers. The rest are singletons - please see the individual
changelogs for details"
* tag 'mm-hotfixes-stable-2026-06-08-20-51' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mincore: handle non-swap entries before !CONFIG_SWAP guard
arm64: mm: call pagetable dtor when freeing hot-removed page tables
mm/list_lru: drain before clearing xarray entry on reparent
mm/huge_memory: use correct flags for device private PMD entry
mm/damon/lru_sort: handle ctx allocation failure
mm/damon/reclaim: handle ctx allocation failure
zram: fix use-after-free in zram_bvec_write_partial()
MAINTAINERS: update Baoquan He's email address
tools headers UAPI: sync linux/taskstats.h for procacct.c
mm/cma_sysfs: skip inactive CMA areas in sysfs
ipc/shm: serialize orphan cleanup with shm_nattch updates
Linus Torvalds [Tue, 9 Jun 2026 15:19:48 +0000 (08:19 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Several significant bug fixes of pre-existing issues:
- Missing validation on ucap fd types passed from userspace
- Missing validation of HW DMA space vs userpace expected sizes in
EFA queue setup
- DMA corruption when using DMA block sizes >= 4G when setting up MRs
in all drivers
- Missing validation of CPU IDs when setting up dma handles
- Missing validation of IB_MR_REREG_ACCESS when changing writability
of a MR
- Missing validation of received message/packet size in ISER and SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/srp: bound SRP_RSP sense copy by the received length
IB/isert: Reject login PDUs shorter than ISER_HEADERS_LEN
RDMA: During rereg_mr ensure that REREG_ACCESS is compatible
RDMA/core: Validate cpu_id against nr_cpu_ids in DMAH alloc
RDMA/umem: Fix truncation for block sizes >= 4G
RDMA/efa: Validate SQ ring size against max LLQ size
RDMA/core: Validate the passed in fops for ib_get_ucaps()
esp: fix page frag reference leak on skb_to_sgvec failure
In esp_output_tail(), when esp->inplace is false, the old skb page frags
are replaced with a new page from the xfrm page_frag cache The source
scatterlist (sg) is built from the old frags before the replacement, and
esp_ssg_unref() is responsible for releasing the old page references
after the crypto operation completes
However, if the second skb_to_sgvec() call (which builds the destination
scatterlist from the new page) fails, the code jumps to error_free which
only calls kfree(tmp). The old page frag references captured in the
source scatterlist are never released:
1 sg[] is built from old frags via skb_to_sgvec() (no extra get_page)
2 nr_frags is set to 1 and frag[0] is replaced with the new page
3 Second skb_to_sgvec() fails -> goto error_free
Fix this by adding a bool parameter to esp_ssg_unref() that, when true,
unconditionally unrefs the source scatterlist frags. Since req->src is
not yet initialized by aead_request_set_crypt() at the point of the
error, the source scatterlist is obtained directly via esp_req_sg()
Existing callers pass false to preserve the original behavior
The same issue exists in both esp4 and esp6 as the code is identical
====================
net: mctp: usb: minor fixes for MCTP over USB transport driver
This series adds a couple of fixes in the ndo_open / ndo_stop path for
the MCTP over USB transport, where we are incorrectly sequencing two
error cases.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
====================
Jeremy Kerr [Mon, 8 Jun 2026 01:25:41 +0000 (09:25 +0800)]
net: mctp: usb: don't fail mctp_usb_rx_queue on a deferred submission
In the ndo_open path, a deferred queue open will report a failure, and
so the netdev will not be ndo_stop()ed, leaving us with the rx_retry
work potentially pending.
Don't report a deferred queue as an error, as we are still operational.
This means we use the ndo_stop() path for future cleanup, which handles
rx_retry_work cancellation.
That urb completion can then re-schedule rx_retry_work.
Strenghen the sequencing between the stop (preventing another requeue)
and the cancel by updating both atomically under a new rx lock. After
setting ->rx_stopped, and cancelling pending work, we know that the
requeue cannot occur, so all that's left is killing any pending urb.
Marco Scardovi [Sun, 7 Jun 2026 23:05:02 +0000 (01:05 +0200)]
gpio: rockchip: fix generic IRQ chip leak on remove
The driver allocates domain generic chips using
irq_alloc_domain_generic_chips() during probe. However, on driver
remove/teardown, the generic chips are not automatically freed when the
IRQ domain is removed because the domain flags do not include
IRQ_DOMAIN_FLAG_DESTROY_GC.
This causes both the domain generic chips structure and the associated
generic chips to be leaked. Additionally, the generic chips remain on
the global gc_list and may later be visited by generic IRQ chip suspend,
resume, or shutdown callbacks after the GPIO bank has been removed,
potentially resulting in a use-after-free and kernel crash.
Fix the resource leak by explicitly calling
irq_domain_remove_generic_chips() before removing the IRQ domain in
rockchip_gpio_remove().
Fixes: 936ee2675eee ("gpio/rockchip: add driver for rockchip gpio") Assisted-by: Antigravity:gemini-3.5-flash Signed-off-by: Marco Scardovi <scardracs@disroot.org> Link: https://patch.msgid.link/20260607230504.35392-2-scardracs@disroot.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
gpio-mockup validates only that each second gpio_mockup_ranges value is
non-negative before creating the mock chips. The fixed-base form uses
the second value as the first GPIO number after the range, while the
dynamic-base form uses it as the number of GPIOs.
gpio_mockup_register_chip() stores the resulting number of GPIOs in a
u16 and passes it through a PROPERTY_ENTRY_U16("nr-gpios", ...). Values
greater than U16_MAX therefore truncate silently. For example,
gpio_mockup_ranges=-1,65537 creates a one-line mock GPIO chip instead of
rejecting the invalid request.
Reject zero-width, reversed, and over-U16 ranges before registering any
mock chip.
Ruoyu Wang [Tue, 9 Jun 2026 07:33:13 +0000 (15:33 +0800)]
gpio: zynq: fix runtime PM leak on remove
pm_runtime_get_sync() increments the runtime PM usage counter even when it
returns an error. zynq_gpio_remove() uses it to keep the controller active
while removing the GPIO chip, but never drops the usage counter again.
Balance the get with pm_runtime_put_noidle() after disabling runtime PM.
Fixes: 3242ba117e9b ("gpio: Add driver for Zynq GPIO controller") Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com> Link: https://patch.msgid.link/20260609073313.5-1-ruoyuw560@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Anton Leontev [Thu, 4 Jun 2026 16:59:38 +0000 (19:59 +0300)]
hv_netvsc: use kmap_local_page in netvsc_copy_to_send_buf
netvsc_copy_to_send_buf() copies page buffer entries into the VMBus
send buffer using phys_to_virt() on the entry PFN. Entries for the
RNDIS header and the skb linear data come from kmalloc'd memory and
are always in the kernel direct map, but entries for skb fragments
reference page cache or user pages, which on 32-bit x86 with
CONFIG_HIGHMEM=y can live above the LOWMEM boundary. For such a page
phys_to_virt() returns an address outside the direct map and the
subsequent memcpy() faults on the transmit softirq path, which is
fatal.
Map the pages with kmap_local_page() instead, handling two properties
of the page buffer entries:
- pb[i].pfn is a Hyper-V PFN at HV_HYP_PAGE_SIZE (4K) granularity,
not a native PFN. Reconstruct the physical address first and derive
the native page from it, so the mapping stays correct where
PAGE_SIZE > HV_HYP_PAGE_SIZE (e.g. arm64 with 64K pages).
- Since commit 41a6328b2c55 ("hv_netvsc: Preserve contiguous PFN
grouping in the page buffer array"), an entry describes a full
physically contiguous fragment and pb[i].len can exceed PAGE_SIZE,
while kmap_local_page() maps a single page. Copy page by page,
splitting at native page boundaries.
The copy path only handles packets smaller than the send section size
(6144 bytes by default); larger packets take the cp_partial path where
only the RNDIS header is copied. So entries here are bounded by the
section size and a copy is split at most once on 4K-page systems. On
!CONFIG_HIGHMEM configs kmap_local_page() folds to page_address() and
no mapping work is added.
Dawei Feng [Thu, 4 Jun 2026 14:37:56 +0000 (22:37 +0800)]
octeontx2-af: fix memory leak in rvu_setup_hw_resources()
If rvu_npc_exact_init() fails in rvu_setup_hw_resources(), the function
returns directly instead of jumping to the error handling path. This
causes a resource leak for the previously initialized CGX, NPC, fwdata,
and MSI-X states.
Fix this by replacing the direct return with goto cgx_err to ensure
proper cleanup.
The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still present in
v7.1-rc6.
An x86_64 allyesconfig build showed no new warnings. As we do not have
access to Marvell OcteonTX2 RVU AF hardware to test with, no runtime
testing was able to be performed.
Fixes: 3571fe07a090 ("octeontx2-af: Drop rules for NPC MCAM") Cc: stable@vger.kernel.org Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Link: https://patch.msgid.link/20260604143756.1524482-1-dawei.feng@seu.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Howells [Thu, 4 Jun 2026 11:46:00 +0000 (12:46 +0100)]
rxrpc: Fix the ACK parser to extract the SACK table for parsing
Fix modification of the received skbuff in rxrpc_input_soft_acks() and a
potential incorrect access of the buffer in a fragmented UDP packet (the
packet would probably have to be deliberately pre-generated as fragmented)
when AF_RXRPC tries to extract the contents of the SACK table by copying
out the contents of the SACK table into a buffer before attempting to parse
AF_RXRPC assumes that it can just call skb_condense() and then validly
access the SACK table from skb->data and that it will be a flat buffer -
but skb_condense() can silently fail to do anything under some
circumstances.
Note that whilst rxrpc_input_soft_acks() should be able to parse extended
ACKs, the rest of AF_RXRPC doesn't currently support that.
Further, there's then no need to call skb_condense() in rxrpc_input_ack(),
so don't.
Fixes: d57a3a151660 ("rxrpc: Save last ACK's SACK table rather than marking txbufs") Reported-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://lore.kernel.org/r/20260513180907.2061972-1-michael.bommarito@gmail.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: stable@kernel.org Link: https://patch.msgid.link/105362.1780573560@warthog.procyon.org.uk Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chih Kai Hsu [Thu, 4 Jun 2026 09:22:47 +0000 (17:22 +0800)]
r8152: handle the return value of usb_reset_device()
If usb_reset_device() returns a negative error code, stop the
process of probing.
Fixes: 10c3271712f5 ("r8152: disable the ECM mode") Signed-off-by: Chih Kai Hsu <hsu.chih.kai@realtek.com> Reviewed-by: Hayes Wang <hayeswang@realtek.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260604092247.27158-450-nic_swsd@realtek.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adrian Moreno [Thu, 4 Jun 2026 12:19:46 +0000 (14:19 +0200)]
net: openvswitch: fix possible kfree_skb of ERR_PTR
After the patch in the "Fixes" tag, the allocation of the "reply" skb
can happen either before or after locking the ovs_mutex.
However, error cleanups still follow the classical reversed order,
assuming "reply" is allocated before locking: it is freed after unlocking.
If "reply" allocation happens after locking the mutex and it fails,
"reply" is left with an ERR_PTR, and execution jumps to the correspondent
cleanup stage which will try to free an invalid pointer.
Fix this by setting the pointer to NULL after having saved its error
value.
Kyle Zeng [Fri, 5 Jun 2026 07:34:48 +0000 (00:34 -0700)]
ipv6: sit: reload inner IPv6 header after GSO offloads
ipip6_tunnel_xmit() caches the inner IPv6 header pointer at function
entry and continues using it after iptunnel_handle_offloads().
For GSO skbs, iptunnel_handle_offloads() calls skb_header_unclone().
When the skb header is cloned, skb_header_unclone() can call
pskb_expand_head(), which may move the skb head. The pskb_expand_head()
contract requires pointers into the skb header to be reloaded after the
call.
If the later skb_realloc_headroom() branch is not taken, SIT uses the
stale iph6 pointer to read the inner hop limit and DS field. That can
read from a freed skb head after the old head's remaining clone is
released.
Reload iph6 after the offload helper succeeds and before subsequent
reads from the inner IPv6 header. Keep the existing reload after
skb_realloc_headroom(), since that branch can also replace the skb.
Fixes: 14909664e4e1 ("sit: Setup and TX path for sit/UDP foo-over-udp encapsulation") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6eb9ca986d80f6f88cf9@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260605073448.6524-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fushuai Wang [Fri, 5 Jun 2026 10:21:12 +0000 (18:21 +0800)]
net/mlx5: Use effective affinity mask for IRQ selection
When a sf is created after a CPU has been taken offline, the IRQ pool may
contain IRQs with affinity masks that include the offline CPU. Since only
online CPUs should be considered for IRQ placement, cpumask_subset() check
would fail because the iter_mask contains offline CPUs that are not present
in req_mask, causing sf creation to fail.
This is an example:
1. When mlx5 driver loads, it initializes the IRQ pools.
For sf_ctrl_pool with ≤64 sf:
- xa_num_irqs = {N, N} (There is only one slot)
2. When the first SF is created:
- The ctrl IRQ is allocated with mask=cpu_online_mask={0-191}
2. We take CPU 20 offline
3. Existing ctl irq still have mask={0-191}
4. Create a new SF:
- req_mask={0-19,21-191}
- iter_mask={0-191}
- {0-191} is NOT a subset of {0-19,21-191}
- least_loaded_irq=NULL
5. Try to allocate a new irq via irq_pool_request_irq()
6. xa_alloc() fails because the pool is full(There is only one slot)
7. sf creation fails with error
Use irq_get_effective_affinity_mask() instead, which returns the IRQ's
actual effective affinity that already excludes offline CPUs.
Fixes: 061f5b23588a ("net/mlx5: SF, Use all available cpu for setting cpu affinity") Suggested-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260605102112.91772-1-fushuai.wang@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dragos Tatulea [Thu, 4 Jun 2026 13:54:46 +0000 (16:54 +0300)]
net/mlx5e: xsk: Fix DMA and xdp_frame leak on XDP_TX xmit failure
In the XSK branch of mlx5e_xmit_xdp_buff(), when sq->xmit_xdp_frame()
returns false (e.g. XDPSQ is full), the function returns without
unmapping the DMA address or freeing the xdp_frame allocated by
xdp_convert_zc_to_xdp_frame(). The xdpi_fifo push only happens on
success, so the completion path cannot recover these entries.
With CONFIG_DMA_API_DEBUG=y, the leak surfaces on driver unbind:
DMA-API: pci 0000:08:00.0: device driver has pending DMA
allocations while released from device [count=1116]
One of leaked entries details: [device address=0x000000010ffd7028]
[size=1534 bytes] [mapped with DMA_TO_DEVICE] [mapped as phy]
WARNING: kernel/dma/debug.c:881 at dma_debug_device_change+0x127/0x180
...
DMA-API: Mapped at:
debug_dma_map_phys+0x4b/0xd0
dma_map_phys+0xfd/0x2d0
mlx5e_xdp_handle+0x5ae/0xac0 [mlx5_core]
mlx5e_xsk_skb_from_cqe_mpwrq_linear+0xc4/0x170 [mlx5_core]
mlx5e_handle_rx_cqe_mpwrq+0xc1/0x290 [mlx5_core]
Add the missing unmap + xdp_return_frame, matching the cleanup already
done in mlx5e_xdp_xmit(). has_frags is rejected earlier in this branch,
so no per-frag unmap is needed.
Dragos Tatulea [Thu, 4 Jun 2026 13:58:49 +0000 (16:58 +0300)]
net/mlx5: Fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list
mlx5_query_nic_vport_mac_list() sizes its firmware command buffer using
the PF's log_max_current_uc/mc_list capabilities. When querying a VF
vport with a larger configured max (via devlink), the firmware response
can overflow this buffer:
BUG: KASAN: slab-out-of-bounds in mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core]
Read of size 4 at addr ff1100013ffc8a12 by task kworker/u96:2/385
Fix by querying the vport's own HCA caps to size the buffer correctly.
Refactor the function to allocate and return the MAC list internally,
removing the caller's dependency on knowing the correct max.
Fixes: e16aea2744ab ("net/mlx5: Introduce access functions to modify/query vport mac lists") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260604135849.458060-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mingyu Wang [Thu, 4 Jun 2026 06:48:01 +0000 (14:48 +0800)]
net: qrtr: fix refcount saturation and potential UAF in qrtr_port_remove
In qrtr_port_remove(), the socket reference count is decremented via
__sock_put() before the port is removed from the qrtr_ports XArray and
before the RCU grace period elapses.
This breaks the fundamental RCU update paradigm. It exposes a race
window where a concurrent RCU reader (such as qrtr_reset_ports() or
qrtr_port_lookup()) can obtain a pointer to the socket from the XArray,
and attempt to call sock_hold() on a socket whose reference count has
already dropped to zero.
This exact race condition was hit during syzkaller fuzzing, leading to
the following refcount saturation warning and a potential Use-After-Free:
Fix this by deferring the reference count decrement until after the
xa_erase() and the synchronize_rcu() complete.
(Note: The v1 of this patch incorrectly replaced __sock_put() with
sock_put(). As Simon Horman pointed out, the callers of qrtr_port_remove()
still hold a reference to the socket, so freeing the socket memory here
would lead to a subsequent UAF in the caller. Thus, the __sock_put() is
kept, but only repositioned to close the RCU race.)
Fixes: bdabad3e363d ("net: Add Qualcomm IPC router") Signed-off-by: Mingyu Wang <25181214217@stu.xidian.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260604064801.1180388-1-w15303746062@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: some cleanups following phy_port SFP
While posting the v11 of phy_port netlink, sashiko found some
pre-existing issues, and following the tentative fix, Nicolai found
some more :)
This is V3, with a re-ordering of the port/sfp cleanup, as well as a new
patch (patch 3) that also reorders the phy_remove() path.
====================
net: phy: don't try to setup PHY-driven SFP cages when using genphy
We don't have support for PHY-driver SFP cages with the genphy code.
On top of that, it was found by sashiko that running
sfp_bus_add_upstream() for genphy deadlocks, as for genphy the PHY
probing runs under RTNL, which isn't the case for non-genphy drivers.
This problem was reproduced, and does lead to a deadlock on RTNL.
Before the blamed commit, the phy_sfp_probe() call was made by
individual PHY drivers, so there was no way to get to the SFP probing
path when using genphy.
Let's therefore only run phy_sfp_probe when not using genphy.
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Fixes: bad869b5e41a ("net: phy: Only rely on phy_port for PHY-driven SFP") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260604092819.723505-5-maxime.chevallier@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: Clean the phy_ports after unregistering the downstream SFP bus
As reported by sashiko when looking a other patches, we need to ensure
that the downstream SFP bus gets unregistered prior to destroying the
phy_ports attached to a phy_device, as the SFP code may reference these
ports. Let's make sure we follow that ordering in phy_remove().
net: phy: clean the sfp upstream if phy probing fails
Sashiko reported that we don't call sfp_bus_del_upstream() in the probe
failure path, so let's add it, otherwise the sfp-bus is left with a
dangling 'upstream' field, that may be used later on during SFP events.
This issue existed before the generic phylib sfp support, back when
drivers were calling phy_sfp_probe themselves.
Jakub Kicinski [Sat, 6 Jun 2026 01:21:24 +0000 (18:21 -0700)]
netdev: fix double-free in netdev_nl_bind_rx_doit()
Sashiko flags that genlmsg_reply() always consumes the skb.
The error path calls nlmsg_free(rsp) so we can't jump directly
to it. Let's not unbind, just propagate the error to the user.
This is the typical way of handling genlmsg_reply() failures.
They shouldn't happen unless user does something silly like
calling the kernel with an already-full rcvbuf.
Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice") Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Santosh Kalluri [Thu, 4 Jun 2026 00:08:43 +0000 (17:08 -0700)]
net: phonet: free phonet_device after RCU grace period
phonet_device_destroy() removes a phonet_device from the per-net device
list with list_del_rcu(), but frees it immediately. RCU readers walking
the same list can still hold a pointer to the object after it has been
removed, leading to a slab-use-after-free.
Use kfree_rcu(), matching the lifetime rule already used by
phonet_address_del() for the same object type.
Fixes: eeb74a9d45f7 ("Phonet: convert devices list to RCU") Cc: stable@vger.kernel.org Signed-off-by: Santosh Kalluri <santosh.kalluri129@gmail.com> Acked-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>