Merge tag 'net-7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter, IPsec and wireless. This is again
considerably bigger than the old average. No known outstanding
regressions.
Current release - regressions:
- net: increase IP_TUNNEL_RECURSION_LIMIT to 5
- eth: ice: fix PTP timestamping broken by SyncE code on E825C
Current release - new code bugs:
- eth: stmmac: dwmac-motorcomm: fix eFUSE MAC address read failure
Previous releases - regressions:
- core: fix cross-cache free of KFENCE-allocated skb head
- sched: act_csum: validate nested VLAN headers
- rxrpc: fix call removal to use RCU safe deletion
- xfrm:
- wait for RCU readers during policy netns exit
- fix refcount leak in xfrm_migrate_policy_find
- wifi: rt2x00usb: fix devres lifetime
- mptcp: fix slab-use-after-free in __inet_lookup_established
- ipvs: fix NULL deref in ip_vs_add_service error path
- eth:
- airoha: fix memory leak in airoha_qdma_rx_process()
- lan966x: fix use-after-free and leak in lan966x_fdma_reload()
Previous releases - always broken:
- ipv6: ioam: fix potential NULL dereferences in __ioam6_fill_trace_data()
- ipv4: nexthop: avoid duplicate NHA_HW_STATS_ENABLE on nexthop group
dump
- bridge: guard local VLAN-0 FDB helpers against NULL vlan group
- xsk: tailroom reservation and MTU validation
- rxrpc:
- fix to request an ack if window is limited
- fix RESPONSE authenticator parser OOB read
- netfilter: nft_ct: fix use-after-free in timeout object destroy
- batman-adv: hold claim backbone gateways by reference
- eth:
- stmmac: fix PTP ref clock for Tegra234
- idpf: fix PREEMPT_RT raw/bh spinlock nesting for async VC handling
- ipa: fix GENERIC_CMD register field masks for IPA v5.0+"
* tag 'net-7.0-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (104 commits)
net: lan966x: fix use-after-free and leak in lan966x_fdma_reload()
net: lan966x: fix page pool leak in error paths
net: lan966x: fix page_pool error handling in lan966x_fdma_rx_alloc_page_pool()
nfc: pn533: allocate rx skb before consuming bytes
l2tp: Drop large packets with UDP encap
net: ipa: fix event ring index not programmed for IPA v5.0+
net: ipa: fix GENERIC_CMD register field masks for IPA v5.0+
MAINTAINERS: Add Prashanth as additional maintainer for amd-xgbe driver
devlink: Fix incorrect skb socket family dumping
af_unix: read UNIX_DIAG_VFS data under unix_state_lock
Revert "mptcp: add needs_id for netlink appending addr"
mptcp: fix slab-use-after-free in __inet_lookup_established
net: txgbe: leave space for null terminators on property_entry
net: ioam6: fix OOB and missing lock
rxrpc: proc: size address buffers for %pISpc output
rxrpc: only handle RESPONSE during service challenge
rxrpc: Fix buffer overread in rxgk_do_verify_authenticator()
rxrpc: Fix leak of rxgk context in rxgk_verify_response()
rxrpc: Fix integer overflow in rxgk_verify_response()
rxrpc: Fix missing error checks for rxkad encryption/decryption failure
...
Merge tag 'iommu-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull IOMMU fix from Will Deacon:
- Fix regression introduced by the empty MMU gather fix in -rc7, where
the ->iotlb_sync() callback can be elided incorrectly, resulting in
boot failures (hangs), crashes and potential memory corruption.
* tag 'iommu-fixes-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu: Ensure .iotlb_sync is called correctly
Merge tag 'platform-drivers-x86-v7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fixes from Ilpo Järvinen:
- amd/pmc: Add Thinkpad L14 Gen3 to quirk_s2idle_bug
- asus-armoury: Add support for FA607NU, GU605MU, and GV302XU.
- intel-uncore-freq: Handle autonomous UFS status bit
- ISST: Handle cases with less than max buckets correctly
- intel-uncore-freq & ISST: Mark minor version 3 supported (no
additional driver changes required)
* tag 'platform-drivers-x86-v7.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: asus-armoury: add support for GU605MU
platform/x86: asus-armoury: add support for FA607NU
platform/x86: asus-armoury: add support for GV302XU
platform/x86/amd: pmc: Add Thinkpad L14 Gen3 to quirk_s2idle_bug
platform/x86/intel-uncore-freq: Increase minor version
platform/x86: ISST: Increase minor version
platform/x86/intel-uncore-freq: Handle autonomous UFS status bit
platform/x86: ISST: Reset core count to 0
====================
net: lan966x: fix page_pool error handling and error paths
This series fixes error handling around the lan966x page pool:
1/3 adds the missing IS_ERR check after page_pool_create(), preventing
a kernel oops when the error pointer flows into
xdp_rxq_info_reg_mem_model().
2/3 plugs page pool leaks in the lan966x_fdma_rx_alloc() and
lan966x_fdma_init() error paths, now reachable after 1/3.
3/3 fixes a use-after-free and page pool leak in the
lan966x_fdma_reload() restore path, where the hardware could
resume DMA into pages already returned to the page pool.
====================
David Carlier [Sun, 5 Apr 2026 05:52:41 +0000 (06:52 +0100)]
net: lan966x: fix use-after-free and leak in lan966x_fdma_reload()
When lan966x_fdma_reload() fails to allocate new RX buffers, the restore
path restarts DMA using old descriptors whose pages were already freed
via lan966x_fdma_rx_free_pages(). Since page_pool_put_full_page() can
release pages back to the buddy allocator, the hardware may DMA into
memory now owned by other kernel subsystems.
Additionally, on the restore path, the newly created page pool (if
allocation partially succeeded) is overwritten without being destroyed,
leaking it.
Fix both issues by deferring the release of old pages until after the
new allocation succeeds. Save the old page array before the allocation
so old pages can be freed on the success path. On the failure path, the
old descriptors, pages and page pool are all still valid, making the
restore safe. Also ensure the restore path re-enables NAPI and wakes
the netdev, matching the success path.
Fixes: 89ba464fcf54 ("net: lan966x: refactor buffer reload function") Cc: stable@vger.kernel.org Signed-off-by: David Carlier <devnexen@gmail.com> Link: https://patch.msgid.link/20260405055241.35767-4-devnexen@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Carlier [Sun, 5 Apr 2026 05:52:40 +0000 (06:52 +0100)]
net: lan966x: fix page pool leak in error paths
lan966x_fdma_rx_alloc() creates a page pool but does not destroy it if
the subsequent fdma_alloc_coherent() call fails, leaking the pool.
Similarly, lan966x_fdma_init() frees the coherent DMA memory when
lan966x_fdma_tx_alloc() fails but does not destroy the page pool that
was successfully created by lan966x_fdma_rx_alloc(), leaking it.
Add the missing page_pool_destroy() calls in both error paths.
David Carlier [Sun, 5 Apr 2026 05:52:39 +0000 (06:52 +0100)]
net: lan966x: fix page_pool error handling in lan966x_fdma_rx_alloc_page_pool()
page_pool_create() can return an ERR_PTR on failure. The return value
is used unconditionally in the loop that follows, passing the error
pointer through xdp_rxq_info_reg_mem_model() into page_pool_use_xdp_mem(),
which dereferences it, causing a kernel oops.
Add an IS_ERR check after page_pool_create() to return early on failure.
Robin Murphy [Wed, 8 Apr 2026 14:40:57 +0000 (15:40 +0100)]
iommu: Ensure .iotlb_sync is called correctly
Many drivers have no reason to use the iotlb_gather mechanism, but do
still depend on .iotlb_sync being called to properly complete an unmap.
Since the core code is now relying on the gather to detect when there
is legitimately something to sync, it should also take care of encoding
a successful unmap when the driver does not touch the gather itself.
Fixes: 90c5def10bea ("iommu: Do not call drivers for empty gathers") Reported-by: Jon Hunter <jonathanh@nvidia.com> Closes: https://lore.kernel.org/r/8800a38b-8515-4bbe-af15-0dae81274bf7@nvidia.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Will Deacon <will@kernel.org>
nfc: pn533: allocate rx skb before consuming bytes
pn532_receive_buf() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already hand
a complete frame to pn533_recv_frame() before allocating a fresh receive
buffer.
If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().
Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.
syzbot reported a WARN on my patch series [1]. The actual issue is an
overflow of 16-bit UDP length field, and it exists in the upstream code.
My series added a debug WARN with an overflow check that exposed the
issue, that's why syzbot tripped on my patches, rather than on upstream
code.
It basically sends an oversized (0x34000 bytes) PPPoL2TP packet with UDP
encapsulation, and l2tp_xmit_core doesn't check for overflows when it
assigns the UDP length field. The value gets trimmed to 16 bites.
Add an overflow check that drops oversized packets and avoids sending
packets with trimmed UDP length to the wire.
net: ipa: fix event ring index not programmed for IPA v5.0+
For IPA v5.0+, the event ring index field moved from CH_C_CNTXT_0 to
CH_C_CNTXT_1. The v5.0 register definition intended to define this
field in the CH_C_CNTXT_1 fmask array but used the old identifier of
ERINDEX instead of CH_ERINDEX.
Without a valid event ring, GSI channels could never signal transfer
completions. This caused gsi_channel_trans_quiesce() to block
forever in wait_for_completion().
At least for IPA v5.2 this resolves an issue seen where runtime
suspend, system suspend, and remoteproc stop all hanged forever. It
also meant the IPA data path was completely non functional.
Fixes: faf0678ec8a0 ("net: ipa: add IPA v5.0 GSI register definitions") Signed-off-by: Alexander Koskovich <akoskovich@pm.me> Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260403-milos-ipa-v1-2-01e9e4e03d3e@fairphone.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Emil converts to use spinlock_t for virtchnl transactions to make
consistent use of the xn_bm_lock when accessing the free_xn_bm bitmap,
while also avoiding nested raw/bh spinlock issue on PREEMPT_RT kernels.
He also sets payload size before calling the async handler, to make sure
it doesn't error out prematurely due to invalid size check for idpf.
Kohei Enju changes WARN_ON for missing PTP control PF to a dev_info() on
ice as there are cases where this is expected and acceptable.
Petr Oros fixes conditions in which error paths failed to call
ice_ptp_port_phy_restart() breaking PTP functionality on ice.
Alex significantly reduces reporting of driver information, and time
under RTNL locl, on ixgbe e610 devices by reducing reads of flash info
only on events that could change it.
Michal Schmidt adds missing Hyper-V op on ixgbevf.
Alex Dvoretsky removes call to napi_synchronize() in igb_down() to
resolve a deadlock.
Agalakov Daniil adds error check on e1000 for failed EEPROM read.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000: check return value of e1000_read_eeprom
igb: remove napi_synchronize() in igb_down()
ixgbevf: add missing negotiate_features op to Hyper-V ops table
ixgbe: stop re-reading flash on every get_drvinfo for e610
ice: fix PTP timestamping broken by SyncE code on E825C
ice: ptp: don't WARN when controlling PF is unavailable
idpf: set the payload size before calling the async handler
idpf: improve locking around idpf_vc_xn_push_free()
idpf: fix PREEMPT_RT raw/bh spinlock nesting for async VC handling
====================
Li RongQing [Tue, 7 Apr 2026 02:27:30 +0000 (22:27 -0400)]
devlink: Fix incorrect skb socket family dumping
The devlink_fmsg_dump_skb function was incorrectly using the socket
type (sk->sk_type) instead of the socket family (sk->sk_family)
when filling the "family" field in the fast message dump.
This patch fixes this to properly display the socket family.
Jiexun Wang [Tue, 7 Apr 2026 08:00:14 +0000 (16:00 +0800)]
af_unix: read UNIX_DIAG_VFS data under unix_state_lock
Exact UNIX diag lookups hold a reference to the socket, but not to
u->path. Meanwhile, unix_release_sock() clears u->path under
unix_state_lock() and drops the path reference after unlocking.
Read the inode and device numbers for UNIX_DIAG_VFS while holding
unix_state_lock(), then emit the netlink attribute after dropping the
lock.
This keeps the VFS data stable while the reply is being built.
Fixes: 5f7b0569460b ("unix_diag: Unix inode info NLA") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260407080015.1744197-1-n05ec@lzu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Revert "mptcp: add needs_id for netlink appending addr"
This commit was originally adding the ability to add MPTCP endpoints
with ID 0 by accident. The in-kernel PM, handling MPTCP endpoints at the
net namespace level, is not supposed to handle endpoints with such ID,
because this ID 0 is reserved to the initial subflow, as mentioned in
the MPTCPv1 protocol [1], a per-connection setting.
Note that 'ip mptcp endpoint add id 0' stops early with an error, but
other tools might still request the in-kernel PM to create MPTCP
endpoints with this restricted ID 0.
In other words, it was wrong to call the mptcp_pm_has_addr_attr_id
helper to check whether the address ID attribute is set: if it was set
to 0, a new MPTCP endpoint would be created with ID 0, which is not
expected, and might cause various issues later.
mptcp: fix slab-use-after-free in __inet_lookup_established
The ehash table lookups are lockless and rely on
SLAB_TYPESAFE_BY_RCU to guarantee socket memory stability
during RCU read-side critical sections. Both tcp_prot and
tcpv6_prot have their slab caches created with this flag
via proto_register().
However, MPTCP's mptcp_subflow_init() copies tcpv6_prot into
tcpv6_prot_override during inet_init() (fs_initcall, level 5),
before inet6_init() (module_init/device_initcall, level 6) has
called proto_register(&tcpv6_prot). At that point,
tcpv6_prot.slab is still NULL, so tcpv6_prot_override.slab
remains NULL permanently.
This causes MPTCP v6 subflow child sockets to be allocated via
kmalloc (falling into kmalloc-4k) instead of the TCPv6 slab
cache. The kmalloc-4k cache lacks SLAB_TYPESAFE_BY_RCU, so
when these sockets are freed without SOCK_RCU_FREE (which is
cleared for child sockets by design), the memory can be
immediately reused. Concurrent ehash lookups under
rcu_read_lock can then access freed memory, triggering a
slab-use-after-free in __inet_lookup_established.
Fix this by splitting the IPv6-specific initialization out of
mptcp_subflow_init() into a new mptcp_subflow_v6_init(), called
from mptcp_proto_v6_init() before protocol registration. This
ensures tcpv6_prot_override.slab correctly inherits the
SLAB_TYPESAFE_BY_RCU slab cache.
net: txgbe: leave space for null terminators on property_entry
Lists of struct property_entry are supposed to be terminated with an
empty property, this driver currently seems to be allocating exactly the
amount of entry used.
Change the struct definition to leave an extra element for all
property_entry.
Fixes: c3e382ad6d15 ("net: txgbe: Add software nodes to support phylink") Signed-off-by: Fabio Baltieri <fabio.baltieri@gmail.com> Tested-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/20260405222013.5347-1-fabio.baltieri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This code can lead to an out-of-bounds access of the dev->_tx[] array
when is_input is true. In such a case, the packet is on the RX path and
skb->queue_mapping contains the RX queue index of the ingress device. If
the ingress device has more RX queues than the egress device (dev) has
TX queues, skb_get_queue_mapping(skb) will exceed dev->num_tx_queues.
Add a check to avoid this situation since skb_get_tx_queue() does not
clamp the index. This issue has also revealed that per queue visibility
cannot be accurate and will be replaced later as a new feature.
While at it, add missing lock around qdisc_qstats_qlen_backlog(). The
function __ioam6_fill_trace_data() is called from both softirq and
process contexts, hence the use of spin_lock_bh() here.
Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260403214418.2233266-2-kuba@kernel.org/ Signed-off-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404134137.24553-1-justin.iurman@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 9 Apr 2026 01:56:17 +0000 (18:56 -0700)]
Merge tag 'wireless-2026-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
A few last-minute fixes:
- rfkill: prevent boundless event list
- rt2x00: fix USB resource management
- brcmfmac: validate firmware IDs
- brcmsmac: fix DMA free size
* tag 'wireless-2026-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
net: rfkill: prevent unlimited numbers of rfkill events from being created
wifi: rt2x00usb: fix devres lifetime
wifi: brcmfmac: validate bsscfg indices in IF events
wifi: brcmsmac: Fix dma_free_coherent() size
====================
1) Clear trailing padding in build_polexpire() to prevent
leaking unititialized memory. From Yasuaki Torimaru.
2) Fix aevent size calculation when XFRMA_IF_ID is used.
From Keenan Dong.
3) Wait for RCU readers during policy netns exit before
freeing the policy hash tables.
4) Fix dome too eaerly dropped references on the netdev
when uding transport mode. From Qi Tang.
5) Fix refcount leak in xfrm_migrate_policy_find().
From Kotlyarov Mihail.
6) Fix two fix info leaks in build_report() and
in build_mapping(). From Greg Kroah-Hartman.
7) Zero aligned sockaddr tail in PF_KEY exports.
From Zhengchuan Liang.
* tag 'ipsec-2026-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
net: af_key: zero aligned sockaddr tail in PF_KEY exports
xfrm_user: fix info leak in build_report()
xfrm_user: fix info leak in build_mapping()
xfrm: fix refcount leak in xfrm_migrate_policy_find
xfrm: hold dev ref until after transport_finish NF_HOOK
xfrm: Wait for RCU readers during policy netns exit
xfrm: account XFRMA_IF_ID in aevent size calculation
xfrm: clear trailing padding in build_polexpire()
====================
Jakub Kicinski [Thu, 9 Apr 2026 01:50:27 +0000 (18:50 -0700)]
Merge tag 'batadv-net-pullrequest-20260408' of https://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- reject oversized global TT response buffers, by Ruide Cao
- hold claim backbone gateways by reference, by Haoze Xie
* tag 'batadv-net-pullrequest-20260408' of https://git.open-mesh.org/linux-merge:
batman-adv: hold claim backbone gateways by reference
batman-adv: reject oversized global TT response buffers
====================
Jakub Kicinski [Thu, 9 Apr 2026 01:48:44 +0000 (18:48 -0700)]
Merge tag 'nf-26-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Florian Westphal says:
====================
netfilter updates for net
I only included crash fixes, as we're closer to a release, rest will
be handled via -next.
1) Fix a NULL pointer dereference in ip_vs_add_service error path, from
Weiming Shi, bug added in 6.2 development cycle.
2) Don't leak kernel data bytes from allocator to userspace: nfnetlink_log
needs to init the trailing NLMSG_DONE terminator. From Xiang Mei.
3) xt_multiport match lacks range validation, bogus userspace request will
cause out-of-bounds read. From Ren Wei.
4) ip6t_eui64 match must reject packets with invalid mac header before
calling eth_hdr. Make existing check unconditional. From Zhengchuan
Liang.
5) nft_ct timeout policies are free'd via kfree() while they may still
be reachable by other cpus that process a conntrack object that
uses such a timeout policy. Existing reaping of entries is not
sufficient because it doesn't wait for a grace period. Use kfree_rcu().
From Tuan Do.
6/7) Make nfnetlink_queue hash table per queue. As-is we can hit a page
fault in case underlying page of removed element was free'd. Per-queue
hash prevents parallel lookups. This comes with a test case that
demonstrates the bug, from Fernando Fernandez Mancera.
* tag 'nf-26-04-08' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: nft_queue.sh: add a parallel stress test
netfilter: nfnetlink_queue: make hash table per queue
netfilter: nft_ct: fix use-after-free in timeout object destroy
netfilter: ip6t_eui64: reject invalid MAC header for all packets
netfilter: xt_multiport: validate range encoding in checkentry
netfilter: nfnetlink_log: initialize nfgenmsg in NLMSG_DONE terminator
ipvs: fix NULL deref in ip_vs_add_service error path
====================
Jakub Kicinski [Thu, 9 Apr 2026 01:44:37 +0000 (18:44 -0700)]
Merge branch 'rxrpc-miscellaneous-fixes'
David Howells says:
====================
rxrpc: Miscellaneous fixes
Here are some fixes for rxrpc:
(1) Fix key quota calculation.
(2) Fix a memory leak.
(3) Fix rxrpc_new_client_call_for_sendmsg() to substitute NULL for an
empty key.
Might want to remove this substitution entirely or handle it in
rxrpc_init_client_call_security() instead.
(4) Fix deletion of call->link to be RCU safe.
(5) Fix missing bounds checks when parsing RxGK tickets.
(6) Fix use of wrong skbuff to get challenge serial number. Also actually
substitute the newer response skbuff and release the older one.
(7) Fix unexpected RACK timer warning to report old mode.
(8) Fix call key refcount leak.
(9) Fix the interaction of jumbograms with Tx window space, setting the
request-ack flag when the window space is getting low, typically
because each jumbogram take a big bite out of the window and fewer UDP
packets get traded.
(10) Don't call rxrpc_put_call() with a NULL pointer.
(11) Reject undecryptable rxkad response tickets by checking result of
decryption.
(12) Fix buffer bounds calculation in the RESPONSE authenticator parser.
(13) Fix oversized response length check.
(14) Fix refcount leak on multiple setting of server keyring.
(15) Fix checks made by RXRPC_SECURITY_KEY and RXRPC_SECURITY_KEYRING (both
should be allowed).
(16) Fix lack of result checking on calls to crypto_skcipher_en/decrypt().
(17) Fix token_len limit check in rxgk_verify_response().
(18) Fix rxgk context leak in rxgk_verify_response().
(19) Fix read beyond end of buffer in rxgk_do_verify_authenticator().
(20) Fix parsing of RESPONSE packet on a connection that has already been set
from a prior response.
(21) Fix size of buffers used for rendering addresses into for procfiles.
====================
rxrpc: proc: size address buffers for %pISpc output
The AF_RXRPC procfs helpers format local and remote socket addresses into
fixed 50-byte stack buffers with "%pISpc".
That is too small for the longest current-tree IPv6-with-port form the
formatter can produce. In lib/vsprintf.c, the compressed IPv6 path uses a
dotted-quad tail not only for v4mapped addresses, but also for ISATAP
addresses via ipv6_addr_is_isatap().
is possible with the current formatter. That is 50 visible characters, so
51 bytes including the trailing NUL, which does not fit in the existing
char[50] buffers used by net/rxrpc/proc.c.
Size the buffers from the formatter's maximum textual form and switch the
call sites to scnprintf().
Changes since v1:
- correct the changelog to cite the actual maximum current-tree case
explicitly
- frame the proof around the ISATAP formatting path instead of the earlier
mapped-v4 example
Fixes: 75b54cb57ca3 ("rxrpc: Add IPv6 support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Anderson Nascimento <anderson@allelesecurity.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-22-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wang Jie [Wed, 8 Apr 2026 12:12:48 +0000 (13:12 +0100)]
rxrpc: only handle RESPONSE during service challenge
Only process RESPONSE packets while the service connection is still in
RXRPC_CONN_SERVICE_CHALLENGING. Check that state under state_lock before
running response verification and security initialization, then use a local
secured flag to decide whether to queue the secured-connection work after
the state transition. This keeps duplicate or late RESPONSE packets from
re-running the setup path and removes the unlocked post-transition state
test.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jie Wang <jiewang2024@lzu.edu.cn> Signed-off-by: Yang Yang <n05ec@lzu.edu.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-21-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:45 +0000 (13:12 +0100)]
rxrpc: Fix integer overflow in rxgk_verify_response()
In rxgk_verify_response(), there's a potential integer overflow due to
rounding up token_len before checking it, thereby allowing the length check to
be bypassed.
Fix this by checking the unrounded value against len too (len is limited as
the response must fit in a single UDP packet).
Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)") Closes: https://sashiko.dev/#/patchset/20260401105614.1696001-10-dhowells@redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-18-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:44 +0000 (13:12 +0100)]
rxrpc: Fix missing error checks for rxkad encryption/decryption failure
Add error checking for failure of crypto_skcipher_en/decrypt() to various
rxkad function as the crypto functions can fail with ENOMEM at least.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Closes: https://sashiko.dev/#/patchset/20260401105614.1696001-10-dhowells@redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-17-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:43 +0000 (13:12 +0100)]
rxrpc: Fix key/keyring checks in setsockopt(RXRPC_SECURITY_KEY/KEYRING)
An AF_RXRPC socket can be both client and server at the same time. When
sending new calls (ie. it's acting as a client), it uses rx->key to set the
security, and when accepting incoming calls (ie. it's acting as a server),
it uses rx->securities.
setsockopt(RXRPC_SECURITY_KEY) sets rx->key to point to an rxrpc-type key
and setsockopt(RXRPC_SECURITY_KEYRING) sets rx->securities to point to a
keyring of rxrpc_s-type keys.
Now, it should be possible to use both rx->key and rx->securities on the
same socket - but for userspace AF_RXRPC sockets rxrpc_setsockopt()
prevents that.
Fix this by:
(1) Remove the incorrect check rxrpc_setsockopt(RXRPC_SECURITY_KEYRING)
makes on rx->key.
(2) Move the check that rxrpc_setsockopt(RXRPC_SECURITY_KEY) makes on
rx->key down into rxrpc_request_key().
(3) Remove rxrpc_request_key()'s check on rx->securities.
This (in combination with a previous patch) pushes the checks down into the
functions that set those pointers and removes the cross-checks that prevent
both key and keyring being set.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Closes: https://sashiko.dev/#/patchset/20260401105614.1696001-10-dhowells@redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Anderson Nascimento <anderson@allelesecurity.com>
cc: Luxiao Xu <rakukuip@gmail.com>
cc: Yuan Tan <yuantan098@gmail.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxgk_verify_response() decodes auth_len from the packet and is supposed
to verify that it fits in the remaining bytes. The existing check is
inverted, so oversized RESPONSE authenticators are accepted and passed
to rxgk_decrypt_skb(), which can later reach skb_to_sgvec() with an
impossible length and hit BUG_ON(len).
Decoded from the original latest-net reproduction logs with
scripts/decode_stacktrace.sh:
rxgk_verify_authenticator() copies auth_len bytes into a temporary
buffer and then passes p + auth_len as the parser limit to
rxgk_do_verify_authenticator(). Since p is a __be32 *, that inflates the
parser end pointer by a factor of four and lets malformed RESPONSE
authenticators read past the kmalloc() buffer.
Decoded from the original latest-net reproduction logs with
scripts/decode_stacktrace.sh:
rxkad_decrypt_ticket() decrypts the RXKAD response ticket and then
parses the buffer as plaintext without checking whether
crypto_skcipher_decrypt() succeeded.
A malformed RESPONSE can therefore use a non-block-aligned ticket
length, make the decrypt operation fail, and still drive the ticket
parser with attacker-controlled bytes.
Check the decrypt result and abort the connection with RXKADBADTICKET
when ticket decryption fails.
Fixes: 17926a79320a ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Yuqi Xu <xuyuqiabc@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-12-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Douya Le [Wed, 8 Apr 2026 12:12:38 +0000 (13:12 +0100)]
rxrpc: Only put the call ref if one was acquired
rxrpc_input_packet_on_conn() can process a to-client packet after the
current client call on the channel has already been torn down. In that
case chan->call is NULL, rxrpc_try_get_call() returns NULL and there is
no reference to drop.
The client-side implicit-end error path does not account for that and
unconditionally calls rxrpc_put_call(). This turns a protocol error
path into a kernel crash instead of rejecting the packet.
Only drop the call reference if one was actually acquired. Keep the
existing protocol error handling unchanged.
Fixes: 5e6ef4f1017c ("rxrpc: Make the I/O thread take over the call and local processor work") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Signed-off-by: Douya Le <ldy3087146292@gmail.com> Co-developed-by: Yuan Tan <tanyuan98@gmail.com> Signed-off-by: Yuan Tan <tanyuan98@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-11-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marc Dionne [Wed, 8 Apr 2026 12:12:37 +0000 (13:12 +0100)]
rxrpc: Fix to request an ack if window is limited
Peers may only send immediate acks for every 2 UDP packets received.
When sending a jumbogram, it is important to check that there is
sufficient window space to send another same sized jumbogram following
the current one, and request an ack if there isn't. Failure to do so may
cause the call to stall waiting for an ack until the resend timer fires.
Where jumbograms are in use this causes a very significant drop in
performance.
Fixes: fe24a5494390 ("rxrpc: Send jumbo DATA packets") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-10-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxrpc: Fix key reference count leak from call->key
When creating a client call in rxrpc_alloc_client_call(), the code obtains
a reference to the key. This is never cleaned up and gets leaked when the
call is destroyed.
Fix this by freeing call->key in rxrpc_destroy_call().
Before the patch, it shows the key reference counter elevated:
rxrpc: Fix rack timer warning to report unexpected mode
rxrpc_rack_timer_expired() clears call->rack_timer_mode to OFF before
the switch. The default case warning therefore always prints OFF and
doesn't identify the unexpected timer mode.
Log the saved mode value instead so the warning reports the actual
unexpected rack timer mode.
Fixes: 7c482665931b ("rxrpc: Implement RACK/TLP to deal with transmission stalls [RFC8985]") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-8-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxrpc: Fix use of wrong skb when comparing queued RESP challenge serial
In rxrpc_post_response(), the code should be comparing the challenge serial
number from the cached response before deciding to switch to a newer
response, but looks at the newer packet private data instead, rendering the
comparison always false.
Fix this by switching to look at the older packet.
Fix further[1] to substitute the new packet in place of the old one if
newer and also to release whichever we don't use.
Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://sashiko.dev/#/patchset/20260319150150.4189381-1-dhowells%40redhat.com Link: https://patch.msgid.link/20260408121252.2249051-7-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Oleh Konko [Wed, 8 Apr 2026 12:12:33 +0000 (13:12 +0100)]
rxrpc: Fix RxGK token loading to check bounds
rxrpc_preparse_xdr_yfs_rxgk() reads the raw key length and ticket length
from the XDR token as u32 values and passes each through round_up(x, 4)
before using the rounded value for validation and allocation. When the raw
length is >= 0xfffffffd, round_up() wraps to 0, so the bounds check and
kzalloc both use 0 while the subsequent memcpy still copies the original
~4 GiB value, producing a heap buffer overflow reachable from an
unprivileged add_key() call.
Fix this by:
(1) Rejecting raw key lengths above AFSTOKEN_GK_KEY_MAX and raw ticket
lengths above AFSTOKEN_GK_TOKEN_MAX before rounding, consistent with
the caps that the RxKAD path already enforces via AFSTOKEN_RK_TIX_MAX.
(2) Sizing the flexible-array allocation from the validated raw key
length via struct_size_t() instead of the rounded value.
(3) Caching the raw lengths so that the later field assignments and
memcpy calls do not re-read from the token, eliminating a class of
TOCTOU re-parse.
The control path (valid token with lengths within bounds) is unaffected.
Fixes: 0ca100ff4df6 ("rxrpc: Add YFS RxGK (GSSAPI) security class") Signed-off-by: Oleh Konko <security@1seal.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:32 +0000 (13:12 +0100)]
rxrpc: Fix call removal to use RCU safe deletion
Fix rxrpc call removal from the rxnet->calls list to use list_del_rcu()
rather than list_del_init() to prevent stuffing up reading
/proc/net/rxrpc/calls from potentially getting into an infinite loop.
This, however, means that list_empty() no longer works on an entry that's
been deleted from the list, making it harder to detect prior deletion. Fix
this by:
Firstly, make rxrpc_destroy_all_calls() only dump the first ten calls that
are unexpectedly still on the list. Limiting the number of steps means
there's no need to call cond_resched() or to remove calls from the list
here, thereby eliminating the need for rxrpc_put_call() to check for that.
rxrpc_put_call() can then be fixed to unconditionally delete the call from
the list as it is the only place that the deletion occurs.
Fixes: 2baec2c3f854 ("rxrpc: Support network namespacing") Closes: https://sashiko.dev/#/patchset/20260319150150.4189381-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-5-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:31 +0000 (13:12 +0100)]
rxrpc: Fix anonymous key handling
In rxrpc_new_client_call_for_sendmsg(), a key with no payload is meant to
be substituted for a NULL key pointer, but the variable this is done with
is subsequently not used.
Fix this by using "key" rather than "rx->key" when filling in the
connection parameters.
Note that this only affects direct use of AF_RXRPC; the kAFS filesystem
doesn't use sendmsg() directly and so bypasses the issue. Further,
AF_RXRPC passes a NULL key in if no key is set, so using an anonymous key
in that manner works. Since this hasn't been noticed to this point, it
might be better just to remove the "key" variable and the code that sets it
- and, arguably, rxrpc_init_client_call_security() would be a better place
to handle it.
Fixes: 19ffa01c9c45 ("rxrpc: Use structs to hold connection params and protocol info") Closes: https://sashiko.dev/#/patchset/20260319150150.4189381-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-4-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Howells [Wed, 8 Apr 2026 12:12:29 +0000 (13:12 +0100)]
rxrpc: Fix key quota calculation for multitoken keys
In the rxrpc key preparsing, every token extracted sets the proposed quota
value, but for multitoken keys, this will overwrite the previous proposed
quota, losing it.
Fix this by adding to the proposed quota instead.
Fixes: 8a7a3eb4ddbe ("KEYS: RxRPC: Use key preparsing") Closes: https://sashiko.dev/#/patchset/20260319150150.4189381-1-dhowells%40redhat.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: stable@kernel.org Link: https://patch.msgid.link/20260408121252.2249051-2-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
x86: shadow stacks: proper error handling for mmap lock
김영민 reports that shstk_pop_sigframe() doesn't check for errors from
mmap_read_lock_killable(), which is a silly oversight, and also shows
that we haven't marked those functions with "__must_check", which would
have immediately caught it.
So let's fix both issues.
Reported-by: 김영민 <osori@hspace.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
selftests: nft_queue.sh: add a parallel stress test
Introduce a new stress test to check for race conditions in the
nfnetlink_queue subsystem, where an entry is freed while another CPU is
concurrently walking the global rhashtable.
To trigger this, `nf_queue.c` is extended with two new flags:
* -O (out-of-order): Buffers packet IDs and flushes them in reverse.
* -b (bogus verdicts): Floods the kernel with non-existent packet IDs.
The bogus verdict loop forces the kernel's lookup function to perform
full rhashtable bucket traversals (-ENOENT). Combined with reverse-order
flushing and heavy parallel UDP/ping flooding across 8 queues, this puts
the nfnetlink_queue code under pressure.
Joint work with Florian Westphal.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
Tuan Do [Fri, 3 Apr 2026 07:33:17 +0000 (00:33 -0700)]
netfilter: nft_ct: fix use-after-free in timeout object destroy
nft_ct_timeout_obj_destroy() frees the timeout object with kfree()
immediately after nf_ct_untimeout(), without waiting for an RCU grace
period. Concurrent packet processing on other CPUs may still hold
RCU-protected references to the timeout object obtained via
rcu_dereference() in nf_ct_timeout_data().
Add an rcu_head to struct nf_ct_timeout and use kfree_rcu() to defer
freeing until after an RCU grace period, matching the approach already
used in nfnetlink_cttimeout.c.
KASAN report:
BUG: KASAN: slab-use-after-free in nf_conntrack_tcp_packet+0x1381/0x29d0
Read of size 4 at addr ffff8881035fe19c by task exploit/80
netfilter: ip6t_eui64: reject invalid MAC header for all packets
`eui64_mt6()` derives a modified EUI-64 from the Ethernet source address
and compares it with the low 64 bits of the IPv6 source address.
The existing guard only rejects an invalid MAC header when
`par->fragoff != 0`. For packets with `par->fragoff == 0`, `eui64_mt6()`
can still reach `eth_hdr(skb)` even when the MAC header is not valid.
Fix this by removing the `par->fragoff != 0` condition so that packets
with an invalid MAC header are rejected before accessing `eth_hdr(skb)`.
Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Ren Wei <enjou1224z@gmail.com> Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
netfilter: xt_multiport: validate range encoding in checkentry
ports_match_v1() treats any non-zero pflags entry as the start of a
port range and unconditionally consumes the next ports[] element as
the range end.
The checkentry path currently validates protocol, flags and count, but
it does not validate the range encoding itself. As a result, malformed
rules can mark the last slot as a range start or place two range starts
back to back, leaving ports_match_v1() to step past the last valid
ports[] element while interpreting the rule.
Reject malformed multiport v1 rules in checkentry by validating that
each range start has a following element and that the following element
is not itself marked as another range start.
Fixes: a89ecb6a2ef7 ("[NETFILTER]: x_tables: unify IPv4/IPv6 multiport match") Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Co-developed-by: Yuan Tan <yuantan098@gmail.com> Signed-off-by: Yuan Tan <yuantan098@gmail.com> Suggested-by: Xin Liu <bird@lzu.edu.cn> Tested-by: Yuhang Zheng <z1652074432@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
Xiang Mei [Wed, 1 Apr 2026 21:20:57 +0000 (14:20 -0700)]
netfilter: nfnetlink_log: initialize nfgenmsg in NLMSG_DONE terminator
When batching multiple NFLOG messages (inst->qlen > 1), __nfulnl_send()
appends an NLMSG_DONE terminator with sizeof(struct nfgenmsg) payload via
nlmsg_put(), but never initializes the nfgenmsg bytes. The nlmsg_put()
helper only zeroes alignment padding after the payload, not the payload
itself, so four bytes of stale kernel heap data are leaked to userspace
in the NLMSG_DONE message body.
Use nfnl_msg_put() to build the NLMSG_DONE terminator, which initializes
the nfgenmsg payload via nfnl_fill_hdr(), consistent with how
__build_packet_message() already constructs NFULNL_MSG_PACKET headers.
Fixes: 29c5d4afba51 ("[NETFILTER]: nfnetlink_log: fix sending of multipart messages") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Florian Westphal <fw@strlen.de>
Weiming Shi [Wed, 1 Apr 2026 07:58:01 +0000 (15:58 +0800)]
ipvs: fix NULL deref in ip_vs_add_service error path
When ip_vs_bind_scheduler() succeeds in ip_vs_add_service(), the local
variable sched is set to NULL. If ip_vs_start_estimator() subsequently
fails, the out_err cleanup calls ip_vs_unbind_scheduler(svc, sched)
with sched == NULL. ip_vs_unbind_scheduler() passes the cur_sched NULL
check (because svc->scheduler was set by the successful bind) but then
dereferences the NULL sched parameter at sched->done_service, causing a
kernel panic at offset 0x30 from NULL.
Oops: general protection fault, [..] [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
RIP: 0010:ip_vs_unbind_scheduler (net/netfilter/ipvs/ip_vs_sched.c:69)
Call Trace:
<TASK>
ip_vs_add_service.isra.0 (net/netfilter/ipvs/ip_vs_ctl.c:1500)
do_ip_vs_set_ctl (net/netfilter/ipvs/ip_vs_ctl.c:2809)
nf_setsockopt (net/netfilter/nf_sockopt.c:102)
[..]
Fix by simply not clearing the local sched variable after a successful
bind. ip_vs_unbind_scheduler() already detects whether a scheduler is
installed via svc->scheduler, and keeping sched non-NULL ensures the
error path passes the correct pointer to both ip_vs_unbind_scheduler()
and ip_vs_scheduler_put().
While the bug is older, the problem popups in more recent kernels (6.2),
when the new error path is taken after the ip_vs_start_estimator() call.
Fixes: 705dd3444081 ("ipvs: use kthreads for stats estimation") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
====================
seg6: fix dst_cache sharing in seg6 lwtunnel
The seg6 lwtunnel encap uses a single per-route dst_cache shared
between seg6_input_core() and seg6_output_core(). These two paths
can perform the post-encap SID lookup in different routing contexts
(e.g., ip rules matching on the ingress interface, or VRF table
separation). Whichever path runs first populates the cache, and the
other reuses it blindly, bypassing its own lookup.
Patch 1 fixes this by splitting the cache into cache_input and
cache_output. Patch 2 adds a selftest that validates the isolation.
====================
Andrea Mayer [Sat, 4 Apr 2026 00:44:05 +0000 (02:44 +0200)]
selftests: seg6: add test for dst_cache isolation in seg6 lwtunnel
Add a selftest that verifies the dst_cache in seg6 lwtunnel is not
shared between the input (forwarding) and output (locally generated)
paths.
The test creates three namespaces (ns_src, ns_router, ns_dst)
connected in a line. An SRv6 encap route on ns_router encapsulates
traffic destined to cafe::1 with SID fc00::100. The SID is
reachable only for forwarded traffic (from ns_src) via an ip rule
matching the ingress interface (iif veth-r0 lookup 100), and
blackholed in the main table.
The test verifies that:
1. A packet generated locally on ns_router does not reach
ns_dst with an empty cache, since the SID is blackholed;
2. A forwarded packet from ns_src populates the input cache
from table 100 and reaches ns_dst;
3. A packet generated locally on ns_router still does not
reach ns_dst after the input cache is populated,
confirming the output path does not reuse the input
cache entry.
Both the forwarded and local packets are pinned to the same CPU
with taskset, since dst_cache is per-cpu.
Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404004405.4057-3-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrea Mayer [Sat, 4 Apr 2026 00:44:04 +0000 (02:44 +0200)]
seg6: separate dst_cache for input and output paths in seg6 lwtunnel
The seg6 lwtunnel uses a single dst_cache per encap route, shared
between seg6_input_core() and seg6_output_core(). These two paths
can perform the post-encap SID lookup in different routing contexts
(e.g., ip rules matching on the ingress interface, or VRF table
separation). Whichever path runs first populates the cache, and the
other reuses it blindly, bypassing its own lookup.
Fix this by splitting the cache into cache_input and cache_output,
so each path maintains its own cached dst independently.
Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Cc: stable@vger.kernel.org Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Justin Iurman <justin.iurman@gmail.com> Link: https://patch.msgid.link/20260404004405.4057-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Golle [Sun, 5 Apr 2026 21:29:19 +0000 (22:29 +0100)]
selftests: net: bridge_vlan_mcast: wait for h1 before querier check
The querier-interval test adds h1 (currently a slave of the VRF created
by simple_if_init) to a temporary bridge br1 acting as an outside IGMP
querier. The kernel VRF driver (drivers/net/vrf.c) calls cycle_netdev()
on every slave add and remove, toggling the interface admin-down then up.
Phylink takes the PHY down during the admin-down half of that cycle.
Since h1 and swp1 are cable-connected, swp1 also loses its link may need
several seconds to re-negotiate.
Use setup_wait_dev $h1 0 which waits for h1 to return to UP state, so the
test can rely on the link being back up at this point.
Jakub Kicinski [Sat, 4 Apr 2026 00:19:38 +0000 (17:19 -0700)]
net: avoid nul-deref trying to bind mp to incapable device
Sashiko points out that we use qops in __net_mp_open_rxq()
but never validate they are null. This was introduced when
check was moved from netdev_rx_queue_restart().
Look at ops directly instead of the locking config.
qops imply netdev_need_ops_lock(). We used netdev_need_ops_lock()
initially to signify that the real_num_rx_queues check below
is safe without rtnl_lock, but I'm not sure if this is actually
clear to most people, anyway.
Fixes: da7772a2b4ad ("net: move mp->rx_page_size validation to __net_mp_open_rxq()") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20260404001938.2425670-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johan Alvarado [Mon, 6 Apr 2026 07:44:25 +0000 (07:44 +0000)]
net: stmmac: dwmac-motorcomm: fix eFUSE MAC address read failure
This patch fixes an issue where reading the MAC address from the eFUSE
fails due to a race condition.
The root cause was identified by comparing the driver's behavior with a
custom U-Boot port. In U-Boot, the MAC address was read successfully
every time because the driver was loaded later in the boot process, giving
the hardware ample time to initialize. In Linux, reading the eFUSE
immediately returns all zeros, resulting in a fallback to a random MAC address.
Hardware cold-boot testing revealed that the eFUSE controller requires a
short settling time to load its internal data. Adding a 2000-5000us
delay after the reset ensures the hardware is fully ready, allowing the
native MAC address to be read consistently.
Fixes: 02ff155ea281 ("net: stmmac: Add glue driver for Motorcomm YT6801 ethernet controller") Reported-by: Georg Gottleuber <ggo@tuxedocomputers.com> Closes: https://lore.kernel.org/24cfefff-1233-4745-8c47-812b502d5d19@tuxedocomputers.com Signed-off-by: Johan Alvarado <contact@c127.dev> Reviewed-by: Yao Zi <me@ziyao.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/fc5992a4-9532-49c3-8ec1-c2f8c5b84ca1@smtp-relay.sendinblue.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
John Pavlick [Mon, 6 Apr 2026 13:23:33 +0000 (13:23 +0000)]
net: sfp: add quirks for Hisense and HSGQ GPON ONT SFP modules
Several GPON ONT SFP sticks based on Realtek RTL960x report
1000BASE-LX at 1300MBd in their EEPROM but can operate at 2500base-X.
On hosts capable of 2500base-X (e.g. Banana Pi R3 / MT7986), the
kernel negotiates only 1G because it trusts the incorrect EEPROM data.
Each quirk advertises 2500base-X and ignores TX_FAULT during the
module's ~40s Linux boot time.
Tested on Banana Pi R3 (MT7986) with OpenWrt 25.12.1, confirmed
2.5Gbps link and full throughput with flow offloading.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Suggested-by: Marcin Nita <marcin.nita@leolabs.pl> Signed-off-by: John Pavlick <jspavlick@posteo.net> Link: https://patch.msgid.link/20260406132321.72563-1-jspavlick@posteo.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'ata-7.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Add a quirk for JMicron JMB582/JMB585 AHCI controllers such that
they only use 32-bit DMA addresses.
While these controllers do report that they support 64-bit DMA
addresses, a user reports that using 64-bit DMA addresses cause
silent corruption even on modern x86 systems (Arthur)
* tag 'ata-7.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: force 32-bit DMA for JMicron JMB582/JMB585
Merge tag 'hyperv-fixes-signed-20260406' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull Hyper-V fixes from Wei Liu:
- Two fixes for Hyper-V PCI driver (Long Li, Sahil Chandna)
- Fix an infinite loop issue in MSHV driver (Stanislav Kinsburskii)
* tag 'hyperv-fixes-signed-20260406' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: Fix infinite fault loop on permission-denied GPA intercepts
PCI: hv: Fix double ida_free in hv_pci_probe error path
PCI: hv: Set default NUMA node to 0 for devices without affinity info
net/tls: fix use-after-free in -EBUSY error path of tls_do_encryption
The -EBUSY handling in tls_do_encryption(), introduced by commit 859054147318 ("net: tls: handle backlogging of crypto requests"), has
a use-after-free due to double cleanup of encrypt_pending and the
scatterlist entry.
When crypto_aead_encrypt() returns -EBUSY, the request is enqueued to
the cryptd backlog and the async callback tls_encrypt_done() will be
invoked upon completion. That callback unconditionally restores the
scatterlist entry (sge->offset, sge->length) and decrements
ctx->encrypt_pending. However, if tls_encrypt_async_wait() returns an
error, the synchronous error path in tls_do_encryption() performs the
same cleanup again, double-decrementing encrypt_pending and
double-restoring the scatterlist.
The double-decrement corrupts the encrypt_pending sentinel (initialized
to 1), making tls_encrypt_async_wait() permanently skip the wait for
pending async callbacks. A subsequent sendmsg can then free the
tls_rec via bpf_exec_tx_verdict() while a cryptd callback is still
pending, resulting in a use-after-free when the callback fires on the
freed record.
Fix this by skipping the synchronous cleanup when the -EBUSY async
wait returns an error, since the callback has already handled
encrypt_pending and sge restoration.
Fixes: 859054147318 ("net: tls: handle backlogging of crypto requests") Cc: stable@vger.kernel.org Signed-off-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260403013617.2838875-1-ramdhan@starlabs.sg Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: rfkill: prevent unlimited numbers of rfkill events from being created
Userspace can create an unlimited number of rfkill events if the system
is so configured, while not consuming them from the rfkill file
descriptor, causing a potential out of memory situation. Prevent this
from bounding the number of pending rfkill events at a "large" number
(i.e. 1000) to prevent abuses like this.
Cc: Johannes Berg <johannes@sipsolutions.net> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/2026033013-disfigure-scroll-e25e@gregkh Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Johan Hovold [Fri, 27 Mar 2026 11:32:19 +0000 (12:32 +0100)]
wifi: rt2x00usb: fix devres lifetime
USB drivers bind to USB interfaces and any device managed resources
should have their lifetime tied to the interface rather than parent USB
device. This avoids issues like memory leaks when drivers are unbound
without their devices being physically disconnected (e.g. on probe
deferral or configuration changes).
Fix the USB anchor lifetime so that it is released on driver unbind.
Fixes: 8b4c0009313f ("rt2x00usb: Use usb anchor to manage URB") Cc: stable@vger.kernel.org # 4.7 Cc: Vishal Thanki <vishalthanki@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Acked-by: Stanislaw Gruszka <stf_xl@wp.pl> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://patch.msgid.link/20260327113219.1313748-1-johan@kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pengpeng Hou [Mon, 23 Mar 2026 07:45:51 +0000 (15:45 +0800)]
wifi: brcmfmac: validate bsscfg indices in IF events
brcmf_fweh_handle_if_event() validates the firmware-provided interface
index before it touches drvr->iflist[], but it still uses the raw
bsscfgidx field as an array index without a matching range check.
Reject IF events whose bsscfg index does not fit in drvr->iflist[]
before indexing the interface array.
Zhengchuan Liang [Sun, 22 Mar 2026 18:46:08 +0000 (11:46 -0700)]
net: af_key: zero aligned sockaddr tail in PF_KEY exports
PF_KEY export paths use `pfkey_sockaddr_size()` when reserving sockaddr
payload space, so IPv6 addresses occupy 32 bytes on the wire. However,
`pfkey_sockaddr_fill()` initializes only the first 28 bytes of
`struct sockaddr_in6`, leaving the final 4 aligned bytes uninitialized.
Not every PF_KEY message is affected. The state and policy dump builders
already zero the whole message buffer before filling the sockaddr
payloads. Keep the fix to the export paths that still append aligned
sockaddr payloads with plain `skb_put()`:
struct xfrm_user_report is a __u8 proto field followed by a struct
xfrm_selector which means there is three "empty" bytes of padding, but
the padding is never zeroed before copying to userspace. Fix that up by
zeroing the structure before setting individual member variables.
Cc: stable <stable@kernel.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
struct xfrm_usersa_id has a one-byte padding hole after the proto
field, which ends up never getting set to zero before copying out to
userspace. Fix that up by zeroing out the whole structure before
setting individual variables.
Fixes: 3a2dfbe8acb1 ("xfrm: Notify changes in UDP encapsulation via netlink") Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Assisted-by: gregkh_clanker_t1000 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The root cause is a double call to xfrm_pol_hold_rcu() in
xfrm_migrate_policy_find(). The lookup function already returns
a policy with held reference, making the second call redundant.
Remove the redundant xfrm_pol_hold_rcu() call to fix the refcount
imbalance and prevent the memory leak.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 563d5ca93e88 ("xfrm: switch migrate to xfrm_policy_lookup_bytype") Signed-off-by: Kotlyarov Mihail <mihailkotlyarow@gmail.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm: hold dev ref until after transport_finish NF_HOOK
After async crypto completes, xfrm_input_resume() calls dev_put()
immediately on re-entry before the skb reaches transport_finish.
The skb->dev pointer is then used inside NF_HOOK and its okfn,
which can race with device teardown.
Remove the dev_put from the async resumption entry and instead
drop the reference after the NF_HOOK call in transport_finish,
using a saved device pointer since NF_HOOK may consume the skb.
This covers NF_DROP, NF_QUEUE and NF_STOLEN paths that skip
the okfn.
For non-transport exits (decaps, gro, drop) and secondary
async return points, release the reference inline when
async is set.
xfrm: Wait for RCU readers during policy netns exit
xfrm_policy_fini() frees the policy_bydst hash tables after flushing the
policy work items and deleting all policies, but it does not wait for
concurrent RCU readers to leave their read-side critical sections first.
The policy_bydst tables are published via rcu_assign_pointer() and are
looked up through rcu_dereference_check(), so netns teardown must also
wait for an RCU grace period before freeing the table memory.
Fix this by adding synchronize_rcu() before freeing the policy hash tables.
Arthur Husband [Mon, 6 Apr 2026 22:23:35 +0000 (15:23 -0700)]
ata: ahci: force 32-bit DMA for JMicron JMB582/JMB585
The JMicron JMB585 (and JMB582) SATA controllers advertise 64-bit DMA
support via the S64A bit in the AHCI CAP register, but their 64-bit DMA
implementation is defective. Under sustained I/O, DMA transfers targeting
addresses above 4GB silently corrupt data -- writes land at incorrect
memory addresses with no errors logged.
The failure pattern is similar to the ASMedia ASM1061
(commit 20730e9b2778 ("ahci: add 43-bit DMA address quirk for ASMedia
ASM1061 controllers")), which also falsely advertised full 64-bit DMA
support. However, the JMB585 requires a stricter 32-bit DMA mask rather
than 43-bit, as corruption occurs with any address above 4GB.
On the Minisforum N5 Pro specifically, the combination of the JMB585's
broken 64-bit DMA with the AMD Family 1Ah (Strix Point) IOMMU causes
silent data corruption that is only detectable via checksumming
filesystems (BTRFS/ZFS scrub). The corruption occurs when 32-bit IOVA
space is exhausted and the kernel transparently switches to 64-bit DMA
addresses.
Add device-specific PCI ID entries for the JMB582 (0x0582) and JMB585
(0x0585) before the generic JMicron class match, using a new board type
that combines AHCI_HFLAG_IGN_IRQ_IF_ERR (preserving existing behavior)
with AHCI_HFLAG_32BIT_ONLY to force 32-bit DMA masks.
Signed-off-by: Arthur Husband <artmoty@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org>
net: skb: fix cross-cache free of KFENCE-allocated skb head
SKB_SMALL_HEAD_CACHE_SIZE is intentionally set to a non-power-of-2
value (e.g. 704 on x86_64) to avoid collisions with generic kmalloc
bucket sizes. This ensures that skb_kfree_head() can reliably use
skb_end_offset to distinguish skb heads allocated from
skb_small_head_cache vs. generic kmalloc caches.
However, when KFENCE is enabled, kfence_ksize() returns the exact
requested allocation size instead of the slab bucket size. If a caller
(e.g. bpf_test_init) allocates skb head data via kzalloc() and the
requested size happens to equal SKB_SMALL_HEAD_CACHE_SIZE, then
slab_build_skb() -> ksize() returns that exact value. After subtracting
skb_shared_info overhead, skb_end_offset ends up matching
SKB_SMALL_HEAD_HEADROOM, causing skb_kfree_head() to incorrectly free
the object to skb_small_head_cache instead of back to the original
kmalloc cache, resulting in a slab cross-cache free:
kmem_cache_free(skbuff_small_head): Wrong slab cache. Expected
skbuff_small_head but got kmalloc-1k
Fix this by always calling kfree(head) in skb_kfree_head(). This keeps
the free path generic and avoids allocator-specific misclassification
for KFENCE objects.
Fixes: bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head") Reported-by: Antonius <antonius@bluedragonsec.com> Closes: https://lore.kernel.org/netdev/CAK8a0jxC5L5N7hq-DT2_NhUyjBxrPocoiDazzsBk4TGgT1r4-A@mail.gmail.com/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260403014517.142550-1-jiayuan.chen@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When send() or recv() returns -1 with errno == EINTR, the code skips
the break but still adds the return value to nwritten/nread, making it
decrease by 1. This leads to wrong buffer offsets and wrong bytes count.
Fix it by explicitly continuing the loop on EINTR, so the return value
is only added when it is positive.
====================
xsk: tailroom reservation and MTU validation
here we fix a long-standing issue regarding multi-buffer scenario in ZC
mode - we have not been providing space at the end of the buffer where
multi-buffer XDP works on skb_shared_info. This has been brought to our
attention via [0].
Unaligned mode does not get any specific treatment, it is user's
responsibility to properly handle XSK addresses in queues.
With adjustments included here in this set against xskxceiver I have
been able to pass the full test suite on ice.
selftests: bpf: adjust rx_dropped xskxceiver's test to respect tailroom
Since we have changed how big user defined headroom in umem can be,
change the logic in testapp_stats_rx_dropped() so we pass updated
headroom validation in xdp_umem_reg() and still drop half of frames.
Test works on non-mbuf setup so __xsk_pool_get_rx_frame_size() that is
called on xsk_rcv_check() will not account skb_shared_info size. Taking
the tailroom size into account in test being fixed is needed as
xdp_umem_reg() defaults to respect it.
selftests: bpf: have a separate variable for drop test
Currently two different XDP programs share a static variable for
different purposes (picking where to redirect on shared umem test &
whether to drop a packet). This can be a problem when running full test
suite - idx can be written by shared umem test and this value can cause
a false behavior within XDP drop half test.
Introduce a dedicated variable for drop half test so that these two
don't step on each other toes. There is no real need for using
__sync_fetch_and_add here as XSK tests are executed on single CPU.
Skip tail adjust tests in xskxceiver for SKB mode as it is not very
friendly for it. multi-buffer case does not work as xdp_rxq_info that is
registered for generic XDP does not report ::frag_size. The non-mbuf
path copies packet via skb_pp_cow_data() which only accounts for
headroom, leaving us with no tailroom and causing underlying XDP prog to
drop packets therefore.
For multi-buffer test on other modes, change the amount of bytes we use
for growth, assume worst-case scenario and take care of headroom and
tailroom.
selftests: bpf: introduce a common routine for reading procfs
Parametrize current way of getting MAX_SKB_FRAGS value from {sys,proc}fs
so that it can be re-used to get cache line size of system's CPU. All
that just to mimic and compute size of kernel's struct skb_shared_info
which for xsk and test suite interpret as tailroom.
Introduce two variables to ifobject struct that will carry count of skb
frags and tailroom size. Do the reading and computing once, at the
beginning of test suite execution in xskxceiver, but for test_progs such
way is not possible as in this environment each test setups and torns
down ifobject structs.
xsk: validate MTU against usable frame size on bind
AF_XDP bind currently accepts zero-copy pool configurations without
verifying that the device MTU fits into the usable frame space provided
by the UMEM chunk.
This becomes a problem since we started to respect tailroom which is
subtracted from chunk_size (among with headroom). 2k chunk size might
not provide enough space for standard 1500 MTU, so let us catch such
settings at bind time. Furthermore, validate whether underlying HW will
be able to satisfy configured MTU wrt XSK's frame size multiplied by
supported Rx buffer chain length (that is exposed via
net_device::xdp_zc_max_segs).
Currently xp_assign_dev_shared() is missing XDP_USE_SG being propagated
to flags so set it in order to preserve mtu check that is supposed to be
done only when no multi-buffer setup is in picture.
Also, this flag has the same value as XDP_UMEM_TX_SW_CSUM so we could
get unexpected SG setups for software Tx checksums. Since csum flag is
UAPI, modify value of XDP_UMEM_SG_FLAG.
Fixes: d609f3d228a8 ("xsk: add multi-buffer support for sockets sharing umem") Reviewed-by: Björn Töpel <bjorn@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-4-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multi-buffer XDP stores information about frags in skb_shared_info that
sits at the tailroom of a packet. The storage space is reserved via
xdp_data_hard_end():
Currently we do not respect this tailroom space in multi-buffer AF_XDP
ZC scenario. To address this, introduce xsk_pool_get_tailroom() and use
it within xsk_pool_get_rx_frame_size() which is used in ZC drivers to
configure length of HW Rx buffer.
Typically drivers on Rx Hw buffers side work on 128 byte alignment so
let us align the value returned by xsk_pool_get_rx_frame_size() in order
to avoid addressing this on driver's side. This addresses the fact that
idpf uses mentioned function *before* pool->dev being set so we were at
risk that after subtracting tailroom we would not provide 128-byte
aligned value to HW.
Since xsk_pool_get_rx_frame_size() is actively used in xsk_rcv_check()
and __xsk_rcv(), add a variant of this routine that will not include 128
byte alignment and therefore old behavior is preserved.
Reviewed-by: Björn Töpel <bjorn@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-3-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
xsk: tighten UMEM headroom validation to account for tailroom and min frame
The current headroom validation in xdp_umem_reg() could leave us with
insufficient space dedicated to even receive minimum-sized ethernet
frame. Furthermore if multi-buffer would come to play then
skb_shared_info stored at the end of XSK frame would be corrupted.
HW typically works with 128-aligned sizes so let us provide this value
as bare minimum.
Multi-buffer setting is known later in the configuration process so
besides accounting for 128 bytes, let us also take care of tailroom space
upfront.
Reviewed-by: Björn Töpel <bjorn@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Fixes: 99e3a236dd43 ("xsk: Add missing check on user supplied headroom size") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20260402154958.562179-2-maciej.fijalkowski@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Agalakov Daniil [Wed, 18 Mar 2026 12:05:05 +0000 (15:05 +0300)]
e1000: check return value of e1000_read_eeprom
[Why]
e1000_set_eeprom() performs a read-modify-write operation when the write
range is not word-aligned. This requires reading the first and last words
of the range from the EEPROM to preserve the unmodified bytes.
However, the code does not check the return value of e1000_read_eeprom().
If the read fails, the operation continues using uninitialized data from
eeprom_buff. This results in corrupted data being written back to the
EEPROM for the boundary words.
Add the missing error checks and abort the operation if reading fails.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Alex Dvoretsky [Thu, 12 Mar 2026 13:52:55 +0000 (14:52 +0100)]
igb: remove napi_synchronize() in igb_down()
When an AF_XDP zero-copy application terminates abruptly (e.g., kill -9),
the XSK buffer pool is destroyed but NAPI polling continues.
igb_clean_rx_irq_zc() repeatedly returns the full budget, preventing
napi_complete_done() from clearing NAPI_STATE_SCHED.
igb_down() calls napi_synchronize() before napi_disable() for each queue
vector. napi_synchronize() spins waiting for NAPI_STATE_SCHED to clear,
which never happens. igb_down() blocks indefinitely, the TX watchdog
fires, and the TX queue remains permanently stalled.
napi_disable() already handles this correctly: it sets NAPI_STATE_DISABLE.
After a full-budget poll, __napi_poll() checks napi_disable_pending(). If
set, it forces completion and clears NAPI_STATE_SCHED, breaking the loop
that napi_synchronize() cannot.
napi_synchronize() was added in commit 41f149a285da ("igb: Fix possible
panic caused by Rx traffic arrival while interface is down").
napi_disable() provides stronger guarantees: it prevents further
scheduling and waits for any active poll to exit.
Other Intel drivers (ixgbe, ice, i40e) use napi_disable() without a
preceding napi_synchronize() in their down paths.
Remove redundant napi_synchronize() call and reorder napi_disable()
before igb_set_queue_napi() so the queue-to-NAPI mapping is only
cleared after polling has fully stopped.
Fixes: 2c6196013f84 ("igb: Add AF_XDP zero-copy Rx support") Cc: stable@vger.kernel.org Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Alex Dvoretsky <advoretsky@gmail.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: Patryk Holda <patryk.holda@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Schmidt [Fri, 13 Mar 2026 08:22:29 +0000 (09:22 +0100)]
ixgbevf: add missing negotiate_features op to Hyper-V ops table
Commit a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by
negotiating supported features") added the .negotiate_features callback
to ixgbe_mac_operations and populated it in ixgbevf_mac_ops, but forgot
to add it to ixgbevf_hv_mac_ops. This leaves the function pointer NULL
on Hyper-V VMs.
During probe, ixgbevf_negotiate_api() calls ixgbevf_set_features(),
which unconditionally dereferences hw->mac.ops.negotiate_features().
On Hyper-V this results in a NULL pointer dereference:
Add ixgbevf_hv_negotiate_features_vf() that returns -EOPNOTSUPP and
wire it into ixgbevf_hv_mac_ops. The caller already handles -EOPNOTSUPP
gracefully.
Fixes: a7075f501bd3 ("ixgbevf: fix mailbox API compatibility by negotiating supported features") Reported-by: Xiaoqiang Xiong <xxiong@redhat.com> Closes: https://issues.redhat.com/browse/RHEL-155455 Assisted-by: Claude:claude-4.6-opus-high Cursor Tested-by: Xiaoqiang Xiong <xxiong@redhat.com> Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ixgbe: stop re-reading flash on every get_drvinfo for e610
ixgbe_get_drvinfo() calls ixgbe_refresh_fw_version() on every ethtool
query for e610 adapters. That ends up in ixgbe_discover_flash_size(),
which bisects the full 16 MB NVM space issuing one ACI command per
step (~20 ms each, ~24 steps total = ~500 ms).
Profiling on an idle E610-XAT2 system with telegraf scraping ethtool
stats every 10 seconds:
kretprobe:ixgbe_get_drvinfo took 527603 us
kretprobe:ixgbe_get_drvinfo took 523978 us
kretprobe:ixgbe_get_drvinfo took 552975 us
kretprobe:ice_get_drvinfo took 3 us
kretprobe:igb_get_drvinfo took 2 us
kretprobe:i40e_get_drvinfo took 5 us
The half-second stall happens under the RTNL lock, causing visible
latency on ip-link and friends.
The FW version can only change after an EMPR reset. All flash data is
already populated at probe time and the cached adapter->eeprom_id is
what get_drvinfo should be returning. The only place that needs to
trigger a re-read is ixgbe_devlink_reload_empr_finish(), right after
the EMPR completes and new firmware is running. Additionally, refresh
the FW version in ixgbe_reinit_locked() so that any PF that undergoes a
reinit after an EMPR (e.g. triggered by another PF's devlink reload)
also picks up the new version in adapter->eeprom_id.
ixgbe_devlink_info_get() keeps its refresh call for explicit
"devlink dev info" queries, which is fine given those are user-initiated.
Fixes: c9e563cae19e ("ixgbe: add support for devlink reload") Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Petr Oros [Fri, 27 Mar 2026 07:46:58 +0000 (08:46 +0100)]
ice: fix PTP timestamping broken by SyncE code on E825C
The E825C SyncE support added in commit ad1df4f2d591 ("ice: dpll:
Support E825-C SyncE and dynamic pin discovery") introduced a SyncE
reconfiguration block in ice_ptp_link_change() that prevents
ice_ptp_port_phy_restart() from being called in several error paths.
Without the PHY restart, PTP timestamps stop working after any link
change event.
There are three ways the PHY restart gets blocked:
1. When DPLL initialization fails (e.g. missing ACPI firmware node
properties), ICE_FLAG_DPLL is not set and the function returns early
before reaching the PHY restart.
2. When ice_tspll_bypass_mux_active_e825c() fails to read the CGU
register, WARN_ON_ONCE fires and the function returns early.
3. When ice_tspll_cfg_synce_ethdiv_e825c() fails to configure the
clock divider for an active pin, same early return.
SyncE and PTP are independent features. SyncE reconfiguration failures
must not prevent the PTP PHY restart that is essential for timestamp
recovery after link changes.
Fix by making the entire SyncE block conditional on ICE_FLAG_DPLL
without an early return, and replacing the WARN_ON_ONCE + return error
handling inside the loop with dev_err_once + break. The function always
proceeds to ice_ptp_port_phy_restart() regardless of SyncE errors.
Fixes: ad1df4f2d591 ("ice: dpll: Support E825-C SyncE and dynamic pin discovery") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kohei Enju [Sun, 1 Feb 2026 14:14:00 +0000 (14:14 +0000)]
ice: ptp: don't WARN when controlling PF is unavailable
In VFIO passthrough setups, it is possible to pass through only a PF
which doesn't own the source timer. In that case the PTP controlling PF
(adapter->ctrl_pf) is never initialized in the VM, so ice_get_ctrl_ptp()
returns NULL and triggers WARN_ON() in ice_ptp_setup_pf().
Since this is an expected behavior in that configuration, replace
WARN_ON() with an informational message and return -EOPNOTSUPP.
Fixes: e800654e85b5 ("ice: Use ice_adapter for PTP shared data instead of auxdev") Signed-off-by: Kohei Enju <kohei@enjuk.jp> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:35 +0000 (14:13 -0700)]
idpf: set the payload size before calling the async handler
Set the payload size before forwarding the reply to the async handler.
Without this, xn->reply_sz will be 0 and idpf_mac_filter_async_handler()
will never get past the size check.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Li Li <boolli@google.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:34 +0000 (14:13 -0700)]
idpf: improve locking around idpf_vc_xn_push_free()
Protect the set_bit() operation for the free_xn bitmask in
idpf_vc_xn_push_free(), to make the locking consistent with rest of the
code and avoid potential races in that logic.
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Reported-by: Ray Zhang <sgzhang@google.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Emil Tantilov [Thu, 19 Mar 2026 21:13:33 +0000 (14:13 -0700)]
idpf: fix PREEMPT_RT raw/bh spinlock nesting for async VC handling
Switch from using the completion's raw spinlock to a local lock in the
idpf_vc_xn struct. The conversion is safe because complete/_all() are
called outside the lock and there is no reason to share the completion
lock in the current logic. This avoids invalid wait context reported by
the kernel due to the async handler taking BH spinlock:
Fixes: 34c21fa894a1 ("idpf: implement virtchnl transaction manager") Cc: stable@vger.kernel.org Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: Ray Zhang <sgzhang@google.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Joseph Qi [Fri, 3 Apr 2026 06:38:30 +0000 (14:38 +0800)]
ocfs2: fix out-of-bounds write in ocfs2_write_end_inline
KASAN reports a use-after-free write of 4086 bytes in
ocfs2_write_end_inline, called from ocfs2_write_end_nolock during a
copy_file_range splice fallback on a corrupted ocfs2 filesystem mounted on
a loop device. The actual bug is an out-of-bounds write past the inode
block buffer, not a true use-after-free. The write overflows into an
adjacent freed page, which KASAN reports as UAF.
The root cause is that ocfs2_try_to_write_inline_data trusts the on-disk
id_count field to determine whether a write fits in inline data. On a
corrupted filesystem, id_count can exceed the physical maximum inline data
capacity, causing writes to overflow the inode block buffer.