Breno Leitao [Wed, 20 May 2026 16:53:46 +0000 (09:53 -0700)]
af_iucv: convert to getsockopt_iter
Convert IUCV socket's getsockopt implementation to use the new
getsockopt_iter callback with sockopt_t.
Key changes:
- Replace (char __user *optval, int __user *optlen) with sockopt_t *opt
- Use opt->optlen for buffer length (input) and returned size (output)
- Use copy_to_iter() instead of put_user()/copy_to_user()
net: mana: Expose hardware diagnostic info via debugfs
Add debugfs entries to expose hardware configuration and diagnostic
information that aids in debugging driver initialization and runtime
operations without adding noise to dmesg.
The debugfs directory for each PCI device is named using pci_name()
(the unique BDF address), and its creation and removal is integrated
into mana_gd_setup() and mana_gd_cleanup_device() respectively, so
that all callers (probe, remove, suspend, resume, shutdown) share a
single code path.
Device-level entries (under /sys/kernel/debug/mana/<BDF>/):
- num_msix_usable, max_num_queues: Max resources from hardware
- gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
- num_vports, bm_hostmode: Device configuration
Commit 150061a20651 ("net/sched: fq_codel: local packets no longer count against memory limit")
made fq_codel not account for local packets in the
memory limit. Since tests a4bb, a4be, a4bf, a4c0, a4c1 were relying on
these packets being accounted so that parent's qlen notify callback was
executed, they broke.
Fix the tests by adding the qdiscs to ifb instead and making it see
mirred packets that came from scapy. That way the packets are accounted
in the memory limit and the parent's qlen notify callback is still
executed.
Jakub Kicinski [Thu, 21 May 2026 23:00:06 +0000 (16:00 -0700)]
Merge tag 'wireless-next-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Not much going on here right now:
- mac80211/hwsim:
- some NAN related things
- MCS/NSS rate issues with S1G
- p54: port SPI version to device-tree
- (a few other random things)
* tag 'wireless-next-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next:
ARM: dts: omap2: add stlc4560 spi-wireless node
p54spi: convert to devicetree
dt-bindings: net: add st,stlc4560/p54spi binding
wifi: mac80211: allow cipher change on NAN_DATA interfaces
wifi: mac80211_hwsim: Do not declare NAN support for Extended Key ID
wifi: cfg80211: add a function to parse UHR DBE
wifi: mac80211: don't call ieee80211_handle_reconfig_failure when not needed
wifi: mac80211: Allow per station GTK for NAN Data interfaces
wifi: mac80211_hwsim: advertise NPCA capability
wifi: mac80211_hwsim: reject NAN on multi-radio wiphys
wifi: plfxlc: use module_usb_driver() macro
wifi: mac80211: don't recalc min def for S1G chan ctx
wifi: mac80211: skip NSS and BW init for S1G sta
wifi: mac80211: check stations are removed before MLD change
wifi: rt2x00: allocate anchor with rt2x00dev
====================
Linus Torvalds [Thu, 21 May 2026 21:39:12 +0000 (14:39 -0700)]
Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth, wireless and netfilter.
Craziness continues with no end in sight. Even discounting the driver
revert this is a pretty huge PR for standards of the previous era. I'd
speculate - we haven't seen the worst of it, yet. Good news, I guess,
is that so far we haven't seen many (any?) cases of "AI reported a
bug, we fixed it and a real user regressed".
Current release - fix to a fix:
- Bluetooth: btmtk: accept too short WMT FUNC_CTRL events
- vsock/virtio: relax the recently added memory limit a little
Current release - regressions:
- IB/IPoIB: make sure IB drivers always use async set_rx_mode since
some (mlx5) are now required to use it due to locking changes
Previous releases - regressions:
- udp: fix UDP length on last GSO_PARTIAL segment
- af_unix: fix UAF read of tail->len in unix_stream_data_wait()
- tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction
- mlx5e: fix unlocked writing to ICOSQ, breaking AF_XDP
Previous releases - always broken:
- tap: fix stack info leak in tap_ioctl() SIOCGIFHWADDR
- ipv4: raw: reject IP_HDRINCL packets with ihl < 5
- Bluetooth: a lot of locking and concurrency fixes (as always)
- batman-adv (mesh wireless networking): a lot of random fixes for
issues reported by security researchers and Sashiko
- netfilter: same thing, a lot of small security-ish fixes all over
the place, nothing really stands out
Misc:
- bring back the old 3c509 driver, Maciej wants to maintain it"
* tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (187 commits)
net: enetc: avoid VF->PF mailbox timeout during SR-IOV teardown
net: enetc: fix init and teardown order to prevent use of unsafe resources
net: enetc: fix unbounded loop and interrupt handling in VF-to-PF messaging
net: enetc: fix DMA write to freed memory in enetc_msg_free_mbx()
net: enetc: fix race condition in VF MAC address configuration
net: enetc: fix TOCTOU race and validate VF MAC address
net: enetc: add ratelimiting to VF mailbox error messages
net: enetc: fix missing error code when pf->vf_state allocation fails
net: enetc: fix incorrect mailbox message status returned to VFs
net: bridge: prevent too big nested attributes in br_fill_linkxstats()
l2tp: use list_del_rcu in l2tp_session_unhash
net: bcmgenet: keep RBUF EEE/PM disabled
ethernet: 3c509: Fix most coding style issues
ethernet: 3c509: Update documentation to match MAINTAINERS
ethernet: 3c509: Add GPL 2.0 SPDX license identifier
ethernet: 3c509: Fix AUI transceiver type selection
Revert "drivers: net: 3com: 3c509: Remove this driver"
tools: ynl: support listening on all nsids
net: gro: don't merge zcopy skbs
pds_core: ensure null-termination for firmware version strings
...
Linus Torvalds [Thu, 21 May 2026 21:17:28 +0000 (14:17 -0700)]
Merge tag 'ceph-for-7.1-rc5' of https://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A fix for an 'rbd unmap' race condition which popped up on a
production setup where many RBD devices are frequently mapped and
unmapped, marked for stable"
* tag 'ceph-for-7.1-rc5' of https://github.com/ceph/ceph-client:
rbd: eliminate a race in lock_dwork draining on unmap
Linus Torvalds [Thu, 21 May 2026 21:05:09 +0000 (14:05 -0700)]
Merge tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fixes from Steven Rostedt:
- Fix reporting MISSED EVENTS in trace iterator
When the "trace" file is read with tracing enabled, if the writer
were to pass the iterator reader, it resets, sets a "missed_events"
flag and continues. The tracing output checks for missed events and
if there are some, it prints out "[LOST EVENTS]" to let the user know
events were dropped.
But the clearing of the missed_events happened when the tracing
system queried the ring buffer iterator about missed events. This was
premature as the ring buffer is per CPU, and the tracing code reads
all the CPU buffers and checks for missed events when it is read. If
the CPU iterator that had missed events isn't printed next, the
output for the LOST EVENTS is lost.
Clear the missed_events flag when the iterator moves to the next
event and not when the missed_events flag is queried. Also clear it
on reset.
- Flush and stop the persistent ring buffer on panic
On panic the persistent ring buffer is used to debug what caused the
panic. But on some architectures, it requires flushing the memory
from cache, otherwise, the ring buffer persistent memory may not have
the last events and this could also cause the ring buffer to be
corrupted on the next boot.
- Fix nr_subbufs initialization in simple_ring_buffer_init_mm
The remote simple ring buffer meta data nr_subbufs is initialized too
early and gets cleared later on, making it zero and not reflect the
actual number of sub-buffers.
- Fix unload_page for simple_ring_buffer init rollback
On error, the pages loaded need to be unloaded. To unload a page it
is expected that: page = load_page(va); -> unload_page(page). But the
code was doing: unload_page(va) and not unload_page(page).
- Create output file from cmd_check_undefined
The check for undefined symbols checks if the file *.o.checked exists
and if so it skips doing the work. But the *.o.checked file never was
created making every build do the work even when it was already done
previously.
* tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Create output file from cmd_check_undefined
tracing: Fix unload_page for simple_ring_buffer init rollback
tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Fix reporting of missed events in iterator
Jakub Kicinski [Thu, 21 May 2026 18:03:58 +0000 (11:03 -0700)]
Merge tag 'wireless-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Quite a few more updates:
- cfg80211/mac80211:
- various security(-ish) fixes
- fix A-MSDU subframe handling
- fix multi-link element parsing
- ath10: avoid sending commands to dead device
- ath11k:
- fix WMI buffer leaks on error conditions
- fix UAF in RX MSDU coalesce path
- allow peer ID 0 on RX path (legal for mobile devices)
- reinitialize shared SRNG pointers on restart
- ath12k:
- fix 20 MHz-only parsing of EHT-MCS map
- iwlwifi:
- fix TSO segmentation explosion
- don't TX to dead device
- fix warning in WoWLAN
- fix TX rates on old devices
- disconnect on beacon loss only if also no other traffic
- fill NULL-ptr deref
- fix STEP_URM hardware access
* tag 'wireless-2026-05-21' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (24 commits)
wifi: cfg80211: wext: validate chandef in monitor mode
wifi: mac80211: consume only present negotiated TTLM maps
wifi: wilc1000: fix dma_buffer leak on bus acquire failure
wifi: mac80211: capture fast-RX rate before mesh reuses skb->cb
wifi: mac80211: fix multi-link element inheritance
wifi: mac80211: fix MLE defragmentation
wifi: mac80211: don't override max_amsdu_subframes
wifi: mac80211: bounds-check link_id in ieee80211_ml_epcs
wifi: ath12k: fix EHT TX MCS limitation due to wrong 20 MHz-only parsing
wifi: ath11k: clear shared SRNG pointer state on restart
wifi: ath11k: fix use after free in ath11k_dp_rx_msdu_coalesce()
wifi: ath11k: fix peer resolution on rx path when peer_id=0
wifi: iwlwifi: mld: disconnect only after 6 beacons without Rx
wifi: iwlwifi: mld: don't WARN on WoWLAN suspend w/o BSS vif
wifi: iwlwifi: use correct function to read STEP_URM register
wifi: iwlwifi: mvm: fix driver-set TX rates on old devices
wifi: iwlwifi: mld: don't dereference a pointer before NULL checking it
wifi: iwlwifi: mld: stop TX during firmware restart
wifi: iwlwifi: mld: fix TSO segmentation explosion when AMSDU is disabled
wifi: ath10k: skip WMI and beacon transmission when device is wedged
...
====================
====================
net: enetc: SR-IOV robustness and security fixes
This patch series addresses a number of robustness, security, and
correctness issues in the ENETC driver's SR-IOV subsystem, focusing
primarily on the VF-to-PF mailbox communication path.
The series can be grouped into the following categories:
1. DoS and security fixes:
- Prevent an unbounded loop DoS in the VF-to-PF message handler,
which could be triggered by a malicious or misbehaving VF.
- Fix a TOCTOU (Time-of-Check-Time-of-Use) race and add proper
validation of VF MAC addresses to prevent spoofing or invalid
configuration from being applied.
2. Race condition fixes:
- Fix a race condition in VF MAC address configuration that could
lead to inconsistent state between the VF request and PF
application.
- Fix a race condition during SR-IOV teardown that could cause
VF->PF mailbox operations to time out, resulting in unnecessary
errors during shutdown.
3. Memory safety fixes:
- Fix a DMA write to freed memory in enetc_msg_free_mbx(), which
could cause silent memory corruption or system instability.
4. Error handling and initialization fixes:
- Fix missing error code propagation when pf->vf_state allocation
fails, ensuring callers receive a proper errno instead of
succeeding silently.
- Fix incorrect mailbox message status values returned to VFs,
which could cause VFs to misinterpret PF responses.
- Fix initialization order to prevent the use of uninitialized
resources during driver probe, which could cause undefined
behavior on certain configurations.
5. Diagnostics improvement:
- Add rate limiting to VF mailbox error messages to prevent log
flooding in the presence of a misbehaving VF.
These fixes improve the overall stability and security of the ENETC
SR-IOV implementation, particularly in multi-tenant environments where
VFs may be assigned to untrusted guests.
====================
Wei Fang [Wed, 20 May 2026 06:44:21 +0000 (14:44 +0800)]
net: enetc: avoid VF->PF mailbox timeout during SR-IOV teardown
During SR-IOV teardown, enetc_msg_psi_free() disables the MR interrupt
before pci_disable_sriov() removes the VFs. If a VF sends a mailbox
message during this window, the PF cannot receive it, causing the VF to
timeout waiting for a reply.
Since the timeout occurs during SR-IOV teardown when the VF is about to
be removed anyway, it has no functional impact on operation. However,
more messages will be added in the future, some visible error logs may
confuse users. So fix it by calling pci_disable_sriov() first to remove
all VFs, then safely clean up the mailbox resources. This eliminates the
race window where VFs could send messages to an unresponsive PF.
This ordering is unsafe because if a spurious interrupt or pending
interrupt from a previous device state fires immediately after
request_irq() returns, the registered ISR enetc_msg_psi_msix() will
execute and unconditionally call:
schedule_work(&pf->msg_task)
At this point, pf->msg_task has not been initialized by INIT_WORK(), so
the work_struct contains garbage values in its internal linked list
pointers (work_struct->entry). Passing an uninitialized work_struct to
schedule_work() could corrupt the kernel's workqueue linked lists,
potentially leading to:
- Kernel panic in __queue_work()
- Memory corruption in workqueue data structures
- System deadlock or undefined behavior
Additionally, even if the work_struct was initialized, the mailbox DMA
buffers (pf->rxmsg[]) may not yet be allocated when the work handler
enetc_msg_task() runs, resulting in NULL pointer dereference.
Fix by reordering the initialization sequence to ensure all resources are
properly initialized before the interrupt handler can execute:
1. enetc_msg_alloc_mbx() <- Allocate all mailboxes
2. INIT_WORK(&pf->msg_task, ...) <- Initialize work first
3. request_irq(enetc_msg_psi_msix) <- Register IRQ last
4. Configure hardware & enable MR interrupts
This guarantees that when enetc_msg_psi_msix() runs:
- pf->msg_task is properly initialized (safe for schedule_work)
- pf->rxmsg[] buffers are allocated (safe for work handler access)
- Hardware is configured appropriately
As the inverse of enetc_msg_psi_init(), enetc_msg_psi_free() also has
similar problems. For example, if a pending interrupt fires between
enetc_msg_free_mbx() and free_irq(), the ISR enetc_msg_psi_msix() may
schedule the work handler again via schedule_work(), which could then
access already-freed DMA buffers (pf->rxmsg[]), leading to use-after-free
and potential memory corruption.
Therefore, the order of enetc_msg_psi_free() is adjusted:
1. enetc_msg_disable_mr_int() <- Stop new interrupts first
2. free_irq() <- Ensure no IRQ handler can run
3. cancel_work_sync() <- Wait for any pending work
4. enetc_msg_disable_mr_int() <- Re-disable in case work
re-enabled it
5. enetc_msg_free_mbx() <- Safe to free DMA buffers now
Wei Fang [Wed, 20 May 2026 06:44:19 +0000 (14:44 +0800)]
net: enetc: fix unbounded loop and interrupt handling in VF-to-PF messaging
The enetc_msg_task() function has several issues that need to be addressed:
1. Unbounded loop causing potential DoS:
enetc_msg_task() processes VF-to-PF mailbox messages in an unbounded
for(;;) loop that keeps polling ENETC_PSIMSGRR until no MR bits are set.
A malicious guest VM can exploit this by continuously sending messages at
a high rate - immediately sending a new message as soon as the PF
acknowledges the previous one. Since the worker thread never yields or
enforces a processing budget, the mr_mask check frequently evaluates to
non-zero, causing the PF to spin indefinitely and starving other tasks.
Fix this by replacing the unbounded loop with a single snapshot read at
task entry. The task processes only the VFs whose MR bits were set at
that point, then re-enables message interrupts before returning. This
bounds work per invocation to at most num_vfs iterations. No messages are
lost because the message interrupt is disabled in enetc_msg_psi_msix()
before scheduling enetc_msg_task(), so any new messages arriving during
processing will trigger a fresh interrupt once re-enabled, scheduling
another task invocation.
2. Write order of ENETC_PSIIDR and ENETC_PSIMSGRR:
Both ENETC_PSIIDR and ENETC_PSIMSGRR contain MR bits indicating messages
have been received from VSIs, but only ENETC_PSIIDR trigger the CPU
interrupt. Previously, ENETC_PSIMSGRR was written before ENETC_PSIIDR.
Writing ENETC_PSIMSGRR returns the message code to the VSI in its upper
16 bits, signaling to the VF that message processing is complete and it
may send the next message. If the VF sends a new message before
ENETC_PSIIDR is written, the subsequent w1c write to ENETC_PSIIDR would
inadvertently clear the MR bit set by the new message, causing the
interrupt to be lost and the new message to go unprocessed.
Therefore, write ENETC_PSIIDR first to clear the interrupt source, then
write ENETC_PSIMSGRR to acknowledge the message to the VSI.
3. Check both ENETC_PSIMSGRR and ENETC_PSIIDR for mr_status:
The write order change above introduces a potential race: if a VF sends
a new message in the window between the ENETC_PSIIDR w1c and the
ENETC_PSIMSGRR w1c, the ENETC_PSIMSGRR MR bit for the new message may
not be set. If mr_status was derived solely from ENETC_PSIMSGRR, this
message would never be detected despite ENETC_PSIIDR retaining its MR
bit, leading to an unacknowledged interrupt storm.
Fix this by computing mr_status as the union of both ENETC_PSIMSGRR and
ENETC_PSIIDR MR bits, ensuring all pending messages are detected
regardless of which register reflects the new message state.
Additionally, rename the per-register MR macros (ENETC_PSI*_MR_MASK,
ENETC_PSI*_MR) to register-agnostic names (ENETC_PSIMR_MASK,
ENETC_PSIMR_BIT) since the MR bit layout is shared across ENETC_PSIMSGRR,
ENETC_PSIIER, and ENETC_PSIIDR. Make the mask macro dynamic based on
the actual number of active VFs rather than hardcoded.
Wei Fang [Wed, 20 May 2026 06:44:18 +0000 (14:44 +0800)]
net: enetc: fix DMA write to freed memory in enetc_msg_free_mbx()
The teardown sequence in enetc_msg_psi_free() frees the DMA buffer before
clearing the device's DMA address registers. If a VF sends a message or a
pending DMA transfer completes within this window, the hardware will
perform a DMA write into the kernel memory that has already been returned
to the allocator.
The result is silent memory corruption that can affect arbitrary kernel
data structures. Therefore, clear the DMA address registers before the
DMA buffer is freed.
Wei Fang [Wed, 20 May 2026 06:44:17 +0000 (14:44 +0800)]
net: enetc: fix race condition in VF MAC address configuration
Sashiko reported a potential race condition between the VF message
handler and administrative VF MAC configuration from the host [1].
The VF message handler (enetc_msg_pf_set_vf_primary_mac_addr) runs
asynchronously in a workqueue context and accesses vf_state->flags
without any locking. Concurrently, the host can administratively
change the VF MAC address via enetc_pf_set_vf_mac(), which executes
under RTNL lock and modifies both vf_state->flags and hardware
registers.
This creates two race windows:
1) TOCTOU race on vf_state->flags: The check of ENETC_VF_FLAG_PF_SET_MAC
and subsequent MAC programming are not atomic, allowing the flag state
to change between check and use.
2) Torn MAC address writes: Hardware MAC programming requires multiple
non-atomic register writes (__raw_writel for lower 32 bits and
__raw_writew for upper 16 bits). Concurrent updates from VF mailbox
and PF admin paths can interleave these operations, resulting in a
corrupted MAC address being programmed into the hardware.
Fix by introducing a per-VF mutex to serialize access to vf_state and
hardware MAC register updates. Both enetc_pf_set_vf_mac() and
enetc_msg_pf_set_vf_primary_mac_addr() now acquire this lock before
accessing vf_state->flags or programming the MAC address, ensuring
atomic read-modify-write sequences and preventing register write
interleaving.
Wei Fang [Wed, 20 May 2026 06:44:16 +0000 (14:44 +0800)]
net: enetc: fix TOCTOU race and validate VF MAC address
Sashiko reported that the PF driver accepts arbitrary MAC address from
from VF mailbox messages without proper validation, creating a security
vulnerability [1].
In enetc_msg_pf_set_vf_primary_mac_addr(), the MAC address is extracted
directly from the message buffer (cmd->mac.sa_data) and programmed into
hardware via pf->ops->set_si_primary_mac() without any validity checks.
A malicious VF can configure a multicast, broadcast, or all-zero MAC
address. Therefore, a validation to check the MAC address provided by VF
is required.
However, simply checking the MAC address is not enough, because it also
has the potential TOCTOU race [2]: The code reads the MAC address from
the DMA buffer to validate it via is_valid_ether_addr(), if validation
passes, reads the same DMA buffer a second time when calling
enetc_pf_set_primary_mac_addr() to program the hardware. A malicious VF
can exploit this window by overwriting the MAC address in the DMA buffer
between the validation check and the hardware programming, bypassing the
validation entirely.
Therefore, allocate a local buffer in enetc_msg_handle_rxmsg() and copy
the message content from the DMA buffer via memcpy() before processing.
This ensures the PF operates on a stable snapshot that the VF cannot
modify.
Wei Fang [Wed, 20 May 2026 06:44:15 +0000 (14:44 +0800)]
net: enetc: add ratelimiting to VF mailbox error messages
Sashiko reported that a buggy or malicious guest VM can flood the host
kernel log by repeatedly sending VF-to-PF messages at a high rate,
degrading host performance and hiding important system logs [1].
Fix by replacing dev_err()/dev_warn() with dev_err_ratelimited(),
limiting output to the default kernel ratelimit. This ensures errors are
still logged for debugging while preventing log flooding attacks.
Wei Fang [Wed, 20 May 2026 06:44:14 +0000 (14:44 +0800)]
net: enetc: fix missing error code when pf->vf_state allocation fails
In enetc_pf_probe(), when the memory allocation for pf->vf_state fails,
the code jumps to the error handling label but the variable 'err' is not
assigned an appropriate error code beforehand. This causes the function
to return 0 (success) on an allocation failure path, misleading the
caller into thinking the probe succeeded. So set err to -ENOMEM before
jumping to the error handling label when the allocation for pf->vf_state
returns NULL.
Wei Fang [Wed, 20 May 2026 06:44:13 +0000 (14:44 +0800)]
net: enetc: fix incorrect mailbox message status returned to VFs
There are two cases where VFs receive an incorrect success status from
the PF mailbox message handler, misleading them into believing their
requests have been fulfilled:
In enetc_msg_handle_rxmsg(), *status is pre-initialized to
ENETC_MSG_CMD_STATUS_OK. When an unsupported command type is received,
the default case only logs an error without updating *status, so it
remains as ENETC_MSG_CMD_STATUS_OK.
In enetc_msg_pf_set_vf_primary_mac_addr(), when the PF has already
assigned a MAC address for the VF (ENETC_VF_FLAG_PF_SET_MAC is set),
the function rejects the request but returns ENETC_MSG_CMD_STATUS_OK
instead of ENETC_MSG_CMD_STATUS_FAIL.
Therefore, correct the status value for the two cases mentioned above.
Eric Dumazet [Wed, 20 May 2026 11:42:07 +0000 (11:42 +0000)]
net: bridge: prevent too big nested attributes in br_fill_linkxstats()
After commit ff205bf8c554 ("netlink: add one debug check in nla_nest_end()")
syzbot found that br_fill_linkxstats() can send corrupted netlink packets.
Make sure the nested attribute size is bounded.
Fixes: a60c090361ea ("bridge: netlink: export per-vlan stats") Reported-by: syzbot+a35f9259d08f907c06e6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a0b0da3.050a0220.175f0c.0000.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260520114207.1394241-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An unprivileged local user can pin a host CPU indefinitely in
l2tp_session_get_by_ifname() by issuing L2TP_CMD_SESSION_GET on
L2TP_ATTR_IFNAME concurrently with L2TP_CMD_SESSION_CREATE and
L2TP_CMD_SESSION_DELETE on the same tunnel. All three commands take
GENL_UNS_ADMIN_PERM, so CAP_NET_ADMIN in the netns user namespace
suffices; on any host that has l2tp_core loaded the trigger is
reachable from a standard `unshare -Urn` sandbox.
l2tp_session_unhash() removes a session from tunnel->session_list
with list_del_init(), but that list is walked by
l2tp_session_get_by_ifname() with list_for_each_entry_rcu() under
rcu_read_lock_bh(). list_del_init() leaves the deleted entry's
next/prev self-pointing; a reader that has loaded the entry and
then advances pos->list.next reads &session->list, container_of()s
back to the same session, and list_for_each_entry_rcu() never
reaches the list head. The CPU stays in strcmp() inside the
walker, with BH and preemption disabled, so RCU grace periods on
the host stall behind it and the wedged thread cannot be killed
(SIGKILL is delivered on syscall return).
Use list_del_rcu() to match the existing list_add_rcu() in
l2tp_session_register(); the deleted session remains visible to
in-flight walkers with consistent next/prev pointers until
kfree_rcu() in l2tp_session_free() releases it. tunnel->session_list
has exactly one list_del_init() call site; the list_del_init
(&session->clist) at l2tp_core.c:533 operates on the per-collision
list, which is not walked under RCU. list_empty(&session->list) is
not used anywhere in net/l2tp/ after the unhash point, so dropping
the post-delete self-init is safe; the fix has no userspace-visible
behavior change.
Linus Torvalds [Thu, 21 May 2026 15:43:26 +0000 (08:43 -0700)]
Merge tag 'soc-fixes-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
- The ff-a firmware driver gets 11 individual bugfixes for a number of
issues with robustness to buggy firmware or client implementations.
Another firmware fix address suspend to RAM via PSCI firmware.
- The final code change is for the old Arm Integrator reference
platform that recently started exposing an old NULL pointer
dereference bug.
- The MAINTAINERS file gets two updates, notably James Tai and Yu-Chun
Lin are stepping up as co-maintainers for the Realtek platform.
- The remaining patches are all for devicetree files. Two of these are
for riscv boards, the rest are all for enesas Arm platforms,
addressing build time checking issues as well as minor configuration
problems.
* tag 'soc-fixes-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
firmware: psci: Set pm_set_resume/suspend_via_firmware() for SYSTEM_SUSPEND
ARM: realtek: MAINTAINERS: Include pin controller drivers
MAINTAINERS: Add maintainers for ARM/REALTEK ARCHITECTURE
ARM: integrator: Fix early initialization
firmware: arm_ffa: Fix sched-recv callback partition lookup
firmware: arm_ffa: Snapshot notifier callbacks under lock
firmware: arm_ffa: Align RxTx buffer size before mapping
firmware: arm_ffa: Validate framework notification message layout
firmware: arm_ffa: Keep framework RX release under lock
firmware: arm_ffa: Bound PARTITION_INFO_GET_REGS copies
firmware: arm_ffa: Unregister bus notifier on teardown for FF-A v1.0
firmware: arm_ffa: Fix per-vcpu self notifications handling in workqueue
firmware: arm_ffa: Avoid collapsing NPI work from different CPUs
firmware: arm_ffa: Skip free_pages on RX buffer alloc failure
firmware: arm_ffa: Check for NULL FF-A ID table while driver registration
riscv: dts: microchip: fix icicle i2c pinctrl configuration
riscv: dts: starfive: jh7110: Drop CAMSS node
arm64: dts: renesas: r9a09g056: Add #mux-state-cells to usb20phyrst
arm64: dts: renesas: r9a09g057: Add #mux-state-cells to usb2{0,1}phyrst
ARM: dts: renesas: rskrza1: Drop superfluous cells
...
Nicolai Buchwitz [Wed, 20 May 2026 18:43:20 +0000 (20:43 +0200)]
net: bcmgenet: keep RBUF EEE/PM disabled
Setting RBUF_EEE_EN | RBUF_PM_EN in RBUF_ENERGY_CTRL breaks the RX
path on GENET hardware once MAC EEE becomes active. RX traffic stops
flowing while the link stays up and the usual descriptor/RX error
counters remain quiet. In that state the MAC still accepts frames
(rbuf_ovflow_cnt keeps climbing) but RBUF no longer forwards them to
DMA, so rx_packets is no longer incremented at the netdev level. On
some boards the corruption ends up as a paging fault in
skb_release_data via bcmgenet_rx_poll on an LPI exit.
Reproduced on Pi 4B (BCM2711 + BCM54213PE) and confirmed by Florian
Fainelli on an internal Broadcom 4908-family board with the same crash
signature. RBUF_PM_EN is not publicly documented.
This shows up more often now that phy_support_eee() enables EEE by
default, but it also affects older kernels as soon as TX LPI is
turned on via ethtool, so it is not specific to recent changes.
Always clear RBUF_EEE_EN | RBUF_PM_EN in bcmgenet_eee_enable_set so
the bits stay off across resets. UMAC and TBUF setup is left alone so
TX-side EEE keeps working.
====================
ethernet: 3c509: Bring driver back and make some fixes
As per the previous discussions[1][2] this patch series brings the 3c509
driver back. Picking up net rather than net-next as I consider it a fix
to accidental removal and so that any downstream users do not suffer from
disruption when using released kernels.
In the course of making the coding style changes requested I have come
across an actual bug in transceiver type selection code, where the old
setting is not masked out before ORing in the new one, causing no change
to be actually made in a requested transition from BNC to AUI. I guess
this code must have been executed exceedingly rarely, as it's always been
wrong ever since it was added in 2.5.42 back in 2002.
Therefore I find it not worth backporting to stable branches, however for
the sake of appropriateness, in case someone downstream does want to have
the fix, I chose to apply it second in the series, right after the actual
revert and before code clean-ups.
The remaining patches of the series should be obvious; see the respective
commit descriptions for details.
[1] "drivers: net: 3com: 3c509: Remove this driver",
<https://lore.kernel.org/r/alpine.DEB.2.21.2604240004280.28583@angie.orcam.me.uk/>.
[2] "MAINTAINERS: Add self for the 3c509 network driver",
<https://lore.kernel.org/r/alpine.DEB.2.21.2604271056460.28583@angie.orcam.me.uk/>.
====================
Update the driver for our current coding style according to output from
`checkpatch.pl' and manual code review, where no change to binary code
results, as indicated by `objdump -dr'. Exceptions are as follows:
- incomplete reverse xmas tree in set_multicast_list(), as that would
change binary output,
- referring el3_start_xmit() verbatim rather than via `__func__' with
pr_debug(), likewise,
- a bunch of pr_cont() calls, likewise,
- a long udelay() call in el3_netdev_set_ecmd() made under a spinlock,
likewise plus it's not eligible for conversion to a sleep in the first
place,
- a blank line at the start of a block in el3_interrupt(), to improve
readability where the first statement would otherwise visually merge
with the controlling expression of the enclosing `while' statement.
These issues are benign and depending on circumstances may be adressed
with suitable code refactoring later on.
ethernet: 3c509: Update documentation to match MAINTAINERS
There has been apparently a single message only ever publicly posted by
David Ruggiero, back in 2002, which added this documentation piece among
others, and MAINTAINERS was never updated accordingly. It is therefore
doubtful that his maintainer status has actually come into effect. Just
replace the reference then so as not to confuse people.
This driver has landed with Linux 0.99.13k, which was covered by the GNU
General Public License version 2, and no further conditions as to
licensing terms have been specified within the copyright notice included
with the driver itself.
ethernet: 3c509: Fix AUI transceiver type selection
The transceiver type is held in bits 15:14 of the Address Configuration
Register, with the values of 0b00, 0b01, and 0b11 denoting TP, AUI, and
BNC types respectively. Therefore switching from BNC to AUI requires
bits to be cleared before setting bit 14 or the setting won't change.
NB this has always been wrong ever since this code was added in 2.5.42.
Sabrina Dubroca [Wed, 20 May 2026 20:44:42 +0000 (22:44 +0200)]
net: gro: don't merge zcopy skbs
skb_gro_receive() can currently copy frags between the source and GRO
skb, without checking the zerocopy status, and in particular the
SKBFL_MANAGED_FRAG_REFS flag.
When SKBFL_MANAGED_FRAG_REFS is set, the skb doesn't hold a reference
on the pages in shinfo->frags. Appending those frags to another skb's
frags without fixing up the page refcount can lead to UAF.
When either the last skb in the GRO chain (the one we would append
frags to) or the source skb is zerocopy, don't merge the skbs.
Nikhil P. Rao [Wed, 20 May 2026 20:58:42 +0000 (20:58 +0000)]
pds_core: ensure null-termination for firmware version strings
The driver passes fw_version directly to devlink_info_version_stored_put()
without ensuring null-termination. While current firmware null-terminates
these strings, the driver should not rely on this behavior. Add explicit
null-termination to prevent potential issues if firmware behavior changes.
Justin Iurman [Wed, 20 May 2026 12:42:42 +0000 (14:42 +0200)]
ipv6: ioam: refresh hdr pointer before ioam6_event()
Reported by Sashiko:
In ipv6_hop_ioam(), the hdr pointer is initialized to point into the
skb's linear data buffer. Later, the code calls skb_ensure_writable(),
which might reallocate the buffer:
if (skb_ensure_writable(skb, optoff + 2 + hdr->opt_len))
goto drop;
/* Trace pointer may have changed */
trace = (struct ioam6_trace_hdr *)(skb_network_header(skb)
+ optoff + sizeof(*hdr));
If the skb is cloned or lacks sufficient linear headroom,
skb_ensure_writable() will invoke pskb_expand_head(), which reallocates
the skb's data buffer and frees the old one, invalidating pointers to
it. While the code recalculates the trace pointer immediately after the
call to skb_ensure_writable(), it fails to recalculate the hdr pointer.
This patch fixes the above by recalculating the hdr pointer before
passing hdr->opt_len to ioam6_event(), so that we avoid any UaF.
Weiming Shi [Wed, 20 May 2026 07:57:38 +0000 (00:57 -0700)]
tap: fix stack info leak in tap_ioctl() SIOCGIFHWADDR
In the SIOCGIFHWADDR path, tap_ioctl() copies 16 bytes of an
uninitialised on-stack struct sockaddr_storage to userspace via
ifr_hwaddr, but netif_get_mac_address() only writes sa_family and
dev->addr_len (6 for Ethernet) bytes, leaving sa_data[6..13] uninitialised.
Those 8 trailing bytes leak kernel stack contents; SIOCGIFHWADDR on a
macvtap chardev returns kernel .text and direct-map pointers, defeating
KASLR.
Initialise ss at declaration.
Fixes: 3b23a32a6321 ("net: fix dev_ifsioc_locked() race condition") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260520075736.3415676-3-bestswngs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dawei Feng [Wed, 20 May 2026 07:03:23 +0000 (15:03 +0800)]
qed: fix double free in qed_cxt_tables_alloc()
If one of the later PF or VF CID bitmap allocations fails,
qed_cid_map_alloc() jumps to cid_map_fail and frees the previously
allocated CID bitmaps before returning an error. qed_cxt_tables_alloc()
then calls qed_cxt_mngr_free(), which invokes qed_cid_map_free()
again.
Fix this by setting each CID bitmap pointer to NULL after bitmap_free()
to avoid double free.
The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc3.
Runtime reproduction was not attempted because exercising the failing
allocation path requires device-specific setup.
Fixes: fe56b9e6a8d9 ("qed: Add module with basic common support") Cc: stable@vger.kernel.org Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Link: https://patch.msgid.link/20260520070323.2762379-1-dawei.feng@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aditya Garg [Wed, 20 May 2026 05:15:53 +0000 (22:15 -0700)]
net: mana: validate rx_req_idx to prevent out-of-bounds array access
In mana_hwc_rx_event_handler(), rx_req_idx is derived from
sge->address in DMA-coherent memory. In Confidential VMs
(SEV-SNP/TDX), this memory is shared unencrypted and HW can modify
WQE contents at any time. No bounds check exists on rx_req_idx,
which can lead to an out-of-bounds access into reqs[].
Add bounds check on rx_req_idx in mana_hwc_rx_event_handler() before
using it to index the reqs[] array.
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260520051553.857120-1-gargaditya@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ratheesh Kannoth [Wed, 20 May 2026 04:30:36 +0000 (10:00 +0530)]
octeontx2-af: npc: Fix allmulticast skip logic for LBK and SDP VFs
When installing the allmulticast NPC rule, rvu_npc_install_allmulti_entry()
should skip LBK and SDP VFs (only CGX PF/VF may add the entry). The
code combined is_lbk_vf() and is_sdp_vf() with logical AND, which is
never true for a single pcifunc, so the intended early return never ran.
Zhang Cen [Tue, 19 May 2026 10:46:47 +0000 (18:46 +0800)]
netpoll: normalize skb->dev to the netpoll device
__netpoll_send_skb() always transmits through np->dev and queues busy
packets on np->dev->npinfo->txq, but it leaves skb->dev unchanged.
Stacked callers such as DSA and macvlan can reach netpoll with skb->dev
still naming the upper device while np->dev is the lower device that
owns the netpoll state.
If the skb has to be deferred, queue_process() later dequeues it from
the lower device's txq but retries it through skb->dev. That can
re-enter the upper ndo_start_xmit path on an already transformed skb,
and if the upper device disappears before the lower txq drains the
workqueue can dereference a stale skb->dev pointer.
The buggy scenario involves two paths, with each column showing the
order within that path:
path A label: netpoll enqueue path path B label: upper-device teardown
1. Stacked xmit calls netpoll 1. Teardown unregisters the upper
with lower np->dev and upper net_device while lower npinfo
skb->dev. stays alive.
2. __netpoll_send_skb() uses 2. netdev_release() runs for the
np->dev->npinfo as the txq upper net_device.
owner.
3. Busy transmit queues the skb 3. The lower txq still owns the
on that lower txq with upper deferred skb.
skb->dev.
4. queue_process() drains the 4. queue_process() dereferences
lower txq and reads skb->dev. that stale upper skb->dev.
Normalize skb->dev to np->dev after loading np->dev from the netpoll
instance, before either the direct transmit path or the fallback enqueue.
This keeps the queued skb in the same device and txq domain as the
netpoll state that owns it.
KASAN report as below:
KASAN slab-use-after-free in queue_process+0x7c/0x480
Workqueue: events queue_process
The buggy address belongs to the object at ffff88810906c000 which belongs
to the cache kmalloc-4k of size 4096
The buggy address is located 168 bytes inside of freed 4096-byte region
[ffff88810906c000, ffff88810906d000)
Read of size 8
Call trace:
dump_stack_lvl+0x73/0xb0 (?:?)
print_report+0xd1/0x620 (?:?)
srso_alias_return_thunk+0x5/0xfbef5 (?:?)
__virt_addr_valid+0x215/0x420 (?:?)
kasan_complete_mode_report_info+0x64/0x200 (?:?)
kasan_report+0xf7/0x130 (?:?)
queue_process+0x7c/0x480 (net/core/netpoll.c:88)
kasan_check_range+0x10c/0x1c0 (?:?)
__kasan_check_read+0x15/0x20 (?:?)
process_one_work+0x8b7/0x1af0 (kernel/workqueue.c:3200)
assign_work+0x170/0x3f0 (?:?)
worker_thread+0x574/0xf10 (?:?)
_raw_spin_unlock_irqrestore+0x4b/0x60 (?:?)
trace_hardirqs_on+0x2a/0x180 (?:?)
kthread+0x2fc/0x3f0 (?:?)
ret_from_fork+0x58b/0x830 (?:?)
__switch_to+0x58e/0xe90 (?:?)
__switch_to_asm+0x39/0x70 (?:?)
ret_from_fork_asm+0x1a/0x30 (?:?)
Freed by task stack:
kasan_save_stack+0x3d/0x60 (?:?)
kasan_save_track+0x18/0x40 (?:?)
kasan_save_free_info+0x3f/0x60 (?:?)
__kasan_slab_free+0x48/0x70 (?:?)
kfree+0x20e/0x4e0 (?:?)
kvfree+0x31/0x40 (?:?)
netdev_release+0x71/0x90 (net/core/net-sysfs.c:2227)
device_release+0xd2/0x250 (?:?)
kobject_put+0x181/0x4c0 (lib/kobject.c:730)
netdev_run_todo+0x700/0x1000 (net/core/dev.c:11666)
rtnl_dellink+0x396/0xc00 (net/core/rtnetlink.c:3558)
rtnetlink_rcv_msg+0x740/0xc20 (net/core/rtnetlink.c:6897)
netlink_rcv_skb+0x147/0x3a0 (?:?)
rtnetlink_rcv+0x19/0x20 (net/core/rtnetlink.c:7021)
netlink_unicast+0x4d1/0x830 (net/netlink/af_netlink.c:1327)
netlink_sendmsg+0x840/0xe10 (net/netlink/af_netlink.c:1812)
____sys_sendmsg+0x8a7/0xb50 (?:?)
___sys_sendmsg+0x104/0x190 (?:?)
__sys_sendmsg+0x135/0x1d0 (?:?)
__x64_sys_sendmsg+0x7b/0xc0 (?:?)
x64_sys_call+0x205c/0x2130 (?:?)
do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
entry_SYSCALL_64_after_hwframe+0x77/0x7f (?:?)
Abdun Nihaal [Tue, 19 May 2026 06:27:39 +0000 (11:57 +0530)]
net: wwan: iosm: fix potential memory leaks in ipc_imem_init()
The memory allocated in ipc_protocol_init() is not freed on the error
paths that follow in ipc_imem_init(). Fix that by calling the
corresponding release function ipc_protocol_deinit() in the error path.
Jakub Kicinski [Thu, 21 May 2026 00:41:51 +0000 (17:41 -0700)]
MAINTAINERS: add missing entry for Bluetooth include files
We X-out net/bluetooth/ from "NETWORKING [GENERAL]" so that only
the dedicated list is CCed on patches, and networking gets them
once already processed by Luiz. We missed include/net/bluetooth.
Nimrod Oren [Wed, 20 May 2026 15:39:28 +0000 (18:39 +0300)]
selftests: net: Fix checksums in xdp_native
Data adjustment cases failed with "Data exchange failed" when using IPv4
because the program did not update the IP and UDP checksums in the IPv4
branch. The issue was masked when both IPv4 and IPv6 were configured,
since the test harness prefers IPv6.
While here, generalize csum_fold_helper() to fold twice so it works for
any 32-bit input.
Yuho Choi [Wed, 20 May 2026 03:03:28 +0000 (23:03 -0400)]
ipv6: route: Unregister netdevice notifier on BPF init failure
ip6_route_init() registers ip6_route_dev_notifier before registering the
IPv6 route BPF iterator target. If bpf_iter_register() fails after the
notifier has been registered, the error path currently jumps to
out_register_late_subsys and unwinds the RTNL handlers and pernet route
state without removing the notifier from the netdevice notifier chain.
This leaves ip6_route_dev_notify() callable after the IPv6 route state it
uses has been torn down. Add a separate unwind label for the BPF iterator
failure path and unregister the netdevice notifier before continuing with
the existing cleanup.
The run.sh script explicitly checks that CONFIG_MODULES is disabled.
By default, this config option is enabled. Explicitly disable it to be
able to run the RDS tests.
Note that writing '# CONFIG_(...) is not set' is usually recommended to
disable an option in the .config, but it looks like selftests usually
set 'CONFIG_(...)=n', which looks clearer.
Zijing Yin [Tue, 19 May 2026 17:26:33 +0000 (10:26 -0700)]
phonet/pep: disable BH around forwarded sk_receive_skb()
The networking receive path is usually run from softirq context, but
protocols that take the socket lock may have packets stored in the
backlog and processed later from process context. In that case
release_sock() -> __release_sock() drops the slock with spin_unlock_bh()
and then calls sk->sk_backlog_rcv() with bottom halves enabled.
Typical sk_backlog_rcv handlers process the socket whose backlog is
being drained, so the BH state at entry is irrelevant for the slocks
they touch. pep_do_rcv() is different: when the inbound skb targets an
existing PEP pipe, it forwards the skb to a different *child* socket
via sk_receive_skb(). That helper takes the child slock with
bh_lock_sock_nested(), which is just spin_lock_nested() and assumes BH
is already off. The same child slock therefore ends up acquired with
BH on (process path) and with BH off (softirq path):
Lockdep flags this as inconsistent lock state, and it can become a real
self-deadlock if a softirq on the same CPU tries to receive to the same
child socket while its slock is held in the BH-enabled path:
Wrap the forwarded sk_receive_skb() in local_bh_disable() /
local_bh_enable() so the child slock is always acquired with BH off.
local_bh_disable() nests safely on the softirq path.
Discovered via in-house syzkaller fuzzing; the same root cause also
on the linux-6.1.y syzbot dashboard as extid 44f0626dd6284f02663c.
Reproduced under KASAN + LOCKDEP + PROVE_LOCKING, reproducer:
https://pastebin.com/A3t8xzCR
tracing: Fix unload_page for simple_ring_buffer init rollback
The unload_page callback expects the return value of load_page() as its
argument: ret = load_page(va); unload(ret). Fix the rollback code in
simple_ring_buffer_init_mm() where the descriptor's VA is used instead
of the loaded page address.
David Carlier [Tue, 12 May 2026 13:54:20 +0000 (14:54 +0100)]
tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
nr_subbufs in the ring buffer metadata is always initialized to zero
because it is assigned from cpu_buffer->nr_pages before the page
initialization loop has run. While nr_subbufs is not currently read
by the kernel, it should reflect the actual buffer geometry in the
meta page for correctness.
Move the assignment after the page loop so that cpu_buffer->nr_pages
holds the final count.
Link: https://patch.msgid.link/20260512135420.99194-1-devnexen@gmail.com Fixes: 34e5b958bdad ("tracing: Introduce simple_ring_buffer") Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring-buffer: Flush and stop persistent ring buffer on panic
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Thu, 21 May 2026 02:08:01 +0000 (22:08 -0400)]
ring-buffer: Fix reporting of missed events in iterator
When tracing is active while reading the trace file, if the iterator
reading the buffer detects that the writer has passed the iterator head,
it will reset and set a "missed events" flag. This flag is passed to the
output processing to show the user that events were missed:
CPU:4 [LOST EVENTS]
The problem is that the flag is reset after it is checked in
ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU
ring buffers and it will check if they are dropped when figuring out which
buffer to print next. This prematurely clears the missed_events flag if
the CPU buffer with the missed events is not the one that is printed next.
On the iteration where the CPU buffer with the missed events is printed,
the check if it had missed events would return false and the output does
not show that events were missed.
Do not reset the missed_events flag when checking if there were missed
events, but instead clear it when moving the iterator head to the next
event.
====================
vsock/virtio: fix skb overhead accounting to preserve full buf_alloc
Patch 1 resets the connection when we can no longer queue packets,
this prevents silent data loss, and both peers are notified.
Patch 2 increases the total budget to `buf_alloc * 2` for payload
plus skb overhead similar to how SO_RCVBUF is doubled to reserve
space for sk_buff metadata. This preserves the full buf_alloc for
payload under normal operation, while still bounding the skb queue
growth.
In the future, we plan to improve how we handle the merging of packets
to minimize overhead and avoid closing connections.
vsock/virtio: fix skb overhead accounting to preserve full buf_alloc
After commit 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb
queue"), virtio_transport_inc_rx_pkt() subtracts per-skb overhead from
buf_alloc when checking whether a new packet fits. This reduces the
effective receive buffer below what the user configured via
SO_VM_SOCKETS_BUFFER_SIZE, causing legitimate data packets to be
silently dropped and applications that rely on the full buffer size
to deadlock.
Also, the reduced space is not communicated to the remote peer, so
its credit calculation accounts more credit than the receiver will
actually accept, causing data loss (there is no retransmission).
With this approach we currently have failures in
tools/testing/vsock/vsock_test.c. Test 18 sometimes fails, while
test 22 always fails in this way:
18 - SOCK_STREAM MSG_ZEROCOPY...hash mismatch
Fix by allowing at most `buf_alloc * 2` as the total budget for payload
plus skb overhead in virtio_transport_inc_rx_pkt(), similar to how
SO_RCVBUF is doubled to reserve space for sk_buff metadata.
This preserves the full buf_alloc for payload under normal operation,
while still bounding the skb queue growth.
With this patch, all tests in tools/testing/vsock/vsock_test.c are
now passing again.
vsock/virtio: reset connection on receiving queue overflow
When there is no more space to queue an incoming packet, the packet is
silently dropped. This causes data loss without any notification to
either peer, since there is no retransmission.
Under normal circumstances, this should never happen. However, it could
happen if the other peer doesn't respect the credit, or if the skb
overhead, which we recently began to take into account with commit 059b7dbd20a6 ("vsock/virtio: fix potential unbounded skb queue"),
is too high.
Fix this by resetting the connection and setting the local socket error
to ENOBUFS when virtio_transport_recv_enqueue() can no longer queue a
packet, so both peers are explicitly notified of the failure rather than
silently losing data.
Fixes: ae6fcfbf5f03 ("vsock/virtio: discard packets if credit is not respected") Cc: stable@vger.kernel.org Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20260518090656.134588-2-sgarzare@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Add preliminary NETC switch support for i.MX94
i.MX94 NETC (v4.3) integrates 802.1Q Ethernet switch functionality, the
switch provides advanced QoS with 8 traffic classes and a full range of
TSN standards capabilities. It has 3 user ports and 1 CPU port, and the
CPU port is connected to an internal ENETC through the pseduo link, so
instead of a back-to-back MAC, the lightweight "pseudo MAC" is used at
both ends of the pseudo link to transfer Ethernet frames. The pseudo
link provides a zero-copy interface (no serialization delay) and lower
power (less logic and memory).
Like most Ethernet switches, the NETC switch also supports a proprietary
switch tag, is used to carry in-band metadata information about frames.
This in-band metadata information can include the source port from which
the frame was received, what was the reason why this frame got forwarded
to the entity, and for the entity to indicate the precise destination
port of a frame. The NETC switch tag is added to frames after the source
MAC address. There are three types of switch tags, and each type has 1
to 4 subtypes, more details are as follows.
Forward switch tag (Type = 0): Represents forwarded frames.
- SubType = 0 - Normal frame processing.
To_Port switch tag (Type = 1): Represents frames that are to be sent to
a specific switch port.
- SubType = 0. No request to perform timestamping.
- SubType = 1. Request to perform one-step timestamping.
- SubType = 2. Request to perform two-step timestamping.
- SubType = 3. Request to perform both one-step timestamping and
two-step timestamping.
To_Host switch tag (Type = 2): Represents frames redirected or copied to
the switch management port.
- SubType = 0. Received frames redirected or copied to the switch
management port.
- SubType = 1. Received frames redirected or copied to the switch
management port with captured timestamp at the switch port where
the frame was received.
- SubType = 2. Transmit timestamp response (two-step timestamping).
Currently, this patch set supports Forward tag, SubType 0 of To_Port tag
and SubType 0 of To_Host tag. More tags will be supported in the future.
In addition, the switch supports NETC Table Management Protocol (NTMP),
some switch functionality is controlled using control messages sent to
the hardware using BD ring interface with 32B descriptors similar to the
packet Transmit BD ring used on ENETC. This interface is referred to as
the command BD ring. This is used to configure functionality where the
underlying resources may be shared between different entities or being
too large to configure using direct registers.
For this patch set, we have supported the following tables through the
command BD ring interface.
FDB Table: It contains forwarding and/or filtering information about MAC
addresses. The FDB table is used for MAC learning lookups and MAC
forwarding lookups.
VLAN Filter Table: It contains configuration and control information for
each VLAN configured on the switch.
Buffer Pool Table: It contains buffer pool configuration and operational
information. Each entry corresponds to a buffer pool. Currently, we use
this table to implement flow control feature on each port.
Ingress Port Filter Table: It contains a set of filters each capable of
classifying incoming traffic using a mix of L2, L3, and L4 parsed and
arbitrary field data. We use this table to implement host flood support
to the switch port.
The switch also supports other tables, and we will add more advanced
features through them in the future.
====================
Wei Fang [Mon, 18 May 2026 08:25:06 +0000 (16:25 +0800)]
net: dsa: netc: add support for ethtool private statistics
Implement the ethtool private statistics interface to expose additional
port-level and MAC-level counters that are not covered by the standard
IEEE 802.3 statistics. The pMAC counters are only reported when the port
supports Frame Preemption (802.1Qbu/802.3br).
Note that although rtnl_link_stats64 provides some standard statistics
such as rx octets, rx frame errors, rx dropped packets, and tx packets,
these are overall port statistics. The NETC switch supports preemption
on each port, and each port has two MACs (eMAC and pMAC). The driver
private statistics are used to obtain statistics for each MAC, allowing
users to perform analysis and debugging.
Wei Fang [Mon, 18 May 2026 08:25:05 +0000 (16:25 +0800)]
net: dsa: netc: add support for the standardized counters
Each user port of the NETC switch supports 802.3 basic and mandatory
managed objects statistic counters and IETF Management Information
Database (MIB) package (RFC2665) and Remote Network Monitoring (RMON)
counters. And all of these counters are 64-bit registers. In addition,
some user ports support preemption, so these ports have two MACs, MAC
0 is the express MAC (eMAC), MAC 1 is the preemptible MAC (pMAC). So
for ports that support preemption, the statistics are the sum of the
pMAC and eMAC statistics.
Note that the current switch driver does not support preemption, all
frames are sent and received via the eMAC by default. The statistics
read from the pMAC should be zero.
Wei Fang [Mon, 18 May 2026 08:25:04 +0000 (16:25 +0800)]
net: dsa: netc: initialize buffer pool table and implement flow-control
The buffer pool is a quantity of memory available for buffering a group
of flows (e.g. frames having the same priority, frames received from the
same port), while waiting to be transmitted on a port. The buffer pool
tracks internal memory consumption with upper bound limits and optionally
a non-shared portion when associated with a shared buffer pool. Currently
the shared buffer pool is not supported, it will be added in the future.
For i.MX94, the switch has 4 ports and 8 buffer pools, so each port is
allocated two buffer pools. For frames with priorities of 0 to 3, they
will be mapped to the first buffer pool; For frames with priorities of
4 to 7, they will be mapped to the second buffer pool. Each buffer pool
has a flow control on threshold and a flow control off threshold. By
setting these threshold, add the flow control support to each port.
Wei Fang [Mon, 18 May 2026 08:25:03 +0000 (16:25 +0800)]
net: dsa: netc: add FDB, STP, MTU, port setup and host flooding support
Expand the NETC switch driver with several foundational features:
- FDB and MDB management
- STP state handling
- MTU configuration
- Port setup/teardown
- Host flooding support
At this stage, the driver operates only in standalone port mode. Each
port uses VLAN 0 as its PVID, meaning ingress frames are internally
assigned VID 0 regardless of whether they arrive tagged or untagged.
Note that this does not inject a VLAN 0 header into the frame, the VID
is used purely for subsequent VLAN processing within the switch.
Wei Fang [Mon, 18 May 2026 08:25:02 +0000 (16:25 +0800)]
net: dsa: netc: add phylink MAC operations
Different versions of NETC switches have different numbers of ports and
MAC capabilities. Add .phylink_get_caps() to struct netc_switch_info,
allowing each NETC switch version to implement its own callback for
obtaining MAC capabilities.
Implement the phylink_mac_ops callbacks: .mac_config(), .mac_link_up(),
and .mac_link_down(). Note that flow-control configuration is not yet
supported in .mac_link_up(), but will be implemented in a subsequent
patch.
Wei Fang [Mon, 18 May 2026 08:25:01 +0000 (16:25 +0800)]
net: dsa: netc: introduce NXP NETC switch driver for i.MX94
For i.MX94 series, the NETC IP provides full 802.1Q Ethernet switch
functionality, advanced QoS with 8 traffic classes, and a full range of
TSN standards capabilities. The switch has 3 user ports and 1 CPU port,
the CPU port is connected to an internal ENETC. Since the switch and the
internal ENETC are fully integrated within the NETC IP, no back-to-back
MAC connection is required. Instead, a light-weight "pseudo MAC" is used
between the switch and the ENETC. This translates to lower power (less
logic and memory) and lower delay (as there is no serialization delay
across this link).
Introduce the initial NETC switch driver with basic probe and remove
functionality. More features will be added in subsequent patches.
Wei Fang [Mon, 18 May 2026 08:25:00 +0000 (16:25 +0800)]
net: dsa: add NETC switch tag support
The NXP NETC switch tag is a proprietary header added to frames after the
source MAC address. The switch tag has 3 types, and each type has 1 ~ 4
subtypes, the details are as follows.
Forward NXP switch tag (Type=0): Represents forwarded frames.
- SubType = 0 - Normal frame processing.
To_Port NXP switch tag (Type=1): Represents frames that are to be sent
to a specific switch port.
- SubType = 0. No request to perform timestamping.
- SubType = 1. Request to perform one-step timestamping.
- SubType = 2. Request to perform two-step timestamping.
- SubType = 3. Request to perform both one-step timestamping and
two-step timestamping.
To_Host NXP switch tag (Type=2): Represents frames redirected or copied
to the switch management port.
- SubType = 0. Received frames redirected or copied to the switch
management port.
- SubType = 1. Received frames redirected or copied to the switch
management port with captured timestamp at the switch port where
the frame was received.
- SubType = 2. Transmit timestamp response (two-step timestamping).
In addition, the length of different type switch tag is different, the
minimum length is 6 bytes, the maximum length is 14 bytes. Currently,
Forward tag, SubType 0 of To_Port tag and Subtype 0 of To_Host tag are
supported. More tags will be supported in the future.
Wei Fang [Mon, 18 May 2026 08:24:59 +0000 (16:24 +0800)]
net: enetc: add multiple command BD rings support
All the tables of NETC switch are managed through the command BD ring,
but unlike ENETC, the switch has two command BD rings, if the current
ring is busy, the switch driver can switch to another ring to manage
the table. Currently, the NTMP driver does not support multiple rings.
Therefore, update ntmp_select_and_lock_cbdr() to select a appropriate
ring to execute the command for the switch.
Wei Fang [Mon, 18 May 2026 08:24:58 +0000 (16:24 +0800)]
net: enetc: add support for "Add" and "Delete" operations to IPFT
The ingress port filter table (IPFT )contains a set of filters each
capable of classifying incoming traffic using a mix of L2, L3, and L4
parsed and arbitrary field data. As a result of a filter match, several
actions can be specified such as on whether to deny or allow a frame,
overriding internal QoS attributes associated with the frame and setting
parameters for the subsequent frame processing functions, such as stream
identification, policing, ingress mirroring. Each entry corresponds to a
filter. The ingress port filter entries are added using a precedence
value. If a frame matches multiple entries, the entry with the higher
precedence is used. Currently, this patch only adds "Add" and "Delete"
operations to the ingress port filter table. These two interfaces will
be used by both ENETC driver and NETC switch driver.
Wei Fang [Mon, 18 May 2026 08:24:57 +0000 (16:24 +0800)]
net: enetc: add support for the "Update" operation to buffer pool table
The buffer pool table contains buffer pool configuration and operational
information. Each entry corresponds to a buffer pool. The Entry ID value
represents the buffer pool ID to access.
The buffer pool table is a static bounded index table, buffer pools are
always present and enabled. It only supports Update and Query operations,
This patch only adds ntmp_bpt_update_entry() helper to support updating
the specified entry of the buffer pool table. Query action to the table
will be added in the future.
Wei Fang [Mon, 18 May 2026 08:24:56 +0000 (16:24 +0800)]
net: enetc: add support for the "Add" operation to VLAN filter table
The VLAN filter table contains configuration and control information for
each VLAN configured on the switch. Each VLAN entry includes the VLAN
port membership, which FID to use in the FDB lookup, which spanning tree
group to use, the egress frame modification actions to apply to a frame
exiting form this VLAN, and various configuration and control parameters
for this VLAN.
The VLAN filter table can only be managed by the command BD ring using
table management protocol version 2.0. The table supports Add, Delete,
Update and Query operations. And the table supports 3 access methods:
Entry ID, Exact Match Key Element and Search. But currently we only add
the ntmp_vft_add_entry() helper to support the upcoming switch driver to
add an entry to the VLAN filter table. Other interfaces will be added in
the future.
Wei Fang [Mon, 18 May 2026 08:24:55 +0000 (16:24 +0800)]
net: enetc: add basic operations to the FDB table
The FDB table is used for MAC learning lookups and MAC forwarding lookups.
Each table entry includes information such as a FID and MAC address that
may be unicast or multicast and a forwarding destination field containing
a port bitmap identifying the associated port(s) with the MAC address.
FDB table entries can be static or dynamic. Static entries are added from
software whereby dynamic entries are added either by software or by the
hardware as MAC addresses are learned in the datapath.
The FDB table can only be managed by the command BD ring using table
management protocol version 2.0. Table management command operations Add,
Delete, Update and Query are supported. And the FDB table supports three
access methods: Entry ID, Exact Match Key Element and Search. This patch
adds the following basic supports to the FDB table.
ntmp_fdbt_update_entry() - update the configuration element data of a
specified FDB entry
ntmp_fdbt_delete_entry() - delete a specified FDB entry
ntmp_fdbt_add_entry() - add an entry into the FDB table
ntmp_fdbt_search_port_entry() - Search the FDB entry on the specified
port based on RESUME_ENTRY_ID.
Wei Fang [Mon, 18 May 2026 08:24:54 +0000 (16:24 +0800)]
net: enetc: add pre-boot initialization for i.MX94 switch
Before probing the NETC switch driver, some pre-initialization needs to
be set in NETCMIX and IERB to ensure that the switch can work properly.
For example, i.MX94 NETC switch has three external ports and each port
is bound to a link. And each link needs to be configured so that it can
work properly, such as I/O variant and MII protocol.
In addition, the switch port 2 (MAC 2) and ENETC 0 (MAC 3) share the same
parallel interface, they cannot be used at the same time due to the SoC
constraint. And the MAC selection is controlled by the mac2_mac3_sel bit
of EXT_PIN_CONTROL register. Currently, the interface is set for ENETC 0
by default unless the switch port 2 is enabled in the DT node.
Like ENETC, each external port of the NETC switch can manage its external
PHY through its port MDIO registers. And the port can only access its own
external PHY by setting the PHY address to the LaBCR[MDIO_PHYAD_PRTAD].
If the accessed PHY address is not equal to LaBCR[MDIO_PHYAD_PRTAD], then
the MDIO access initiated by port MDIO will be invalid.
Wei Fang [Mon, 18 May 2026 08:24:53 +0000 (16:24 +0800)]
dt-bindings: net: dsa: add NETC switch
Add bindings for NETC switch. This switch is a PCIe function of NETC IP,
it supports advanced QoS with 8 traffic classes and 4 drop resilience
levels, and a full range of TSN standards capabilities. The switch CPU
port connects to an internal ENETC port, which is also a PCIe function
of NETC IP. So these two ports use a light-weight "pseudo MAC" instead
of a back-to-back MAC, because the "pseudo MAC" provides the delineation
between switch and ENETC, this translates to lower power (less logic and
memory) and lower delay (as there is no serialization delay across this
link).
Wei Fang [Mon, 18 May 2026 08:24:52 +0000 (16:24 +0800)]
dt-bindings: net: dsa: update the description of 'dsa,member' property
The current description indicates that the 'dsa,member' property cannot
be set for a switch that is not part of any cluster. Vladimir thinks
that this is a case where the actual technical limitation was poorly
transposed into words when this restriction was first documented, in
commit 8c5ad1d6179d ("net: dsa: Document new binding").
The true technical limitation is that many DSA tagging protocols are
topology-unaware, and always call dsa_conduit_find_user() with a
switch_id of 0. Specifying a custom "dsa,member" property with a
non-zero switch_id would break them.
Therefore, for topology-aware switches, it is fine to specify this
property for them, even if they are not part of any cluster. Our NETC
switch is a good example which is topology-aware, the switch_id is
carried in the switch tag, but the switch_id 0 is reserved for VEPA
switch and cannot be used, so we need to use this property to assign
a non-zero switch_id for it.
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Wei Fang <wei.fang@nxp.com> Acked-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20260518082506.1318236-2-wei.fang@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net/mlx5: Prepare eswitch infrastructure for satellite PF support
A satellite PF is a new SmartNIC configuration that adds another
physical function on the DPU that is not an eswitch manager and not a
page manager. The satellite PF can have its own SFs and can be passed
through to a VM on the DPU, providing an isolated function for users who
should not have access to the privileged ECPF. The ECPF handles the
satellite PF and the host PF in a similar way, using the same management
framework.
This series prepares the mlx5 eswitch command interface and vport
infrastructure for satellite PF support.
The first two patches abstract host PF data parsing behind a helper and
switch to the v1 response layout for query_esw_functions when supported,
so callers are insulated from layout differences.
The IPsec VF checks are tightened to use mlx5_eswitch_is_vf_vport()
instead of comparing against a specific vport number.
The remaining patches refactor SET_HCA_CAP and enable/disable_hca
command helpers to support vhca_id-based addressing, which is required
for managing functions that are not directly addressable by function_id.
A follow-up series will introduce satellite PF discovery and management
using this infrastructure.
Moshe Shemesh [Mon, 18 May 2026 07:13:56 +0000 (10:13 +0300)]
net/mlx5: Generalize enable/disable HCA for any PF vport
Refactor the host-PF-specific mlx5_cmd_host_pf_enable/disable_hca()
into generic mlx5_cmd_pf_enable/disable_hca() that accept a vport
number. The new functions use vhca_id as function_id when supported.
Similarly, refactor the eswitch layer into generic static helpers
mlx5_esw_pf_enable/disable_hca() with thin wrappers for the host PF
case, in preparation for enable_hca on satellite PF vports.
Moshe Shemesh [Mon, 18 May 2026 07:13:55 +0000 (10:13 +0300)]
net/mlx5: Use vport helper for IPsec eswitch set caps
Use mlx5_vport_set_other_func_cap() and
mlx5_vport_set_other_func_general_cap() in the IPsec eswitch functions
instead of open-coding the SET_HCA_CAP command. This removes redundant
buffer allocation and boilerplate, and also enables vhca_id based
addressing when supported.
Use mlx5_vport_set_other_func_general_cap() instead of open-coding the
SET_HCA_CAP command. This removes redundant buffer allocation and
ensures consistent use of vport-based function addressing.
mlx5_vport_set_other_func_general_cap() supports both function_id and
vhca_id based addressing, so this also enables SET_HCA_CAP for vhca_id
indexed functions which was not supported before.
Add mlx5_vport_set_other_func_general_cap() convenience macro, symmetric
to the existing mlx5_vport_get_other_func_general_cap(), and use it in
mlx5_devlink_port_fn_roce_set().
Moshe Shemesh [Mon, 18 May 2026 07:13:52 +0000 (10:13 +0300)]
net/mlx5: Switch vport HCA cap helpers to kvzalloc
mlx5_vport_set_other_func_cap() and mlx5_vport_get_vhca_id() allocate
command buffers that embed the HCA capability union, exceeding 4KiB.
Use kvzalloc/kvfree so the allocation can fall back to vmalloc when
contiguous memory is scarce.
Moshe Shemesh [Mon, 18 May 2026 07:13:51 +0000 (10:13 +0300)]
net/mlx5: Use mlx5_eswitch_is_vf_vport() for IPsec VF checks
IPsec eswitch offload operations and the enabled_ipsec_vf_count counter
are intended for VF vports only. Replace the MLX5_VPORT_HOST_PF checks
with mlx5_eswitch_is_vf_vport() to properly identify VF vports, as
preparation for adding another type of PF vports.
Moshe Shemesh [Mon, 18 May 2026 07:13:50 +0000 (10:13 +0300)]
net/mlx5: Use v1 response layout for query_esw_functions
Use the v1 response layout for the query_esw_functions command when
supported by the device. When query_host_net_function_v1 capability is
set, use MLX5_QUERY_ESW_FUNC_OP_MOD_LAYOUT_V1 to retrieve parameters
for multiple network functions, allocating the output buffer according
to query_host_net_function_num_max. Validate that firmware does not
return more entries than the allocated buffer.
The v1 layout reports vhca_state instead of the legacy host_pf_disabled
bit. PFs transition through ALLOCATED, ACTIVE, and IN_USE states (they
do not use TEARDOWN_REQUEST as SFs do). When the ECPF calls disable_hca,
firmware resets the PF and moves it to ALLOCATED. When the ECPF calls
enable_hca, the PF moves to ACTIVE, and once the PF driver enables it,
it reaches IN_USE. The PF is only fully operational in IN_USE, so
pf_disabled is derived as vhca_state != IN_USE, equivalent to the legacy
host_pf_disabled bit.
The mlx5_esw_get_host_pf_info() helper abstracts parsing the command
output in both legacy and new formats, so callers do not need to handle
the different layouts.
Moshe Shemesh [Mon, 18 May 2026 07:13:49 +0000 (10:13 +0300)]
net/mlx5: Use helper to parse host PF info
Add a helper mlx5_esw_get_host_pf_info() to retrieve host PF data from
the query_esw_functions command output, so callers no longer need to
parse the layout to obtain the required information.
Convert all callers of mlx5_esw_query_functions() to use the new helper,
preparing for upcoming support of the new op_mod that returns data in
the network_function_params layout.
Zhi Li [Mon, 18 May 2026 02:22:13 +0000 (10:22 +0800)]
net: stmmac: eswin: validate RGMII delay values
Validate rx-internal-delay-ps and tx-internal-delay-ps against the
hardware capabilities of the EIC7700 MAC.
The programmable RGMII delay supports 20 ps steps and a maximum value of
2540 ps. The driver previously accepted arbitrary values and silently
truncated unsupported settings when converting them to hardware units.
As a result, invalid device tree values could lead to unexpected delay
programming and incorrect RGMII timing.
Reject delay values that are not multiples of 20 ps or exceed the
supported hardware range.
Zhi Li [Mon, 18 May 2026 02:21:52 +0000 (10:21 +0800)]
net: stmmac: eswin: correct RGMII delay granularity to 20 ps
The EIC7700 MAC implements programmable RGMII delay adjustment with a
granularity of 20 ps per hardware step.
The driver previously converted rx-internal-delay-ps and
tx-internal-delay-ps values using a 100 ps step size, resulting in
incorrect delay programming.
Update the conversion to use the correct 20 ps granularity so the
programmed delay matches the values described in the device tree.
Zhi Li [Mon, 18 May 2026 02:21:37 +0000 (10:21 +0800)]
net: stmmac: eswin: clear TXD and RXD delay registers during initialization
Clear the TXD and RXD delay control registers during EIC7700 DWMAC
initialization.
These registers may retain values programmed by the bootloader. If left
unchanged, residual delays can alter the effective RGMII timing seen by
the MAC and override the configuration described by the device tree.
This may violate the expected RGMII timing model and can cause link
instability or prevent the Ethernet controller from operating correctly.
Explicitly clearing these registers ensures that the MAC delay settings
are determined solely by the kernel configuration.
The corresponding register offsets are optional, and the registers are
only cleared when the offsets are provided in the device tree.
Fix the initialization ordering of the HSP CSR configuration in the
EIC7700 DWMAC glue driver.
The HSP CSR registers control MAC-side RGMII delay behavior and must
only be accessed after the corresponding clocks are enabled. The
previous implementation could trigger register access before clock
enablement, leading to undefined behavior depending on boot state.
Move the HSP CSR configuration into the post-clock-enable initialization
path to ensure all register accesses occur under valid clock domains.
This change ensures deterministic initialization and prevents
clock-dependent register access failures during probe or resume.
Document two optional cells in eswin,hsp-sp-csr for the TXD and RXD
delay control register offsets.
These registers are used by the driver to clear any residual delay
configuration left by the bootloader, ensuring that MAC-side RGMII delay
settings are applied solely according to the kernel configuration.
Add a reference to the EIC7700X SoC Technical Reference Manual for
background information about the HSP CSR block.
Fixes: 888bd0eca93c ("dt-bindings: ethernet: eswin: Document for EIC7700 SoC") Signed-off-by: Zhi Li <lizhi2@eswincomputing.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20260518022023.427-1-lizhi2@eswincomputing.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Hyunwoo Kim [Fri, 15 May 2026 22:28:53 +0000 (07:28 +0900)]
net: skbuff: propagate shared-frag marker through frag-transfer helpers
Two frag-transfer helpers (__pskb_copy_fclone() and skb_shift()) fail
to propagate the SKBFL_SHARED_FRAG bit in skb_shinfo()->flags when
moving frags from source to destination. __pskb_copy_fclone() defers
the rest of the shinfo metadata to skb_copy_header() after copying
frag descriptors, but that helper only carries over gso_{size,segs,
type} and never touches skb_shinfo()->flags; skb_shift() moves frag
descriptors directly and leaves flags untouched. As a result, the
destination skb keeps a reference to the same externally-owned or
page-cache-backed pages while reporting skb_has_shared_frag() as
false.
The mismatch is harmful in any in-place writer that uses
skb_has_shared_frag() to decide whether shared pages must be detoured
through skb_cow_data(). ESP input is one such writer (esp4.c,
esp6.c), and a single nft 'dup to <local>' rule -- or any other
nf_dup_ipv4() / xt_TEE caller -- is enough to land a pskb_copy()'d
skb in esp_input() with the marker stripped, letting an unprivileged
user write into the page cache of a root-owned read-only file via
authencesn-ESN stray writes.
Set SKBFL_SHARED_FRAG on the destination whenever frag descriptors
were actually moved from the source. skb_copy() and skb_copy_expand()
share skb_copy_header() too but linearize all paged data into freshly
allocated head storage and emerge with nr_frags == 0, so
skb_has_shared_frag() returns false on its own; they need no change.
The same omission exists in skb_gro_receive() and skb_gro_receive_list().
The former moves the incoming skb's frag descriptors into the
accumulator's last sub-skb via two paths (a direct frag-move loop and
the head_frag + memcpy path); the latter chains the incoming skb whole
onto p's frag_list. Downstream skb_segment() reads only
skb_shinfo(p)->flags, and skb_segment_list() reuses each sub-skb's
shinfo as the nskb -- both p and lp must carry the marker.
The same omission also exists in tcp_clone_payload(), which builds an
MTU probe skb by moving frag descriptors from skbs on sk_write_queue
into a freshly allocated nskb. The helper falls into the same family
and warrants the same fix for consistency; no TCP TX-side in-place
writer is currently known to reach a user page through this gap, but
a future consumer depending on the marker would regress silently.
The same omission exists in skb_segment(): the per-iteration flag
merge takes only head_skb's flag, and the inner switch that rebinds
frag_skb to list_skb on head_skb-frags exhaustion does not fold the
new frag_skb's flag into nskb. Fold frag_skb's flag at both sites
so segments drawing frags from frag_list members carry the marker.
Fixes: cef401de7be8 ("net: fix possible wrong checksum generation") Fixes: f4c50a4034e6 ("xfrm: esp: avoid in-place decrypt on shared skb frags") Suggested-by: Sabrina Dubroca <sd@queasysnail.net> Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Suggested-by: Ben Hutchings <ben@decadent.org.uk> Suggested-by: Lin Ma <malin89@huawei.com> Suggested-by: Jingguo Tan <tanjingguo@huawei.com> Suggested-by: Aaron Esau <aaron1esau@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Tested-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/ageeJfJHwgzmKXbh@v4bel Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet [Tue, 19 May 2026 08:46:11 +0000 (08:46 +0000)]
tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN prediction
Blamed commit moved the TIME_WAIT-derived ISN from the skb control
block to a per-CPU variable, assuming the value would always be consumed
by tcp_conn_request() for the same packet that wrote it. That assumption
is violated by multiple drop paths between the producer
(__this_cpu_write(tcp_tw_isn, isn) in tcp_v{4,6}_rcv()) and the consumer
(tcp_conn_request()):
- min_ttl / min_hopcount check
- xfrm policy check
- tcp_inbound_hash() MD5/AO mismatch
- tcp_filter() eBPF/SO_ATTACH_FILTER drop
- th->syn && th->fin discard in tcp_rcv_state_process() TCP_LISTEN
- psp_sk_rx_policy_check() in tcp_v{4,6}_do_rcv()
- tcp_checksum_complete() in tcp_v{4,6}_do_rcv()
- tcp_v{4,6}_cookie_check() returning NULL
When a packet is dropped on any of these paths, tcp_tw_isn is left set.
The next SYN processed on the same CPU then consumes the non zero value in
tcp_conn_request(), receiving a potentially predictable ISN.
This patch moves back tcp_tw_isn to skb->cb[], getting rid of the per-cpu
variable.
Note that tcp_v{4,6}_fill_cb() do not set it.
Very litle impact on overall code size/complexity:
The warning is not new and exists since the initial bareudp commit 571912c69f0e ("net: UDP tunnel encapsulation module for tunnelling
different protocols like MPLS, IP, NSH etc.").
Let's use rtnl_dereference().
Note that bareudp_sock_release() is called from bareudp_stop()
under RTNL, so there is no real issue even without the helper.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605062359.e3gOfZCr-lkp@intel.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260518050726.318824-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bareudp: Remove synchronize_net() in bareudp_sock_release().
synchronize_net() in bareudp_sock_release() has existed since
day 1, commit 571912c69f0e ("net: UDP tunnel encapsulation module
for tunnelling different protocols like MPLS, IP, NSH etc.").
It was most likely copied from a similar tunneling device like
vxlan or geneve.
bareudp_sock_release() is called from dev->netdev_ops->ndo_stop(),
and synchronize_net() in unregister_netdevice_many_notify() ensures
that inflight bareudp fast paths finish before bareudp_dev is freed.
Let's remove the redundant synchronize_net() in bareudp_sock_release().
geneve: Remove synchronize_net() in geneve_unquiesce().
When changing the geneve config, geneve_changelink() sandwiches
the config memcpy() between geneve_quiesce() and geneve_unquiesce().
geneve_quiesce() temporarily clears geneve->sock[46] and their
sk_user_data, and then calls synchronize_net() to wait for inflight
fast paths to finish.
geneve_unquiesce() then restores the cleared pointers, but it also
superfluously calls synchronize_net().
The latter synchronize_net() provides no benefit; with or without it,
inflight fast paths can see either the NULL pointers or the original
pointers alongside the new configuration.
Let's remove the redundant synchronize_net() in geneve_unquiesce().
geneve: Remove synchronize_net() in geneve_sock_release().
vxlan previously had an issue where the fast path could access
stale pointers, which was fixed by commit c6fcc4fc5f8b ("vxlan:
avoid using stale vxlan socket.").
geneve later followed the same pattern, and commit fceb9c3e3825
("geneve: avoid using stale geneve socket.") copied synchronize_net()
from vxlan_sock_release() into geneve_sock_release().
However, that change occurred after commit ca065d0cf80f ("udp: no
longer use SLAB_DESTROY_BY_RCU"), and geneve had already been
using kfree_rcu() to free geneve_sock.
Therefore, the synchronize_net() was never actually needed there.
Let's remove the redundant synchronize_net() in geneve_sock_release().
vxlan: Remove synchronize_net() in vxlan_sock_release().
Initially, a dedicated workqueue was used to defer calling
udp_tunnel_sock_release(vxlan_sock->sock) and kfree(vxlan_sock).
Later, commit 0412bd931f5f ("vxlan: synchronously and race-free
destruction of vxlan sockets") removed the workqueue and instead
invoked these two functions immediately after synchronize_net().
This was intended to prevent UAF of the UDP socket in the fast path.
( Note that the "nondeterministic behaviour" mentioned in that
commit was not addressed, as another thread not waiting RCU gp
still sees the same behaviour. )
However, a week prior to that change, commit ca065d0cf80f ("udp:
no longer use SLAB_DESTROY_BY_RCU") had already moved UDP socket
freeing to after the RCU grace period. This made the synchronize_net()
in vxlan_sock_release() completely redundant.
Since vxlan_sock now uses kfree_rcu() and is invoked after
udp_tunnel_sock_release(), vxlan_sock is guaranteed to be freed
either at the same time or after the UDP socket is released,
following the RCU grace period.
Let's remove the redundant synchronize_net() in vxlan_sock_release().
That made vmci_transport_recv_listen() skip vsock_remove_pending(),
leaving the pending socket on the listener's pending_links with
sk_state = TCP_CLOSE while destroy: still dropped the explicit
reference taken before schedule_delayed_work().
One second later vsock_pending_work() observed is_pending=true and
performed full cleanup: vsock_remove_pending() then the two trailing
sock_put(sk) calls -- the first reached refcount 0 and __sk_freed
the socket, and the second wrote into the freed object:
BUG: KASAN: slab-use-after-free in refcount_warn_saturate
Write of size 4 at addr ffff88800b1cac80 by task kworker
Workqueue: events vsock_pending_work
Treat peer RST like any other unexpected packet type (err = -EINVAL).
All destroy: arms now return err < 0, so vmci_transport_recv_listen()
removes pending from pending_links synchronously and
vsock_pending_work() takes the is_pending=false / !rejected branch,
dropping only its own work reference. This also closes the
multi-packet race Sashiko reported on v2: pending is removed from
the list before any subsequent packet can find it.
The pre-existing sk_acceptq_removed() gap on the err < 0 path of
vmci_transport_recv_listen() that Sashiko also noted is not
introduced or changed by this patch.
Tested on lts-6.12.79 with KASAN: 52/100 unpatched -> 0/100 patched.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Cc: stable@vger.kernel.org Signed-off-by: Minh Nguyen <minhnguyen.080505@gmail.com> Acked-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260519102310.237181-1-minhnguyen.080505@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>