This series addresses several issues in the Renesas Ethernet AVB (ravb)
driver related descriptor ordering.
A potential ordering hazard in descriptor setup could cause
the DMA engine to start prematurely, leading to TX stalls on some
platforms.
The series includes the following changes:
Enforce descriptor type ordering to prevent early DMA start
Ensure proper write ordering of TX descriptor type fields to prevent the
DMA engine from observing an incomplete descriptor chain. This fixes
observed TX stalls on RZ/G2L platforms running RT kernels.
Tested on R/G1x Gen2, RZ/G2x Gen3 and RZ/G2L family hardware.
====================
Lad Prabhakar [Fri, 17 Oct 2025 15:18:30 +0000 (16:18 +0100)]
net: ravb: Ensure memory write completes before ringing TX doorbell
Add a final dma_wmb() barrier before triggering the transmit request
(TCCR_TSRQ) to ensure all descriptor and buffer writes are visible to
the DMA engine.
According to the hardware manual, a read-back operation is required
before writing to the doorbell register to guarantee completion of
previous writes. Instead of performing a dummy read, a dma_wmb() is
used to both enforce the same ordering semantics on the CPU side and
also to ensure completion of writes.
Lad Prabhakar [Fri, 17 Oct 2025 15:18:29 +0000 (16:18 +0100)]
net: ravb: Enforce descriptor type ordering
Ensure the TX descriptor type fields are published in a safe order so the
DMA engine never begins processing a descriptor chain before all descriptor
fields are fully initialised.
For multi-descriptor transmits the driver writes DT_FEND into the last
descriptor and DT_FSTART into the first. The DMA engine begins processing
when it observes DT_FSTART. Move the dma_wmb() barrier so it executes
immediately after DT_FEND and immediately before writing DT_FSTART
(and before DT_FSINGLE in the single-descriptor case). This guarantees
that all prior CPU writes to the descriptor memory are visible to the
device before DT_FSTART is seen.
This avoids a situation where compiler/CPU reordering could publish
DT_FSTART ahead of DT_FEND or other descriptor fields, allowing the DMA to
start on a partially initialised chain and causing corrupted transmissions
or TX timeouts. Such a failure was observed on RZ/G2L with an RT kernel as
transmit queue timeouts and device resets.
Heiner Kallweit [Mon, 20 Oct 2025 06:54:54 +0000 (08:54 +0200)]
net: hibmcge: select FIXED_PHY
hibmcge uses fixed_phy_register() et al, but doesn't cater for the case
that hibmcge is built-in and fixed_phy is a module. To solve this
select FIXED_PHY.
Fixes: 1d7cd7a9c69c ("net: hibmcge: support scenario without PHY") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Jijie Shao <shaojijie@huawei.com> Link: https://patch.msgid.link/c4fc061f-b6d5-418b-a0dc-6b238cdbedce@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This issue can occur when the link state changes from off to on
(e.g., plugging or unplugging the LAN cable) while transmitting a
packet. If the skb has a destructor, a warning message may be
printed in this situation.
Jakub Kicinski [Wed, 22 Oct 2025 00:42:54 +0000 (17:42 -0700)]
Merge tag 'linux-can-fixes-for-6.18-20251020' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-10-20
All patches are by me. The first 3 update the bxcan, esd and rockchip
driver to drop skbs in xmit of the device is in listen only mode.
The last patch targets the CAN netlink implementation to allow the
disabling of automatic restart after Bus-Off, even if the a driver
doesn't implement that callback.
* tag 'linux-can-fixes-for-6.18-20251020' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: netlink: can_changelink(): allow disabling of automatic restart
can: rockchip-canfd: rkcanfd_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
can: esd: acc_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
can: bxcan: bxcan_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
====================
Given the introduction of @have_bh_lock variable, it seems the author
intent was to have the local_unlock_nested_bh() after the @unlock label.
Fixes: 25718fdcbdd2 ("net: gro_cells: Use nested-BH locking for gro_cell") Reported-by: syzbot+f9651b9a8212e1c8906f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68f65eb9.a70a0220.205af.0034.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20251020161114.1891141-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
mptcp: handle late ADD_ADDR + selftests skip
Here are a few independent fixes related to MPTCP and its selftests:
- Patch 1: correctly handle ADD_ADDR being received after the switch to
'fully-established'. A fix for another recent fix backported up to
v5.14.
- Patches 2-5: properly mark some MPTCP Join subtests as 'skipped' if
the tested kernel doesn't support the feature being validated. Some
fixes for up to v5.13, v5.18, v6.11 and v6.18-rc1 respectively.
====================
mptcp: pm: in-kernel: C-flag: handle late ADD_ADDR
The special C-flag case expects the ADD_ADDR to be received when
switching to 'fully-established'. But for various reasons, the ADD_ADDR
could be sent after the "4th ACK", and the special case doesn't work.
On NIPA, the new test validating this special case for the C-flag failed
a few times, e.g.
102 default limits, server deny join id 0
syn rx [FAIL] got 0 JOIN[s] syn rx expected 2
Server ns stats
(...)
MPTcpExtAddAddrTx 1
MPTcpExtEchoAdd 1
synack rx [FAIL] got 0 JOIN[s] synack rx expected 2
ack rx [FAIL] got 0 JOIN[s] ack rx expected 2
join Rx [FAIL] see above
syn tx [FAIL] got 0 JOIN[s] syn tx expected 2
join Tx [FAIL] see above
I had a suspicion about what the issue could be: the ADD_ADDR might have
been received after the switch to the 'fully-established' state. The
issue was not easy to reproduce. The packet capture shown that the
ADD_ADDR can indeed be sent with a delay, and the client would not try
to establish subflows to it as expected.
A simple fix is not to mark the endpoints as 'used' in the C-flag case,
when looking at creating subflows to the remote initial IP address and
port. In this case, there is no need to try.
Note: newly added fullmesh endpoints will still continue to be used as
expected, thanks to the conditions behind mptcp_pm_add_addr_c_flag_case.
Aksh Garg [Thu, 16 Oct 2025 11:57:55 +0000 (17:27 +0530)]
net: ethernet: ti: am65-cpts: fix timestamp loss due to race conditions
Resolve race conditions in timestamp events list handling between TX
and RX paths causing missed timestamps.
The current implementation uses a single events list for both TX and RX
timestamps. The am65_cpts_find_ts() function acquires the lock,
splices all events (TX as well as RX events) to a temporary list,
and releases the lock. This function performs matching of timestamps
for TX packets only. Before it acquires the lock again to put the
non-TX events back to the main events list, a concurrent RX
processing thread could acquire the lock (as observed in practice),
find an empty events list, and fail to attach timestamp to it,
even though a relevant event exists in the spliced list which is yet to
be restored to the main list.
Fix this by creating separate events lists to handle TX and RX
timestamps independently.
Fixes: c459f606f66df ("net: ethernet: ti: am65-cpts: Enable RX HW timestamp for PTP packets using CPTS FIFO") Signed-off-by: Aksh Garg <a-garg7@ti.com> Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com> Link: https://patch.msgid.link/20251016115755.1123646-1-a-garg7@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With CONFIG_DEBUG_LOCK_ALLOC=y, the smc->clcsock is unexpectedly changed
in inet_init_csk_locks(). The INET_PROTOSW_ICSK flag is no need by smc,
just remove it.
After removing the INET_PROTOSW_ICSK flag, this patch alse revert
commit 6fd27ea183c2 ("net/smc: fix lacks of icsk_syn_mss with IPPROTO_SMC")
to avoid casting smc_sock to inet_connection_sock.
Reported-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f775be4458668f7d220e Tested-by: syzbot+f775be4458668f7d220e@syzkaller.appspotmail.com Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC") Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: D. Wythe <alibuda@linux.alibaba.com> Link: https://patch.msgid.link/20251017024827.3137512-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amery Hung [Thu, 16 Oct 2025 19:55:40 +0000 (22:55 +0300)]
net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for striding RQ
XDP programs can change the layout of an xdp_buff through
bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver
cannot assume the size of the linear data area nor fragments. Fix the
bug in mlx5 by generating skb according to xdp_buff after XDP programs
run.
Currently, when handling multi-buf XDP, the mlx5 driver assumes the
layout of an xdp_buff to be unchanged. That is, the linear data area
continues to be empty and fragments remain the same. This may cause
the driver to generate erroneous skb or triggering a kernel
warning. When an XDP program added linear data through
bpf_xdp_adjust_head(), the linear data will be ignored as
mlx5e_build_linear_skb() builds an skb without linear data and then
pull data from fragments to fill the linear data area. When an XDP
program has shrunk the non-linear data through bpf_xdp_adjust_tail(),
the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
data size and trigger the BUG_ON in it.
To fix the issue, first record the original number of fragments. If the
number of fragments changes after the XDP program runs, rewind the end
fragment pointer by the difference and recalculate the truesize. Then,
build the skb with the linear data area matching the xdp_buff. Finally,
only pull data in if there is non-linear data and fill the linear part
up to 256 bytes.
Fixes: f52ac7028bec ("net/mlx5e: RX, Add XDP multi-buffer support in Striding RQ") Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1760644540-899148-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amery Hung [Thu, 16 Oct 2025 19:55:39 +0000 (22:55 +0300)]
net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for legacy RQ
XDP programs can release xdp_buff fragments when calling
bpf_xdp_adjust_tail(). The driver currently assumes the number of
fragments to be unchanged and may generate skb with wrong truesize or
containing invalid frags. Fix the bug by generating skb according to
xdp_buff after the XDP program runs.
Fixes: ea5d49bdae8b ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ") Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1760644540-899148-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Fri, 17 Oct 2025 20:06:14 +0000 (16:06 -0400)]
selftests: net: fix server bind failure in sctp_vrf.sh
sctp_vrf.sh could fail:
TEST 12: bind vrf-2 & 1 in server, connect from client 1 & 2, N [FAIL]
not ok 1 selftests: net: sctp_vrf.sh # exit=3
The failure happens when the server bind in a new run conflicts with an
existing association from the previous run:
[1] ip netns exec $SERVER_NS ./sctp_hello server ...
[2] ip netns exec $CLIENT_NS ./sctp_hello client ...
[3] ip netns exec $SERVER_NS pkill sctp_hello ...
[4] ip netns exec $SERVER_NS ./sctp_hello server ...
It occurs if the client in [2] sends a message and closes immediately.
With the message unacked, no SHUTDOWN is sent. Killing the server in [3]
triggers a SHUTDOWN the client also ignores due to the unacked message,
leaving the old association alive. This causes the bind at [4] to fail
until the message is acked and the client responds to a second SHUTDOWN
after the server’s T2 timer expires (3s).
This patch fixes the issue by preventing the client from sending data.
Instead, the client blocks on recv() and waits for the server to close.
It also waits until both the server and the client sockets are fully
released in stop_server and wait_client before restarting.
Additionally, replace 2>&1 >/dev/null with -q in sysctl and grep, and
drop other redundant 2>&1 >/dev/null redirections, and fix a typo from
N to Y (connect successfully) in the description of the last test.
can: netlink: can_changelink(): allow disabling of automatic restart
Since the commit c1f3f9797c1f ("can: netlink: can_changelink(): fix NULL
pointer deref of struct can_priv::do_set_mode"), the automatic restart
delay can only be set for devices that implement the restart handler struct
can_priv::do_set_mode. As it makes no sense to configure a automatic
restart for devices that doesn't support it.
However, since systemd commit 13ce5d4632e3 ("network/can: properly handle
CAN.RestartSec=0") [1], systemd-networkd correctly handles a restart delay
of "0" (i.e. the restart is disabled). Which means that a disabled restart
is always configured in the kernel.
On systems with both changes active this causes that CAN interfaces that
don't implement a restart handler cannot be brought up by systemd-networkd.
Solve this problem by allowing a delay of "0" to be configured, even if the
device does not implement a restart handler.
Merge patch series "can: drivers: drop skb in xmit if device is in listen only mode"
Marc Kleine-Budde <mkl@pengutronix.de> says:
I notived that 3 drivers (bxcan, esd and rockchip) use the function
can_dropped_invalid_skb(), that doesn't check if the device is in listen
only mode. This series converts these driver to use the new
can_dev_dropped_skb() function.
can: rockchip-canfd: rkcanfd_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
In addition to can_dropped_invalid_skb(), the helper function
can_dev_dropped_skb() checks whether the device is in listen-only mode and
discards the skb accordingly.
Replace can_dropped_invalid_skb() by can_dev_dropped_skb() to also drop
skbs in for listen-only mode.
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Closes: https://lore.kernel.org/all/20251017-bizarre-enchanted-quokka-f3c704-mkl@pengutronix.de/ Fixes: ff60bfbaf67f ("can: rockchip_canfd: add driver for Rockchip CAN-FD controller") Link: https://patch.msgid.link/20251017-fix-skb-drop-check-v1-3-556665793fa4@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can: esd: acc_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
In addition to can_dropped_invalid_skb(), the helper function
can_dev_dropped_skb() checks whether the device is in listen-only mode and
discards the skb accordingly.
Replace can_dropped_invalid_skb() by can_dev_dropped_skb() to also drop
skbs in for listen-only mode.
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Closes: https://lore.kernel.org/all/20251017-bizarre-enchanted-quokka-f3c704-mkl@pengutronix.de/ Fixes: 9721866f07e1 ("can: esd: add support for esd GmbH PCIe/402 CAN interface family") Link: https://patch.msgid.link/20251017-fix-skb-drop-check-v1-2-556665793fa4@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can: bxcan: bxcan_start_xmit(): use can_dev_dropped_skb() instead of can_dropped_invalid_skb()
In addition to can_dropped_invalid_skb(), the helper function
can_dev_dropped_skb() checks whether the device is in listen-only mode and
discards the skb accordingly.
Replace can_dropped_invalid_skb() by can_dev_dropped_skb() to also drop
skbs in for listen-only mode.
Reported-by: Marc Kleine-Budde <mkl@pengutronix.de> Closes: https://lore.kernel.org/all/20251017-bizarre-enchanted-quokka-f3c704-mkl@pengutronix.de/ Fixes: f00647d8127b ("can: bxcan: add support for ST bxCAN controller") Link: https://patch.msgid.link/20251017-fix-skb-drop-check-v1-1-556665793fa4@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
When splitting the RTL8221B-VM-CG into C22 and C45 variants, the name was
accidentally changed to RTL8221B-VN-CG. This patch brings back the previous
part number.
Fixes: ad5ce743a6b0 ("net: phy: realtek: Add driver instances for rtl8221b via Clause 45") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251016192325.2306757-1-olek2@wp.pl Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ioana Ciornei [Thu, 16 Oct 2025 13:58:07 +0000 (16:58 +0300)]
dpaa2-eth: fix the pointer passed to PTR_ALIGN on Tx path
The blamed commit increased the needed headroom to account for
alignment. This means that the size required to always align a Tx buffer
was added inside the dpaa2_eth_needed_headroom() function. By doing
that, a manual adjustment of the pointer passed to PTR_ALIGN() was no
longer correct since the 'buffer_start' variable was already pointing
to the start of the skb's memory.
The behavior of the dpaa2-eth driver without this patch was to drop
frames on Tx even when the headroom was matching the 128 bytes
necessary. Fix this by removing the manual adjust of 'buffer_start' from
the PTR_MODE call.
Closes: https://lore.kernel.org/netdev/70f0dcd9-1906-4d13-82df-7bbbbe7194c6@app.fastmail.com/T/#u Fixes: f422abe3f23d ("dpaa2-eth: increase the needed headroom to account for alignment") Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Tested-by: Mathew McBride <matt@traverse.com.au> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251016135807.360978-1-ioana.ciornei@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tonghao Zhang [Thu, 16 Oct 2025 12:51:36 +0000 (20:51 +0800)]
net: bonding: update the slave array for broadcast mode
This patch fixes ce7a381697cb ("net: bonding: add broadcast_neighbor option for 802.3ad").
Before this commit, on the broadcast mode, all devices were traversed using the
bond_for_each_slave_rcu. This patch supports traversing devices by using all_slaves.
Therefore, we need to update the slave array when enslave or release slave.
Fixes: ce7a381697cb ("net: bonding: add broadcast_neighbor option for 802.3ad") Cc: Simon Horman <horms@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Lunn <andrew+netdev@lunn.ch> Cc: <stable@vger.kernel.org> Reported-by: Jiri Slaby <jirislaby@kernel.org> Tested-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/all/a97e6e1e-81bc-4a79-8352-9e4794b0d2ca@kernel.org/ Signed-off-by: Tonghao Zhang <tonghao@bamaicloud.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20251016125136.16568-1-tonghao@bamaicloud.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bagas Sanjaya [Thu, 16 Oct 2025 09:39:37 +0000 (16:39 +0700)]
Documentation: net: net_failover: Separate cloud-ifupdown-helper and reattach-vf.sh code blocks marker
cloud-ifupdown-helper patch and reattach-vf.sh script are rendered in
htmldocs output as normal paragraphs instead of literal code blocks
due to missing separator from respective code block marker. Add it.
Wei Fang [Thu, 16 Oct 2025 08:01:31 +0000 (16:01 +0800)]
net: enetc: correct the value of ENETC_RXB_TRUESIZE
The ENETC RX ring uses the page halves flipping mechanism, each page is
split into two halves for the RX ring to use. And ENETC_RXB_TRUESIZE is
defined to 2048 to indicate the size of half a page. However, the page
size is configurable, for ARM64 platform, PAGE_SIZE is default to 4K,
but it could be configured to 16K or 64K.
When PAGE_SIZE is set to 16K or 64K, ENETC_RXB_TRUESIZE is not correct,
and the RX ring will always use the first half of the page. This is not
consistent with the description in the relevant kernel doc and commit
messages.
This issue is invisible in most cases, but if users want to increase
PAGE_SIZE to receive a Jumbo frame with a single buffer for some use
cases, it will not work as expected, because the buffer size of each
RX BD is fixed to 2048 bytes.
Based on the above two points, we expect to correct ENETC_RXB_TRUESIZE
to (PAGE_SIZE >> 1), as described in the comment.
Jianpeng Chang [Wed, 15 Oct 2025 02:14:27 +0000 (10:14 +0800)]
net: enetc: fix the deadlock of enetc_mdio_lock
After applying the workaround for err050089, the LS1028A platform
experiences RCU stalls on RT kernel. This issue is caused by the
recursive acquisition of the read lock enetc_mdio_lock. Here list some
of the call stacks identified under the enetc_poll path that may lead to
a deadlock:
After enetc_poll acquires the read lock, a higher-priority writer attempts
to acquire the lock, causing preemption. The writer detects that a
read lock is already held and is scheduled out. However, readers under
enetc_poll cannot acquire the read lock again because a writer is already
waiting, leading to a thread hang.
Currently, the deadlock is avoided by adjusting enetc_lock_mdio to prevent
recursive lock acquisition.
Fixes: 6d36ecdbc441 ("net: enetc: take the MDIO lock only once per NAPI poll cycle") Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Acked-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20251015021427.180757-1-jianpeng.chang.cn@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On all platforms set_clock_selection() writes to a GRF register. This
requires certain clocks running and thus should happen before the
clocks are disabled.
This has been noticed on RK3576 Sige5, which hangs during system suspend
when trying to suspend the second network interface. Note, that
suspending the first interface works, because the second device ensures
that the necessary clocks for the GRF are enabled.
Cc: stable@vger.kernel.org Fixes: 2f2b60a0ec28 ("net: ethernet: stmmac: dwmac-rk: Add gmac support for rk3588") Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251014-rockchip-network-clock-fix-v1-1-c257b4afdf75@collabora.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnetlink: Allow deleting FDB entries in user namespace
Creating FDB entries is possible from a non-initial user namespace when
having CAP_NET_ADMIN, yet, when deleting FDB entries, processes receive
an EPERM because the capability is always checked against the initial
user namespace. This restricts the FDB management from unprivileged
containers.
Drop the netlink_capable check in rtnl_fdb_del as it was originally
dropped in c5c351088ae7 and reintroduced in 1690be63a27b without
intention.
This patch was tested using a container on GyroidOS, where it was
possible to delete FDB entries from an unprivileged user namespace and
private network namespace.
Fixes: 1690be63a27b ("bridge: Add vlan support to static neighbors") Reviewed-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> Tested-by: Harshal Gohel <hg@simonwunderlich.de> Signed-off-by: Johannes Wiesböck <johannes.wiesboeck@aisec.fraunhofer.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20251015201548.319871-1-johannes.wiesboeck@aisec.fraunhofer.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'accel' parameter of mlx5e_txwqe_build_eseg_csum() and the similar
'state' parameter of mlx5e_accel_tx_ids_len() were NULL when called
from mlx5i_sq_xmit() and were causing kernel panics from that context.
Fix that by passing in a local empty mlx5e_accel_tx_state variable, thus
guaranteeing that 'accel' is never NULL. Also remove an unnecessary
check from mlx5e_tx_wqe_inline_mode().
Fixes: e5a1861a298e ("net/mlx5e: Implement PSP Tx data path") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1760511923-890650-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 15 Oct 2025 06:32:21 +0000 (06:32 +0000)]
net: gro: clear skb_shinfo(skb)->hwtstamps in napi_reuse_skb()
Some network drivers assume this field is zero after napi_get_frags().
We must clear it in napi_reuse_skb() otherwise the following can happen:
1) A packet is received, and skb_shinfo(skb)->hwtstamps is populated
because a bit in the receive descriptor announced hwtstamp
availability for this packet.
2) Packet is given to gro layer via napi_gro_frags().
3) Packet is merged to a prior one held in GRO queues.
4) skb is saved after some cleanup in napi->skb via a call
to napi_reuse_skb().
5) Next packet is received 10 seconds later, gets the recycled skb
from napi_get_frags().
6) The receive descriptor does not announce hwtstamp availability.
Driver does not clear shinfo->hwtstamps.
7) We have in shinfo->hwtstamps an old timestamp.
Fixes: ac45f602ee3d ("net: infrastructure for hardware time stamping") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251015063221.4171986-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Return 1 instead of 0 in invalid case in mlx5e_mpwrq_umr_entry_size()
When building with Clang 20 or newer, there are some objtool warnings
from unexpected fallthroughs to other functions:
vmlinux.o: warning: objtool: mlx5e_mpwrq_mtts_per_wqe() falls through to next function mlx5e_mpwrq_max_num_entries()
vmlinux.o: warning: objtool: mlx5e_mpwrq_max_log_rq_size() falls through to next function mlx5e_get_linear_rq_headroom()
LLVM 20 contains an (admittedly problematic [1]) optimization [2] to
convert divide by zero into the equivalent of __builtin_unreachable(),
which invokes undefined behavior and destroys code generation when it is
encountered in a control flow graph.
mlx5e_mpwrq_umr_entry_size() returns 0 in the default case of an
unrecognized mlx5e_mpwrq_umr_mode value. mlx5e_mpwrq_mtts_per_wqe(),
which is inlined into mlx5e_mpwrq_max_log_rq_size(), uses the result of
mlx5e_mpwrq_umr_entry_size() in a divide operation without checking for
zero, so LLVM is able to infer there will be a divide by zero in this
case and invokes undefined behavior. While there is some proposed work
to isolate this undefined behavior and avoid the destructive code
generation that results in these objtool warnings, code should still be
defensive against divide by zero.
As the WARN_ONCE() implies that an invalid value should be handled
gracefully, return 1 instead of 0 in the default case so that the
results of this division operation is always valid.
Michal Pecio [Tue, 14 Oct 2025 18:35:28 +0000 (20:35 +0200)]
net: usb: rtl8150: Fix frame padding
TX frames aren't padded and unknown memory is sent into the ether.
Theoretically, it isn't even guaranteed that the extra memory exists
and can be sent out, which could cause further problems. In practice,
I found that plenty of tailroom exists in the skb itself (in my test
with ping at least) and skb_padto() easily succeeds, so use it here.
In the event of -ENOMEM drop the frame like other drivers do.
The use of one more padding byte instead of a USB zero-length packet
is retained to avoid regression. I have a dodgy Etron xHCI controller
which doesn't seem to support sending ZLPs at all.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Michal Pecio <michal.pecio@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251014203528.3f9783c4.michal.pecio@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 16 Oct 2025 16:41:21 +0000 (09:41 -0700)]
Merge tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from CAN
Current release - regressions:
- udp: do not use skb_release_head_state() before
skb_attempt_defer_free()
- gro_cells: use nested-BH locking for gro_cell
- dpll: zl3073x: increase maximum size of flash utility
Previous releases - regressions:
- core: fix lockdep splat on device unregister
- tcp: fix tcp_tso_should_defer() vs large RTT
- tls:
- don't rely on tx_work during send()
- wait for pending async decryptions if tls_strp_msg_hold fails
- can: j1939: add missing calls in NETDEV_UNREGISTER notification
handler
- eth: lan78xx: fix lost EEPROM write timeout in
lan78xx_write_raw_eeprom
Previous releases - always broken:
- ip6_tunnel: prevent perpetual tunnel growth
- dpll: zl3073x: handle missing or corrupted flash configuration
- can: m_can: fix pm_runtime and CAN state handling
- eth:
- ixgbe: fix too early devlink_free() in ixgbe_remove()
- ixgbevf: fix mailbox API compatibility
- gve: Check valid ts bit on RX descriptor before hw timestamping
- idpf: cleanup remaining SKBs in PTP flows
- r8169: fix packet truncation after S4 resume on RTL8168H/RTL8111H"
* tag 'net-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
udp: do not use skb_release_head_state() before skb_attempt_defer_free()
net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
netdevsim: set the carrier when the device goes up
selftests: tls: add test for short splice due to full skmsg
selftests: net: tls: add tests for cmsg vs MSG_MORE
tls: don't rely on tx_work during send()
tls: wait for pending async decryptions if tls_strp_msg_hold fails
tls: always set record_type in tls_process_cmsg
tls: wait for async encrypt in case of error during latter iterations of sendmsg
tls: trim encrypted message to match the plaintext on short splice
tg3: prevent use of uninitialized remote_adv and local_adv variables
MAINTAINERS: new entry for IPv6 IOAM
gve: Check valid ts bit on RX descriptor before hw timestamping
net: core: fix lockdep splat on device unregister
MAINTAINERS: add myself as maintainer for b53
selftests: net: check jq command is supported
net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()
tcp: fix tcp_tso_should_defer() vs large RTT
r8152: add error handling in rtl8152_driver_init
usbnet: Fix using smp_processor_id() in preemptible code warnings
...
Linus Torvalds [Thu, 16 Oct 2025 16:39:29 +0000 (09:39 -0700)]
Merge tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Do not print an error message (and assume that the General Purpose
Log Directory log page is not supported) for a device that reports a
bogus General Purpose Logging Version.
Unsurprisingly, many vendors fail to report the only valid General
Purpose Logging Version (Damien)
* tag 'ata-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: relax checks in ata_read_log_directory()
Eric Dumazet [Wed, 15 Oct 2025 05:27:15 +0000 (05:27 +0000)]
udp: do not use skb_release_head_state() before skb_attempt_defer_free()
Michal reported and bisected an issue after recent adoption
of skb_attempt_defer_free() in UDP.
The issue here is that skb_release_head_state() is called twice per skb,
one time from skb_consume_udp(), then a second time from skb_defer_free_flush()
and napi_consume_skb().
As Sabrina suggested, remove skb_release_head_state() call from
skb_consume_udp().
Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free()
Many thanks to Michal, Sabrina, Paolo and Florian for their help.
Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()") Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
I Viswanath [Mon, 13 Oct 2025 18:16:48 +0000 (23:46 +0530)]
net: usb: lan78xx: fix use of improperly initialized dev->chipid in lan78xx_reset
dev->chipid is used in lan78xx_init_mac_address before it's initialized:
lan78xx_reset() {
lan78xx_init_mac_address()
lan78xx_read_eeprom()
lan78xx_read_raw_eeprom() <- dev->chipid is used here
dev->chipid = ... <- dev->chipid is initialized correctly here
}
Reorder initialization so that dev->chipid is set before calling
lan78xx_init_mac_address().
Fixes: a0db7d10b76e ("lan78xx: Add to handle mux control per chip id") Signed-off-by: I Viswanath <viswanathiyyappan@gmail.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Khalid Aziz <khalid@kernel.org> Link: https://patch.msgid.link/20251013181648.35153-1-viswanathiyyappan@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 16 Oct 2025 00:56:20 +0000 (17:56 -0700)]
Merge tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-10-14
The first 2 paches are by Celeste Liu and target the gS_usb driver.
The first patch remove the limitation to 3 CAN interface per USB
device. The second patch adds the missing population of
net_device->dev_port.
The next 4 patches are by me and fix the m_can driver. They add a
missing pm_runtime_disable(), fix the CAN state transition back to
Error Active and fix the state after ifup and suspend/resume.
Another patch by me targets the m_can driver, too and replaces Dong
Aisheng's old email address.
The next 2 patches are by Vincent Mailhol and update the CAN
networking Documentation.
Tetsuo Handa contributes the last patch that add missing cleanup calls
in the NETDEV_UNREGISTER notification handler.
* tag 'linux-can-fixes-for-6.18-20251014' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: j1939: add missing calls in NETDEV_UNREGISTER notification handler
can: add Transmitter Delay Compensation (TDC) documentation
can: remove false statement about 1:1 mapping between DLC and length
can: m_can: replace Dong Aisheng's old email address
can: m_can: fix CAN state in system PM
can: m_can: m_can_chip_config(): bring up interface in correct state
can: m_can: m_can_handle_state_errors(): fix CAN state transition to Error Active
can: m_can: m_can_plat_remove(): add missing pm_runtime_disable()
can: gs_usb: gs_make_candev(): populate net_device->dev_port
can: gs_usb: increase max interface to U8_MAX
====================
Breno Leitao [Tue, 14 Oct 2025 09:17:25 +0000 (02:17 -0700)]
netdevsim: set the carrier when the device goes up
Bringing a linked netdevsim device down and then up causes communication
failure because both interfaces lack carrier. Basically a ifdown/ifup on
the interface make the link broken.
Commit 3762ec05a9fbda ("netdevsim: add NAPI support") added supported
for NAPI, calling netif_carrier_off() in nsim_stop(). This patch
re-enables the carrier symmetrically on nsim_open(), in case the device
is linked and the peer is up.
Jakub Kicinski [Thu, 16 Oct 2025 00:41:47 +0000 (17:41 -0700)]
Merge branch 'tls-misc-bugfixes'
Sabrina Dubroca says:
====================
tls: misc bugfixes
Jann Horn reported multiple bugs in kTLS. This series addresses them,
and adds some corresponding selftests for those that are reproducible
(and without failure injection).
====================
Sabrina Dubroca [Tue, 14 Oct 2025 09:17:01 +0000 (11:17 +0200)]
selftests: net: tls: add tests for cmsg vs MSG_MORE
We don't have a test to check that MSG_MORE won't let us merge records
of different types across sendmsg calls.
Add new tests that check:
- MSG_MORE is only allowed for DATA records
- a pending DATA record gets closed and pushed before a non-DATA
record is processed
Sabrina Dubroca [Tue, 14 Oct 2025 09:17:00 +0000 (11:17 +0200)]
tls: don't rely on tx_work during send()
With async crypto, we rely on tx_work to actually transmit records
once encryption completes. But while send() is running, both the
tx_lock and socket lock are held, so tx_work_handler cannot process
the queue of encrypted records, and simply reschedules itself. During
a large send(), this could last a long time, and use a lot of memory.
Transmit any pending encrypted records before restarting the main
loop of tls_sw_sendmsg_locked.
Sabrina Dubroca [Tue, 14 Oct 2025 09:16:59 +0000 (11:16 +0200)]
tls: wait for pending async decryptions if tls_strp_msg_hold fails
Async decryption calls tls_strp_msg_hold to create a clone of the
input skb to hold references to the memory it uses. If we fail to
allocate that clone, proceeding with async decryption can lead to
various issues (UAF on the skb, writing into userspace memory after
the recv() call has returned).
In this case, wait for all pending decryption requests.
Sabrina Dubroca [Tue, 14 Oct 2025 09:16:58 +0000 (11:16 +0200)]
tls: always set record_type in tls_process_cmsg
When userspace wants to send a non-DATA record (via the
TLS_SET_RECORD_TYPE cmsg), we need to send any pending data from a
previous MSG_MORE send() as a separate DATA record. If that DATA record
is encrypted asynchronously, tls_handle_open_record will return
-EINPROGRESS. This is currently treated as an error by
tls_process_cmsg, and it will skip setting record_type to the correct
value, but the caller (tls_sw_sendmsg_locked) handles that return
value correctly and proceeds with sending the new message with an
incorrect record_type (DATA instead of whatever was requested in the
cmsg).
Always set record_type before handling the open record. If
tls_handle_open_record returns an error, record_type will be
ignored. If it succeeds, whether with synchronous crypto (returning 0)
or asynchronous (returning -EINPROGRESS), the caller will proceed
correctly.
Sabrina Dubroca [Tue, 14 Oct 2025 09:16:57 +0000 (11:16 +0200)]
tls: wait for async encrypt in case of error during latter iterations of sendmsg
If we hit an error during the main loop of tls_sw_sendmsg_locked (eg
failed allocation), we jump to send_end and immediately
return. Previous iterations may have queued async encryption requests
that are still pending. We should wait for those before returning, as
we could otherwise be reading from memory that userspace believes
we're not using anymore, which would be a sort of use-after-free.
This is similar to what tls_sw_recvmsg already does: failures during
the main loop jump to the "wait for async" code, not straight to the
unlock/return.
Sabrina Dubroca [Tue, 14 Oct 2025 09:16:56 +0000 (11:16 +0200)]
tls: trim encrypted message to match the plaintext on short splice
During tls_sw_sendmsg_locked, we pre-allocate the encrypted message
for the size we're expecting to send during the current iteration, but
we may end up sending less, for example when splicing: if we're
getting the data from small fragments of memory, we may fill up all
the slots in the skmsg with less data than expected.
In this case, we need to trim the encrypted message to only the length
we actually need, to avoid pushing uninitialized bytes down the
underlying TCP socket.
Justin Iurman [Tue, 14 Oct 2025 17:06:50 +0000 (19:06 +0200)]
MAINTAINERS: new entry for IPv6 IOAM
Create a maintainer entry for IPv6 IOAM. Add myself as I authored most
if not all of the IPv6 IOAM code in the kernel and actively participate
in the related IETF groups.
Tim Hostetler [Tue, 14 Oct 2025 00:47:39 +0000 (00:47 +0000)]
gve: Check valid ts bit on RX descriptor before hw timestamping
The device returns a valid bit in the LSB of the low timestamp byte in
the completion descriptor that the driver should check before
setting the SKB's hardware timestamp. If the timestamp is not valid, do not
hardware timestamp the SKB.
Cc: stable@vger.kernel.org Fixes: b2c7aeb49056 ("gve: Implement ndo_hwtstamp_get/set for RX timestamping") Reviewed-by: Joshua Washington <joshwash@google.com> Signed-off-by: Tim Hostetler <thostet@google.com> Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20251014004740.2775957-1-hramamurthy@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 15 Oct 2025 14:57:28 +0000 (07:57 -0700)]
Remove long-stale ext3 defconfig option
Inspired by commit c065b6046b34 ("Use CONFIG_EXT4_FS instead of
CONFIG_EXT3_FS in all of the defconfigs") I looked around for any other
left-over EXT3 config options, and found some old defconfig files still
mentioned CONFIG_EXT3_DEFAULTS_TO_ORDERED.
That config option was removed a decade ago in commit c290ea01abb7 ("fs:
Remove ext3 filesystem driver"). It had a good run, but let's remove it
for good.
Linus Torvalds [Wed, 15 Oct 2025 14:51:57 +0000 (07:51 -0700)]
Merge tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
- Fix regression caused by removing CONFIG_EXT3_FS when testing some
very old defconfigs
- Avoid a BUG_ON when opening a file on a maliciously corrupted file
system
- Avoid mm warnings when freeing a very large orphan file metadata
- Avoid a theoretical races between metadata writeback and checkpoints
(it's very hard to hit in practice, since the race requires that the
writeback take a very long time)
* tag 'ext4_for_linus-6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
ext4: free orphan info with kvfree
ext4: detect invalid INLINE_DATA + EXTENTS flag combination
ext4, doc: fix and improve directory hash tree description
ext4: wait for ongoing I/O to complete before freeing blocks
jbd2: ensure that all ongoing I/O complete before freeing blocks
PM / devfreq: rockchip-dfi: switch to FIELD_PREP_WM16 macro
The era of hand-rolled HIWORD_UPDATE macros is over, at least for those
drivers that use constant masks.
Like many other Rockchip drivers, rockchip-dfi brings with it its own
HIWORD_UPDATE macro. This variant doesn't shift the value (and like the
others, doesn't do any checking).
Remove it, and replace instances of it with hw_bitfield.h's
FIELD_PREP_WM16. Since FIELD_PREP_WM16 requires contiguous masks and
shifts the value for us, some reshuffling of definitions needs to
happen.
This gives us better compile-time error checking, and in my opinion,
nicer code.
Tested on an RK3568 ODROID-M1 board (LPDDR4X at 1560 MHz, an RK3588
Radxa ROCK 5B board (LPDDR4X at 2112 MHz) and an RK3588 Radxa ROCK 5T
board (LPDDR5 at 2400 MHz). perf measurements were consistent with the
measurements of stress-ng --stream in all cases.
Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@collabora.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Florian Westphal [Mon, 13 Oct 2025 18:50:52 +0000 (20:50 +0200)]
net: core: fix lockdep splat on device unregister
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.
If the device list is too long, this will lock more device mutexes than
lockdep can handle:
unshare -n \
bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'
BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48 max: 48!
48 locks held by kworker/u16:1/69:
#0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
#1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
#2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
#3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
#4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]
Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.
Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.
Fixes: 7e4d784f5810 ("net: hold netdev instance lock during rtnetlink operations") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20251013185052.14021-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Tue, 14 Oct 2025 16:15:45 +0000 (09:15 -0700)]
Merge tag 'for-linus-6.18-2' of https://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"A few bug fixes for patches that went in this release: a refcount
error and some missing or incorrect error checks"
* tag 'for-linus-6.18-2' of https://github.com/cminyard/linux-ipmi:
ipmi: Fix handling of messages with provided receive message pointer
mfd: ls2kbmc: check for devm_mfd_add_devices() failure
mfd: ls2kbmc: Fix an IS_ERR() vs NULL check in probe()
Wang Liang [Mon, 13 Oct 2025 08:00:39 +0000 (16:00 +0800)]
selftests: net: check jq command is supported
The jq command is used in vlan_bridge_binding.sh, if it is not supported,
the test will spam the following log.
# ./vlan_bridge_binding.sh: line 51: jq: command not found
# ./vlan_bridge_binding.sh: line 51: jq: command not found
# ./vlan_bridge_binding.sh: line 51: jq: command not found
# ./vlan_bridge_binding.sh: line 51: jq: command not found
# ./vlan_bridge_binding.sh: line 51: jq: command not found
# TEST: Test bridge_binding on->off when lower down [FAIL]
# Got operstate of , expected 0
The rtnetlink.sh has the same problem. It makes sense to check if jq is
installed before running these tests. After this patch, the
vlan_bridge_binding.sh skipped if jq is not supported:
# timeout set to 3600
# selftests: net: vlan_bridge_binding.sh
# TEST: jq not installed [SKIP]
Fixes: dca12e9ab760 ("selftests: net: Add a VLAN bridge binding selftest") Fixes: 6a414fd77f61 ("selftests: rtnetlink: Add an address proto test") Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20251013080039.3035898-1-wangliang74@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lorenzo Bianconi [Sun, 12 Oct 2025 09:19:44 +0000 (11:19 +0200)]
net: airoha: Take into account out-of-order tx completions in airoha_dev_xmit()
Completion napi can free out-of-order tx descriptors if hw QoS is
enabled and packets with different priority are queued to same DMA ring.
Take into account possible out-of-order reports checking if the tx queue
is full using circular buffer head/tail pointer instead of the number of
queued packets.
Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC") Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251012-airoha-tx-busy-queue-v2-1-a600b08bab2d@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Eric Dumazet [Sat, 11 Oct 2025 11:57:42 +0000 (11:57 +0000)]
tcp: fix tcp_tso_should_defer() vs large RTT
Neal reported that using neper tcp_stream with TCP_TX_DELAY
set to 50ms would often lead to flows stuck in a small cwnd mode,
regardless of the congestion control.
While tcp_stream sets TCP_TX_DELAY too late after the connect(),
it highlighted two kernel bugs.
The following heuristic in tcp_tso_should_defer() seems wrong
for large RTT:
delta = tp->tcp_clock_cache - head->tstamp;
/* If next ACK is likely to come too late (half srtt), do not defer */
if ((s64)(delta - (u64)NSEC_PER_USEC * (tp->srtt_us >> 4)) < 0)
goto send_now;
If next ACK is expected to come in more than 1 ms, we should
not defer because we prefer a smooth ACK clocking.
While blamed commit was a step in the good direction, it was not
generic enough.
Another patch fixing TCP_TX_DELAY for established flows
will be proposed when net-next reopens.
For historical and portability reasons, the netif_rx() is usually
run in the softirq or interrupt context, this commit therefore add
local_bh_disable/enable() protection in the usbnet_resume_rx().
Raju Rangoju [Fri, 10 Oct 2025 06:51:42 +0000 (12:21 +0530)]
amd-xgbe: Avoid spurious link down messages during interface toggle
During interface toggle operations (ifdown/ifup), the driver currently
resets the local helper variable 'phy_link' to -1. This causes the link
state machine to incorrectly interpret the state as a link change event,
resulting in spurious "Link is down" messages being logged when the
interface is brought back up.
Preserve the phy_link state across interface toggles to avoid treating
the -1 sentinel value as a legitimate link state transition.
Fixes: 88131a812b16 ("amd-xgbe: Perform phy connect/disconnect at dev open/stop") Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Link: https://patch.msgid.link/20251010065142.1189310-1-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Theodore Ts'o [Tue, 14 Oct 2025 01:50:40 +0000 (21:50 -0400)]
Use CONFIG_EXT4_FS instead of CONFIG_EXT3_FS in all of the defconfigs
Commit d6ace46c82fd ("ext4: remove obsolete EXT3 config options")
removed the obsolete EXT3_CONFIG options, since it had been over a
decade since fs/ext3 had been removed. Unfortunately, there were a
number of defconfigs that still used CONFIG_EXT3_FS which the cleanup
commit didn't fix up. This led to a large number of defconfig test
builds to fail. Oops.
Marek Vasut [Sat, 11 Oct 2025 11:02:49 +0000 (13:02 +0200)]
net: phy: realtek: Avoid PHYCR2 access if PHYCR2 not present
The driver is currently checking for PHYCR2 register presence in
rtl8211f_config_init(), but it does so after accessing PHYCR2 to
disable EEE. This was introduced in commit bfc17c165835 ("net:
phy: realtek: disable PHY-mode EEE"). Move the PHYCR2 presence
test before the EEE disablement and simplify the code.
Fixes: bfc17c165835 ("net: phy: realtek: disable PHY-mode EEE") Signed-off-by: Marek Vasut <marek.vasut@mailbox.org> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20251011110309.12664-1-marek.vasut@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Intel Wired LAN Driver Updates 2025-10-01 (idpf, ixgbe, ixgbevf)
For idpf:
Milena fixes a memory leak in the idpf reset logic when the driver resets
with an outstanding Tx timestamp.
For ixgbe and ixgbevf:
Jedrzej fixes an issue with reporting link speed on E610 VFs.
Jedrzej also fixes the VF mailbox API incompatibilities caused by the
confusion with API v1.4, v1.5, and v1.6. The v1.4 API introduced IPSEC
offload, but this was only supported on Linux hosts. The v1.5 API
introduced a new mailbox API which is necessary to resolve issues on ESX
hosts. The v1.6 API introduced a new link management API for E610. Jedrzej
introduces a new v1.7 API with a feature negotiation which enables properly
checking if features such as IPSEC or the ESX mailbox APIs are supported.
This resolves issues with compatibility on different hosts, and aligns the
API across hosts instead of having Linux require custom mailbox API
versions for IPSEC offload.
Koichiro fixes a KASAN use-after-free bug in ixgbe_remove().
====================
Koichiro Den [Fri, 10 Oct 2025 00:03:51 +0000 (17:03 -0700)]
ixgbe: fix too early devlink_free() in ixgbe_remove()
Since ixgbe_adapter is embedded in devlink, calling devlink_free()
prematurely in the ixgbe_remove() path can lead to UAF. Move devlink_free()
to the end.
ixgbevf: fix mailbox API compatibility by negotiating supported features
There was backward compatibility in the terms of mailbox API. Various
drivers from various OSes supporting 10G adapters from Intel portfolio
could easily negotiate mailbox API.
This convention has been broken since introducing API 1.4.
Commit 0062e7cc955e ("ixgbevf: add VF IPsec offload code") added support
for IPSec which is specific only for the kernel ixgbe driver. None of the
rest of the Intel 10G PF/VF drivers supports it. And actually lack of
support was not included in the IPSec implementation - there were no such
code paths. No possibility to negotiate support for the feature was
introduced along with introduction of the feature itself.
Commit 339f28964147 ("ixgbevf: Add support for new mailbox communication
between PF and VF") increasing API version to 1.5 did the same - it
introduced code supported specifically by the PF ESX driver. It altered API
version for the VF driver in the same time not touching the version
defined for the PF ixgbe driver. It led to additional discrepancies,
as the code provided within API 1.6 cannot be supported for Linux ixgbe
driver as it causes crashes.
The issue was noticed some time ago and mitigated by Jake within the commit d0725312adf5 ("ixgbevf: stop attempting IPSEC offload on Mailbox API 1.5").
As a result we have regression for IPsec support and after increasing API
to version 1.6 ixgbevf driver stopped to support ESX MBX.
To fix this mess add new mailbox op asking PF driver about supported
features. Basing on a response determine whether to set support for IPSec
and ESX-specific enhanced mailbox.
New mailbox op, for compatibility purposes, must be added within new API
revision, as API version of OOT PF & VF drivers is already increased to
1.6 and doesn't incorporate features negotiate op.
Features negotiation mechanism gives possibility to be extended with new
features when needed in the future.
Reported-by: Jacob Keller <jacob.e.keller@intel.com> Closes: https://lore.kernel.org/intel-wired-lan/20241101-jk-ixgbevf-mailbox-v1-5-fixes-v1-0-f556dc9a66ed@intel.com/ Fixes: 0062e7cc955e ("ixgbevf: add VF IPsec offload code") Fixes: 339f28964147 ("ixgbevf: Add support for new mailbox communication between PF and VF") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-4-ef32a425b92a@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Update supported API version and provide handler for
IXGBE_VF_GET_PF_LINK_STATE cmd.
Simply put stored values of link speed and link_up from adapter context.
ixgbevf: fix getting link speed data for E610 devices
E610 adapters no longer use the VFLINKS register to read PF's link
speed and linkup state. As a result VF driver cannot get actual link
state and it incorrectly reports 10G which is the default option.
It leads to a situation where even 1G adapters print 10G as actual
link speed. The same happens when PF driver set speed different than 10G.
Add new mailbox operation to let the VF driver request a PF driver
to provide actual link data. Update the mailbox api to v1.6.
Incorporate both ways of getting link status within the legacy
ixgbe_check_mac_link_vf() function.
Fixes: 4c44b450c69b ("ixgbevf: Add support for Intel(R) E610 device") Co-developed-by: Andrzej Wilczynski <andrzejx.wilczynski@intel.com> Signed-off-by: Andrzej Wilczynski <andrzejx.wilczynski@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-2-ef32a425b92a@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Milena Olech [Fri, 10 Oct 2025 00:03:46 +0000 (17:03 -0700)]
idpf: cleanup remaining SKBs in PTP flows
When the driver requests Tx timestamp value, one of the first steps is
to clone SKB using skb_get. It increases the reference counter for that
SKB to prevent unexpected freeing by another component.
However, there may be a case where the index is requested, SKB is
assigned and never consumed by PTP flows - for example due to reset during
running PTP apps.
Add a check in release timestamping function to verify if the SKB
assigned to Tx timestamp latch was freed, and release remaining SKBs.
Fixes: 4901e83a94ef ("idpf: add Tx timestamp capabilities negotiation") Signed-off-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20251009-jk-iwl-net-2025-10-01-v3-1-ef32a425b92a@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dmitry Safonov [Thu, 9 Oct 2025 15:02:19 +0000 (16:02 +0100)]
net/ip6_tunnel: Prevent perpetual tunnel growth
Similarly to ipv4 tunnel, ipv6 version updates dev->needed_headroom, too.
While ipv4 tunnel headroom adjustment growth was limited in
commit 5ae1e9922bbd ("net: ip_tunnel: prevent perpetual headroom growth"),
ipv6 tunnel yet increases the headroom without any ceiling.
Reflect ipv4 tunnel headroom adjustment limit on ipv6 version.
Credits to Francesco Ruggeri, who was originally debugging this issue
and wrote local Arista-specific patch and a reproducer.
The Broadcom bcm54811 is hardware-strapped to select among RGMII and
GMII/MII/MII-Lite modes. However, the corresponding bit, RGMII Enable
in Miscellaneous Control Register must be also set to select desired
RGMII or MII(-lite)/GMII mode.
Linmao Li [Thu, 9 Oct 2025 12:25:49 +0000 (20:25 +0800)]
r8169: fix packet truncation after S4 resume on RTL8168H/RTL8111H
After resume from S4 (hibernate), RTL8168H/RTL8111H truncates incoming
packets. Packet captures show messages like "IP truncated-ip - 146 bytes
missing!".
The issue is caused by RxConfig not being properly re-initialized after
resume. Re-initializing the RxConfig register before the chip
re-initialization sequence avoids the truncation and restores correct
packet reception.
This follows the same pattern as commit ef9da46ddef0 ("r8169: fix data
corruption issue on RTL8402").
Fixes: 6e1d0b898818 ("r8169:add support for RTL8168H and RTL8107E") Signed-off-by: Linmao Li <lilinmao@kylinos.cn> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/20251009122549.3955845-1-lilinmao@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: gro_cells: Use nested-BH locking for gro_cell
The gro_cell data structure is per-CPU variable and relies on disabled
BH for its locking. Without per-CPU locking in local_bh_disable() on
PREEMPT_RT this data structure requires explicit locking.
Add a local_lock_t to the data structure and use
local_lock_nested_bh() for locking. This change adds only lockdep
coverage and does not alter the functional behaviour for !PREEMPT_RT.
Reported-by: syzbot+8715dd783e9b0bef43b1@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68c6c3b1.050a0220.2ff435.0382.GAE@google.com/ Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251009094338.j1jyKfjR@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Wed, 8 Oct 2025 14:14:45 +0000 (16:14 +0200)]
dpll: zl3073x: Handle missing or corrupted flash configuration
If the internal flash contains missing or corrupted configuration,
basic communication over the bus still functions, but the device
is not capable of normal operation (for example, using mailboxes).
This condition is indicated in the info register by the ready bit.
If this bit is cleared, the probe procedure times out while fetching
the device state.
Handle this case by checking the ready bit value in zl3073x_dev_start()
and skipping DPLL device and pin registration if it is cleared.
Do not report this condition as an error, allowing the devlink device
to be registered and enabling the user to flash the correct configuration.
Prior this patch:
[ 31.112299] zl3073x-i2c 1-0070: Failed to fetch input state: -ETIMEDOUT
[ 31.116332] zl3073x-i2c 1-0070: error -ETIMEDOUT: Failed to start device
[ 31.136881] zl3073x-i2c 1-0070: probe with driver zl3073x-i2c failed with error -110
After this patch:
[ 41.011438] zl3073x-i2c 1-0070: FW not fully ready - missing or corrupted config
Fixes: 75a71ecc24125 ("dpll: zl3073x: Register DPLL devices and pins") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251008141445.841113-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
can: j1939: add missing calls in NETDEV_UNREGISTER notification handler
Currently NETDEV_UNREGISTER event handler is not calling
j1939_cancel_active_session() and j1939_sk_queue_drop_all().
This will result in these calls being skipped when j1939_sk_release() is
called. And I guess that the reason syzbot is still reporting
unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
is caused by lack of these calls.
Calling j1939_cancel_active_session(priv, sk) from j1939_sk_release() can
be covered by calling j1939_cancel_active_session(priv, NULL) from
j1939_netdev_notify().
Calling j1939_sk_queue_drop_all() from j1939_sk_release() can be covered
by calling j1939_sk_netdev_event_netdown() from j1939_netdev_notify().
Therefore, we can reuse j1939_cancel_active_session(priv, NULL) and
j1939_sk_netdev_event_netdown(priv) for NETDEV_UNREGISTER event handler.
Merge patch series "can: add Transmitter Delay Compensation (TDC) documentation"
Vincent Mailhol <mailhol@kernel.org> says:
TDC was added to the kernel in 2021 but I never took time to update the
documentation. The year is now 2025... As we say: "better late than never"!
The first patch is a small clean up which fixes an incorrect statement
concerning the CAN DLC, the second patch is the real thing and adds the
documentation of how to use the ip tool to configure the TDC.
Vincent Mailhol [Mon, 13 Oct 2025 10:10:22 +0000 (19:10 +0900)]
can: remove false statement about 1:1 mapping between DLC and length
The CAN-FD section of can.rst still states that there is a 1:1 mapping
between the Classical CAN DLC and its length. This is only true for
the DLC values up to 8. Beyond that point, the length remains at 8.
For reference, the mapping between the CAN DLC and the length is given
in below table [1]:
Damien Le Moal [Thu, 9 Oct 2025 10:46:00 +0000 (19:46 +0900)]
ata: libata-core: relax checks in ata_read_log_directory()
Commit 6d4405b16d37 ("ata: libata-core: Cache the general purpose log
directory") introduced caching of a device general purpose log directory
to avoid repeated access to this log page during device scan. This
change also added a check on this log page to verify that the log page
version is 0x0001 as mandated by the ACS specifications.
And it turns out that some devices do not bother reporting this version,
instead reporting a version 0, resulting in error messages such as:
ata6.00: Invalid log directory version 0x0000
and to the device being marked as not supporting the general purpose log
directory log page.
Since before commit 6d4405b16d37 the log page version check did not
exist and things were still working correctly for these devices, relax
ata_read_log_directory() version check and only warn about the invalid
log page version number without disabling access to the log directory
page.
Fixes: 6d4405b16d37 ("ata: libata-core: Cache the general purpose log directory") Cc: stable@vger.kernel.org Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220635 Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <cassel@kernel.org>
Nicolas Dichtel [Fri, 10 Oct 2025 14:18:59 +0000 (16:18 +0200)]
doc: fix seg6_flowlabel path
This sysctl is not per interface; it's global per netns.
Fixes: 292ecd9f5a94 ("doc: move seg6_flowlabel to seg6-sysctl.rst") Reported-by: Philippe Guibert <philippe.guibert@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 12 Oct 2025 20:27:56 +0000 (13:27 -0700)]
Merge tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fix from Wolfram Sang:
"One revert because of a regression in the I2C core which has sadly not
showed up during its time in -next"
* tag 'i2c-for-6.18-rc1-hotfix' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
Revert "i2c: boardinfo: Annotate code used in init phase only"
Convert remaining __init__ files similar to what we did in
commit b615879dbfea ("selftests: drv-net: make linters happy with our imports")
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
There is no error handling for `dma_map_single()` failures.
Add error handling by checking `dma_mapping_error()` and freeing
the `skb` using `dev_kfree_skb()` (process context) when it fails.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yeounsu Moon <yyyynoom@gmail.com>
Tested-on: D-Link DGE-550T Rev-A3 Suggested-by: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Rex Lu [Thu, 9 Oct 2025 06:29:34 +0000 (08:29 +0200)]
net: mtk: wed: add dma mask limitation and GFP_DMA32 for device with more than 4GB DRAM
Limit tx/rx buffer address to 32-bit address space for board with more
than 4GB DRAM.
Fixes: 804775dfc2885 ("net: ethernet: mtk_eth_soc: add support for Wireless Ethernet Dispatch (WED)") Fixes: 6757d345dd7db ("net: ethernet: mtk_wed: introduce hw_rro support for MT7988") Tested-by: Daniel Pawlik <pawlik.dan@gmail.com> Tested-by: Matteo Croce <teknoraver@meta.com> Signed-off-by: Rex Lu <rex.lu@mediatek.com> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: usb: lan78xx: Fix lost EEPROM write timeout error(-ETIMEDOUT) in lan78xx_write_raw_eeprom
The function lan78xx_write_raw_eeprom failed to properly propagate EEPROM
write timeout errors (-ETIMEDOUT). In the timeout fallthrough path, it first
attempted to restore the pin configuration for LED outputs and then
returned only the status of that restore operation, discarding the
original timeout error saved in ret.
As a result, callers could mistakenly treat EEPROM write operation as
successful even though the EEPROM write had actually timed out with no
or partial data write.
To fix this, handle errors in restoring the LED pin configuration separately.
If the restore succeeds, return any prior EEPROM write timeout error saved
in ret to the caller.
Suggested-by: Oleksij Rempel <o.rempel@pengutronix.de> Fixes: 8b1b2ca83b20 ("net: usb: lan78xx: Improve error handling in EEPROM and OTP operations")
cc: stable@vger.kernel.org Signed-off-by: Bhanu Seshu Kumar Valluri <bhanuseshukumar@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 12 Oct 2025 15:45:52 +0000 (08:45 -0700)]
Merge tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Skip interrupt ID 0 in sifive-plic during suspend/resume because
ID 0 is reserved and accessing reserved register space could result
in undefined behavior
- Fix a function's retval check in aspeed-scu-ic
* tag 'irq_urgent_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/sifive-plic: Avoid interrupt ID 0 handling during suspend/resume
irqchip/aspeed-scu-ic: Fix an IS_ERR() vs NULL check
Linus Torvalds [Sat, 11 Oct 2025 23:06:04 +0000 (16:06 -0700)]
Merge tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
"The previous fix to trace_marker required updating trace_marker_raw as
well. The difference between trace_marker_raw from trace_marker is
that the raw version is for applications to write binary structures
directly into the ring buffer instead of writing ASCII strings. This
is for applications that will read the raw data from the ring buffer
and get the data structures directly. It's a bit quicker than using
the ASCII version.
Unfortunately, it appears that our test suite has several tests that
test writes to the trace_marker file, but lacks any tests to the
trace_marker_raw file (this needs to be remedied). Two issues came
about the update to the trace_marker_raw file that syzbot found:
- Fix tracing_mark_raw_write() to use per CPU buffer
The fix to use the per CPU buffer to copy from user space was
needed for both the trace_maker and trace_maker_raw file.
The fix for reading from user space into per CPU buffers properly
fixed the trace_marker write function, but the trace_marker_raw
file wasn't fixed properly. The user space data was correctly
written into the per CPU buffer, but the code that wrote into the
ring buffer still used the user space pointer and not the per CPU
buffer that had the user space data already written.
- Stop the fortify string warning from writing into trace_marker_raw
After converting the copy_from_user_nofault() into a memcpy(),
another issue appeared. As writes to the trace_marker_raw expects
binary data, the first entry is a 4 byte identifier. The entry
structure is defined as:
struct {
struct trace_entry ent;
int id;
char buf[];
};
The size of this structure is reserved on the ring buffer with:
size = sizeof(*entry) + cnt;
Then it is copied from the buffer into the ring buffer with:
memcpy(&entry->id, buf, cnt);
This use to be a copy_from_user_nofault(), but now converting it to
a memcpy() triggers the fortify-string code, and causes a warning.
The allocated space is actually more than what is copied, as the
cnt used also includes the entry->id portion. Allocating
sizeof(*entry) plus cnt is actually allocating 4 bytes more than
what is needed.
* tag 'trace-v6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Stop fortify-string from warning in tracing_mark_raw_write()
tracing: Fix tracing_mark_raw_write() to use buf and not ubuf