Bluetooth: MGMT: Fix UAF on mgmt_remove_adv_monitor_complete
This reworks MGMT_OP_REMOVE_ADV_MONITOR to not use mgmt_pending_add to
avoid crashes like bellow:
==================================================================
BUG: KASAN: slab-use-after-free in mgmt_remove_adv_monitor_complete+0xe5/0x540 net/bluetooth/mgmt.c:5406
Read of size 8 at addr ffff88801c53f318 by task kworker/u5:5/5341
Bluetooth: btintel_pcie: Reduce driver buffer posting to prevent race condition
Modify the driver to post 3 fewer buffers than the maximum rx buffers
(64) allowed for the firmware. This change mitigates a hardware issue
causing a race condition in the firmware, improving stability and data
handling.
Signed-off-by: Chandrashekar Devegowda <chandrashekar.devegowda@intel.com> Signed-off-by: Kiran K <kiran.k@intel.com> Fixes: c2b636b3f788 ("Bluetooth: btintel_pcie: Add support for PCIe transport") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: btintel_pcie: Increase the tx and rx descriptor count
This change addresses latency issues observed in HID use cases where
events arrive in bursts. By increasing the Rx descriptor count to 64,
the firmware can handle bursty data more effectively, reducing latency
and preventing buffer overflows.
Signed-off-by: Chandrashekar Devegowda <chandrashekar.devegowda@intel.com> Signed-off-by: Kiran K <kiran.k@intel.com> Fixes: c2b636b3f788 ("Bluetooth: btintel_pcie: Add support for PCIe transport") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Kiran K [Tue, 3 Jun 2025 10:04:38 +0000 (15:34 +0530)]
Bluetooth: btintel_pcie: Fix driver not posting maximum rx buffers
The driver was posting only 6 rx buffers, despite the maximum rx buffers
being defined as 16. Having fewer RX buffers caused firmware exceptions
in HID use cases when events arrived in bursts.
Exception seen on android 6.12 kernel.
E Bluetooth: hci0: Received hw exception interrupt
E Bluetooth: hci0: Received gp1 mailbox interrupt
D Bluetooth: hci0: 00000000: ff 3e 87 80 03 01 01 01 03 01 0c 0d 02 1c 10 0e
D Bluetooth: hci0: 00000010: 01 00 05 14 66 b0 28 b0 c0 b0 28 b0 ac af 28 b0
D Bluetooth: hci0: 00000020: 14 f1 28 b0 00 00 00 00 fa 04 00 00 00 00 40 10
D Bluetooth: hci0: 00000030: 08 00 00 00 7a 7a 7a 7a 47 00 fb a0 10 00 00 00
D Bluetooth: hci0: 00000000: 10 01 0a
E Bluetooth: hci0: ---- Dump of debug registers —
E Bluetooth: hci0: boot stage: 0xe0fb0047
E Bluetooth: hci0: ipc status: 0x00000004
E Bluetooth: hci0: ipc control: 0x00000000
E Bluetooth: hci0: ipc sleep control: 0x00000000
E Bluetooth: hci0: mbox_1: 0x00badbad
E Bluetooth: hci0: mbox_2: 0x0000101c
E Bluetooth: hci0: mbox_3: 0x00000008
E Bluetooth: hci0: mbox_4: 0x7a7a7a7a
Signed-off-by: Chandrashekar Devegowda <chandrashekar.devegowda@intel.com> Signed-off-by: Kiran K <kiran.k@intel.com> Fixes: c2b636b3f788 ("Bluetooth: btintel_pcie: Add support for PCIe transport") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Releasing + re-acquiring RCU lock inside list_for_each_entry_rcu() loop
body is not correct.
Fix by taking the update-side hdev->lock instead.
Fixes: c7eaf80bfb0c ("Bluetooth: Fix hci_link_tx_to RCU lock usage") Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Ido Schimmel [Wed, 4 Jun 2025 11:32:52 +0000 (14:32 +0300)]
seg6: Fix validation of nexthop addresses
The kernel currently validates that the length of the provided nexthop
address does not exceed the specified length. This can lead to the
kernel reading uninitialized memory if user space provided a shorter
length than the specified one.
Fix by validating that the provided length exactly matches the specified
one.
Fixes: d1df6fd8a1d2 ("ipv6: sr: define core operations for seg6local lightweight tunnel") Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250604113252.371528-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Wed, 4 Jun 2025 10:58:15 +0000 (10:58 +0000)]
net: prevent a NULL deref in rtnl_create_link()
At the time rtnl_create_link() is running, dev->netdev_ops is NULL,
we must not use netdev_lock_ops() or risk a NULL deref if
CONFIG_NET_SHAPER is defined.
Jakub Kicinski [Wed, 4 Jun 2025 01:20:55 +0000 (18:20 -0700)]
selftests: drv-net: tso: make bkg() wait for socat to quit
Commit 846742f7e32f ("selftests: drv-net: add a warning for
bkg + shell + terminate") added a warning for bkg() used
with terminate=True. The tso test was missed as we didn't
have it running anywhere in NIPA. Add exit_wait=True, to avoid:
# Warning: combining shell and terminate is risky!
# SIGTERM may not reach the child on zsh/ksh!
Jakub Kicinski [Wed, 4 Jun 2025 01:20:31 +0000 (18:20 -0700)]
selftests: drv-net: tso: fix the GRE device name
The device type for IPv4 GRE is "gre" not "ipgre",
unlike for IPv6 which uses "ip6gre".
Not sure how I missed this when writing the test, perhaps
because all HW I have access to is on an IPv6-only network.
Fixes: 0d0f4174f6c8 ("selftests: drv-net: add a simple TSO test") Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250604012031.891242-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 4 Jun 2025 00:16:52 +0000 (17:16 -0700)]
selftests: drv-net: add configs for the TSO test
Add missing config options for the tso.py test, specifically
to make sure the kernel is built with vxlan and gre tunnels.
I noticed this while adding a TSO-capable device QEMU to the CI.
Previously we only run virtio tests and it doesn't report LSO
stats on the QEMU we have.
Jakub Kicinski [Thu, 5 Jun 2025 14:59:31 +0000 (07:59 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
iavf: get rid of the crit lock
Przemek Kitszel says:
Fix some deadlocks in iavf, and make it less error prone for the future.
Patch 1 is simple and independent from the rest.
Patches 2, 3, 4 are strictly a refactor, but it enables the last patch
to be much smaller.
(Technically Jake given his RB tags not knowing I will send it to -net).
Patch 5 just adds annotations, this also helps prove last patch to be correct.
Patch 6 removes the crit lock, with its unusual try_lock()s.
I have more refactoring for scheduling done for -next, to be sent soon.
There is a simple test:
add VF; decrease number of queueus; remove VF
that was way too hard to pass without this series :)
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
iavf: get rid of the crit lock
iavf: sprinkle netdev_assert_locked() annotations
iavf: extract iavf_watchdog_step() out of iavf_watchdog_task()
iavf: simplify watchdog_task in terms of adminq task scheduling
iavf: centralize watchdog requeueing itself
iavf: iavf_suspend(): take RTNL before netdev_lock()
====================
Mirco Barone [Thu, 5 Jun 2025 12:06:16 +0000 (14:06 +0200)]
wireguard: device: enable threaded NAPI
Enable threaded NAPI by default for WireGuard devices in response to low
performance behavior that we observed when multiple tunnels (and thus
multiple wg devices) are deployed on a single host. This affects any
kind of multi-tunnel deployment, regardless of whether the tunnels share
the same endpoints or not (i.e., a VPN concentrator type of gateway
would also be affected).
The problem is caused by the fact that, in case of a traffic surge that
involves multiple tunnels at the same time, the polling of the NAPI
instance of all these wg devices tends to converge onto the same core,
causing underutilization of the CPU and bottlenecking performance.
This happens because NAPI polling is hosted by default in softirq
context, but the WireGuard driver only raises this softirq after the rx
peer queue has been drained, which doesn't happen during high traffic.
In this case, the softirq already active on a core is reused instead of
raising a new one.
As a result, once two or more tunnel softirqs have been scheduled on
the same core, they remain pinned there until the surge ends.
In our experiments, this almost always leads to all tunnel NAPIs being
handled on a single core shortly after a surge begins, limiting
scalability to less than 3× the performance of a single tunnel, despite
plenty of unused CPU cores being available.
The proposed mitigation is to enable threaded NAPI for all WireGuard
devices. This moves the NAPI polling context to a dedicated per-device
kernel thread, allowing the scheduler to balance the load across all
available cores.
On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
improvement with 16 tunnels, increasing throughput from ~13 Gbps to
~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
jumps from 20% to 100%.
We have found no performance regressions in any scenario we tested.
Single-tunnel throughput remains unchanged.
Paolo Abeni [Thu, 5 Jun 2025 11:37:02 +0000 (13:37 +0200)]
Merge tag 'nf-25-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Zero out the remainder in nft_pipapo AVX2 implementation, otherwise
next lookup could bogusly report a mismatch. This is followed by two
patches to update nft_pipapo selftests to cover for the previous bug.
From Florian Westphal.
2) Check for reverse tuple too in case of esoteric NAT collisions for
UDP traffic and extend selftest coverage. Also from Florian.
netfilter pull request 25-06-05
* tag 'nf-25-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
netfilter: nf_nat: also check reverse tuple to obtain clashing entry
selftests: netfilter: nft_concat_range.sh: add datapath check for map fill bug
selftests: netfilter: nft_concat_range.sh: prefer per element counters for testing
netfilter: nf_set_pipapo_avx2: fix initial map fill
====================
Adding GRE tunnels to the .config for driver tests caused
some unhappiness in YNL, as it can't decode all the link
attrs on the system. Add ip6gre support to fix the tests.
This is similar to commit 6ffdbb93a59c ("netlink: specs:
rt_link: decode ip6tnl, vti and vti6 link attrs").
====================
Jakub Kicinski [Tue, 3 Jun 2025 13:53:57 +0000 (06:53 -0700)]
netlink: specs: rt-link: decode ip6gre
Driver tests now require GRE tunnels, while we don't configure
them with YNL, YNL will complain when it sees link types it
doesn't recognize. Teach it decoding ip6gre tunnels. The attrs
are largely the same as IPv4 GRE.
Correct the type of encap-limit, but note that this attr is
only used in ip6gre, so the mistake didn't matter until now.
Fixes: 0d0f4174f6c8 ("selftests: drv-net: add a simple TSO test") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250603135357.502626-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A number of fields in the ip tunnels are lacking the big-endian
designation. I suspect this is not intentional, as decoding
the ports with the right endian seems objectively beneficial.
Fixes: 6ffdbb93a59c ("netlink: specs: rt_link: decode ip6tnl, vti and vti6 link attrs") Fixes: 077b6022d24b ("doc/netlink/specs: Add sub-message type to rt_link family") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250603135357.502626-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 5 Jun 2025 10:37:10 +0000 (12:37 +0200)]
Merge tag 'ovpn-net-20250603' of https://github.com/OpenVPN/ovpn-net-next
Antonio Quartulli says:
====================
In this batch you can find the following bug fixes:
Patch 1: when releasing a UDP socket we were wrongly invoking
setup_udp_tunnel_sock() with an empty config. This was not
properly shutting down the UDP encap state.
With this patch we simply undo what was done during setup.
Patch 2: ovpn was holding a reference to a 'struct socket'
without increasing its reference counter. This was intended
and worked as expected until we hit a race condition where
user space tries to close the socket while kernel space is
also releasing it. In this case the (struct socket *)->sk
member would disappear under our feet leading to a null-ptr-deref.
This patch fixes this issue by having struct ovpn_socket hold
a reference directly to the sk member while also increasing
its reference counter.
Patch 3: in case of errors along the TCP RX path (softirq)
we want to immediately delete the peer, but this operation may
sleep. With this patch we move the peer deletion to a scheduled
worker.
Patch 4 and 5 are instead fixing minor issues in the ovpn
kselftests.
* tag 'ovpn-net-20250603' of https://github.com/OpenVPN/ovpn-net-next:
selftest/net/ovpn: fix missing file
selftest/net/ovpn: fix TCP socket creation
ovpn: avoid sleep in atomic context in TCP RX error path
ovpn: ensure sk is still valid during cleanup
ovpn: properly deconfigure UDP-tunnel
====================
Daniele Palmas [Tue, 3 Jun 2025 09:12:04 +0000 (11:12 +0200)]
net: wwan: mhi_wwan_mbim: use correct mux_id for multiplexing
Recent Qualcomm chipsets like SDX72/75 require MBIM sessionId mapping
to muxId in the range (0x70-0x8F) for the PCIe tethered use.
This has been partially addressed by the referenced commit, mapping
the default data call to muxId = 112, but the multiplexed data calls
scenario was not properly considered, mapping sessionId = 1 to muxId
1, while it should have been 113.
Fix this by moving the session_id assignment logic to mhi_mbim_newlink,
in order to map sessionId = n to muxId = n + WDS_BIND_MUX_DATA_PORT_MUX_ID.
Fixes: 65bc58c3dcad ("net: wwan: mhi: make default data link id configurable") Signed-off-by: Daniele Palmas <dnlplm@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@oss.qualcomm.com> Link: https://patch.msgid.link/20250603091204.2802840-1-dnlplm@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lachlan Hodges [Tue, 3 Jun 2025 05:35:38 +0000 (15:35 +1000)]
wifi: cfg80211/mac80211: correctly parse S1G beacon optional elements
S1G beacons are not traditional beacons but a type of extension frame.
Extension frames contain the frame control and duration fields, followed
by zero or more optional fields before the frame body. These optional
fields are distinct from the variable length elements.
The presence of optional fields is indicated in the frame control field.
To correctly locate the elements offset, the frame control must be parsed
to identify which optional fields are present. Currently, mac80211 parses
S1G beacons based on fixed assumptions about the frame layout, without
inspecting the frame control field. This can result in incorrect offsets
to the "variable" portion of the frame.
Properly parse S1G beacon frames by using the field lengths defined in
IEEE 802.11-2024, section 9.3.4.3, ensuring that the elements offset is
calculated accurately.
Fixes: 9eaffe5078ca ("cfg80211: convert S1G beacon to scan results") Fixes: cd418ba63f0c ("mac80211: convert S1G beacon to scan results") Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20250603053538.468562-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
RGMII ports on BCM63xx were not really working, especially with PHYs
that support EEE and are capable of configuring their own RGMII delays.
So let's make them work, and fix additional minor rgmii related issues
found while working on it.
With a BCM96328BU-P300:
Before:
[ 3.580000] b53-switch 10700000.switch GbE3 (uninitialized): validation of rgmii with support 0000000,00000000,00000000,000062ff and advertisement 0000000,00000000,00000000,000062ff failed: -EINVAL
[ 3.600000] b53-switch 10700000.switch GbE3 (uninitialized): failed to connect to PHY: -EINVAL
[ 3.610000] b53-switch 10700000.switch GbE3 (uninitialized): error -22 setting up PHY for tree 0, switch 0, port 4
[ 3.620000] b53-switch 10700000.switch GbE1 (uninitialized): validation of rgmii with support 0000000,00000000,00000000,000062ff and advertisement 0000000,00000000,00000000,000062ff failed: -EINVAL
[ 3.640000] b53-switch 10700000.switch GbE1 (uninitialized): failed to connect to PHY: -EINVAL
[ 3.650000] b53-switch 10700000.switch GbE1 (uninitialized): error -22 setting up PHY for tree 0, switch 0, port 5
[ 3.660000] b53-switch 10700000.switch GbE4 (uninitialized): validation of rgmii with support 0000000,00000000,00000000,000062ff and advertisement 0000000,00000000,00000000,000062ff failed: -EINVAL
[ 3.680000] b53-switch 10700000.switch GbE4 (uninitialized): failed to connect to PHY: -EINVAL
[ 3.690000] b53-switch 10700000.switch GbE4 (uninitialized): error -22 setting up PHY for tree 0, switch 0, port 6
[ 3.700000] b53-switch 10700000.switch GbE5 (uninitialized): validation of rgmii with support 0000000,00000000,00000000,000062ff and advertisement 0000000,00000000,00000000,000062ff failed: -EINVAL
[ 3.720000] b53-switch 10700000.switch GbE5 (uninitialized): failed to connect to PHY: -EINVAL
[ 3.730000] b53-switch 10700000.switch GbE5 (uninitialized): error -22 setting up PHY for tree 0, switch 0, port 7
Jonas Gorski [Mon, 2 Jun 2025 19:39:53 +0000 (21:39 +0200)]
net: dsa: b53: do not touch DLL_IQQD on bcm53115
According to OpenMDK, bit 2 of the RGMII register has a different
meaning for BCM53115 [1]:
"DLL_IQQD 1: In the IDDQ mode, power is down0: Normal function
mode"
Configuring RGMII delay works without setting this bit, so let's keep it
at the default. For other chips, we always set it, so not clearing it
is not an issue.
One would assume BCM53118 works the same, but OpenMDK is not quite sure
what this bit actually means [2]:
"BYPASS_IMP_2NS_DEL #1: In the IDDQ mode, power is down#0: Normal
function mode1: Bypass dll65_2ns_del IP0: Use
dll65_2ns_del IP"
Jonas Gorski [Mon, 2 Jun 2025 19:39:52 +0000 (21:39 +0200)]
net: dsa: b53: allow RGMII for bcm63xx RGMII ports
Add RGMII to supported interfaces for BCM63xx RGMII ports so they can be
actually used in RGMII mode.
Without this, phylink will fail to configure them:
[ 3.580000] b53-switch 10700000.switch GbE3 (uninitialized): validation of rgmii with support 0000000,00000000,00000000,000062ff and advertisement 0000000,00000000,00000000,000062ff failed: -EINVAL
[ 3.600000] b53-switch 10700000.switch GbE3 (uninitialized): failed to connect to PHY: -EINVAL
[ 3.610000] b53-switch 10700000.switch GbE3 (uninitialized): error -22 setting up PHY for tree 0, switch 0, port 4
Fixes: ce3bf94871f7 ("net: dsa: b53: add support for BCM63xx RGMIIs") Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Link: https://patch.msgid.link/20250602193953.1010487-5-jonas.gorski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jonas Gorski [Mon, 2 Jun 2025 19:39:49 +0000 (21:39 +0200)]
net: dsa: b53: do not enable EEE on bcm63xx
BCM63xx internal switches do not support EEE, but provide multiple RGMII
ports where external PHYs may be connected. If one of these PHYs are EEE
capable, we may try to enable EEE for the MACs, which then hangs the
system on access of the (non-existent) EEE registers.
Fix this by checking if the switch actually supports EEE before
attempting to configure it.
Fixes: 22256b0afb12 ("net: dsa: b53: Move EEE functions to b53") Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Tested-by: Álvaro Fernández Rojas <noltari@gmail.com> Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Link: https://patch.msgid.link/20250602193953.1010487-2-jonas.gorski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Meghana Malladi [Tue, 3 Jun 2025 05:29:04 +0000 (10:59 +0530)]
net: ti: icssg-prueth: Fix swapped TX stats for MII interfaces.
In MII mode, Tx lines are swapped for port0 and port1, which means
Tx port0 receives data from PRU1 and the Tx port1 receives data from
PRU0. This is an expected hardware behavior and reading the Tx stats
needs to be handled accordingly in the driver. Update the driver to
read Tx stats from the PRU1 for port0 and PRU0 for port1.
Florian Westphal [Fri, 30 May 2025 10:34:03 +0000 (12:34 +0200)]
selftests: netfilter: nft_nat.sh: add test for reverse clash with nat
This will fail without the previous bug fix because we erronously
believe that the clashing entry went way.
However, the clash exists in the opposite direction due to an
existing nat mapping:
PASS: IP statless for ns2-LgTIuS
ERROR: failed to test udp ns1-x4iyOW to ns2-LgTIuS with dnat rule step 2, result: ""
This is partially adapted from test instructions from the below
ubuntu tracker.
Florian Westphal [Fri, 30 May 2025 10:34:02 +0000 (12:34 +0200)]
netfilter: nf_nat: also check reverse tuple to obtain clashing entry
The logic added in the blamed commit was supposed to only omit nat source
port allocation if neither the existing nor the new entry are subject to
NAT.
However, its not enough to lookup the conntrack based on the proposed
tuple, we must also check the reverse direction.
Otherwise there are esoteric cases where the collision is in the reverse
direction because that colliding connection has a port rewrite, but the
new entry doesn't. In this case, we only check the new entry and then
erronously conclude that no clash exists anymore.
The existing (udp) tuple is:
a:p -> b:P, with nat translation to s:P, i.e. pure daddr rewrite,
reverse tuple in conntrack table is s:P -> a:p.
When another UDP packet is sent directly to s, i.e. a:p->s:P, this is
correctly detected as a colliding entry: tuple is taken by existing reply
tuple in reverse direction.
But the colliding conntrack is only searched for with unreversed
direction, and we can't find such entry matching a:p->s:P.
The incorrect conclusion is that the clashing entry has timed out and
that no port address translation is required.
Such conntrack will then be discarded at nf_confirm time because the
proposed reverse direction clashes with an existing mapping in the
conntrack table.
Search for the reverse tuple too, this will then check the NAT bits of
the colliding entry and triggers port reallocation.
Followp patch extends nft_nat.sh selftest to cover this scenario.
The IPS_SEQ_ADJUST change is also a bug fix:
Instead of checking for SEQ_ADJ this tested for SEEN_REPLY and ASSURED
by accident -- _BIT is only for use with the test_bit() API.
This bug has little consequence in practice, because the sequence number
adjustments are only useful for TCP which doesn't support clash resolution.
The existing test case (conntrack_reverse_clash.sh) exercise a race
condition path (parallel conntrack creation on different CPUs), so
the colliding entries have neither SEEN_REPLY nor ASSURED set.
Thanks to Yafang Shao and Shaun Brady for an initial investigation
of this bug.
Fixes: d8f84a9bc7c4 ("netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash") Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1795 Reported-by: Yafang Shao <laoar.shao@gmail.com> Reported-by: Shaun Brady <brady.1345@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Fri, 23 May 2025 12:20:46 +0000 (14:20 +0200)]
selftests: netfilter: nft_concat_range.sh: add datapath check for map fill bug
commit 0935ee6032df ("selftests: netfilter: add test case for recent mismatch bug")
added a regression check for incorrect initial fill of the result map
that was fixed with 791a615b7ad2 ("netfilter: nf_set_pipapo: fix initial map fill").
The test used 'nft get element', i.e., control plane checks for
match/nomatch results.
The control plane however doesn't use avx2 version, so we need to
send+match packets.
As the additional packet match/nomatch is slow, don't do this for
every element added/removed: add and use maybe_send_(no)match
helpers and use them.
Florian Westphal [Fri, 23 May 2025 12:20:45 +0000 (14:20 +0200)]
selftests: netfilter: nft_concat_range.sh: prefer per element counters for testing
The selftest uses following rule:
... @test counter name "test"
Then sends a packet, then checks if the named counter did increment or
not.
This is fine for the 'no-match' test case: If anything matches the
counter increments and the test fails as expected.
But for the 'should match' test cases this isn't optimal.
Consider buggy matching, where the packet matches entry x, but it
should have matched entry y.
In that case the test would erronously pass.
Rework the selftest to use per-element counters to avoid this.
After sending packet that should have matched entry x, query the
relevant element via 'nft reset element' and check that its counter
had incremented.
The 'nomatch' case isn't altered, no entry should match so the named
counter must be 0, changing it to the per-element counter would then
pass if another entry matches.
The downside of this change is a slight increase in test run-time by
a few seconds.
The regulatory domain information was initialized every time the
FW was loaded and the device was restarted. This was unnecessary
and useless as at this stage the wiphy channels information was
not setup yet so while the regulatory domain was set to the wiphy,
the channel information was not updated.
In case that a specific MCC was configured during FW initialization
then following updates with this MCC are ignored, and thus the
wiphy channels information is left with information not matching
the regulatory domain.
This commit moves the regulatory domain initialization to after the
operational firmware is started, i.e., after the wiphy channels were
configured and the regulatory information is needed.
Miri Korenblit [Wed, 4 Jun 2025 03:13:19 +0000 (06:13 +0300)]
wifi: iwlwifi: mld: avoid panic on init failure
In case of an error during init, in_hw_restart will be set, but it will
never get cleared.
Instead, we will retry to init again, and then we will act like we are in a
restart when we are actually not.
This causes (among others) to a NULL pointer dereference when canceling
rx_omi::finished_work, that was not even initialized, because we thought
that we are in hw_restart.
Set in_hw_restart to true only if the fw is running, then we know that
FW was loaded successfully and we are not going to the retry loop.
Miri Korenblit [Wed, 4 Jun 2025 03:13:18 +0000 (06:13 +0300)]
wifi: iwlwifi: mvm: fix assert on suspend
After using DEFINE_RAW_FLEX, cmd is a pointer to iwl_rxq_sync_cmd,
and not a variable containing both the command and notification.
Adjust hcmd->data and hcmd->len assignment as well.
Alok Tiwari [Mon, 2 Jun 2025 10:34:29 +0000 (03:34 -0700)]
gve: add missing NULL check for gve_alloc_pending_packet() in TX DQO
gve_alloc_pending_packet() can return NULL, but gve_tx_add_skb_dqo()
did not check for this case before dereferencing the returned pointer.
Add a missing NULL check to prevent a potential NULL pointer
dereference when allocation fails.
This improves robustness in low-memory scenarios.
Fixes: a57e5de476be ("gve: DQO: Add TX path") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of the crit lock.
That frees us from the error prone logic of try_locks.
Thanks to netdev_lock() by Jakub it is now easy, and in most cases we were
protected by it already - replace crit lock by netdev lock when it was not
the case.
Lockdep reports that we should cancel the work under crit_lock [splat1],
and that was the scheme we have mostly followed since [1] by Slawomir.
But when that is done we still got into deadlocks [splat2]. So instead
we should look at the bigger problem, namely "weird locking/scheduling"
of the iavf. The first step to fix that is to remove the crit lock.
I will followup with a -next series that simplifies scheduling/tasks.
Cancel the work without netdev lock (weird unlock+lock scheme),
to fix the [splat2] (which would be totally ugly if we would kept
the crit lock).
Extend protected part of iavf_watchdog_task() to include scheduling
more work.
Note that the removed comment in iavf_reset_task() was misplaced,
it belonged to inside of the removed if condition, so it's gone now.
[splat1] - w/o this patch - The deadlock during VF removal:
WARNING: possible circular locking dependency detected
sh/3825 is trying to acquire lock:
((work_completion)(&(&adapter->watchdog_task)->work)){+.+.}-{0:0}, at: start_flush_work+0x1a1/0x470
but task is already holding lock:
(&adapter->crit_lock){+.+.}-{4:4}, at: iavf_remove+0xd1/0x690 [iavf]
which lock already depends on the new lock.
[splat2] - when cancelling work under crit lock, w/o this series,
see [2] for the band aid attempt
WARNING: possible circular locking dependency detected
sh/3550 is trying to acquire lock:
((wq_completion)iavf){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90
but task is already holding lock:
(&dev->lock){+.+.}-{4:4}, at: iavf_remove+0xa6/0x6e0 [iavf]
which lock already depends on the new lock.
iavf: iavf_suspend(): take RTNL before netdev_lock()
Fix an obvious violation of lock ordering.
Jakub's [1] added netdev_lock() call that is wrong ordered wrt RTNL,
but the Fixes tag points to crit_lock being wrongly placed (by lockdep
standards).
Actual reason we got it wrong is dated back to critical section managed by
pure flag checks, which is with us since the very beginning.
[1] afc664987ab3 ("eth: iavf: extend the netdev_lock usage")
Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
TCP sockets cannot be created with AF_UNSPEC, but
one among the supported family must be used.
Since commit 944f8b6abab6 ("selftest/net/ovpn: extend
coverage with more test cases") the default address
family for all tests was changed from AF_INET to AF_UNSPEC,
thus breaking all TCP cases.
Restore AF_INET as default address family for TCP listeners.
Fixes: 944f8b6abab6 ("selftest/net/ovpn: extend coverage with more test cases") Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Since strp_data_ready() may be invoked from softirq context, and
ovpn_socket_release() may sleep, the above sequence may cause a
sleep in atomic context like the following:
This happens because the peer deletion operation reaches
ovpn_socket_release() while ovpn_sock->sock (struct socket *)
and its sk member (struct sock *) are still both valid.
Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk
becomes NULL, due to the concurrent socket closing triggered
from userspace.
After having invoked synchronize_rcu(), ovpn_socket_release() will
attempt dereferencing ovpn_sock->sock->sk, triggering the crash
reported above.
The reason for accessing sk is that we need to retrieve its
protocol and continue the cleanup routine accordingly.
This crash can be easily produced by running openvpn userspace in
client mode with `--keepalive 10 20`, while entirely omitting this
option on the server side.
After 20 seconds ovpn will assume the peer (server) to be dead,
will start removing it and will notify userspace. The latter will
receive the notification and close the transport socket, thus
triggering the crash.
To fix the race condition for good, we need to refactor struct ovpn_socket.
Since ovpn is always only interested in the sock->sk member (struct sock *)
we can directly hold a reference to it, raher than accessing it via
its struct socket container.
This means changing "struct socket *ovpn_socket->sock" to
"struct sock *ovpn_socket->sk".
While acquiring a reference to sk, we can increase its refcounter
without affecting the socket close()/destroy() notification
(which we rely on when userspace closes a socket we are using).
By increasing sk's refcounter we know we can dereference it
in ovpn_socket_release() without incurring in any race condition
anymore.
ovpn_socket_release() will ultimately decrease the reference
counter.
Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cbd60ea ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: https://github.com/OpenVPN/ovpn-net-next/issues/1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
When deconfiguring a UDP-tunnel from a socket, we cannot
call setup_udp_tunnel_sock() with an empty config, because
this helper is expected to be invoked only during setup.
Get rid of the call to setup_udp_tunnel_sock() and just
revert what it did during socket initialization..
Note that the global udp_encap_needed_key and the GRO state
are left untouched: udp_destroy_socket() will eventually
take care of them.
Fix IPv6 hw acceleration in bridge mode resolving ib2 and
airoha_foe_mac_info_common overwrite in
airoha_ppe_foe_commit_subflow_entry routine.
Introduce UPDMEM source-mac table used to set source mac address for
L3 IPv6 hw accelerated traffic.
====================
net: airoha: Fix smac_id configuration in bridge mode
Set PPE entry smac_id field to 0xf in airoha_ppe_foe_commit_subflow_entry
routine for IPv6 traffic in order to instruct the hw to keep original
source mac address for IPv6 hw accelerated traffic in bridge mode.
net: airoha: Fix IPv6 hw acceleration in bridge mode
ib2 and airoha_foe_mac_info_common have not the same offsets in
airoha_foe_bridge and airoha_foe_ipv6 structures. Current codebase does
not accelerate IPv6 traffic in bridge mode since ib2 and l2 info are not
set properly copying airoha_foe_bridge struct into airoha_foe_ipv6 one
in airoha_ppe_foe_commit_subflow_entry routine.
Fix IPv6 hw acceleration in bridge mode resolving ib2 and
airoha_foe_mac_info_common overwrite in
airoha_ppe_foe_commit_subflow_entry() and configuring them with proper
values.
net: airoha: Initialize PPE UPDMEM source-mac table
UPDMEM source-mac table is a key-value map used to store devices mac
addresses according to the port identifier. UPDMEM source mac table is
used during IPv6 traffic hw acceleration since PPE entries, for space
constraints, do not contain the full source mac address but just the
identifier in the UPDMEM source-mac table.
Configure UPDMEM source-mac table with device mac addresses and set
the source-mac ID field for PPE IPv6 entries in order to select the
proper device mac address as source mac for L3 IPv6 hw accelerated traffic.
selftests: net: build net/lib dependency in all target
We have the logic to include net/lib automatically for net related
selftests. However, currently, this logic is only in install target
which means only `make install` will have net/lib included. This commit
adds the logic to all target so that all `make`, `make run_tests` and
`make install` will have net/lib included in net related selftests.
Ronak Doshi [Fri, 30 May 2025 15:27:00 +0000 (15:27 +0000)]
vmxnet3: correctly report gso type for UDP tunnels
Commit 3d010c8031e3 ("udp: do not accept non-tunnel GSO skbs landing
in a tunnel") added checks in linux stack to not accept non-tunnel
GRO packets landing in a tunnel. This exposed an issue in vmxnet3
which was not correctly reporting GRO packets for tunnel packets.
This patch fixes this issue by setting correct GSO type for the
tunnel packets.
Currently, vmxnet3 does not support reporting inner fields for LRO
tunnel packets. The issue is not seen for egress drivers that do not
use skb inner fields. The workaround is to enable tnl-segmentation
offload on the egress interfaces if the driver supports it. This
problem pre-exists this patch fix and can be addressed as a separate
future patch.
Fixes: dacce2be3312 ("vmxnet3: add geneve and vxlan tunnel offload support") Signed-off-by: Ronak Doshi <ronak.doshi@broadcom.com> Acked-by: Guolin Yang <guolin.yang@broadcom.com> Link: https://patch.msgid.link/20250530152701.70354-1-ronak.doshi@broadcom.com
[pabeni@redhat.com: dropped the changelog] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The commit in question breaks kunit for older compilers:
$ gcc --version
gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
$ ./tools/testing/kunit/kunit.py run --alltests --json --arch=x86_64
Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=x86_64 O=.kunit olddefconfig
ERROR:root:Not all Kconfig options selected in kunitconfig were in the generated .config.
This is probably due to unsatisfied dependencies.
Missing: CONFIG_INIT_STACK_ALL_PATTERN=y
Jinjian Song [Fri, 30 May 2025 03:16:48 +0000 (11:16 +0800)]
net: wwan: t7xx: Fix napi rx poll issue
When driver handles the napi rx polling requests, the netdev might
have been released by the dellink logic triggered by the disconnect
operation on user plane. However, in the logic of processing skb in
polling, an invalid netdev is still being used, which causes a panic.
For idpf:
Brian Vazquez updates netif_subqueue_maybe_stop() condition check to
prevent possible races.
Emil shuts down virtchannel mailbox during reset to reduce timeout
delays as it's unavailable during that time.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: avoid mailbox timeout delays during reset
idpf: fix a race in txq wakeup
ice: fix rebuilding the Tx scheduler tree for large queue counts
ice: create new Tx scheduler nodes for new queues only
ice: fix Tx scheduler error handling in XDP callback
====================
Shiming Cheng [Fri, 30 May 2025 01:26:08 +0000 (09:26 +0800)]
net: fix udp gso skb_segment after pull from frag_list
Commit a1e40ac5b5e9 ("net: gso: fix udp gso fraglist segmentation after
pull from frag_list") detected invalid geometry in frag_list skbs and
redirects them from skb_segment_list to more robust skb_segment. But some
packets with modified geometry can also hit bugs in that code. We don't
know how many such cases exist. Addressing each one by one also requires
touching the complex skb_segment code, which risks introducing bugs for
other types of skbs. Instead, linearize all these packets that fail the
basic invariants on gso fraglist skbs. That is more robust.
If only part of the fraglist payload is pulled into head_skb, it will
always cause exception when splitting skbs by skb_segment. For detailed
call stack information, see below.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify fraglist skbs, breaking these invariants.
In extreme cases they pull one part of data into skb linear. For UDP,
this causes three payloads with lengths of (11,11,10) bytes were
pulled tail to become (12,10,10) bytes.
The skbs no longer meets the above SKB_GSO_FRAGLIST conditions because
payload was pulled into head_skb, it needs to be linearized before pass
to regular skb_segment.
Fixes: a1e40ac5b5e9 ("gso: fix udp gso fraglist segmentation after pull from frag_list") Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: Fix inet_proto_csum_replace_by_diff for IPv6
This patchset fixes a bug that causes skb->csum to hold an incorrect
value when calling inet_proto_csum_replace_by_diff for an IPv6 packet
in CHECKSUM_COMPLETE state. This bug affects BPF helper
bpf_l4_csum_replace and IPv6 ILA in adj-transport mode.
In those cases, inet_proto_csum_replace_by_diff updates the L4 checksum
field after an IPv6 address change. These two changes cancel each other
in terms of checksum, so skb->csum shouldn't be updated.
Paul Chaignon [Thu, 29 May 2025 10:28:35 +0000 (12:28 +0200)]
bpf: Fix L4 csum update on IPv6 in CHECKSUM_COMPLETE
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.
When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:
The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.
For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.
The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.
This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.
This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.
Paul Chaignon [Thu, 29 May 2025 10:28:05 +0000 (12:28 +0200)]
net: Fix checksum update for ILA adj-transport
During ILA address translations, the L4 checksums can be handled in
different ways. One of them, adj-transport, consist in parsing the
transport layer and updating any found checksum. This logic relies on
inet_proto_csum_replace_by_diff and produces an incorrect skb->csum when
in state CHECKSUM_COMPLETE.
This bug can be reproduced with a simple ILA to SIR mapping, assuming
packets are received with CHECKSUM_COMPLETE:
$ ip a show dev eth0
14: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 62:ae:35:9e:0f:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 3333:0:0:1::c078/64 scope global
valid_lft forever preferred_lft forever
inet6 fd00:10:244:1::c078/128 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::60ae:35ff:fe9e:f8d/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ ip ila add loc_match fd00:10:244:1 loc 3333:0:0:1 \
csum-mode adj-transport ident-type luid dev eth0
Then I hit [fd00:10:244:1::c078]:8000 with a server listening only on
[3333:0:0:1::c078]:8000. With the bug, the SYN packet is dropped with
SKB_DROP_REASON_TCP_CSUM after inet_proto_csum_replace_by_diff changed
skb->csum. The translation and drop are visible on pwru [1] traces:
This is happening because inet_proto_csum_replace_by_diff is updating
skb->csum when it shouldn't. The L4 checksum is updated such that it
"cancels" the IPv6 address change in terms of checksum computation, so
the impact on skb->csum is null.
Note this would be different for an IPv4 packet since three fields
would be updated: the IPv4 address, the IP checksum, and the L4
checksum. Two would cancel each other and skb->csum would still need
to be updated to take the L4 checksum change into account.
This patch fixes it by passing an ipv6 flag to
inet_proto_csum_replace_by_diff, to skip the skb->csum update if we're
in the IPv6 case. Note the behavior of the only other user of
inet_proto_csum_replace_by_diff, the BPF subsystem, is left as is in
this patch and fixed in the subsequent patch.
With the fix, using the reproduction from above, I can confirm
skb->csum is not touched by inet_proto_csum_replace_by_diff and the TCP
SYN proceeds to the application after the ILA translation.
Alexis Lothoré [Thu, 29 May 2025 09:07:24 +0000 (11:07 +0200)]
net: stmmac: make sure that ptp_rate is not 0 before configuring EST
If the ptp_rate recorded earlier in the driver happens to be 0, this
bogus value will propagate up to EST configuration, where it will
trigger a division by 0.
Prevent this division by 0 by adding the corresponding check and error
code.
Suggested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Fixes: 8572aec3d0dc ("net: stmmac: Add basic EST support for XGMAC") Link: https://patch.msgid.link/20250529-stmmac_tstamp_div-v4-2-d73340a794d5@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexis Lothoré [Thu, 29 May 2025 09:07:23 +0000 (11:07 +0200)]
net: stmmac: make sure that ptp_rate is not 0 before configuring timestamping
The stmmac platform drivers that do not open-code the clk_ptp_rate value
after having retrieved the default one from the device-tree can end up
with 0 in clk_ptp_rate (as clk_get_rate can return 0). It will
eventually propagate up to PTP initialization when bringing up the
interface, leading to a divide by 0:
Division by zero in kernel.
CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.30-00001-g48313bd5768a #22
Hardware name: STM32 (Device Tree Support)
Call trace:
unwind_backtrace from show_stack+0x18/0x1c
show_stack from dump_stack_lvl+0x6c/0x8c
dump_stack_lvl from Ldiv0_64+0x8/0x18
Ldiv0_64 from stmmac_init_tstamp_counter+0x190/0x1a4
stmmac_init_tstamp_counter from stmmac_hw_setup+0xc1c/0x111c
stmmac_hw_setup from __stmmac_open+0x18c/0x434
__stmmac_open from stmmac_open+0x3c/0xbc
stmmac_open from __dev_open+0xf4/0x1ac
__dev_open from __dev_change_flags+0x1cc/0x224
__dev_change_flags from dev_change_flags+0x24/0x60
dev_change_flags from ip_auto_config+0x2e8/0x11a0
ip_auto_config from do_one_initcall+0x84/0x33c
do_one_initcall from kernel_init_freeable+0x1b8/0x214
kernel_init_freeable from kernel_init+0x24/0x140
kernel_init from ret_from_fork+0x14/0x28
Exception stack(0xe0815fb0 to 0xe0815ff8)
Prevent this division by 0 by adding an explicit check and error log
about the actual issue. While at it, remove the same check from
stmmac_ptp_register, which then becomes duplicate
Fixes: 19d857c9038e ("stmmac: Fix calculations for ptp counters when clock input = 50Mhz.") Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250529-stmmac_tstamp_div-v4-1-d73340a794d5@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This deadlock was not observed so far because net_shaper_ops was never set,
and thus the lock was effectively a no-op in this case. Fix this by using
netif_xdp_propagate() instead of dev_xdp_propagate() to avoid recursive
locking in this path.
And, since no deadlock is observed on the other path which is via
netvsc_probe, add the lock exclusivly for that path.
Also, clean up the unregistration path by removing the unnecessary call to
netvsc_vf_setxdp(), since unregister_netdevice_many_notify() already
performs this cleanup via dev_xdp_uninstall().
Fixes: 97246d6d21c2 ("net: hold netdev instance lock during ndo_bpf") Cc: stable@vger.kernel.org Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Tested-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com> Link: https://patch.msgid.link/1748513910-23963-1-git-send-email-ssengar@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 31 May 2025 02:15:58 +0000 (19:15 -0700)]
Merge tag 'for-net-2025-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_qca: move the SoC type check to the right place
- MGMT: reject malformed HCI_CMD_SYNC commands
- btnxpuart: Fix missing devm_request_irq() return value check
- L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION
* tag 'for-net-2025-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION
Bluetooth: hci_qca: move the SoC type check to the right place
Bluetooth: btnxpuart: Fix missing devm_request_irq() return value check
Bluetooth: MGMT: reject malformed HCI_CMD_SYNC commands
====================
Emil Tantilov [Thu, 8 May 2025 18:47:15 +0000 (11:47 -0700)]
idpf: avoid mailbox timeout delays during reset
Mailbox operations are not possible while the driver is in reset.
Operations that require MBX exchange with the control plane will result
in long delays if executed while a reset is in progress:
Disable mailbox communication in case of a reset, unless it's done during
a driver load, where the virtchnl operations are needed to configure the
device.
Fixes: 8077c727561aa ("idpf: add controlq init and reset checks") Co-developed-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Ahmed Zaki <ahmed.zaki@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Brian Vazquez [Thu, 1 May 2025 17:06:17 +0000 (17:06 +0000)]
idpf: fix a race in txq wakeup
Add a helper function to correctly handle the lockless
synchronization when the sender needs to block. The paradigm is
if (no_resources()) {
stop_queue();
barrier();
if (!no_resources())
restart_queue();
}
netif_subqueue_maybe_stop already handles the paradigm correctly, but
the code split the check for resources in three parts, the first one
(descriptors) followed the protocol, but the other two (completions and
tx_buf) were only doing the first part and so race prone.
Luckily netif_subqueue_maybe_stop macro already allows you to use a
function to evaluate the start/stop conditions so the fix only requires
the right helper function to evaluate all the conditions at once.
The patch removes idpf_tx_maybe_stop_common since it's no longer needed
and instead adjusts separately the conditions for singleq and splitq.
Note that idpf_tx_buf_hw_update doesn't need to check for resources
since that will be covered in idpf_tx_splitq_frame.
To reproduce:
Reduce the threshold for pending completions to increase the chances of
hitting this pause by changing your kernel:
Michal Kubiak [Tue, 13 May 2025 10:55:29 +0000 (12:55 +0200)]
ice: fix rebuilding the Tx scheduler tree for large queue counts
The current implementation of the Tx scheduler allows the tree to be
rebuilt as the user adds more Tx queues to the VSI. In such a case,
additional child nodes are added to the tree to support the new number
of queues.
Unfortunately, this algorithm does not take into account that the limit
of the VSI support node may be exceeded, so an additional node in the
VSI layer may be required to handle all the requested queues.
Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs. Although the driver still respects the queue limit returned by
the FW, the Tx scheduler was unable to add those queues to its tree
and returned one of the errors below.
Such a scenario occurs when adding XDP Tx queues on machines with many
CPUs (e.g. at least 321 CPUs, if there is already 128 Tx/Rx queue pairs).
Although the driver still respects the queue limit returned by the FW,
the Tx scheduler was unable to add those queues to its tree and returned
the following errors:
Failed VSI LAN queue config for XDP, error: -5
or:
Failed to set LAN Tx queue context, error: -22
Fix this problem by extending the tree rebuild algorithm to check if the
current VSI node can support the requested number of queues. If it
cannot, create as many additional VSI support nodes as necessary to
handle all the required Tx queues. Symmetrically, adjust the VSI node
removal algorithm to remove all nodes associated with the given VSI.
Also, make the search for the next free VSI node more restrictive. That is,
add queue group nodes only to the VSI support nodes that have a matching
VSI handle.
Finally, fix the comment describing the tree update algorithm to better
reflect the current scenario.
Fixes: b0153fdd7e8a ("ice: update VSI config dynamically") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Kubiak [Tue, 13 May 2025 10:55:28 +0000 (12:55 +0200)]
ice: create new Tx scheduler nodes for new queues only
The current implementation of the Tx scheduler tree attempts
to create nodes for all Tx queues, ignoring the fact that some
queues may already exist in the tree. For example, if the VSI
already has 128 Tx queues and the user requests for 16 new queues,
the Tx scheduler will compute the tree for 272 queues (128 existing
queues + 144 new queues), instead of 144 queues (128 existing queues
and 16 new queues).
Fix that by modifying the node count calculation algorithm to skip
the queues that already exist in the tree.
Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue support") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Michal Kubiak [Tue, 13 May 2025 10:55:27 +0000 (12:55 +0200)]
ice: fix Tx scheduler error handling in XDP callback
When the XDP program is loaded, the XDP callback adds new Tx queues.
This means that the callback must update the Tx scheduler with the new
queue number. In the event of a Tx scheduler failure, the XDP callback
should also fail and roll back any changes previously made for XDP
preparation.
The previous implementation had a bug that not all changes made by the
XDP callback were rolled back. This caused the crash with the following
call trace:
[ +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5
[ +0.382335] Oops: general protection fault, probably for non-canonical address 0x50a2250a90495525: 0000 [#1] SMP NOPTI
[ +0.010710] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+ #14 PREEMPT(voluntary)
[ +0.010175] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
[ +0.010946] RIP: 0010:__ice_update_sample+0x39/0xe0 [ice]
Fix this by performing the missing unmapping of XDP queues from
q_vectors and setting the XDP rings pointer back to NULL after all those
queues are released.
Also, add an immediate exit from the XDP callback in case of ring
preparation failure.
Fixes: efc2214b6047 ("ice: Add support for XDP") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Bluetooth: L2CAP: Fix not responding with L2CAP_CR_LE_ENCRYPTION
Depending on the security set the response to L2CAP_LE_CONN_REQ shall be
just L2CAP_CR_LE_ENCRYPTION if only encryption when BT_SECURITY_MEDIUM
is selected since that means security mode 2 which doesn't require
authentication which is something that is covered in the qualification
test L2CAP/LE/CFC/BV-25-C.
Link: https://github.com/bluez/bluez/issues/1270 Fixes: 27e2d4c8d28b ("Bluetooth: Add basic LE L2CAP connect request receiving support") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: hci_qca: move the SoC type check to the right place
Commit 3d05fc82237a ("Bluetooth: qca: set power_ctrl_enabled on NULL
returned by gpiod_get_optional()") accidentally changed the prevous
behavior where power control would be disabled without the BT_EN GPIO
only on QCA_WCN6750 and QCA_WCN6855 while also getting the error check
wrong. We should treat every IS_ERR() return value from
devm_gpiod_get_optional() as a reason to bail-out while we should only
set power_ctrl_enabled to false on the two models mentioned above. While
at it: use dev_err_probe() to save a LOC.
Cc: stable@vger.kernel.org Fixes: 3d05fc82237a ("Bluetooth: qca: set power_ctrl_enabled on NULL returned by gpiod_get_optional()") Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Tested-by: Hsin-chen Chuang <chharry@chromium.org> Reviewed-by: Hsin-chen Chuang <chharry@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Setting up wakeup IRQ handler is not really critical, because the
handler is empty, so just log the informational message so user could
submit proper bug report and silences the clang warning.
Fixes: c50b56664e48 ("Bluetooth: btnxpuart: Implement host-wakeup feature") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
In 'mgmt_hci_cmd_sync()', check whether the size of parameters passed
in 'struct mgmt_cp_hci_cmd_sync' matches the total size of the data
(i.e. 'sizeof(struct mgmt_cp_hci_cmd_sync)' plus trailing bytes).
Otherwise, large invalid 'params_len' will cause 'hci_cmd_sync_alloc()'
to do 'skb_put_data()' from an area beyond the one actually passed to
'mgmt_hci_cmd_sync()'.
Reported-by: syzbot+5fe2d5bfbfbec0b675a0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5fe2d5bfbfbec0b675a0 Fixes: 827af4787e74 ("Bluetooth: MGMT: Add initial implementation of MGMT_OP_HCI_CMD_SYNC") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Oliver Neukum [Wed, 28 May 2025 11:03:54 +0000 (13:03 +0200)]
net: usb: aqc111: debug info before sanitation
If we sanitize error returns, the debug statements need
to come before that so that we don't lose information.
Signed-off-by: Oliver Neukum <oneukum@suse.com> Fixes: 405b0d610745 ("net: usb: aqc111: fix error handling of usbnet read calls") Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Wed, 28 May 2025 09:36:19 +0000 (11:36 +0200)]
net: lan966x: Make sure to insert the vlan tags also in host mode
When running these commands on DUT (and similar at the other end)
ip link set dev eth0 up
ip link add link eth0 name eth0.10 type vlan id 10
ip addr add 10.0.0.1/24 dev eth0.10
ip link set dev eth0.10 up
ping 10.0.0.2
The ping will fail.
The reason why is failing is because, the network interfaces for lan966x
have a flag saying that the HW can insert the vlan tags into the
frames(NETIF_F_HW_VLAN_CTAG_TX). Meaning that the frames that are
transmitted don't have the vlan tag inside the skb data, but they have
it inside the skb. We already get that vlan tag and put it in the IFH
but the problem is that we don't configure the HW to rewrite the frame
when the interface is in host mode.
The fix consists in actually configuring the HW to insert the vlan tag
if it is different than 0.
The "freq" variable is in terms of MHz and "max_val_cycles" is in terms
of Hz. The fact that "max_val_cycles" is a u64 suggests that support
for high frequency is intended but the "freq_khz * 1000" would overflow
the u32 type if we went above 4GHz. Use unsigned long long type for the
mutliplication to prevent that.
Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according to HW frequency") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/aDbFHe19juIJKjsb@stanley.mountain Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Identify the cause of the suspend/resume hang: netif_carrier_off()
is called during link state changes and becomes stuck while
executing linkwatch_work().
To resolve this issue, call netif_device_detach() during the Ethernet
suspend process to temporarily detach the network device from the
kernel and prevent the suspend/resume hang.
Fixes: 8c7bd5a454ff ("net: ethernet: mtk-star-emac: new driver") Signed-off-by: Yanqing Wang <ot_yanqing.wang@mediatek.com> Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com> Signed-off-by: Biao Huang <biao.huang@mediatek.com> Link: https://patch.msgid.link/20250528075351.593068-1-macpaul.lin@mediatek.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net: tipc: fix refcount warning in tipc_aead_encrypt
syzbot reported a refcount warning [1] caused by calling get_net() on
a network namespace that is being destroyed (refcount=0). This happens
when a TIPC discovery timer fires during network namespace cleanup.
The recently added get_net() call in commit e279024617134 ("net/tipc:
fix slab-use-after-free Read in tipc_aead_encrypt_done") attempts to
hold a reference to the network namespace. However, if the namespace
is already being destroyed, its refcount might be zero, leading to the
use-after-free warning.
Replace get_net() with maybe_get_net(), which safely checks if the
refcount is non-zero before incrementing it. If the namespace is being
destroyed, return -ENODEV early, after releasing the bearer reference.
David Howells [Tue, 27 May 2025 15:01:43 +0000 (16:01 +0100)]
rxrpc: Fix return from none_validate_challenge()
Fix the return value of none_validate_challenge() to be explicitly true
(which indicates the source packet should simply be discarded) rather than
implicitly true (because rxrpc_abort_conn() always returns -EPROTO which
gets converted to true).
Note that this change doesn't change the behaviour of the code (which is
correct by accident) and, in any case, we *shouldn't* get a CHALLENGE
packet to an rxnull connection (ie. no security).
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lists.infradead.org/pipermail/linux-afs/2025-April/009738.html Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Simon Horman <horms@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/10720.1748358103@warthog.procyon.org.uk Fixes: 5800b1cf3fd8 ("rxrpc: Allow CHALLENGEs to the passed to the app for a RESPONSE") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Alok Tiwari [Tue, 27 May 2025 13:08:16 +0000 (06:08 -0700)]
gve: Fix RX_BUFFERS_POSTED stat to report per-queue fill_cnt
Previously, the RX_BUFFERS_POSTED stat incorrectly reported the
fill_cnt from RX queue 0 for all queues, resulting in inaccurate
per-queue statistics.
Fix this by correctly indexing priv->rx[idx].fill_cnt for each RX queue.
Quentin Schulz [Tue, 27 May 2025 11:56:23 +0000 (13:56 +0200)]
net: stmmac: platform: guarantee uniqueness of bus_id
bus_id is currently derived from the ethernetX alias. If one is missing
for the device, 0 is used. If ethernet0 points to another stmmac device
or if there are 2+ stmmac devices without an ethernet alias, then bus_id
will be 0 for all of those.
This is an issue because the bus_id is used to generate the mdio bus id
(new_bus->id in drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
stmmac_mdio_register) and this needs to be unique.
This allows to avoid needing to define ethernet aliases for devices with
multiple stmmac controllers (such as the Rockchip RK3588) for multiple
stmmac devices to probe properly.
Obviously, the bus_id isn't guaranteed to be stable across reboots if no
alias is set for the device but that is easily fixed by simply adding an
alias if this is desired.
Fixes: 25c83b5c2e82 ("dt:net:stmmac: Add support to dwmac version 3.610 and 3.710") Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250527-stmmac-mdio-bus_id-v2-1-a5ca78454e3c@cherry.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
echo_skb_max should define the supported upper limit of echo_skb[]
allocated inside the netdevice's priv. The corresponding size value
provided by this driver to alloc_candev() is KVASER_PCIEFD_CAN_TX_MAX_COUNT
which is 17.
But later echo_skb_max is rounded up to the nearest power of two (for the
max case, that would be 32) and the tx/ack indices calculated further
during tx/rx may exceed the upper array boundary. Kasan reported this for
the ack case inside kvaser_pciefd_handle_ack_packet(), though the xmit
function has actually caught the same thing earlier.
BUG: KASAN: slab-out-of-bounds in kvaser_pciefd_handle_ack_packet+0x2d7/0x92a drivers/net/can/kvaser_pciefd.c:1528
Read of size 8 at addr ffff888105e4f078 by task swapper/4/0
Tx max count definitely matters for kvaser_pciefd_tx_avail(), but for seq
numbers' generation that's not the case - we're free to calculate them as
would be more convenient, not taking tx max count into account. The only
downside is that the size of echo_skb[] should correspond to the max seq
number (not tx max count), so in some situations a bit more memory would
be consumed than could be.
Thus make the size of the underlying echo_skb[] sufficient for the rounded
max tx value.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Dong Chenchen [Tue, 27 May 2025 11:41:52 +0000 (19:41 +0800)]
page_pool: Fix use-after-free in page_pool_recycle_in_ring
syzbot reported a uaf in page_pool_recycle_in_ring:
BUG: KASAN: slab-use-after-free in lock_release+0x151/0xa30 kernel/locking/lockdep.c:5862
Read of size 8 at addr ffff8880286045a0 by task syz.0.284/6943
page_pool can be free while page pool recycle the last page in ring.
Add producer-lock barrier to page_pool_release to prevent the page
pool from being free before all pages have been recycled.
recycle_stat_inc() is empty when CONFIG_PAGE_POOL_STATS is not
enabled, which will trigger Wempty-body build warning. Add definition
for pool stat macro to fix warning.
Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250513083123.3514193-1-dongchenchen2@huawei.com Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Reported-by: syzbot+204a4382fcb3311f3858@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=204a4382fcb3311f3858 Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250527114152.3119109-1-dongchenchen2@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Qasim Ijaz [Mon, 26 May 2025 18:36:07 +0000 (19:36 +0100)]
net: ch9200: fix uninitialised access during mii_nway_restart
In mii_nway_restart() the code attempts to call
mii->mdio_read which is ch9200_mdio_read(). ch9200_mdio_read()
utilises a local buffer called "buff", which is initialised
with control_read(). However "buff" is conditionally
initialised inside control_read():
if (err == size) {
memcpy(data, buf, size);
}
If the condition of "err == size" is not met, then
"buff" remains uninitialised. Once this happens the
uninitialised "buff" is accessed and returned during
ch9200_mdio_read():
return (buff[0] | buff[1] << 8);
The problem stems from the fact that ch9200_mdio_read()
ignores the return value of control_read(), leading to
uinit-access of "buff".
To fix this we should check the return value of
control_read() and return early on error.
Reported-by: syzbot <syzbot+3361c2d6f78a3e0892f9@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=3361c2d6f78a3e0892f9 Tested-by: syzbot <syzbot+3361c2d6f78a3e0892f9@syzkaller.appspotmail.com> Fixes: 4a476bd6d1d9 ("usbnet: New driver for QinHeng CH9200 devices") Cc: stable@vger.kernel.org Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Link: https://patch.msgid.link/20250526183607.66527-1-qasdev00@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tengteng Yang [Tue, 27 May 2025 03:04:19 +0000 (11:04 +0800)]
Fix sock_exceed_buf_limit not being triggered in __sk_mem_raise_allocated
When a process under memory pressure is not part of any cgroup and
the charged flag is false, trace_sock_exceed_buf_limit was not called
as expected.
This regression was introduced by commit 2def8ff3fdb6 ("sock:
Code cleanup on __sk_mem_raise_allocated()"). The fix changes the
default value of charged to true while preserving existing logic.
Fixes: 2def8ff3fdb6 ("sock: Code cleanup on __sk_mem_raise_allocated()") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Signed-off-by: Tengteng Yang <yangtengteng@bytedance.com> Link: https://patch.msgid.link/20250527030419.67693-1-yangtengteng@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 28 May 2025 22:52:42 +0000 (15:52 -0700)]
Merge tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
- Fix and improve BTF deduplication of identical BTF types (Alan
Maguire and Andrii Nakryiko)
- Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and
Alexis Lothoré)
- Support load-acquire and store-release instructions in BPF JIT on
riscv64 (Andrea Parri)
- Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton
Protopopov)
- Streamline allowed helpers across program types (Feng Yang)
- Support atomic update for hashtab of BPF maps (Hou Tao)
- Implement json output for BPF helpers (Ihor Solodrai)
- Several s390 JIT fixes (Ilya Leoshkevich)
- Various sockmap fixes (Jiayuan Chen)
- Support mmap of vmlinux BTF data (Lorenz Bauer)
- Support BPF rbtree traversal and list peeking (Martin KaFai Lau)
- Tests for sockmap/sockhash redirection (Michal Luczaj)
- Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko)
- Add support for dma-buf iterators in BPF (T.J. Mercier)
- The verifier support for __bpf_trap() (Yonghong Song)
* tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits)
bpf, arm64: Remove unused-but-set function and variable.
selftests/bpf: Add tests with stack ptr register in conditional jmp
bpf: Do not include stack ptr register in precision backtracking bookkeeping
selftests/bpf: enable many-args tests for arm64
bpf, arm64: Support up to 12 function arguments
bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()
bpf: Avoid __bpf_prog_ret0_warn when jit fails
bpftool: Add support for custom BTF path in prog load/loadall
selftests/bpf: Add unit tests with __bpf_trap() kfunc
bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable
bpf: Remove special_kfunc_set from verifier
selftests/bpf: Add test for open coded dmabuf_iter
selftests/bpf: Add test for dmabuf_iter
bpf: Add open coded dmabuf iterator
bpf: Add dmabuf iterator
dma-buf: Rename debugfs symbols
bpf: Fix error return value in bpf_copy_from_user_dynptr
libbpf: Use mmap to parse vmlinux BTF from sysfs
selftests: bpf: Add a test for mmapable vmlinux BTF
btf: Allow mmap of vmlinux btf
...
Linus Torvalds [Wed, 28 May 2025 22:24:36 +0000 (15:24 -0700)]
Merge tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
Linus Torvalds [Wed, 28 May 2025 21:55:35 +0000 (14:55 -0700)]
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The headline feature is the re-enablement of support for Arm's
Scalable Matrix Extension (SME) thanks to a bumper crop of fixes
from Mark Rutland.
If matrices aren't your thing, then Ryan's page-table optimisation
work is much more interesting.
Summary:
ACPI, EFI and PSCI:
- Decouple Arm's "Software Delegated Exception Interface" (SDEI)
support from the ACPI GHES code so that it can be used by platforms
booted with device-tree
- Remove unnecessary per-CPU tracking of the FPSIMD state across EFI
runtime calls
- Fix a node refcount imbalance in the PSCI device-tree code
CPU Features:
- Ensure register sanitisation is applied to fields in ID_AA64MMFR4
- Expose AIDR_EL1 to userspace via sysfs, primarily so that KVM
guests can reliably query the underlying CPU types from the VMM
- Re-enabling of SME support (CONFIG_ARM64_SME) as a result of fixes
to our context-switching, signal handling and ptrace code
Entry code:
- Hook up TIF_NEED_RESCHED_LAZY so that CONFIG_PREEMPT_LAZY can be
selected
Memory management:
- Prevent BSS exports from being used by the early PI code
- Propagate level and stride information to the low-level TLB
invalidation routines when operating on hugetlb entries
- Use the page-table contiguous hint for vmap() mappings with
VM_ALLOW_HUGE_VMAP where possible
- Optimise vmalloc()/vmap() page-table updates to use "lazy MMU mode"
and hook this up on arm64 so that the trailing DSB (used to publish
the updates to the hardware walker) can be deferred until the end
of the mapping operation
- Extend mmap() randomisation for 52-bit virtual addresses (on par
with 48-bit addressing) and remove limited support for
randomisation of the linear map
Perf and PMUs:
- Add support for probing the CMN-S3 driver using ACPI
- Minor driver fixes to the CMN, Arm-NI and amlogic PMU drivers
Selftests:
- Fix FPSIMD and SME tests to align with the freshly re-enabled SME
support
- Fix default setting of the OUTPUT variable so that tests are
installed in the right location
vDSO:
- Replace raw counter access from inline assembly code with a call to
the the __arch_counter_get_cntvct() helper function
Miscellaneous:
- Add some missing header inclusions to the CCA headers
- Rework rendering of /proc/cpuinfo to follow the x86-approach and
avoid repeated buffer expansion (the user-visible format remains
identical)
- Remove redundant selection of CONFIG_CRC32
- Extend early error message when failing to map the device-tree
blob"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (83 commits)
arm64: cputype: Add cputype definition for HIP12
arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again
perf/arm-cmn: Add CMN S3 ACPI binding
arm64/boot: Disallow BSS exports to startup code
arm64/boot: Move global CPU override variables out of BSS
arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
perf/arm-cmn: Initialise cmn->cpu earlier
kselftest/arm64: Set default OUTPUT path when undefined
arm64: Update comment regarding values in __boot_cpu_mode
arm64: mm: Drop redundant check in pmd_trans_huge()
arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1
arm64/mm: Permit lazy_mmu_mode to be nested
arm64/mm: Disable barrier batching in interrupt contexts
arm64/cpuinfo: only show one cpu's info in c_show()
arm64/mm: Batch barriers when updating kernel mappings
mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
arm64/mm: Support huge pte-mapped pages in vmap
mm/vmalloc: Gracefully unmap huge ptes
mm/vmalloc: Warn on improper use of vunmap_range()
arm64/mm: Hoist barriers out of set_ptes_anysz() loop
...
Linus Torvalds [Wed, 28 May 2025 21:52:12 +0000 (14:52 -0700)]
Merge tag 'nios2_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux
Pull nios2 updates fromDinh Nguyen:
- Use strscpy() and simply setup_cpuinfo()
- Remove conflicting mappings when flushing tlb entries
- Force update_mmu_cache on spurious pagefaults
* tag 'nios2_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
nios2: Replace strcpy() with strscpy() and simplify setup_cpuinfo()
nios2: do not introduce conflicting mappings when flushing tlb entries
nios2: force update_mmu_cache on spurious tlb-permission--related pagefaults
Linus Torvalds [Wed, 28 May 2025 20:36:38 +0000 (13:36 -0700)]
Merge tag 'jfs-6.16' of github.com:kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A few small fixes for jfs"
* tag 'jfs-6.16' of github.com:kleikamp/linux-shaggy:
jfs: fix array-index-out-of-bounds read in add_missing_indices
jfs: Fix null-ptr-deref in jfs_ioc_trim
jfs: validate AG parameters in dbMount() to prevent crashes