net/mlx5: HWS, do not initialize native API queues
HWS has two types of APIs:
- Native: fastest and slimmest, async API.
The user of this API is required to manage rule handles memory,
and to poll for completion for each rule.
- BWC: backward compatible API, similar semantics to SWS API.
This layer is implemented above native API and it does all
the work for the user, so that it is easy to switch between
SWS and HWS.
Right now the existing users of HWS require only BWC API.
Therefore, in order to not waste resources, this patch disables
send queues allocation for native API.
If in the future support for faster HWS rule insertion will be required
(such as for Connection Tracking), native queues can be enabled.
Mark Bloch [Thu, 19 Dec 2024 17:58:35 +0000 (19:58 +0200)]
net/mlx5: fs, retry insertion to hash table on EBUSY
When inserting into an rhashtable faster than it can grow, an -EBUSY error
may be encountered. Modify the insertion logic to retry on -EBUSY until
either a successful insertion or a genuine error is returned.
Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20241219175841.1094544-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Moshe Shemesh [Thu, 19 Dec 2024 17:58:34 +0000 (19:58 +0200)]
net/mlx5: fs, add mlx5_fs_pool API
Refactor fc_pool API to create generic fs_pool API, as HW steering has
more flow steering elements which can take advantage of the same pool of
bulks API. Change fs_counters code to use the fs_pool API.
Note, removed __counted_by from struct mlx5_fc_bulk as bulk_len is now
inner struct member. It will be added back once __counted_by can support
inner struct members.
Moshe Shemesh [Thu, 19 Dec 2024 17:58:33 +0000 (19:58 +0200)]
net/mlx5: fs, add counter object to flow destination
Currently mlx5_flow_destination includes counter_id which is assigned in
case we use flow counter on the flow steering rule. However, counter_id
is not enough data in case of using HW Steering. Thus, have mlx5_fc
object as part of mlx5_flow_destination instead of counter_id and assign
it where needed.
In case counter_id is received from user space, create a local counter
object to represent it.
Rongwei Liu [Thu, 19 Dec 2024 17:58:32 +0000 (19:58 +0200)]
net/mlx5: LAG, Support LAG over Multi-Host NICs
New multi-host NICs provide each host with partial ports,
allowing each host to maintain its unique LAG configuration.
On these multi-host NICs, the 'native_port_num' capability
is no longer continuous on each host and can exceed the
'num_lag_ports' capability. Therefore, it is necessary to
skip the PFs with ldev->pf[i].dev == NULL when querying/modifying
the lag devices' information.
There is no need to check dev.native_port_num against ldev->ports.
Rongwei Liu [Thu, 19 Dec 2024 17:58:31 +0000 (19:58 +0200)]
net/mlx5: LAG, Refactor lag logic
Wrap the lag pf access into two new macros:
1. ldev_for_each()
2. ldev_for_each_reverse()
The maximum number of lag ports and the index to `natvie_port_num`
mapping will be handled by the two new macros.
Users shouldn't use the for loop anymore.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241219175841.1094544-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Add rds ptp library for Microchip phys
Adds support for rds ptp library in Microchip phys, where rds is internal
code name for ptp IP or hardware. This library will be re-used in
Microchip phys where same ptp hardware is used. Register base addresses
and mmd may changes, due to which base addresses and mmd is made variable
in this library.
====================
Patch 1 is a non-functional preparatory cleanup.
Patch 2 is a test suite extension for picking specific tests.
Patch 3 explains the need of kmemleak scans.
Patch 4 adapts utility functions to handle MSG_ZEROCOPY.
Patches 5-6-7 add the tests.
NOTE: Test in the last patch ("vsock/test: Add test for MSG_ZEROCOPY
completion memory leak") may stop working even before this series is
merged. See changes proposed in [2]. The failslab variant would be
unaffected.
Michal Luczaj [Thu, 19 Dec 2024 09:49:31 +0000 (10:49 +0100)]
vsock/test: Adapt send_byte()/recv_byte() to handle MSG_ZEROCOPY
For a zerocopy send(), buffer (always byte 'A') needs to be preserved (thus
it can not be on the stack) or the data recv()ed check in recv_byte() might
fail.
While there, change the printf format to 0x%02x so the '\0' bytes can be
seen.
Yuyang Huang [Sat, 21 Dec 2024 10:00:07 +0000 (19:00 +0900)]
netlink: correct nlmsg size for multicast notifications
Corrected the netlink message size calculation for multicast group
join/leave notifications. The previous calculation did not account for
the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the
payload. This fix ensures that the allocated message size is
sufficient to hold all necessary information.
This patch also includes the following improvements:
* Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex.
* Uses nla_total_size(sizeof(struct in6_addr)) instead of
nla_total_size(16).
* Removes unnecessary EXPORT_SYMBOL().
Fixes: 2c2b61d2138f ("netlink: add IGMP/MLD join/leave notifications") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 20 Dec 2024 00:31:16 +0000 (16:31 -0800)]
selftests: drv-net: assume stats refresh is 0 if no ethtool -c support
Tests using HW stats wait for them to stabilize, using data from
ethtool -c as the delay. Not all drivers implement ethtool -c
so handle the errors gracefully.
YiFei Zhu [Thu, 19 Dec 2024 17:30:04 +0000 (17:30 +0000)]
sfc: Use netdev refcount tracking in struct efx_async_filter_insertion
I was debugging some netdev refcount issues in OpenOnload, and one
of the places I was looking at was in the sfc driver. Only
struct efx_async_filter_insertion was not using netdev refcount tracker,
so add it here. GFP_ATOMIC because this code path is called by
ndo_rx_flow_steer which holds RCU.
This patch should be a no-op if !CONFIG_NET_DEV_REFCNT_TRACKER
====================
net/bridge: Add skb drop reasons to the most common drop points
The bridge input code may drop frames for various reasons and at various
points in the ingress handling logic. Currently kfree_skb() is used
everywhere, and therefore no drop reason is specified. Add drop reasons
to the most common drop points.
The purpose of this series is to address the most common drop points on
the bridge ingress path. It does not exhaustively add drop reasons to
the entire bridge code. The intention here is to incrementally add drop
reasons to the rest of the bridge code in follow up patches.
Most of the skb drop points that are addressed in this series can be
easily tested by sending crafted packets. The diagram below shows a
simple test configuration, and some examples using `packit`(*) are
also included. The bridge is set up with STP disabled.
(*) https://github.com/resurrecting-open-source-projects/packit
The following changes were *not* tested:
* SKB_DROP_REASON_NOMEM in br_flood(). It's not easy to trigger an OOM
condition for testing purposes, while everything else works correctly.
* All drop reasons in br_multicast_flood(). I could not find an easy way
to make a crafted packet get there.
* SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE in br_handle_frame_finish()
when the port state is BR_STATE_DISABLED, because in that case the
frame is already dropped in the switch/case block at the end of
br_handle_frame().
Radu Rendec [Thu, 19 Dec 2024 16:36:06 +0000 (11:36 -0500)]
net: bridge: add skb drop reasons to the most common drop points
The bridge input code may drop frames for various reasons and at various
points in the ingress handling logic. Currently kfree_skb() is used
everywhere, and therefore no drop reason is specified. Add drop reasons
to the most common drop points.
Drop reasons are not added exhaustively to the entire bridge code. The
intention is to incrementally add drop reasons to the rest of the bridge
code in follow up patches.
The SKB_DROP_REASON_VXLAN_NO_REMOTE skb drop reason was introduced in
the specific context of vxlan. As it turns out, there are similar cases
when a packet needs to be dropped in other parts of the network stack,
such as the bridge module.
Rename SKB_DROP_REASON_VXLAN_NO_REMOTE and give it a more generic name,
so that it can be used in other parts of the network stack. This is not
a functional change, and the numeric value of the drop reason even
remains unchanged.
====================
Add more feautues for ENETC v4 - round 1
Compared to ENETC v1 (LS1028A), ENETC v4 (i.MX95) adds more features, and
some features are configured completely differently from v1. In order to
more fully support ENETC v4, these features will be added through several
rounds of patch sets. This round adds these features, such as Tx and Rx
checksum offload, increase maximum chained Tx BD number and Large send
offload (LSO).
Wei Fang [Thu, 19 Dec 2024 05:47:54 +0000 (13:47 +0800)]
net: enetc: add LSO support for i.MX95 ENETC PF
ENETC rev 4.1 supports large send offload (LSO), segmenting large TCP
and UDP transmit units into multiple Ethernet frames. To support LSO,
software needs to fill some auxiliary information in Tx BD, such as LSO
header length, frame length, LSO maximum segment size, etc.
At 1Gbps link rate, TCP segmentation was tested using iperf3, and the
CPU performance before and after applying the patch was compared through
the top command. It can be seen that LSO saves a significant amount of
CPU cycles compared to software TSO.
Before applying the patch:
%Cpu(s): 0.1 us, 4.1 sy, 0.0 ni, 85.7 id, 0.0 wa, 0.5 hi, 9.7 si
After applying the patch:
%Cpu(s): 0.1 us, 2.3 sy, 0.0 ni, 94.5 id, 0.0 wa, 0.4 hi, 2.6 si
Wei Fang [Thu, 19 Dec 2024 05:47:53 +0000 (13:47 +0800)]
net: enetc: update max chained Tx BD number for i.MX95 ENETC
The max chained Tx BDs of latest ENETC (i.MX95 ENETC, rev 4.1) has been
increased to 63, but since the range of MAX_SKB_FRAGS is 17~45, so for
i.MX95 ENETC and later revision, it is better to set ENETC4_MAX_SKB_FRAGS
to MAX_SKB_FRAGS.
In addition, add max_frags in struct enetc_drvdata to indicate the max
chained BDs supported by device. Because the max number of chained BDs
supported by LS1028A and i.MX95 ENETC is different.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241219054755.1615626-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang [Thu, 19 Dec 2024 05:47:52 +0000 (13:47 +0800)]
net: enetc: add Tx checksum offload for i.MX95 ENETC
In addition to supporting Rx checksum offload, i.MX95 ENETC also supports
Tx checksum offload. The transmit checksum offload is implemented through
the Tx BD. To support Tx checksum offload, software needs to fill some
auxiliary information in Tx BD, such as IP version, IP header offset and
size, whether L4 is UDP or TCP, etc.
Same as Rx checksum offload, Tx checksum offload capability isn't defined
in register, so tx_csum bit is added to struct enetc_drvdata to indicate
whether the device supports Tx checksum offload.
Stefano Brivio [Wed, 18 Dec 2024 16:21:16 +0000 (17:21 +0100)]
udp: Deal with race between UDP socket address change and rehash
If a UDP socket changes its local address while it's receiving
datagrams, as a result of connect(), there is a period during which
a lookup operation might fail to find it, after the address is changed
but before the secondary hash (port and address) and the four-tuple
hash (local and remote ports and addresses) are updated.
Secondary hash chains were introduced by commit 30fff9231fad ("udp:
bind() optimisation") and, as a result, a rehash operation became
needed to make a bound socket reachable again after a connect().
This operation was introduced by commit 719f835853a9 ("udp: add
rehash on connect()") which isn't however a complete fix: the
socket will be found once the rehashing completes, but not while
it's pending.
This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a
client sending datagrams to it. After the server receives the first
datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to
the address of the sender, in order to set up a directed flow.
Now, if the client, running on a different CPU thread, happens to
send a (subsequent) datagram while the server's socket changes its
address, but is not rehashed yet, this will result in a failed
lookup and a port unreachable error delivered to the client, as
apparent from the following reproducer:
where the client will eventually get ECONNREFUSED on a write()
(typically the second or third one of a given iteration):
2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused
This issue was first observed as a seldom failure in Podman's tests
checking UDP functionality while using pasta(1) to connect the
container's network namespace, which leads us to a reproducer with
the lookup error resulting in an ICMP packet on a tap device:
and, somewhat confusingly, after a connect() on the same socket
succeeded.
Until commit 4cdeeee9252a ("net: udp: prefer listeners bound to an
address"), the race between receive address change and lookup didn't
actually cause visible issues, because, once the lookup based on the
secondary hash chain failed, we would still attempt a lookup based on
the primary hash (destination port only), and find the socket with the
outdated secondary hash.
That change, however, dropped port-only lookups altogether, as side
effect, making the race visible.
To fix this, while avoiding the need to make address changes and
rehash atomic against lookups, reintroduce primary hash lookups as
fallback, if lookups based on four-tuple and secondary hashes fail.
To this end, introduce a simplified lookup implementation, which
doesn't take care of SO_REUSEPORT groups: if we have one, there are
multiple sockets that would match the four-tuple or secondary hash,
meaning that we can't run into this race at all.
v2:
- instead of synchronising lookup operations against address change
plus rehash, reintroduce a simplified version of the original
primary hash lookup as fallback
v1:
- fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr
usage (Kuniyuki Iwashima)
- directly use sk_rcv_saddr for IPv4 receive addresses instead of
fetching inet_rcv_saddr (Kuniyuki Iwashima)
- move inet_update_saddr() to inet_hashtables.h and use that
to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima)
- rebase onto net-next, update commit message accordingly
Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/24147 Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 30fff9231fad ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
ipv4: Consolidate route lookups from IPv4 sockets.
Create inet_sk_init_flowi4() so that the different IPv4 code paths that
need to do a route lookup based on an IPv4 socket don't need to
reimplement that logic.
====================
Guillaume Nault [Mon, 16 Dec 2024 17:21:54 +0000 (18:21 +0100)]
l2tp: Use inet_sk_init_flowi4() in l2tp_ip_sendmsg().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in l2tp_ip_sendmsg() instead of passing parameters manually
to ip_route_output_ports().
Override ->daddr with the value passed in the msghdr structure if
provided.
Guillaume Nault [Mon, 16 Dec 2024 17:21:51 +0000 (18:21 +0100)]
ipv4: Use inet_sk_init_flowi4() in __ip_queue_xmit().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in __ip_queue_xmit() instead of passing parameters manually
to ip_route_output_ports().
Override ->flowi4_tos with the value passed as parameter since that's
required by SCTP.
Guillaume Nault [Mon, 16 Dec 2024 17:21:48 +0000 (18:21 +0100)]
ipv4: Use inet_sk_init_flowi4() in inet_csk_rebuild_route().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in inet_csk_rebuild_route() instead of passing parameters
manually to ip_route_output_ports().
Guillaume Nault [Mon, 16 Dec 2024 17:21:46 +0000 (18:21 +0100)]
ipv4: Use inet_sk_init_flowi4() in ip4_datagram_release_cb().
Use inet_sk_init_flowi4() to automatically initialise the flowi4
structure in ip4_datagram_release_cb() instead of passing parameters
manually to ip_route_output_ports().
Guillaume Nault [Mon, 16 Dec 2024 17:21:44 +0000 (18:21 +0100)]
ipv4: Define inet_sk_init_flowi4() and use it in inet_sk_rebuild_header().
IPv4 code commonly has to initialise a flowi4 structure from an IPv4
socket. This requires looking at potential IPv4 options to set the
proper destination address, call flowi4_init_output() with the correct
set of parameters and run the sk_classify_flow security hook.
Instead of reimplementing these operations in different parts of the
stack, let's define inet_sk_init_flowi4() which does all these
operations.
The first user is inet_sk_rebuild_header(), where inet_sk_init_flowi4()
replaces ip_route_output_ports(). Unlike ip_route_output_ports(), which
sets the flowi4 structure and performs the route lookup in one go,
inet_sk_init_flowi4() only initialises the flow. The route lookup is
then done by ip_route_output_flow(). Decoupling flow initialisation
from route lookup makes this new interface applicable more broadly as
it will allow some users to overwrite specific struct flowi4 members
before the route lookup.
Tristram Ha [Wed, 18 Dec 2024 02:02:40 +0000 (18:02 -0800)]
net: dsa: microchip: Do not execute PTP driver code for unsupported switches
The PTP driver code only works for certain KSZ switches like KSZ9477,
KSZ9567, LAN937X and their varieties. This code is enabled by kernel
configuration CONFIG_NET_DSA_MICROCHIP_KSZ_PTP. As the DSA driver is
common to work with all KSZ switches this PTP code is not appropriate
for other unsupported switches. The ptp_capable indication is added to
the chip data structure to signal whether to execute those code.
====================
bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING
When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for a newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.
In this patchset, have bridge react to bridge_binding toggles on VLAN
uppers.
There has been another attempt at supporting this behavior in 2022 by
Sevinj Aghayeva [0]. A discussion ensued that informed how this new
patchset is constructed, namely that the logic is in the bridge as opposed
to the 8021q driver, and the bridge reacts to NETDEV_CHANGE events on the
8021q upper.
Patches #1 and #2 contain the implementation, patches #3 and #4 a
selftest.
Petr Machata [Wed, 18 Dec 2024 17:15:58 +0000 (18:15 +0100)]
selftests: net: lib: Add a couple autodefer helpers
Alongside the helper ip_link_set_up(), one to set the link down will be
useful as well. Add a helper to determine the link state as well,
ip_link_is_up(), and use it to short-circuit any changes if the state is
already the desired one.
Petr Machata [Wed, 18 Dec 2024 17:15:57 +0000 (18:15 +0100)]
net: bridge: Handle changes in VLAN_FLAG_BRIDGE_BINDING
When bridge binding is enabled on a VLAN netdevice, its link state should
track bridge ports that are members of the corresponding VLAN. This works
for newly-added netdevices. However toggling the option does not have the
effect of enabling or disabling the behavior as appropriate.
In this patch, react to bridge_binding toggles on VLAN uppers.
Petr Machata [Wed, 18 Dec 2024 17:15:56 +0000 (18:15 +0100)]
net: bridge: Extract a helper to handle bridge_binding toggles
Currently, the BROPT_VLAN_BRIDGE_BINDING bridge option is only toggled when
VLAN devices are added on top of a bridge or removed from it. Extract the
toggling of the option to a function so that it could be invoked by a
subsequent patch when the state of an upper VLAN device changes.
With hns_dsaf_roce_reset() removed in a previous patch, the two
helper member pointers, 'hns_dsaf_roce_srst', and 'hns_dsaf_srst_chns'
are now unread.
Remove them, and the helper functions that they were initialised
to, that is hns_dsaf_srst_chns(), hns_dsaf_srst_chns_acpi(),
hns_dsaf_roce_srst() and hns_dsaf_roce_srst_acpi().
====================
xdp: a fistful of generic changes pt. III
XDP for idpf is currently 5.(6) chapters:
* convert Rx to libeth;
* convert Tx and stats to libeth;
* generic XDP and XSk code changes;
* generic XDP and XSk code additions pt. 1;
* generic XDP and XSk code additions pt. 2 (you are here);
* actual XDP for idpf via new libeth_xdp;
* XSk for idpf (via ^).
Part III.3 does the following:
* adds generic functions to build skbs from xdp_buffs (regular and
XSk) and attach frags to xdp_buffs (regular and XSk);
* adds helper to optimize XSk xmit in drivers.
Everything is prereq for libeth_xdp, but will be useful standalone
as well: less code in drivers, faster XSk XDP_PASS, smaller object
code.
====================
Same as with converting &xdp_buff to skb on Rx, the code which allocates
a new skb and copies the XSk frame there is identical across the
drivers, so make it generic. This includes copying all the frags if they
are present in the original buff.
System percpu page_pools greatly improve XDP_PASS performance on XSk:
instead of page_alloc() + page_free(), the net core recycles the same
pages, so the only overhead left is memcpy()s. When the Page Pool is
not compiled in, the whole function is a return-NULL (but it always
gets selected when eBPF is enabled).
Note that the passed buff gets freed if the conversion is done w/o any
error, assuming you don't need this buffer after you convert it to an
skb.
xsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag()
Currently, xsk_buff_add_frag() only adds the frag to pool's linked list,
not doing anything with the &xdp_buff. The drivers do that manually and
the logic is the same.
Make it really add an skb frag, just like xdp_buff_add_frag() does that,
and freeing frags on error if needed. This allows to remove repeating
code from i40e and ice and not add the same code again and again.
The code which builds an skb from an &xdp_buff keeps multiplying itself
around the drivers with almost no changes. Let's try to stop that by
adding a generic function.
Unlike __xdp_build_skb_from_frame(), always allocate an skbuff head
using napi_build_skb() and make use of the available xdp_rxq pointer to
assign the Rx queue index. In case of PP-backed buffer, mark the skb to
be recycled, as every PP user's been switched to recycle skbs.
The code piece which would attach a frag to &xdp_buff is almost
identical across the drivers supporting XDP multi-buffer on Rx.
Make it a generic elegant "oneliner".
Also, I see lots of drivers calculating frags_truesize as
`xdp->frame_sz * nr_frags`. I can't say this is fully correct, since
frags might be backed by chunks of different sizes, especially with
stuff like the header split. Even page_pool_alloc() can give you two
different truesizes on two subsequent requests to allocate the same
buffer size. Add a field to &skb_shared_info (unionized as there's no
free slot currently on x86_64) to track the "true" truesize. It can
be used later when updating the skb.
Similarly to other _dev shorthands, add one for page_pool_alloc_netmem()
to allocate a netmem using the default Rx GFP flags (ATOMIC | NOWARN) to
make the page -> netmem transition of drivers easier.
Furong Xu [Wed, 18 Dec 2024 08:34:07 +0000 (16:34 +0800)]
net: stmmac: Drop useless code related to ethtool rx-copybreak
After commit 2af6106ae949 ("net: stmmac: Introducing support for Page
Pool"), the driver always copies frames to get a better performance,
zero-copy for RX frames is no more, then these code turned to be
useless and users of ethtool may get confused about the unhandled
rx-copybreak parameter.
This patch mostly reverts
commit 22ad38381547 ("stmmac: do not perform zero-copy for rx frames")
Lorenzo Bianconi [Mon, 16 Dec 2024 17:47:33 +0000 (18:47 +0100)]
net: airoha: Fix error path in airoha_probe()
Do not run napi_disable() if airoha_hw_init() fails since Tx/Rx napi
has not been started yet. In order to fix the issue, introduce
airoha_qdma_stop_napi routine and remove napi_disable in
airoha_hw_cleanup().
Jakub Kicinski [Fri, 20 Dec 2024 03:07:54 +0000 (19:07 -0800)]
Merge branch 'net-add-and-use-phy_disable_eee'
Heiner Kallweit says:
====================
net: add and use phy_disable_eee
If a MAC driver doesn't support EEE, then the PHY shouldn't advertise it.
Add phy_disable_eee() for this purpose, and use it in cpsw driver.
====================
It seems the cpsw MAC doesn't support EEE. See e.g. the commit message of ce2899428ec0 ("ARM: dts: am335x-baltos: disable EEE for Atheros 8035 PHY").
There are cases where this causes issues if the PHY's on both sides have
negotiated EEE. As a workaround EEE modes of the PHY are marked broken
in DT, effectively disabling EEE advertisement.
Improve this by using new function phy_disable_eee() in the MAC driver.
This properly disables EEE advertisement, and allows to remove the
eee-broken-xxx properties from DT. As EEE is disabled anyway, we can
remove also the set_eee ethtool op.
Jakub Kicinski [Fri, 20 Dec 2024 02:54:07 +0000 (18:54 -0800)]
Merge tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.14
Multi-Link Operation implementation continues, both in stack and in
drivers. Otherwise it has been relatively quiet.
Major changes:
cfg80211/mac80211
- define wiphy guard
- get TX power per link
- EHT 320 MHz channel support for mesh
ath11k
- QCA6698AQ support
ath9k
- RX inactivity detection
rtl8xxxu
- add more USB device IDs
rtw88
- add more USB device IDs
- enable USB RX aggregation and USB 3 to improve performance
rtw89
- PowerSave flow for Multi-Link Operation
* tag 'wireless-next-2024-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (121 commits)
wifi: wlcore: sysfs: constify 'struct bin_attribute'
wifi: brcmfmac: clarify unmodifiable headroom log message
wifi: brcmfmac: add missing header include for brcmf_dbg
wifi: brcmsmac: add gain range check to wlc_phy_iqcal_gainparams_nphy()
wifi: qtnfmac: fix spelling error in core.h
wifi: rtw89: phy: add dummy C2H event handler for report of TAS power
wifi: rtw89: 8851b: rfk: remove unnecessary assignment of return value of _dpk_dgain_read()
wifi: rtw89: 8852c: rfk: refine target channel calculation in _rx_dck_channel_calc()
wifi: rtlwifi: pci: wait for firmware loading before releasing memory
wifi: rtlwifi: fix memory leaks and invalid access at probe error path
wifi: rtlwifi: destroy workqueue at rtl_deinit_core
wifi: rtlwifi: remove unused check_buddy_priv
wifi: rtw89: 8922a: update format of RFK pre-notify H2C command v2
wifi: rtw89: regd: update regulatory map to R68-R51
wifi: rtw89: 8852c: disable ER SU when 4x HE-LTF and 0.8 GI capability differ
wifi: rtw89: disable firmware training HE GI and LTF
wifi: rtw89: ps: update data for firmware and settings for hardware before/after PS
wifi: rtw89: ps: refactor channel info to firmware before entering PS
wifi: rtw89: ps: refactor PS flow to support MLO
wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
...
====================
Jakub Kicinski [Wed, 18 Dec 2024 02:44:00 +0000 (18:44 -0800)]
net: netlink: catch attempts to send empty messages
syzbot can figure out a way to redirect a netlink message to a tap.
Sending empty skbs to devices is not valid and we end up hitting
a skb_assert_len() in __dev_queue_xmit().
Make catching these mistakes easier, assert the skb size directly
in netlink core.
Tristram Ha [Wed, 18 Dec 2024 02:03:11 +0000 (18:03 -0800)]
net: dsa: microchip: Add suspend/resume support to KSZ DSA driver
The KSZ DSA driver starts a timer to read MIB counters periodically to
avoid count overrun. During system suspend this will give an error for
not able to write to register as the SPI system returns an error when
it is in suspend state. This implementation stops the timer when the
system goes into suspend and restarts it when resumed.
Jakub Kicinski [Fri, 20 Dec 2024 01:30:03 +0000 (17:30 -0800)]
Merge branch 'bnxt_en-driver-update'
Michael Chan says:
====================
bnxt_en: Driver update
The first patch configures context memory for RoCE resources based
on FW limits. The next 4 patches restrict certain ethtool
operations when they are not supported. The last patch adds Pavan
Chebbi as co-maintainer of the driver.
Michael Chan [Tue, 17 Dec 2024 18:26:19 +0000 (10:26 -0800)]
bnxt_en: Skip reading PXP registers during ethtool -d if unsupported
Newer firmware does not allow reading the PXP registers during
ethtool -d, so skip the firmware call in that case. Userspace
(bnxt.c) always expects the register block to be populated so
zeroes will be returned instead.
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20241217182620.2454075-6-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Chan [Tue, 17 Dec 2024 18:26:17 +0000 (10:26 -0800)]
bnxt_en: Skip PHY loopback ethtool selftest if unsupported by FW
Skip PHY loopback selftest if firmware advertises that it is unsupported
in the HWRM_PORT_PHY_QCAPS call. Only show PHY loopback test result to
be 0 if the test has run and passes. Do the same for external loopback
to be consistent.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20241217182620.2454075-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hongguang Gao [Tue, 17 Dec 2024 18:26:15 +0000 (10:26 -0800)]
bnxt_en: Use FW defined resource limits for RoCE
If FW supports setting resource limits for RoCE, then just use the
FW limits instead of using some fixed values in the driver. These
limits will be used to allocate context memory for QP, SRQ, AH, and
MR resources for RoCE.
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20241217182620.2454075-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* tag 'net-6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
net: mctp: handle skb cleanup on sock_queue failures
net: mdiobus: fix an OF node reference leak
octeontx2-pf: fix error handling of devlink port in rvu_rep_create()
octeontx2-pf: fix netdev memory leak in rvu_rep_create()
psample: adjust size if rate_as_probability is set
netdev-genl: avoid empty messages in queue dump
net: dsa: restore dsa_software_vlan_untag() ability to operate on VLAN-untagged traffic
selftests: openvswitch: fix tcpdump execution
net: usb: qmi_wwan: add Quectel RG255C
net: phy: avoid undefined behavior in *_led_polarity_set()
netfilter: ipset: Fix for recursive locking warning
ipvs: Fix clamp() of ip_vs_conn_tab on small memory systems
can: m_can: fix missed interrupts with m_can_pci
can: m_can: set init flag earlier in probe
rtnetlink: Try the outer netns attribute in rtnl_get_peer_net().
net: netdevsim: fix nsim_pp_hold_write()
idpf: trigger SW interrupt when exiting wb_on_itr mode
idpf: add support for SW triggered interrupts
qed: fix possible uninit pointer read in qed_mcp_nvm_info_populate()
net: ethernet: bgmac-platform: fix an OF node reference leak
...
Linus Torvalds [Thu, 19 Dec 2024 16:53:51 +0000 (08:53 -0800)]
Merge tag 'mmc-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
- mtk-sd: Cleanup the wakeup configuration in error/remove-path
- sdhci-tegra: Correct quirk for ADMA2 length
* tag 'mmc-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: mtk-sd: disable wakeup in .remove() and in the error path of .probe()
mmc: sdhci-tegra: Remove SDHCI_QUIRK_BROKEN_ADMA_ZEROLEN_DESC quirk
Linus Torvalds [Thu, 19 Dec 2024 16:50:05 +0000 (08:50 -0800)]
Merge tag 'pwm/for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux
Pull pwm fix from Uwe Kleine-König:
"Fix regression in pwm-stm32 driver when converting to new waveform
support
Fabrice Gasnier found and fixed a regression I introduced with
v6.13-rc1 when converting the stm32 pwm driver to support the new
waveform stuff. On some hardware variants this completely broke the
driver"
* tag 'pwm/for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
pwm: stm32: Fix complementary output in round_waveform_tohw()
Linus Torvalds [Thu, 19 Dec 2024 16:45:37 +0000 (08:45 -0800)]
Merge tag 'v6.13-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Two fixes for better handling maximum outstanding requests
- Fix simultaneous negotiate protocol race
* tag 'v6.13-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: conn lock to serialize smb2 negotiate
ksmbd: fix broken transfers when exceeding max simultaneous operations
ksmbd: count all requests in req_running counter
====================
net: fib_rules: Add flow label selector support
In some deployments users would like to encode path information into
certain bits of the IPv6 flow label, the UDP source port and the DSCP
and use this information to route packets accordingly.
Redirecting traffic to a routing table based on the flow label is not
currently possible with Linux as FIB rules cannot match on it despite
the flow label being available in the IPv6 flow key.
This patchset extends FIB rules to match on the flow label with a mask.
Future patches will add mask attributes to L4 ports and DSCP matches.
Patches #1-#5 gradually extend FIB rules to match on the flow label.
Patches #6-#7 allow user space to specify a flow label in route get
requests. This is useful for both debugging and testing.
Patch #8 adjusts the fib6_table_lookup tracepoint to print the flow
label to the trace buffer for better observability.
Patch #9 extends the FIB rule selftest with flow label test cases while
utilizing the route get functionality from patch #6.
====================
Ido Schimmel [Mon, 16 Dec 2024 17:12:01 +0000 (19:12 +0200)]
selftests: fib_rule_tests: Add flow label selector match tests
Add tests for the new FIB rule flow label selector. Test both good and bad
flows and with both input and output routes.
# ./fib_rule_tests.sh
IPv6 FIB rule tests
[...]
TEST: rule6 check: flowlabel redirect to table [ OK ]
TEST: rule6 check: flowlabel no redirect to table [ OK ]
TEST: rule6 del by pref: flowlabel redirect to table [ OK ]
TEST: rule6 check: iif flowlabel redirect to table [ OK ]
TEST: rule6 check: iif flowlabel no redirect to table [ OK ]
TEST: rule6 del by pref: iif flowlabel redirect to table [ OK ]
TEST: rule6 check: flowlabel masked redirect to table [ OK ]
TEST: rule6 check: flowlabel masked no redirect to table [ OK ]
TEST: rule6 del by pref: flowlabel masked redirect to table [ OK ]
TEST: rule6 check: iif flowlabel masked redirect to table [ OK ]
TEST: rule6 check: iif flowlabel masked no redirect to table [ OK ]
TEST: rule6 del by pref: iif flowlabel masked redirect to table [ OK ]
[...]
Tests passed: 268
Tests failed: 0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Mon, 16 Dec 2024 17:12:00 +0000 (19:12 +0200)]
tracing: ipv6: Add flow label to fib6_table_lookup tracepoint
The different parameters affecting the IPv6 route lookup are printed to
the trace buffer by the fib6_table_lookup tracepoint. Add the IPv6 flow
label for better observability as it can affect the route lookup both in
terms of multipath hash calculation and policy based routing (FIB
rules). Example:
# echo 1 > /sys/kernel/tracing/events/fib6/fib6_table_lookup/enable
# ip -6 route get ::1 flowlabel 0x12345 ipproto udp sport 12345 dport 54321 &> /dev/null
# cat /sys/kernel/tracing/trace_pipe
ip-358 [010] ..... 44.897484: fib6_table_lookup: table 255 oif 0 iif 1 proto 17 ::/12345 -> ::1/54321 flowlabel 0x12345 tos 0 scope 0 flags 0 ==> dev lo gw :: err 0
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Mon, 16 Dec 2024 17:11:58 +0000 (19:11 +0200)]
ipv6: Add flow label to route get requests
The default IPv6 multipath hash policy takes the flow label into account
when calculating a multipath hash and previous patches added a flow
label selector to IPv6 FIB rules.
Allow user space to specify a flow label in route get requests by adding
a new netlink attribute and using its value to populate the "flowlabel"
field in the IPv6 flow info structure prior to a route lookup.
Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in
rtm_to_fib6_config() and returning an error if present.
A subsequent patch will use this capability to test the new flow label
selector in IPv6 FIB rules.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Mon, 16 Dec 2024 17:11:56 +0000 (19:11 +0200)]
net: fib_rules: Enable flow label selector usage
Now that both IPv4 and IPv6 correctly handle the new flow label
attributes, enable user space to configure FIB rules that make use of
the flow label by changing the policy to stop rejecting them and
accepting 32 bit values in big-endian byte order.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Mon, 16 Dec 2024 17:11:55 +0000 (19:11 +0200)]
ipv6: fib_rules: Add flow label support
Implement support for the new flow label selector which allows IPv6 FIB
rules to match on the flow label with a mask. Ensure that both flow
label attributes are specified (or none) and that the mask is valid.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Ido Schimmel [Mon, 16 Dec 2024 17:11:54 +0000 (19:11 +0200)]
ipv4: fib_rules: Reject flow label attributes
IPv4 FIB rules cannot match on flow label so reject requests that try to
add such rules. Do that in the IPv4 configure callback as the netlink
policy resides in the core and used by both IPv4 and IPv6.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add new FIB rule attributes which will allow user space to match on the
IPv6 flow label with a mask. Temporarily set the type of the attributes
to 'NLA_REJECT' while support is being added in the IPv6 code.
Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jeremy Kerr [Wed, 18 Dec 2024 03:53:01 +0000 (11:53 +0800)]
net: mctp: handle skb cleanup on sock_queue failures
Currently, we don't use the return value from sock_queue_rcv_skb, which
means we may leak skbs if a message is not successfully queued to a
socket.
Instead, ensure that we're freeing the skb where the sock hasn't
otherwise taken ownership of the skb by adding checks on the
sock_queue_rcv_skb() to invoke a kfree on failure.
In doing so, rather than using the 'rc' value to trigger the
kfree_skb(), use the skb pointer itself, which is more explicit.
Also, add a kunit test for the sock delivery failure cases.
Joe Hattori [Wed, 18 Dec 2024 03:51:06 +0000 (12:51 +0900)]
net: mdiobus: fix an OF node reference leak
fwnode_find_mii_timestamper() calls of_parse_phandle_with_fixed_args()
but does not decrement the refcount of the obtained OF node. Add an
of_node_put() call before returning from the function.
This bug was detected by an experimental static analysis tool that I am
developing.
Fixes: bc1bee3b87ee ("net: mdiobus: Introduce fwnode_mdiobus_register_phy()") Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20241218035106.1436405-1-joe@pf.is.s.u-tokyo.ac.jp Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 19 Dec 2024 08:55:21 +0000 (09:55 +0100)]
Merge tag 'nf-24-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following series contains two fixes for Netfilter/IPVS:
1) Possible build failure in IPVS on systems with less than 512MB
memory due to incorrect use of clamp(), from David Laight.
2) Fix bogus lockdep nesting splat with ipset list:set type,
from Phil Sutter.
netfilter pull request 24-12-19
* tag 'nf-24-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ipset: Fix for recursive locking warning
ipvs: Fix clamp() of ip_vs_conn_tab on small memory systems
====================