Petr Machata [Thu, 17 Oct 2024 09:45:46 +0000 (11:45 +0200)]
selftests: RED: Use defer for test cleanup
Instead of having a suite of dedicated cleanup functions, use the defer
framework to schedule cleanups right as their setup functions are run.
The sleep after stop_traffic() in mlxsw selftests is necessary, but
scheduling it as "defer sleep; defer stop_traffic" is silly. Instead, add a
local helper to stop traffic and sleep afterwards.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Machata [Thu, 17 Oct 2024 09:45:45 +0000 (11:45 +0200)]
selftests: forwarding: lib: Allow passing PID to stop_traffic()
Now that it is possible to schedule a deferral of stop_traffic() right
after the traffic is started, we do not have to rely on the %% magic to
kill the background process that was started last. Instead we can just give
the PID explicitly. This makes it possible to start other background
processes after the traffic is started without confusing the cleanup.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Machata [Thu, 17 Oct 2024 09:45:44 +0000 (11:45 +0200)]
selftests: forwarding: Add a fallback cleanup()
Consistent use of defers obviates the need for a separate test-specific
cleanup function -- everything is just taken care of in defers. So in this
patch, introduce a cleanup() helper in the forwarding lib.sh, which calls
just pre_cleanup() and defer_scopes_cleanup(). Selftests are obviously
still free to override the function.
Since pre_cleanup() is too entangled with forwarding-specific minutia, the
function cannot currently be in net/lib.sh.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Petr Machata [Thu, 17 Oct 2024 09:45:43 +0000 (11:45 +0200)]
selftests: net: lib: Introduce deferred commands
In commit 8510801a9dbd ("selftests: drv-net: add ability to schedule
cleanup with defer()"), a defer helper was added to Python selftests.
The idea is to keep cleanup commands close to their dirtying counterparts,
thereby making it more transparent what is cleaning up what, making it
harder to miss a cleanup, and make the whole cleanup business exception
safe. All these benefits are applicable to bash as well, exception safety
can be interpreted in terms of safety vs. a SIGINT.
This patch therefore introduces a framework of several helpers that serve
to schedule cleanups in bash selftests:
- defer_scope_push(), defer_scope_pop(): Deferred statements can be batched
together in scopes. When a scope is popped, the deferred commands
scheduled in that scope are executed in the order opposite to order of
their scheduling.
- defer(): Schedules a defer to the most recently pushed scope (or the
default scope if none was pushed.)
- defer_prio(): Schedules a defer on the priority track. The priority defer
queue is run before the default defer queue when scope is popped.
The issue that this is addressing is specifically the one of restoring
devlink shared buffer threshold type. When setting up static thresholds,
one has to first change the threshold type to static, then override the
individual thresholds. When cleaning up, it would be natural to reset the
threshold values first, then change the threshold type. But the values
that are valid for dynamic thresholds are generally invalid for static
thresholds and vice versa. Attempts to restore the values first would be
bounced. Thus one has to first reset the threshold type, then adjust the
thresholds.
(You could argue that the shared buffer threshold type API is broken and
you would be right, but here we are.)
This cannot be solved by pure defers easily. I considered making it
possible to disable an existing defer, so that one could then schedule a
new defer and disable the original. But this forward-shifting of the
defer job would have to take place after every threshold-adjusting
command, which would make it very awkward to schedule these jobs.
- defer_scopes_cleanup(): Pops any unpopped scopes, including the default
one. The selftests that use defer should run this in their exit trap.
This is important to get cleanups of interrupted scripts.
- in_defer_scope(): Sometimes a function would like to introduce a new
defer scope, then run whatever it is that it wants to run, and then pop
the scope to run the deferred cleanups. The helper in_defer_scope() can
be used to run another command within such environment, such that any
scheduled defers run after the command finishes.
The framework is added as a separate file lib/sh/defer.sh so that it can be
used by all bash selftests, including those that do not currently use
lib.sh. lib.sh however includes the file by default, because ideally all
tests would use these helpers instead of hand-rolling their cleanups.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Abhishek Chauhan [Wed, 16 Oct 2024 23:43:13 +0000 (16:43 -0700)]
net: stmmac: Programming sequence for VLAN packets with split header
Currently reset state configuration of split header works fine for
non-tagged packets and we see no corruption in payload of any size
We need additional programming sequence with reset configuration to
handle VLAN tagged packets to avoid corruption in payload for packets
of size greater than 256 bytes.
Without this change ping application complains about corruption
in payload when the size of the VLAN packet exceeds 256 bytes.
With this change tagged and non-tagged packets of any size works fine
and there is no corruption seen.
Current configuration which has the issue for VLAN packet
----------------------------------------------------------
Split happens at the position at Layer 3 header
|MAC-DA|MAC-SA|Vlan Tag|Ether type|IP header|IP data|Rest of the payload|
2 bytes ^
|
With the fix we are making sure that the split happens now at
Layer 2 which is end of ethernet header and start of IP payload
Ip traffic split
-----------------
Bits which take care of this are SPLM and SPLOFST
SPLM = Split mode is set to Layer 2
SPLOFST = These bits indicate the value of offset from the beginning
of Length/Type field at which header split should take place when the
appropriate SPLM is selected. Reset value is 2bytes.
Un-tagged data (without VLAN)
|MAC-DA|MAC-SA|Ether type|IP header|IP data|Rest of the payload|
2bytes ^
|
Tagged data (with VLAN)
|MAC-DA|MAC-SA|VLAN Tag|Ether type|IP header|IP data|Rest of the payload|
2bytes ^
|
Non-IP traffic split such AV packet
------------------------------------
Bits which take care of this are
SAVE = Split AV Enable
SAVO = Split AV Offset, similar to SPLOFST but this is for AVTP
packets.
Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_af_ops is alive during inflight RTM_SETLINK
even when its module is being unloaded.
Let's use SRCU to protect ops.
rtnl_af_lookup() now iterates rtnl_af_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_af_put()
to release the pointer after the use.
Also, rtnl_af_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_SETLINK requests to
complete.
Note that rtnl_af_ops needs to be protected by its dedicated lock
when RTNL is removed.
Note also that BUG_ON() in do_setlink() is changed to the normal
error handling as a different af_ops might be found after
validate_linkmsg().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rtnetlink: Call rtnl_link_get_net_capable() in do_setlink().
We will push RTNL down to rtnl_setlink().
RTM_SETLINK could call rtnl_link_get_net_capable() in do_setlink()
to move a dev to a new netns, but the netns needs to be fetched before
holding rtnl_net_lock().
Let's move it to rtnl_setlink() and pass the netns to do_setlink().
Now, RTM_NEWLINK paths (rtnl_changelink() and rtnl_group_changelink())
can pass the prefetched netns to do_setlink().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rtnetlink: Call rtnl_link_get_net_capable() in rtnl_newlink().
As a prerequisite of per-netns RTNL, we must fetch netns before
looking up dev or moving it to another netns.
rtnl_link_get_net_capable() is called in rtnl_newlink_create() and
do_setlink(), but both of them need to be moved to the RTNL-independent
region, which will be rtnl_newlink().
Let's call rtnl_link_get_net_capable() in rtnl_newlink() and pass the
netns down to where needed.
Note that the latter two have not passed the nets to do_setlink() yet
but will do so after the remaining rtnl_link_get_net_capable() is moved
to rtnl_setlink() later.
While at it, dest_net is renamed to tgt_net in rtnl_newlink_create() to
align with rtnl_{del,set}link().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rtnetlink: Protect struct rtnl_link_ops with SRCU.
Once RTNL is replaced with rtnl_net_lock(), we need a mechanism to
guarantee that rtnl_link_ops is alive during inflight RTM_NEWLINK
even when its module is being unloaded.
Let's use SRCU to protect ops.
rtnl_link_ops_get() now iterates link_ops under RCU and returns
SRCU-protected ops pointer. The caller must call rtnl_link_ops_put()
to release the pointer after the use.
Also, __rtnl_link_unregister() unlinks the ops first and calls
synchronize_srcu() to wait for inflight RTM_NEWLINK requests to
complete.
Note that link_ops needs to be protected by its dedicated lock
when RTNL is removed.
Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
rtnetlink: Move rtnl_link_ops_get() and retry to rtnl_newlink().
Currently, if neither dev nor rtnl_link_ops is found in __rtnl_newlink(),
we release RTNL and redo the whole process after request_module(), which
complicates the logic.
The ops will be RTNL-independent later.
Let's move the ops lookup to rtnl_newlink() and do the retry earlier.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
net/mlx5: Refactor esw QoS to support generalized operations
This patch series from the team to mlx5 core driver consists of one main
QoS part followed by small misc patches.
This main part (patches 1 to 11) by Carolina refactors the QoS handling
to generalize operations on scheduling groups and vports. These changes
are necessary to support new features that will extend group
functionality, introduce new group types, and support deeper
hierarchies.
Additionally, this refactor updates the terminology from "group" to
"node" to better reflect the hardware’s rate hierarchy and its use
of scheduling element nodes.
Simplify group scheduling element creation:
- net/mlx5: Refactor QoS group scheduling element creation
Refactor to support generalized operations for QoS:
- net/mlx5: Introduce node type to rate group structure
- net/mlx5: Add parent group support in rate group structure
- net/mlx5: Restrict domain list insertion to root TSAR ancestors
- net/mlx5: Rename vport QoS group reference to parent
- net/mlx5: Introduce node struct and rename group terminology to node
- net/mlx5: Refactor vport scheduling element creation function
- net/mlx5: Refactor vport QoS to use scheduling node structure
- net/mlx5: Remove vport QoS enabled flag
Support generalized operations for QoS elements:
- net/mlx5: Simplify QoS scheduling element configuration
- net/mlx5: Generalize QoS operations for nodes and vports
On top, patch 12 by Moshe handles FW request to move to drop mode.
In patch 13, Benjamin Poirier removes an empty eswitch flow table when
not used, which improves packet processing performance.
Patches 14 and 15 by Moshe are small field renamings as preparation for
future fields addition to these structures.
Moshe Shemesh [Wed, 16 Oct 2024 17:36:17 +0000 (20:36 +0300)]
net/mlx5: fs, rename modify header struct member action
As preparation for HW Steering support, rename modify header struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.
Moshe Shemesh [Wed, 16 Oct 2024 17:36:16 +0000 (20:36 +0300)]
net/mlx5: fs, rename packet reformat struct member action
As preparation for HW Steering support, rename packet reformat struct
member action to fs_dr_action, to distinguish from fs_hws_action which
will be added. Add a pointer where needed to keep code line shorter and
more readable.
Benjamin Poirier [Wed, 16 Oct 2024 17:36:15 +0000 (20:36 +0300)]
net/mlx5: Only create VEPA flow table when in VEPA mode
Currently, when VFs are created, two flow tables are added for the eswitch:
the "fdb" table, which contains rules for each VF and the "vepa_fdb" table.
In the default VEB mode, the vepa_fdb table is empty. When switching to
VEPA mode, flow steering rules are added to vepa_fdb. Even though the
vepa_fdb table is empty in VEB mode, its presence adds some cost to packet
processing. In some workloads, this leads to drops which are reported by
the rx_discards_phy ethtool counter.
In order to improve performance, only create vepa_fdb when in VEPA mode.
Tests were done on a ConnectX-6 Lx adapter forwarding 64B packets between
both ports using dpdk-testpmd. Numbers are Rx-pps for each port, as
reported by testpmd.
Moshe Shemesh [Wed, 16 Oct 2024 17:36:14 +0000 (20:36 +0300)]
net/mlx5: Add sync reset drop mode support
On sync reset flow, firmware may request a PF, which already
acknowledged the unload event, to move to drop mode. Drop mode means
that this PF will reduce polling frequency, as this PF is not going to
have another active part in the reset, but only reload back after the
reset.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Carolina Jubran [Wed, 16 Oct 2024 17:36:13 +0000 (20:36 +0300)]
net/mlx5: Generalize QoS operations for nodes and vports
Refactor QoS normalization and rate calculation functions to operate
on mlx5_esw_sched_node, allowing for generalized handling of both
vports and nodes.
Carolina Jubran [Wed, 16 Oct 2024 17:36:11 +0000 (20:36 +0300)]
net/mlx5: Remove vport QoS enabled flag
Remove the `enabled` flag from the `vport->qos` struct, as QoS now
relies solely on the `sched_node` pointer to determine whether QoS
features are in use.
Currently, the vport `qos` struct consists only of the `sched_node`,
introducing an unnecessary two-level reference. However, the qos struct
is retained as it will be extended in future patches to support new QoS
features.
Carolina Jubran [Wed, 16 Oct 2024 17:36:10 +0000 (20:36 +0300)]
net/mlx5: Refactor vport QoS to use scheduling node structure
Refactor the vport QoS structure by moving group membership and
scheduling details into the `mlx5_esw_sched_node` structure.
This change consolidates the vport into the rate hierarchy by unifying
the handling of different types of scheduling element nodes.
In addition, add a direct reference to the mlx5_vport within the
mlx5_esw_sched_node structure, to ensure that the vport is easily
accessible when a scheduling node is associated with a vport.
Carolina Jubran [Wed, 16 Oct 2024 17:36:08 +0000 (20:36 +0300)]
net/mlx5: Introduce node struct and rename group terminology to node
Introduce the `mlx5_esw_sched_node` struct, consolidating all rate
hierarchy related details, including membership and scheduling
parameters.
Since the group concept aligns with the `mlx5_esw_sched_node`, replace
the `mlx5_esw_rate_group` struct with it and rename the "group"
terminology to "node" throughout the rate hierarchy.
All relevant code paths and structures have been updated to use the
"node" terminology accordingly, laying the groundwork for future
patches that will unify the handling of different types of members
within the rate hierarchy.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Carolina Jubran [Wed, 16 Oct 2024 17:36:07 +0000 (20:36 +0300)]
net/mlx5: Rename vport QoS group reference to parent
Rename the `group` field in the `mlx5_vport` structure to `parent` to
clarify the vport's role as a member of a parent group and distinguish
it from the concept of a general group.
Additionally, rename `group_entry` to `parent_entry` to reflect this
update.
This distinction will be important for handling more complex group
structures and scheduling elements.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Carolina Jubran [Wed, 16 Oct 2024 17:36:06 +0000 (20:36 +0300)]
net/mlx5: Restrict domain list insertion to root TSAR ancestors
Update the logic for adding rate groups to the E-Switch domain list,
ensuring only groups with the root Transmit Scheduling Arbiter as their
parent are included.
Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Daniel Machon <daniel.machon@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:14 +0000 (20:35 +0800)]
net: hibmcge: Implement some ethtool_ops functions
Implement the .get_drvinfo .get_link .get_link_ksettings to get
the basic information and working status of the driver.
Implement the .set_link_ksettings to modify the rate, duplex,
and auto-negotiation status.
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:13 +0000 (20:35 +0800)]
net: hibmcge: Implement rx_poll function to receive packets
Implement rx_poll function to read the rx descriptor after
receiving the rx interrupt. Adjust the skb based on the
descriptor to complete the reception of the packet.
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:12 +0000 (20:35 +0800)]
net: hibmcge: Implement .ndo_start_xmit function
Implement .ndo_start_xmit function to fill the information of the packet
to be transmitted into the tx descriptor, and then the hardware will
transmit the packet using the information in the tx descriptor.
In addition, we also implemented the tx_handler function to enable the
tx descriptor to be reused, and .ndo_tx_timeout function to print some
information when the hardware is busy.
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:11 +0000 (20:35 +0800)]
net: hibmcge: Implement some .ndo functions
Implement the .ndo_open() .ndo_stop() .ndo_set_mac_address()
and .ndo_change_mtu functions().
And .ndo_validate_addr calls the eth_validate_addr function directly
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:10 +0000 (20:35 +0800)]
net: hibmcge: Add interrupt supported in this module
The driver supports four interrupts: TX interrupt, RX interrupt,
mdio interrupt, and error interrupt.
Actually, the driver does not use the mdio interrupt.
Therefore, the driver does not request the mdio interrupt.
The error interrupt distinguishes different error information
by using different masks. To distinguish different errors,
the statistics count is added for each error.
To ensure the consistency of the code process, masks are added for the
TX interrupt and RX interrupt.
This patch implements interrupt request, and provides a
unified entry for the interrupt handler function. However,
the specific interrupt handler function of each interrupt
is not implemented currently.
Because of pcim_enable_device(), the interrupt vector
is already device managed and does not need to be free actively.
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:08 +0000 (20:35 +0800)]
net: hibmcge: Add read/write registers supported through the bar space
Add support for to read and write registers through the pic bar space.
Some driver parameters, such as mac_id, are determined by the
board form. Therefore, these parameters are initialized
from the register as device specifications.
the device specifications register are initialized and written by bmc.
driver will read these registers when loading.
Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jijie Shao [Tue, 15 Oct 2024 12:35:07 +0000 (20:35 +0800)]
net: hibmcge: Add pci table supported in this module
Add pci table supported in this module, and implement pci_driver function
to initialize this driver.
hibmcge is a passthrough network device. Its software runs
on the host side, and the MAC hardware runs on the BMC side
to reduce the host CPU area. The software interacts with the
MAC hardware through the PCIe.
WangYuli [Fri, 18 Oct 2024 02:19:10 +0000 (10:19 +0800)]
eth: Fix typo 'accelaration'. 'exprienced' and 'rewritting'
There are some spelling mistakes of 'accelaration', 'exprienced' and
'rewritting' in comments which should be 'acceleration', 'experienced'
and 'rewriting'.
Suggested-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/all/20241017162846.GA51712@kernel.org/ Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <90D42CB167CA0842+20241018021910.31359-1-wangyuli@uniontech.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Heiner Kallweit [Thu, 17 Oct 2024 20:27:44 +0000 (22:27 +0200)]
r8169: enable EEE at 2.5G per default on RTL8125B
Register a6d/12 is shadowing register MDIO_AN_EEE_ADV2. So this line
disables advertisement of EEE at 2.5G. Latest vendor driver r8125
doesn't do this (any longer?), so this mode seems to be safe.
EEE saves quite some energy, therefore enable this mode per default.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <95dd5a0c-09ea-4847-94d9-b7aa3063e8ff@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Heiner Kallweit [Thu, 17 Oct 2024 16:01:13 +0000 (18:01 +0200)]
net: phy: realtek: add RTL8125D-internal PHY
The first boards show up with Realtek's RTL8125D. This MAC/PHY chip
comes with an integrated 2.5Gbps PHY with ID 0x001cc841. It's not
clear yet whether there's an external version of this PHY and how
Realtek calls it, therefore use the numeric id for now.
Heiner Kallweit [Wed, 16 Oct 2024 20:29:39 +0000 (22:29 +0200)]
r8169: avoid duplicated messages if loading firmware fails and switch to warn level
In case of a problem with firmware loading we inform at the driver level,
in addition the firmware load code itself issues warnings. Therefore
switch to firmware_request_nowarn() to avoid duplicated error messages.
In addition switch to warn level because the firmware is optional and
typically just fixes compatibility issues.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <d9c5094c-89a6-40e2-b5fe-8df7df4624ef@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Heiner Kallweit [Wed, 16 Oct 2024 20:06:53 +0000 (22:06 +0200)]
r8169: replace custom flag with disable_work() et al
So far we use a custom flag to define when a task can be scheduled and
when not. Let's use the standard mechanism with disable_work() et al
instead.
Note that in rtl8169_close() we can remove the call to cancel_work()
because we now call disable_work_sync() in rtl8169_down() already.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Heiner Kallweit [Wed, 16 Oct 2024 20:05:57 +0000 (22:05 +0200)]
r8169: don't take RTNL lock in rtl_task()
There's not really a benefit here in taking the RTNL lock. The task
handler does exception handling only, so we're in trouble anyway when
we come here, and there's no need to protect against e.g. a parallel
ethtool call.
A benefit of removing the RTNL lock here is that we now can
synchronously cancel the workqueue from a context holding the RTNL mutex.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
fbnic fails to link as built-in when PTP support is in a loadable
module:
aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_ethtool.o: in function `fbnic_get_ts_info':
fbnic_ethtool.c:(.text+0x428): undefined reference to `ptp_clock_index'
aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_time_start':
fbnic_time.c:(.text+0x820): undefined reference to `ptp_schedule_worker'
aarch64-linux-ld: drivers/net/ethernet/meta/fbnic/fbnic_time.o: in function `fbnic_ptp_setup':
fbnic_time.c:(.text+0xa68): undefined reference to `ptp_clock_register'
Menglong Dong [Tue, 15 Oct 2024 09:02:44 +0000 (17:02 +0800)]
net: vxlan: update the document for vxlan_snoop()
The function vxlan_snoop() returns drop reasons now, so update the
document of it too.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Tue, 15 Oct 2024 08:28:30 +0000 (16:28 +0800)]
net: vxlan: replace VXLAN_INVALID_HDR with VNI_NOT_FOUND
Replace the drop reason "SKB_DROP_REASON_VXLAN_INVALID_HDR" with
"SKB_DROP_REASON_VXLAN_VNI_NOT_FOUND" in encap_bypass_if_local(), as the
latter is more accurate.
Fixes: 790961d88b0e ("net: vxlan: use kfree_skb_reason() in encap_bypass_if_local()") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Tue, 15 Oct 2024 07:58:09 +0000 (09:58 +0200)]
net: airoha: Fix typo in REG_CDM2_FWD_CFG configuration
Fix typo in airoha_fe_init routine configuring CDM2_OAM_QSEL_MASK field
of REG_CDM2_FWD_CFG register.
This bug is not introducing any user visible problem since Frame Engine
CDM2 port is used just by the second QDMA block and we currently enable
just QDMA1 block connected to the MT7530 dsa switch via CDM1 port.
Introduced by commit 23020f049327 ("net: airoha: Introduce ethernet
support for EN7581 SoC")
Reported-by: ChihWei Cheng <chihwei.cheng@airoha.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <20241015-airoha-eth-cdm2-fixes-v1-1-9dc6993286c3@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Paul Barker [Tue, 15 Oct 2024 13:36:31 +0000 (14:36 +0100)]
net: ravb: Simplify UDP TX checksum offload
The GbEth IP will pass through a zero UDP checksum without asserting any
error flags so we do not need to resort to software checksum calculation
in this case.
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Paul Barker [Tue, 15 Oct 2024 13:36:30 +0000 (14:36 +0100)]
net: ravb: Disable IP header TX checksum offloading
For IPv4 packets, the header checksum will always be calculated in software
in the TX path (Documentation/networking/checksum-offloads.rst says "No
offloading of the IP header checksum is performed; it is always done in
software.") so there is no advantage in asking the hardware to also
calculate this checksum.
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Paul Barker [Tue, 15 Oct 2024 13:36:29 +0000 (14:36 +0100)]
net: ravb: Simplify types in RX csum validation
The hardware checksum value is used as a 16-bit flag, it is zero when
the checksum has been validated and non-zero otherwise. Therefore we
don't need to treat this as an actual __wsum type or call csum_unfold(),
we can just use a u16 pointer.
Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Paul Barker [Tue, 15 Oct 2024 13:36:28 +0000 (14:36 +0100)]
net: ravb: Combine if conditions in RX csum validation
We can merge the two if conditions on skb_is_nonlinear(). Since
skb_frag_size_sub() and skb_trim() do not free memory, it is still safe
to access the trimmed bytes at the end of the packet after these calls.
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Paul Barker [Tue, 15 Oct 2024 13:36:26 +0000 (14:36 +0100)]
net: ravb: Disable IP header RX checksum offloading
For IPv4 packets, the header checksum will always be checked in software
in the RX path (inet_gro_receive() calls ip_fast_csum() unconditionally)
so there is no advantage in asking the hardware to also calculate this
checksum.
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Andy Shevchenko [Wed, 16 Oct 2024 09:05:54 +0000 (12:05 +0300)]
tg3: Increase buffer size for IRQ label
GCC is not happy with the current code, e.g.:
.../tg3.c:11313:37: error: ‘-txrx-’ directive output may be truncated writing 6 bytes into a region of size between 1 and 16 [-Werror=format-truncation=]
11313 | "%s-txrx-%d", tp->dev->name, irq_num);
| ^~~~~~
.../tg3.c:11313:34: note: using the range [-2147483648, 2147483647] for directive argument
11313 | "%s-txrx-%d", tp->dev->name, irq_num);
When `make W=1` is supplied, this prevents kernel building. Fix it by
increasing the buffer size for IRQ label and use sizeoF() instead of
hard coded constants.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Message-ID: <20241016090647.691022-1-andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
With DSA's implementation of the mac_select_pcs() method removed, we
can now remove the detection of mac_select_pcs() implementation.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
net: phylink: remove use of pl->pcs in phylink_validate_mac_and_pcs()
When the mac_select_pcs() method is not implemented, there is no way
for pl->pcs to be set to a non-NULL value. This was here to support
the old phylink_set_pcs() method which has been removed a few years
ago. Simplify the code in phylink_validate_mac_and_pcs().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
net: phylink: allow mac_select_pcs() to remove a PCS
phylink has historically not permitted a PCS to be removed. An attempt
to permit this with phylink_set_pcs() resulted in comments indicating
that there was no need for this. This behaviour has been propagated
forward to the mac_select_pcs() approach as it was believed from these
comments that changing this would be NAK'd.
However, with mac_select_pcs(), it takes more code and thus complexity
to maintain this behaviour, which can - and in this case has - resulted
in a bug. If mac_select_pcs() returns NULL for a particular interface
type, but there is already a PCS in-use, then we skip the pcs_validate()
method, but continue using the old PCS. Also, it wouldn't be expected
behaviour by implementers of mac_select_pcs().
Allow this by removing this old unnecessary restriction.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
There is no longer any reason to implement the mac_select_pcs()
callback in DSA. Returning ERR_PTR(-EOPNOTSUPP) is functionally
equivalent to not providing the function.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Andy Shevchenko [Wed, 16 Oct 2024 13:25:26 +0000 (16:25 +0300)]
net: ks8851: use %*ph to print small buffer
Use %*ph format to print small buffer as hex string. It will change
the output format from 32-bit words to byte hexdump, but this is not
critical as it's only a debug message.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org>
Message-ID: <20241016132615.899037-1-andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Simon Horman [Wed, 16 Oct 2024 14:31:14 +0000 (15:31 +0100)]
net: usb: sr9700: only store little-endian values in __le16 variable
In sr_mdio_read() the local variable res is used to store both
little-endian and host byte order values. This prevents Sparse
from helping us by flagging when endian miss matches occur - the
detection process hinges on the type of variables matching the
byte order of values stored in them.
Address this by adding a new local variable, word, to store little-endian
values; change the type of res to int, and use it to store host-byte
order values.
Flagged by Sparse as:
.../sr9700.c:205:21: warning: incorrect type in assignment (different base types)
.../sr9700.c:205:21: expected restricted __le16 [addressable] [usertype] res
.../sr9700.c:205:21: got int
.../sr9700.c:207:21: warning: incorrect type in assignment (different base types)
.../sr9700.c:207:21: expected restricted __le16 [addressable] [usertype] res
.../sr9700.c:207:21: got int
.../sr9700.c:212:16: warning: incorrect type in return expression (different base types)
.../sr9700.c:212:16: expected int
.../sr9700.c:212:16: got restricted __le16 [addressable] [usertype] res
Compile tested only.
No functional change intended.
Signed-off-by: Simon Horman <horms@kernel.org>
Message-ID: <20241016-blackbird-le16-v1-1-97ba8de6b38f@kernel.org> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
The *ndev pointer needs to be set or it leads to an uninitialized variable
bug in the caller.
Fixes: 4a7b2ba94a59 ("net: ethernet: ti: am65-cpsw: Use tstats instead of open coded version") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Roger Quadros <rogerq@kernel.org>
Message-ID: <b168d5c7-704b-4452-84f9-1c1762b1f4ce@stanley.mountain> Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Linus Torvalds [Thu, 17 Oct 2024 16:31:18 +0000 (09:31 -0700)]
Merge tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Current release - new code bugs:
- eth: mlx5: HWS, don't destroy more bwc queue locks than allocated
Previous releases - regressions:
- ipv4: give an IPv4 dev to blackhole_netdev
- udp: compute L4 checksum as usual when not segmenting the skb
- tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().
- eth: mlx5e: don't call cleanup on profile rollback failure
- eth: microchip: vcap api: fix memory leaks in
vcap_api_encode_rule_test()
- eth: enetc: disable Tx BD rings after they are empty
- eth: macb: avoid 20s boot delay by skipping MDIO bus registration
for fixed-link PHY
Previous releases - always broken:
- posix-clock: fix missing timespec64 check in pc_clock_settime()
- genetlink: hold RCU in genlmsg_mcast()
- mptcp: prevent MPC handshake on port-based signal endpoints
- eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame
- eth: stmmac: dwmac-tegra: fix link bring-up sequence
- eth: bcmasp: fix potential memory leak in bcmasp_xmit()
Misc:
- add Andrew Lunn as a co-maintainer of all networking drivers"
* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
net/mlx5e: Don't call cleanup on profile rollback failure
net/mlx5: Unregister notifier on eswitch init failure
net/mlx5: Fix command bitmask initialization
net/mlx5: Check for invalid vector index on EQ creation
net/mlx5: HWS, use lock classes for bwc locks
net/mlx5: HWS, don't destroy more bwc queue locks than allocated
net/mlx5: HWS, fixed double free in error flow of definer layout
net/mlx5: HWS, removed wrong access to a number of rules variable
mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
net: phy: mdio-bcm-unimac: Add BCM6846 support
dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
udp: Compute L4 checksum as usual when not segmenting the skb
genetlink: hold RCU in genlmsg_mcast()
net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
...
Heiner Kallweit [Tue, 15 Oct 2024 05:47:14 +0000 (07:47 +0200)]
net: phy: realtek: merge the drivers for internal NBase-T PHY's
The Realtek RTL8125/RTL8126 NBase-T MAC/PHY chips have internal PHY's
which are register-compatible, at least for the registers we use here.
So let's use just one PHY driver to support all of them.
These internal PHY's exist also as external C45 PHY's, but on the
internal PHY's no access to MMD registers is possible. This can be
used to differentiate between the internal and external version.
As a side effect the drivers for two now external-only drivers don't
require read_mmd/write_mmd hooks any longer.
Sanman Pradhan [Mon, 14 Oct 2024 15:27:09 +0000 (08:27 -0700)]
eth: fbnic: Add hardware monitoring support via HWMON interface
This patch adds support for hardware monitoring to the fbnic driver,
allowing for temperature and voltage sensor data to be exposed to
userspace via the HWMON interface. The driver registers a HWMON device
and provides callbacks for reading sensor data, enabling system
admins to monitor the health and operating conditions of fbnic.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:08 +0000 (12:32 +0300)]
net/mlx5e: Don't call cleanup on profile rollback failure
When profile rollback fails in mlx5e_netdev_change_profile, the netdev
profile var is left set to NULL. Avoid a crash when unloading the driver
by not calling profile->cleanup in such a case.
This was encountered while testing, with the original trigger that
the wq rescuer thread creation got interrupted (presumably due to
Ctrl+C-ing modprobe), which gets converted to ENOMEM (-12) by
mlx5e_priv_init, the profile rollback also fails for the same reason
(signal still active) so the profile is left as NULL, leading to a crash
later in _mlx5e_remove.
Shay Drory [Tue, 15 Oct 2024 09:32:06 +0000 (12:32 +0300)]
net/mlx5: Fix command bitmask initialization
Command bitmask have a dedicated bit for MANAGE_PAGES command, this bit
isn't Initialize during command bitmask Initialization, only during
MANAGE_PAGES.
In addition, mlx5_cmd_trigger_completions() is trying to trigger
completion for MANAGE_PAGES command as well.
Hence, in case health error occurred before any MANAGE_PAGES command
have been invoke (for example, during mlx5_enable_hca()),
mlx5_cmd_trigger_completions() will try to trigger completion for
MANAGE_PAGES command, which will result in null-ptr-deref error.[1]
Fix it by Initialize command bitmask correctly.
While at it, re-write the code for better understanding.
Maher Sanalla [Tue, 15 Oct 2024 09:32:05 +0000 (12:32 +0300)]
net/mlx5: Check for invalid vector index on EQ creation
Currently, mlx5 driver does not enforce vector index to be lower than
the maximum number of supported completion vectors when requesting a
new completion EQ. Thus, mlx5_comp_eqn_get() fails when trying to
acquire an IRQ with an improper vector index.
To prevent the case above, enforce that vector index value is
valid and lower than maximum in mlx5_comp_eqn_get() before handling the
request.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:04 +0000 (12:32 +0300)]
net/mlx5: HWS, use lock classes for bwc locks
The HWS BWC API uses one lock per queue and usually acquires one of
them, except when doing changes which require locking all queues in
order. Naturally, lockdep isn't too happy about acquiring the same lock
class multiple times, so inform it that each queue lock is a different
class to avoid false positives.
Cosmin Ratiu [Tue, 15 Oct 2024 09:32:03 +0000 (12:32 +0300)]
net/mlx5: HWS, don't destroy more bwc queue locks than allocated
hws_send_queues_bwc_locks_destroy destroyed more queue locks than
allocated, leading to memory corruption (occasionally) and warnings such
as DEBUG_LOCKS_WARN_ON(mutex_is_locked(lock)) in __mutex_destroy because
sometimes, the 'mutex' being destroyed was random memory.
The severity of this problem is proportional to the number of queues
configured because the code overreaches beyond the end of the
bwc_send_queue_locks array by 2x its length.
Fix that by using the correct number of bwc queues.
net/mlx5: HWS, removed wrong access to a number of rules variable
Removed wrong access to the num_of_rules field of the matcher.
This is a usual u32 variable, but the access was as if it was atomic.
This fixes the following CI warnings:
mlx5hws_bwc.c:708:17: warning: large atomic operation may incur significant performance penalty;
the access size (4 bytes) exceeds the max lock-free size (0 bytes) [-Watomic-alignment]
Fixes: 510f9f61a112 ("net/mlx5: HWS, added API and enabled HWS support") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409291101.6NdtMFVC-lkp@intel.com/ Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Itamar Gozlan <igozlan@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>