Jacob Keller [Fri, 28 Oct 2022 11:04:16 +0000 (04:04 -0700)]
ptp: mlx5: convert to .adjfine and adjust_by_scaled_ppm
The mlx5 implementation of .adjfreq is implemented in terms of a
straight forward "base * ppb / 1 billion" calculation.
Convert this to the .adjfine interface and use adjust_by_scaled_ppm for the
calculation of the new mult value.
Note that the mlx5_ptp_adjfreq_real_time function expects input in terms of
ppb, so use the scaled_ppm_to_ppb to convert before passing to this
function.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Shirly Ohnona <shirlyo@nvidia.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Cc: Gal Pressman <gal@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Aya Levin <ayal@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Fri, 28 Oct 2022 11:04:15 +0000 (04:04 -0700)]
ptp: mlx4: convert to .adjfine and adjust_by_scaled_ppm
The mlx4 implementation of .adjfreq is implemented in terms of a
straight forward "base * ppb / 1 billion" calculation.
Convert this driver to .adjfine and use adjust_by_scaled_ppm to perform the
calculation.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Cc: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Fri, 28 Oct 2022 11:04:14 +0000 (04:04 -0700)]
drivers: convert unsupported .adjfreq to .adjfine
A few PTP drivers implement a .adjfreq handler which indicates the
operation is not supported. Convert all of these to .adjfine.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Dexuan Cui <decui@microsoft.com> Cc: Vivek Thampi <vithampi@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Fri, 28 Oct 2022 11:04:13 +0000 (04:04 -0700)]
ptp: introduce helpers to adjust by scaled parts per million
Many drivers implement the .adjfreq or .adjfine PTP op function with the
same basic logic:
1. Determine a base frequency value
2. Multiply this by the abs() of the requested adjustment, then divide by
the appropriate divisor (1 billion, or 65,536 billion).
3. Add or subtract this difference from the base frequency to calculate a
new adjustment.
A few drivers need the difference and direction rather than the combined
new increment value.
I recently converted the Intel drivers to .adjfine and the scaled parts per
million (65.536 parts per billion) logic. To avoid overflow with minimal
loss of precision, mul_u64_u64_div_u64 was used.
The basic logic used by all of these drivers is very similar, and leads to
a lot of duplicate code to perform the same task.
Rather than keep this duplicate code, introduce diff_by_scaled_ppm and
adjust_by_scaled_ppm. These helper functions calculate the difference or
adjustment necessary based on the scaled parts per million input.
The diff_by_scaled_ppm function returns true if the difference should be
subtracted, and false otherwise.
Update the Intel drivers to use the new helper functions. Other vendor
drivers will be converted to .adjfine and this helper function in the
following changes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Fri, 28 Oct 2022 11:04:12 +0000 (04:04 -0700)]
ptp: add missing documentation for parameters
The ptp_find_pin_unlocked function and the ptp_system_timestamp structure
didn't document their parameters and fields. Fix this.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Frank [Fri, 28 Oct 2022 09:26:21 +0000 (17:26 +0800)]
net: phy: Add driver for Motorcomm yt8521 gigabit ethernet phy
Add a driver for the motorcomm yt8521 gigabit ethernet phy. We have verified
the driver on StarFive VisionFive development board, which is developed by
Shanghai StarFive Technology Co., Ltd.. On the board, yt8521 gigabit ethernet
phy works in utp mode, RGMII interface, supports 1000M/100M/10M speeds, and
wol(magic package).
Signed-off-by: Frank <Frank.Sae@motor-comm.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Fri, 28 Oct 2022 08:11:06 +0000 (16:11 +0800)]
net: microchip: sparx5: kunit test: change test_callbacks and test_vctrl to static
test_callbacks and test_vctrl are only used in vcap_api_kunit.c now,
change them to static.
Fixes: 67d637516fa9 ("net: microchip: sparx5: Adding KUNIT test for the VCAP API") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Fri, 28 Oct 2022 07:34:57 +0000 (15:34 +0800)]
net: hns: hnae: remove unnecessary __module_get() and module_put()
hnae_ae_register() is called from hns_dsaf_probe(), the refcount of
module hnae has already be got in resolve_symbol() while calling the
function, so the __module_get()/module_put() can be removed.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The queue configuration is referenced by snps,mtl-rx-config and
snps,mtl-tx-config. Some in-tree DTs and the example put the
referenced config nodes directly beneath the root node, but
most in-tree DTs put it as child node of the dwmac node.
This adds proper description for this setup, which has the
advantage of validating the queue configuration node content.
The example is also updated to use the sub-node style, incl.
the axi bus configuration node, which got the same treatment
as the queues config in 5361660af6d3 ("dt-bindings: net: snps,dwmac:
Document stmmac-axi-config subnode").
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Juhee Kang [Thu, 27 Oct 2022 16:04:24 +0000 (01:04 +0900)]
net: remove unused netdev_unregistering()
Currently, use dev->reg_state == NETREG_UNREGISTERING to check the status
which is NETREG_UNREGISTERING, rather than using netdev_unregistering.
Also, A helper function which is netdev_unregistering on nedevice.h is no
longer used. Thus, netdev_unregistering removes from netdevice.h.
Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Sat, 29 Oct 2022 05:07:47 +0000 (22:07 -0700)]
Merge tag 'mlx5-updates-2022-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-10-24
SW steering updates from Yevgeny Kliteynik:
1) 1st Four patches: small fixes / optimizations for SW steering:
- Patch 1: Don't abort destroy flow if failed to destroy table - continue
and free everything else.
- Patches 2 and 3 deal with fast teardown:
+ Skip sync during fast teardown, as PCI device is not there any more.
+ Check device state when polling CQ - otherwise SW steering keeps polling
the CQ forever, because nobody is there to flush it.
- Patch 4: Removing unneeded function argument.
2) Deal with the hiccups that we get during rules insertion/deletion,
which sometimes reach 1/4 of a second. While insertion/deletion rate
improvement was not the focus here, it still is a by-product of removing these
hiccups.
Another by-product is the reduced standard deviation in measuring the duration
of rules insertion/deletion bursts.
In the testing we add K rules (warm-up phase), and then continuously do
insertion/deletion bursts of N rules.
During the test execution, the driver measures hiccups (amount and duration)
and total time for insertion/deletion of a batch of rules.
Here are some numbers, before and after these patches:
+--------------------------------------------+-----------------+----------------+
| | Create rules | Delete rules |
| +--------+--------+--------+-------+
| | Before | After | Before | After |
+--------------------------------------------+--------+--------+--------+-------+
| Max hiccup [msec] | 253 | 42 | 254 | 68 |
+--------------------------------------------+--------+--------+--------+-------+
| Avg duration of 10K rules add/remove [msec]| 140.07 | 124.32 | 106.99 | 99.51 |
+--------------------------------------------+--------+--------+--------+-------+
| Num of hiccups per 100K rules add/remove | 7.77 | 7.97 | 12.60 | 11.57 |
+--------------------------------------------+--------+--------+--------+-------+
| Avg hiccup duration [msec] | 36.92 | 33.25 | 36.15 | 33.74 |
+--------------------------------------------+--------+--------+--------+-------+
- Patch 5: Allocate a short array on stack instead of dynamically- it is
destroyed at the end of the function.
- Patch 6: Rather than cleaning the corresponding chunk's section of
ste_arrays on chunk deletion, initialize these areas upon chunk creation.
Chunk destruction tend to come in large batches (during pool syncing),
so instead of doing huge memory initialization during pool sync,
we amortize this by doing small initsializations on chunk creation.
- Patch 7: In order to simplifies error flow and allows cleaner addition
of new pools, handle creation/destruction of all the domain's memory pools
and other memory-related fields in a separate init/uninit functions.
- Patch 8: During rehash, write each table row immediately instead of waiting
for the whole table to be ready and writing it all - saves allocations
of ste_send_info structures and improves performance.
- Patch 9: Instead of allocating/freeing send info objects dynamically,
manage them in pool. The number of send info objects doesn't depend on
number of rules, so after pre-populating the pool with an initial batch of
send info objects, the pool is not expected to grow.
This way we save alloc/free during writing STEs to ICM, which by itself can
sometimes take up to 40msec.
- Patch 10: Allocate icm_chunks from their own slab allocator, which lowered
the alloc/free "hiccups" frequency.
- Patch 11: Similar to patch 9, allocate htbl from its own slab allocator.
- Patch 12: Lower sync threshold for ICM hot memory - set the threshold for
sync to 1/4 of the pool instead of 1/2 of the pool. Although we will have
more syncs, each sync will be shorter and will help with insertion rate
stability. Also, notice that the overall number of hiccups wasn't increased
due to all the other patches.
- Patch 13: Keep track of hot ICM chunks in an array instead of list.
After steering sync, we traverse the hot list and finally free all the
chunks. It appears that traversing a long list takes unusually long time
due to cache misses on many entries, which causes a big "hiccup" during
rule insertion. This patch replaces the list with pre-allocated array that
stores only the bookkeeping information that is needed to later free the
chunks in its buddy allocator.
- Patch 14: Remove the unneeded buddy used_list - we don't need to have the
list of used chunks, we only need the total amount of used memory.
* tag 'mlx5-updates-2022-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Remove the buddy used_list
net/mlx5: DR, Keep track of hot ICM chunks in an array instead of list
net/mlx5: DR, Lower sync threshold for ICM hot memory
net/mlx5: DR, Allocate htbl from its own slab allocator
net/mlx5: DR, Allocate icm_chunks from their own slab allocator
net/mlx5: DR, Manage STE send info objects in pool
net/mlx5: DR, In rehash write the line in the entry immediately
net/mlx5: DR, Handle domain memory resources init/uninit separately
net/mlx5: DR, Initialize chunk's ste_arrays at chunk creation
net/mlx5: DR, For short chains of STEs, avoid allocating ste_arr dynamically
net/mlx5: DR, Remove unneeded argument from dr_icm_chunk_destroy
net/mlx5: DR, Check device state when polling CQ
net/mlx5: DR, Fix the SMFS sync_steering for fast teardown
net/mlx5: DR, In destroy flow, free resources even if FW command failed
====================
The biggest change for IPA v5.0 is that it supports more than 32
endpoints. However there are two other unrelated changes:
- The STATS_TETHERING memory region is not required
- Filter tables no longer support a "global" filter
Beyond this, refactoring some code makes supporting more than 32
endpoints (in an upcoming series) easier. So this series includes
a few other changes (not in this order):
- The maximum endpoint ID in use is determined during config
- Loops over all endpoints only involve those in use
- Endpoints IDs and their directions are checked for validity
differently to simplify comparison against the maximum
====================
Alex Elder [Thu, 27 Oct 2022 12:26:32 +0000 (07:26 -0500)]
net: ipa: record and use the number of defined endpoint IDs
Define a new field in the IPA structure that records the maximum
number of entries that will be used in the IPA endpoint array. Use
that value rather than IPA_ENDPOINT_MAX to determine the end
condition for two loops that iterate over all endpoints.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Thu, 27 Oct 2022 12:26:31 +0000 (07:26 -0500)]
net: ipa: determine the maximum endpoint ID
Each endpoint ID has an entry in the IPA endpoint array. But the
size of that array is defined at compile time. Instead, rename
ipa_endpoint_data_valid() to be ipa_endpoint_max() and have it
return the maximum endpoint ID defined in configuration data.
That function will still validate configuration data.
Zero is returned on error; it's a valid endpoint ID, but we need
more than one, so it can't be the maximum. The next patch makes use
of the returned maximum value.
Finally, rename the "initialized" mask of endpoints defined by
configuration data to be "defined".
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Thu, 27 Oct 2022 12:26:29 +0000 (07:26 -0500)]
net: ipa: more completely check endpoint validity
Ensure all defined TX endpoints are in the range [0, CONS_PIPES) and
defined RX endpoints are within [PROD_LOWEST, PROD_LOWEST+PROD_PIPES).
Modify the way local variables are used to make the checks easier
to understand. Check for each endpoint being in valid range in the
loop, and drop the logical-AND check of initialized against
unavailable IDs.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Thu, 27 Oct 2022 12:26:28 +0000 (07:26 -0500)]
net: ipa: no more global filtering starting with IPA v5.0
IPA v5.0 eliminates the global filter table entry. As a result,
there is no need to shift the filtered endpoint bitmap when it is
written to IPA local memory.
Update comments to explain this. Also delete a redundant block of
comments above the function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Extend packet socket option PACKET_IGNORE_OUTGOING to fanout groups.
The socket option sets ptype.ignore_outgoing, which makes
dev_queue_xmit_nit skip the socket.
When the socket joins a fanout group, the option is not reflected in
the struct ptype of the group. dev_queue_xmit_nit only tests the
fanout ptype, so the flag is ignored once a socket joins a
fanout group.
Inheriting the option from a socket would change established behavior.
Different sockets in the group can set different flags, and can also
change them at runtime.
Testing in packet_rcv_fanout defeats the purpose of the original
patch, which is to avoid skb_clone in dev_queue_xmit_nit (esp. for
MSG_ZEROCOPY packets).
Instead, introduce a new fanout group flag with the same behavior.
Tested with https://github.com/wdebruij/kerneltools/blob/master/tests/test_psock_fanout_ignore_outgoing.c
Lukasz Czapnik [Thu, 27 Oct 2022 10:42:39 +0000 (03:42 -0700)]
ice: Add additional CSR registers to ETHTOOL_GREGS
In the event of a Tx hang it can be useful to read a variety of hardware
registers to capture some state about why the transmit queue got stuck.
Extend the ETHTOOL_GREGS dump provided by the ice driver with several CSR
registers that provide such relevant information regarding the hardware Tx
state. This enables capturing relevant data to enable debugging such a Tx
hang.
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Link: https://lore.kernel.org/r/20221027104239.1691549-1-jacob.e.keller@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Sat, 29 Oct 2022 04:56:22 +0000 (21:56 -0700)]
Merge branch 'clean-up-sfp-register-definitions'
Russell King says:
====================
Clean up SFP register definitions
This two-part patch series cleans up the SFP register definitions by
1. converting them from hex to decimal, as all the definitions in the
documents use decimal, this makes it easier to cross-reference.
2. moving the bit definitions for each register along side their
register address definition
====================
net: sfp: move field definitions along side register index
Just as we do for the A2h enum, arrange the A0h enum to have the
field definitions next to their corresponding register index.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: sfp: convert register indexes from hex to decimal
The register indexes in the standards are in decimal rather than hex,
so lets specify them in decimal in the header file so we can easily
cross-reference without converting between hex and decimal.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a result of invesigations from Frank Wunderlich, we know a lot more
about the Mediatek "SGMII" PCS block, and can implement the PCS support
correctly. This series achieves that, and Frank has tested the final
result and reports that it works for him. The series could do with
further testing by others, but I suspect that is unlikely to happen
until it is merged based on past performances with this driver.
Briefly, the patches in order:
1. Add a new helper to get the link timer duration in nanoseconds
2. Add definitions for the newly discovered registers and updates to
bit definitions, including bitmasks for the BMCR, BMSR and two
advertisement registers.
3. Remove unnecessary/unused error handling (functions always returning
zero.)
4. Adding the missing pcs_get_state() implementation.
5. Converting the code to use regmap_update_bits() rather than
open-coding read-modify-write sequences.
6. Adding out-of-band speed and duplex forcing for all non-inband modes
not just the 802.3z link modes the code currently does.
7. Moving the release of the PHY power down to the main pcs_config()
function.
8. Moving the interface speed selection to the main pcs_config()
function.
9. Adding advertisement programming.
10. Adding correct link timer programming using the new helper in the
first patch.
11. Adding support for 802.3z negotiation.
There is one remaining issue - when configuring the PCS for in-band,
for some reason the AN restart bit is always set. This should not be
necessary, but requires further investigation with the hardware to
find out whether it is really necessary. I suspect this was a work
around for a previous poor implementation.
====================
net: mtk_eth_soc: add support for in-band 802.3z negotiation
As a result of help from Frank Wunderlich to investigate and test, we
now know how to program this PCS for in-band 802.3z negotiation. Add
support for this by moving the contents of the two functions into the
common mtk_pcs_config() function and adding the register settings for
802.3z negotiation.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mtk_eth_soc: move and correct link timer programming
Program the link timer appropriately for the interface mode being
used, using the newly introduced phylink helper that provides the
nanosecond link timer interval.
The intervals are 1.6ms for SGMII based protocols and 10ms for
802.3z based protocols.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mtk_eth_soc: add out of band forcing of speed and duplex in pcs_link_up
Add support for forcing the link speed and duplex setting in the
pcs_link_up() method for out of band modes, which will be useful when
we finish converting the pcs_config() method. Until then, we still have
to force duplex for 802.3z modes to work correctly.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: mtk_eth_soc: convert mtk_sgmii to use regmap_update_bits()
mtk_sgmii does a lot of read-modify-write operations, for which there
is a specific regmap function. Use this function instead of open-coding
the operations.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a pcs_get_state() implementation which uses the advertisements
to compute the resulting link modes, and BMSR contents to determine
negotiation and link status.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The functions called by the pcs_config() method always return zero, so
there is no point trying to handle an error from these functions. Make
these functions void, eliminate the "err" variable and simply return
zero from the pcs_config() function itself.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As a result of help from Frank Wunderlich to investigate and test, we
know a bit more about the PCS on the Mediatek platforms. Update the
definitions from this investigation.
This PCS appears similar, but not identical to the Lynx PCS.
Although not included in this patch, but for future reference, the PHY
ID registers at offset 4 read as 0x4d544950 'MTIP'.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a helper to convert the PHY interface mode to the required link
timer setting as stated by the appropriate standard. Inappropriate
interface modes return an error.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Gleixner [Wed, 26 Oct 2022 13:22:15 +0000 (15:22 +0200)]
net: Remove the obsolte u64_stats_fetch_*_irq() users (net).
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.
Convert to the regular interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thomas Gleixner [Wed, 26 Oct 2022 13:22:14 +0000 (15:22 +0200)]
net: Remove the obsolte u64_stats_fetch_*_irq() users (drivers).
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.
Convert to the regular interface.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First set of patches v6.2. mac80211 refactoring continues for Wi-Fi 7.
All mac80211 driver are now converted to use internal TX queues, this
might cause some regressions so we wanted to do this early in the
cycle.
Note: wireless tree was merged[1] to wireless-next to avoid some
conflicts with mac80211 patches between the trees. Unfortunately there
are still two smaller conflicts in net/mac80211/util.c which Stephen
also reported[2]. In the first conflict initialise scratch_len to
"params->scratch_len ?: 3 * params->len" (note number 3, not 2!) and
in the second conflict take the version which uses elems->scratch_pos.
mac80211
- preparation for Wi-Fi 7 Multi-Link Operation (MLO) continues
- add API to show the link STAs in debugfs
- all mac80211 drivers are now using mac80211 internal TX queues (iTXQs)
rtw89
- support 8852BE
rtl8xxxu
- support RTL8188FU
brmfmac
- support two station interfaces concurrently
Dmitry Torokhov [Thu, 27 Oct 2022 07:34:02 +0000 (00:34 -0700)]
nfc: s3fwrn5: use devm_clk_get_optional_enabled() helper
Because we enable the clock immediately after acquiring it in probe,
we can combine the 2 operations and use devm_clk_get_optional_enabled()
helper.
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series adds support for WangXun 10 gigabit NIC, to initialize
hardware, set mac address, and register netdev.
Change log:
v6: address comments:
Jakub Kicinski: check with scripts/kernel-doc
v5: address comments:
Jakub Kicinski: clean build with W=1 C=1
v4: address comments:
Andrew Lunn: https://lore.kernel.org/all/YzXROBtztWopeeaA@lunn.ch/
v3: address comments:
Andrew Lunn: remove hw function ops, reorder functions, use BIT(n)
for register bit offset, move the same code of txgbe
and ngbe to libwx
v2: address comments:
Andrew Lunn: https://lore.kernel.org/netdev/YvRhld5rD%2FxgITEg@lunn.ch/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 28 Oct 2022 09:47:42 +0000 (10:47 +0100)]
Merge branch 'tcp-plb'
Mubashir Adnan Qureshi says:
====================
net: Add PLB functionality to TCP
This patch series adds PLB (Protective Load Balancing) to TCP and hooks
it up to DCTCP. PLB is disabled by default and can be enabled using
relevant sysctls and support from underlying CC.
PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
https://doi.org/10.1145/3544216.3544226
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rcv_wnd can be useful to diagnose TCP performance where receiver window
becomes the bottleneck. rehash reports the PLB and timeout triggered
rehash attempts by the TCP connection.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: add u32 counter in tcp_sock and an SNMP counter for PLB
A u32 counter is added to tcp_sock for counting the number of PLB
triggered rehashes for a TCP connection. An SNMP counter is also
added to count overall PLB triggered rehash events for a host. These
counters are hooked up to PLB implementation for DCTCP.
TCP_NLA_REHASH is added to SCM_TIMESTAMPING_OPT_STATS that reports
the rehash attempts triggered due to PLB or timeouts. This gives
a historical view of sustained congestion or timeouts experienced
by the TCP connection.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
PLB support is added to TCP DCTCP code. As DCTCP uses ECN as the
congestion signal, PLB also uses ECN to make decisions whether to change
the path or not upon sustained congestion.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Congestion control algorithms track PLB state and cause the connection
to trigger a path change when either of the 2 conditions is satisfied:
- No packets are in flight and (# consecutive congested rounds >=
sysctl_tcp_plb_idle_rehash_rounds)
- (# consecutive congested rounds >= sysctl_tcp_plb_rehash_rounds)
A round (RTT) is marked as congested when congestion signal
(ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh.
In the event of RTO, PLB (via tcp_write_timeout()) triggers a path
change and disables congestion-triggered path changes for random time
between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec)
to avoid hopping onto the "connectivity blackhole". RTO-triggered
path changes can still happen during this cool-off period.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
PLB (Protective Load Balancing) is a host based mechanism for load
balancing across switch links. It leverages congestion signals(e.g. ECN)
from transport layer to randomly change the path of the connection
experiencing congestion. PLB changes the path of the connection by
changing the outgoing IPv6 flow label for IPv6 connections (implemented
in Linux by calling sk_rethink_txhash()). Because of this implementation
mechanism, PLB can currently only work for IPv6 traffic. For more
information, see the SIGCOMM 2022 paper:
https://doi.org/10.1145/3544216.3544226
This commit adds new sysctl knobs and sets their default values for
TCP PLB.
Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter updates for net-next
1) Move struct nft_payload_set definition to .c file where it is
only used.
2) Shrink transport and inner header offset fields in the nft_pktinfo
structure to 16-bits, from Florian Westphal.
3) Get rid of nft_objref Kbuild toggle, make it built-in into
nf_tables. This expression is used to instantiate conntrack helpers
in nftables. After removing the conntrack helper auto-assignment
toggle it this feature became more important so move it to the nf_tables
core module. Also from Florian.
4) Extend the existing function to calculate payload inner header offset
to deal with the GRE and IPIP transport protocols.
6) Add inner expression support for nf_tables. This new expression
provides a packet parser for tunneled packets which uses a userspace
description of the expected inner headers. The inner expression
invokes the payload expression (via direct call) to match on the
inner header protocol fields using the inner link, network and
transport header offsets.
An example of the bytecode generated from userspace to match on
IP source encapsulated in a VxLAN packet:
# nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4
netdev x y
[ meta load l4proto => reg 1 ]
[ cmp eq reg 1 0x00000011 ]
[ payload load 2b @ transport header + 2 => reg 1 ]
[ cmp eq reg 1 0x0000b512 ]
[ inner type vxlan hdrsize 8 flags f [ meta load protocol => reg 1 ] ]
[ cmp eq reg 1 0x00000008 ]
[ inner type vxlan hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ]
[ cmp eq reg 1 0x04030201 ]
7) Store inner link, network and transport header offsets in percpu
area to parse inner packet header once only. Matching on a different
tunnel type invalidates existing offsets in the percpu area and it
invokes the inner tunnel parser again.
8) Add support for inner meta matching. This support for
NFTA_META_PROTOCOL, which specifies the inner ethertype, and
NFT_META_L4PROTO, which specifies the inner transport protocol.
9) Extend nft_inner to parse GENEVE optional fields to calculate the
link layer offset.
10) Update inner expression so tunnel offset points to GRE header
to normalize tunnel header handling. This also allows to perform
different interpretations of the GRE header from userspace.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nft_inner: set tunnel offset to GRE header offset
netfilter: nft_inner: add geneve support
netfilter: nft_meta: add inner match support
netfilter: nft_inner: add percpu inner context
netfilter: nft_inner: support for inner tunnel header matching
netfilter: nft_payload: access ipip payload for inner offset
netfilter: nft_payload: access GRE payload via inner offset
netfilter: nft_objref: make it builtin
netfilter: nf_tables: reduce nft_pktinfo by 8 bytes
netfilter: nft_payload: move struct nft_payload_set definition where it belongs
====================
====================
ionic: VF attr replay and other updates
For better VF management when a FW update restart or a FW crash recover is
detected, the PF now will replay any user specified VF attributes to be
sure the FW hasn't lost them in the restart.
Newer FW offers more packet processing offloads, so we now support them in
the driver.
A small refactor of the Rx buffer fill cleans a bit of code and will help
future work on buffer caching.
====================
Shannon Nelson [Wed, 26 Oct 2022 14:37:42 +0000 (07:37 -0700)]
ionic: new ionic device identity level and VF start control
A new ionic dev_cmd is added to the interface in ionic_if.h,
with a new capabilities field in the ionic device identity to
signal its availability in the FW. The identity level code is
incremented to '2' to show support for this new capabilities
bitfield.
If the driver has indicated with the new identity level that
it has the VF_CTRL command, newer FW will wait for the start
command before starting the VFs after a FW update or crash
recovery.
This patch updates the driver to make use of the new VF start
control in fw_up path to be sure that the PF has set the user
attributes on the VF before the FW allows the VFs to restart.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Wed, 26 Oct 2022 14:37:41 +0000 (07:37 -0700)]
ionic: only save the user set VF attributes
Report the current FW values for the VF attributes, but don't
save the FW values locally, only save the vf attributes that
are given to us from the user. This allows us to replay user
data, and doesn't end up confusing things like "who set the
mac address".
Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Shannon Nelson [Wed, 26 Oct 2022 14:37:40 +0000 (07:37 -0700)]
ionic: replay VF attributes after fw crash recovery
The VF attributes that the user has set into the FW through
the PF can be lost over a FW crash recovery. Much like we
already replay the PF mac/vlan filters, we now add a replay
in the recovery path to be sure the FW has the up-to-date
VF configurations.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported")
Linus Torvalds [Thu, 27 Oct 2022 20:36:59 +0000 (13:36 -0700)]
Merge tag 'net-6.1-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from 802.15.4 (Zigbee et al).
Current release - regressions:
- ipa: fix bugs in the register conversion for IPA v3.1 and v3.5.1
Current release - new code bugs:
- mptcp: fix abba deadlock on fastopen
- eth: stmmac: rk3588: allow multiple gmac controllers in one system
Previous releases - regressions:
- ip: rework the fix for dflt addr selection for connected nexthop
- net: couple more fixes for misinterpreting bits in struct page
after the signature was added
Previous releases - always broken:
- ipv6: ensure sane device mtu in tunnels
- openvswitch: switch from WARN to pr_warn on a user-triggerable path
- ethtool: eeprom: fix null-deref on genl_info in dump
- ieee802154: more return code fixes for corner cases in
dgram_sendmsg
- mac802154: fix link-quality-indicator recording
- eth: mlx5: fixes for IPsec, PTP timestamps, OvS and conntrack
offload
- eth: fec: limit register access on i.MX6UL
- eth: bcm4908_enet: update TX stats after actual transmission
- can: rcar_canfd: improve IRQ handling for RZ/G2L
Misc:
- genetlink: piggy back on the newly added resv_op_start to enforce
more sanity checks on new commands"
* tag 'net-6.1-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
net: enetc: survive memory pressure without crashing
kcm: do not sense pfmemalloc status in kcm_sendpage()
net: do not sense pfmemalloc status in skb_append_pagefrags()
net/mlx5e: Fix macsec sci endianness at rx sa update
net/mlx5e: Fix wrong bitwise comparison usage in macsec_fs_rx_add_rule function
net/mlx5e: Fix macsec rx security association (SA) update/delete
net/mlx5e: Fix macsec coverity issue at rx sa update
net/mlx5: Fix crash during sync firmware reset
net/mlx5: Update fw fatal reporter state on PCI handlers successful recover
net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed
net/mlx5e: TC, Reject forwarding from internal port to internal port
net/mlx5: Fix possible use-after-free in async command interface
net/mlx5: ASO, Create the ASO SQ with the correct timestamp format
net/mlx5e: Update restore chain id for slow path packets
net/mlx5e: Extend SKB room check to include PTP-SQ
net/mlx5: DR, Fix matcher disconnect error flow
net/mlx5: Wait for firmware to enable CRS before pci_restore_state
net/mlx5e: Do not increment ESN when updating IPsec ESN state
netdevsim: remove dir in nsim_dev_debugfs_init() when creating ports dir failed
netdevsim: fix memory leak in nsim_drv_probe() when nsim_dev_resources_register() failed
...
Linus Torvalds [Thu, 27 Oct 2022 20:16:36 +0000 (13:16 -0700)]
Merge tag 'execve-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve fixes from Kees Cook:
- Fix an ancient signal action copy race (Bernd Edlinger)
- Fix a memory leak in ELF loader, when under memory pressure (Li
Zetao)
* tag 'execve-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
fs/binfmt_elf: Fix memory leak in load_elf_binary()
exec: Copy oldsighand->action under spin-lock
Linus Torvalds [Thu, 27 Oct 2022 19:31:57 +0000 (12:31 -0700)]
Merge tag 'hardening-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Fix older Clang vs recent overflow KUnit test additions (Nick
Desaulniers, Kees Cook)
- Fix kern-doc visibility for overflow helpers (Kees Cook)
* tag 'hardening-v6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
overflow: Refactor test skips for Clang-specific issues
overflow: disable failing tests for older clang versions
overflow: Fix kern-doc markup for functions
Linus Torvalds [Thu, 27 Oct 2022 19:21:57 +0000 (12:21 -0700)]
Merge tag 'media/v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A bunch of patches addressing issues in the vivid driver and adding
new checks in V4L2 to validate the input parameters from some ioctls"
* tag 'media/v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: vivid.rst: loop_video is set on the capture devnode
media: vivid: set num_in/outputs to 0 if not supported
media: vivid: drop GFP_DMA32
media: vivid: fix control handler mutex deadlock
media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced'
media: v4l2-dv-timings: add sanity checks for blanking values
media: vivid: dev->bitmap_cap wasn't freed in all cases
media: vivid: s_fbuf: add more sanity checks
Vladimir Oltean [Thu, 27 Oct 2022 18:29:25 +0000 (21:29 +0300)]
net: enetc: survive memory pressure without crashing
Under memory pressure, enetc_refill_rx_ring() may fail, and when called
during the enetc_open() -> enetc_setup_rxbdr() procedure, this is not
checked for.
An extreme case of memory pressure will result in exactly zero buffers
being allocated for the RX ring, and in such a case it is expected that
hardware drops all RX packets due to lack of buffers.
This does not happen, because the reset-default value of the consumer
and produces index is 0, and this makes the ENETC think that all buffers
have been initialized and that it owns them (when in reality none were).
The hardware guide explains this best:
| Configure the receive ring producer index register RBaPIR with a value
| of 0. The producer index is initially configured by software but owned
| by hardware after the ring has been enabled. Hardware increments the
| index when a frame is received which may consume one or more BDs.
| Hardware is not allowed to increment the producer index to match the
| consumer index since it is used to indicate an empty condition. The ring
| can hold at most RBLENR[LENGTH]-1 received BDs.
|
| Configure the receive ring consumer index register RBaCIR. The
| consumer index is owned by software and updated during operation of the
| of the BD ring by software, to indicate that any receive data occupied
| in the BD has been processed and it has been prepared for new data.
| - If consumer index and producer index are initialized to the same
| value, it indicates that all BDs in the ring have been prepared and
| hardware owns all of the entries.
| - If consumer index is initialized to producer index plus N, it would
| indicate N BDs have been prepared. Note that hardware cannot start if
| only a single buffer is prepared due to the restrictions described in
| (2).
| - Software may write consumer index to match producer index anytime
| while the ring is operational to indicate all received BDs prior have
| been processed and new BDs prepared for hardware.
Normally, the value of rx_ring->rcir (consumer index) is brought in sync
with the rx_ring->next_to_use software index, but this only happens if
page allocation ever succeeded.
When PI==CI==0, the hardware appears to receive frames and write them to
DMA address 0x0 (?!), then set the READY bit in the BD.
The enetc_clean_rx_ring() function (and its XDP derivative) is naturally
not prepared to handle such a condition. It will attempt to process
those frames using the rx_swbd structure associated with index i of the
RX ring, but that structure is not fully initialized (enetc_new_page()
does all of that). So what happens next is undefined behavior.
To operate using no buffer, we must initialize the CI to PI + 1, which
will block the hardware from advancing the CI any further, and drop
everything.
The issue was seen while adding support for zero-copy AF_XDP sockets,
where buffer memory comes from user space, which can even decide to
supply no buffers at all (example: "xdpsock --txonly"). However, the bug
is present also with the network stack code, even though it would take a
very determined person to trigger a page allocation failure at the
perfect time (a series of ifup/ifdown under memory pressure should
eventually reproduce it given enough retries).
Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20221027182925.3256653-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 27 Oct 2022 04:03:46 +0000 (04:03 +0000)]
net: do not sense pfmemalloc status in skb_append_pagefrags()
skb_append_pagefrags() is used by af_unix and udp sendpage()
implementation so far.
In commit 326140063946 ("tcp: TX zerocopy should not sense
pfmemalloc status") we explained why we should not sense
pfmemalloc status for pages owned by user space.
We should also use skb_fill_page_desc_noacc()
in skb_append_pagefrags() to avoid following KCSAN report:
BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags
value changed: 0x0000000000000000 -> 0xffffea00058fc188
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Raed Salem [Wed, 26 Oct 2022 13:51:53 +0000 (14:51 +0100)]
net/mlx5e: Fix macsec sci endianness at rx sa update
The cited commit at rx sa update operation passes the sci object
attribute, in the wrong endianness and not as expected by the HW
effectively create malformed hw sa context in case of update rx sa
consequently, HW produces unexpected MACsec packets which uses this
sa.
Fix by passing sci to create macsec object with the correct endianness,
while at it add __force u64 to prevent sparse check error of type
"sparse: error: incorrect type in assignment".
Raed Salem [Wed, 26 Oct 2022 13:51:52 +0000 (14:51 +0100)]
net/mlx5e: Fix wrong bitwise comparison usage in macsec_fs_rx_add_rule function
The cited commit produces a sparse check error of type
"sparse: error: restricted __be64 degrades to integer". The
offending line wrongly did a bitwise operation between two different
storage types one of 64 bit when the other smaller side is 16 bit
which caused the above sparse error, furthermore bitwise operation
usage here is wrong in the first place as the constant MACSEC_PORT_ES
is not a bitwise field.
Fix by using the right mask to get the lower 16 bit if the sci number,
and use comparison operator '==' instead of bitwise '&' operator.
Raed Salem [Wed, 26 Oct 2022 13:51:51 +0000 (14:51 +0100)]
net/mlx5e: Fix macsec rx security association (SA) update/delete
The cited commit adds the support for update/delete MACsec Rx SA,
naturally, these operations need to check if the SA in question exists
to update/delete the SA and return error code otherwise, however they
do just the opposite i.e. return with error if the SA exists
Fix by change the check to return error in case the SA in question does
not exist, adjust error message and code accordingly.
Raed Salem [Wed, 26 Oct 2022 13:51:50 +0000 (14:51 +0100)]
net/mlx5e: Fix macsec coverity issue at rx sa update
The cited commit at update rx sa operation passes object attributes
to MACsec object create function without initializing/setting all
attributes fields leaving some of them with garbage values, therefore
violating the implicit assumption at create object function, which
assumes that all input object attributes fields are set.
Fix by initializing the object attributes struct to zero, thus leaving
unset fields with the legal zero value.
When setting Bluefield to DPU NIC mode using mlxconfig tool + sync
firmware reset flow, we run into scenario where the host was not
eswitch manager at the time of mlx5 driver load but becomes eswitch manager
after the sync firmware reset flow. This results in null pointer
access of mpfs structure during mac filter add. This change prevents null
pointer access but mpfs table entries will not be added.
Fixes: 5ec697446f46 ("net/mlx5: Add support for devlink reload action fw activate") Signed-off-by: Suresh Devarakonda <ramad@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Bodong Wang <bodong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-12-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roy Novich [Wed, 26 Oct 2022 13:51:48 +0000 (14:51 +0100)]
net/mlx5: Update fw fatal reporter state on PCI handlers successful recover
Update devlink health fw fatal reporter state to "healthy" is needed by
strictly calling devlink_health_reporter_state_update() after recovery
was done by PCI error handler. This is needed when fw_fatal reporter was
triggered due to PCI error. Poll health is called and set reporter state
to error. Health recovery failed (since EEH didn't re-enable the PCI).
PCI handlers keep on recover flow and succeed later without devlink
acknowledgment. Fix this by adding devlink state update at the end of
the PCI handler recovery process.
Fixes: 6181e5cb752e ("devlink: add support for reporter recovery completion") Signed-off-by: Roy Novich <royno@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-11-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roi Dayan [Wed, 26 Oct 2022 13:51:47 +0000 (14:51 +0100)]
net/mlx5e: TC, Fix cloned flow attr instance dests are not zeroed
On multi table split the driver creates a new attr instance with
data being copied from prev attr instance zeroing action flags.
Also need to reset dests properties to avoid incorrect dests per attr.
Fixes: 8300f225268b ("net/mlx5e: Create new flow attr for multi table actions") Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-10-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ariel Levkovich [Wed, 26 Oct 2022 13:51:46 +0000 (14:51 +0100)]
net/mlx5e: TC, Reject forwarding from internal port to internal port
Reject TC rules that forward from internal port to internal port
as it is not supported.
This include rules that are explicitly have internal port as
the filter device as well as rules that apply on tunnel interfaces
as the route device for the tunnel interface can be an internal
port.
Fixes: 27484f7170ed ("net/mlx5e: Offload tc rules that redirect to ovs internal port") Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-9-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tariq Toukan [Wed, 26 Oct 2022 13:51:45 +0000 (14:51 +0100)]
net/mlx5: Fix possible use-after-free in async command interface
mlx5_cmd_cleanup_async_ctx should return only after all its callback
handlers were completed. Before this patch, the below race between
mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
lead to a use-after-free:
1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
elevated by 1, a single inflight callback).
2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
is about to call wake_up().
4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
immediately as the condition (num_inflight == 0) holds.
5. mlx5_cmd_cleanup_async_ctx returns.
6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
object.
7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
object.
Fix it by syncing using a completion object. Mark it completed when
num_inflight reaches 0.
Trace:
BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0
Saeed Mahameed [Wed, 26 Oct 2022 13:51:44 +0000 (14:51 +0100)]
net/mlx5: ASO, Create the ASO SQ with the correct timestamp format
mlx5 SQs must select the timestamp format explicitly according to the
active clock mode, select the current active timestamp mode so ASO SQ create
will succeed.
This fixes the following error prints when trying to create ipsec ASO SQ
while the timestamp format is real time mode.
mlx5_cmd_out_err:778:(pid 34874): CREATE_SQ(0x904) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xd61c0b), err(-22)
mlx5_aso_create_sq:285:(pid 34874): Failed to open aso wq sq, err=-22
mlx5e_ipsec_init:436:(pid 34874): IPSec initialization failed, -22
Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reported-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-7-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paul Blakey [Wed, 26 Oct 2022 13:51:43 +0000 (14:51 +0100)]
net/mlx5e: Update restore chain id for slow path packets
Currently encap slow path rules just forward to software without
setting the chain id miss register, so driver doesn't restore
the chain, and packets hitting this rule will restart from tc chain
0 instead of continuing to the chain the encap rule was on.
Fix this by setting the chain id miss register to the chain id mapping.
Fixes: 8f1e0b97cc70 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping") Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-6-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Aya Levin [Wed, 26 Oct 2022 13:51:42 +0000 (14:51 +0100)]
net/mlx5e: Extend SKB room check to include PTP-SQ
When tx_port_ts is set, the driver diverts all UPD traffic over PTP port
to a dedicated PTP-SQ. The SKBs are cached until the wire-CQE arrives.
When the packet size is greater then MTU, the firmware might drop it and
the packet won't be transmitted to the wire, hence the wire-CQE won't
reach the driver. In this case the SKBs are accumulated in the SKB fifo.
Add room check to consider the PTP-SQ SKB fifo, when the SKB fifo is
full, driver stops the queue resulting in a TX timeout. Devlink
TX-reporter can recover from it.
Fixes: 1880bc4e4a96 ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rongwei Liu [Wed, 26 Oct 2022 13:51:41 +0000 (14:51 +0100)]
net/mlx5: DR, Fix matcher disconnect error flow
When 2nd flow rules arrives, it will merge together with the
1st one if matcher criteria is the same.
If merge fails, driver will rollback the merge contents, and
reject the 2nd rule. At rollback stage, matcher can't be
disconnected unconditionally, otherise the 1st rule can't be
hit anymore.
Add logic to check if the matcher should be disconnected or not.
Fixes: cc2295cd54e4 ("net/mlx5: DR, Improve steering for empty or RX/TX-only matchers") Signed-off-by: Rongwei Liu <rongweil@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-4-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Moshe Shemesh [Wed, 26 Oct 2022 13:51:40 +0000 (14:51 +0100)]
net/mlx5: Wait for firmware to enable CRS before pci_restore_state
After firmware reset driver should verify firmware already enabled CRS
and became responsive to pci config cycles before restoring pci state.
Fix that by waiting till device_id is readable through PCI again.
Hyong Youb Kim [Wed, 26 Oct 2022 13:51:39 +0000 (14:51 +0100)]
net/mlx5e: Do not increment ESN when updating IPsec ESN state
An offloaded SA stops receiving after about 2^32 + replay_window
packets. For example, when SA reaches <seq-hi 0x1, seq 0x2c>, all
subsequent packets get dropped with SA-icv-failure (integrity_failed).
To reproduce the bug:
- ConnectX-6 Dx with crypto enabled (FW 22.30.1004)
- ipsec.conf:
nic-offload = yes
replay-window = 32
esn = yes
salifetime=24h
- Run netperf for a long time to send more than 2^32 packets
netperf -H <device-under-test> -t TCP_STREAM -l 20000
When 2^32 + replay_window packets are received, the replay window
moves from the 2nd half of subspace (overlap=1) to the 1st half
(overlap=0). The driver then updates the 'esn' value in NIC
(i.e. seq_hi) as follows.
seq_hi = xfrm_replay_seqhi(seq_bottom)
new esn in NIC = seq_hi + 1
The +1 increment is wrong, as seq_hi already contains the correct
seq_hi. For example, when seq_hi=1, the driver actually tells NIC to
use seq_hi=2 (esn). This incorrect esn value causes all subsequent
packets to fail integrity checks (SA-icv-failure). So, do not
increment.
Fixes: cb01008390bb ("net/mlx5: IPSec, Add support for ESN") Signed-off-by: Hyong Youb Kim <hyonkim@cisco.com> Acked-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhengchao Shao [Wed, 26 Oct 2022 01:46:42 +0000 (09:46 +0800)]
netdevsim: remove dir in nsim_dev_debugfs_init() when creating ports dir failed
Remove dir in nsim_dev_debugfs_init() when creating ports dir failed.
Otherwise, the netdevsim device will not be created next time. Kernel
reports an error: debugfs: Directory 'netdevsim1' with parent 'netdevsim'
already present!
Fixes: ab1d0cc004d7 ("netdevsim: change debugfs tree topology") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhengchao Shao [Wed, 26 Oct 2022 01:54:05 +0000 (09:54 +0800)]
netdevsim: fix memory leak in nsim_bus_dev_new()
If device_register() failed in nsim_bus_dev_new(), the value of reference
in nsim_bus_dev->dev is 1. obj->name in nsim_bus_dev->dev will not be
released.
Jakub Kicinski [Thu, 27 Oct 2022 17:30:41 +0000 (10:30 -0700)]
Merge tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2022-10-27
Anssi Hannula fixes the use of the completions in the kvaser_usb
driver.
Biju Das contributes 2 patches for the rcar_canfd driver. A IRQ storm
that can be triggered by high CAN bus load and channel specific IRQ
handlers are fixed.
Yang Yingliang fixes the j1939 transport protocol by moving a
kfree_skb() out of a spin_lock_irqsave protected section.
* tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: j1939: transport: j1939_session_skb_drop_old(): spin_unlock_irqrestore() before kfree_skb()
can: rcar_canfd: fix channel specific IRQ handling for RZ/G2L
can: rcar_canfd: rcar_canfd_handle_global_receive(): fix IRQ storm on global FIFO receive
can: kvaser_usb: Fix possible completions during init_completion
====================
Rafał Miłecki [Thu, 27 Oct 2022 11:24:30 +0000 (13:24 +0200)]
net: broadcom: bcm4908_enet: update TX stats after actual transmission
Queueing packets doesn't guarantee their transmission. Update TX stats
after hardware confirms consuming submitted data.
This also fixes a possible race and NULL dereference.
bcm4908_enet_start_xmit() could try to access skb after freeing it in
the bcm4908_enet_poll_tx().
====================
ip: rework the fix for dflt addr selection for connected nexthop"
This series reworks the fix that is reverted in the second commit.
As Julian explained, nhc_scope is related to nhc_gw, it's not the scope of
the route.
====================
Nicolas Dichtel [Thu, 20 Oct 2022 10:09:52 +0000 (12:09 +0200)]
nh: fix scope used to find saddr when adding non gw nh
As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but
fib_info_update_nhc_saddr() needs the scope of the route, which is
the scope "before" fib_nh_scope, ie fib_nh_scope - 1.
This patch fixes the problem described in commit 747c14307214 ("ip: fix
dflt addr selection for connected nexthop").
As explained by Julian, nhc_scope is related to nhc_gw, not to the route.
Revert the original patch. The initial problem is fixed differently in the
next commit.
Jakub Kicinski [Wed, 26 Oct 2022 00:15:24 +0000 (17:15 -0700)]
genetlink: limit the use of validation workarounds to old ops
During review of previous change another thing came up - we should
limit the use of validation workarounds to old commands.
Don't list the workarounds one by one, as we're rejecting all existing
ones. We can deal with the masking in the unlikely event that new flag
is added.