git.ipfire.org Git - thirdparty/kernel/linux.git/log

net: stmmac: dwmac-loongson: Init ref and PTP clocks rate

Reference and PTP clocks rate of the Loongson GMAC devices is 125MHz.
(So is in the GNET devices which support is about to be added.) Set
the respective plat_stmmacenet_data field up in accordance with that
so to have the coalesce command and timestamping work correctly.

Fixes: 30bba69d7db4 ("stmmac: pci: Add dwmac support for Loongson")
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-loongson: Detach GMAC-specific platform data init

Loongson delivers two types of the network devices: Loongson GMAC and
Loongson GNET in the framework of four SOC/Chipsets revisions:

   Chip             Network  PCI Dev ID   Synopys Version   DMA-channel
LS2K1000 SOC         GMAC      0x7a03       v3.50a/v3.73a        1
LS7A1000 Chipset     GMAC      0x7a03       v3.50a/v3.73a        1
LS2K2000 SOC         GMAC      0x7a03          v3.73a            8
LS2K2000 SOC         GNET      0x7a13          v3.73a            8
LS7A2000 Chipset     GNET      0x7a13          v3.73a            1

The driver currently supports the chips with the Loongson GMAC network
device synthesized with a single DMA-channel available. As a
preparation before adding the Loongson GNET support detach the
Loongson GMAC-specific platform data initializations to the
loongson_gmac_data() method and preserve the common settings in the
loongson_default_data().

While at it drop the return value statement from the
loongson_default_data() method as redundant.

Note there is no intermediate vendor-specific PCS in between the MAC
and PHY on Loongson GMAC and GNET. So the plat->mac_interface field
can be freely initialized with the PHY_INTERFACE_MODE_NA value.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-loongson: Use PCI_DEVICE_DATA() macro for device identification

For the readability sake convert the hard-coded Loongson GMAC PCI ID to
the respective macro and use the PCI_DEVICE_DATA() macro-function to
create the pci_device_id array entry. The later change will be
specifically useful in order to assign the device-specific data for the
currently supported device and for about to be added Loongson GNET
controller.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-loongson: Drop pci_enable/disable_msi calls

The Loongson GMAC driver currently doesn't utilize the MSI IRQs, but
retrieves the IRQs specified in the device DT-node. Let's drop the
direct pci_enable_msi()/pci_disable_msi() calls then as redundant

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: dwmac-loongson: Drop duplicated hash-based filter size init

The plat_stmmacenet_data::multicast_filter_bins field is twice
initialized in the loongson_default_data() method. Drop the redundant
initialization, but for the readability sake keep the filters init
statements defined in the same place of the method.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Export dwmac1000_dma_ops

Export the DW GMAC DMA-ops descriptor so one could be available in
the low-level platform drivers. It will be utilized to override some
callbacks in order to handle the LS2K2000 GNET device specifics. The
GNET controller support is being added in one of the following up
commits.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Add multi-channel support

DW GMAC v3.73 can be equipped with the Audio Video (AV) feature which
enables transmission of time-sensitive traffic over bridged local area
networks (DWC Ethernet QoS Product). In that case there can be up to two
additional DMA-channels available with no Tx COE support (unless there is
vendor-specific IP-core alterations). Each channel is implemented as a
separate Control and Status register (CSR) for managing the transmit and
receive functions, descriptor handling, and interrupt handling.

Add the multi-channels DW GMAC controllers support just by making sure the
already implemented DMA-configs are performed on the per-channel basis.

Note the only currently known instance of the multi-channel DW GMAC
IP-core is the LS2K2000 GNET controller, which has been released with the
vendor-specific feature extension of having eight DMA-channels. The device
support will be added in one of the following up commits.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: stmmac: Move the atds flag to the stmmac_dma_cfg structure

ATDS (Alternate Descriptor Size) is a part of the DMA Bus Mode configs
(together with PBL, ALL, EME, etc) of the DW GMAC controllers. Seeing
it's not changed at runtime but is activated as long as the IP-core
has it supported (at least due to the Type 2 Full Checksum Offload
Engine feature), move the respective parameter from the
stmmac_dma_ops::init() callback argument to the stmmac_dma_cfg
structure, which already have the rest of the DMA-related configs
defined.

Besides the being added in the next commit DW GMAC multi-channels
support will require to add the stmmac_dma_ops::init_chan() callback
and have the ATDS flag set/cleared for each channel in there. Having
the atds-flag in the stmmac_dma_cfg structure will make the parameter
accessible from stmmac_dma_ops::init_chan() callback too.

Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Yinggang Gu <guyinggang@loongson.cn>
Reviewed-by: Serge Semin <fancer.lancer@gmail.com>
Acked-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Tested-by: Serge Semin <fancer.lancer@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/smc: Use static_assert() to check struct sizes

Commit 9748dbc9f265 ("net/smc: Avoid -Wflex-array-member-not-at-end
warnings") introduced tagged `struct smc_clc_v2_extension_fixed` and
`struct smc_clc_smcd_v2_extension_fixed`. We want to ensure that when
new members need to be added to the flexible structures, they are
always included within these tagged structs.

So, we use `static_assert()` to ensure that the memory layout for
both the flexible structure and the tagged struct is the same after
any changes.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jan Karcher <jaka@linux.ibm.com>
Link: https://patch.msgid.link/ZrVBuiqFHAORpFxE@cute
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfp: Use static_assert() to check struct sizes

Commit d88cabfd9abc ("nfp: Avoid -Wflex-array-member-not-at-end
warnings") introduced tagged `struct nfp_dump_tl_hdr`. We want
to ensure that when new members need to be added to the flexible
structure, they are always included within this tagged struct.

So, we use `static_assert()` to ensure that the memory layout for
both the flexible structure and the tagged struct is the same after
any changes.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ZrVB43Hen0H5WQFP@cute
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sched: act_ct: avoid -Wflex-array-member-not-at-end warning

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

Remove unnecessary flex-array member `pad[]` and refactor the related
code a bit.

Fix the following warning:
net/sched/act_ct.c:57:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/ZrY0JMVsImbDbx6r@cute
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-nexthop-increase-weight-to-u16'

Petr Machata says:

====================
net: nexthop: Increase weight to u16

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patchset increase the next hop weight from u8 to u16.

Patch #1 adds a flag that indicates whether the reserved fields are zeroed.
This is a follow-up to a new fix merged in commit 6d745cd0e972 ("net:
nexthop: Initialize all fields in dumped nexthops"). The theory behind this
patch is that there is a strict ordering between the fields actually being
zeroed, the kernel declaring that they are, and the kernel repurposing the
fields. Thus clients can use the flag to tell if it is safe to interpret
the reserved fields in any way.

Patch #2 contains the substantial code and the commit message covers the
details of the changes.

Patches #3 to #6 add selftests.
====================

Link: https://patch.msgid.link/cover.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: fib_nexthops: Test 16-bit next hop weights

Add tests that attempt to create NH groups that use full 16 bits of NH
weight.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/101cdd3f2bfd9511c9bec95f909d20ff56f70ba5.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: router_mpath_nh_res: Test 16-bit next hop weights

Add tests that exercise full 16 bits of NH weight.

Like in the previous patch, omit the 255:65535 test when KSFT_MACHINE_SLOW.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/a91d6ead9d1b1b4b7e276ca58a71ef814f42b7dd.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: router_mpath_nh: Test 16-bit next hop weights

Add tests that exercise full 16 bits of NH weight.

To test the 255:65535, it is necessary to run more packets than for the
other tests. On a debug kernel, the test can take up to a minute, therefore
avoid the test when KSFT_MACHINE_SLOW.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/c0c257c00ad30b07afc3fa5e2afd135925405544.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: router_mpath: Sleep after MZ

In the context of an offloaded datapath, it may take a while for the ip
link stats to be updated. This causes the test to fail when MZ_DELAY is too
low. Sleep after the packets are sent for the link stats to get up to date.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/8b1971d948273afd7de2da3d6a2ba35200540e55.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: nexthop: Increase weight to u16

In CLOS networks, as link failures occur at various points in the network,
ECMP weights of the involved nodes are adjusted to compensate. With high
fan-out of the involved nodes, and overall high number of nodes,
a (non-)ECMP weight ratio that we would like to configure does not fit into
8 bits. Instead of, say, 255:254, we might like to configure something like
1000:999. For these deployments, the 8-bit weight may not be enough.

To that end, in this patch increase the next hop weight from u8 to u16.

Increasing the width of an integral type can be tricky, because while the
code still compiles, the types may not check out anymore, and numerical
errors come up. To prevent this, the conversion was done in two steps.
First the type was changed from u8 to a single-member structure, which
invalidated all uses of the field. This allowed going through them one by
one and audit for type correctness. Then the structure was replaced with a
vanilla u16 again. This should ensure that no place was missed.

The UAPI for configuring nexthop group members is that an attribute
NHA_GROUP carries an array of struct nexthop_grp entries:

struct nexthop_grp {
__u32 id;   /* nexthop id - must exist */
__u8 weight;   /* weight of this nexthop */
__u8 resvd1;
__u16 resvd2;
};

The field resvd1 is currently validated and required to be zero. We can
lift this requirement and carry high-order bits of the weight in the
reserved field:

struct nexthop_grp {
__u32 id;   /* nexthop id - must exist */
__u8 weight;   /* weight of this nexthop */
__u8 weight_high;
__u16 resvd2;
};

Keeping the fields split this way was chosen in case an existing userspace
makes assumptions about the width of the weight field, and to sidestep any
endianness issues.

The weight field is currently encoded as the weight value minus one,
because weight of 0 is invalid. This same trick is impossible for the new
weight_high field, because zero must mean actual zero. With this in place:

- Old userspace is guaranteed to carry weight_high of 0, therefore
  configuring 8-bit weights as appropriate. When dumping nexthops with
  16-bit weight, it would only show the lower 8 bits. But configuring such
  nexthops implies existence of userspace aware of the extension in the
  first place.

- New userspace talking to an old kernel will work as long as it only
  attempts to configure 8-bit weights, where the high-order bits are zero.
  Old kernel will bounce attempts at configuring >8-bit weights.

Renaming reserved fields as they are allocated for some purpose is commonly
done in Linux. Whoever touches a reserved field is doing so at their own
risk. nexthop_grp::resvd1 in particular is currently used by at least
strace, however they carry an own copy of UAPI headers, and the conversion
should be trivial. A helper is provided for decoding the weight out of the
two fields. Forcing a conversion seems preferable to bending backwards and
introducing anonymous unions or whatever.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/483e2fcf4beb0d9135d62e7d27b46fa2685479d4.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: nexthop: Add flag to assert that NHGRP reserved fields are zero

There are many unpatched kernel versions out there that do not initialize
the reserved fields of struct nexthop_grp. The issue with that is that if
those fields were to be used for some end (i.e. stop being reserved), old
kernels would still keep sending random data through the field, and a new
userspace could not rely on the value.

In this patch, use the existing NHA_OP_FLAGS, which is currently inbound
only, to carry flags back to the userspace. Add a flag to indicate that the
reserved fields in struct nexthop_grp are zeroed before dumping. This is
reliant on the actual fix from commit 6d745cd0e972 ("net: nexthop:
Initialize all fields in dumped nexthops").

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/21037748d4f9d8ff486151f4c09083bcf12d5df8.1723036486.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ipv6: eliminate ndisc_ops_is_useropt()

as it doesn't seem to offer anything of value.

There's only 1 trivial user:
  int lowpan_ndisc_is_useropt(u8 nd_opt_type) {
    return nd_opt_type == ND_OPT_6CO;
  }

but there's no harm to always treating that as
a useropt...

Cc: David Ahern <dsahern@kernel.org>
Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://patch.msgid.link/20240730003010.156977-1-maze@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'eth-fbnic-add-basic-stats'

Jakub Kicinski says:

====================
eth: fbnic: add basic stats

Add basic interface stats to fbnic.

v1: https://lore.kernel.org/20240807022631.1664327-1-kuba@kernel.org
====================

Link: https://patch.msgid.link/20240810054322.2766421-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: add support for basic qstats

Implement netdev_stat_ops and export the basic per-queue stats.

This interface expect users to set the values that are used
either to zero or to some other preserved value (they are 0xff by
default). So here we export bytes/packets/drops from tx and rx_stats
plus set some of the values that are exposed by queue stats
to zero.

  $ cd tools/testing/selftests/drivers/net && ./stats.py
  [...]
  Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20240810054322.2766421-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

eth: fbnic: add basic rtnl stats

Count packets, bytes and drop on the datapath, and report
to the user. Since queues are completely freed when the
device is down - accumulate the stats in the main netdev struct.
This means that per-queue stats will only report values since
last reset (per qstat recommendation).

Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20240810054322.2766421-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ethtool-rss-driver-tweaks'

Jakub Kicinski says:

====================
ethtool: rss: driver tweaks and netlink context dumps

This series is a semi-related collection of RSS patches.
Main point is supporting dumping RSS contexts via ethtool netlink.
At present additional RSS contexts can be queried one by one, and
assuming user know the right IDs. This series uses the XArray
added by Ed to provide netlink dump support for ETHTOOL_GET_RSS.

Patch 1 is a trivial selftest debug patch.
Patch 2 coverts mvpp2 for no real reason other than that I had
a grand plan of converting all drivers at some stage.
Patch 3 removes a now moot check from mlx5 so that all tests
can pass.
Patch 4 and 5 make a bit used for context support optional,
for easier grepping of drivers which need converting
if nothing else.
Patch 6 OTOH adds a new cap bit; some devices don't support
using a different key per context and currently act
in surprising ways.
Patch 7 and 8 update the RSS netlink code to use XArray.
Patch 9 and 10 add support for dumping contexts.
Patch 11 and 12 are small adjustments to spec and a new test.

I'm getting distracted with other work, so probably won't have
the time soon to complete next steps, but things which are missing
are (and some of these may be bad ideas):

- better discovery

   Some sort of API to tell the user who many contexts the device
   can create. Upper bound, devices often share contexts between
   ports etc. so it's hard to tell exactly and upfront number of
   contexts for a netdev. But order of magnitude (4 vs 10s) may
   be enough for container management system to know whether to bother.

- create/modify/delete via netlink

   The only question here is how to handle all the tricky IOCTL
   legacy. "No change" maps trivially to attribute not present.
   "reset" (indir_size = 0) probably needs to be a new NLA_FLAG?

- better table size handling

   The current API assumes the LUT has fixed size, which isn't
   true for modern devices. We should have better APIs for the
   drivers to resize the tables, and in user facing API -
   the ability to specify pattern and min size rather than
   exact table expected (sort of like ethtool CLI already does).

- recounted / socket-bound contexts

   Support for contexts which get "cleaned up" when their parent
   netlink socket gets closed. The major catch is that ntuple
   filters (which we don't currently track) depend on the context,
   so we need auto-removal for both.

v5:
- fix build
v4: https://lore.kernel.org/20240809031827.2373341-1-kuba@kernel.org
- adjust to the meaning of max context from net
v3: https://lore.kernel.org/20240806193317.1491822-1-kuba@kernel.org
- quite a few code comments and commit message changes
- mvpp2: fix interpretation of max_context_id (I'll take care of
   the net -> net-next merge as needed)
- filter by ifindex in the selftest
v2: https://lore.kernel.org/20240803042624.970352-1-kuba@kernel.org
- fix bugs and build in mvpp2
v1: https://lore.kernel.org/20240802001801.565176-1-kuba@kernel.org
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: drv-net: rss_ctx: test dumping RSS contexts

Add a test for dumping RSS contexts. Make sure indir table
and key are sane when contexts are created with various
combination of inputs. Test the dump filtering by ifname
and start-context.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: specs: decode indirection table as u32 array

Indirection table is dumped as a raw u32 array, decode it.
It's tempting to decode hash key, too, but it is an actual
bitstream, so leave it be for now.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: rss: support skipping contexts during dump

Applications may want to deal with dynamic RSS contexts only.
So dumping context 0 will be counter-productive for them.
Support starting the dump from a given context ID.

Alternative would be to implement a dump flag to skip just
context 0, not sure which is better...

Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: rss: support dumping RSS contexts

Now that we track RSS contexts in the core we can easily dump
them. This is a major introspection improvement, as previously
the only way to find all contexts would be to try all ids
(of which there may be 2^32 - 1).

Don't use the XArray iterators (like xa_for_each_start()) as they
do not move the index past the end of the array once done, which
caused multiple bugs in Netlink dumps in the past.

Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: rss: report info about additional contexts from XArray

IOCTL already uses the XArray when reporting info about additional
contexts. Do the same thing in netlink code.

Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: rss: move the device op invocation out of rss_prepare_data()

Factor calling device ops out of rss_prepare_data().
Next patch will add alternative path using xarray.
No functional changes.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: rss: don't report key if device doesn't support it

marvell/otx2 and mvpp2 do not support setting different
keys for different RSS contexts. Contexts have separate
indirection tables but key is shared with all other contexts.
This is likely fine, indirection table is the most important
piece.

Don't report the key-related parameters from such drivers.
This prevents driver-errors, e.g. otx2 always writes
the main key, even when user asks to change per-context key.
The second reason is that without this change tracking
the keys by the core gets complicated. Even if the driver
correctly reject setting key with rss_context != 0,
change of the main key would have to be reflected in
the XArray for all additional contexts.

Since the additional contexts don't have their own keys
not including the attributes (in Netlink speak) seems
intuitive. ethtool CLI seems to deal with it just fine.

Having to set the flag in majority of the drivers is
a bit tedious but not reporting the key is a safer
default.

Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eth: remove .cap_rss_ctx_supported from updated drivers

Remove .cap_rss_ctx_supported from drivers which moved to the new API.
This makes it easy to grep for drivers which still need to be converted.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: make ethtool_ops::cap_rss_ctx_supported optional

cap_rss_ctx_supported was created because the API for creating
and configuring additional contexts is mux'ed with the normal
RSS API. Presence of ops does not imply driver can actually
support rss_context != 0 (in fact drivers mostly ignore that
field). cap_rss_ctx_supported lets core check that the driver
is context-aware before calling it.

Now that we have .create_rxfh_context, there is no such
ambiguity. We can depend on presence of the op.
Make setting the bit optional.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eth: mlx5: allow disabling queues when RSS contexts exist

Since commit 24ac7e544081 ("ethtool: use the rss context XArray
in ring deactivation safety-check") core will prevent queues from
being disabled while being used by additional RSS contexts.
The safety check is no longer necessary, and core will do a more
accurate job of only rejecting changes which can actually break
things.

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

eth: mvpp2: implement new RSS context API

Implement the separate create/modify/delete ops for RSS.

No problems with IDs - even tho RSS tables are per device
the driver already seems to allocate IDs linearly per port.
There's a translation table from per-port context ID
to device context ID.

mvpp2 doesn't have a key for the hash, it defaults to
an empty/previous indir table.

Note that there is no key at all, so we don't have to be
concerned with reporting the wrong one (which is addressed
by a patch later in the series).

Compile-tested only.

Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: drv-net: rss_ctx: add identifier to traffic comments

Include the "name" of the context in the comment for traffic
checks. Makes it easier to reason about which context failed
when we loop over 32 contexts (it may matter if we failed in
first vs last, for example).

Reviewed-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: vxlan: remove duplicated initialization in vxlan_xmit

The variable "did_rsc" is initialized twice, which is unnecessary. Just
remove one of them.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sunvnet: use ethtool_sprintf/puts

Simpler and allows avoiding manual pointer addition.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: microchip: ksz9477: split half-duplex monitoring function

In order to respect the 80 columns limit, split the half-duplex
monitoring function in two.

This is just a styling change, no functional change.

Signed-off-by: Enguerrand de Ribaucourt <enguerrand.de-ribaucourt@savoirfairelinux.com>
Acked-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'phylib-fixed-speed-1G'

Russell King says:

====================
net: phylib: fix fixed-speed >= 1G

This is v2 of the patch (now patches) adding support for ethtool
!autoneg while respecting the requirements of IEEE 802.3.

v2 fixes the build errors in the previous patch by first constifying
the "advertisement" argument to the linkmode functions that only
read from this pointer. It also fixes the incorrectly named
linkmode_set function.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phylib: do not disable autoneg for fixed speeds >= 1G

We have an increasing number of drivers that are forcing
auto-negotiation to be enabled for speeds of 1G or faster.

It would appear that auto-negotiation is mandatory for speeds above
100M. In 802.3, Annex 40C's state diagrams seems to imply that
mr_autoneg_enable (BMCR AN ENABLE) doesn't affect whether or not the
AN state machines work for 1000base-T, and some PHY datasheets (e.g.
Marvell Alaska) state that disabling mr_autoneg_enable leaves AN
enabled but forced to 1G full duplex.

Other PHY datasheets imply that BMCR AN ENABLE should not be cleared
for >= 1G.

Thus, this should be handled in phylib rather than in each driver.

Rather than erroring out, arrange to implement the Marvell Alaska
solution but in software for all PHYs: generate an appropriate
single-speed advertisement for the requested speed, and keep AN
enabled to the PHY driver. However, to avoid userspace API breakage,
continue to report to userspace that we have AN disabled.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mii: constify advertising mask

Constify the advertising mask to linkmode functions that only read from
the advertising mask.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mvpp2-child-port-removal'

Javier Carrasco says:

====================
net: mvpp2: rework child node/port removal handling

These two patches used to be part of another series [1] that did not
apply to the networking tree without conflicts. This is therefore just a
partial resend with no code modifications, just rebased onto net/main.

Link: https://lore.kernel.org/all/20240806181026.5fe7f777@kernel.org/
====================

Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: use device_for_each_child_node() to access device child nodes

The iterated nodes are direct children of the device node, and the
`device_for_each_child_node()` macro accounts for child node
availability.

`fwnode_for_each_available_child_node()` is meant to access the child
nodes of an fwnode, and therefore not direct child nodes of the device
node.

The child nodes within mvpp2_probe are not accessed outside the loops,
and the scoped version of the macro can be used to automatically
decrement the refcount on early exits.

Use `device_for_each_child_node()` and its scoped variant to indicate
device's direct child nodes.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvpp2: use port_count to remove ports

As discussed in [1], there is no need to iterate over child nodes to
remove the list of ports. Instead, a loop up to `port_count` ports can
be used, and is in fact more reliable in case the child node
availability changes.

The suggested approach removes the need for the `fwnode` and
`port_fwnode` variables in mvpp2_remove() as well.

Link: https://lore.kernel.org/all/ZqdRgDkK1PzoI2Pf@shell.armlinux.org.uk/
Suggested-by: Russell King <linux@armlinux.org.uk>
Signed-off-by: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-fix-queue-reset-when-queue-active'

David Wei says:

====================
fix bnxt_en queue reset when queue is active

The current bnxt_en queue API implementation is buggy when resetting a
queue that has active traffic. The problem is that there is no FW
involved to stop the flow of packets and relying on napi_disable() isn't
enough.

To fix this, call bnxt_hwrm_vnic_update() with MRU set to 0 for both the
default and the ntuple vnic to stop the flow of packets. This works for
any Rx queue and not only those that have ntuple rules since every Rx
queue is either in the default or the ntuple vnic.

For bnxt_hwrm_vnic_update() to work, proper flushing must be done by the
FW. A FW flag is there to indicate support and queue_mgmt_ops is keyed
behind this.

The first three patches are from Michael Chan and adds the prerequisite
vnic functions and FW flags indicating that it will properly flush
during vnic update.

Tested on BCM957504 while iperf3 is active:

1. Reset a queue that has an ntuple rule steering flow into it
2. Reset all queues in order, one at a time

In both cases the flow is not interrupted.

Sending this to net-next as there is no in-tree kernel consumer of queue
API just yet, and there is a patch that changes when the queue_mgmt_ops
is registered.

Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
---
v3:
- include patches from Michael Chan that adds a FW flag for vnic flush
capability
- key support for queue_mgmt_ops behind this new flag

v2:
- split setting vnic->mru into a separate patch (Wojciech)
- clarify why napi_enable()/disable() is removed
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: only set dev->queue_mgmt_ops if supported by FW

The queue API calls bnxt_hwrm_vnic_update() to stop/start the flow of
packets, which can only properly flush the pipeline if FW indicates
support.

Add a macro BNXT_SUPPORTS_QUEUE_API that checks for the required flags
and only set queue_mgmt_ops if true.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: stop packet flow during bnxt_queue_stop/start

The current implementation when resetting a queue while packets are
flowing puts the queue into an inconsistent state.

There needs to be some synchronisation with the FW. Add calls to
bnxt_hwrm_vnic_update() to set the MRU for both the default and ntuple
vnic during queue start/stop. When the MRU is set to 0, flow is stopped.
Each Rx queue belongs to either the default or the ntuple vnic.

With calling bnxt_hwrm_vnic_update() the calls to napi_enable() and
napi_disable() must be removed for reset to work on a queue that has
active traffic flowing e.g. iperf3.

Co-developed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: set vnic->mru in bnxt_hwrm_vnic_cfg()

Set the newly added vnic->mru field in bnxt_hwrm_vnic_cfg().

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Check the FW's VNIC flush capability

Check the HWRM_VNIC_QCAPS FW response for the receive engine flush
capability. This capability indicates that we can reliably support
RX ring restart when calling HWRM_VNIC_UPDATE with MRU set to 0.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Add support to call FW to update a VNIC

Add the function bnxt_hwrm_vnic_update() to call FW to update
a VNIC. This call can be used when disabling and enabling a
receive ring within a VNIC. The mru which is the maximum receive
size of packets received by the VNIC can be updated.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Update firmware interface to 1.10.3.68

The main changes are:

1. HWRM_VNIC_UPDATE used to safely disable and enable an RX ring within
the VNIC.

2. New flag in HWRM_VNIC_QCAPS to indicate FW will do the proper flush
during HWRM_VNIC_UPDATE.

3. New flag in HWRM_FUNC_QCAPS to indicate that reservations for some
resources such as VNIC can be reduced.

4. New backing store memory types not used by the driver yet.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tp-misc-improvements'

James Chapman says:

====================
l2tp: misc improvements

This series makes several improvements to l2tp:

* update documentation to be consistent with recent l2tp changes.
* move l2tp_ip socket tables to per-net data.
* fix handling of hash key collisions in l2tp_v3_session_get
* implement and use get-next APIs for management and procfs/debugfs.
* improve l2tp refcount helpers.
* use per-cpu dev->tstats in l2tpeth devices.
* fix a lockdep splat.
* fix a race between l2tp_pre_exit_net and pppol2tp_release.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: flush workqueue before draining it

syzbot exposes a race where a net used by l2tp is removed while an
existing pppol2tp socket is closed. In l2tp_pre_exit_net, l2tp queues
TUNNEL_DELETE work items to close each tunnel in the net. When these
are run, new SESSION_DELETE work items are queued to delete each
session in the tunnel. This all happens in drain_workqueue. However,
drain_workqueue allows only new work items if they are queued by other
work items which are already in the queue. If pppol2tp_release runs
after drain_workqueue has started, it may queue a SESSION_DELETE work
item, which results in the warning below in drain_workqueue.

Address this by flushing the workqueue before drain_workqueue such
that all queued TUNNEL_DELETE work items run before drain_workqueue is
started. This will queue SESSION_DELETE work items for each session in
the tunnel, hence pppol2tp_release or other API requests won't queue
SESSION_DELETE requests once drain_workqueue is started.

  WARNING: CPU: 1 PID: 5467 at kernel/workqueue.c:2259 __queue_work+0xcd3/0xf50 kernel/workqueue.c:2258
  Modules linked in:
  CPU: 1 UID: 0 PID: 5467 Comm: syz.3.43 Not tainted 6.11.0-rc1-syzkaller-00247-g3608d6aca5e7 #0
  Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
  RIP: 0010:__queue_work+0xcd3/0xf50 kernel/workqueue.c:2258
  Code: ff e8 11 84 36 00 90 0f 0b 90 e9 1e fd ff ff e8 03 84 36 00 eb 13 e8 fc 83 36 00 eb 0c e8 f5 83 36 00 eb 05 e8 ee 83 36 00 90 <0f> 0b 90 48 83 c4 60 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc
  RSP: 0018:ffffc90004607b48 EFLAGS: 00010093
  RAX: ffffffff815ce274 RBX: ffff8880661fda00 RCX: ffff8880661fda00
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: 0000000000000000 R08: ffffffff815cd6d4 R09: 0000000000000000
  R10: ffffc90004607c20 R11: fffff520008c0f85 R12: ffff88802ac33800
  R13: ffff88802ac339c0 R14: dffffc0000000000 R15: 0000000000000008
  FS:  00005555713eb500(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000008 CR3: 000000001eda6000 CR4: 00000000003506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   queue_work_on+0x1c2/0x380 kernel/workqueue.c:2392
   pppol2tp_release+0x163/0x230 net/l2tp/l2tp_ppp.c:445
   __sock_release net/socket.c:659 [inline]
   sock_close+0xbc/0x240 net/socket.c:1421
   __fput+0x24a/0x8a0 fs/file_table.c:422
   task_work_run+0x24f/0x310 kernel/task_work.c:228
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
   exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline]
   __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
   syscall_exit_to_user_mode+0x168/0x370 kernel/entry/common.c:218
   do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f061e9779f9
  Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007ffff1c1fce8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4
  RAX: 0000000000000000 RBX: 000000000001017d RCX: 00007f061e9779f9
  RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003
  RBP: 00007ffff1c1fdc0 R08: 0000000000000001 R09: 00007ffff1c1ffcf
  R10: 00007f061e800000 R11: 0000000000000246 R12: 0000000000000032
  R13: 00007ffff1c1fde0 R14: 00007ffff1c1fe00 R15: ffffffffffffffff
  </TASK>

Fixes: fc7ec7f554d7 ("l2tp: delete sessions using work queue")
Reported-by: syzbot+0e85b10481d2f5478053@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=0e85b10481d2f5478053
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: l2tp_eth: use per-cpu counters from dev->tstats

l2tp_eth uses old-style dev->stats for fastpath packet/byte
counters. Convert it to use dev->tstats per-cpu counters.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: improve tunnel/session refcount helpers

l2tp_tunnel_inc_refcount and l2tp_session_inc_refcount wrap
refcount_inc. They add no value so just use the refcount APIs directly
and drop l2tp's helpers. l2tp already uses refcount_inc_not_zero
anyway.

Rename l2tp_tunnel_dec_refcount and l2tp_session_dec_refcount to
l2tp_tunnel_put and l2tp_session_put to better match their use pairing
various _get getters.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: use get_next APIs for management requests and procfs/debugfs

l2tp netlink and procfs/debugfs iterate over tunnel and session lists
to obtain data. They currently use very inefficient get_nth functions
to do so. Replace these with get_next.

For netlink, use nl cb->ctx[] for passing state instead of the
obsolete cb->args[].

l2tp_tunnel_get_nth and l2tp_session_get_nth are no longer used so
they can be removed.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: add tunnel/session get_next helpers

l2tp management APIs and procfs/debugfs iterate over l2tp tunnel and
session lists. Since these lists are now implemented using IDR, we can
use IDR get_next APIs to iterate them. Add tunnel/session get_next
functions to do so.

The session get_next functions get the next session in a given tunnel
and need to account for l2tpv2 and l2tpv3 differences:

* l2tpv2 sessions are keyed by tunnel ID / session ID. Iteration for
   a given tunnel ID, TID, can therefore start with a key given by
   TID/0 and finish when the next entry's tunnel ID is not TID. This
   is possible only because the tunnel ID part of the key is the upper
   16 bits and the session ID part the lower 16 bits; when idr_next
   increments the key value, it therefore finds the next sessions of
   the current tunnel before those of the next tunnel. Entries with
   session ID 0 are always skipped because they are used internally by
   pppol2tp.

* l2tpv3 sessions are keyed by session ID. Iteration starts at the
   first IDR entry and skips entries where the tunnel does not
   match. Iteration must also consider session ID collisions and walk
   the list of colliding sessions (if any) for one which matches the
   supplied tunnel.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: handle hash key collisions in l2tp_v3_session_get

To handle colliding l2tpv3 session IDs, l2tp_v3_session_get searches a
hashed list keyed by ID and sk. Although unlikely, if hash keys
collide, it is possible that hash_for_each_possible loops over a
session which doesn't have the ID that we are searching for. So check
for session ID match when looping over possible hash key matches.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: move l2tp_ip and l2tp_ip6 data to pernet

l2tp_ip[6] have always used global socket tables. It is therefore not
possible to create l2tpip sockets in different namespaces with the
same socket address.

To support this, move l2tpip socket tables to pernet data.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove inline from functions in c sources

Update l2tp to remove the inline keyword from several functions in C
sources, since this is now discouraged.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

documentation/networking: update l2tp docs

l2tp no longer uses sk_user_data in tunnel sockets and now manages
tunnel/session lifetimes slightly differently. Update docs to cover
this.

CC: linux-doc@vger.kernel.org
CC: corbet@lwn.net
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx5-misc-patches-2024-08-08'

Tariq Toukan says:

====================
mlx5 misc patches 2024-08-08

This patchset contains multiple enhancements from the team to the mlx5
core and Eth drivers.

Patch #1 by Chris bumps a defined value to permit more devices doing TC
offloads.

Patch #2 by Jianbo adds an IPsec fast-path optimization to replace the
slow async handling.

Patches #3 and #4 by Jianbo add TC offload support for complicated rules
to overcome firmware limitation.

Patch #5 by Gal unifies the access macro to advertised/supported link
modes.

Patches #6 to #9 by Gal adds extack messages in ethtool ops to replace
prints to the kernel log.

Patch #10 by Cosmin switches to using 'update' verb instead of 'replace'
to better reflect the operation.

Patch #11 by Cosmin exposes an update connection tracking operation to
replace the assumed delete+add implementaiton.
====================

Link: https://patch.msgid.link/20240808055927.2059700-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: CT: Update connection tracking steering entries

Previously, replacing a connection tracking steering entry was done by
adding a new rule (with the same tag but possibly different mod hdr
actions/labels) then removing the old rule.

This approach doesn't work in hardware steering because two steering
entries with the same tag cannot coexist in a hardware steering table.

This commit prepares for that by adding a new ct_rule_update operation on
the ct_fs_ops struct which is used instead of add+delete.
Implementations for both dmfs (firmware steering) and smfs (software
steering) are provided, which simply add the new rule and delete the old
one.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-12-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: CT: 'update' rules instead of 'replace'

Offloaded rules can be updated with a new modify header action
containing a changed restore cookie. This was done using the verb
'replace', while in some configurations 'update' is a better fit.

This commit renames the functions used to reflect that.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-11-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use extack in get module eeprom by page callback

In case of errors in get module eeprom by page, reflect it through
extack instead of a dmesg print.
While at it, make the messages more human friendly.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use extack in set coalesce callback

In case of errors in set coalesce, reflect it through extack instead of
a dmesg print.
While at it, make the messages more human friendly.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-9-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use extack in get coalesce callback

In case of errors in get coalesce, reflect it through extack instead of
a dmesg print.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-8-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Use extack in set ringparams callback

In case of errors in set ringparams, reflect it through extack instead
of a dmesg print.
While at it, make the messages more human friendly and remove two
redundant checks that are already validated by the core.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-7-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Be consistent with bitmap handling of link modes

Use the bitmap operations when accessing the advertised/supported link
modes and remove places that access them as arrays of unsigned longs
(underlying implementation of the bitmap), this makes the code much more
readable and clear.

Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, Offload rewrite and mirror to both internal and external dests

Firmware has the limitation that it cannot offload a rule with rewrite
and mirror to internal and external destinations simultaneously.

This patch adds a workaround to this issue. Here the destination array
is split again, just like what's done in previous commit, but after
the action indexed by split_count - 1. An extra rule is added for the
leftover destinations. Such rule can be offloaded, even there are
destinations to both internal and external destinations, because the
header rewrite is left in the original FTE.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: TC, Offload rewrite and mirror on tunnel over ovs internal port

To offload the encap rule when the tunnel IP is configured on an
openvswitch internal port, driver need to overwrite vport metadata in
reg_c0 to the value assigned to the internal port, then forward
packets to root table to be processed again by the rules matching on
the metadata for such internal port.

When such rule is combined with header rewrite and mirror, openvswitch
generates the rule like the following, because it resets mirror after
packets are modified.
in_port(enp8s0f0npf0sf1),..,
actions:enp8s0f0npf0sf2,set(tunnel(...)),set(ipv4(...)),vxlan_sys_4789,enp8s0f0npf0sf2

The split_count was introduced before to support rewrite and mirror.
Driver splits the rule into two different hardware rules in order to
offload it. But it's not enough to offload the above complicated rule
because of the limitations, in both driver and firmware.

To resolve this issue, the destination array is split again after the
destination indexed by split_count. An extra rule is added for the
leftover destinations (in the above example, it is enp8s0f0npf0sf2),
and is inserted to post_act table. And the extra destination is added
in the original rule to forward to post_act table, so the extra mirror
is done there.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Enable remove flow for hard packet limit

In the commit a2a73ea14b1a ("net/mlx5e: Don't listen to remove flows
event"), remove_flow_enable event is removed, and the hard limit
usually relies on software mechanism added in commit b2f7b01d36a9
("net/mlx5e: Simulate missing IPsec TX limits hardware
functionality"). But the delayed work is rescheduled every one second,
which is slow for fast traffic. As a result, traffic can't be blocked
even reaches the hard limit, which usually happens when soft and hard
limits are very close.

In reality it won't happen because soft limit is much lower than hard
limit. But, as an optimization for RX to block traffic when reaching
hard limit, need to set remove_flow_enable. When remove flow is
enabled, IPSEC HARD_LIFETIME ASO syndrome will be set in the metadata
defined in the ASO return register if packets reach hard lifetime
threshold. And those packets are dropped immediately by the steering
table.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: E-Switch, Increase max int port number for offload

Currently MLX5E_TC_MAX_INT_PORT_NUM is 8. Usually int port has one
ingress and one egress rules. But sometimes, a temporary rule can be
offloaded as well, eg:

recirc_id(0),in_port(br-phy),eth(src=10:70:fd:87:57:c0,dst=33:33:00:00:00:16),
eth_type(0x86dd),ipv6(frag=no), packets:2, bytes:180, used:0.060s,
actions:enp8s0f0

If one int port device offloads 3 rules, only 2 devices can offload.
Other devices will hit the limit and fail to offload. Actually it is
insufficient for customers. So increase the number to 32.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240808055927.2059700-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ag71xx: use phylink_mii_ioctl

f1294617d2f38bd2b9f6cce516b0326858b61182 removed the custom function for
ndo_eth_ioctl and used the standard phy_do_ioctl which calls
phy_mii_ioctl. However since then, this driver was ported to phylink
where it makes more sense to call phylink_mii_ioctl.

Bring back custom function that calls phylink_mii_ioctl.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240807215834.33980-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ibmvnic-ibmvnic-rr-patchset'

Nick Child says:

====================
ibmvnic: ibmvnic rr patchset

v1 - https://lore.kernel.org/netdev/20240801212340.132607-1-nnac123@linux.ibm.com/
v2 - https://lore.kernel.org/netdev/20240806193706.998148-1-nnac123@linux.ibm.com/
====================

Link: https://patch.msgid.link/20240807211809.1259563-1-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Perform tx CSO during send scrq direct

During initialization with the vnic server, a bitstring is communicated
to the client regarding header info needed during CSO (See "VNIC
Capabilities" in PAPR). Most of the time, to be safe, vnic server
requests header info for CSO. When header info is needed, multiple TX
descriptors are required per skb; This limits the driver to use
send_subcrq_indirect instead of send_subcrq_direct.

Previously, the vnic server request for header info was ignored. This
allowed the use of send_sub_crq_direct. Transmissions were successful
because the bitstring returned by vnic server is broad and over
cautionary. It was observed that mlx backing devices could actually
transmit and handle CSO packets without the vnic server receiving
header info (despite the fact that the bitstring requested it).

There was a trust issue: The bitstring was overcautionary. This extra
precaution (requesting header info when the backing device may not use
it) comes at the cost of performance (using direct vs indirect hcalls
has a 30% delta in small packet RR transaction rate). So it has been
requested that the vnic server team tries to ensure that the bitstring
is more exact. In the meantime, disable CSO when it is possible to use
the skb in the send_subcrq_direct path. In other words, calculate the
checksum before handing the packet to FW when the packet is not
segmented and xmit_more is false.

Since the code path is only possible if the skb is non GSO and xmit_more
is false, the cost of doing checksum in the send_subcrq_direct path is
minimal. Any large segmented skb will have xmit_more set to true more
frequently and it is inexpensive to do checksumming on a small skb.
The worst-case workload would be a 9000 MTU TCP_RR test with close
to MTU sized packets (and TSO off). This allows xmit_more to be false
more frequently and open the code path up to use send_subcrq_direct.
Observing trace data (graph-time = 1) and packet rate with this workload
shows minimal performance degradation:

1. NIC does checksum w headers, safely use send_subcrq_indirect:
  - Packet rate: 631k txs
  - Trace data:
    ibmvnic_xmit = 44344685.87 us / 6234576 hits = AVG 7.11 us
      skb_checksum_help = 4.07 us / 2 hits = AVG 2.04 us
       ^ Notice hits, tracing this just for reassurance
      ibmvnic_tx_scrq_flush = 33040649.69 us / 5638441 hits = AVG 5.86 us
        send_subcrq_indirect = 37438922.24 us / 6030859 hits = AVG 6.21 us

2. NIC does checksum w/o headers, dangerously use send_subcrq_direct:
  - Packet rate: 831k txs
  - Trace data:
    ibmvnic_xmit = 48940092.29 us / 8187630 hits = AVG 5.98 us
      skb_checksum_help = 2.03 us / 1 hits = AVG 2.03
      ibmvnic_tx_scrq_flush = 31141879.57 us / 7948960 hits = AVG 3.92 us
        send_subcrq_indirect = 8412506.03 us / 728781 hits = AVG 11.54
         ^ notice hits is much lower b/c send_subcrq_direct was called
                                            ^ wasn't traceable

3. driver does checksum, safely use send_subcrq_direct (THIS PATCH):
  - Packet rate: 829k txs
  - Trace data:
    ibmvnic_xmit = 56696077.63 us / 8066168 hits = AVG 7.03 us
      skb_checksum_help = 8587456.16 us / 7526072 hits = AVG 1.14 us
      ibmvnic_tx_scrq_flush = 30219545.55 us / 7782409 hits = AVG 3.88 us
        send_subcrq_indirect = 8638326.44 us / 763693 hits = AVG 11.31 us

When the bitstring ever specifies that CSO does not require headers
(dependent on VIOS vnic server changes), then this patch should be
removed and replaced with one that investigates the bitstring before
using send_subcrq_direct.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-8-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Only record tx completed bytes once per handler

Byte Queue Limits depends on dql_completed being called once per tx
completion round in order to adjust its algorithm appropriately. The
dql->limit value is an approximation of the amount of bytes that the NIC
can consume per irq interval. If this approximation is too high then the
NIC will become over-saturated. Too low and the NIC will starve.

The dql->limit depends on dql->prev-* stats to calculate an optimal
value. If dql_completed() is called more than once per irq handler then
those prev-* values become unreliable (because they are not an accurate
representation of the previous state of the NIC) resulting in a
sub-optimal limit value.

Therefore, move the call to netdev_tx_completed_queue() to the end of
ibmvnic_complete_tx().

When performing 150 sessions of TCP rr (request-response 1 byte packets)
workloads, one could observe:
  PREVIOUSLY: - limit and inflight values hovering around 130
              - transaction rate of around 750k pps.

  NOW:        - limit rises and falls in response to inflight (130-900)
              - transaction rate of around 1M pps (33% improvement)

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-7-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Introduce send sub-crq direct

Firmware supports two hcalls to send a sub-crq request:
H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows
for submission of batched messages while the other hcall is limited to
only one message. This protocol is defined in PAPR section 17.2.3.3.

Previously, the ibmvnic xmit function only used the indirect hcall. This
allowed the driver to batch it's skbs. A single skb can occupy a few
entries per hcall depending on if FW requires skb header information or
not. The FW only needs header information if the packet is segmented.

By this logic, if an skb is not GSO then it can fit in one sub-crq
message and therefore is a candidate for H_SEND_SUB_CRQ.
Batching skb transmission is only useful when there are more packets
coming down the line (ie netdev_xmit_more is true).

As it turns out, H_SEND_SUB_CRQ induces less latency than
H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where
appropriate.

Small latency gains seen when doing TCP_RR_150 (request/response
workload). Ftrace results (graph-time=1):
  Previous:
     ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34
     ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35
  Now:
     ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63
     ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-6-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Remove duplicate memory barriers in tx

send_subcrq_[in]direct() already has a dma memory barrier.
Remove the earlier one.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-5-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Reduce memcpys in tx descriptor generation

Previously when creating the header descriptors, the driver would:
1. allocate a temporary buffer on the stack (in build_hdr_descs_arr)
2. memcpy the header info into the temporary buffer (in build_hdr_data)
3. memcpy the temp buffer into a local variable (in create_hdr_descs)
4. copy the local variable into the return buffer (in create_hdr_descs)

Since, there is no opportunity for errors during this process, the temp
buffer is not needed and work can be done on the return buffer directly.

Repurpose build_hdr_data() to only calculate the header lengths. Rename
it to get_hdr_lens().
Edit create_hdr_descs() to read from the skb directly and copy directly
into the returned useful buffer.

The process now involves less memory and write operations while
also being more readable.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-4-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Use header len helper functions on tx

Use the header length helper functions rather than trying to calculate
it within the driver. There are defined functions for mac and network
headers (skb_mac_header_len and skb_network_header_len) but no such
function exists for the transport header length.

Also, hdr_data was memset during allocation to all 0's so no need to
memset again.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-3-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ibmvnic: Only replenish rx pool when resources are getting low

Previously, the driver would replenish the rx pool if the polling
function consumed less than the budget. The logic being that the driver
did not exhaust its budget so that must mean that the driver is not busy
and has cycles to spare for replenishing the pool.

So pool replenishment happens on every poll which did not consume
the budget. This can very costly during request-response tests.

In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove
this conditional. Trace results (ftrace, graph-time=1) for the poll
function are below:
Previous results:
    ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58
    replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74
Now:
    ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04
    replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88

While the replenish function takes longer, it is hit less frequently
meaning the ibmvnic_poll function, on average, is faster.

Furthermore, this change does not have a negative effect on
performance bandwidth/latency measurements.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://patch.msgid.link/20240807211809.1259563-2-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: fs_enet: Fix warning due to wrong type

Building fs_enet on powerpc e500 leads to following warning:

    CC      drivers/net/ethernet/freescale/fs_enet/mac-scc.o
  In file included from ./include/linux/build_bug.h:5,
                   from ./include/linux/container_of.h:5,
                   from ./include/linux/list.h:5,
                   from ./include/linux/module.h:12,
                   from drivers/net/ethernet/freescale/fs_enet/mac-scc.c:15:
  drivers/net/ethernet/freescale/fs_enet/mac-scc.c: In function 'allocate_bd':
  ./include/linux/err.h:28:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
     28 | #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO)
        |                                                 ^
  ./include/linux/compiler.h:77:45: note: in definition of macro 'unlikely'
     77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
        |                                             ^
  drivers/net/ethernet/freescale/fs_enet/mac-scc.c:138:13: note: in expansion of macro 'IS_ERR_VALUE'
    138 |         if (IS_ERR_VALUE(fep->ring_mem_addr))
        |             ^~~~~~~~~~~~

This is due to fep->ring_mem_addr not being a pointer but a DMA
address which is 64 bits on that platform while pointers are
32 bits as this is a 32 bits platform with wider physical bus.

However, using fep->ring_mem_addr is just wrong because
cpm_muram_alloc() returns an offset within the muram and not
a physical address directly. So use fpi->dpram_offset instead.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ec67ea3a3bef7e58b8dc959f7c17d405af0d27e4.1723101144.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: cdc_ether: don't spew notifications

The usbnet_link_change function is not called, if the link has not changed.

...
[16913.807393][ 3] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
[16913.822266][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped
[16913.826296][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 11 may have been dropped
...

kevent 11 is scheduled too frequently and may affect other event schedules.

Signed-off-by: zhangxiangqian <zhangxiangqian@kylinos.cn>
Acked-by: Oliver Neukum <oneukum@suse.com>
Link: https://patch.msgid.link/1723109985-11996-1-git-send-email-zhangxiangqian@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethtool: refactor checking max channels

Currently ethtool_set_channel calls separate functions to check whether
the new channel number violates rss configuration or flow steering
configuration.

Very soon we need to check whether the new channel number violates
memory provider configuration as well.

To do all 3 checks cleanly, add a wrapper around
ethtool_get_max_rxnfc_channel() and ethtool_get_max_rxfh_channel(),
which does both checks. We can later extend this wrapper to add the
memory provider check in one place.

Note that in the current code, we put a descriptive genl error message
when we run into issues. To preserve the error message, we pass the
genl_info* to the common helper. The ioctl calls can pass NULL instead.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20240808205345.2141858-1-almasrymina@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftest-rds'

Allison Henderson says:

====================
selftests: rds selftest

This series is a new selftest that Vegard, Chuck and myself have been
working on to provide some test coverage for rds.  I've modified the
scripts to include the feedback from the last version, but let me know
if there's anything missed.  Questions and comments appreciated.

Thanks everyone!

Allison

Changes in v2:
- Removed qemu vm creation and related code
- Updated README.txt with examples of running the test with virtme
- Removed init.sh. run.sh now directly calls test.py
- Some clean up done with the return code handling since there is no
  vm between the scripts anymore
- Imported ip python function in
tools/testing/selftests/net/lib/py/utils.py into test.py
- Adapted test.py to use the imported ip function, and removed the
local ip wrapper
- Some line wrap clean up
- Link to v1:
https://lore.kernel.org/netdev/20240626012834.5678-3-allison.henderson@oracle.com/T
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: rds: add testing infrastructure

This adds some basic self-testing infrastructure for RDS-TCP.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: rds: add option for GCOV profiling

To better our unit tests we need code coverage to be part of the kernel.
This patch borrows heavily from how CONFIG_GCOV_PROFILE_FTRACE is
implemented

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

.gitignore: add .gcda files

These files contain the runtime coverage data generated by gcov.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: xgmac: use const char arrays for string constants

Jiri Slaby advises me that the preferred mechanism for declaring
string constants is static char arrays, so use that here.

This mostly reverts
commit 1692b9775e74 ("net: stmmac: xgmac: use #define for string constants")

That commit was a fix for
commit 46eba193d04f ("net: stmmac: xgmac: fix handling of DPP safety error for DMA channels").
The fix being replacing const char * with #defines in order to address
compilation failures observed on GCC 6 through 10.

Compile tested only.
No functional change intended.

Suggested-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/netdev/485dbc5a-a04b-40c2-9481-955eaa5ce2e2@kernel.org/
Signed-off-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: vsc73xx: use defined values in phy operations

This commit changes magic numbers in phy operations.
Some shifted registers was replaced with bitfield macros.

No functional changes done.

Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Pawel Dembicki <paweldembicki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ti: icssg_prueth: populate netdev of_node

Allow of_find_net_device_by_node() to find icssg_prueth ports and make
the individual ports' of_nodes available in sysfs.

Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Reviewed-by: MD Danish Anwar <danishanwar@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20240807121215.3178964-1-matthias.schiffer@ew.tq-group.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sungem_phy: Constify struct mii_phy_def

'struct mii_phy_def' are not modified in this driver.

Constifying these structures moves some data to a read-only section, so
increase overall security.

While at it fix the checkpatch warning related to this patch (some missing
newlines and spaces around *)

On a x86_64, with allmodconfig:
Before:
======
  27709     928       0   28637    6fdd drivers/net/sungem_phy.o

After:
=====
   text    data     bss     dec     hex filename
  28157     476       0   28633    6fd9 drivers/net/sungem_phy.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/54c3b30930f80f4895e6fa2f4234714fdea4ef4e.1723033266.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethtool: check rxfh_max_num_contexts != 1 at register time

A value of 1 doesn't make sense, as it implies the only allowed
context ID is 0, which is reserved for the default context - in
which case the driver should just not claim to support custom
RSS contexts at all.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/c07725b3a3d0b0a63b85e230f9c77af59d4d07f8.1723045898.git.ecree.xilinx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: atlantic: use ethtool_sprintf

Allows simplifying get_strings and avoids manual pointer manipulation.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20240807190303.6143-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnx2x: Provide declaration of dmae_reg_go_c in header

Provide declaration of dmae_reg_go_c in header.
This symbol is defined in bnx2x_main.c.
And used in that file and bnx2x_stats.c.

However, Sparse complains that there is no declaration
of the symbol in dmae_reg_go_c nor is the symbol static.

.../bnx2x_main.c:291:11: warning: symbol 'dmae_reg_go_c' was not declared. Should it be static?

Address this by moving the declaration from bnx2x_stats.c to bnx2x_reg.h.

No functional change intended.
Compile tested only.

Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240806-bnx2x-dec-v1-1-ae844ec785e4@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR.

No conflicts or adjacent changes.

Link: https://patch.msgid.link/20240808170148.3629934-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth.

  Current release - regressions:

   - eth: bnxt_en: fix memory out-of-bounds in bnxt_fill_hw_rss_tbl() on
     older chips

  Current release - new code bugs:

   - ethtool: fix off-by-one error / kdoc contradicting the code for max
     RSS context IDs

   - Bluetooth: hci_qca:
      - QCA6390: fix support on non-DT platforms
      - QCA6390: don't call pwrseq_power_off() twice
      - fix a NULL-pointer derefence at shutdown

   - eth: ice: fix incorrect assigns of FEC counters

  Previous releases - regressions:

   - mptcp: fix handling endpoints with both 'signal' and 'subflow'
     flags set

   - virtio-net: fix changing ring count when vq IRQ coalescing not
     supported

   - eth: gve: fix use of netif_carrier_ok() during reconfig / reset

  Previous releases - always broken:

   - eth: idpf: fix bugs in queue re-allocation on reconfig / reset

   - ethtool: fix context creation with no parameters

  Misc:

   - linkwatch: use system_unbound_wq to ease RTNL contention"

* tag 'net-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (41 commits)
  net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.
  ethtool: Fix context creation with no parameters
  net: ethtool: fix off-by-one error in max RSS context IDs
  net: pse-pd: tps23881: include missing bitfield.h header
  net: fec: Stop PPS on driver remove
  net: bcmgenet: Properly overlay PHY and MAC Wake-on-LAN capabilities
  l2tp: fix lockdep splat
  net: stmmac: dwmac4: fix PCS duplex mode decode
  idpf: fix UAFs when destroying the queues
  idpf: fix memleak in vport interrupt configuration
  idpf: fix memory leaks and crashes while performing a soft reset
  bnxt_en : Fix memory out-of-bounds in bnxt_fill_hw_rss_tbl()
  net: dsa: bcm_sf2: Fix a possible memory leak in bcm_sf2_mdio_register()
  net/smc: add the max value of fallback reason count
  Bluetooth: hci_sync: avoid dup filtering when passive scanning with adv monitor
  Bluetooth: l2cap: always unlock channel in l2cap_conless_channel()
  Bluetooth: hci_qca: fix a NULL-pointer derefence at shutdown
  Bluetooth: hci_qca: fix QCA6390 support on non-DT platforms
  Bluetooth: hci_qca: don't call pwrseq_power_off() twice for QCA6390
  ice: Fix incorrect assigns of FEC counts
  ...

Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

- Have reading of event format files test if the metadata still exists.

   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
   is set to state that it is to prevent any new references to it from
   happening while waiting for existing references to close. When the
   last reference closes, the metadata is freed. But the "format" was
   missing a check to this flag (along with some other files) that
   allowed new references to happen, and a use-after-free bug to occur.

- Have the trace event meta data use the refcount infrastructure
   instead of relying on its own atomic counters.

- Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.

- Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
   callers expect a real object or an ERR_PTR.

- Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.

- Fix ftrace_graph_ret_addr() to use the task passed in and not
   current.

- Fix overflow bug in get_free_elt() where the counter can overflow the
   integer and cause an infinite loop.

- Remove unused function ring_buffer_nr_pages()

- Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own.

   When the kernel had randomize structure fields enabled, the rcu field
   of the tracefs_inode was overlapping the rcu field of the inode
   structure, and corrupting it. Instead, use the destroy_inode()
   callback to do the initial cleanup of the code, and then have
   free_inode() free it.

* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: Use generic inode RCU for synchronizing freeing
  ring-buffer: Remove unused function ring_buffer_nr_pages()
  tracing: Fix overflow in get_free_elt()
  function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
  eventfs: Use SRCU for freeing eventfs_inodes
  eventfs: Don't return NULL in eventfs_create_dir()
  tracefs: Fix inode allocation
  tracing: Use refcount for trace_event_file reference counter
  tracing: Have format file honor EVENT_FILE_FL_FREED

Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"Assorted little stuff:

   - lockdep fixup for lockdep_set_notrack_class()

   - we can now remove a device when using erasure coding without
     deadlocking, though we still hit other issues

   - the 'allocator stuck' timeout is now configurable, and messages are
     ratelimited. The default timeout has been increased from 10 seconds
     to 30"

* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
  bcachefs: Make allocator stuck timeout configurable, ratelimit messages
  bcachefs: Add missing path_traverse() to btree_iter_next_node()
  bcachefs: ec should not allocate from ro devs
  bcachefs: Improved allocator debugging for ec
  bcachefs: Add missing bch2_trans_begin() call
  bcachefs: Add a comment for bucket helper types
  bcachefs: Don't rely on implicit unsigned -> signed integer conversion
  lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
  bcachefs: Fix double free of ca->buckets_nouse