git.ipfire.org Git - thirdparty/linux.git/log

ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.

Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.

The remaining things to do are

  1. pass false to lwtunnel_valid_encap_type_attr()
  2. use rcu_dereference_rtnl() in fib6_check_nexthop()
  3. place rcu_read_lock() before ip6_route_info_create_nh().

Let's complete the RTNL-free conversion.

When each CPU-X adds 100000 routes on table-X in a batch
concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 62854 milliseconds.

I changed the number of routes in each table (1000 ~ 100000)
and consistently saw it finish 3x faster with this series.

Note that now every caller of lwtunnel_valid_encap_type() passes
false as the last argument, and this can be removed later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Protect nh->f6i_list with spinlock and flag.

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we may be going to add a route tied to a dying nexthop.

The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.

To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.

__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.

While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().

The next patch adds per-nexthop spinlock which protects nh->f6i_list.

When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.

Then, deadlock could happen between two nexthops.

Let's defer fib6_purge_rt() after fib6_add_rt2node().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Protect fib6_link_table() with spinlock.

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.

Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].

Let's add a spinlock to protect the hash bucket linkage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Factorise ip6_route_multipath_add().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, the RCU section will start before ip6_route_info_create_nh()
in ip6_route_multipath_add(), but ip6_route_info_create() is called
in the same loop and will sleep.

Let's split the loop into ip6_route_mpath_info_create() and
ip6_route_mpath_info_create_nh().

Note that ip6_route_info_append() is now integrated into
ip6_route_mpath_info_create_nh() because we need to call different
free functions for nexthops that passed ip6_route_info_create_nh().

In case of failure, the remaining nexthops that ip6_route_info_create_nh()
has not been called for will be freed by ip6_route_mpath_info_cleanup().

OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked
to a local temporary list, which will be spliced back to rt6_nh_list.
In case of failure, these nexthops will be released by fib6_info_release()
in ip6_route_multipath_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Rename rt6_nh.next to rt6_nh.list.

ip6_route_multipath_add() allocates struct rt6_nh for each config
of multipath routes to link them to a local list rt6_nh_list.

struct rt6_nh.next is the list node of each config, so the name
is quite misleading.

Let's rename it to list.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Don't pass net to ip6_route_info_append().

net is not used in ip6_route_info_append() after commit 36f19d5b4f99
("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().

ip6_route_info_create_nh() will be called under RCU.

It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output.

As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid
GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU.

Let's preallocate it in ip6_route_info_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

ip6_route_info_create_nh() will be called under RCU.

Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes
and we would see a bunch of this message in dmesg.

percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left

Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

If something fails before the original memory allocation in
fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(),
which releases the preallocated per-cpu memory.

Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via
ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Split ip6_route_info_create().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, we want to allocate as much as possible before entering
the RCU section.

The RCU section will start in the middle of ip6_route_info_create(),
and this is problematic for ip6_route_multipath_add() that calls
ip6_route_info_create() multiple times.

Let's split ip6_route_info_create() into two parts; one for memory
allocation and another for nexthop setup.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Move nexthop_find_by_id() after fib6_info_alloc().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we must perform two lookups for nexthop and dev under RCU
to guarantee their lifetime.

ip6_route_info_create() calls nexthop_find_by_id() first if
RTA_NH_ID is specified, and then allocates struct fib6_info.

nexthop_find_by_id() must be called under RCU, but we do not want
to use GFP_ATOMIC for memory allocation here, which will be likely
to fail in ip6_route_multipath_add().

Let's move nexthop_find_by_id() after the memory allocation so
that we can later split ip6_route_info_create() into two parts:
the sleepable part and the RCU part.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().

In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
entry.  If it returns false, the request fails.

rt6_qualify_for_ecmp() returns false if either of the conditions below
is true:

  1. f6i->fib6_flags has RTF_ADDRCONF
  2. f6i->nh is not NULL
  3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC

1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
to cfg->fc_flags.

2. is equivalent with cfg->fc_nh_id.

3. can be replaced by checking RTF_GATEWAY in the base and each multipath
entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
false.

These checks do not require RCU and can be done earlier.

Let's perform the equivalent checks in rtm_to_fib6_multipath_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().

ip6_route_info_create() is called from 3 functions:

  * ip6_route_add()
  * ip6_route_multipath_add()
  * addrconf_f6i_alloc()

addrconf_f6i_alloc() does not need validation for struct fib6_config in
ip6_route_info_create().

ip6_route_multipath_add() calls ip6_route_info_create() for multiple
routes with slightly different fib6_config instances, which is copied
from the base config passed from userspace.  So, we need not validate
the same config repeatedly.

Let's move such validation into rtm_to_fib6_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.

Basically, removing an IPv6 route does not require RTNL because
the IPv6 routing tables are protected by per table lock.

inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
the top-down walk can be safely performed under RCU.

ip6_route_del() already relies on RCU and the table lock, but we
need to extend the RCU critical section a bit more to cover
__ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
inet6_rt_notify() needs RCU.

Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and
get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT.

Even if the nexthop is removed after rcu_read_unlock() in
inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes
tied to the nexthop, and ip6_route_del() returns -ESRCH.  So the
request was at least valid as of nexthop_find_by_id(), and it's just
a matter of timing.

Note that we need to pass false to lwtunnel_valid_encap_type_attr().
The following patches also use the newroute bool.

Note also that fib6_get_table() does not require RCU because once
allocated fib6_table is not freed until netns dismantle.  I will
post a follow-up series to convert such callers to RCU-lockless
version.  [0]

Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().

We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
we want to perform some validation out of the RCU scope.

When creating / removing an IPv6 route with RTA_MULTIPATH,
inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
in each multipath entry.

Let's do that in rtm_to_fib6_config().

Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
0 entries, which was accepted but should result in -EINVAL as
RTM_NEWROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-mlx5-hws-improve-ip-version-handling'

Mark Bloch says:

====================
net/mlx5: HWS, Improve IP version handling

This small series hardens our checks against a single matcher containing
rules that match on IPv4 and IPv6. This scenario is not supported by
hardware steering and the implementation now signals this instead of
failing silently.

Patches:
* Patch 1 forbids a single definer to match on mixed IP versions for
  source and destination address.
* Patch 2 reproduces a couple of firmware checks: it forbids creating
  a definer that matches on IP address without matching on IP version,
  and also disallows matching on IPv6 addresses and the IPv4 IHL fields
  in the same definer.
* Patch 3 forbids mixing rules that match on IPv4 and IPv6 addresses in
  the same matcher. The underlying definer mechanism does not support
  that.
====================

Link: https://patch.msgid.link/20250422092540.182091-1-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, Disallow matcher IP version mixing

Signal clearly to the user, via an error, that mixing IPv4 and IPv6
rules in the same matcher is not supported. Previously such cases
silently failed by adding a rule that did not work correctly.

Rules can specify an IP version by one of two fields: IP version or
ethertype. At matcher creation, store whether the template matches on
any of these two fields. If yes, inspect each rule for its corresponding
match value and store the IP version inside the matcher to guard against
inconsistencies with subsequent rules.

Furthermore, also check rules for internal consistency, i.e. verify that
the ethertype and IP version match values do not contradict each other.

The logic applies to inner and outer headers independently, to account
for tunneling.

Rules that do not match on IP addresses are not affected.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, Harden IP version definer checks

Replicate some sanity checks that firmware does, since hardware steering
does not go through firmware.

When creating a definer, disallow matching on IP addresses without also
matching on IP version. The latter can be satisfied by matching either
on the version field in the IP header, or on the ethertype field.

Also refuse to match IPv4 IHL alongside IPv6.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: HWS, Fix IP version decision

Unify the check for IP version when creating a definer. A given matcher
is deemed to match on IPv6 if any of the higher order (>31) bits of
source or destination address mask are set.

A single packet cannot mix IP versions between source and destination
addresses, so it makes no sense that they would be decided on
independently.

Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: AF_XDP: code clean up

The current API, otx2_xdp_sq_append_pkt, verifies the number of available
descriptors before sending packets to the hardware.

However, for AF_XDP, this check is unnecessary because the batch value
is already determined based on the free descriptors.

This patch introduces a new API, "otx2_xsk_sq_append_pkt" to address this.

Remove the logic for releasing the TX buffers, as it is implicitly handled
by xsk_tx_peek_release_desc_batch

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://patch.msgid.link/20250420032350.4047706-1-hkelam@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
igc: Add support for Frame Preemption

Faizal Rahim says:

Introduce support for the FPE feature in the IGC driver.

The patches aligns with the upstream FPE API:
https://patchwork.kernel.org/project/netdevbpf/cover/20230220122343.1156614-1-vladimir.oltean@nxp.com/
https://patchwork.kernel.org/project/netdevbpf/cover/20230119122705.73054-1-vladimir.oltean@nxp.com/

It builds upon earlier work:
https://patchwork.kernel.org/project/netdevbpf/cover/20220520011538.1098888-1-vinicius.gomes@intel.com/

The patch series adds the following functionalities to the IGC driver:
a) Configure FPE using `ethtool --set-mm`.
b) Display FPE settings via `ethtool --show-mm`.
c) View FPE statistics using `ethtool --include-statistics --show-mm'.
e) Block setting preemptible tc in taprio since it is not supported yet.
   Existing code already blocks it in mqprio.

Tested:
Enabled CONFIG_PROVE_LOCKING, CONFIG_DEBUG_ATOMIC_SLEEP, CONFIG_DMA_API_DEBUG, and CONFIG_KASAN
1) selftests
2) netdev down/up cycles
3) suspend/resume cycles
4) fpe verification

No bugs or unusual dmesg logs were observed.
Ran 1), 2) and 3) with and without the patch series, compared dmesg and selftest logs - no differences found.

* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  igc: add support to get frame preemption statistics via ethtool
  igc: add support to get MAC Merge data via ethtool
  igc: block setting preemptible traffic class in taprio
  igc: add support to set tx-min-frag-size
  igc: add support for frame preemption verification
  igc: set the RX packet buffer size for TSN mode
  igc: use FIELD_PREP and GENMASK for existing RX packet buffer size
  igc: optimize TX packet buffer utilization for TSN mode
  igc: use FIELD_PREP and GENMASK for existing TX packet buffer size
  igc: rename I225_RXPBSIZE_DEFAULT and I225_TXPBSIZE_DEFAULT
  igc: rename xdp_get_tx_ring() for non-xdp usage
  net: ethtool: mm: reset verification status when link is down
  net: ethtool: mm: extract stmmac verification logic into common library
  net: stmmac: move frag_size handling out of spin_lock
====================

Link: https://patch.msgid.link/20250418163822.3519810-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'enable-multiple-irq-lines-support-in-airoha_eth-driver'

Lorenzo Bianconi says:

====================
Enable multiple IRQ lines support in airoha_eth driver

EN7581 ethernet SoC supports 4 programmable IRQ lines each one composed
by 4 IRQ configuration registers to map Tx/Rx queues. Enable multiple
IRQ lines support.
====================

Link: https://patch.msgid.link/20250418-airoha-eth-multi-irq-v1-0-1ab0083ca3c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Enable multiple IRQ lines support in airoha_eth driver.

EN7581 ethernet SoC supports 4 programmable IRQ lines for Tx and Rx
interrupts. Enable multiple IRQ lines support. Map Rx/Tx queues to the
available IRQ lines using the default scheme used in the vendor SDK:

- IRQ0: rx queues [0-4],[7-9],15
- IRQ1: rx queues [21-30]
- IRQ2: rx queues 5
- IRQ3: rx queues 6

Tx queues interrupts are managed by IRQ0.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250418-airoha-eth-multi-irq-v1-2-1ab0083ca3c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Introduce airoha_irq_bank struct

EN7581 ethernet SoC supports 4 programmable IRQ lines each one composed
by 4 IRQ configuration registers. Add airoha_irq_bank struct as a
container for independent IRQ lines info (e.g. IRQ number, enabled source
interrupts, ecc). This is a preliminary patch to support multiple IRQ lines
in airoha_eth driver.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20250418-airoha-eth-multi-irq-v1-1-1ab0083ca3c1@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'r8169-merge-chip-versions'

Heiner Kallweit says:

====================
r8169: merge chip versions

After 2b065c098c37 ("r8169: refactor chip version detection") we can
merge handling of few chip versions.
====================

Link: https://patch.msgid.link/5e1e14ea-d60f-4608-88eb-3104b6bbace8@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: merge chip versions 52 and 53 (RTL8117)

Handling of both chip versions is the same, only difference is
the firmware. So we can merge handling of both chip versions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ae866b71-c904-434e-befb-848c831e33ff@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: merge chip versions 64 and 65 (RTL8125D)

Handling of both chip versions is the same, only difference is
the firmware. So we can merge handling of both chip versions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/0baad123-c679-4154-923f-fdc12783e900@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: merge chip versions 70 and 71 (RTL8126A)

Handling of both chip versions is the same, only difference is
the firmware. So we can merge handling of both chip versions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/97d7ae79-d021-4b6b-b424-89e5e305b029@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: remove function stubs

All callers of these functions depend on PHYLIB or select it directly
or indirectly by selecting PHYLINK. Stubs make sense for optional
functionality, but that's not the case here.

MDIO_XGENE usually is selected by NET_XGENE which also selects PHYLIB.
Add a dependency to PHYLIB nevertheless, in order not to break
randconfig builds.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/f7a69a1f-60e9-4ac0-8b7c-481e0cc850e7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Fix wild-memory-access in __register_pernet_operations() when CONFIG_NET_NS=n.

kernel test robot reported the splat below. [0]

Before commit fed176bf3143 ("net: Add ops_undo_single for module
load/unload."), if CONFIG_NET_NS=n, ops was linked to pernet_list
only when init_net had not been initialised, and ops was unlinked
from pernet_list only under the same condition.

Let's say an ops is loaded before the init_net setup but unloaded
after that.  Then, the ops remains in pernet_list, which seems odd.

The cited commit added ops_undo_single(), which calls list_add() for
ops to link it to a temporary list, so a minor change was added to
__register_pernet_operations() and __unregister_pernet_operations()
under CONFIG_NET_NS=n to avoid the pernet_list corruption.

However, the corruption must have been left as is.

When CONFIG_NET_NS=n, pernet_list was used to keep ops registered
before the init_net setup, and after that, pernet_list was not used
at all.

This was because some ops annotated with __net_initdata are cleared
out of memory at some point during boot.

Then, such ops is initialised by POISON_FREE_INITMEM (0xcc), resulting
in that ops->list.{next,prev} suddenly switches from a valid pointer
to a weird value, 0xcccccccccccccccc.

To avoid such wild memory access, let's allow the pernet_list
corruption for CONFIG_NET_NS=n.

[0]:
Oops: general protection fault, probably for non-canonical address 0xf999959999999999: 0000 [#1] SMP KASAN NOPTI
KASAN: maybe wild-memory-access in range [0xccccccccccccccc8-0xcccccccccccccccf]
CPU: 2 UID: 0 PID: 346 Comm: modprobe Not tainted 6.15.0-rc1-00294-ga4cba7e98e35 #85 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__list_add_valid_or_report (lib/list_debug.c:32)
Code: 48 c1 ea 03 80 3c 02 00 0f 85 5a 01 00 00 49 39 74 24 08 0f 85 83 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 1f 01 00 00 4c 39 26 0f 85 ab 00 00 00 4c 39 ee
RSP: 0018:ff11000135b87830 EFLAGS: 00010a07
RAX: dffffc0000000000 RBX: ffffffffc02223c0 RCX: ffffffff8406fcc2
RDX: 1999999999999999 RSI: cccccccccccccccc RDI: ffffffffc02223c0
RBP: ffffffff86064e40 R08: 0000000000000001 R09: fffffbfff0a9f5b5
R10: ffffffff854fadaf R11: 676552203a54454e R12: ffffffff86064e40
R13: ffffffffc02223c0 R14: ffffffff86064e48 R15: 0000000000000021
FS:  00007f6fb0d9e1c0(0000) GS:ff11000858ea0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6fb0eda580 CR3: 0000000122fec005 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
register_pernet_operations (./include/linux/list.h:150 (discriminator 5) ./include/linux/list.h:183 (discriminator 5) net/core/net_namespace.c:1315 (discriminator 5) net/core/net_namespace.c:1359 (discriminator 5))
register_pernet_subsys (net/core/net_namespace.c:1401)
inet6_init (net/ipv6/af_inet6.c:535) ipv6
do_one_initcall (init/main.c:1257)
do_init_module (kernel/module/main.c:2942)
load_module (kernel/module/main.c:3409)
init_module_from_file (kernel/module/main.c:3599)
idempotent_init_module (kernel/module/main.c:3611)
__x64_sys_finit_module (./include/linux/file.h:62 ./include/linux/file.h:83 kernel/module/main.c:3634 kernel/module/main.c:3621 kernel/module/main.c:3621)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f6fb0df7e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007fffdc6a8968 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000055b535721b70 RCX: 00007f6fb0df7e5d
RDX: 0000000000000000 RSI: 000055b51e44aa2a RDI: 0000000000000004
RBP: 0000000000040000 R08: 0000000000000000 R09: 000055b535721b30
R10: 0000000000000004 R11: 0000000000000246 R12: 000055b51e44aa2a
R13: 000055b535721bf0 R14: 000055b5357220b0 R15: 0000000000000000
</TASK>
Modules linked in: ipv6(+) crc_ccitt

Fixes: fed176bf3143 ("net: Add ops_undo_single for module load/unload.")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504181511.1c3f23e4-lkp@intel.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418215025.87871-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'netlink-specs-rtnetlink-adjust-specs-for-c-codegen'

Jakub Kicinski says:

====================
netlink: specs: rtnetlink: adjust specs for C codegen

The first patch brings a schema extension allowing specifying
"header" (as in .h file) properties in attribute sets.
This is used for rare cases where we carry attributes from
another family in a nest - we need to include the extra
headers. If we were to generate kernel code we'd also
need to skip it in the uAPI output.

The remaining 11 patches are pretty boring schema adjustments.
====================

Link: https://patch.msgid.link/20250418021706.1967583-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-rule: add C naming info

Add properties needed for C codegen to match names with uAPI headers.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-13-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rtnetlink: correct notify properties

The notify property should point at the object the notifications
carry, usually the get object, not the cmd which triggers
the notification:

  notify:
    description: Name of the command sharing the reply type with
                 this notification.

Not treating this as a fix, I think that only C codegen cares.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-12-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-neigh: make sure getneigh is consistent

The consistency check complains replies to do and dump don't match
because dump has no value. It doesn't have to by the schema... but
fixing this in code gen would be more code than adjusting the spec.
This is rare.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-neigh: add C naming info

Add properties needed for C codegen to match names with uAPI headers.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: add notification for newlink

Add a notification entry for netlink so that we can test ntf handling
in classic netlink and C.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: make bond's ipv6 address attribute fixed size

ns-ip6-target is an indexed-array. Codegen for variable size binary
array would be a bit tedious, tell C that we know the size of these
attributes, since they are IPv6 addrs.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: adjust AF_ nest for C codegen

The AF nest is indexed by AF ID, so it's a bit strange,
but with minor adjustments C codegen deals with it just fine.
Entirely unclear why the names have been in quotes here.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: add C naming info

Add properties needed for C codegen to match names with uAPI headers.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: remove duplicated group in attr list

group is listed twice for newlink.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: remove if-netnsid from attr list

if-netnsid an alias to target-netnsid:

IFLA_TARGET_NETNSID = IFLA_IF_NETNSID, /* new alias */

We don't have a definition for this attr.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: rt-link: remove the fixed members from attrs

The purpose of the attribute list is to list the attributes
which will be included in a given message to shrink the objects
for families with huge attr spaces. Fixed headers are always
present in their entirety (between netlink header and the attrs)
so there's no point in listing their members. Current C codegen
doesn't expect them and tries to look them up in the attribute space.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netlink: specs: allow header properties for attribute sets

rt-link has a number of disjoint headers, plus it uses attributes
of other families (e.g. DPLL). Allow declaring a attribute set
as "foreign" by specifying which header its definition is coming
from.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250418021706.1967583-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwc-qos: calibrate tegra with mdio bus idle

Thierry states that there are prerequists for Tegra's calibration
that should be met before starting calibration - both the RGMII and
MDIO interfaces should be idle.

This commit adds the necessary MII bus locking to ensure that the MDIO
interface is idle during calibration.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Thierry Reding <treding@nvidia.com>
Link: https://patch.msgid.link/E1u7EYR-001ZAS-Cr@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: hide CONFIG_DETECT_HUNG_TASK specific code

The CONFIG_DEFAULT_HUNG_TASK_TIMEOUT setting is only available when the
hung task detection is enabled, otherwise the code now produces a build
failure:

drivers/net/ethernet/broadcom/bnxt/bnxt.c:10188:21: error: use of undeclared identifier 'CONFIG_DEFAULT_HUNG_TASK_TIMEOUT'
10188 | max_tmo_secs > CONFIG_DEFAULT_HUNG_TASK_TIMEOUT) {

Enclose this warning logic in an #ifdef to ensure this builds.

Fixes: 0fcad44a86bd ("bnxt_en: Change FW message timeout warning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250423162827.2189658-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bridge-mc-per-vlan-qquery'

Yong Wang says:

====================
bridge: multicast: per vlan query improvement when port or vlan state changes

The current implementation of br_multicast_enable_port() only operates on
port's multicast context, which doesn't take into account in case of vlan
snooping, one downside is the port's igmp query timer will NOT resume when
port state gets changed from BR_STATE_BLOCKING to BR_STATE_FORWARDING etc.

Such code flow will briefly look like:
1.vlan snooping
  --> br_multicast_port_query_expired with per vlan port_mcast_ctx
  --> port in BR_STATE_BLOCKING state --> then one-shot timer discontinued

The port state could be changed by STP daemon or kernel STP, taking mstpd
as example:

2.mstpd --> netlink_sendmsg --> br_setlink --> br_set_port_state with non
  blocking states, i.e. BR_STATE_LEARNING or BR_STATE_FORWARDING
  --> br_port_state_selection --> br_multicast_enable_port
  --> enable multicast with port's multicast_ctx

Here for per vlan snooping, the vlan context of the port should be used
instead of port's multicast_ctx. The first patch corrects such behavior.

Similarly, vlan state change also impacts multicast behavior, the 2nd patch
adds function to update the corresponding multicast context when vlan state
changes.

The 3rd patch adds the selftests to confirm that IGMP/MLD query does happen
when the STP state becomes forwarding.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net/bridge : add tests for per vlan snooping with stp state changes

Change ALL_TESTS definition to "test-per-line".

Add the test case of per vlan snooping with port stp state change to
forwarding and also vlan equivalent case in both bridge_igmp.sh and
bridge_mld.sh.

Signed-off-by: Yong Wang <yongwang@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: mcast: update multicast contex when vlan state is changed

When the vlan STP state is changed, which could be manipulated by
"bridge vlan" commands, similar to port STP state, this also impacts
multicast behaviors such as igmp query. In the scenario of per-VLAN
snooping, there's a need to update the corresponding multicast context
to re-arm the port query timer when vlan state becomes "forwarding" etc.

Update br_vlan_set_state() function to enable vlan multicast context
in such scenario.

Before the patch, the IGMP query does not happen in the last step of the
following test sequence, i.e. no growth for tx counter:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
# bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
# ip link add name swp1 up master br1 type dummy
# sleep 1
# bridge vlan set vid 1 dev swp1 state 4
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# sleep 1
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# bridge vlan set vid 1 dev swp1 state 3
# sleep 2
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1

After the patch, the IGMP query happens in the last step of the test:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
# bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
# ip link add name swp1 up master br1 type dummy
# sleep 1
# bridge vlan set vid 1 dev swp1 state 4
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# sleep 1
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# bridge vlan set vid 1 dev swp1 state 3
# sleep 2
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
3

Signed-off-by: Yong Wang <yongwang@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bridge: mcast: re-implement br_multicast_{enable, disable}_port functions

When a bridge port STP state is changed from BLOCKING/DISABLED to
FORWARDING, the port's igmp query timer will NOT re-arm itself if the
bridge has been configured as per-VLAN multicast snooping.

Solve this by choosing the correct multicast context(s) to enable/disable
port multicast based on whether per-VLAN multicast snooping is enabled or
not, i.e. using per-{port, VLAN} context in case of per-VLAN multicast
snooping by re-implementing br_multicast_enable_port() and
br_multicast_disable_port() functions.

Before the patch, the IGMP query does not happen in the last step of the
following test sequence, i.e. no growth for tx counter:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
# bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
# ip link add name swp1 up master br1 type dummy
# bridge link set dev swp1 state 0
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# sleep 1
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# bridge link set dev swp1 state 3
# sleep 2
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1

After the patch, the IGMP query happens in the last step of the test:
# ip link add name br1 up type bridge vlan_filtering 1 mcast_snooping 1 mcast_vlan_snooping 1 mcast_querier 1 mcast_stats_enabled 1
# bridge vlan global set vid 1 dev br1 mcast_snooping 1 mcast_querier 1 mcast_query_interval 100 mcast_startup_query_count 0
# ip link add name swp1 up master br1 type dummy
# bridge link set dev swp1 state 0
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# sleep 1
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
1
# bridge link set dev swp1 state 3
# sleep 2
# ip -j -p stats show dev swp1 group xstats_slave subgroup bridge suite mcast | jq '.[]["multicast"]["igmp_queries"]["tx_v2"]'
3

Signed-off-by: Yong Wang <yongwang@nvidia.com>
Reviewed-by: Andy Roulin <aroulin@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

xdp: create locked/unlocked instances of xdp redirect target setters

Commit 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with
netdev->lock") introduces the netdev lock to xdp_set_features_flag().
The change includes a _locked version of the method, as it is possible
for a driver to have already acquired the netdev lock before calling
this helper. However, the same applies to
xdp_features_(set|clear)_redirect_flags(), which ends up calling the
unlocked version of xdp_set_features_flags() leading to deadlocks in
GVE, which grabs the netdev lock as part of its suspend, reset, and
shutdown processes:

[  833.265543] WARNING: possible recursive locking detected
[  833.270949] 6.15.0-rc1 #6 Tainted: G            E
[  833.276271] --------------------------------------------
[  833.281681] systemd-shutdow/1 is trying to acquire lock:
[  833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90
[  833.295470]
[  833.295470] but task is already holding lock:
[  833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]
[  833.309508]
[  833.309508] other info that might help us debug this:
[  833.316130]  Possible unsafe locking scenario:
[  833.316130]
[  833.322142]        CPU0
[  833.324681]        ----
[  833.327220]   lock(&dev->lock);
[  833.330455]   lock(&dev->lock);
[  833.333689]
[  833.333689]  *** DEADLOCK ***
[  833.333689]
[  833.339701]  May be due to missing lock nesting notation
[  833.339701]
[  833.346582] 5 locks held by systemd-shutdow/1:
[  833.351205]  #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210
[  833.360695]  #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0
[  833.369144]  #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0
[  833.377603]  #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve]
[  833.386138]  #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve]

Introduce xdp_features_(set|clear)_redirect_target_locked() versions
which assume that the netdev lock has already been acquired before
setting the XDP feature flag and update GVE to use the locked version.

Fixes: 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock")
Tested-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Signed-off-by: Joshua Washington <joshwash@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'implement-udp-tunnel-port-for-txgbe'

Jiawen Wu says:

====================
Implement udp tunnel port for txgbe

v3: https://lore.kernel.org/20250417080328.426554-1-jiawenwu@trustnetic.com
v2: https://lore.kernel.org/20250414091022.383328-1-jiawenwu@trustnetic.com
v1: https://lore.kernel.org/20250410074456.321847-1-jiawenwu@trustnetic.com
====================

Link: https://patch.msgid.link/20250421022956.508018-1-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: wangxun: restrict feature flags for tunnel packets

Implement ndo_features_check to restrict Tx checksum offload flags, since
there are some inner layer length and protocols unsupported.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/20250421022956.508018-3-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: txgbe: Support to set UDP tunnel port

Tunnel types VXLAN/VXLAN_GPE/GENEVE are supported for txgbe devices. The
hardware supports to set only one port for each tunnel type.

Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Link: https://patch.msgid.link/20250421022956.508018-2-jiawenwu@trustnetic.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-followup-series-for-exit_rtnl'

Kuniyuki Iwashima says:

====================
net: Followup series for ->exit_rtnl().

Patch 1 drops the hold_rtnl arg in ops_undo_list() as suggested by Jakub.
Patch 2 & 3 apply ->exit_rtnl() to pfcp and ppp.

v1: https://lore.kernel.org/20250415022258.11491-1-kuniyu@amazon.com
====================

Link: https://patch.msgid.link/20250418003259.48017-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ppp: Split ppp_exit_net() to ->exit_rtnl().

ppp_exit_net() unregisters devices related to the netns under
RTNL and destroys lists and IDR.

Let's use ->exit_rtnl() for the device unregistration part to
save RTNL dances for each netns.

Note that we delegate the for_each_netdev_safe() part to
default_device_exit_batch() and replace unregister_netdevice_queue()
with ppp_nl_dellink() to align with bond, geneve, gtp, and pfcp.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418003259.48017-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pfcp: Convert pfcp_net_exit() to ->exit_rtnl().

pfcp_net_exit() holds RTNL and cleans up all devices in the netns
and other devices tied to sockets in the netns.

We can use ->exit_rtnl() to save RTNL dance for all dying netns.

Note that we delegate the for_each_netdev() part to
default_device_exit_batch() to avoid a list corruption splat
like the one reported in commit 4ccacf86491d ("gtp: Suppress
list corruption splat in gtp_net_exit_batch_rtnl().")

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418003259.48017-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Drop hold_rtnl arg from ops_undo_list().

ops_undo_list() first iterates over ops_list for ->pre_exit().

Let's check if any of the ops has ->exit_rtnl() there and drop
the hold_rtnl argument.

Note that nexthop uses ->exit_rtnl() and is built-in, so hold_rtnl
is always true for setup_net() and cleanup_net() for now.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250414170148.21f3523c@kernel.org/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418003259.48017-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: visconti: convert to set_clk_tx_rate() method

Convert visconti to use the set_clk_tx_rate() method. By doing so,
the GMAC control register will already have been updated (unlike with
the fix_mac_speed() method) so this code can be removed while porting
to the set_clk_tx_rate() method.

There is also no need for the spinlock, and has never been - neither
fix_mac_speed() nor set_clk_tx_rate() can be called by more than one
thread at a time, so the lock does nothing useful.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1u5SiQ-001I0B-OQ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: rzn1_a5psw: Make the read-only array offsets static const

Don't populate the read-only array offsets on the stack at run time,
instead make it static const.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://patch.msgid.link/20250417161353.490219-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-gbeth-glue-layer-driver-for-renesas-rz-v2h-p-soc'

Lad Prabhakar says:

====================
Add GBETH glue layer driver for Renesas RZ/V2H(P) SoC

This patch series adds support for the GBETH (Gigabit Ethernet) glue layer
driver for the Renesas RZ/V2H(P) SoC. The GBETH IP is integrated with
the Synopsys DesignWare MAC (version 5.20). The changes include updating
the device tree bindings, documenting the GBETH bindings, and adding the
DWMAC glue layer for the Renesas GBETH.

v1: https://lore.kernel.org/20250302181808.728734-1-prabhakar.mahadev-lad.rj@bp.renesas.com
====================

Link: https://patch.msgid.link/20250417084015.74154-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: Add entry for Renesas RZ/V2H(P) DWMAC GBETH glue layer driver

Add a new MAINTAINERS entry for the Renesas RZ/V2H(P) DWMAC GBETH
glue layer driver.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250417084015.74154-5-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: Add DWMAC glue layer for Renesas GBETH

Add the DWMAC glue layer for the GBETH IP found in the Renesas RZ/V2H(P)
SoC.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Link: https://patch.msgid.link/20250417084015.74154-4-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: Document support for Renesas RZ/V2H(P) GBETH

GBETH IP on the Renesas RZ/V2H(P) SoC is integrated with Synopsys
DesignWare MAC (version 5.20). Document the device tree bindings for
the GBETH glue layer.

Generic compatible string 'renesas,rzv2h-gbeth' is added since this
module is identical on both the RZ/V2H(P) and RZ/G3E SoCs.

The Rx/Tx clocks supplied for GBETH on the RZ/V2H(P) SoC is depicted
below:

                      Rx / Tx
-------+------------- on / off -------
       |
       |            Rx-180 / Tx-180
       +---- not ---- on / off -------

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250417084015.74154-3-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dwmac: Increase 'maxItems' for 'interrupts' and 'interrupt-names'

Increase the `maxItems` value for the `interrupts` and `interrupt-names`
properties to 11 to support additional per-channel Tx/Rx completion
interrupts on the Renesas RZ/V2H(P) SoC, which features the
`snps,dwmac-5.20` IP.

Refactor the `interrupt-names` property by replacing repeated `enum`
entries with a `oneOf` list. Add support for per-channel receive and
transmit completion interrupts using regex patterns.

Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250417084015.74154-2-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ptp: Do not enable by default during compile testing

Enabling the compile test should not cause automatic enabling of all
drivers, but only allow to choose to compile them.

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250417074643.81448-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

emulex/benet: Annotate flash_cookie as nonstring

GCC 15's new -Wunterminated-string-initialization notices that the 32
character "flash_cookie" (which is not used as a C-String)
needs to be marked as "nonstring":

drivers/net/ethernet/emulex/benet/be_cmds.c:2618:51: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization]
2618 | static char flash_cookie[2][16] = {"*** SE FLAS", "H DIRECTORY *** "};
| ^~~~~~~~~~~~~~~~~~

Add this annotation, avoid using a multidimensional array, but keep the
string split (with a comment about why). Additionally mark it const
and annotate the "cookie" member that is being memcmp()ed against as
nonstring too.

Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250416221028.work.967-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-phy-dp83822-add-support-for-changing-the-mac-series-termination'

Dimitri Fedrau says:

====================
net: phy: dp83822: Add support for changing the MAC series termination

The dp83822 provides the possibility to set the resistance value of the
the MAC series termination. Modifying the resistance to an appropriate
value can reduce signal reflections and therefore improve signal quality.

v2: https://lore.kernel.org/20250408-dp83822-mac-impedance-v2-0-fefeba4a9804@liebherr.com
v1: https://lore.kernel.org/20250307-dp83822-mac-impedance-v1-0-bdd85a759b45@liebherr.com
====================

Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-0-028ac426cddb@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: dp83822: Add support for changing the MAC termination

The dp83822 provides the possibility to set the resistance value of the
the MAC termination. Modifying the resistance to an appropriate value can
reduce signal reflections and therefore improve signal quality.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-4-028ac426cddb@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: Add helper for getting MAC termination resistance

Add helper which returns the MAC termination resistance value. Modifying
the resistance to an appropriate value can reduce signal reflections and
therefore improve signal quality.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-3-028ac426cddb@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dp83822: add constraints for mac-termination-ohms

Property mac-termination-ohms is defined in ethernet-phy.yaml. Add allowed
values for the property.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-2-028ac426cddb@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: ethernet-phy: add property mac-termination-ohms

Add property mac-termination-ohms in the device tree bindings for selecting
the resistance value of the builtin series termination resistors of the
PHY. Changing the resistance to an appropriate value can reduce signal
reflections and therefore improve signal quality.

Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-1-028ac426cddb@liebherr.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: use pci_prepare_to_sleep in rtl_shutdown

Use pci_prepare_to_sleep() like PCI core does in pci_pm_suspend_noirq.
This aligns setting a low-power mode during shutdown with the handling
of the transition to system suspend. Also the transition to runtime
suspend uses pci_target_state() instead of setting D3hot unconditionally.

Note: pci_prepare_to_sleep() uses device_may_wakeup() to check whether
device may generate wakeup events. So we don't lose anything by
not passing tp->saved_wolopts any longer.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/f573fdbd-ba6d-41c1-b68f-311d3c88db2c@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Remove unused rvu_npc_enable_bcast_entry

The last use of rvu_npc_enable_bcast_entry() was removed in 2021 by
commit 967db3529eca ("octeontx2-af: add support for multicast/promisc
packet replication feature")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250420225810.171852-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: 802: Remove unused p8022 code

p8022.c defines two external functions, register_8022_client()
and unregister_8022_client(), the last use of which was removed in
2018 by
commit 7a2e838d28cf ("staging: ipx: delete it from the tree")

Remove the p8022.c file, it's corresponding header, and glue
surrounding it. There was one place the header was included in vlan.c
but it didn't use the functions it declared.

There was a comment in net/802/Makefile about checking
against net/core/Makefile, but that's at least 20 years old and
there's no sign of net/core/Makefile mentioning it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250418011519.145320-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtase: Add ndo_setup_tc support for CBS offload in traffic control setup

Add support for ndo_setup_tc to enable CBS offload functionality as
part of traffic control configuration for network devices, where CBS
is applied from the CPU to the switch. More specifically, CBS is
applied at the GMAC in the topmost architecture diagram.

Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250416115757.28156-1-justinlai0215@realtek.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rxrpc: rxgk: Set error code in rxgk_yfs_decode_ticket()

Propagate the error code if key_alloc() fails. Don't return
success.

Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/Z_-P_1iLDWksH1ik@stanley.mountain
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'ionic-support-qsfp-cmis'

Shannon Nelson says:

====================
ionic: support QSFP CMIS

This patchset sets up support for additional pages and better
handling of the QSFP CMIS data.

v1: https://lore.kernel.org/netdev/20250411182140.63158-1-shannon.nelson@amd.com/
====================

Link: https://patch.msgid.link/20250415231317.40616-1-shannon.nelson@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ionic: add module eeprom channel data to ionic_if and ethtool

Make the CMIS module type's page 17 channel data available for
ethtool to request. As done previously, carve space for this
data from the port_info reserved space.

In the future, if additional pages are needed, a new firmware
AdminQ command will be added for accessing random pages.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415231317.40616-4-shannon.nelson@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ionic: support ethtool get_module_eeprom_by_page

Add support for the newer get_module_eeprom_by_page interface.
Only the upper half of the 256 byte page is available for
reading, and the firmware puts the two sections into the
extended sprom buffer, so a union is used over the extended
sprom buffer to make clear which page is to be accessed.

With get_module_eeprom_by_page implemented there is no need
for the older get_module_info or git_module_eeprom interfaces,
so remove them.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415231317.40616-3-shannon.nelson@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ionic: extend the QSFP module sprom for more pages

Some QSFP modules have more eeprom to be read by ethtool than
the initial high and low page 0 that is currently available
in the DSC's ionic sprom[] buffer. Since the current sprom[]
is baked into the middle of an existing API struct, to make
the high end of page 1 and page 2 available a block is carved
from a reserved space of the existing port_info struct and the
ionic_get_module_eeprom() service is taught how to get there.

Newer firmware writes the additional QSFP page info here,
yet this remains backward compatible because older firmware
sets this space to all 0 and older ionic drivers do not use
the reserved space.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250415231317.40616-2-shannon.nelson@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'vxlan-convert-fdb-table-to-rhashtable'

Ido Schimmel says:

====================
vxlan: Convert FDB table to rhashtable

The VXLAN driver currently stores FDB entries in a hash table with a
fixed number of buckets (256), resulting in reduced performance as the
number of entries grows. This patchset solves the issue by converting
the driver to use rhashtable which maintains a more or less constant
performance regardless of the number of entries.

Measured transmitted packets per second using a single pktgen thread
with varying number of entries when the transmitted packet always hits
the default entry (worst case):

Number of entries | Improvement
------------------|------------
1k                | +1.12%
4k                | +9.22%
16k               | +55%
64k               | +585%
256k              | +2460%

The first patches are preparations for the conversion in the last patch.
Specifically, the series is structured as follows:

Patch #1 adds RCU read-side critical sections in the Tx path when
accessing FDB entries. Targeting at net-next as I am not aware of any
issues due to this omission despite the code being structured that way
for a long time. Without it, traces will be generated when converting
FDB lookup to rhashtable_lookup().

Patch #2-#5 simplify the creation of the default FDB entry (all-zeroes).
Current code assumes that insertion into the hash table cannot fail,
which will no longer be true with rhashtable.

Patches #6-#10 add FDB entries to a linked list for entry traversal
instead of traversing over them using the fixed size hash table which is
removed in the last patch.

Patches #11-#12 add wrappers for FDB lookup that make it clear when each
should be used along with lockdep annotations. Needed as a preparation
for rhashtable_lookup() that must be called from an RCU read-side
critical section.

Patch #13 treats dst cache initialization errors as non-fatal. See more
info in the commit message. The current code happens to work because
insertion into the fixed size hash table is slow enough for the per-CPU
allocator to be able to create new chunks of per-CPU memory.

Patch #14 adds an FDB key structure that includes the MAC address and
source VNI. To be used as rhashtable key.

Patch #15 does the conversion to rhashtable.
====================

Link: https://patch.msgid.link/20250415121143.345227-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Convert FDB table to rhashtable

FDB entries are currently stored in a hash table with a fixed number of
buckets (256), resulting in performance degradation as the number of
entries grows. Solve this by converting the driver to use rhashtable
which maintains more or less constant performance regardless of the
number of entries.

Measured transmitted packets per second using a single pktgen thread
with varying number of entries when the transmitted packet always hits
the default entry (worst case):

Number of entries | Improvement
------------------|------------
1k                | +1.12%
4k                | +9.22%
16k               | +55%
64k               | +585%
256k              | +2460%

In addition, the change reduces the size of the VXLAN device structure
from 2584 bytes to 672 bytes.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-16-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Introduce FDB key structure

In preparation for converting the FDB table to rhashtable, introduce a
key structure that includes the MAC address and source VNI.

No functional changes intended.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-15-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Do not treat dst cache initialization errors as fatal

FDB entries are allocated in an atomic context as they can be added from
the data path when learning is enabled.

After converting the FDB hash table to rhashtable, the insertion rate
will be much higher (*) which will entail a much higher rate of per-CPU
allocations via dst_cache_init().

When adding a large number of entries (e.g., 256k) in a batch, a small
percentage (< 0.02%) of these per-CPU allocations will fail [1]. This
does not happen with the current code since the insertion rate is low
enough to give the per-CPU allocator a chance to asynchronously create
new chunks of per-CPU memory.

Given that:

a. Only a small percentage of these per-CPU allocations fail.

b. The scenario where this happens might not be the most realistic one.

c. The driver can work correctly without dst caches. The dst_cache_*()
APIs first check that the dst cache was properly initialized.

d. The dst caches are not always used (e.g., 'tos inherit').

It seems reasonable to not treat these allocation failures as fatal.

Therefore, do not bail when dst_cache_init() fails and suppress warnings
by specifying '__GFP_NOWARN'.

[1] percpu: allocation failed, size=40 align=8 atomic=1, atomic alloc failed, no space left

(*) 97% reduction in average latency of vxlan_fdb_update() when adding
256k entries in a batch.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-14-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Create wrappers for FDB lookup

__vxlan_find_mac() is called from both the data path (e.g., during
learning) and the control path (e.g., when replacing an entry). The
function is missing lockdep annotations to make sure that the FDB hash
lock is held during FDB updates.

Rename __vxlan_find_mac() to vxlan_find_mac_rcu() to reflect the fact
that it should be called from an RCU read-side critical section and call
it from vxlan_find_mac() which checks that the FDB hash lock is held.
Change callers to invoke the appropriate function.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-13-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Rename FDB Tx lookup function

vxlan_find_mac() is only expected to be called from the Tx path as it
updates the 'used' timestamp. Rename it to vxlan_find_mac_tx() to
reflect that and to avoid incorrect updates of this timestamp like those
addressed by commit 9722f834fe9a ("vxlan: Avoid unnecessary updates to
FDB 'used' time").

No functional changes intended.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-12-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Convert FDB flushing to RCU

Instead of holding the FDB hash lock when traversing the FDB linked list
during flushing, use RCU and only acquire the lock for entries that need
to be flushed.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-11-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Convert FDB garbage collection to RCU

Instead of holding the FDB hash lock when traversing the FDB linked list
during garbage collection, use RCU and only acquire the lock for entries
that need to be removed (aged out).

Avoid races by using hlist_unhashed() to check that the entry has not
been removed from the list by another thread.

Note that vxlan_fdb_destroy() uses hlist_del_init_rcu() to remove an
entry from the list which should cause list_unhashed() to return true.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-10-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Use linked list to traverse FDB entries

In preparation for removing the fixed size hash table, convert FDB entry
traversal to use the newly added FDB linked list.

No functional changes intended.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-9-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Add a linked list of FDB entries

Currently, FDB entries are stored in a hash table with a fixed number of
buckets. The table is used for both lookups and entry traversal.
Subsequent patches will convert the table to rhashtable which is not
suitable for entry traversal.

In preparation for this conversion, add FDB entries to a linked list.
Subsequent patches will convert the driver to use this list when
traversing entries during dump, flush, etc.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-8-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Use a single lock to protect the FDB table

Currently, the VXLAN driver stores FDB entries in a hash table with a
fixed number of buckets (256). Subsequent patches are going to convert
this table to rhashtable with a linked list for entry traversal, as
rhashtable is more scalable.

In preparation for this conversion, move from a per-bucket spin lock to
a single spin lock that protects the entire FDB table.

The per-bucket spin locks were introduced by commit fe1e0713bbe8
("vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention") citing
"huge contention when inserting/deleting vxlan_fdbs into the fdb_head".

It is not clear from the commit message which code path was holding the
spin lock for long periods of time, but the obvious suspect is the FDB
cleanup routine (vxlan_cleanup()) that periodically traverses the entire
table in order to delete aged-out entries.

This will be solved by subsequent patches that will convert the FDB
cleanup routine to traverse the linked list of FDB entries using RCU,
only acquiring the spin lock when deleting an aged-out entry.

The change reduces the size of the VXLAN device structure from 3600
bytes to 2576 bytes.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-7-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Relocate assignment of default remote device

The default FDB entry can be associated with a net device if a physical
device (i.e., 'dev PHYS_DEV') was specified during the creation of the
VXLAN device.

The assignment of the net device pointer to 'dst->remote_dev' logically
belongs in the if block that resolves the pointer from the specified
ifindex, so move it there.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-6-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Unsplit default FDB entry creation and notification

Commit 0241b836732f ("vxlan: fix default fdb entry netlink notify
ordering during netdev create") split the creation of the default FDB
entry from its notification to avoid sending a RTM_NEWNEIGH notification
before RTM_NEWLINK.

Previous patches restructured the code so that the default FDB entry is
created after registering the VXLAN device and the notification about
the new entry immediately follows its creation.

Therefore, simplify the code and revert back to vxlan_fdb_update() which
takes care of both creating the FDB entry and notifying user space
about it.

Hold the FDB hash lock when calling vxlan_fdb_update() like it expects.
A subsequent patch will add a lockdep assertion to make sure this is
indeed the case.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-5-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Insert FDB into hash table in vxlan_fdb_create()

Commit 7c31e54aeee5 ("vxlan: do not destroy fdb if register_netdevice()
is failed") split the insertion of FDB entries into the FDB hash table
from the function where they are created.

This was done in order to work around a problem that is no longer
possible after the previous patch. Simplify the code and move the body
of vxlan_fdb_insert() back into vxlan_fdb_create().

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-4-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Simplify creation of default FDB entry

There is asymmetry in how the default FDB entry (all-zeroes) is created
and destroyed in the VXLAN driver. It is created as part of the driver's
newlink() routine, but destroyed as part of its ndo_uninit() routine.

This caused multiple problems in the past. First, commit 0241b836732f
("vxlan: fix default fdb entry netlink notify ordering during netdev
create") split the notification about the entry from its creation so
that it will not be notified to user space before the VXLAN device is
registered.

Then, commit 6db924687139 ("vxlan: Fix error path in
__vxlan_dev_create()") made the error path in __vxlan_dev_create()
asymmetric by destroying the FDB entry before unregistering the net
device. Otherwise, the FDB entry would have been freed twice: By
ndo_uninit() as part of unregister_netdevice() and by
vxlan_fdb_destroy() in the error path.

Finally, commit 7c31e54aeee5 ("vxlan: do not destroy fdb if
register_netdevice() is failed") split the insertion of the FDB entry
into the hash table from its creation, moving the insertion after the
registration of the net device. Otherwise, like before, the FDB entry
would have been freed twice: By ndo_uninit() as part of
register_netdevice()'s error path and by vxlan_fdb_destroy() in the
error path of __vxlan_dev_create().

The end result is that the code is unnecessarily complex. In addition,
the fixed size hash table cannot be converted to rhashtable as
vxlan_fdb_insert() cannot fail, which will no longer be true with
rhashtable.

Solve this by making the addition and deletion of the default FDB entry
completely symmetric. Namely, as part of newlink() routine, create the
entry, insert it into to the hash table and send a notification to user
space after the net device was registered. Note that at this stage the
net device is still administratively down and cannot transmit / receive
packets.

Move the deletion from ndo_uninit() to the dellink routine(): Flush the
default entry together with all the other entries, before unregistering
the net device.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-3-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

vxlan: Add RCU read-side critical sections in the Tx path

The Tx path does not run from an RCU read-side critical section which
makes the current lockless accesses to FDB entries invalid. As far as I
am aware, this has not been a problem in practice, but traces will be
generated once we transition the FDB lookup to rhashtable_lookup().

Add rcu_read_{lock,unlock}() around the handling of FDB entries in the
Tx path. Remove the RCU read-side critical section from vxlan_xmit_nh()
as now the function is always called from an RCU read-side critical
section.

Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250415121143.345227-2-idosch@nvidia.com
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-04-17

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 18 files changed, 1748 insertions(+), 19 deletions(-).

The main changes are:

1) bpf qdisc support, from Amery Hung.
   A qdisc can be implemented in bpf struct_ops programs and
   can be used the same as other existing qdiscs in the
   "tc qdisc" command.

2) Add xsk tail adjustment tests, from Tushar Vyavahare.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Test attaching bpf qdisc to mq and non root
  selftests/bpf: Add a bpf fq qdisc to selftest
  selftests/bpf: Add a basic fifo qdisc test
  libbpf: Support creating and destroying qdisc
  bpf: net_sched: Disable attaching bpf qdisc to non root
  bpf: net_sched: Support updating bstats
  bpf: net_sched: Add a qdisc watchdog timer
  bpf: net_sched: Add basic bpf qdisc kfuncs
  bpf: net_sched: Support implementation of Qdisc_ops in bpf
  bpf: Prepare to reuse get_ctx_arg_idx
  selftests/xsk: Add tail adjustment tests and support check
  selftests/xsk: Add packet stream replacement function
====================

Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'bnxt_en-update-for-net-next'

Michael Chan says:

====================
bnxt_en: Update for net-next

The first patch changes the FW message timeout threshold for a warning
message. The second patch adjusts the ethtool -w coredump length to
suppress a warning. The last 2 patches are small cleanup patches for
the bnxt_ulp RoCE auxbus code.

v1: https://lore.kernel.org/netdev/20250415174818.1088646-1-michael.chan@broadcom.com/
====================

Link: https://patch.msgid.link/20250417172448.1206107-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove unused macros in bnxt_ulp.h

BNXT_ROCE_ULP and BNXT_MAX_ULP are no longer used. Remove them to
clean up the code.

Reviewed-by: Shruti Parab <shruti.parab@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bnxt_en: Remove unused field "ref_count" in struct bnxt_ulp

The "ref_count" field in struct bnxt_ulp is unused after
commit a43c26fa2e6c ("RDMA/bnxt_re: Remove the sriov config callback").
So we can just remove it now.

Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20250417172448.1206107-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>