]> git.ipfire.org Git - thirdparty/linux.git/log
thirdparty/linux.git
6 weeks agoidpf: remove unreachable code from setting mailbox
Michal Swiatkowski [Wed, 9 Apr 2025 06:29:45 +0000 (08:29 +0200)] 
idpf: remove unreachable code from setting mailbox

Remove code that isn't reached. There is no need to check for
adapter->req_vec_chunks, because if it isn't set idpf_set_mb_vec_id()
won't be called.

Only one path when idpf_set_mb_vec_id() is called:
idpf_intr_req()
 -> idpf_send_alloc_vectors_msg() -> adapter->req_vec_chunk is allocated
 here, otherwise an error is returned and idpf_intr_req() exits with an
 error.

The idpf_set_mb_vec_id() becomes one-liner and it is called only once.
Remove it and set mailbox vector index directly.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoidpf: assign extracted ptype to struct libeth_rqe_info field
Mateusz Polchlopek [Sat, 1 Mar 2025 19:02:44 +0000 (20:02 +0100)] 
idpf: assign extracted ptype to struct libeth_rqe_info field

Assign the ptype extracted from qword to the ptype field of struct
libeth_rqe_info.
Remove the now excess ptype param of idpf_rx_singleq_extract_fields(),
idpf_rx_singleq_extract_base_fields() and
idpf_rx_singleq_extract_flex_fields().

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoixgbe: devlink: add devlink region support for E610
Slawomir Mrozowicz [Fri, 11 Apr 2025 13:06:26 +0000 (15:06 +0200)] 
ixgbe: devlink: add devlink region support for E610

Provide support for the following devlink cmds:
 -DEVLINK_CMD_REGION_GET
 -DEVLINK_CMD_REGION_NEW
 -DEVLINK_CMD_REGION_DEL
 -DEVLINK_CMD_REGION_READ

ixgbe devlink region implementation, similarly to the ice one,
lets user to create snapshots of content of Non Volatile Memory,
content of Shadow RAM, and capabilities of the device.

For both NVM and SRAM regions provide .read() handler to let user
read their contents without the need to create full snapshots.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoixgbe: add E610 .set_phys_id() callback implementation
Jedrzej Jagielski [Mon, 3 Mar 2025 12:06:30 +0000 (13:06 +0100)] 
ixgbe: add E610 .set_phys_id() callback implementation

Legacy implementation of .set_phys_id() ethtool callback is not
applicable for E610 device.

Add new implementation which uses 0x06E9 command by calling
ixgbe_aci_set_port_id_led().

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoixgbe: apply different rules for setting FC on E610
Jedrzej Jagielski [Mon, 3 Mar 2025 12:06:29 +0000 (13:06 +0100)] 
ixgbe: apply different rules for setting FC on E610

E610 device doesn't support disabling FC autonegotiation.

Create dedicated E610 .set_pauseparam() implementation and assign
it to ixgbe_ethtool_ops_e610.

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoixgbe: add support for ACPI WOL for E610
Jedrzej Jagielski [Mon, 3 Mar 2025 12:06:28 +0000 (13:06 +0100)] 
ixgbe: add support for ACPI WOL for E610

Currently only APM (Advanced Power Management) is supported by
the ixgbe driver. It works for magic packets only, as for different
sources of wake-up E610 adapter utilizes different feature.

Add E610 specific implementation of ixgbe_set_wol() callback. When
any of broadcast/multicast/unicast wake-up is set, disable APM and
configure ACPI (Advanced Configuration and Power Interface).

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoixgbe: create E610 specific ethtool_ops structure
Jedrzej Jagielski [Mon, 3 Mar 2025 12:06:27 +0000 (13:06 +0100)] 
ixgbe: create E610 specific ethtool_ops structure

E610's implementation of various ethtool ops is different than
the ones corresponding to ixgbe legacy products. Therefore create
separate E610 ethtool_ops struct which will be filled out in the
forthcoming patches.

Add adequate ops struct basing on MAC type. This step requires
changing a bit the flow of probing by placing ixgbe_set_ethtool_ops
after hw.mac.type is assigned. So move the whole netdev assignment
block after hw.mac.type is known. This step doesn't have any additional
impact on probing sequence.

Suggested-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigc: Change Tx mode for MQPRIO offloading
Kurt Kanzenbach [Fri, 21 Mar 2025 13:52:39 +0000 (14:52 +0100)] 
igc: Change Tx mode for MQPRIO offloading

The current MQPRIO offload implementation uses the legacy TSN Tx mode. In
this mode the hardware uses four packet buffers and considers queue
priorities.

In order to harmonize the TAPRIO implementation with MQPRIO, switch to the
regular TSN Tx mode. This mode also uses four packet buffers and considers
queue priorities. In addition to the legacy mode, transmission is always
coupled to Qbv. The driver already has mechanisms to use a dummy schedule
of 1 second with all gates open for ETF. Simply use this for MQPRIO too.

This reduces code and makes it easier to add support for frame preemption
later.

Tested on i225 with real time application using high priority queue, iperf3
using low priority queue and network TAP device.

Acked-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigc: Limit netdev_tc calls to MQPRIO
Kurt Kanzenbach [Fri, 21 Mar 2025 13:52:38 +0000 (14:52 +0100)] 
igc: Limit netdev_tc calls to MQPRIO

Limit netdev_tc calls to MQPRIO. Currently these calls are made in
igc_tsn_enable_offload() and igc_tsn_disable_offload() which are used by
TAPRIO and ETF as well. However, these are only required for MQPRIO.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigb: Get rid of spurious interrupts
Kurt Kanzenbach [Wed, 19 Mar 2025 10:26:42 +0000 (11:26 +0100)] 
igb: Get rid of spurious interrupts

When running the igc with XDP/ZC in busy polling mode with deferral of hard
interrupts, interrupts still happen from time to time. That is caused by
the igb task watchdog which triggers Rx interrupts periodically.

That mechanism has been introduced to overcome skb/memory allocation
failures [1]. So the Rx clean functions stop processing the Rx ring in case
of such failure. The task watchdog triggers Rx interrupts periodically in
the hope that memory became available in the mean time.

The current behavior is undesirable for real time applications, because the
driver induced Rx interrupts trigger also the softirq processing. However,
all real time packets should be processed by the application which uses the
busy polling method.

Therefore, only trigger the Rx interrupts in case of real allocation
failures. Introduce a new flag for signaling that condition.

Follow the same logic as in commit 8dcf2c212078 ("igc: Get rid of spurious
interrupts").

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=3be507547e6177e5c808544bd6a2efa2c7f1d436

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Sweta Kumari <sweta.kumari@intel.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigb: Add support for persistent NAPI config
Kurt Kanzenbach [Wed, 19 Mar 2025 10:26:41 +0000 (11:26 +0100)] 
igb: Add support for persistent NAPI config

Use netif_napi_add_config() to assign persistent per-NAPI config.

This is useful for preserving NAPI settings when changing queue counts or
for user space programs using SO_INCOMING_NAPI_ID.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigb: Link queues to NAPI instances
Kurt Kanzenbach [Wed, 19 Mar 2025 10:26:40 +0000 (11:26 +0100)] 
igb: Link queues to NAPI instances

Link queues to NAPI instances via netdev-genl API. This is required to use
XDP/ZC busy polling. See commit 5ef44b3cb43b ("xsk: Bring back busy polling
support") for details.

This also allows users to query the info with netlink:

|$ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
|                               --dump queue-get --json='{"ifindex": 2}'
|[{'id': 0, 'ifindex': 2, 'napi-id': 8201, 'type': 'rx'},
| {'id': 1, 'ifindex': 2, 'napi-id': 8202, 'type': 'rx'},
| {'id': 2, 'ifindex': 2, 'napi-id': 8203, 'type': 'rx'},
| {'id': 3, 'ifindex': 2, 'napi-id': 8204, 'type': 'rx'},
| {'id': 0, 'ifindex': 2, 'napi-id': 8201, 'type': 'tx'},
| {'id': 1, 'ifindex': 2, 'napi-id': 8202, 'type': 'tx'},
| {'id': 2, 'ifindex': 2, 'napi-id': 8203, 'type': 'tx'},
| {'id': 3, 'ifindex': 2, 'napi-id': 8204, 'type': 'tx'}]

Add rtnl locking to PCI error handlers, because netif_queue_set_napi()
requires the lock held.

While at __igb_open() use RCT coding style.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Sweta Kumari <sweta.kumari@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoigb: Link IRQs to NAPI instances
Kurt Kanzenbach [Wed, 19 Mar 2025 10:26:39 +0000 (11:26 +0100)] 
igb: Link IRQs to NAPI instances

Link IRQs to NAPI instances via netdev-genl API. This allows users to query
that information via netlink:

|$ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
|                               --dump napi-get --json='{"ifindex": 2}'
|[{'defer-hard-irqs': 0,
|  'gro-flush-timeout': 0,
|  'id': 8204,
|  'ifindex': 2,
|  'irq': 127,
|  'irq-suspend-timeout': 0},
| {'defer-hard-irqs': 0,
|  'gro-flush-timeout': 0,
|  'id': 8203,
|  'ifindex': 2,
|  'irq': 126,
|  'irq-suspend-timeout': 0},
| {'defer-hard-irqs': 0,
|  'gro-flush-timeout': 0,
|  'id': 8202,
|  'ifindex': 2,
|  'irq': 125,
|  'irq-suspend-timeout': 0},
| {'defer-hard-irqs': 0,
|  'gro-flush-timeout': 0,
|  'id': 8201,
|  'ifindex': 2,
|  'irq': 124,
|  'irq-suspend-timeout': 0}]
|$ cat /proc/interrupts | grep enp2s0
|123:          0          1 IR-PCI-MSIX-0000:02:00.0   0-edge      enp2s0
|124:          0          7 IR-PCI-MSIX-0000:02:00.0   1-edge      enp2s0-TxRx-0
|125:          0          0 IR-PCI-MSIX-0000:02:00.0   2-edge      enp2s0-TxRx-1
|126:          0          5 IR-PCI-MSIX-0000:02:00.0   3-edge      enp2s0-TxRx-2
|127:          0          0 IR-PCI-MSIX-0000:02:00.0   4-edge      enp2s0-TxRx-3

Reviewed-by: Joe Damato <jdamato@fastly.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
6 weeks agoMerge branch 'ip-improve-tcp-sock-multipath-routing'
Paolo Abeni [Tue, 29 Apr 2025 14:22:26 +0000 (16:22 +0200)] 
Merge branch 'ip-improve-tcp-sock-multipath-routing'

Willem de Bruijn says:

====================
ip: improve tcp sock multipath routing

From: Willem de Bruijn <willemb@google.com>

Improve layer 4 multipath hash policy for local tcp connections:

patch 1: Select a source address that matches the nexthop device.
         Due to tcp_v4_connect making separate route lookups for saddr
         and route, the two can currently be inconsistent.

patch 2: Use all paths when opening multiple local tcp connections to
         the same ip address and port.

patch 3: Test the behavior. Extend the fib_tests.sh testsuite with one
         opening many connections, and count SYNs on both egress
         devices, for packets matching the source address of the dev.

Changelog in the individual patches
====================

Link: https://patch.msgid.link/20250424143549.669426-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoselftests/net: test tcp connection load balancing
Willem de Bruijn [Thu, 24 Apr 2025 14:35:20 +0000 (10:35 -0400)] 
selftests/net: test tcp connection load balancing

Verify that TCP connections use both routes when connecting multiple
times to a remote service over a two nexthop multipath route.

Use socat to create the connections. Use tc prio + tc filter to
count routes taken, counting SYN packets across the two egress
devices. Also verify that the saddr matches that of the device.

To avoid flaky tests when testing inherently randomized behavior,
set a low bar and pass if even a single SYN is observed on each
device.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-4-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoip: load balance tcp connections to single dst addr and port
Willem de Bruijn [Thu, 24 Apr 2025 14:35:19 +0000 (10:35 -0400)] 
ip: load balance tcp connections to single dst addr and port

Load balance new TCP connections across nexthops also when they
connect to the same service at a single remote address and port.

This affects only port-based multipath hashing:
fib_multipath_hash_policy 1 or 3.

Local connections must choose both a source address and port when
connecting to a remote service, in ip_route_connect. This
"chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and
simplify ip_route_{connect,newports}()")) is resolved by first
selecting a source address, by looking up a route using the zero
wildcard source port and address.

As a result multiple connections to the same destination address and
port have no entropy in fib_multipath_hash.

This is not a problem when forwarding, as skb-based hashing has a
4-tuple. Nor when establishing UDP connections, as autobind there
selects a port before reaching ip_route_connect.

Load balance also TCP, by using a random port in fib_multipath_hash.
Port assignment in inet_hash_connect is not atomic with
ip_route_connect. Thus ports are unpredictable, effectively random.

Implementation details:

Do not actually pass a random fl4_sport, as that affects not only
hashing, but routing more broadly, and can match a source port based
policy route, which existing wildcard port 0 will not. Instead,
define a new wildcard flowi flag that is used only for hashing.

Selecting a random source is equivalent to just selecting a random
hash entirely. But for code clarity, follow the normal 4-tuple hash
process and only update this field.

fib_multipath_hash can be reached with zero sport from other code
paths, so explicitly pass this flowi flag, rather than trying to infer
this case in the function itself.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agoipv4: prefer multipath nexthop that matches source address
Willem de Bruijn [Thu, 24 Apr 2025 14:35:18 +0000 (10:35 -0400)] 
ipv4: prefer multipath nexthop that matches source address

With multipath routes, try to ensure that packets leave on the device
that is associated with the source address.

Avoid the following tcpdump example:

    veth0 Out IP 10.1.0.2.38640 > 10.2.0.3.8000: Flags [S]
    veth1 Out IP 10.1.0.2.38648 > 10.2.0.3.8000: Flags [S]

Which can happen easily with the most straightforward setup:

    ip addr add 10.0.0.1/24 dev veth0
    ip addr add 10.1.0.1/24 dev veth1

    ip route add 10.2.0.3 nexthop via 10.0.0.2 dev veth0 \
       nexthop via 10.1.0.2 dev veth1

This is apparently considered WAI, based on the comment in
ip_route_output_key_hash_rcu:

    * 2. Moreover, we are allowed to send packets with saddr
    *    of another iface. --ANK

It may be ok for some uses of multipath, but not all. For instance,
when using two ISPs, a router may drop packets with unknown source.

The behavior occurs because tcp_v4_connect makes three route
lookups when establishing a connection:

1. ip_route_connect calls to select a source address, with saddr zero.
2. ip_route_connect calls again now that saddr and daddr are known.
3. ip_route_newports calls again after a source port is also chosen.

With a route with multiple nexthops, each lookup may make a different
choice depending on available entropy to fib_select_multipath. So it
is possible for 1 to select the saddr from the first entry, but 3 to
select the second entry. Leading to the above situation.

Address this by preferring a match that matches the flowi4 saddr. This
will make 2 and 3 make the same choice as 1. Continue to update the
backup choice until a choice that matches saddr is found.

Do this in fib_select_multipath itself, rather than passing an fl4_oif
constraint, to avoid changing non-multipath route selection. Commit
e6b45241c57a ("ipv4: reset flowi parameters on route connect") shows
how that may cause regressions.

Also read ipv4.sysctl_fib_multipath_use_neigh only once. No need to
refresh in the loop.

This does not happen in IPv6, which performs only one lookup.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250424143549.669426-2-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6 weeks agonet: ti: icssg-prueth: Add ICSSG FW Stats
MD Danish Anwar [Thu, 24 Apr 2025 09:53:16 +0000 (15:23 +0530)] 
net: ti: icssg-prueth: Add ICSSG FW Stats

The ICSSG firmware maintains set of stats called PA_STATS.
Currently the driver only dumps 4 stats. Add support for dumping more
stats.

The offset for different stats are defined as MACROs in icssg_switch_map.h
file. All the offsets are for Slice0. Slice1 offsets are slice0 + 4.
The offset calculation is taken care while reading the stats in
emac_update_hardware_stats().

The statistics are documented in
Documentation/networking/device_drivers/icssg_prueth.rst

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: MD Danish Anwar <danishanwar@ti.com>
Link: https://patch.msgid.link/20250424095316.2643573-1-danishanwar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agotools/Makefile: Add ynl target
Joe Damato [Wed, 23 Apr 2025 20:46:44 +0000 (20:46 +0000)] 
tools/Makefile: Add ynl target

Add targets to build, clean, and install ynl headers, libynl.a, and
python tooling.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20250423204647.190784-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agortase: Modify the format specifier in snprintf to %u
Justin Lai [Fri, 25 Apr 2025 06:40:57 +0000 (14:40 +0800)] 
rtase: Modify the format specifier in snprintf to %u

Modify the format specifier in snprintf to %u.

Signed-off-by: Justin Lai <justinlai0215@realtek.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250425064057.30035-1-justinlai0215@realtek.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'phase-out-hybrid-pci-devres-api'
Jakub Kicinski [Mon, 28 Apr 2025 23:19:19 +0000 (16:19 -0700)] 
Merge branch 'phase-out-hybrid-pci-devres-api'

Philipp Stanner says:

====================
Phase out hybrid PCI devres API

Fixes a number of minor issues with the usage of the PCI API in net.
Notbaly, it replaces calls to the sometimes-managed
pci_request_regions() to the always-managed pcim_request_all_regions(),
enabling us to remove that hybrid functionality from PCI.
====================

Link: https://patch.msgid.link/20250425085740.65304-2-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: thunder_bgx: Don't disable PCI device manually
Philipp Stanner [Fri, 25 Apr 2025 08:57:41 +0000 (10:57 +0200)] 
net: thunder_bgx: Don't disable PCI device manually

thunder_bgx's PCI device is enabled with pcim_enable_device(), a managed
devres function which ensures that the device gets enabled on driver
detach automatically.

Remove the calls to pci_disable_device().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-10-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: thunder_bgx: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:40 +0000 (10:57 +0200)] 
net: thunder_bgx: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Furthermore, the PCI function being managed implies that it's not
necessary to call pci_release_regions() manually.

Remove the calls to pci_release_regions().

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-9-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: mdio: thunder: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:39 +0000 (10:57 +0200)] 
net: mdio: thunder: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Furthermore, the PCI function being managed implies that it's not
necessary to call pci_release_regions() manually.

Remove the calls to pci_release_regions().

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-8-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethernet: sis900: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:38 +0000 (10:57 +0200)] 
net: ethernet: sis900: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Daniele Venzano <venza@brownhat.org>
Link: https://patch.msgid.link/20250425085740.65304-7-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: ethernet: natsemi: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:37 +0000 (10:57 +0200)] 
net: ethernet: natsemi: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-6-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: tulip: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:36 +0000 (10:57 +0200)] 
net: tulip: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-5-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: octeontx2: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:35 +0000 (10:57 +0200)] 
net: octeontx2: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Furthermore, the PCI function being managed implies that it's not
necessary to call pci_release_regions() manually.

Remove the calls to pci_release_regions().

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20250425085740.65304-4-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: prestera: Use pure PCI devres API
Philipp Stanner [Fri, 25 Apr 2025 08:57:34 +0000 (10:57 +0200)] 
net: prestera: Use pure PCI devres API

The currently used function pci_request_regions() is one of the
problematic "hybrid devres" PCI functions, which are sometimes managed
through devres, and sometimes not (depending on whether
pci_enable_device() or pcim_enable_device() has been called before).

The PCI subsystem wants to remove this behavior and, therefore, needs to
port all users to functions that don't have this problem.

Furthermore, the PCI function being managed implies that it's not
necessary to call pci_release_regions() manually.

Remove the calls to pci_release_regions().

Replace pci_request_regions() with pcim_request_all_regions().

Signed-off-by: Philipp Stanner <phasta@kernel.org>
Acked-by: Elad Nachman <enachman@marvell.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250425085740.65304-3-phasta@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'virtio-net-disable-delayed-refill-when-pausing-rx'
Jakub Kicinski [Mon, 28 Apr 2025 22:49:14 +0000 (15:49 -0700)] 
Merge branch 'virtio-net-disable-delayed-refill-when-pausing-rx'

Bui Quang Minh says:

====================
virtio-net: disable delayed refill when pausing rx

Hi everyone,

This only includes the selftest for virtio-net deadlock bug. The fix
commit has been applied already.

Link: https://lore.kernel.org/virtualization/174537302875.2111809.8543884098526067319.git-patchwork-notify@kernel.org/T/
====================

Link: https://patch.msgid.link/20250425071018.36078-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: add a virtio_net deadlock selftest
Bui Quang Minh [Fri, 25 Apr 2025 07:10:18 +0000 (14:10 +0700)] 
selftests: net: add a virtio_net deadlock selftest

The selftest reproduces the deadlock scenario when binding/unbinding XDP
program, XDP socket, rx ring resize on virtio_net interface.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250425071018.36078-5-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: retry when bind returns EBUSY in xdp_helper
Bui Quang Minh [Fri, 25 Apr 2025 07:10:17 +0000 (14:10 +0700)] 
selftests: net: retry when bind returns EBUSY in xdp_helper

When binding the XDP socket, we may get EBUSY because the deferred
destructor of XDP socket in previous test has not been executed yet. If
that is the case, just sleep and retry some times.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250425071018.36078-4-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: add flag to force zerocopy mode in xdp_helper
Bui Quang Minh [Fri, 25 Apr 2025 07:10:16 +0000 (14:10 +0700)] 
selftests: net: add flag to force zerocopy mode in xdp_helper

This commit adds an optional -z flag to xdp_helper. When this flag is
provided, the XDP socket binding is forced to be in zerocopy mode.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250425071018.36078-3-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoselftests: net: move xdp_helper to net/lib
Bui Quang Minh [Fri, 25 Apr 2025 07:10:15 +0000 (14:10 +0700)] 
selftests: net: move xdp_helper to net/lib

Move xdp_helper to net/lib to make it easier for other selftests to use
the helper.

Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250425071018.36078-2-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoMerge branch 'veth-qdisc-backpressure-and-qdisc-check-refactor'
Jakub Kicinski [Mon, 28 Apr 2025 21:07:00 +0000 (14:07 -0700)] 
Merge branch 'veth-qdisc-backpressure-and-qdisc-check-refactor'

Jesper Dangaard Brouer says:

====================
veth: qdisc backpressure and qdisc check refactor

This patch series addresses TX drops seen on veth devices under load,
particularly when using threaded NAPI, which is our setup in production.

The root cause is that the NAPI consumer often runs on a different CPU
than the producer. Combined with scheduling delays or simply slower
consumption, this increases the chance that the ptr_ring fills up before
packets are drained, resulting in drops from veth_xmit() (ndo_start_xmit()).

To make this easier to reproduce, we’ve created a script that sets up a
test scenario using network namespaces. The script inserts 1000 iptables
rules in the consumer namespace to slow down packet processing and
amplify the issue. Reproducer script:

https://github.com/xdp-project/xdp-project/blob/main/areas/core/veth_setup01_NAPI_TX_drops.sh

This series first introduces a helper to detect no-queue qdiscs and then
uses it in the veth driver to conditionally apply qdisc-level
backpressure when a real qdisc is attached. The behavior is off by
default and opt-in, ensuring minimal impact and easy activation.

v6: https://lore.kernel.org/174549933665.608169.392044991754158047.stgit@firesoul
v5: https://lore.kernel.org/174489803410.355490.13216831426556849084.stgit@firesoul
v4 https://lore.kernel.org/174472463778.274639.12670590457453196991.stgit@firesoul
v3: https://lore.kernel.org/174464549885.20396.6987653753122223942.stgit@firesoul
v2: https://lore.kernel.org/174412623473.3702169.4235683143719614624.stgit@firesoul
RFC-v1: https://lore.kernel.org/174377814192.3376479.16481605648460889310.stgit@firesoul
====================

Link: https://patch.msgid.link/174559288731.827981.8748257839971869213.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agoveth: apply qdisc backpressure on full ptr_ring to reduce TX drops
Jesper Dangaard Brouer [Fri, 25 Apr 2025 14:55:40 +0000 (16:55 +0200)] 
veth: apply qdisc backpressure on full ptr_ring to reduce TX drops

In production, we're seeing TX drops on veth devices when the ptr_ring
fills up. This can occur when NAPI mode is enabled, though it's
relatively rare. However, with threaded NAPI - which we use in
production - the drops become significantly more frequent.

The underlying issue is that with threaded NAPI, the consumer often runs
on a different CPU than the producer. This increases the likelihood of
the ring filling up before the consumer gets scheduled, especially under
load, leading to drops in veth_xmit() (ndo_start_xmit()).

This patch introduces backpressure by returning NETDEV_TX_BUSY when the
ring is full, signaling the qdisc layer to requeue the packet. The txq
(netdev queue) is stopped in this condition and restarted once
veth_poll() drains entries from the ring, ensuring coordination between
NAPI and qdisc.

Backpressure is only enabled when a qdisc is attached. Without a qdisc,
the driver retains its original behavior - dropping packets immediately
when the ring is full. This avoids unexpected behavior changes in setups
without a configured qdisc.

With a qdisc in place (e.g. fq, sfq) this allows Active Queue Management
(AQM) to fairly schedule packets across flows and reduce collateral
damage from elephant flows.

A known limitation of this approach is that the full ring sits in front
of the qdisc layer, effectively forming a FIFO buffer that introduces
base latency. While AQM still improves fairness and mitigates flow
dominance, the latency impact is measurable.

In hardware drivers, this issue is typically addressed using BQL (Byte
Queue Limits), which tracks in-flight bytes needed based on physical link
rate. However, for virtual drivers like veth, there is no fixed bandwidth
constraint - the bottleneck is CPU availability and the scheduler's ability
to run the NAPI thread. It is unclear how effective BQL would be in this
context.

This patch serves as a first step toward addressing TX drops. Future work
may explore adapting a BQL-like mechanism to better suit virtual devices
like veth.

Reported-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://patch.msgid.link/174559294022.827981.1282809941662942189.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agonet: sched: generalize check for no-queue qdisc on TX queue
Jesper Dangaard Brouer [Fri, 25 Apr 2025 14:55:31 +0000 (16:55 +0200)] 
net: sched: generalize check for no-queue qdisc on TX queue

The "noqueue" qdisc can either be directly attached, or get default
attached if net_device priv_flags has IFF_NO_QUEUE. In both cases, the
allocated Qdisc structure gets it's enqueue function pointer reset to
NULL by noqueue_init() via noqueue_qdisc_ops.

This is a common case for software virtual net_devices. For these devices
with no-queue, the transmission path in __dev_queue_xmit() will bypass
the qdisc layer. Directly invoking device drivers ndo_start_xmit (via
dev_hard_start_xmit).  In this mode the device driver is not allowed to
ask for packets to be queued (either via returning NETDEV_TX_BUSY or
stopping the TXQ).

The simplest and most reliable way to identify this no-queue case is by
checking if enqueue == NULL.

The vrf driver currently open-codes this check (!qdisc->enqueue). While
functionally correct, this low-level detail is better encapsulated in a
dedicated helper for clarity and long-term maintainability.

To make this behavior more explicit and reusable, this patch introduce a
new helper: qdisc_txq_has_no_queue(). Helper will also be used by the
veth driver in the next patch, which introduces optional qdisc-based
backpressure.

This is a non-functional change.

Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/174559293172.827981.7583862632045264175.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 weeks agomdio: fix CONFIG_MDIO_DEVRES selects
Arnd Bergmann [Fri, 25 Apr 2025 11:27:56 +0000 (13:27 +0200)] 
mdio: fix CONFIG_MDIO_DEVRES selects

The newly added rtl9300 driver needs MDIO_DEVRES:

x86_64-linux-ld: drivers/net/mdio/mdio-realtek-rtl9300.o: in function `rtl9300_mdiobus_probe':
mdio-realtek-rtl9300.c:(.text+0x941): undefined reference to `devm_mdiobus_alloc_size'
x86_64-linux-ld: mdio-realtek-rtl9300.c:(.text+0x9e2): undefined reference to `__devm_mdiobus_register'
Since this is a hidden symbol, it needs to be selected by each user,
rather than the usual 'depends on'. I see that there are a few other
drivers that accidentally use 'depends on', so fix these as well for
consistency and to avoid dependency loops.

Fixes: 37f9b2a6c086 ("net: ethernet: Add missing depends on MDIO_DEVRES")
Fixes: 24e31e474769 ("net: mdio: Add RTL9300 MDIO driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Link: https://patch.msgid.link/20250425112819.1645342-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-stmmac-dwmac-loongson-add-loongson-2k3000-support'
Jakub Kicinski [Mon, 28 Apr 2025 20:37:28 +0000 (13:37 -0700)] 
Merge branch 'net-stmmac-dwmac-loongson-add-loongson-2k3000-support'

Huacai Chen says:

====================
net: stmmac: dwmac-loongson: Add Loongson-2K3000 support

This series add stmmac driver support for Loongson-2K3000/Loongson-3B6000M,
which introduces a new CORE ID (0x12) and a new PCI device ID (0x7a23). The
new core reduces channel numbers from 8 to 4, but checksum is supported for
all channels.
====================

Note that the first patch of the series has been merged separately as
commit f438eee2c8c9 ("net: stmmac: dwmac-loongson: Move queue number init
to common function")

Link: https://patch.msgid.link/20250424072209.3134762-1-chenhuacai@loongson.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: dwmac-loongson: Add new GMAC's PCI device ID support
Huacai Chen [Thu, 24 Apr 2025 07:22:09 +0000 (15:22 +0800)] 
net: stmmac: dwmac-loongson: Add new GMAC's PCI device ID support

Add a new GMAC's PCI device ID (0x7a23) support which is used in
Loongson-2K3000/Loongson-3B6000M. The new GMAC device use external PHY,
so it reuses loongson_gmac_data() as the old GMAC device (0x7a03), and
the new GMAC device still doesn't support flow control now.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Tested-by: Henry Chen <chenx97@aosc.io>
Tested-by: Biao Dong <dongbiao@loongson.cn>
Signed-off-by: Baoqi Zhang <zhangbaoqi@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://patch.msgid.link/20250424072209.3134762-4-chenhuacai@loongson.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: dwmac-loongson: Add new multi-chan IP core support
Huacai Chen [Thu, 24 Apr 2025 07:22:08 +0000 (15:22 +0800)] 
net: stmmac: dwmac-loongson: Add new multi-chan IP core support

Add a new multi-chan IP core (0x12) support which is used in Loongson-
2K3000/Loongson-3B6000M. Compared with the 0x10 core, the new 0x12 core
reduces channel numbers from 8 to 4, but checksum is supported for all
channels.

Add a "multichan" flag to loongson_data, so that we can simply use a
"if (ld->multichan)" condition rather than the complicated condition
"if (ld->loongson_id == DWMAC_CORE_MULTICHAN_V1 || ld->loongson_id ==
DWMAC_CORE_MULTICHAN_V2)".

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Henry Chen <chenx97@aosc.io>
Tested-by: Biao Dong <dongbiao@loongson.cn>
Signed-off-by: Baoqi Zhang <zhangbaoqi@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Link: https://patch.msgid.link/20250424072209.3134762-3-chenhuacai@loongson.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-stmmac-socfpga-1000basex-support-and-cleanups'
Jakub Kicinski [Sat, 26 Apr 2025 02:22:03 +0000 (19:22 -0700)] 
Merge branch 'net-stmmac-socfpga-1000basex-support-and-cleanups'

Maxime Chevallier says:

====================
net: stmmac: socfpga: 1000BaseX support and cleanups

This small series sorts-out 1000BaseX support and does a bit of cleanup
for the Lynx conversion.

Patch 1 makes sure that we set the right phy_mode when working in
1000BaseX mode, so that the internal GMII is configured correctly.

Patch 2 removes a check for phy_device upon calling fix_mac_speed(). As
the SGMII adapter may be chained to a Lynx PCS, checking for a
phy_device to be attached to the netdev before enabling the SGMII
adapter doesn't make sense, as we won't have a downstream PHY when using
1000BaseX.

Patch 3 cleans an unused field from the PCS conversion.

v1: https://lore.kernel.org/20250422094701.49798-1-maxime.chevallier@bootlin.com
v2: https://lore.kernel.org/20250423104646.189648-1-maxime.chevallier@bootlin.com
====================

Link: https://patch.msgid.link/20250424071223.221239-1-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: socfpga: Remove unused pcs-mdiodev field
Maxime Chevallier [Thu, 24 Apr 2025 07:12:22 +0000 (09:12 +0200)] 
net: stmmac: socfpga: Remove unused pcs-mdiodev field

When dwmac-socfpga was converted to using the Lynx PCS (previously
referred to in the driver as the Altera TSE PCS), the
lynx_pcs_create_mdiodev() was used to create the pcs instance.

As this function didn't exist in the early versions of the series, a
local mdiodev object was stored for PCS creation. It was never used, but
still made it into the driver, so remove it.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250424071223.221239-4-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: socfpga: Don't check for phy to enable the SGMII adapter
Maxime Chevallier [Thu, 24 Apr 2025 07:12:21 +0000 (09:12 +0200)] 
net: stmmac: socfpga: Don't check for phy to enable the SGMII adapter

The SGMII adapter needs to be enabled for both Cisco SGMII and 1000BaseX
operations. It doesn't make sense to check for an attached phydev here,
as we simply might not have any, in particular if we're using the
1000BaseX interface mode.

Make so that we only re-enable the SGMII adapter when it's present, and
when we use a phy_mode that is handled by said adapter.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250424071223.221239-3-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: socfpga: Enable internal GMII when using 1000BaseX
Maxime Chevallier [Thu, 24 Apr 2025 07:12:20 +0000 (09:12 +0200)] 
net: stmmac: socfpga: Enable internal GMII when using 1000BaseX

Dwmac Socfpga may be used with an instance of a Lynx / Altera TSE PCS,
in which case it gains support for 1000BaseX.

It appears that the PCS is wired to the MAC through an internal GMII
bus. Make sure that we enable the GMII_MII mode for the internal MAC when
using 1000BaseX.

Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250424071223.221239-2-maxime.chevallier@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: stmmac: dwmac-loongson: Move queue number init to common function
Huacai Chen [Thu, 24 Apr 2025 07:22:07 +0000 (15:22 +0800)] 
net: stmmac: dwmac-loongson: Move queue number init to common function

Currently, the tx and rx queue number initialization is duplicated in
loongson_gmac_data() and loongson_gnet_data(), so move it to the common
function loongson_default_data().

This is a preparation for later patches.

Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Tested-by: Henry Chen <chenx97@aosc.io>
Tested-by: Biao Dong <dongbiao@loongson.cn>
Signed-off-by: Baoqi Zhang <zhangbaoqi@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 weeks agobonding: assign random address if device address is same as bond
Hangbin Liu [Thu, 24 Apr 2025 04:22:38 +0000 (04:22 +0000)] 
bonding: assign random address if device address is same as bond

This change addresses a MAC address conflict issue in failover scenarios,
similar to the problem described in commit a951bc1e6ba5 ("bonding: correct
the MAC address for 'follow' fail_over_mac policy").

In fail_over_mac=follow mode, the bonding driver expects the formerly active
slave to swap MAC addresses with the newly active slave during failover.
However, under certain conditions, two slaves may end up with the same MAC
address, which breaks this policy:

1) ip link set eth0 master bond0
   -> bond0 adopts eth0's MAC address (MAC0).

2) ip link set eth1 master bond0
   -> eth1 is added as a backup with its own MAC (MAC1).

3) ip link set eth0 nomaster
   -> eth0 is released and restores its MAC (MAC0).
   -> eth1 becomes the active slave, and bond0 assigns MAC0 to eth1.

4) ip link set eth0 master bond0
   -> eth0 is re-added to bond0, now both eth0 and eth1 have MAC0.

This results in a MAC address conflict and violates the expected behavior
of the failover policy.

To fix this, we assign a random MAC address to any newly added slave if
its current MAC address matches that of the bond. The original (permanent)
MAC address is saved and will be restored when the device is released
from the bond.

This ensures that each slave has a unique MAC address during failover
transitions, preserving the integrity of the fail_over_mac=follow policy.

Fixes: 3915c1e8634a ("bonding: Add "follow" option to fail_over_mac")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jay Vosburgh <jv@jvosburgh.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 weeks agoMerge branch 'io_uring-zcrx-fix-selftests-and-add-new-test-for-rss-ctx'
Jakub Kicinski [Sat, 26 Apr 2025 01:44:11 +0000 (18:44 -0700)] 
Merge branch 'io_uring-zcrx-fix-selftests-and-add-new-test-for-rss-ctx'

David Wei says:

====================
io_uring/zcrx: fix selftests and add new test for rss ctx

Update io_uring zero copy receive selftest. Patch 1 does a requested
cleanup to use defer() for undoing ethtool actions during the test and
restoring the NIC under test back to its original state.

Patch 2 adds a required call to set hds_thresh to 0. This is needed for
the queue API.

Patch 3 adds a new test case for steering into RSS contexts. A real
application using io_uring zero copy receive relies on this working to
shard work across multiple queues. There seems to be some
differences/bugs with steering into RSS contexts and individual queues.
====================

Link: https://patch.msgid.link/20250425022049.3474590-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoio_uring/zcrx: selftests: add test case for rss ctx
David Wei [Fri, 25 Apr 2025 02:20:49 +0000 (19:20 -0700)] 
io_uring/zcrx: selftests: add test case for rss ctx

RSS contexts are used to shard work across multiple queues for an
application using io_uring zero copy receive. Add a test case checking
that steering flows into an RSS context works.

Until I add multi-thread support to the selftest binary, this test case
only has 1 queue in the RSS context.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250425022049.3474590-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoio_uring/zcrx: selftests: set hds_thresh to 0
David Wei [Fri, 25 Apr 2025 02:20:48 +0000 (19:20 -0700)] 
io_uring/zcrx: selftests: set hds_thresh to 0

Setting hds_thresh to 0 is required for queue reset.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250425022049.3474590-3-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoio_uring/zcrx: selftests: switch to using defer() for cleanup
David Wei [Fri, 25 Apr 2025 02:20:47 +0000 (19:20 -0700)] 
io_uring/zcrx: selftests: switch to using defer() for cleanup

Switch to using defer() for putting the NIC back to the original state
prior to running the selftest.

Signed-off-by: David Wei <dw@davidwei.uk>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250425022049.3474590-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'fix-netdevim-to-correctly-mark-napi-ids'
Jakub Kicinski [Fri, 25 Apr 2025 01:30:36 +0000 (18:30 -0700)] 
Merge branch 'fix-netdevim-to-correctly-mark-napi-ids'

Joe Damato says:

====================
Fix netdevim to correctly mark NAPI IDs

This series fixes netdevsim to correctly set the NAPI ID on the skb.
This is helpful for writing tests around features that use
SO_INCOMING_NAPI_ID.

In addition to the netdevsim fix in patch 1, patches 2 & 3 do some self
test refactoring and add a test for NAPI IDs. The test itself (patch 3)
introduces a C helper because apparently python doesn't have
socket.SO_INCOMING_NAPI_ID.

v3: https://lore.kernel.org/20250418013719.12094-1-jdamato@fastly.com
v2: https://lore.kernel.org/20250417013301.39228-1-jdamato@fastly.com
rfcv1: https://lore.kernel.org/20250329000030.39543-1-jdamato@fastly.com
====================

Link: https://patch.msgid.link/20250424002746.16891-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoselftests: drv-net: Test that NAPI ID is non-zero
Joe Damato [Thu, 24 Apr 2025 00:27:33 +0000 (00:27 +0000)] 
selftests: drv-net: Test that NAPI ID is non-zero

Test that the SO_INCOMING_NAPI_ID of a network file descriptor is
non-zero. This ensures that either the core networking stack or, in some
cases like netdevsim, the driver correctly sets the NAPI ID.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250424002746.16891-4-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoselftests: drv-net: Factor out ksft C helpers
Joe Damato [Thu, 24 Apr 2025 00:27:32 +0000 (00:27 +0000)] 
selftests: drv-net: Factor out ksft C helpers

Factor ksft C helpers to a header so they can be used by other C-based
tests.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250424002746.16891-3-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonetdevsim: Mark NAPI ID on skb in nsim_rcv
Joe Damato [Thu, 24 Apr 2025 00:27:31 +0000 (00:27 +0000)] 
netdevsim: Mark NAPI ID on skb in nsim_rcv

Previously, nsim_rcv was not marking the NAPI ID on the skb, leading to
applications seeing a napi ID of 0 when using SO_INCOMING_NAPI_ID.

To add to the userland confusion, netlink appears to correctly report
the NAPI IDs for netdevsim queues but the resulting file descriptor from
a call to accept() was reporting a NAPI ID of 0.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250424002746.16891-2-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agotools: ynl: fix the header guard name for OVPN
Jakub Kicinski [Wed, 23 Apr 2025 22:02:31 +0000 (15:02 -0700)] 
tools: ynl: fix the header guard name for OVPN

Thorsten reports that after upgrading system headers from linux-next
the YNL build breaks. I typo'ed the header guard, _H is missing.

Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Link: https://lore.kernel.org/59ba7a94-17b9-485f-aa6d-14e4f01a7a39@leemhuis.info
Fixes: 12b196568a3a ("tools: ynl: add missing header deps")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250423220231.1035931-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ethernet: mtk_wed: annotate RCU release in attach()
Johannes Berg [Wed, 23 Apr 2025 15:08:08 +0000 (17:08 +0200)] 
net: ethernet: mtk_wed: annotate RCU release in attach()

There are some sparse warnings in wifi, and it seems that
it's actually possible to annotate a function pointer with
__releases(), making the sparse warnings go away. In a way
that also serves as documentation that rcu_read_unlock()
must be called in the attach method, so add that annotation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://patch.msgid.link/20250423150811.456205-2-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'tcp-fastopen-observability'
Jakub Kicinski [Fri, 25 Apr 2025 01:21:07 +0000 (18:21 -0700)] 
Merge branch 'tcp-fastopen-observability'

Jeremy Harris says:

====================
tcp: fastopen: observability

Whether TCP Fast Open was used for a connection is not reliably
observable by an accepting application when the SYN passed no data.

Fix this by noting during SYN receive processing that an acceptable Fast
Open option was used, and provide this to userland via getsockopt TCP_INFO.
====================

Link: https://patch.msgid.link/20250423124334.4916-1-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agotcp: fastopen: pass TFO child indication through getsockopt
Jeremy Harris [Wed, 23 Apr 2025 12:43:34 +0000 (13:43 +0100)] 
tcp: fastopen: pass TFO child indication through getsockopt

tcp: fastopen: pass TFO child indication through getsockopt

Note that this uses up the last bit of a field in struct tcp_info

Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-3-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agotcp: fastopen: note that a child socket was created
Jeremy Harris [Wed, 23 Apr 2025 12:43:33 +0000 (13:43 +0100)] 
tcp: fastopen: note that a child socket was created

tcp: fastopen: note that a child socket was created

This uses up the last bit in a field of tcp_sock.

Signed-off-by: Jeremy Harris <jgh@exim.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250423124334.4916-2-jgh@exim.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ip_gre: Fix spelling mistake "demultiplexor" -> "demultiplexer"
Colin Ian King [Wed, 23 Apr 2025 11:37:19 +0000 (12:37 +0100)] 
net: ip_gre: Fix spelling mistake "demultiplexor" -> "demultiplexer"

There is a spelling mistake in a pr_info message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250423113719.173539-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agorxrpc: rxgk: Fix some reference count leaks
Dan Carpenter [Wed, 23 Apr 2025 08:25:45 +0000 (11:25 +0300)] 
rxrpc: rxgk: Fix some reference count leaks

These paths should call rxgk_put(gk) but they don't.  In the
rxgk_construct_response() function the "goto error;" will free the
"response" skb as well calling rxgk_put() so that's a bonus.

Fixes: 9d1d2b59341f ("rxrpc: rxgk: Implement the yfs-rxgk security class (GSSAPI)")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/aAikCbsnnzYtVmIA@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: ethernet: mtk_eth_soc: convert cap_bit in mtk_eth_muxc struct to u64
Bo-Cun Chen [Wed, 23 Apr 2025 00:48:02 +0000 (01:48 +0100)] 
net: ethernet: mtk_eth_soc: convert cap_bit in mtk_eth_muxc struct to u64

With commit 51a4df60db5c2 ("net: ethernet: mtk_eth_soc: convert caps in
mtk_soc_data struct to u64") the capabilities bitfield was converted to
a 64-bit value, but a cap_bit in struct mtk_eth_muxc which is used to
store a full bitfield (rather than the bit number, as the name would
suggest) still holds only a 32-bit value.

Change the type of cap_bit to u64 in order to avoid truncating the
bitfield which results in path selection to not work with capabilities
above the 32-bit limit.

The values currently stored in the cap_bit field are
MTK_ETH_MUX_GDM1_TO_GMAC1_ESW:
 BIT_ULL(18) | BIT_ULL(5)

MTK_ETH_MUX_GMAC2_GMAC0_TO_GEPHY:
 BIT_ULL(19) | BIT_ULL(5) | BIT_ULL(6)

MTK_ETH_MUX_U3_GMAC2_TO_QPHY:
 BIT_ULL(20) | BIT_ULL(5) | BIT_ULL(6)

MTK_ETH_MUX_GMAC1_GMAC2_TO_SGMII_RGMII:
 BIT_ULL(20) | BIT_ULL(5) | BIT_ULL(7)

MTK_ETH_MUX_GMAC12_TO_GEPHY_SGMII:
 BIT_ULL(21) | BIT_ULL(5)

While all those values are currently still within 32-bit boundaries,
the addition of new capabilities of MT7988 as well as future SoC's
like MT7987 will exceed them. Also, the use of a 32-bit 'int' type to
store the result of a BIT_ULL(...) is misleading.

Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/ded98b0d716c3203017a7a92151516ec2bf1abee.1745369249.git.daniel@makrotopia.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agorxrpc: Remove deadcode
Dr. David Alan Gilbert [Tue, 22 Apr 2025 23:51:47 +0000 (00:51 +0100)] 
rxrpc: Remove deadcode

Remove three functions that are no longer used.

rxrpc_get_txbuf() last use was removed by 2020's
commit 5e6ef4f1017c ("rxrpc: Make the I/O thread take over the call and
local processor work")

rxrpc_kernel_get_epoch() last use was removed by 2020's
commit 44746355ccb1 ("afs: Don't get epoch from a server because it may be
ambiguous")

rxrpc_kernel_set_max_life() last use was removed by 2023's
commit db099c625b13 ("rxrpc: Fix timeout of a call that hasn't yet been
granted a channel")

Both of the rxrpc_kernel_* functions were documented.  Remove that
documentation as well as the code.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20250422235147.146460-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge branch 'net-bcmasp-add-v3-0-and-remove-v2-0'
Jakub Kicinski [Thu, 24 Apr 2025 23:59:55 +0000 (16:59 -0700)] 
Merge branch 'net-bcmasp-add-v3-0-and-remove-v2-0'

Justin Chen says:

====================
net: bcmasp: Add v3.0 and remove v2.0

asp-v2.0 had one supported SoC that never saw the light of day.
Given that it was the first iteration of the HW, it ended up with
some one off HW design decisions that were changed in futher iterations
of the HW. We remove support to simplify the code and make it easier to
add future revisions.

Add support for asp-v3.0. asp-v3.0 reduces the feature set for cost
savings. We reduce the number of channel/network filters. And also
remove some features and statistics.
====================

Link: https://patch.msgid.link/20250422233645.1931036-1-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phy: mdio-bcm-unimac: Add asp-v3.0
Justin Chen [Tue, 22 Apr 2025 23:36:45 +0000 (16:36 -0700)] 
net: phy: mdio-bcm-unimac: Add asp-v3.0

Add mdio compat string for asp-v3.0 ethernet driver.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-9-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: bcmasp: Add support for asp-v3.0
Justin Chen [Tue, 22 Apr 2025 23:36:44 +0000 (16:36 -0700)] 
net: bcmasp: Add support for asp-v3.0

The asp-v3.0 is a major HW revision that reduced the number of
channels and filters. The goal was to save cost by reducing the
feature set.

Changes for asp-v3.0
- Number of network filters were reduced.
- Number of channels were reduced.
- EDPKT stats were removed.
- Fix a bug with csum offload.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-8-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: brcm,unimac-mdio: Add asp-v3.0
Justin Chen [Tue, 22 Apr 2025 23:36:43 +0000 (16:36 -0700)] 
dt-bindings: net: brcm,unimac-mdio: Add asp-v3.0

The asp-v3.0 Ethernet controller uses a brcm unimac like its
predecessor.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-7-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: brcm,asp-v2.0: Add asp-v3.0
Justin Chen [Tue, 22 Apr 2025 23:36:42 +0000 (16:36 -0700)] 
dt-bindings: net: brcm,asp-v2.0: Add asp-v3.0

Add asp-v3.0 support. v3.0 is a major revision that reduces
the feature set for cost savings. We have a reduced amount of
channels and network filters.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-6-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: phy: mdio-bcm-unimac: Remove asp-v2.0
Justin Chen [Tue, 22 Apr 2025 23:36:41 +0000 (16:36 -0700)] 
net: phy: mdio-bcm-unimac: Remove asp-v2.0

Remove asp-v2.0 which will no longer be supported.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-5-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agonet: bcmasp: Remove support for asp-v2.0
Justin Chen [Tue, 22 Apr 2025 23:36:40 +0000 (16:36 -0700)] 
net: bcmasp: Remove support for asp-v2.0

The SoC that supported asp-v2.0 never saw the light of day. asp-v2.0 has
quirks that makes the logic overly complicated. For example, asp-v2.0 is
the only revision that has a different wake up IRQ hook up. Remove asp-v2.0
support to make supporting future HW revisions cleaner.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-4-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: brcm,unimac-mdio: Remove asp-v2.0
Justin Chen [Tue, 22 Apr 2025 23:36:39 +0000 (16:36 -0700)] 
dt-bindings: net: brcm,unimac-mdio: Remove asp-v2.0

Remove asp-v2.0 which was only supported on one SoC that never
saw the light of day.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-3-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agodt-bindings: net: brcm,asp-v2.0: Remove asp-v2.0
Justin Chen [Tue, 22 Apr 2025 23:36:38 +0000 (16:36 -0700)] 
dt-bindings: net: brcm,asp-v2.0: Remove asp-v2.0

Remove asp-v2.0 which was only supported on one SoC that never
saw the light of day.

Signed-off-by: Justin Chen <justin.chen@broadcom.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250422233645.1931036-2-justin.chen@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoselftests: iou-zcrx: Get the page size at runtime
Haiyue Wang [Sat, 19 Apr 2025 14:10:15 +0000 (22:10 +0800)] 
selftests: iou-zcrx: Get the page size at runtime

Use the API `sysconf()` to query page size at runtime, instead of using
hard code number 4096.

And use `posix_memalign` to allocate the page size aligned momory.

Signed-off-by: Haiyue Wang <haiyuewa@163.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20250419141044.10304-1-haiyuewa@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Jakub Kicinski [Thu, 24 Apr 2025 18:14:16 +0000 (11:14 -0700)] 
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Cross-merge networking fixes after downstream PR (net-6.15-rc4).

This pull includes wireless and a fix to vxlan which isn't
in Linus's tree just yet. The latter creates with a silent conflict
/ build breakage, so merging it now to avoid causing problems.

drivers/net/vxlan/vxlan_vnifilter.c
  094adad91310 ("vxlan: Use a single lock to protect the FDB table")
  087a9eb9e597 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry")
https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com

No "normal" conflicts, or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agovxlan: vnifilter: Fix unlocked deletion of default FDB entry
Ido Schimmel [Wed, 23 Apr 2025 14:51:31 +0000 (17:51 +0300)] 
vxlan: vnifilter: Fix unlocked deletion of default FDB entry

When a VNI is deleted from a VXLAN device in 'vnifilter' mode, the FDB
entry associated with the default remote (assuming one was configured)
is deleted without holding the hash lock. This is wrong and will result
in a warning [1] being generated by the lockdep annotation that was
added by commit ebe642067455 ("vxlan: Create wrappers for FDB lookup").

Reproducer:

 # ip link add vx0 up type vxlan dstport 4789 external vnifilter local 192.0.2.1
 # bridge vni add vni 10010 remote 198.51.100.1 dev vx0
 # bridge vni del vni 10010 dev vx0

Fix by acquiring the hash lock before the deletion and releasing it
afterwards. Blame the original commit that introduced the issue rather
than the one that exposed it.

[1]
WARNING: CPU: 3 PID: 392 at drivers/net/vxlan/vxlan_core.c:417 vxlan_find_mac+0x17f/0x1a0
[...]
RIP: 0010:vxlan_find_mac+0x17f/0x1a0
[...]
Call Trace:
 <TASK>
 __vxlan_fdb_delete+0xbe/0x560
 vxlan_vni_delete_group+0x2ba/0x940
 vxlan_vni_del.isra.0+0x15f/0x580
 vxlan_process_vni_filter+0x38b/0x7b0
 vxlan_vnifilter_process+0x3bb/0x510
 rtnetlink_rcv_msg+0x2f7/0xb70
 netlink_rcv_skb+0x131/0x360
 netlink_unicast+0x426/0x710
 netlink_sendmsg+0x75a/0xc20
 __sock_sendmsg+0xc1/0x150
 ____sys_sendmsg+0x5aa/0x7b0
 ___sys_sendmsg+0xfc/0x180
 __sys_sendmsg+0x121/0x1b0
 do_syscall_64+0xbb/0x1d0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250423145131.513029-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git...
Jakub Kicinski [Thu, 24 Apr 2025 18:10:57 +0000 (11:10 -0700)] 
Merge tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless

Johannes Berg says:

====================
Some more fixes, notably:
 * iwlwifi: various regression and iwlmld fixes
 * mac80211: fix TX frames in monitor mode
 * brcmfmac: error handling for firmware load

* tag 'wireless-2025-04-24' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
  wifi: iwlwifi: restore missing initialization of async_handlers_list
  wifi: brcm80211: fmac: Add error handling for brcmf_usb_dl_writeimage()
  wifi: plfxlc: Remove erroneous assert in plfxlc_mac_release
  wifi: iwlwifi: fix the check for the SCRATCH register upon resume
  wifi: iwlwifi: don't warn if the NIC is gone in resume
  wifi: iwlwifi: mld: fix BAID validity check
  wifi: iwlwifi: back off on continuous errors
  wifi: iwlwifi: mld: only create debugfs symlink if it does not exist
  wifi: iwlwifi: mld: inform trans on init failure
  wifi: iwlwifi: mld: properly handle async notification in op mode start
  Revert "wifi: iwlwifi: make no_160 more generic"
  Revert "wifi: iwlwifi: add support for BE213"
  wifi: mac80211: restore monitor for outgoing frames
====================

Link: https://patch.msgid.link/20250424120535.56499-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 weeks agoMerge tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 24 Apr 2025 16:14:50 +0000 (09:14 -0700)] 
Merge tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "No fixes from any subtree.

  Current release - regressions:

   - net: fix the missing unlock for detached devices

  Previous releases - regressions:

   - sched: fix UAF vulnerability in HFSC qdisc

   - lwtunnel: disable BHs when required

   - mptcp: pm: defer freeing of MPTCP userspace path manager entries

   - tipc: fix NULL pointer dereference in tipc_mon_reinit_self()

   - eth: virtio-net: disable delayed refill when pausing rx

  Previous releases - always broken:

   - phylink: fix suspend/resume with WoL enabled and link down

   - eth:
       - mlx5: fix null-ptr-deref in mlx5_create_{inner_,}ttc_table()
       - xen-netfront: handle NULL returned by xdp_convert_buff_to_frame()
       - enetc: fix frame corruption on bpf_xdp_adjust_head/tail() and XDP_PASS
       - stmmac: fix dwmac1000 ptp timestamp status offset
       - pds_core: prevent possible adminq overflow/stuck condition

  Misc:

   - a bunch of MAINTAINERS updates"

* tag 'net-6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
  net: stmmac: fix multiplication overflow when reading timestamp
  net: stmmac: fix dwmac1000 ptp timestamp status offset
  net: dp83822: Fix OF_MDIO config check
  pds_core: make wait_context part of q_info
  pds_core: Remove unnecessary check in pds_client_adminq_cmd()
  pds_core: handle unsupported PDS_CORE_CMD_FW_CONTROL result
  pds_core: Prevent possible adminq overflow/stuck condition
  net: dsa: mt7530: sync driver-specific behavior of MT7531 variants
  selftests/tc-testing: Add test for HFSC queue emptying during peek operation
  net_sched: hfsc: Fix a potential UAF in hfsc_dequeue() too
  net_sched: hfsc: Fix a UAF vulnerability in class handling
  selftests: mptcp: diag: use mptcp_lib_get_info_value
  mptcp: pm: Defer freeing of MPTCP userspace path manager entries
  net: ethernet: mtk_eth_soc: net: revise NETSYSv3 hardware configuration
  tipc: fix NULL pointer dereference in tipc_mon_reinit_self()
  virtio-net: disable delayed refill when pausing rx
  net: phy: leds: fix memory leak
  net: phylink: mac_link_(up|down)() clarifications
  net: phylink: fix suspend/resume with WoL enabled and link down
  net: lwtunnel: disable BHs when required
  ...

7 weeks agoMerge tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Thu, 24 Apr 2025 16:10:01 +0000 (09:10 -0700)] 
Merge tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:

 - Revert acomp multibuffer tests which were buggy

 - Fix off-by-one regression in new scomp code

 - Lower quality setting on atmel-sha204a as it may not be random

* tag 'v6.15-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: atmel-sha204a - Set hwrng quality to lowest possible
  crypto: scomp - Fix off-by-one bug when calculating last page
  Revert "crypto: testmgr - Add multibuffer acomp testing"

7 weeks agonet: phy: marvell-88q2xxx: Enable temperature sensor for mv88q211x
Niklas Söderlund [Fri, 18 Apr 2025 14:58:00 +0000 (16:58 +0200)] 
net: phy: marvell-88q2xxx: Enable temperature sensor for mv88q211x

The temperature sensor enabled for mv88q222x devices also functions for
mv88q211x based devices. Unify the two devices probe functions to enable
the sensors for all devices supported by this driver.

The same oddity as for mv88q222x devices exists, the PHY link must be up
for a correct temperature reading to be reported.

    # cat /sys/class/hwmon/hwmon9/temp1_input
    -75000

    # ifconfig end5 up

    # cat /sys/class/hwmon/hwmon9/temp1_input
    59000

Worth noting is that while the temperature register offsets and layout
are the same between mv88q211x and mv88q222x devices their names in the
datasheets are different. This change keeps the mv88q222x names for the
mv88q211x support.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Dimitri Fedrau <dima.fedrau@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250418145800.2420751-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoMerge branch 'net-stmmac-fix-timestamp-snapshots-on-dwmac1000'
Paolo Abeni [Thu, 24 Apr 2025 09:50:45 +0000 (11:50 +0200)] 
Merge branch 'net-stmmac-fix-timestamp-snapshots-on-dwmac1000'

Alexis Lothore says:

====================
net: stmmac: fix timestamp snapshots on dwmac1000

this is the v2 of a small series containing two small fixes for the
timestamp snapshot feature on stmmac, especially on dwmac1000 version.
Those issues have been detected on a socfpga (Cyclone V) platform. They
kind of follow the big rework sent by Maxime at the end of last year to
properly split this feature support between different versions of the
DWMAC IP.

v1: https://lore.kernel.org/r/20250422-stmmac_ts-v1-0-b59c9f406041@bootlin.com

Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
====================

Link: https://patch.msgid.link/20250423-stmmac_ts-v2-0-e2cf2bbd61b1@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agonet: stmmac: fix multiplication overflow when reading timestamp
Alexis Lothoré [Wed, 23 Apr 2025 07:12:10 +0000 (09:12 +0200)] 
net: stmmac: fix multiplication overflow when reading timestamp

The current way of reading a timestamp snapshot in stmmac can lead to
integer overflow, as the computation is done on 32 bits. The issue has
been observed on a dwmac-socfpga platform returning chaotic timestamp
values due to this overflow. The corresponding multiplication is done
with a MUL instruction, which returns 32 bit values. Explicitly casting
the value to 64 bits replaced the MUL with a UMLAL, which computes and
returns the result on 64 bits, and so returns correctly the timestamps.

Prevent this overflow by explicitly casting the intermediate value to
u64 to make sure that the whole computation is made on u64. While at it,
apply the same cast on the other dwmac variant (GMAC4) method for
snapshot retrieval.

Fixes: 477c3e1f6363 ("net: stmmac: Introduce dwmac1000 timestamping operations")
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250423-stmmac_ts-v2-2-e2cf2bbd61b1@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agonet: stmmac: fix dwmac1000 ptp timestamp status offset
Alexis Lothore [Wed, 23 Apr 2025 07:12:09 +0000 (09:12 +0200)] 
net: stmmac: fix dwmac1000 ptp timestamp status offset

When a PTP interrupt occurs, the driver accesses the wrong offset to
learn about the number of available snapshots in the FIFO for dwmac1000:
it should be accessing bits 29..25, while it is currently reading bits
19..16 (those are bits about the auxiliary triggers which have generated
the timestamps). As a consequence, it does not compute correctly the
number of available snapshots, and so possibly do not generate the
corresponding clock events if the bogus value ends up being 0.

Fix clock events generation by reading the correct bits in the timestamp
register for dwmac1000.

Fixes: 477c3e1f6363 ("net: stmmac: Introduce dwmac1000 timestamping operations")
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250423-stmmac_ts-v2-1-e2cf2bbd61b1@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agonet: dp83822: Fix OF_MDIO config check
Johannes Schneider [Wed, 23 Apr 2025 04:47:24 +0000 (06:47 +0200)] 
net: dp83822: Fix OF_MDIO config check

When CONFIG_OF_MDIO is set to be a module the code block is not
compiled. Use the IS_ENABLED macro that checks for both built in as
well as module.

Fixes: 5dc39fd5ef35 ("net: phy: DP83822: Add ability to advertise Fiber connection")
Signed-off-by: Johannes Schneider <johannes.schneider@leica-geosystems.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/20250423044724.1284492-1-johannes.schneider@leica-geosystems.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoMerge branch 'ipv6-no-rtnl-for-ipv6-routing-table'
Paolo Abeni [Thu, 24 Apr 2025 07:30:02 +0000 (09:30 +0200)] 
Merge branch 'ipv6-no-rtnl-for-ipv6-routing-table'

Kuniyuki Iwashima says:

====================
ipv6: No RTNL for IPv6 routing table.

IPv6 routing tables are protected by each table's lock and work in
the interrupt context, which means we basically don't need RTNL to
modify an IPv6 routing table itself.

Currently, the control paths require RTNL because we may need to
perform device and nexthop lookups; we must prevent dev/nexthop from
going away from the netns.

This, however, can be achieved by RCU as well.

If we are in the RCU critical section while adding an IPv6 route,
synchronize_net() in __dev_change_net_namespace() and
unregister_netdevice_many_notify() guarantee that the dev will not be
moved to another netns or removed.

Also, nexthop is guaranteed not to be freed during the RCU grace period.

If we care about a race between nexthop removal and IPv6 route addition,
we can get rid of RTNL from the control paths.

Patch 1 moves a validation for RTA_MULTIPATH earlier.
Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE.
Patch 3 ~ 11 moves validation and memory allocation earlier.
Patch 12 prevents a race between two requests for the same table.
Patch 13 & 14 prevents the nexthop race mentioned above.
Patch 15 removes RTNL for SIOCADDRT and RTM_NEWROUTE.

Test:

The script [0] lets each CPU-X create 100000 routes on table-X in a
batch.

On c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  total routes: 19200006
  Time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  total routes: 19200006
  Time elapsed: 62854 milliseconds.

I changed the number of routes (1000 ~ 100000 per CPU/table) and
consistently saw it finish 3x faster with this series.

[0]

mkdir tmp

NS="test"
ip netns add $NS
ip -n $NS link add veth0 type veth peer veth1
ip -n $NS link set veth0 up
ip -n $NS link set veth1 up

TABLES=()
for i in $(seq $(nproc)); do
    TABLES+=("$i")
done

ROUTES=()
for i in {1..100}; do
    for j in {1..1000}; do
ROUTES+=("2001:$i:$j::/64")
    done
done

for TABLE in "${TABLES[@]}"; do
    (
FILE="./tmp/batch-table-$TABLE.txt"
> $FILE
for ROUTE in "${ROUTES[@]}"; do
            echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE
done
    ) &
done

wait

echo "start adding routes"

START_TIME=$(date +%s%3N)
for TABLE in "${TABLES[@]}"; do
    ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" &
done

wait
END_TIME=$(date +%s%3N)
ELAPSED_TIME=$((END_TIME - START_TIME))

echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)."
echo "total routes: $(ip -n $NS -6 route show table all | wc -l)"  # Just for debug
echo "Time elapsed: ${ELAPSED_TIME} milliseconds."

ip netns del $NS
rm -fr ./tmp/

v2: https://lore.kernel.org/netdev/20250409011243.26195-1-kuniyu@amazon.com/
v1: https://lore.kernel.org/netdev/20250321040131.21057-1-kuniyu@amazon.com/
====================

Link: https://patch.msgid.link/20250418000443.43734-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:56 +0000 (17:03 -0700)] 
ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.

Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE.

The remaining things to do are

  1. pass false to lwtunnel_valid_encap_type_attr()
  2. use rcu_dereference_rtnl() in fib6_check_nexthop()
  3. place rcu_read_lock() before ip6_route_info_create_nh().

Let's complete the RTNL-free conversion.

When each CPU-X adds 100000 routes on table-X in a batch
concurrently on c7a.metal-48xl EC2 instance with 192 CPUs,

without this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 191577 milliseconds.

with this series:

  $ sudo ./route_test.sh
  ...
  added 19200000 routes (100000 routes * 192 tables).
  time elapsed: 62854 milliseconds.

I changed the number of routes in each table (1000 ~ 100000)
and consistently saw it finish 3x faster with this series.

Note that now every caller of lwtunnel_valid_encap_type() passes
false as the last argument, and this can be removed later.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Protect nh->f6i_list with spinlock and flag.
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:55 +0000 (17:03 -0700)] 
ipv6: Protect nh->f6i_list with spinlock and flag.

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we may be going to add a route tied to a dying nexthop.

The nexthop itself is not freed during the RCU grace period, but
if we link a route after __remove_nexthop_fib() is called for the
nexthop, the route will be leaked.

To avoid the race between IPv6 route addition under RCU vs nexthop
deletion under RTNL, let's add a dead flag and protect it and
nh->f6i_list with a spinlock.

__remove_nexthop_fib() acquires the nexthop's spinlock and sets false
to nh->dead, then calls ip6_del_rt() for the linked route one by one
without the spinlock because fib6_purge_rt() acquires it later.

While adding an IPv6 route, fib6_add() acquires the nexthop lock and
checks the dead flag just before inserting the route.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:54 +0000 (17:03 -0700)] 
ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().

The next patch adds per-nexthop spinlock which protects nh->f6i_list.

When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock.
fib6_add_rt2node() could call fib6_purge_rt() for another route, which
could holds another nexthop lock.

Then, deadlock could happen between two nexthops.

Let's defer fib6_purge_rt() after fib6_add_rt2node().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Protect fib6_link_table() with spinlock.
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:53 +0000 (17:03 -0700)] 
ipv6: Protect fib6_link_table() with spinlock.

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

If the request specifies a new table ID, fib6_new_table() is
called to create a new routing table.

Two concurrent requests could specify the same table ID, so we
need a lock to protect net->ipv6.fib_table_hash[h].

Let's add a spinlock to protect the hash bucket linkage.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Factorise ip6_route_multipath_add().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:52 +0000 (17:03 -0700)] 
ipv6: Factorise ip6_route_multipath_add().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, the RCU section will start before ip6_route_info_create_nh()
in ip6_route_multipath_add(), but ip6_route_info_create() is called
in the same loop and will sleep.

Let's split the loop into ip6_route_mpath_info_create() and
ip6_route_mpath_info_create_nh().

Note that ip6_route_info_append() is now integrated into
ip6_route_mpath_info_create_nh() because we need to call different
free functions for nexthops that passed ip6_route_info_create_nh().

In case of failure, the remaining nexthops that ip6_route_info_create_nh()
has not been called for will be freed by ip6_route_mpath_info_cleanup().

OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked
to a local temporary list, which will be spliced back to rt6_nh_list.
In case of failure, these nexthops will be released by fib6_info_release()
in ip6_route_multipath_add().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Rename rt6_nh.next to rt6_nh.list.
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:51 +0000 (17:03 -0700)] 
ipv6: Rename rt6_nh.next to rt6_nh.list.

ip6_route_multipath_add() allocates struct rt6_nh for each config
of multipath routes to link them to a local list rt6_nh_list.

struct rt6_nh.next is the list node of each config, so the name
is quite misleading.

Let's rename it to list.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Don't pass net to ip6_route_info_append().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:50 +0000 (17:03 -0700)] 
ipv6: Don't pass net to ip6_route_info_append().

net is not used in ip6_route_info_append() after commit 36f19d5b4f99
("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case").

Let's remove the argument.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:49 +0000 (17:03 -0700)] 
ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().

ip6_route_info_create_nh() will be called under RCU.

It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output.

As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid
GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU.

Let's preallocate it in ip6_route_info_create().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:48 +0000 (17:03 -0700)] 
ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

ip6_route_info_create_nh() will be called under RCU.

Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation
is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes
and we would see a bunch of this message in dmesg.

  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left
  percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left

Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().

If something fails before the original memory allocation in
fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(),
which releases the preallocated per-cpu memory.

Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via
ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Split ip6_route_info_create().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:47 +0000 (17:03 -0700)] 
ipv6: Split ip6_route_info_create().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely
on RCU to guarantee dev and nexthop lifetime.

Then, we want to allocate as much as possible before entering
the RCU section.

The RCU section will start in the middle of ip6_route_info_create(),
and this is problematic for ip6_route_multipath_add() that calls
ip6_route_info_create() multiple times.

Let's split ip6_route_info_create() into two parts; one for memory
allocation and another for nexthop setup.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Move nexthop_find_by_id() after fib6_info_alloc().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:46 +0000 (17:03 -0700)] 
ipv6: Move nexthop_find_by_id() after fib6_info_alloc().

We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT.

Then, we must perform two lookups for nexthop and dev under RCU
to guarantee their lifetime.

ip6_route_info_create() calls nexthop_find_by_id() first if
RTA_NH_ID is specified, and then allocates struct fib6_info.

nexthop_find_by_id() must be called under RCU, but we do not want
to use GFP_ATOMIC for memory allocation here, which will be likely
to fail in ip6_route_multipath_add().

Let's move nexthop_find_by_id() after the memory allocation so
that we can later split ip6_route_info_create() into two parts:
the sleepable part and the RCU part.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:45 +0000 (17:03 -0700)] 
ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().

In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each
entry.  If it returns false, the request fails.

rt6_qualify_for_ecmp() returns false if either of the conditions below
is true:

  1. f6i->fib6_flags has RTF_ADDRCONF
  2. f6i->nh is not NULL
  3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC

1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF
to cfg->fc_flags.

2. is equivalent with cfg->fc_nh_id.

3. can be replaced by checking RTF_GATEWAY in the base and each multipath
entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when
cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always
false.

These checks do not require RCU and can be done earlier.

Let's perform the equivalent checks in rtm_to_fib6_multipath_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:44 +0000 (17:03 -0700)] 
ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().

ip6_route_info_create() is called from 3 functions:

  * ip6_route_add()
  * ip6_route_multipath_add()
  * addrconf_f6i_alloc()

addrconf_f6i_alloc() does not need validation for struct fib6_config in
ip6_route_info_create().

ip6_route_multipath_add() calls ip6_route_info_create() for multiple
routes with slightly different fib6_config instances, which is copied
from the base config passed from userspace.  So, we need not validate
the same config repeatedly.

Let's move such validation into rtm_to_fib6_config().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:43 +0000 (17:03 -0700)] 
ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.

Basically, removing an IPv6 route does not require RTNL because
the IPv6 routing tables are protected by per table lock.

inet6_rtm_delroute() calls nexthop_find_by_id() to check if the
nexthop specified by RTA_NH_ID exists.  nexthop uses rbtree and
the top-down walk can be safely performed under RCU.

ip6_route_del() already relies on RCU and the table lock, but we
need to extend the RCU critical section a bit more to cover
__ip6_del_rt().  For example, nexthop_for_each_fib6_nh() and
inet6_rt_notify() needs RCU.

Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and
get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT.

Even if the nexthop is removed after rcu_read_unlock() in
inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes
tied to the nexthop, and ip6_route_del() returns -ESRCH.  So the
request was at least valid as of nexthop_find_by_id(), and it's just
a matter of timing.

Note that we need to pass false to lwtunnel_valid_encap_type_attr().
The following patches also use the newroute bool.

Note also that fib6_get_table() does not require RCU because once
allocated fib6_table is not freed until netns dismantle.  I will
post a follow-up series to convert such callers to RCU-lockless
version.  [0]

Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7 weeks agoipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
Kuniyuki Iwashima [Fri, 18 Apr 2025 00:03:42 +0000 (17:03 -0700)] 
ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().

We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then
we want to perform some validation out of the RCU scope.

When creating / removing an IPv6 route with RTA_MULTIPATH,
inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY
in each multipath entry.

Let's do that in rtm_to_fib6_config().

Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with
0 entries, which was accepted but should result in -EINVAL as
RTM_NEWROUTE.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>