git.ipfire.org Git - thirdparty/kernel/stable.git/log

udp: remove busylock and add per NUMA queues

busylock was protecting UDP sockets against packet floods,
but unfortunately was not protecting the host itself.

Under stress, many cpus could spin while acquiring the busylock,
and NIC had to drop packets. Or packets would be dropped
in cpu backlog if RPS/RFS were in place.

This patch replaces the busylock by intermediate
lockless queues. (One queue per NUMA node).

This means that fewer number of cpus have to acquire
the UDP receive queue lock.

Most of the cpus can either:
- immediately drop the packet.
- or queue it in their NUMA aware lockless queue.

Then one of the cpu is chosen to process this lockless queue
in a batch.

The batch only contains packets that were cooked on the same
NUMA node, thus with very limited latency impact.

Tested:

DDOS targeting a victim UDP socket, on a platform with 6 NUMA nodes
(Intel(R) Xeon(R) 6985P-C)

Before:

nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1004179            0.0
Udp6InErrors                    3117               0.0
Udp6RcvbufErrors                3117               0.0

After:
nstat -n ; sleep 1 ; nstat | grep Udp
Udp6InDatagrams                 1116633            0.0
Udp6InErrors                    14197275           0.0
Udp6RcvbufErrors                14197275           0.0

We can see this host can now proces 14.2 M more packets per second
while under attack, and the victim socket can receive 11 % more
packets.

I used a small bpftrace program measuring time (in us) spent in
__udp_enqueue_schedule_skb().

Before:

@udp_enqueue_us[398]:
[0]                24901 |@@@                                                 |
[1]                63512 |@@@@@@@@@                                           |
[2, 4)            344827 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4, 8)            244673 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[8, 16)            54022 |@@@@@@@@                                            |
[16, 32)          222134 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
[32, 64)          232042 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[64, 128)           4219 |                                                    |
[128, 256)           188 |                                                    |

After:

@udp_enqueue_us[398]:
[0]              5608855 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]              1111277 |@@@@@@@@@@                                          |
[2, 4)            501439 |@@@@                                                |
[4, 8)            102921 |                                                    |
[8, 16)            29895 |                                                    |
[16, 32)           43500 |                                                    |
[32, 64)           31552 |                                                    |
[64, 128)            979 |                                                    |
[128, 256)            13 |                                                    |

Note that the remaining bottleneck for this platform is in
udp_drops_inc() because we limited struct numa_drop_counters
to only two nodes so far.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250922104240.2182559-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'dibs-direct-internal-buffer-sharing'

Alexandra Winter says:

====================
dibs - Direct Internal Buffer Sharing

This series introduces a generic abstraction of existing components like:
- the s390 specific ISM device (Internal Shared Memory),
- the SMC-D loopback mechanism (Shared Memory Communication - Direct)
- the client interface of the SMC-D module to the transport devices
This generic shim layer can be extended with more devices, more clients and
more features in the future.

This layer is called 'dibs' for Direct Internal Buffer Sharing based on the
common scheme that these mechanisms enable controlled sharing of memory
buffers within some containing entity such as a hypervisor or a Linux
instance.

Benefits:
- Cleaner separation of ISM and SMC-D functionality
- simpler and less module dependencies
- Clear interface definition.
- Extendable for future devices and clients.

An overview was given at the Netdev 0x19 conference, recordings and slides
are available [1].

Background / Status quo:
------------------------
Currently s390 hardware provides virtual PCI ISM devices (Internal Shared
Memory). Their driver is in drivers/s390/net/ism_drv.c. The main user is
SMC-D (net/smc). The ism driver offers a client interface so other
users/protocols can also use them, but it is still heavily intermingled
with the smc code. Namely, the ism module cannot be used without the smc
module, which feels artificial.

There is ongoing work to extend the ISM concept of shared buffers that can
be accessed directly by another instance on the same hardware: [2] proposed
a loopback interface (ism_lo), that can be used on non-s390 architectures
(e.g. between containers or to test SMC-D). A minimal implementation went
upstream with [3]: ism_lo currently is a part of the smc protocol and
rather hidden.

[4] proposed a virtio definition of ism (ism_virtio) that can be used
between kvm guests.

We will shortly send an RFC for an dibs client that uses dibs as transport
for TTY.

Concept:
--------
Create a shim layer in net/dibs that contains common definitions and code
for all dibs devices and all dibs clients. Any device or client module only
needs to depend on this dibs layer module and any device or client code
only needs to include the definitions in include/linux/dibs.h.

The name dibs was chosen to clearly distinguish it from the existing s390
ism devices. And to emphasize that it is not about sharing whole memory
regions with anybody, but dedicating single buffers for another system.

Implementation:
---------------
The end result of this series is: A dibs shim layer with
- One dibs client: smc-d
- Two dibs device drivers: ism and dibs-loopback
- Everything prepared to add more clients and more device drivers.

Patches 1-2 contain some issues that were found along the way. They make
sense on their own, but also enable a better structured dibs series.

There are three components that exist today:
a) smc module (especially SMC-D functionality, which is an ism client today)
b) ism device driver (supports multiple ism clients today)
c) smc-loopback (integrated with smc today)
In order to preserve existing functionality at each step, these are not
moved to dibs layer by component, instead:
- the dibs layer is established in parallel to existing code [patches 3-6]
- then some service functions are moved to the dibs layer [patches 7-12]
- the actual data movement is moved to the dibs layer [patch 13]
- and last event handling is moved to the dibs layer [patch 14]

Future:
-------
Items that are not part of this patchset but can be added later:
- dynamically add or remove dibs_loopback. That will be allow for simple
  testing of add_dev()/del_dev()
- handle_irq(): Call clients without interrupt context. e.g using
  threaded interrupts. I left this for a follow-on, because it includes
  conceptual changes for the smcd receive code.
- Any improvements of locking scopes. I mainly moved some of the the
  existing locks to dibs layer. I have the feeling there is room for
  improvements.
- The device drivers should not loop through the client array
- dibs_dev_op.*_dmb() functions reveal unnecessary details of the
  internal dmb struct to the clients
- Check whether client calls to dibs_dev_ops should be replaced by
  interface functions that provide additional value
- Check whether device driver calls to dibs_client_ops should be replaced
  by interface functions that provide additional value.

Link: [1] https://netdevconf.info/0x19/sessions/talk/communication-via-internal-shared-memory-ism-time-to-open-up.html
Link: [2] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
Link: [3] https://lore.kernel.org/linux-kernel//20240428060738.60843-1-guwen@linux.alibaba.com/
Link: [4] https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=c060ecf9-ea1a-49a2-9827-c92f0e6447b2&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
====================

Link: https://patch.msgid.link/20250918110500.1731261-1-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Move event handling to dibs layer

Add defines for all event types and subtypes an ism device is known to
produce as it can be helpful for debugging purposes.

Introduces a generic 'struct dibs_event' and adopt ism device driver
and smc-d client accordingly. Tolerate and ignore other type and subtype
values to enable future device extensions.

SMC-D and ISM are now independent.
struct ism_dev can be moved to drivers/s390/net/ism.h.

Note that in smc, the term 'ism' is still used. Future patches could
replace that with 'dibs' or 'smc-d' as appropriate.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-15-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Move data path to dibs layer

Use struct dibs_dmb instead of struct smc_dmb and move the corresponding
client tables to dibs_dev. Leave driver specific implementation details
like sba in the device drivers.

Register and unregister dmbs via dibs_dev_ops. A dmb is dedicated to a
single client, but a dibs device can have dmbs for more than one client.

Trigger dibs clients via dibs_client_ops->handle_irq(), when data is
received into a dmb. For dibs_loopback replace scheduling an smcd receive
tasklet with calling dibs_client_ops->handle_irq().

For loopback devices attach_dmb(), detach_dmb() and move_data() need to
access the dmb tables, so move those to dibs_dev_ops in this patch as well.

Remove remaining definitions of smc_loopback as they are no longer
required, now that everything is in dibs_loopback.

Note that struct ism_client and struct ism_dev are still required in smc
until a follow-on patch moves event handling to dibs. (Loopback does not
use events).

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-14-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Move query_remote_gid() to dibs_dev_ops

Provide the dibs_dev_ops->query_remote_gid() in ism and dibs_loopback
dibs_devices. And call it in smc dibs_client.

Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-13-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Move vlan support to dibs_dev_ops

It can be debated how much benefit definition of vlan ids for dibs devices
brings, as the dmbs are accessible only by a single peer anyhow. But ism
provides vlan support and smcd exploits it, so move it to dibs layer as an
optional feature.

smcd_loopback simply ignores all vlan settings, do the same in
dibs_loopback.

SMC-D and ISM have a method to use the invalid VLAN ID 1FFF
(ISM_RESERVED_VLANID), to indicate that both communication peers support
routable SMC-Dv2. Tolerate it in dibs, but move it to SMC only.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-12-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Local gid for dibs devices

Define a uuid_t GID attribute to identify a dibs device.

SMC uses 64 Bit and 128 Bit Global Identifiers (GIDs) per device, that
need to be sent via the SMC protocol. Because the smc code uses integers,
network endianness and host endianness need to be considered. Avoid this
in the dibs layer by using uuid_t byte arrays. Future patches could change
SMC to use uuid_t. For now conversion helper functions are introduced.

ISM devices provide 64 Bit GIDs. Map them to dibs uuid_t GIDs like this:
_________________________________________
| 64 Bit ISM-vPCI GID | 00000000_00000000 |
-----------------------------------------
If interpreted as UUID [1], this would be interpreted as the UIID variant,
that is reserved for NCS backward compatibility. So it will not collide
with UUIDs that were generated according to the standard.

smc_loopback already uses version 4 UUIDs as 128 Bit GIDs, move that to
dibs loopback. A temporary change to smc_lo_query_rgid() is required,
that will be moved to dibs_loopback with a follow-on patch.

Provide gid of a dibs device as sysfs read-only attribute.

Link: https://datatracker.ietf.org/doc/html/rfc4122
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-11-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Create class dibs

Create '/sys/class/dibs' to represent multiple kinds of dibs devices in
sysfs. Show s390/ism devices as well as dibs_loopback devices.

Show attribute fabric_id using dibs_ops.get_fabric_id(). This can help
users understand which dibs devices are connected to the same fabric in
different systems and which dibs devices are loopback devices
(fabric_id 0xffff)

Instead of using the same name as the pci device, give the ism devices
their own readable names based on uid or fid from the HW definition.

smc_loopback was never visible in sysfs. dibs_loopback is now represented
as a virtual device.

For the SMC feature "software defined pnet-id" either the ib device name or
the PCI-ID (actually the parent device name) can be used for SMC-R entries.
Mimic this behaviour for SMC-D, and check the parent device name as well.
So device name or PCI-ID can be used for ism and device name can be used
for dibs-loopback. Note that this:
IB_DEVICE_NAME_MAX - 1 == smc_pnet_policy.[SMC_PNETID_IBNAME].len
is the length of smcd_name. Future SW-pnetid cleanup patches to could use a
meaningful define, but that would touch too much unrelated code here.

Examples:
---------
ism before:
> ls /sys/bus/pci/devices/0000:00:00.0/0000:00:00.0
uevent

ism now:
> ls /sys/bus/pci/devices/0000:00:00.0/dibs/ism30
device -> ../../../0000:00:00.0/
fabric_id
subsystem -> ../../../../../class/dibs/
uevent

dibs loopback:
> ls /sys/devices/virtual/dibs/lo/
fabric_id
subsystem -> ../../../../class/dibs/
uevent

dibs class:
> ls -l /sys/class/dibs/
ism30 -> ../../devices/pci0000:00/0000:00:00.0/dibs/ism30/
lo -> ../../devices/virtual/dibs/lo/

For comparison:
> ls -l /sys/class/net/
enc8410 -> ../../devices/qeth/0.0.8410/net/enc8410/
ens1693 -> ../../devices/pci0001:00/0001:00:00.0/net/ens1693/
lo -> ../../devices/virtual/net/lo/

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-10-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Move struct device to dibs_dev

Move struct device from ism_dev and smc_lo_dev to dibs_dev, and define a
corresponding release function. Free ism_dev in ism_remove() and smc_lo_dev
in smc_lo_dev_remove().

Replace smcd->ops->get_dev(smcd) by using dibs->dev directly.

An alternative design would be to embed dibs_dev as a field in ism_dev and
do the same for other dibs device driver specific structs. However that
would have the disadvantage that each dibs device driver needs to allocate
dibs_dev and each dibs device driver needs a different device release
function. The advantage would be that ism_dev and other device driver
specific structs would be covered by device reference counts.

Signed-off-by: Julian Ruess <julianr@linux.ibm.com>
Co-developed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-9-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Define dibs_client_ops and dibs_dev_ops

Move the device add() and remove() functions from ism_client to
dibs_client_ops and call add_dev()/del_dev() for ism devices and
dibs_loopback devices. dibs_client_ops->add_dev() = smcd_register_dev() for
the smc_dibs_client. This is the first step to handle ism and loopback
devices alike (as dibs devices) in the smc dibs client.

Define dibs_dev->ops and move smcd_ops->get_chid to
dibs_dev_ops->get_fabric_id() for ism and loopback devices. See below for
why this needs to be in the same patch as dibs_client_ops->add_dev().

The following changes contain intermediate steps, that will be obsoleted by
follow-on patches, once more functionality has been moved to dibs:

Use different smcd_ops and max_dmbs for ism and loopback. Follow-on patches
will change SMC-D to directly use dibs_ops instead of smcd_ops.

In smcd_register_dev() it is now necessary to identify a dibs_loopback
device before smcd_dev and smcd_ops->get_chid() are available. So provide
dibs_dev_ops->get_fabric_id() in this patch and evaluate it in
smc_ism_is_loopback().

Call smc_loopback_init() in smcd_register_dev() and call
smc_loopback_exit() in smcd_unregister_dev() to handle the functionality
that is still in smc_loopback. Follow-on patches will move all smc_loopback
code to dibs_loopback.

In smcd_[un]register_dev() use only ism device name, this will be replaced
by dibs device name by a follow-on patch.

End of changes with intermediate parts.

Allocate an smcd event workqueue for all dibs devices, although
dibs_loopback does not generate events.

Use kernel memory instead of devres memory for smcd_dev and smcd->conn.
Since commit a72178cfe855 ("net/smc: Fix dependency of SMC on ISM") an ism
device and its driver can have a longer lifetime than the smc module, so
smc should not rely on devres to free its resources [1]. It is now the
responsibility of the smc client to free smcd and smcd->conn for all dibs
devices, ism devices as well as loopback. Call client->ops->del_dev() for
all existing dibs devices in dibs_unregister_client(), so all device
related structures can be freed in the client.

When dibs_unregister_client() is called in the context of smc_exit() or
smc_core_reboot_event(), these functions have already called
smc_lgrs_shutdown() which calls smc_smcd_terminate_all(smcd) and sets
going_away. This is done a second time in smcd_unregister_dev(). This is
analogous to how smcr is handled in these functions, by calling first
smc_lgrs_shutdown() and then smc_ib_unregister_client() >
smc_ib_remove_dev(), so leave it that way. It may be worth investigating,
whether smc_lgrs_shutdown() is still required or useful.

Remove CONFIG_SMC_LO. CONFIG_DIBS_LO now controls whether a dibs loopback
device exists or not.

Link: https://www.kernel.org/doc/Documentation/driver-model/devres.txt
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-8-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Define dibs loopback

The first stage of loopback-ism was implemented as part of the
SMC module [1]. Now that we have the dibs layer, provide access to a
dibs_loopback device to all dibs clients.

This is the first step of moving loopback-ism from net/smc/smc_loopback.*
to drivers/dibs/dibs_loopback.*. One global structure lo_dev is allocated
and added to the dibs devices. Follow-on patches will move functionality.

Same as smc_loopback, dibs_loopback is provided by a config option.
Note that there is no way to dynamically add or remove the loopback
device. That could be a future improvement.

When moving code to drivers/dibs, replace ism_ prefix with dibs_ prefix.
As this is mostly a move of existing code, copyright and authors are
unchanged.

Link: https://lore.kernel.org/lkml/20240428060738.60843-1-guwen@linux.alibaba.com/
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-7-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Register ism as dibs device

Register ism devices with the dibs layer. Follow-on patches will move
functionality to the dibs layer.

As DIBS is only a shim layer without any dependencies, we can depend ISM
on DIBS without adding indirect dependencies. A follow-on patch will
remove implication of SMC by ISM.

Define struct dibs_dev. Follow-on patches will move more content into
dibs_dev. The goal of follow-on patches is that ism_dev will only
contain fields that are special for this device driver. The same concept
will apply to other dibs device drivers.

Define dibs_dev_alloc(), dibs_dev_add() and dibs_dev_del() to be called
by dibs device drivers and call them from ism_drv.c
Use ism_dev.dibs for a pointer to dibs_dev.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-6-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Register smc as dibs_client

Formally register smc as dibs client. Functionality will be moved by
follow-on patches from ism_client to dibs_client until eventually
ism_client can be removed.
As DIBS is only a shim layer without any dependencies, we can depend SMC
on DIBS without adding indirect dependencies. A follow-on patch will
remove dependency of SMC on ISM.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-5-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

dibs: Create drivers/dibs

Create the file structure for a 'DIBS - Direct Internal Buffer Sharing'
shim layer that will provide generic functionality and declarations for
dibs device drivers and dibs clients.

Following patches will add functionality.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://patch.msgid.link/20250918110500.1731261-4-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/smc: Decouple sf and attached send_buf in smc_loopback

Before this patch there was the following assumption in
smc_loopback.c>smc_lo_move_data():
sf (signalling flag) == 0 : data is already in an attached target dmb
sf == 1 : data is not yet in the target dmb

This is true for the 2 callers in smc client
smcd_cdc_msg_send() : sf=1
smcd_tx_rdma_writes() : sf=0
but should not be a general assumption.

Add a bool to struct smc_buf_desc to indicate whether an SMC-D sndbuf_desc
is an attached buffer. Don't call move_data() for attached send_buffers,
because it is not necessary.

Move the data in smc_lo_move_data() if len != 0 and signal when requested.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-3-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net/smc: Remove error handling of unregister_dmb()

smcd_buf_free() calls smc_ism_unregister_dmb(lgr->smcd, buf_desc) and
then unconditionally frees buf_desc.

Remove the cleaning up of fields of buf_desc in
smc_ism_unregister_dmb(), because it is not helpful.

This removes the only usage of ISM_ERROR from the smc module. So move it
to drivers/s390/net/ism.h.

Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://patch.msgid.link/20250918110500.1731261-2-wintera@linux.ibm.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'tcp-update-bind-bucket-state-on-port-release'

Jakub Sitnicki says:

====================
tcp: Update bind bucket state on port release

TL;DR
-----

This is another take on addressing the issue we already raised earlier [1].

This time around, instead of trying to relax the bind-conflict checks in
connect(), we make an attempt to fix the tcp bind bucket state accounting.

The goal of this patch set is to make the bind buckets return to "port
reusable by ephemeral connections" state when all sockets blocking the port
from reuse get unhashed.

Changelog
---------
Changes in v5:
- Fix initial port-addr bucket state on saddr update with ip_dynaddr=1
- Add Kuniyuki's tag for tests
- Link to v4: https://lore.kernel.org/r/20250913-update-bind-bucket-state-on-unhash-v4-0-33a567594df7@cloudflare.com

Changes in v4:
- Drop redundant sk_is_connect_bind helper doc comment
- Link to v3: https://lore.kernel.org/r/20250910-update-bind-bucket-state-on-unhash-v3-0-023caaf4ae3c@cloudflare.com

Changes in v3:
- Move the flag from inet_flags to sk_userlocks (Kuniyuki)
- Rename the flag from AUTOBIND to CONNECT_BIND to avoid a name clash (Kuniyuki)
- Drop unreachable code for sk_state == TCP_NEW_SYN_RECV (Kuniyuki)
- Move the helper to inet_hashtables where it's used
- Reword patch 1 description for conciseness
- Link to v2: https://lore.kernel.org/r/20250821-update-bind-bucket-state-on-unhash-v2-0-0c204543a522@cloudflare.com

Changes in v2:
- Rename the inet_sock flag from LAZY_BIND to AUTOBIND (Eric)
- Clear the AUTOBIND flag on disconnect path (Eric)
- Add a test to cover the disconnect case (Eric)
- Link to RFC v1: https://lore.kernel.org/r/20250808-update-bind-bucket-state-on-unhash-v1-0-faf85099d61b@cloudflare.com

Situation
---------

We observe the following scenario in production:

                                                  inet_bind_bucket
                                                state for port 54321
                                                --------------------

                                                (bucket doesn't exist)

// Process A opens a long-lived connection:
s1 = socket(AF_INET, SOCK_STREAM)
s1.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s1.bind(192.0.2.10, 0)
s1.connect(192.51.100.1, 443)
                                                tb->fastreuse = -1
                                                tb->fastreuseport = -1
s1.getsockname() -> 192.0.2.10:54321
s1.send()
s1.recv()
// ... s1 stays open.

// Process B opens a short-lived connection:
s2 = socket(AF_INET, SOCK_STREAM)
s2.setsockopt(SO_REUSEADDR)
s2.bind(192.0.2.20, 0)
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0
s2.connect(192.51.100.2, 53)
s2.getsockname() -> 192.0.2.20:54321
s2.send()
s2.recv()
s2.close()

                                                // bucket remains in this
                                                // state even though port
                                                // was released by s2
                                                tb->fastreuse = 0
                                                tb->fastreuseport = 0

// Process A attempts to open another connection
// when there is connection pressure from
// 192.0.2.30:54000..54500 to 192.51.100.1:443.
// Assume only port 54321 is still available.

s3 = socket(AF_INET, SOCK_STREAM)
s3.setsockopt(IP_BIND_ADDRESS_NO_PORT)
s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500)
s3.bind(192.0.2.30, 0)
s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99)

Problem
-------

We end up in a state where Process A can't reuse ephemeral port 54321 for
as long as there are sockets, like s1, that keep the bind bucket alive. The
bucket does not return to "reusable" state even when all sockets which
blocked it from reuse, like s2, are gone.

The ephemeral port becomes available for use again only after all sockets
bound to it are gone and the bind bucket is destroyed.

Programs which behave like Process B in this scenario - that is, binding to
an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered
poorly written. However, the reality is that such implementation is not
actually uncommon. Trying to fix each and every such program is like
playing whack-a-mole.

For instance, it could be any software using Golang's net.Dialer with
LocalAddr provided:

        dialer := &net.Dialer{
                LocalAddr: &net.TCPAddr{IP: srcIP},
        }
        conn, err := dialer.Dial("tcp4", dialTarget)

Or even a ubiquitous tool like dig when using a specific local address:

        $ dig -b 127.1.1.1 +tcp +short example.com

Hence, we are proposing a systematic fix in the network stack itself.

Solution
--------

Please see the description in patch 1.

[1] https://lore.kernel.org/r/20250714-connect-port-search-harder-v3-0-b1a41f249865@cloudflare.com

Reported-by: Lee Valentine <lvalentine@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
====================

Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-0-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

selftests/net: Test tcp port reuse after unbinding a socket

Exercise the scenario described in detail in the cover letter:

  1) socket A: connect() from ephemeral port X
  2) socket B: explicitly bind() to port X
  3) check that port X is now excluded from ephemeral ports
  4) close socket B to release the port bind
  5) socket C: connect() from ephemeral port X

As well as a corner case to test that the connect-bind flag is cleared:

  1) connect() from ephemeral port X
  2) disconnect the socket with connect(AF_UNSPEC)
  3) bind() it explicitly to port X
  4) check that port X is now excluded from ephemeral ports

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-2-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

tcp: Update bind bucket state on port release

Today, once an inet_bind_bucket enters a state where fastreuse >= 0 or
fastreuseport >= 0 after a socket is explicitly bound to a port, it remains
in that state until all sockets are removed and the bucket is destroyed.

In this state, the bucket is skipped during ephemeral port selection in
connect(). For applications using a reduced ephemeral port
range (IP_LOCAL_PORT_RANGE socket option), this can cause faster port
exhaustion since blocked buckets are excluded from reuse.

The reason the bucket state isn't updated on port release is unclear.
Possibly a performance trade-off to avoid scanning bucket owners, or just
an oversight.

Fix it by recalculating the bucket state when a socket releases a port. To
limit overhead, each inet_bind2_bucket stores its own (fastreuse,
fastreuseport) state. On port release, only the relevant port-addr bucket
is scanned, and the overall state is derived from these.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250917-update-bind-bucket-state-on-unhash-v5-1-57168b661b47@cloudflare.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'tcp-move-few-fields-for-data-locality'

Eric Dumazet says:

====================
tcp: move few fields for data locality

After recent additions (PSP and AccECN) I wanted to make another
round on fields locations to increase data locality.

This series manages to shrink TCP and TCPv6 objects by 128 bytes,
but more importantly should reduce number of touched cache lines
in TCP fast paths.

There is more to come.

v2: removed tcp CACHELINE_ASSERT_GROUP_SIZE after a kernel build bot
reported an error.
====================

Link: https://patch.msgid.link/20250919204856.2977245-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: reclaim 8 bytes in struct request_sock_queue

synflood_warned had to be u32 for xchg(), but ensuring
atomicity is not really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move mtu_info to remove two 32bit holes

This removes 8bytes waste on 64bit builds.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_clean_acked to tcp_sock_read_tx group

tp->tcp_clean_acked is fetched in tx path when snd_una is updated.

This field thus belongs to tcp_sock_read_tx group.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-7-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move recvmsg_inq to tcp_sock_read_txrx

Fill a hole in tcp_sock_read_txrx, instead of possibly wasting
a cache line.

Note that tcp_recvmsg_locked() is also reading tp->repair,
so this removes one cache line miss in tcp recvmsg().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-6-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp->rcv_tstamp to tcp_sock_write_txrx group

tcp_ack() writes this field, it belongs to tcp_sock_write_txrx.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: remove CACHELINE_ASSERT_GROUP_SIZE() uses

Maintaining the CACHELINE_ASSERT_GROUP_SIZE() uses
for struct tcp_sock has been painful.

This had little benefit, so remove them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move sk->sk_err_soft and sk->sk_sndbuf

sk->sk_sndbuf is read-mostly in tx path, so move it from
sock_write_tx group to more appropriate sock_read_tx.

sk->sk_err_soft was not identified previously, but
is used from tcp_ack().

Move it to sock_write_tx group for better cache locality.

Also change tcp_ack() to clear sk->sk_err_soft only if needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move sk_uid and sk_protocol to sock_read_tx

sk_uid and sk_protocol are read from inet6_csk_route_socket()
for each TCP transmit.

Also read from udpv6_sendmsg(), udp_sendmsg() and others.

Move them to sock_read_tx for better cache locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250919204856.2977245-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-more-functionality-to-bnge'

Bhargava Marreddy says:

====================
Add more functionality to BNGE

This patch series adds the infrastructure to make the netdevice
functional. It allocates data structures for core resources,
followed by their initialisation and registration with the firmware.
The core resources include the RX, TX, AGG, CMPL, and NQ rings,
as well as the VNIC. RX/TX functionality will be introduced in the
next patch series to keep this one at a reviewable size.
====================

Link: https://patch.msgid.link/20250919174742.24969-1-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Configure default VNIC

Add functions to add a filter to the VNIC to configure unicast
addresses. Also, add multicast, broadcast, and promiscuous settings
to the default VNIC.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-11-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Register default VNIC

Allocate the default VNIC with the firmware and configure its RSS,
HDS, and Jumbo parameters. Add related functions to support VNIC
configuration for these parameters.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-10-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Register rings with the firmware

Enable ring functionality by registering RX, AGG, TX, CMPL, and
NQ rings with the firmware. Initialise the doorbells associated
with the rings.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-9-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Allocate stat contexts

Allocate the hardware statistics context with the firmware and
register DMA memory required for ring statistics. This helps the
driver to collect ring statistics provided by the firmware.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-8-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Allocate packet buffers

Populate packet buffers into the RX and AGG rings while these
rings are being initialized.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-7-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Initialise core resources

Add initial settings to all core resources, such as
the RX, AGG, TX, CQ, and NQ rings, as well as the VNIC.
This will help enable these resources in future patches.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-6-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Introduce VNIC

Add the VNIC-specific structures and DMA memory necessary to support
UC/MC and RSS functionality.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-5-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Add initial support for CP and NQ rings

Allocate CP and NQ related data structures and add support to
associate NQ and CQ rings. Also, add the association of NQ, NAPI,
and interrupts.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-4-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: Add initial support for RX and TX rings

Allocate data structures to support RX, AGG, and TX rings.
While data structures for RX/AGG rings are allocated,
initialise the page pool accordingly.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-3-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

bng_en: make bnge_alloc_ring() self-unwind on failure

Ensure bnge_alloc_ring() frees any intermediate allocations
when it fails. This enables later patches to rely on this
self-unwinding behavior.

Signed-off-by: Bhargava Marreddy <bhargava.marreddy@broadcom.com>
Reviewed-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Rajashekar Hudumula <rajashekar.hudumula@broadcom.com>
Link: https://patch.msgid.link/20250919174742.24969-2-bhargava.marreddy@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-replace-wq-users-and-add-wq_percpu-to-alloc_workqueue-users'

Marco Crivellari says:

====================
net: replace wq users and add WQ_PERCPU to alloc_workqueue() users

Below is a summary of a discussion about the Workqueue API and cpu isolation
considerations. Details and more information are available here:

  "workqueue: Always use wq_select_unbound_cpu() for WORK_CPU_UNBOUND."

Link: https://lore.kernel.org/20250221112003.1dSuoGyc@linutronix.de
=== Current situation: problems ===

Let's consider a nohz_full system with isolated CPUs: wq_unbound_cpumask is
set to the housekeeping CPUs, for !WQ_UNBOUND the local CPU is selected.

This leads to different scenarios if a work item is scheduled on an isolated
CPU where "delay" value is 0 or greater then 0:
        schedule_delayed_work(, 0);

This will be handled by __queue_work() that will queue the work item on the
current local (isolated) CPU, while:

        schedule_delayed_work(, 1);

Will move the timer on an housekeeping CPU, and schedule the work there.

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

=== Plan and future plans ===

This patchset is the first stone on a refactoring needed in order to
address the points aforementioned; it will have a positive impact also
on the cpu isolation, in the long term, moving away percpu workqueue in
favor to an unbound model.

These are the main steps:
1)  API refactoring (that this patch is introducing)
    -   Make more clear and uniform the system wq names, both per-cpu and
        unbound. This to avoid any possible confusion on what should be
        used.

    -   Introduction of WQ_PERCPU: this flag is the complement of WQ_UNBOUND,
        introduced in this patchset and used on all the callers that are not
        currently using WQ_UNBOUND.

        WQ_UNBOUND will be removed in a future release cycle.

        Most users don't need to be per-cpu, because they don't have
        locality requirements, because of that, a next future step will be
        make "unbound" the default behavior.

2)  Check who really needs to be per-cpu
    -   Remove the WQ_PERCPU flag when is not strictly required.

3)  Add a new API (prefer local cpu)
    -   There are users that don't require a local execution, like mentioned
        above; despite that, local execution yeld to performance gain.

        This new API will prefer the local execution, without requiring it.

=== Introduced Changes by this series ===

1) [P 1-2] Replace use of system_wq and system_unbound_wq

        system_wq is a per-CPU workqueue, but his name is not clear.
        system_unbound_wq is to be used when locality is not required.

        Because of that, system_wq has been renamed in system_percpu_wq, and
        system_unbound_wq has been renamed in system_dfl_wq.

2) [P 3] add WQ_PERCPU to remaining alloc_workqueue() users

        Every alloc_workqueue() caller should use one among WQ_PERCPU or
        WQ_UNBOUND.

        WQ_UNBOUND will be removed in a next release cycle.
====================

Link: https://patch.msgid.link/20250918142427.309519-1-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: WQ_PERCPU added to alloc_workqueue users

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly
request the use of the per-CPU behavior. Both flags coexist for one release
cycle to allow callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: replace use of system_wq with system_percpu_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: replace use of system_unbound_wq with system_dfl_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-2-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-09-19 (ice, idpf, iavf, ixgbevf, fm10k)

Paul adds support for Earliest TxTime First (ETF) hardware offload
for E830 devices on ice. ETF is configured per-queue using tc-etf Qdisc;
a new Tx flow mechanism utilizes a dedicated timestamp ring alongside
the standard Tx ring. The timestamp ring contains descriptors that
specify when hardware should transmit packets; up to 2048 Tx queues can
be supported.

Additional info: https://lore.kernel.org/intel-wired-lan/20250818132257.21720-1-paul.greenwalt@intel.com/

Dave removes excess cleanup call to ice_lag_move_new_vf_nodes() in error
path.

Milena adds reporting of timestamping statistics to idpf.

Alex changes error variable type for code clarity for iavf and ixgbevf.

Brahmajit Das removes unused parameter from fm10k_unbind_hw_stats_q().

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  net: intel: fm10k: Fix parameter idx set but not used
  ixgbevf: fix proper type for error code in ixgbevf_resume()
  iavf: fix proper type for error code in iavf_resume()
  idpf: add HW timestamping statistics
  ice: Remove deprecated ice_lag_move_new_vf_nodes() call
  ice: add E830 Earliest TxTime First Offload support
  ice: move ice_qp_[ena|dis] for reuse
====================

Link: https://patch.msgid.link/20250919175412.653707-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: ax88796b: Replace hard-coded values with PHY_ID_MATCH_MODEL()

Use the PHY_ID_MATCH_MODEL() macro instead of hardcoding the values in
asix_driver[] and asix_tbl[].

In asix_tbl[], the macro also uses designated initializers instead of
positional initializers, which allows the struct fields to be reordered.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20250919103944.854845-2-thorsten.blum@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: xilinx: axienet: Fix kernel-doc warnings for missing return descriptions

Add missing "Return:" sections to kernel-doc comments for four functions:
- axienet_calc_cr()
- axienet_device_reset()
- axienet_free_tx_chain()
- axienet_dim_coalesce_count_rx()

Also standardize the return documentation format by replacing inline
"Returns" text with proper "Return:" tags as per kernel documentation
guidelines.

Fixes below kernel-doc warnings:
- Warning: No description found for return value of 'axienet_calc_cr'
- Warning: No description found for return value of 'axienet_device_reset'
- Warning: No description found for return value of 'axienet_free_tx_chain'
- Warning: No description found for return value of
'axienet_dim_coalesce_count_rx'

Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Link: https://patch.msgid.link/20250919103754.434711-1-suraj.gupta2@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-dsa-microchip-add-strap-description-to-set-spi-as-interface-bus'

Bastien Curutchet says:

====================
net: dsa: microchip: Add strap description to set SPI as interface bus

At reset, the KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing
(by mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.

This small series aims to allow to configure the KSZ8463 switch at
reset when the hardware straps are missing.

PATCH 0 and 1 add a new property to the bindings that describes the GPIOs
to be set during reset in order to configure the switch properly.

PATCH 2 implements the use of these properties in the driver.
====================

Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-0-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: microchip: Set SPI as bus interface during reset for KSZ8463

At reset, the KSZ8463 uses a strap-based configuration to set SPI as
bus interface. SPI is the only bus supported by the driver. If the
required pull-ups/pull-downs are missing (by mistake or by design to
save power) the pins may float and the configuration can go wrong
preventing any communication with the switch.

Introduce a ksz8463_configure_straps_spi() function called during the
device reset. It relies on the 'straps-rxd-gpios' OF property and the
'reset' pinmux configuration to enforce SPI as bus interface.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-3-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: microchip: Add strap description to set SPI mode

At reset, KSZ8463 uses a strap-based configuration to set SPI as
interface bus. If the required pull-ups/pull-downs are missing (by
mistake or by design to save power) the pins may float and the
configuration can go wrong preventing any communication with the switch.

Add a 'reset' pinmux state
Add a KSZ8463 specific strap description that can be used by the driver
to drive the strap pins during reset. Two GPIOs are used. Users must
describe either both of them or none of them.

Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-2-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: dsa: microchip: Group if clause under allOf tag

Upcoming patch adds a new if/then clause. It requires to be grouped with
the already existing if/then clause under an 'allOf:' tag.

Move the if/then clause under the already existing 'allOf:' tag to
prepare next patch.

Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Bastien Curutchet (Schneider Electric) <bastien.curutchet@bootlin.com>
Link: https://patch.msgid.link/20250918-ksz-strap-pins-v3-1-16662e881728@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Tariq Toukan says:

====================
mlx5-next updates 2025-09-21

* tag 'mlx5-next-counters' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Add uar access and odp page fault counters
====================

Link: https://patch.msgid.link/1758443940-708689-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-rework-sfp-capability-parsing-and-quirks'

Russell King says:

====================
net: rework SFP capability parsing and quirks

The original SPF module parsing was implemented prior to gaining any
quirks, and was designed such that the upstream calls the parsing
functions to get the translated capabilities of the module.

SFP quirks were then added to cope with modules that didn't correctly
fill out their ID EEPROM. The quirk function was called from
sfp_parse_support() to allow quirks to modify the ethtool link mode
masks.

Using just ethtool link mode masks eventually lead to difficulties
determining the correct phy_interface_t mode, so a bitmap of these
modes were added - needing both the upstream API and quirks to be
updated.

We have had significantly more SFP module quirks added since, some
which are modifying the ID EEPROM as a way of influencing the data
we provide to the upstream - for example, sfp_fixup_10gbaset_30m()
changes id.base.connector so we report PORT_TP. This could be done
more cleanly if the quirks had access to the parsed SFP port.

In order to improve flexibility, and to simplify some of the upstream
code, we group all module capabilities into a single structure that
the upstream can access via sfp_module_get_caps(). This will allow
the module capabilities to be expanded if required without reworking
all the infrastructure and upstreams again.

In this series, we rework the SFP code to use the capability structure
and then rework all the upstream implementations, finally removing the
old kernel internal APIs.
====================

Link: https://patch.msgid.link/aMnaoPjIuzEAsESZ@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: remove old sfp_parse_* functions

Remove the old sfp_parse_*() functions that are now no longer used.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVz-000000061Wj-13Yd@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: update all PHYs to use sfp_get_module_caps()

Update all PHYs to use sfp_get_module_caps() rather than the
sfp_parse_*() family of functions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVu-000000061Wd-0cAG@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phylink: use sfp_get_module_caps()

Use sfp_get_module_caps() to get SFP module's capabilities.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVp-000000061WW-08YM@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: provide sfp_get_module_caps()

Provide a function to retrieve the current sfp_module_caps structure
so that upstreams can get the entire module support in one go.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVj-000000061WQ-3q47@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: convert sfp quirks to modify struct sfp_module_support

In order to provide extensible module support properties, arrange for
the SFP quirks to modify any member of the sfp_module_support struct,
rather than just the ethtool link modes and interfaces.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVe-000000061WK-3KwI@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sfp: pre-parse the module support

Pre-parse the module support on insert rather than when the upstream
requests the data. This will allow more flexible and extensible
parsing.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVZ-000000061WE-2pXD@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: add phy_interface_copy()

Add a helper for copying PHY interface bitmasks. This will be used by
the SFP bus code, which will then be moved to phylink in the subsequent
patches.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uydVU-000000061W8-2IDT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-pm-netlink-announce-server-side-flag'

Matthieu Baerts says:

====================
mptcp: pm: netlink: announce server-side flag

Now that the 'flags' attribute is used, it seems interesting to add one
flag for 'server-side', a boolean value.

Here are a few patches related to the 'server-side' attribute:

- Patch 1: only announce this attribute on the server side.

- Patch 2: announce the 'server-side' flag when this is the case.

- Patch 3: deprecate the 'server-side' attribute.

- Patch 4: use the 'server-side' flag in the selftests.

- Patches 5, 6: small cleanups when working on code around.
====================

Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-0-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: remove unused returned value of check_data_fin

When working on a fix modifying mptcp_check_data_fin(), I noticed the
returned value was no longer used.

It looks like it was used for 3 days, between commit 7ed90803a213
("mptcp: send explicit ack on delayed ack_seq incr") and commit
ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling").

This returned value can be safely removed.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-6-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: use _BITUL() instead of (1 << x)

Simply to use the proper way to declare bits, and to align with all
other flags declared in this file.

No functional changes intended.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-5-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: pm: get server-side flag

server-side info linked to the MPTCP connect/established events can now
come from the flags, in addition to the dedicated attribute.

The attribute is now deprecated -- in favour of the new flag, and will
be removed later on.

Print this info only once.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-4-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: netlink: deprecate server-side attribute

Now that such info is in the 'flags' attribute, it is time to deprecate
the dedicated 'server-side' attribute.

It will be removed in a few versions.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-3-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: netlink: announce server-side flag

Now that the 'flags' attribute is used, it seems interesting to add one
flag for 'server-side', a boolean value.

This is duplicating the info from the dedicated 'server-side' attribute,
but it will be deprecated in the next commit, and removed in a few
versions.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-2-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: pm: netlink: only add server-side attr when true

This attribute is a boolean. No need to add it to set it to 'false'.

Indeed, the default value when this attribute is not set is naturally
'false'. A few bytes can then be saved by not adding this attribute if
the connection is not on the server side.

This prepares the future deprecation of its attribute, in favour of a
new flag.

Reviewed-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250919-net-next-mptcp-server-side-flag-v1-1-a97a5d561a8b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: spacemit: Make stats_lock softirq-safe

While most of the statistics functions (emac_get_stats64() and such) are
called with softirqs enabled, emac_stats_timer() is, as its name
suggests, also called from a timer, i.e. called in softirq context.

All of these take stats_lock. Therefore, make stats_lock softirq-safe by
changing spin_lock() into spin_lock_bh() for the functions that get
statistics.

Also, instead of directly calling emac_stats_timer() in emac_up() and
emac_resume(), set the timer to trigger instead, so that
emac_stats_timer() is only called from the timer. It will keep using
spin_lock().

This fixes a lockdep warning, and potential deadlock when stats_timer is
triggered in the middle of getting statistics.

Fixes: bfec6d7f2001 ("net: spacemit: Add K1 Ethernet MAC")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Closes: https://lore.kernel.org/all/a52c0cf5-0444-41aa-b061-a0a1d72b02fe@samsung.com/
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20250919-k1-ethernet-fix-lock-v1-1-c8b700aa4954@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-enetc-improve-the-interface-for-obtaining-phc_index'

Wei Fang says:

====================
net: enetc: improve the interface for obtaining phc_index

The first patch is to fix the issue that a sleeping function is called
in the context of rcu_read_lock(). The second patch is to use the generic
API instead of the custom API to get phc_index. In addition, the second
patch depends on the first patch to work.

v1: https://lore.kernel.org/20250918074454.1742328-1-wei.fang@nxp.com
====================

Link: https://patch.msgid.link/20250919084509.1846513-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: use generic interfaces to get phc_index for ENETC v1

The commit 61f132ca8c46 ("ptp: add helpers to get the phc_index by
of_node or dev") has added two generic interfaces to get the phc_index
of the PTP clock. This eliminates the need for PTP device drivers to
provide custom APIs for consumers to retrieve the phc_index. This has
already been implemented for ENETC v4 and is also applicable to ENETC
v1. Therefore, the global variable enetc_phc_index is removed from the
driver. ENETC v1 now uses the same interface as v4 to get phc_index.

Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250919084509.1846513-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: enetc: fix sleeping function called from rcu_read_lock() context

The rcu_read_lock() has been introduced in __ethtool_get_ts_info() since
the commit 4c61d809cf60 ("net: ethtool: Fix suspicious rcu_dereference
usage"). Therefore, the device drivers cannot use any sleeping functions
when implementing the callback of ethtool_ops::get_ts_info(). Currently,
pci_get_slot() is used in enetc_get_ts_info(), but it calls down_read()
which might sleep, so this is a potential issue. Therefore, to fix this
issue, pci_get_domain_bus_and_slot() is used to replace pci_get_slot()
in enetc_get_ts_info().

Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Closes: https://lore.kernel.org/netdev/20250918124823.t3xlzn7w2glzkhnx@skbuf/
Fixes: f5b9a1cde0a2 ("net: enetc: add PTP synchronization support for ENETC v4")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20250919084509.1846513-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-clean-up-inet_hash-and-inet_unhash'

Kuniyuki Iwashima says:

====================
tcp: Clean up inet_hash() and inet_unhash().

While reviewing the ehash fix series from Xuanqiang Luo [0],
I noticed that inet_twsk_hashdance_schedule() checks the
retval of __sk_nulls_del_node_init_rcu(), which looks confusing.

The test exists from the pre-git era:

$ git blame -L:tcp_tw_hashdance net/ipv4/tcp_minisocks.c e48c414ee61f4~

Patch 3 is to clarify that the retval check is unnecessary in
inet_twsk_hashdance_schedule(), but I'll delegate its removal
to the Xuanqiang's series.

Patch 1 & 2 are minor cleanups.

[0]: https://lore.kernel.org/netdev/20250916103054.719584-4-xuanqiang.luo@linux.dev/
====================

Link: https://patch.msgid.link/20250919083706.1863217-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Remove redundant sk_unhashed() in inet_unhash().

inet_unhash() checks sk_unhashed() twice at the entry and after locking
ehash/lhash bucket.

The former was somehow added redundantly by commit 4f9bf2a2f5aa ("tcp:
Don't acquire inet_listen_hashbucket::lock with disabled BH.").

inet_unhash() is called for the full socket from 4 places, and it is
always under lock_sock() or the socket is not yet published to other
threads:

  1. __sk_prot_rehash()
     -> called from inet_sk_reselect_saddr(), which has
        lockdep_sock_is_held()

  2. sk_common_release()
     -> called when inet_create() or inet6_create() fail, then the
        socket is not yet published

  3. tcp_set_state()
     -> calls tcp_call_bpf_2arg(), and tcp_call_bpf() has
        sock_owned_by_me()

  4. inet_ctl_sock_create()
     -> creates a kernel socket and unhashes it immediately, but TCP
        socket is not hashed in sock_create_kern() (only SOCK_RAW is)

So we do not need to check sk_unhashed() twice before/after ehash/lhash
lock in inet_unhash().

Let's remove the 2nd one.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Remove inet6_hash().

inet_hash() and inet6_hash() are exactly the same.

Also, we do not need to export inet6_hash().

Let's consolidate the two into __inet_hash() and rename it to inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Remove osk from __inet_hash() arg.

__inet_hash() is called from inet_hash() and inet6_hash with osk NULL.

Let's remove the 2nd arg from __inet_hash().

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250919083706.1863217-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: forwarding: Reorder (ar)ping arguments to obey POSIX getopt

Quoted from musl wiki:

  GNU getopt permutes argv to pull options to the front, ahead of
  non-option arguments. musl and the POSIX standard getopt stop
  processing options at the first non-option argument with no
  permutation.

Thus these scripts stop working on musl since non-option arguments for
tools using getopt() (in this case, (ar)ping) do not always come last.
Fix it by reordering arguments.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250919053538.1106753-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-netpoll-remove-dead-code-and-speed-up-rtnl-locked-region'

Breno Leitao says:

====================
net: netpoll: remove dead code and speed up rtnl-locked region

This patchset introduces two minor modernizations to the netpoll
infrastructure:

The first patch removes the unused netpoll pointer from the netpoll_info
structure. This member is redundant and its presence does not benefit
multi-instance setups, as reported by Jay Vosburgh. Eliminating it cleans up
the structure and removes unnecessary code.

The second patch updates the netpoll resource cleanup routine to use
synchronize_net() instead of synchronize_rcu(). As __netpoll_free() is always
called under the RTNL lock, using synchronize_net() leverages the more
efficient synchronize_rcu_expedited() in these contexts, reducing time spent in
critical sections and improving performance.

Both changes simplify maintenance and enhance efficiency without altering
netpoll behavior.
====================

Link: https://patch.msgid.link/20250918-netpoll_jv-v1-0-67d50eeb2c26@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netpoll: use synchronize_net() instead of synchronize_rcu()

Replace synchronize_rcu() with synchronize_net() in __netpoll_free().

synchronize_net() is RTNL-aware and will use the more efficient
synchronize_rcu_expedited() when called under RTNL lock, avoiding
the potentially expensive synchronize_rcu() in RTNL critical sections.

Since __netpoll_free() is called with RTNL held (as indicated by
ASSERT_RTNL()), this change improves performance by reducing the
time spent in the RTNL critical section.

Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250918-netpoll_jv-v1-2-67d50eeb2c26@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: netpoll: remove unused netpoll pointer from netpoll_info

The netpoll_info structure contains an useless pointer back to its
associated netpoll. This field is never used, and the assignment in
__netpoll_setup() is does not comtemplate multiple instances, as
reported by Jay[1].

Drop both the member and its initialization to simplify the structure.

Link: https://lore.kernel.org/all/2930648.1757463506@famine/
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250918-netpoll_jv-v1-1-67d50eeb2c26@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipv4-some-drop-reason-cleanup-and-improvements'

Antoine Tenart says:

====================
net: ipv4: some drop reason cleanup and improvements

A few patches that were laying around cleaning up and improving drop
reasons in net/ipv4.
====================

Link: https://patch.msgid.link/20250915091958.15382-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: convert ip_rcv_options to drop reasons

This converts the only path not returning drop reasons in
ip_rcv_finish_core.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250915091958.15382-4-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: simplify drop reason handling in ip_rcv_finish_core

Instead of setting the drop reason to SKB_DROP_REASON_NOT_SPECIFIED
early and having to reset it each time it is overridden by a function
returned value, just set the drop reason to the expected value before
returning from ip_rcv_finish_core.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://patch.msgid.link/20250915091958.15382-3-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv4: make udp_v4_early_demux explicitly return drop reason

udp_v4_early_demux already returns drop reasons as it either returns 0
or ip_mc_validate_source, which itself returns drop reasons. Its return
value is also already used as a drop reason itself.

Makes this explicit by making it return drop reasons.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250915091958.15382-2-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

rtnetlink: add needed_{head,tail}room attributes

Various network interface types make use of needed_{head,tail}room values
to efficiently reserve buffer space for additional encapsulation headers,
such as VXLAN, Geneve, IPSec, etc. However, it is not currently possible
to query these values in a generic way.

Introduce ability to query the needed_{head,tail}room values of a network
device via rtnetlink, such that applications that may wish to use these
values can do so.

For example, Cilium agent iterates over present devices based on user config
(direct routing, vxlan, geneve, wireguard etc.) and in future will configure
netkit in order to expose the needed_{head,tail}room into K8s pods. See
b9ed315d3c4c ("netkit: Allow for configuring needed_{head,tail}room").

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alasdair McWilliam <alasdair@mcwilliam.dev>
Reviewed-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20250917095543.14039-1-alasdair@mcwilliam.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-remove-mac_interface'

Russell King says:

====================
net: stmmac: remove mac_interface

The dwmac core supports a range of interfaces, but when it comes to
SerDes interfaces, the core itself does not include the SerDes block.
Consequently, it has to provide an interface suitable to interface such
a block to, and uses TBI for this.

The driver also uses "PCS" for RGMII, even though the dwmac PCS block
is not present for this interface type - it was a convenice for the
code structure as RGMII includes inband signalling of the PHY state,
much like Cisco SGMII does at a high level.

As such, the code refers to RGMII and SGMII modes for the PCS, and
there used to be STMMAC_PCS_TBI and STMMAC_PCS_RTBI constants as well
but these were never set, although they were used in the code.

The selection of the PCS mode was from mac_interface. Thus, it seems
that the original intention was for mac_interface to describe the
interface mode used within the dwmac core, and phy_interface to
describe the external world-accessible interface (e.g. which would
connect to a PHY or SFP cage.)

It appears that many glue drivers misinterpret these. A good exmple
is socfpga. This supports SGMII and 1000BASE-X, but does not include
the dwmac PCS, relying on the Lynx PCS instead. However, it makes use
of mac_interface to configure the dwmac core to its GMII/MII mode.

So, when operating in either of these modes, the dwmac is configured
for GMII mode to communicate with the Lynx PCS which then provides
the SGMII or 1000BASE-X interface mode to the external world.

Given that phy_interface is the external world interface, and
mac_interface is the dwmac core interface, selecting the interface
mode based on mac_interface being 1000BASEX makes no sense.

Moreover, mac_interface is initialised by the stmmac core code. If
the "mac-mode" property is set in DT, this will be used. Otherwise,
it will reflect the "phy-mode" property - meaning that it defaults
to phy_interface. As no in-kernel DT makes reference to a "mac-mode"
property, we can assume that for all in-kernel platforms, these two
interface variables are the same. The exception are those platform
glues which I reviwed and suggested they use phy_interface, setting
mac_interface to PHY_INTERFACE_MODE_NA.

The conclusion to all of this is that mac_interface serves no useful
purpose, and causes confusion as the platform glue code doesn't seem
to know which one should be used.

Thus, let's get rid of mac_interface, which makes all this code more
understandable. It also helps to untangle some of the questions such
as:
- should this be using the interface passed from phylink
- should we set the range of phylink supported interfaces to be
  more than one
- when we get phylink PCS support for the dwmac PCS, should we be
  selecting it based on mac_interface or phy_interface, and how
  should we populate the PCS' supported_interface bitmap.

Having converted socfpga to use phy_interface, this turns out to
feel like "the right way" to do this - convert the external world
"phy_interface" to whatever phy_intf_sel value that the dwmac core
needs to achieve the connection to whatever hardware blocks are
integrated inside the SoC to achieve the requested external world
interface.

As an illustration why - consider that in the case of socfpga, it
_could_ have been implemented such that the dwmac PCS was used for
SGMII, and the Lynx PCS for 1000BASE-X, or vice versa. Only the
platform glue would know which it is.

I will also note that several cores provide their currently configured
interface mode via the ACTPHYIF field of Hardware Feature 0, and thus
can be read back in the platform-independent parts of the driver to
decide whether the internal PCS or the RGMII (or actually SMII) "PCS"
should be used.

This is hot-off-the-press, and has only been build tested. As I have
none of these platforms, this series has not been run-tested, so
please test on your hardware.
====================

Link: https://patch.msgid.link/aMrPpc8oRxqGtVPJ@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove mac_interface

mac_interface has served little purpose, and has only caused confusion.
Now that we have cleaned up all platform glue drivers which should not
have been using mac_interface, there are no users remaining. Remove
mac_interface.

This results in the special dwmac specific "mac-mode" DT property
becoming redundant, and an in case, no DTS files in the kernel make use
of this property. Add a warning if the property is set, and it is
different from the "phy-mode".

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Vladimir Zapolskiy <vz@mleia.com>
Link: https://patch.msgid.link/E1uytpv-00000006H2x-196h@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: thead: convert to use phy_interface

dwmac-thead supports either MII or RGMII interface modes only.

None of the DTS files set "mac-mode", so mac_interface will be
identical to phy_interface.

Convert dwmac-thead to use phy_interface when determining the
interface mode rather than mac_interface.

Also convert the error prints to use phy_modes() so that we get a
meaningful string rather than a number for the interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpq-00000006H2q-0ajY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: sun8i: convert to use phy_interface

dwmac-sun8i supports MII, RMII and RGMII interface modes only. It
is unclear whether the dwmac core interface is different from the
one presented to the outside world.

However, as none of the DTS files set "mac-mode", mac_interface will
be identical to phy_interface.

Convert dwmac-sun8i to use phy_interface when determining the
interface mode rather than mac_interface.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chen-Yu Tsai <wens@csie.org>
Link: https://patch.msgid.link/E1uytpl-00000006H2k-08pH@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: stm32: convert to use phy_interface

dwmac-stm32 supports MII, RMII, GMII and RGMII interface modes,
selecting the dwmac core interface mode via bits 23:21 of the
SYSCFG register. The bit combinations are identical to the
dwmac phy_intf_sel_i signals.

None of the DTS files set "mac-mode", so mac_interface will be
identical to phy_interface.

Convert dwmac-stm32 to use phy_interface when determining the
interface mode rather than mac_interface.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpf-00000006H2c-3hiU@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: starfive: convert to use phy_interface

dwmac-starfive uses RMII or RGMII interface modes without any PCS,
and selects the dwmac core accordingly using a register field with
the same bit encoding as the core's phy_intf_sel_i signals.

None of the DTS files set "mac-mode", so mac_interface will be
identical to phy_interface.

Convert dwmac-starfive to use phy_interface when determining the
interface mode rather than mac_interface. Also convert the error
prints to use phy_modes() so that we get a meaningful string rather
than a number for the interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpa-00000006H2X-3GWx@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: socfpga: convert to use phy_interface

dwmac-socfpga uses MII, RMII, GMII, RGMII, SGMII and 1000BASE-X
interface modes, and supports the Lynx PCS. The Lynx PCS will only be
used for SGMII and 1000BASE-X modes, with the MAC programmed to use
GMII or MII mode to talk to the PCS. This suggests that the Synopsys
optional dwmac PCS is not present.

None of the DTS files set "mac-mode", so mac_interface will be
identical to phy_interface.

Convert dwmac-socfpga to use phy_interface when determining the
interface mode rather than mac_interface.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpV-00000006H2R-2nA6@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: ingenic: convert to use phy_interface

dwmac-ingenic uses only MII, RMII, GMII or RGMII interface modes, none
of which require any kind of conversion between the MAC and external
world. Thus, mac_interface and phy_interface will be the same.

Convert dwmac-ingenic to use phy_interface when determining the
interface mode that the dwmac core should be configured to at reset,
rather than mac_interface.

Also convert the error prints to use phy_modes() and terminate with a
newline so that we get a human readable string rather than a number for
the interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpQ-00000006H2L-2Jzb@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: imx: convert to use phy_interface

Checking the IMX8MP documentation, there is no requirement for a
separate mac_interface mode definition. As mac_interface and
phy_interface will be the same, use phy_interface internally rather
than mac_interface.

Also convert the error prints to use phy_modes() so that we get a
meaningful string rather than a number for the interface mode.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpL-00000006H2F-1o6b@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: use phy_interface in stmmac_check_pcs_mode()

In the majority, if not all cases, mac_interface and phy_interface
are the same with the exception of some drivers that I have suggested
only use phy_interface and set mac_interface to PHY_INTERFACE_MODE_NA.

The only two that currently set mac_interface to PHY_INTERFACE_MODE_NA
are dwmac-loongson and dwmac-lpc18xx, neither of which use RGMII nor
SGMII.

In order to phase out the use of mac_interface, we need to have a path
for existing drivers so they can update to only using phy_interface
without causing regressions.

Therefore, in order to keep the "pcs" code working, we need to choose
the STMMAC integrated PCS mode based on phy_interface if mac_interface
is PHY_INTERFACE_MODE_NA.

This will allow more drivers to set mac_interface to
PHY_INTERFACE_MODE_NA without risking regressions.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpG-00000006H29-1Ltk@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: rework mac_interface and phy_interface documentation

Based on new research, it has come to light that the comment that I
added in a014c35556b9 ("net: stmmac: clarify difference between
"interface" and "phy_interface"") is not fully correct.

Update the comment to properly describe the difference between the two.

All of the DTS files in the kernel tree do not mention the "mac-mode"
property, which results in mac_interface and phy_interface being the
same. Also, none of the platform glue drivers set mac_interface to
anything but PHY_INTERFACE_MODE_NA. This means that for all the
platforms known to mainline, mac_interface is either the same as
phy_interface, or it is PHY_INTERFACE_MODE_NA.

Thus, updating the definition for mac_interface in stmmac.h has no
material effect on current uses known to mainline, but the change opens
the door to cleaning up all uses.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1uytpB-00000006H23-0pRi@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Remove dead code from total_vfs setter

The mlx5_devlink_total_vfs_set function branches based on per_pf_support
twice. Remove the second branch as the first one exits the function when
per_pf_support is false.

Accidentally added as part of commit a4c49611cf4f ("net/mlx5: Implement
devlink total_vfs parameter").

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-rdma/aMQWenzpdjhAX4fm@stanley.mountain/
Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Link: https://patch.msgid.link/a6142a60-1948-439a-b0ae-ff1df26a37f8@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

psp: clarify checksum behavior of psp_dev_rcv()

psp_dev_rcv() decapsulates psp headers from a received frame. This
will make any csum complete computed by the device inaccurate. Rather
than attempt to patch up skb->csum in psp_dev_rcv() just make it clear
to callers what they can expect regarding checksum complete.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918212723.17495-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

psp: Fix typo in kdoc for struct psp_dev_caps.assoc_drv_spc.

assoc_drv_spc is the size of psp_assoc.drv_data[].

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250918192539.1587586-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: micrel: use %pe in print format

New cocci check complains:

drivers/net/phy/micrel.c:4308:6-13: WARNING: Consider using %pe to print PTR_ERR()
drivers/net/phy/micrel.c:5742:6-13: WARNING: Consider using %pe to print PTR_ERR()

Link: https://lore.kernel.org/1758192227-701925-1-git-send-email-tariqt@nvidia.com
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250918183119.2396019-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'address-miscellaneous-issues-with-psp_sk_get_assoc_rcu'

Daniel Zahka says:

====================
address miscellaneous issues with psp_sk_get_assoc_rcu()

There were a few minor issues with psp_sk_get_assoc_rcu() identified
by Eric in his review of the initial psp series. This series addresses
them.
====================

Link: https://patch.msgid.link/20250918155205.2197603-1-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

psp: don't use flags for checking sk_state

Using flags to check sk_state only makes sense to check for a subset
of states in parallel e.g. sk_fullsock(). We are not doing that
here. Compare for individual states directly.

Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20250918155205.2197603-4-daniel.zahka@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>