git.ipfire.org Git - thirdparty/kernel/linux.git/log

inet: add ip_local_port_step_width sysctl to improve port usage distribution

With the current port selection algorithm, ports after a reserved port
range or long time used port are used more often than others [1]. This
causes an uneven port usage distribution. This combines with cloud
environments blocking connections between the application server and the
database server if there was a previous connection with the same source
port, leading to connectivity problems between applications on cloud
environments.

The real issue here is that these firewalls cannot cope with
standards-compliant port reuse. This is a workaround for such situations
and an improvement on the distribution of ports selected.

The proposed solution is to implement a variant of RFC 6056 Algorithm 5.
The step size is selected randomly on every connect() call ensuring it
is a coprime with respect to the size of the range of ports we want to
scan. This way, we can ensure that all ports within the range are
scanned before returning an error. To enable this algorithm, the user
must configure the new sysctl option "net.ipv4.ip_local_port_step_width".

In addition, on graphs generated we can observe that the distribution of
source ports is more even with the proposed approach. [2]

[1] https://0xffsoftware.com/port_graph_current_alg.html

[2] https://0xffsoftware.com/port_graph_random_step_alg.html

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260309023946.5473-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-rds-ksft-cleanups'

Allison Henderson says:

====================
selftests: rds: ksft cleanups

This set addresses a few rds selftests clean ups and bugs encountered
when running in the ksft framework.  The first patch is a clean up
patch that addresses pylint warnings, but otherwise no functional
changes.  The next patch moves the test time out to a ksft settings
file so that the time out is set appropriately.  And lastly we fix a
tcpdump segfault caused by deprecated a os.fork() call.
====================

Link: https://patch.msgid.link/20260308055835.1338257-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix tcpdump segfault in rds selftests

net/rds/test.py sees a segfault in tcpdump when executed through the
ksft runner.

[   21.903713] tcpdump[1469]: segfault at 0 ip 000072100e99126d
sp 00007ffccf740fd0 error 4
[   21.903721]  in libc.so.6[16a26d,7798b149a000+188000]
[   21.905074]  in libc.so.6[16a26d,72100e84f000+188000] likely on
CPU 5 (core 5, socket 0)
[   21.905084] Code: 00 0f 85 a0 00 00 00 48 83 c4 38 89 d8 5b 41 5c
41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 05 91 8b 09 00 8b 4d ac
64 89 08 <41> 0f b6 07 83 e8 2b a8 fd 0f 84 54 ff ff ff 49 8b 36 4c 89
ff e8
[   21.906760]  likely on CPU 9 (core 9, socket 0)
[   21.913469] Code: 00 0f 85 a0 00 00 00 48 83 c4 38 89 d8 5b 41 5c 41
5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 8b 05 91 8b 09 00 8b 4d ac 64 89
08 <41> 0f b6 07 83 e8 2b a8 fd 0f 84 54 ff ff ff 49 8b 36 4c 89 ff e8

The os.fork() call creates extra complexity because it forks the entire
process including the python interpreter.  ip() then calls cmd() which
creates a subprocess.Popen.  We can avoid the extra layering by simply
calling subprocess.Popen directly. Track the process handles directly
and terminate them at cleanup rather than relying on killall. Further
tcpdump's -Z flag attempts to change savefile ownership, which is not
supported by the 9p protocol.  Fix this by writing pcap captures to
"/tmp" during the test and move them to the log directory after tcpdump
exits.

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260308055835.1338257-4-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Add ksft timeout

rds/run.sh sets a timer of 400s when calling test.py. However when
tests are run through ksft, a default 45s timer is applied. Fix this
by adding a ksft timeout in tools/testing/selftests/net/rds/settings

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260308055835.1338257-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: rds: Fix pylint warnings

Tidy up all exiting pylint errors in test.py. No functional
changes are introduced in this patch

Signed-off-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260308055835.1338257-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: cli: order set->list conversion in JSON output

NIPA tries to make sure that HW tests don't modify system state.
It dumps some well known configs before and after the test and
compares the outputs.

Make sure that YNL json output is stable. Converting sets to lists
with a naive list(o) results in a random order.

Link: https://patch.msgid.link/20260307175916.1652518-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'smc-sysctl-formatting-and-missing-entries'

Kyoji Ogasawara says:

====================
smc-sysctl formatting and missing entries

update SMC sysctl documentation in two small steps.

- patch 1 fixes indentation in the smcr_buf_type section
- patch 2 documents missing sysctl parameters limit_smc_hs and hs_ctrl,
including values/defaults and hs_ctrl usage notes
====================

Link: https://patch.msgid.link/20260309124541.22723-1-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: Add documentation for limit_smc_hs and hs_ctrl

Document missing SMC sysctl parameters limit_smc_hs and hs_ctrl

Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: D. Wythe<alibuda@linux.alibaba.com>
Link: https://patch.msgid.link/20260309124541.22723-3-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: fix indentation in smcr_buf_type section

smcr_buf_type section used inconsistent indentation compared
with the rest of this document.

Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com>
Reviewed-by: D. Wythe<alibuda@linux.alibaba.com>
Link: https://patch.msgid.link/20260309124541.22723-2-sawara04.o@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'eth-fbnic-add-fbnic-self-tests'

Mike Marciniszyn says:

====================
eth fbnic: Add fbnic self tests

From: "Mike Marciniszyn (Meta)" <mike.marciniszyn@gmail.com>

This series adds self tests to test the registers, the
msix interrupts, the tlv, and the firmware mailbox.

This series assumes that the
[PATCH net-next 0/2] Add debugfs hooks [1]
is present.

When the self tests are run the with ethtool -t:

        ethtool -t eth0
        The test result is PASS
        The test extra info:
        Register test (offline)  0
        MSI-X Interrupt test (offline)   0
        FW mailbox test (on/offline)     0
====================

Link: https://patch.msgid.link/20260307105847.1438-1-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth fbnic: Add mailbox self test

The mailbox self test ensures the interface to and from
the firmware is healthy by sending a test message and
fielding the response from the firmware.

This patch uses the new completion API [1][2] that allocates a
completion structure, binds the completion to the TEST
message, and uses a new FW parsing routine that wraps the
completion processing around the TLV parser.

Link: https://patch.msgid.link/20250516164804.741348-1-lee@trager.us
Link: https://patch.msgid.link/20260115003353.4150771-6-mohsin.bashr@gmail.com
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-6-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth fbnic: TLV support for use by MBX self test

The TLV (Type-Value-Length) self uses a known set of data to create a
TLV message. These routines support the MBX self test by creating
the test messages and parsing the response message coming back
from the firmware.

Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-5-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth fbnic: Add msix self test

This function is meant to test the global interrupt registers and the
PCIe IP MSI-X functionality. It essentially goes through and tests
various combinations of the set, clear, and mask bits in order to
verify the behavior is as we expect it to be from the driver.

Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-4-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

eth fbnic: Add register self test

The register test will be used to verify hardware is behaving as expected.

The test itself will have us writing to registers that should have no
side effects due to us resetting after the test has been completed.

While the test is being run the interface should be offline.

This patch counts on the first patch of this series to export netif_open()
and also ensures that the half close calls netif_close() to
avoid deadlock.

Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-3-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: export netif_open for self_test usage

dev_open() already is exported, but drivers which use the netdev
instance lock need to use netif_open() instead. netif_close() is
also already exported [1] so this completes the pairing.

This export is required for the following fbnic self tests to
avoid calling ndo_stop() and ndo_open() in favor of the
more appropriate netif_open() and netif_close() that notifies
any listeners that the interface went down to test and is now
coming back up.

Link: https://patch.msgid.link/20250309215851.2003708-1-sdf@fomichev.me
Signed-off-by: Mike Marciniszyn (Meta) <mike.marciniszyn@gmail.com>
Link: https://patch.msgid.link/20260307105847.1438-2-mike.marciniszyn@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: mana: hardening: Validate doorbell ID from GDMA_REGISTER_DEVICE response

As a part of MANA hardening for CVM, add validation for the doorbell
ID (db_id) received from hardware in the GDMA_REGISTER_DEVICE response
to prevent out-of-bounds memory access when calculating the doorbell
page address.

In mana_gd_ring_doorbell(), the doorbell page address is calculated as:
  addr = db_page_base + db_page_size * db_index
       = (bar0_va + db_page_off) + db_page_size * db_index

A hardware could return values that cause this address to fall outside
the BAR0 MMIO region. In Confidential VM environments, hardware responses
cannot be fully trusted.

Add the following validations:
- Store the BAR0 size (bar0_size) in gdma_context during probe.
- Validate the doorbell page offset (db_page_off) read from device
  registers does not exceed bar0_size during initialization, converting
  mana_gd_init_registers() to return an error code.
- Validate db_id from GDMA_REGISTER_DEVICE response against the
  maximum number of doorbell pages that fit within BAR0.

Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Link: https://patch.msgid.link/20260306211212.543376-1-ernis@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: airoha: Move GDM forward port configuration in ndo_open/ndo_stop callbacks

This change allows to set GDM forward port configuration to
FE_PSE_PORT_DROP stopping the network device. Hw design requires to stop
packet forwarding putting the interface down. Moreover, PPE firmware
requires to use PPE1 for GDM3 or GDM4.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260306-airoha-gdm-forward-ndo-open-stop-v1-1-7b7a20dd9ef0@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge branch 'net-stmmac-further-ptp-cleanups'

Russell King says:

====================
net: stmmac: further ptp cleanups

The first uses a local variable when setting n_ext_ts which is a minor
simplification of the code. The second removes the now unnecessary
"available" flag for the PPS outputs.
====================

Link: https://patch.msgid.link/aawDiK7DjcSXSs1X@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: ptp: remove redundant priv->pps[].available

priv->pps[].available is set in stmmac_ptp_register() for all PPS
outputs reported by hardware up to STMMAC_PPS_MAX.

Since we now set priv->ptp_clock_ops.n_per_out to the number of PPS
outputs that both the hardware and driver can support to prevent
array overflow in stmmac_enable(), this makes priv->pps[].available
redundant. Remove this struct member.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vypHc-0000000CSbl-1X6v@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: ptp: rearrange n_ext_ts initialisation

Use local variables for n_ext_ts rather than referencing the DMA
capability several times.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vypHX-0000000CSbc-123K@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: remove stmmac_dwmac4_get_mac_addr()

stmmac_dwmac4_get_mac_addr() is identical to stmmac_get_mac_addr().
Remove stmmac_dwmac4_get_mac_addr() to avoid this code duplication.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1vypJM-0000000CSiJ-48yO@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/tc-testing: Adapt test's output to HFSC's iproute2 printing changes

To make the printing of HFSC's defcls consistent with HTB's,
iproute2 is now printing defcls prepended with "0x".

This commit adapts test a4c3 to this change.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307220724.2501212-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: inline tcp_chrono_start()

tcp_chrono_start() is small enough, and used in TCP sendmsg()
fast path (from tcp_skb_entail()).

Note clang is already inlining it from functions in tcp_output.c.

Inlining it improves performance and reduces bloat :

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 1/-84 (-83)
Function                                     old     new   delta
tcp_skb_entail                               280     281      +1
__pfx_tcp_chrono_start                        16       -     -16
tcp_chrono_start                              68       -     -68
Total: Before=25192434, After=25192351, chg -0.00%

Note that tcp_chrono_stop() is too big.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260308123549.2924460-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tp->chrono_type next tp->chrono_stat[]

chrono_type is currently in tcp_sock_read_txrx group, which
is supposed to hold read-mostly fields.

But chrono_type is mostly written in tx path, it should
be moved to tcp_sock_write_tx group, close to other
chrono fields (chrono_stat[], chrono_start).

Note this adds holes, but data locality is far more important.

Use a full u8 for the time being, compiler can generate
more efficient code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260308122302.2895067-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: refine indirect call mitigation in tc_wrapper.h

Some modern cpus disable X86_FEATURE_RETPOLINE feature,
even if a direct call can still be beneficial.

Even when IBRS is present, an indirect call is more expensive
than a direct one:

Direct Calls:
  Compilers can perform powerful optimizations like inlining,
  where the function body is directly inserted at the call site,
  eliminating call overhead entirely.

Indirect Calls:
  Inlining is much harder, if not impossible, because the compiler
  doesn't know the target function at compile time.
  Techniques like Indirect Call Promotion can help by using
  profile-guided optimization to turn frequently taken indirect calls
  into conditional direct calls, but they still add complexity
  and potential overhead compared to a truly direct call.

In this patch, I split tc_skip_wrapper in two different
static keys, one for tc_act() (tc_skip_wrapper_act)
and one for tc_classify() (tc_skip_wrapper_cls).

Then I enable the tc_skip_wrapper_cls only if the count
of builtin classifiers is above one.

I enable tc_skip_wrapper_act only it the count of builtin
actions is above one.

In our production kernels, we only have CONFIG_NET_CLS_BPF=y
and CONFIG_NET_ACT_BPF=y. Other are modules or are not compiled.

Tested on AMD Turin cpus, cls_bpf_classify() cost went
from 1% down to 0.18 %, and FDO will be able to inline
it in tcf_classify() for further gains.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307133601.3863071-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move sysctl_tcp_shrink_window to netns_ipv4_read_txrx group

Commit 18fd64d25422 ("netns-ipv4: reorganize netns_ipv4 fast path
variables") missed that __tcp_select_window() is reading
net->ipv4.sysctl_tcp_shrink_window.

Move this field to netns_ipv4_read_txrx group, as __tcp_select_window()
is used both in tx and rx paths.

Saves a potential cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260307092214.2433548-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Make flow control source port mapping dependent on nbq parameter

Flow control source port mapping for USB serdes needs to be configured
according to the GDM port nbq parameter. This is a preliminary patch
since nbq parameter is specific for the given port serdes and needs to
be read from the DTS (in the current codebase is assigned statically).

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260306-airoha-fix-loopback-for-usb-serdes-v2-1-319de9c96826@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gve: add support for UDP GSO for DQO format

Enable support for UDP GSO when using DQO format. Advertise the feature
flag during device initialization and enable offload by default.

Signed-off-by: Ankit Garg <nktgrg@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20260306224816.3391551-1-hramamurthy@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: fib_tests: fix link-local retrieval in fib6_nexthop()

fib6_nexthop() retrieves the link-local address for two interfaces used
in the test. However, both lldummy and llv1 are obtained from dummy0.

llv1 is expected to be retrieved from veth1, which is the interface used
later in the test. The subsequent check and error message also expect
the address to be retrieved from veth1.

Fix this by retrieving llv1 from veth1.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20260306180830.2329477-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'ib-gpio-remove-of-gpio-h-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into mbox

Bartosz Golaszewski says:

====================
Immutable branch between GPIO and net

Convert remaining users of of_gpio.h to using GPIO descriptors and
remove the header.

* tag 'ib-gpio-remove-of-gpio-h-for-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpio: remove of_get_named_gpio() and <linux/of_gpio.h>
  nfc: nfcmrvl: convert to gpio descriptors
  nfc: s3fwrn5: convert to gpio descriptors
====================

Link: https://patch.msgid.link/20260309093153.10446-1-bartosz.golaszewski@oss.qualcomm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ppp: simplify input error handling

Currently, ppp_input_error() indicates an error by allocating a 0-length
skb and calling ppp_do_recv(). It takes an error code argument, which is
stored in skb->cb, but not used by ppp_receive_frame().

Simplify the error handling by removing the unused parameter and the
unnecessary skb allocation. Instead, call ppp_receive_error() directly
from ppp_input_error() under the recv lock, and the length check in
ppp_receive_frame() can be removed.

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: use rtnl_kfree_skbs() in pfifo_fast_reset()

rtnl_kfree_skbs() reduces RTNL and qdisc spinlock hold time.

skbs are freed later after RTNL has been released.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260306133154.678730-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: Use also port number to identify timestamps

The driver uses packet-type (RX/TX) PTP-message type and PTP-sequence
number to identify a matching timestamp packet for a skb. If the same
PTP packet arrives on both ports (as in a PRP environment) then it is
not obvious which event belongs to which skb.

The event contains also the port number on which it was received.
Instead of masking it out, use it for matching.

Tested-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Martin Kaistra <martin.kaistra@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20260306144439.cVwaaopR@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: do not reset queues in graft operations

Following typical script is extremely disruptive,
because each graft operation calls dev_deactivate()
which resets all the queues of the device.

QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
   slot=$( printf %x $(( i )) )
   tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
done

One can add "ip link set dev $ETH down/up" to reduce the disruption time:

QPARAM="limit 100000 flow_limit 1000 buckets 4096"
TXQS=64
for ETH in eth1
do
ip link set dev $ETH down
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
for i in `seq 1 $TXQS`
do
   slot=$( printf %x $(( i )) )
   tc qd add dev $ETH parent 1:$slot fq $QPARAM
done
ip link set dev $ETH up
done

Or we can add a @reset_needed flag to dev_deactivate() and
dev_deactivate_many().

This flag is set to true at device dismantle or linkwatch_do_dev(),
and to false for graft operations.

In the future, we might only stop one queue instead of the whole
device, ie call dev_deactivate_queue() instead of dev_deactivate().

I think the problem (quadratic behavior) was added in commit
2fb541c862c9 ("net: sch_generic: aviod concurrent reset and enqueue op
for lockless qdisc") but this does not look serious enough to deserve
risky backports.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260307163430.470644-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: avoid dst->ops->check() call in tcp_v{4,6}_do_rcv()

If incoming skb dst matches the socket cached one,
there is no need to call again dst->ops->check().

Network layer already validated the skb dst for us,
usually from tcp_v{4,6}_early_demux().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260306154322.1086539-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: rocker: kzalloc + kcalloc to kzalloc_flex

Combining the allocations simplifies things, especially the free path.

Remove ofdpa_group_tbl_entry_free as a result. kfree is shorter.

Add __counted_by for extra runtime analysis.

Signed-off-by: Rosen Penev <rosenp@gmail.com>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20260306025449.12333-1-rosenp@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: move tcp_v4_early_demux() to net/ipv4/ip_input.c

tcp_v4_early_demux() has a single caller : ip_rcv_finish_core().

Move it to net/ipv4/ip_input.c and mark it static, for possible
compiler/linker optimizations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260306131130.654991-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: remove unused hash_size from struct tcp_out_options

hash_size is declared but never read. The MD5 path always uses a
fixed size of 16, and the TCP-AO path uses tcp_ao_maclen().

This closes a 7-byte hole and reduces the struct size from 96 to
88 bytes.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Keita Morisaki <kmta1236@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260307051619.51685-1-kmta1236@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Add SPDX ids to some source files

Add SPDX-License-Identifier lines to several source
files under the network sub-directory. Work on files
in the core, dns_resolver, ipv4, ipv6 and
netfilter sub-dirs. Remove boilerplate
and license reference text to avoid ambiguity.

Rusty Russell has expressed that his contributions
were intended to be GPL-2.0-or-later.

Signed-off-by: Tim Bird <tim.bird@sony.com>
Link: https://patch.msgid.link/20260305004724.87469-1-tim.bird@sony.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tools-ynl-convert-samples-into-selftests'

Jakub Kicinski says:

====================
tools: ynl: convert samples into selftests

The "samples" were always poor man's tests, used to manually
confirm that C YNL works as expected. Since a proper tests/
directory now exists move the samples and use the kselftest
harness to turn them into selftests outputting KTAP.
====================

Link: https://patch.msgid.link/20260307033630.1396085-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert rt-route sample to selftest

Convert rt-route.c to use kselftest_harness.h with FIXTURE/TEST_F.
This is the last test to convert so clean up the Makefile.

Validate that the connected routes for 192.168.1.0/24 and
2001:db8::/64 appear in the dump.

Output:

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           rt_route.dump ...
  # oif: nsim0            dst: 192.168.1.0/24
  # oif: lo               dst: ::1/128
  # oif: nsim0            dst: 2001:db8::1/128
  # oif: nsim0            dst: 2001:db8::/64
  # oif: nsim0            dst: fe80::/64
  # oif: nsim0            dst: ff00::/8
  #            OK  rt_route.dump
  ok 1 rt_route.dump
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert rt-addr sample to selftest

Convert rt-addr.c to use kselftest_harness.h with FIXTURE/TEST_F.

Validate that the addresses configured by the wrapper (192.168.1.1
and 2001:db8::1) appear in the dump.

Output:

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           rt_addr.dump ...
  #               lo: 127.0.0.1
  #            nsim0: 192.168.1.1
  #               lo: ::1
  #            nsim0: 2001:db8::1
  #            nsim0: fe80::7c66:c9ff:fe5f:bf01
  #            OK  rt_addr.dump
  ok 1 rt_addr.dump
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert ethtool sample to selftest

Convert ethtool.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move ethtool from BINS to TEST_GEN_FILES and add ethtool.sh wrapper
which sets up a netdevsim device before running the test binary.

Output:

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           ethtool.channels ...
  #    nsim0: combined 1
  #            OK  ethtool.channels
  ok 1 ethtool.channels
  #  RUN           ethtool.rings ...
  #    nsim0: rx 512 tx 512
  #            OK  ethtool.rings
  ok 2 ethtool.rings
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert devlink sample to selftest

Convert devlink.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move devlink from BINS to TEST_GEN_FILES in the Makefile since
it's invoked via the devlink.sh wrapper which sets up netdevsim.

Output:

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           devlink.dump ...
  # netdevsim/netdevsim1337
  #            OK  devlink.dump
  ok 1 devlink.dump
  #  RUN           devlink.info ...
  # netdevsim/netdevsim1337:
  #   driver: netdevsim
  #   running fw:
  #     fw.mgmt: 10.20.30
  #            OK  devlink.info
  ok 2 devlink.info
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add netdevsim wrapper library for YNL tests

Some tests need netdevsim setup which is painful to do from C.

Add ynl_nsim_lib.sh, a shared library providing nsim_setup and
nsim_cleanup functions for tests that need a netdevsim device.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert tc and tc-filter-add samples to selftest

Convert tc.c and tc-filter-add.c to produce KTAP output with
kselftest_harness. Merge the two tests together. They both
test TC one is testing qdisc and the other classifiers but
they can easily live in a single selftest.

Make the test spawn a new netns, and run the operations on
lo to avoid onerous setup and cleanup.

  TAP version 13
  1..2
  # Starting 2 tests from 1 test cases.
  #  RUN           tc.qdisc ...
  #               lo: fq_codel  limit: 10240p target: 5ms new_flow_cnt: 0
  #            OK  tc.qdisc
  ok 1 tc.qdisc
  #  RUN           tc.flower ...
  # flower pref 1 proto: 0x8100
  # flower:
  #   vlan_id: 100
  #   vlan_prio: 5
  #   num_of_vlans: 3
  # action order: 1 vlan push id 200 protocol 0x8100 priority 0
  # action order: 2 vlan push id 300 protocol 0x8100 priority 0
  #            OK  tc.flower
  ok 2 tc.flower
  # PASSED: 2 / 2 tests passed.
  # Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert rt-link sample to selftest

Convert rt-link.c to use kselftest_harness.h with FIXTURE/TEST_F.
Move rt-link from BINS to TEST_GEN_PROGS.

Output:

  TAP version 13
  1..3
  # Starting 3 tests from 1 test cases.
  #  RUN           rt_link.dump ...
  #   1:          lo: mtu 65536
  #   2:          sit0: mtu  1480  kind sit
  #            OK  rt_link.dump
  ok 1 rt_link.dump
  #  RUN           rt_link.netkit ...
  #   4:          nk1: mtu  1500  kind netkit    primary 1  policy blackhole
  #            OK  rt_link.netkit
  ok 2 rt_link.netkit
  #  RUN           rt_link.netkit_err_msg ...
  #            OK  rt_link.netkit_err_msg
  ok 3 rt_link.netkit_err_msg
  # PASSED: 3 / 3 tests passed.
  # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert ovs sample to selftest

Convert ovs.c to produce KTAP output with kselftest_harness.
The single "crud" test creates a new OVS datapath, fetches it back
by name, then dumps all datapaths verifying the new one appears.

IIRC I added this test because ovs is a genetlink family but
has a family-specific fixed header.

  TAP version 13
  1..1
  # Starting 1 tests from 1 test cases.
  #  RUN           ovs.crud ...
  # get:
  # ynl-test(3): pid:0 cache:256
  # dump:
  # ynl-test(3): pid:0 cache:256
  #            OK  ovs.crud
  ok 1 ovs.crud
  # PASSED: 1 / 1 tests passed.
  # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: convert netdev sample to selftest

Convert netdev.c to produce KTAP output with 3 tests:
- dev_dump: dump all netdev devices, skip if empty
- dev_get: query first device from dump by ifindex
- ntf_check: subscribe to "mgmt", create a veth via rt-link,
  verify netdev notification is received, then delete the veth

Remove stdin/scanf-based UI. Add rt-link dependency for the veth
notification test.

  TAP version 13
  1..3
  # Starting 3 tests from 1 test cases.
  #  RUN           netdev.dump ...
  #       lo[1] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #     sit0[2] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #            OK  netdev.dump
  ok 1 netdev.dump
  #  RUN           netdev.get ...
  #       lo[1] xdp-features (0): xdp-rx-metadata-features (0): xsk-fea...
  #            OK  netdev.get
  ok 2 netdev.get
  #  RUN           netdev.ntf_check ...
  #    veth0[7] xdp-features (0): xdp-rx-metadata-features (7): timesta...
  #            OK  netdev.ntf_check
  ok 3 netdev.ntf_check
  # PASSED: 3 / 3 tests passed.
  # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: move samples to tests

The "samples" were always poor man's tests (used to manually
confirm that C YNL works).

Move all C sample programs from tools/net/ynl/samples/ to
tools/net/ynl/tests/, "merge" the Makefiles. The subsequent
changes will convert each sample into a proper KTAP selftests.

Since these are now tests rather than samples - default to
enabling asan. After all we're testing user space code here.

Sort the gitignore while at it, the page-pool entry was a leftover
so delete it.

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Tested-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://patch.msgid.link/20260307033630.1396085-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gpio: remove of_get_named_gpio() and <linux/of_gpio.h>

All in-tree consumers have been converted to the descriptor-based API.
Remove the deprecated of_get_named_gpio() helper, delete the
<linux/of_gpio.h> header, and drop the corresponding entry from
MAINTAINERS.

Also remove the completed TODO item for this cleanup.

Signed-off-by: Jialu Xu <xujialu@vimux.org>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/02ABDA1F9E3FAF1F+20260307030623.3495092-6-xujialu@vimux.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

nfc: nfcmrvl: convert to gpio descriptors

Replace the legacy of_get_named_gpio() / gpio_request_one() /
gpio_set_value() API with the descriptor-based devm_gpiod_get_optional() /
gpiod_set_value() API from <linux/gpio/consumer.h>, removing the
dependency on <linux/of_gpio.h>.

The "reset-n-io" property rename quirk already exists in gpiolib-of.c
(added in commit 9c2cc7171e08), so no additional quirk is needed.

Signed-off-by: Jialu Xu <xujialu@vimux.org>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/DD684946FD7EE161+20260307030623.3495092-4-xujialu@vimux.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

nfc: s3fwrn5: convert to gpio descriptors

Replace the legacy of_get_named_gpio() / gpio_request_one() /
gpio_set_value() API with the descriptor-based devm_gpiod_get() /
gpiod_set_value() API from <linux/gpio/consumer.h>, removing the
dependency on <linux/of_gpio.h>.

This removes the s3fwrn5_i2c_parse_dt() and s3fwrn82_uart_parse_dt()
functions since devm_gpiod_get() handles both DT lookup and resource
management. The gpio_en and gpio_fw_wake fields in struct phy_common
are changed from int to struct gpio_desc *.

Add rename quirks in gpiolib-of.c for the deprecated "s3fwrn5,en-gpios"
and "s3fwrn5,fw-gpios" properties to maintain backward compatibility
with old device trees.

Signed-off-by: Jialu Xu <xujialu@vimux.org>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Link: https://patch.msgid.link/94FF47746A92BD6B+20260307030623.3495092-2-xujialu@vimux.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>

selftests: net: make ovs-dpctl.py fail when pyroute2 is unsupported

The pmtu.sh kselftest configures OVS using ovs-dpctl.py and falls back
to ovs-vsctl only when ovs-dpctl.py fails. However, ovs-dpctl.py exits
with a success status when the installed pyroute2 package version is
lower than 0.6, even though the OVS datapath is not configured.

As a result, pmtu.sh assumes that the setup was successful and
continues running the test, which later fails due to the missing
OVS configuration.

Fix the exit code handling in ovs-dpctl.py so that pmtu.sh can detect
that the setup did not complete successfully and fall back to
ovs-vsctl.

Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20260306000127.519064-3-aleksey.oladko@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: usb: lan78xx: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305105006.16415-1-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ntb_netdev-add-multi-queue-support'

Koichiro Den says:

====================
net: ntb_netdev: Add Multi-queue support

ntb_netdev currently hard-codes a single NTB transport queue pair, which
means the datapath effectively runs as a single-queue netdev regardless
of available CPUs / parallel flows.

The longer-term motivation here is throughput scale-out: allow
ntb_netdev to grow beyond the single-QP bottleneck and make it possible
to spread TX/RX work across multiple queue pairs as link speeds and core
counts keep increasing.

Multi-queue also unlocks the standard networking knobs on top of it. In
particular, once the device exposes multiple TX queues, qdisc/tc can
steer flows/traffic classes into different queues (via
skb->queue_mapping), enabling per-flow/per-class scheduling and QoS in a
familiar way.

Usage
=====

  1. Ensure the NTB device you want to use has multiple Memory Windows.
  2. modprobe ntb_transport on both sides, if it's not built-in.
  3. modprobe ntb_netdev on both sides, if it's not built-in.
  4. Use ethtool -L to configure the desired number of queues.
     The default number of real (combined) queues is 1.

     e.g. ethtool -L eth0 combined 2 # to increase
          ethtool -L eth0 combined 1 # to reduce back to 1

  Note:
    * If the NTB device has only a single Memory Window, ethtool -L eth0
      combined N (N > 1) fails with:
      "netlink error: No space left on device".
    * ethtool -L can be executed while the net_device is up.

Compatibility
=============

  The default remains a single queue, so behavior is unchanged unless
  the user explicitly increases the number of queues.

Kernel base
===========

  ntb-next (latest as of 2026-03-06):
  commit 7b3302c687ca ("ntb_hw_amd: Fix incorrect debug message in link
                        disable path")

Testing / Results
=================

  Environment / command line:
    - 2x R-Car S4 Spider boards
      "Kernel base" (see above) + this series

  TCP:
    [RC] $ sudo iperf3 -s
    [EP] $ sudo iperf3 -Z -c ${SERVER_IP} -l 65480 -w 512M -P 4
  UDP:
    [RC] $ sudo iperf3 -s
    [EP] $ sudo iperf3 -ub0 -c ${SERVER_IP} -l 65480 -w 512M -P 4

  Without this series:
      TCP / UDP : 589 Mbps / 580 Mbps

  With this series (default single queue):
      TCP / UDP : 583 Mbps / 583 Mbps

  With this series + `ethtool -L eth0 combined 2`:
      TCP / UDP : 576 Mbps / 584 Mbps

  With this series + `ethtool -L eth0 combined 2` + [1], where flows are
  properly distributed across queues:
      TCP / UDP : 1.13 Gbps / 1.16 Gbps (re-measured with v3)

  The 575~590 Mbps variation is run-to-run variance i.e. no measurable
  regression or improvement is observed with a single queue. The key
  point is scaling from ~600 Mbps to ~1.20 Gbps once flows are
  distributed across multiple queues.

  Note: On R-Car S4 Spider, only BAR2 is usable for ntb_transport MW.
  For testing, BAR2 was expanded from 1 MiB to 2 MiB and split into two
  Memory Windows. A follow-up series is planned to add split BAR support
  for vNTB. On platforms where multiple BARs can be used for the
  datapath, this series should allow >=2 queues without additional
  changes.

  [1] [PATCH v2 00/10] NTB: epf: Enable per-doorbell bit handling while keeping legacy offset
      https://lore.kernel.org/linux-pci/20260227084955.3184017-1-den@valinux.co.jp/
      (subject was accidentally incorrect in the original posting)
====================

Link: https://patch.msgid.link/20260305155639.1885517-1-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ntb_netdev: Support ethtool channels for multi-queue

Support dynamic queue pair addition/removal via ethtool channels.
Use the combined channel count to control the number of netdev TX/RX
queues, each corresponding to a ntb_transport queue pair.

When the number of queues is reduced, tear down and free the removed
ntb_transport queue pairs (not just deactivate them) so other
ntb_transport clients can reuse the freed resources.

When the number of queues is increased, create additional queue pairs up
to NTB_NETDEV_MAX_QUEUES (=64). The effective limit is determined by the
underlying ntb_transport implementation and NTB hardware resources (the
number of MWs), so set_channels may return -ENOSPC if no more QPs can be
allocated.

Keep the default at one queue pair to preserve the previous behavior.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-5-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ntb_netdev: Factor out multi-queue helpers

Implementing .set_channels will otherwise duplicate the same multi-queue
operations at multiple call sites. Factor out the following helpers:

  - ntb_netdev_update_carrier(): carrier is switched on when at least
                                 one QP link is up
  - ntb_netdev_queue_rx_drain(): drain and free all queued RX packets
                                 for one QP
  - ntb_netdev_queue_rx_fill():  prefill RX ring for one QP

No functional change.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-4-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ntb_netdev: Gate subqueue stop/wake by transport link

When ntb_netdev is extended to multiple ntb_transport queue pairs, the
netdev carrier can be up as long as at least one QP link is up. In that
setup, a given QP may be link-down while the carrier remains on.

Make the link event handler start/stop the corresponding netdev TX
subqueue and drive carrier state based on whether any QP link is up.
Also guard subqueue wake/start points in the TX completion and timer
paths so a subqueue is not restarted while its QP link is down.

Stop all queues in ndo_open() and let the link event handler wake each
subqueue once ntb_transport link negotiation succeeds.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-3-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ntb_netdev: Introduce per-queue context

Prepare ntb_netdev for multi-queue operation by moving queue-pair state
out of struct ntb_netdev.

Introduce struct ntb_netdev_queue to carry the ntb_transport_qp pointer,
the per-QP TX timer and queue id. Pass this object as the callback
context and convert the RX/TX handlers and link event path accordingly.

The probe path allocates a fixed upper bound for netdev queues while
instantiating only a single ntb_transport queue pair, preserving the
previous behavior. Also store client_dev for future queue pair
creation/removal via the ntb_transport API.

Signed-off-by: Koichiro Den <den@valinux.co.jp>
Link: https://patch.msgid.link/20260305155639.1885517-2-den@valinux.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: spacemit: Remove unused buff_addr fields

These were never used. Just remove them.

No functional change intended.

Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20260305-k1-ethernet-cleanup-buff_addr-v1-1-e978ef119231@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'nfc-drop-redundant-usb-device-references'

Johan Hovold says:

====================
nfc: drop redundant USB device references

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop redundant device references to reduce cargo culting, make it easier
to spot drivers where an extra reference is needed, and reduce the risk
of memory leaks when drivers fail to release them.
====================

Link: https://patch.msgid.link/20260305111019.18030-1-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfc: port100: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305111019.18030-3-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nfc: pn533: drop redundant device reference

Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.

Drop the redundant device reference to reduce cargo culting, make it
easier to spot drivers where an extra reference is needed, and reduce
the risk of memory leaks when drivers fail to release it.

Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305111019.18030-2-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

virtio-net: xsk: Support wakeup on RX side

When XDP_USE_NEED_WAKEUP is used and the fill ring is empty so no buffer
is allocated on RX side, allow RX NAPI to be descheduled. This avoids
wasting CPU cycles on polling. Users will be notified and they need to
make a wakeup call after refilling the ring.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20260304154317.7506-1-minhquangbui99@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: net: forwarding: fix IPv6 address leak in cleanup

Several forwarding tests (e.g., gre_multipath.sh) initialize both IPv4
and IPv6 addresses using simple_if_init, but only clean up IPv4
in simple_if_fini. This leaves stale IPv6 addresses on the interfaces,
which causes subsequent tests to fail when they encounter unexpected
address configuration.

The issue can be reproduced by running tests in sequence:
  # run_kselftest.sh -t net/forwarding:ipip_hier_gre.sh
  # run_kselftest.sh -t net/forwarding:min_max_mtu.sh
  TAP version 13
  1..1
  # timeout set to 0
  # selftests: net/forwarding: min_max_mtu.sh
  # TEST: ping                                                          [ OK ]
  # TEST: ping6                                                         [ OK ]
  # TEST: Test maximum MTU configuration                                [ OK ]
  # TEST: Test traffic, packet size is maximum MTU                      [FAIL]
  #       Ping6, packet size: 65487 succeeded, but should have failed
  # TEST: Test minimum MTU configuration                                [ OK ]
  # TEST: Test traffic, packet size is minimum MTU                      [ OK ]
  not ok 1 selftests: net/forwarding: min_max_mtu.sh # exit=1

Fix this by removing the unused IPv6 argument from simple_if_init in
tests that don't use IPv6 (gre_multipath.sh, ipip_lib.sh), and by
adding the missing IPv6 argument to simple_if_fini in tests that
use IPv6 (gre_multipath_nh.sh, gre_multipath_nh_res.sh).

Signed-off-by: Aleksei Oladko <aleksey.oladko@virtuozzo.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260305211000.515301-1-aleksey.oladko@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: annotate data races around sk->sk_prot

inet_sendmsg() and inet_recvmsg() access sk->sk_prot without
lock_sock() or any other synchronization.

sock_replace_proto() (used by sockmap), TLS and MPTCP can change
sk->sk_prot under us, so these functions need READ_ONCE() to avoid
load tearing.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260304064253.16955-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: remove phy_attach

378e6523ebb1 ("net: bcmgenet: remove unused platform code") removed
the last user of phy_attach(). So remove this function.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/8812176a-e319-4e9f-815d-99ea339df8b2@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

inet_diag: report delayed ack timer information

inet_sk_diag_fill() populates r->idiag_timer with the following
precedence order:

1 - Retransmit timer.
4 - Probe0 timer.
2 - Keepalive timer.

This patch adds a new value, last in the list, if other timers
are not active.

5 - Delayed ACK timer.

A corresponding iproute2 patch will follow to replace "unknown"
with "delack":

ESTAB 10     0   [2002:a05:6830:1f86::]:12875 [2002:a05:6830:1f85::]:50438

    timer:(unknown,003ms,0) ino:152178 sk:3004 cgroup:unreachable:189 <->

    skmem:(r1344,rb12780520,t0,tb262144,f2752,w0,o250,bl0,d0) ts usec_ts
    ...

Also add the following enum in uapi/linux/inet_diag.h
as suggested by David Ahern.

enum {
IDIAG_TIMER_OFF,
IDIAG_TIMER_ON,
IDIAG_TIMER_KEEPALIVE,
IDIAG_TIMER_TIMEWAIT,
IDIAG_TIMER_PROBE0,
IDIAG_TIMER_DELACK,
};

Neal Cardwell suggested to test for ICSK_ACK_TIMER:
inet_csk_clear_xmit_timer() does not call sk_stop_timer()
because INET_CSK_CLEAR_TIMERS is unset.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260305114829.2163276-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-stmmac-mdio-related-cleanups'

Russell King says:

====================
net: stmmac: mdio related cleanups

The first four patches clean up the MDC clock divisor selection code,
turning the three different ways we choose a divisor into tabular form,
rather than doing the selection purely in code.

Convert MDIO to use field_prep() which allows a non-constant mask to be
used when preparing fields.

Then use u32 and the associated typed GENMASK for MDIO register field
definitions.

Finally, an extra couple of patches that use appropriate types in
struct mdio_bus_data.
====================

Link: https://patch.msgid.link/aald--qJquWGIvmO@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: make pcs_mask and phy_mask u32

The PCS and PHY masks are passed to the mdio bus layer as phy_mask
to prevent bus addresses between 0 and 31 inclusive being scanned,
and this is declared as u32. Also declare these as u32 in stmmac
for type consistency.

Since this is a u32, use BIT_U32() rather than BIT() to generate
values for these fields.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AY-0000000BtxJ-3smT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mdio_bus_data->default_an_inband is boolean

default_an_inband is declared as an unsigned int, but is set to true/
false and is assigned to phylink_config's member of the same name
which is a bool. Declare this also as a bool for consistency.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AT-0000000BtxD-2qm7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: use GENMASK_U32() for mdio bitfields

Rather than using hex numbers, use GENMASK() for mdio bitfields.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AO-0000000Btx7-2NDV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: use u32 for MDIO register field masks

MDIO registers are 32-bit, so use u32 to describe the masks for these
registers. Convert the GENMASK() initialisers to GENMASK_U32() for
type compatibility.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AJ-0000000Btx1-1teC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mdio: convert field prep to use field_prep()

Convert the MDIO field preparation to use field_prep(), which removes
the need to store separate mask and shifts. Also convert the clk_csr
value using __ffs() to do the shift as we need to detect overflows
for this.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6AE-0000000Btwv-1LM4@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mdio: simplify MDC clock divisor lookup

As each lookup now iterates over each table in the same way, simplfy
the code to select the table, and then walk that table.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6A9-0000000Btwp-0lxY@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mdio: use same test for MDC clock divisor lookups

Use the same frequency test for all clk_csr value lookups (clock
rate > table rate). This has the side effect that the standard rate
table results in the divider being used for the maximum frequency
for the divider rather than the next higher divider. This still
allows MDC to meet the IEE 802.3 specification, but at a rate closer
to 2.5MHz for these frequencies.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy6A4-0000000Btwj-0ATB@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: mdio: convert MDC clock divisor selection to tables

Convert the MDC clock divisor selection to tabular format.

Note that there is a change for 300MHz, but this is not a problem,
as the MDC clock remains within the useable ranges, which are:

STMMAC_CSR_500_800M /324 1.54 - 2.47MHz
STMMAC_CSR_300_500M /204 1.47 - 2.45MHz
STMMAC_CSR_250_300M /124 2.02 - 2.42MHz
STMMAC_CSR_150_250M /102 1.47 - 2.45MHz
STMMAC_CSR_100_150M /62  1.61 - 2.42MHz
STMMAC_CSR_60_100M /42  1.43 - 2.38MHz
STMMAC_CSR_35_60M /26  1.35 - 2.31MHz
STMMAC_CSR_20_35M /16  1.25 - 2.19MHz

Thus, with the change of divisor for exactly 300MHz, MDC temporarily
changes from 2.42MHz to 1.47MHz for the sake of consistency.

The databook does not specify whether the frequency limits for the
CSR divider are inclusive or exclusive.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Link: https://patch.msgid.link/E1vy69y-0000000Btwd-3oq7@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: drv-net: iou-zcrx: wait for memory cleanup of probe run

The large chunks test does a probe run of iou-zcrx before it runs the
actual test. After the probe run finishes, the context will still exist
until the deferred io_uring teardown. When running iou-zcrx the second
time, io_uring_register_ifq() can return -EEXIST due to the existence of
the old context.

The fix is simple: wait for the context teardown using the new
mp_clear_wait() utility before running the second instance of iou-zcrx.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260305080446.897628-2-dtatulea@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Revert "net: phy: improve mdiobus_stats_acct"

This reverts commit 1afccc5a201ec7c9023370958bae1312369b64da.

As reported by Marek the change causes a warning on non-PREEMPT_RT
32 bit systems.

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/c3a1aba9-3fae-4c4b-bcb1-fb620fb7a309@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: netdev: refine netdevsim testing guidance

The library to create tests for both NIC HW and netdevsim has existed
for almost a year. netdevsim-only tests we get increasingly feel like
a waste, we should try to write tests that work both on netdevsim and
real HW. Refine the guidance accordingly.

Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260304151647.2770466-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-net-add-netkit-container-env-and-test'

David Wei says:

====================
selftests/net: add netkit container env and test

Add a new Python selftest env NetDrvContEnv that sets up a pair of
netkit netdevs, with one inside of a netns, and a bpf prog that forwards
skbs from NETIF to the netkit inside the netns.

  NETIF           = "eth0"
  LOCAL_V6        = "2001:db8:1::1"
  REMOTE_V6       = "2001:db8:1::2"
  LOCAL_PREFIX_V6 = "2001:db8:2::0/64"

          +-----------------------------+        +------------------------------+
  dst     | INIT NS                     |        | TEST NS                      |
  2001:   | +---------------+           |        |                              |
  db8:2::2| | NETIF         |           |  bpf   |                              |
      +---|>| 2001:db8:1::1 |           |redirect| +-------------------------+  |
      |   | |               |-----------|--------|>| Netkit                  |  |
      |   | +---------------+           | _peer  | | nk_guest                |  |
      |   | +-------------+ Netkit pair |        | | fe80::2/64              |  |
      |   | | Netkit      |.............|........|>| 2001:db8:2::2/64        |  |
      |   | | nk_host     |             |        | +-------------------------+  |
      |   | | fe80::1/64  |             |        |                              |
      |   | +-------------+             |        | route:                       |
      |   |                             |        |   default                    |
      |   | route:                      |        |     via fe80::1 dev nk_guest |
      |   |   2001:db8:2::2/128         |        +------------------------------+
      |   |     via fe80::2 dev nk_host |
      |   +-----------------------------+
      |
      |   +---------------+
      |   | REMOTE        |
      +---| 2001:db8:1::2 |
          +---------------+

I will use this series for queue leasing selftests. Include a basic ping
test in this series as demonstration.
====================

Link: https://patch.msgid.link/20260305181803.2912736-1-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Add netkit container ping test

Add a basic ping test using NetDrvContEnv that sets up a netkit pair,
with one end in a netns. Use LOCAL_PREFIX_V6 and nk_forward BPF program
to ping from a remote host to the netkit in netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260305181803.2912736-5-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Add env for container based tests

Add an env NetDrvContEnv for container based selftests. This automates
the setup of a netns, netkit pair with one inside the netns, and a BPF
program that forwards skbs from the NETIF host inside the container.

Currently only netkit is used, but other virtual netdevs e.g. veth can
be used too.

Expect netkit container datapath selftests to have a publicly routable
IP prefix to assign to netkit in a container, such that packets will
land on eth0. The BPF skb forward program will then forward such packets
from the host netns to the container netns.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260305181803.2912736-4-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Export Netlink class via lib.py

Making rtnl newlink calls requires constants defined in Netlink class in
pyynl. Export it.

Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://patch.msgid.link/20260305181803.2912736-3-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: Add bpf skb forwarding program

Add nk_forward.bpf.c, a BPF program that forwards skbs matching some IPv6
prefix received on eth0 ifindex to a specified netkit ifindex. This will
be needed by netkit container tests.

Signed-off-by: David Wei <dw@davidwei.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260305181803.2912736-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tools: ynl: add uns-admin-perm to genetlink

GENL_UNS_ADMIN_PERM may be required by protocols using
the `genetlink` family, however, this flag is currently
only allowed in `genetlink-legacy`.

Add it to the list of possible values in genetlink.yaml too.

Cc: Simon Horman <horms@kernel.org>
Cc: Donald Hunter <donald.hunter@gmail.com>
Link: https://github.com/OpenVPN/ovpn-net-next/issues/33
Suggested-by: Ralf Lici <ralf@mandelbit.com>
Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
Link: https://patch.msgid.link/20260304141020.23270-1-antonio@openvpn.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: airoha: Rely __field_prep for non-constant masks

Rely on __field_prep macros for non-constant masks preparing the values
for register updates instead of open-coding.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260304-airoha-__field_prep-v1-1-b185facc4e2f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-cadence-macb-add-ieee-802-3az-eee-support'

Nicolai Buchwitz says:

====================
net: cadence: macb: add IEEE 802.3az EEE support

Add Energy Efficient Ethernet (IEEE 802.3az) support to the Cadence GEM
(macb) driver using phylink's managed EEE framework. The GEM MAC has
hardware LPI registers but no built-in idle timer, so the driver
implements software-managed TX LPI using a delayed_work timer while
delegating EEE negotiation and ethtool state to phylink.

The series is structured as follows:

  1. LPI statistics: Expose the four hardware EEE counters (RX/TX LPI
     transitions and time) through ethtool -S, accumulated in software
     since they are clear-on-read. Adds register offset definitions
     GEM_RXLPI/RXLPITIME/TXLPI/TXLPITIME (0x270-0x27c).

  2. TX LPI engine: Introduces GEM_TXLPIEN (NCR bit 19) and
     MACB_CAPS_EEE alongside the implementation that uses them.
     phylink mac_enable_tx_lpi / mac_disable_tx_lpi callbacks with a
     delayed_work-based idle timer. LPI entry is deferred 1 second
     after link-up per IEEE 802.3az. Wake before transmit with a
     conservative 50us PHY wake delay (IEEE 802.3az Tw_sys_tx).

  3. ethtool EEE ops: get_eee/set_eee delegating to phylink for PHY
     negotiation and timer management.

  4. RP1 enablement: Set MACB_CAPS_EEE for the Raspberry Pi 5's RP1
     southbridge (Cadence GEM_GXL rev 0x00070109 + BCM54213PE PHY).

  5. EyeQ5 enablement: Set MACB_CAPS_EEE for the Mobileye EyeQ5 GEM
     instance, verified with a hardware loopback by Théo Lebrun.

Tested on Raspberry Pi 5 (1000BASE-T, BCM54213PE PHY, 250ms LPI timer):

  iperf3 throughput (no regression):
    TCP TX: 937.8 Mbit/s (EEE on) vs 937.0 Mbit/s (EEE off)
    TCP RX: 936.5 Mbit/s both

  Latency (ping RTT, small expected increase from LPI wake):
    1s interval:  0.273 ms (EEE on) vs 0.181 ms (EEE off)
    10ms interval: 0.206 ms (EEE on) vs 0.168 ms (EEE off)
    flood ping:   0.200 ms (EEE on) vs 0.156 ms (EEE off)

  LPI counters (ethtool -S, 1s-interval ping, EEE on):
    tx_lpi_transitions: 112
    tx_lpi_time: 15574651

  Zero packet loss across all tests. Also verified with
  ethtool --show-eee / --set-eee and cable unplug/replug cycling.
====================

Link: https://patch.msgid.link/20260304105432.631186-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cadence: macb: enable EEE for Mobileye EyeQ5

Set MACB_CAPS_EEE for the Mobileye EyeQ5 GEM instance. EEE has been
verified on EyeQ5 hardware using a loopback setup with ethtool
--show-eee confirming EEE active on both ends at 100baseT/Full and
1000baseT/Full.

Tested-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-6-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cadence: macb: enable EEE for Raspberry Pi RP1

Set MACB_CAPS_EEE for the Raspberry Pi 5 RP1 southbridge
(Cadence GEM_GXL rev 0x00070109 paired with BCM54213PE PHY).

EEE has been verified on RP1 hardware: the LPI counter registers
at 0x270-0x27c return valid data, the TXLPIEN bit in NCR (bit 19)
controls LPI transmission correctly, and ethtool --show-eee reports
the negotiated state after link-up.

Other GEM variants that share the same LPI register layout (SAMA5D2,
SAME70, PIC32CZ) can be enabled by adding MACB_CAPS_EEE to their
respective config entries once tested.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-5-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cadence: macb: add ethtool EEE support

Implement get_eee and set_eee ethtool ops for GEM as simple passthroughs
to phylink_ethtool_get_eee() and phylink_ethtool_set_eee().

No MACB_CAPS_EEE guard is needed: phylink returns -EOPNOTSUPP from both
ops when mac_supports_eee is false, which is the case when
lpi_capabilities and lpi_interfaces are not populated. Those fields are
only set when MACB_CAPS_EEE is present (previous patch), so phylink
already handles the unsupported case correctly.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-4-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cadence: macb: implement EEE TX LPI support

The GEM MAC has hardware LPI registers (NCR bit 19: TXLPIEN) but no
built-in idle timer, so asserting TXLPIEN blocks all TX immediately
with no automatic wake. A software idle timer is required, as noted
in Microchip documentation (section 40.6.19): "It is best to use
firmware to control LPI."

Implement phylink managed EEE using the mac_enable_tx_lpi and
mac_disable_tx_lpi callbacks:

- macb_tx_lpi_set(): sets or clears TXLPIEN; requires bp->lock to be
  held by the caller (asserted with lockdep_assert_held). Returns bool
  indicating whether the register actually changed, avoiding redundant
  writes and unnecessary udelay on the xmit fast path.

- macb_tx_lpi_work_fn(): delayed_work handler that enters LPI if all
  TX queues are idle and EEE is still active. Takes bp->lock with
  irqsave before calling macb_tx_lpi_set().

- macb_tx_lpi_schedule(): arms the work timer using the LPI timer
  value provided by phylink (default 250 ms). Called from
  macb_tx_complete() after each TX drain so the idle countdown
  restarts whenever the ring goes quiet.

- macb_tx_lpi_wake(): called from macb_start_xmit() under bp->lock,
  immediately before TSTART. Returns early if eee_active is false to
  avoid a register read on the common path when EEE is disabled.
  Clears TXLPIEN and applies a 50 us udelay for PHY wake (IEEE
  802.3az Tw_sys_tx is 16.5 us for 1000BASE-T / 30 us for
  100BASE-TX; GEM has no hardware enforcement). Only delays when
  TXLPIEN was actually set. The delay is placed after tx_head is
  advanced so the work_fn's queue-idle check sees a non-empty ring
  and cannot race back into LPI before the frame is transmitted.

- mac_enable_tx_lpi: stores the timer and sets eee_active under
  bp->lock, then defers the first LPI entry by 1 second per IEEE
  802.3az section 22.7a.

- mac_disable_tx_lpi: cancels the work (sync, without the lock to
  avoid deadlock with the work_fn), then takes bp->lock to clear
  eee_active and deassert TXLPIEN.

Populate phylink_config lpi_interfaces (MII, GMII, RGMII variants)
and lpi_capabilities (MAC_100FD | MAC_1000FD) so phylink can
negotiate EEE with the PHY and call the callbacks appropriately.
Set lpi_timer_default to 250000 us and eee_enabled_default to true.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-3-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: cadence: macb: add EEE LPI statistics counters

The GEM MAC provides four read-only, clear-on-read LPI statistics
registers at offsets 0x270-0x27c:

  GEM_RXLPI     (0x270): RX LPI transition count (16-bit)
  GEM_RXLPITIME (0x274): cumulative RX LPI time (24-bit)
  GEM_TXLPI     (0x278): TX LPI transition count (16-bit)
  GEM_TXLPITIME (0x27c): cumulative TX LPI time (24-bit)

Add register offset definitions, extend struct gem_stats with
corresponding u64 software accumulators, and register the four
counters in gem_statistics[] so they appear in ethtool -S output.
Because the hardware counters clear on read, the existing
macb_update_stats() path accumulates them into the u64 fields on
every stats poll, preventing loss between userspace reads.

These registers are present on SAMA5D2, SAME70, PIC32CZ, and RP1
variants of the Cadence GEM IP and have been confirmed on RP1 via
devmem reads.

Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260304105432.631186-2-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Initialise ehash secrets during connect() and listen().

inet_ehashfn() and inet6_ehashfn() initialise random secrets
on the first call by net_get_random_once().

While the init part is patched out using static keys, with
CONFIG_STACKPROTECTOR_STRONG=y, this causes a compiler to
generate a stack canary due to an automatic variable,
unsigned long ___flags, in the DO_ONCE() macro being passed
to __do_once_start().

With FDO, this is visible in __inet_lookup_established() and
__inet6_lookup_established() too.

Let's initialise the secrets by get_random_sleepable_once()
in the slow paths: inet_hash() for listen(), and
inet_hash_connect() and inet6_hash_connect() for connect().

Note that IPv6 listener will initialise both IPv4 & IPv6 secrets
in inet_hash() for IPv4-mapped IPv6 address.

With the patch, the stack size is reduced by 16 bytes (___flags
+ a stack canary) and NOPs for the static key go away.

Before: __inet6_lookup_established()

       ...
       push   %rbx
       sub    $0x38,%rsp                # stack is 56 bytes
       mov    %edx,%ebx                 # sport
       mov    %gs:0x299419f(%rip),%rax  # load stack canary
       mov    %rax,0x30(%rsp)              and store it onto stack
       mov    0x440(%rdi),%r15          # net->ipv4.tcp_death_row.hashinfo
       nop
32:   mov    %r8d,%ebp                 # hnum
       shl    $0x10,%ebp                # hnum << 16
       nop
3d:   mov    0x70(%rsp),%r14d          # sdif
       or     %ebx,%ebp                 # INET_COMBINED_PORTS(sport, hnum)
       mov    0x11a8382(%rip),%eax      # inet6_ehashfn() ...

After: __inet6_lookup_established()

       ...
       push   %rbx
       sub    $0x28,%rsp                # stack is 40 bytes
       mov    0x60(%rsp),%ebp           # sdif
       mov    %r8d,%r14d                # hnum
       shl    $0x10,%r14d               # hnum << 16
       or     %edx,%r14d                # INET_COMBINED_PORTS(sport, hnum)
       mov    0x440(%rdi),%rax          # net->ipv4.tcp_death_row.hashinfo
       mov    0x1194f09(%rip),%r10d     # inet6_ehashfn() ...

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260303235424.3877267-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'doc-netlink-expand-nftables-specification'

Remy D. Farley says:

====================
doc/netlink: Expand nftables specification

Getting out some changes I've accumulated while making nftables work
with Rust netlink-bindings. Hopefully, this will be useful upstream.
====================

Link: https://patch.msgid.link/20260303195638.381642-1-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: nftables: Fill out operation attributes

Filled out operation attributes:
- newtable
- gettable
- deltable
- destroytable
- newchain
- getchain
- delchain
- destroychain
- newrule
- getrule
- getrule-reset
- delrule
- destroyrule
- newset
- getset
- delset
- destroyset
- newsetelem
- getsetelem
- getsetelem-reset
- delsetelem
- destroysetelem
- getgen
- newobj
- getobj
- delobj
- destroyobj
- newflowtable
- getflowtable
- delflowtable
- destroyflowtable

Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-6-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: nftables: Add sub-messages

New sub-messsages:
- log
- match
- numgen
- range

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-5-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: nftables: Update attribute sets

New attribute sets:
- log-attrs
- numgen-attrs
- range-attrs
- compat-target-attrs
- compat-match-attrs
- compat-attrs

Added missing attributes:
- table-attrs (pad, owner)
- set-attrs (type, count)

Added missing checks:
- range-attrs
- expr-bitwise-attrs
- compat-target-attrs
- compat-match-attrs
- compat-attrs

Annotated doc comment or associated enum:
- batch-attrs
- verdict-attrs
- expr-payload-attrs

Fixed byte order:
- nft-counter-attrs
- expr-counter-attrs
- rule-compat-attrs

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-4-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

doc/netlink: nftables: Add definitions

New enums/flags:
- payload-base
- range-ops
- registers
- numgen-types
- log-level
- log-flags

Added missing enumerations:
- bitwise-ops

Annotated doc comment or associated enum:
- bitwise-ops

Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: Remy D. Farley <one-d-wide@protonmail.com>
Link: https://patch.msgid.link/20260303195638.381642-3-one-d-wide@protonmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>