Chia-Yu Chang [Sat, 31 Jan 2026 22:25:12 +0000 (23:25 +0100)]
tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.
Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:11 +0000 (23:25 +0100)]
tcp: accecn: fallback outgoing half link to non-AccECN
According to Section 3.2.2.1 of AccECN spec (RFC9768), if the Server
is in AccECN mode and in SYN-RCVD state, and if it receives a value of
zero on a pure ACK with SYN=0 and no SACK blocks, for the rest of the
connection the Server MUST NOT set ECT on outgoing packets and MUST
NOT respond to AccECN feedback. Nonetheless, as a Data Receiver it
MUST NOT disable AccECN feedback.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:10 +0000 (23:25 +0100)]
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in
AccECN mode MUST NOT set ECT on any packet for the rest of the connection,
if it has received or sent at least one valid SYN or Acceptable SYN/ACK
with (AE,CWR,ECE) = (0,0,0) during the handshake.
In addition, a host in AccECN mode that is feeding back the IP-ECN
field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the
latest valid SYN or acceptable SYN/ACK to arrive.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:09 +0000 (23:25 +0100)]
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK
For Accurate ECN, the first SYN/ACK sent by the TCP server shall set
the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the
capability negotiation. However, if the TCP server needs to retransmit
such a SYN/ACK (for example, because it did not receive an ACK
acknowledging its SYN/ACK, or received a second SYN requesting AccECN
support), the TCP server retransmits the SYN/ACK without the AccECN
option. This is because the SYN/ACK may be lost due to congestion, or a
middlebox may block the AccECN option. Furthermore, if this retransmission
also times out, to expedite connection establishment, the TCP server
should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the
AccECN option, while maintaining AccECN feedback mode.
This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:08 +0000 (23:25 +0100)]
tcp: add TCP_SYNACK_RETRANS synack_type
Before this patch, retransmitted SYN/ACK did not have a specific
synack_type; however, the upcoming patch needs to distinguish between
retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to
transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this
patch introduces a new synack_type (TCP_SYNACK_RETRANS).
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:07 +0000 (23:25 +0100)]
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an
AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it
SHOULD attempt to negotiate the use of AccECN at least one more time
by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on
the first retransmitted SYN (using the usual retransmission time-outs).
If this first retransmission also fails to be acknowledged, in
deployment scenarios where AccECN path traversal might be problematic,
the TCP Client SHOULD send subsequent retransmissions of the SYN with
the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0).
According to Sections 3.1.2 and 3.1.3 of AccECN spec (RFC9768).
In Section 3.1.2, it says an AccECN implementation has no need to
recognize or support the Server response labelled 'Nonce' or ECN-nonce
feedback more generally, as RFC 3540 has been reclassified as Historic.
AccECN is compatible with alternative ECN feedback integrity approaches
to the nonce. The SYN/ACK labelled 'Nonce' with (AE,CWR,ECE) = (1,0,1)
is reserved for future use. A TCP Client (A) that receives such a SYN/ACK
follows the procedure for forward compatibility given in Section 3.1.3.
Then in Section 3.1.3, it says if a TCP Client has sent a SYN requesting
AccECN feedback with (AE,CWR,ECE) = (1,1,1) then receives a SYN/ACK with
the currently reserved combination (AE,CWR,ECE) = (1,0,1) but it does not
have logic specific to such a combination, the Client MUST enable AccECN
mode as if the SYN/ACK onfirmed that the Server supported AccECN and as
if it fed back that the IP-ECN field on the SYN had arrived unchanged.
Fixes: 3cae34274c79 ("tcp: accecn: AccECN negotiation"). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:05 +0000 (23:25 +0100)]
tcp: disable RFC3168 fallback identifier for CC modules
When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.
This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:04 +0000 (23:25 +0100)]
tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
Two flags for congestion control (CC) module are added in this patch
related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN)
defines that the CC expects to negotiate AccECN functionality using the
ECE, CWR and AE flags in the TCP header.
Second, during ECN negotiation, ECT(0) in the IP header is used. This
patch enables CC to control whether ECT(0) or ECT(1) should be used on
a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the
expected ECT value in the IP header by the CA when not-yet initialized
for the connection.
The detailed AccECN negotiaotn can be found in IETF RFC9768.
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:03 +0000 (23:25 +0100)]
selftests/net: gro: add self-test for TCP CWR flag
Currently, GRO does not flush packets when the CWR bit is set.
A corresponding self-test is being added, in which the CWR flag
is set for two consecutive packets, but the first packet with the
CWR flag set will not be flushed immediately.
Ilpo Järvinen [Sat, 31 Jan 2026 22:25:02 +0000 (23:25 +0100)]
gro: flushing when CWR is set negatively affects AccECN
As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.
There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.
Ilpo Järvinen [Sat, 31 Jan 2026 22:25:01 +0000 (23:25 +0100)]
tcp: try to avoid safer when ACKs are thinned
Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.
This patch uses the existing 2-byte holes in the rx group for new
u16 variables withtout creating more holes. Below are the pahole
outcomes before and after this patch:
====================
net: phy: remove modalias-based MDIO device bus matching
modalias-based MDIO device bus matching has only one user (dsa-loop),
where we can replace modalias-based matching with a simple custom
match function. This, and first patch of the series, lay the foundation
for removing modalias-based matching.
====================
Heiner Kallweit [Sat, 31 Jan 2026 17:40:00 +0000 (18:40 +0100)]
net: phy: remove modalias-based mdio bus matching
Last user dsa_loop has been migrated away from modalias-based matching,
so we can remove this feature now. It was the only user of MDIO_NAME_SIZE,
so remove also this constant.
Heiner Kallweit [Sat, 31 Jan 2026 17:38:56 +0000 (18:38 +0100)]
net: dsa: loop: remove MDIO device modalias
This change is a prerequisite for removing the MDIO device modalias,
as dsa_loop is the only user. Switch from modalias to a custom
bus match function.
Heiner Kallweit [Sat, 31 Jan 2026 17:37:15 +0000 (18:37 +0100)]
net: ethernet: adi: make name member of struct adin1110_cfg a pointer
Primary reason for this change is to remove the misuse of MDIO_NAME_SIZE
here, so that this constant can be removed in a follow-up patch.
Use case here is simply a chip name w/o any relationship to a MDIO
device. Also there's no need to reserve a longer char array, so make
the name a pointer.
Cosmin Ratiu [Wed, 28 Jan 2026 11:25:35 +0000 (13:25 +0200)]
devlink: Refactor devlink_rate_nodes_check
devlink_rate_nodes_check() was used to verify there are no devlink rate
nodes created when switching the esw mode.
Rate management code is about to become more complex, so refactor this
function:
- remove unused param 'mode'.
- add a new 'rate_filter' param.
- rename to devlink_rates_check().
- expose devlink_rate_is_node() to be used as a rate filter.
This makes it more usable from multiple places, so use it from those
places as well.
Cosmin Ratiu [Wed, 28 Jan 2026 11:25:33 +0000 (13:25 +0200)]
devlink: Reverse locking order for nested instances
Commit [1] defined the locking expectations for nested devlink
instances: the nested-in devlink instance lock needs to be acquired
before the nested devlink instance lock. The code handling devlink rels
was architected with that assumption in mind.
There are no actual users of double locking yet but that is about to
change in the upcoming patches in the series.
Code operating on nested devlink instances will require also obtaining
the nested-in instance lock, but such code may already be called from a
variety of places with the nested devlink instance lock. Then, there's
no way to acquire the nested-in lock other than making sure that all
callers acquire it first.
Reversing the nested lock order allows incrementally acquiring the
nested-in instance lock when needed (perhaps even a chain of locks up to
the root) without affecting any caller.
The only affected use of nesting is devlink_nl_nested_fill(), which
iterates over nested devlink instances with the RCU lock, without
locking them, so there's no possibility of deadlock.
So this commit just updates a comment regarding the nested locks.
The dwmac core has no support for SGMII without using its integrated
PCS. Thus, PHY_INTF_SEL_SGMII is only supported when this block is
present, and it makes no sense for stmmac_get_phy_intf_sel() to decode
this.
None of the platform glue users that use stmmac_get_phy_intf_sel()
directly accept PHY_INTF_SEL_SGMII as a valid mode.
Check whether a PCS will be used by the driver for the interface mode,
and if it is the integrated PCS, query the integrated PCS for the
phy_intf_sel_i value to use.
net: stmmac: move most PCS register definitions to stmmac_pcs.c
Move most of the PCS register offset definitions to stmmac_pcs.c.
Since stmmac_pcs.c only ever passes zero into the register offset
macros, remove that ability, making them simple constant integer
definitions.
Add appropriate descriptions of the registers, pointing out their
similarity with their IEEE 802.3 counterparts. Make use of the
BMSR definitions for the GMAC_AN_STATUS register and remove the
driver private versions.
Note that BMSR_LSTATUS is non-low-latching, unlike it's 802.3z
counterpart.
net: stmmac: clear half-duplex caps where unsupported
Where a core supports hardware features, but does not indicate support
for half-duplex, clear phylink's half-duplex 1G, 100M and 10M
capability bits to disallow half-duplex operation and advertisement of
these link modes.
This will avoid the need for special code in the PCS driver to do this
based on the ESTATUS register bits, as the support in the PCS is
dependent on the same synthesis choice as the MAC core.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Mohd Ayaan Anwar <mohd.anwar@oss.qualcomm.com> Link: https://patch.msgid.link/E1vlmOQ-00000006zuz-0ffN@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Add support for Renesas RZ/G3L GBETH
From: Biju Das <biju.das.jz@bp.renesas.com>
The Renesas RZ/G3L GBETH IP uses Synopsys DesignWare MAC version 5.30
compared to other Renesas SoC such as RZ/V2H that use MAC version 5.20.
The RZ/G3L GBETH requires an extra clock compared to RZ/G3E and has pps
interrupts. Document the Renesas RZ/G3L GBETH IP in bindings and add
support for the RZ/G3L GBETH in dwmac-renesas-gbeth glue driver.
====================
Biju Das [Sat, 31 Jan 2026 16:12:43 +0000 (16:12 +0000)]
net: stmmac: dwmac-renesas-gbeth: Add support for RZ/G3L SoC
Compared to other Renesas GBETH stmmac glue drivers, RZ/G3L GBETH IP use
the version Synopsys DesignWare MAC (version 5.30). It has an extra clock
compared to RZ/V2H and has ptp_pps_o interrupts. Add support for RZ/G3L
GBETH by reusing device data of RZ/V2H and can be extended to add other
functionalities later.
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Link: https://patch.msgid.link/20260131161250.5047-3-biju.das.jz@bp.renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add device tree binding support for the Gigabit Ethernet (GBETH) IP on
Renesas RZ/G3L SoC. This SoC uses different Synopsys DesignWare MAC
version 5.30 compared to RZ/G3E.
RZ/G3L requires an extra clock compared to RZ/G3E and has pps interrupts.
Add a new compatible string "renesas,r9a08g046-gbeth" for RZ/G3L SoC and
update the schema to handle hardware differences between SoC variants.
Extend the base snps,dwmac.yaml schema to accommodate the PPS interrupts.
Acked-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Link: https://patch.msgid.link/20260131161250.5047-2-biju.das.jz@bp.renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
mptcp: implement .read_sock and .splice_read
This series is a preparation work for future in-kernel MPTCP sockets
usage. Here, two interfaces are implemented: read_sock and splice_read.
As a result of this series, splice() with MPTCP sockets -- which was
already supported -- is now improved.
- Patches 1-2: .read_sock implementation
- Patches 3-4: .splice_read implementation
- Patches 5-6: validate splice() support with MPTCP sockets.
====================
Geliang Tang [Fri, 30 Jan 2026 19:24:28 +0000 (20:24 +0100)]
selftests: mptcp: add splice io mode
This patch adds a new 'splice' io mode for mptcp_connect to test
the newly added read_sock() and splice_read() functions of MPTCP.
do_splice() efficiently transfers data directly between two file
descriptors (infd and outfd) without copying to userspace, using
Linux's splice() system call.
Geliang Tang [Fri, 30 Jan 2026 19:24:25 +0000 (20:24 +0100)]
mptcp: implement .read_sock
Current in-kernel TCP sockets -- i.e. from nvme_tcp_try_recv() -- need
to call .read_sock interface of struct proto_ops, but it's not
implemented in MPTCP.
This patch implements it with reference to __tcp_read_sock() and
__mptcp_recvmsg_mskq().
Corresponding to tcp_recv_skb(), a new helper for MPTCP named
mptcp_recv_skb() is added to peek a skb from sk->sk_receive_queue.
Compared with __mptcp_recvmsg_mskq(), mptcp_read_sock() uses
sk->sk_rcvbuf as the max read length. The LISTEN status is checked
before the while loop, and mptcp_recv_skb() and mptcp_cleanup_rbuf()
are invoked after the loop. In the loop, all flags checks for
__mptcp_recvmsg_mskq() are removed.
Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-2-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
ptp: vmclock: Add VM generation counter and ACPI notification
Similarly to live migration, starting a VM from some serialized state
(aka snapshot) is an event which calls for adjusting guest clocks, hence
a hypervisor should increase the disruption_marker before resuming the
VM vCPUs, letting the guest know.
However, loading a snapshot, is slightly different than live migration,
especially since we can start multiple VMs from the same serialized
state. Apart from adjusting clocks, the guest needs to take additional
action during such events, e.g. recreate UUIDs, reset network
adapters/connections, reseed entropy pools, etc. These actions are not
necessary during live migration. This calls for a differentiation
between the two triggering events.
We differentiate between the two events via an extra field in the
vmclock_abi, called vm_generation_counter. Whereas hypervisors should
increase the disruption marker in both cases, they should only increase
vm_generation_counter when a snapshot is loaded in a VM (not during live
migration).
Additionally, we attach an ACPI notification to VMClock. Implementing
the notification is optional for the device. VMClock device will declare
that it implements the notification by setting
VMCLOCK_FLAG_NOTIFICATION_PRESENT bit in vmclock_abi flags. Hypervisors
that implement the notification must send an ACPI notification every
time seq_count changes to an even number. The driver will propagate
these notifications to userspace via the poll() interface.
====================
David Woodhouse [Fri, 30 Jan 2026 17:36:06 +0000 (17:36 +0000)]
ptp: ptp_vmclock: return TAI not UTC
To output UTC would involve complex calculations about whether the time
elapsed since the reference time has crossed the end of the month when
a leap second takes effect. I've prototyped that, but it made me sad.
Much better to report TAI, which is what PHCs should do anyway.
And much much simpler.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Babis Chalios <bchalios@amazon.es> Tested-by: Takahiro Itazuri <itazur@amazon.com> Link: https://patch.msgid.link/20260130173704.12575-8-itazur@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Woodhouse [Fri, 30 Jan 2026 17:36:04 +0000 (17:36 +0000)]
ptp: ptp_vmclock: add 'VMCLOCK' to ACPI device match
As we finalised the spec, we spotted that vmgenid actually says that the
_HID is supposed to be hypervisor-specific. Although in the 13 years
since the original vmgenid doc was published, nobody seems to have cared
about using _HID to distinguish between implementations on different
hypervisors, and we only ever use the _CID.
For consistency, match the _CID of "VMCLOCK" too.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Babis Chalios <bchalios@amazon.es> Tested-by: Takahiro Itazuri <itazur@amazon.com> Link: https://patch.msgid.link/20260130173704.12575-6-itazur@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David Woodhouse [Fri, 30 Jan 2026 17:36:03 +0000 (17:36 +0000)]
ptp: ptp_vmclock: Add device tree support
Add device tree support to the ptp_vmclock driver, allowing it to probe
via device tree in addition to ACPI.
Handle optional interrupt for clock disruption notifications, mirroring
the ACPI notification behaviour.
Although the interrupt is marked as 'optional' in the DT bindings, if
the device *advertises* the VMCLOCK_FLAG_NOTIFICATION_ABSENT then it
*should* have an interrupt. The driver will refuse to initialize if not.
Babis Chalios [Fri, 30 Jan 2026 17:36:01 +0000 (17:36 +0000)]
ptp: vmclock: support device notifications
Add optional support for device notifications in VMClock. When
supported, the hypervisor will send a device notification every time it
updates the seq_count to a new even value.
Moreover, add support for poll() in VMClock as a means to propagate this
notification to user space. poll() will return a POLLIN event to
listeners every time seq_count changes to a value different than the one
last seen (since open() or last read()/pread()). This means that when
poll() returns a POLLIN event, listeners need to use read() to observe
what has changed and update the reader's view of seq_count. In other
words, after a poll() returned, all subsequent calls to poll() will
immediately return with a POLLIN event until the listener calls read().
The device advertises support for the notification mechanism by setting
flag VMCLOCK_FLAG_NOTIFICATION_PRESENT in vmclock_abi flags field. If
the flag is not present the driver won't setup the ACPI notification
handler and poll() will always immediately return POLLHUP.
Signed-off-by: Babis Chalios <bchalios@amazon.es> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Takahiro Itazuri <itazur@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Takahiro Itazuri <itazur@amazon.com> Link: https://patch.msgid.link/20260130173704.12575-3-itazur@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Babis Chalios [Fri, 30 Jan 2026 17:36:00 +0000 (17:36 +0000)]
ptp: vmclock: add vm generation counter
Similar to live migration, loading a VM from some saved state (aka
snapshot) is also an event that calls for clock adjustments in the
guest. However, guests might want to take more actions as a response to
such events, e.g. as discarding UUIDs, resetting network connections,
reseeding entropy pools, etc. These are actions that guests don't
typically take during live migration, so add a new field in the
vmclock_abi called vm_generation_counter which informs the guest about
such events.
Hypervisor advertises support for vm_generation_counter through the
VMCLOCK_FLAG_VM_GEN_COUNTER_PRESENT flag. Users need to check the
presence of this bit in vmclock_abi flags field before using this flag.
Signed-off-by: Babis Chalios <bchalios@amazon.es> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Takahiro Itazur <itazur@amazon.com> Link: https://patch.msgid.link/20260130173704.12575-2-itazur@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Many network drivers have unnecessary empty module_init and module_exit
functions. Remove them (including some that just print a message). Note
that if a module_init function exists, a module_exit function must also
exist; otherwise, the module cannot be unloaded.
Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Acked-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Ping-Ke Shih <pkshih@realtek.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Link: https://patch.msgid.link/20260131004327.18112-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: ethernet: use module_pci_driver; remove useless driver versions
The module version is useless, and the only thing these drivers' init
routines did besides pci_register_driver was to print the driver name
and/or version.
Acked-by: Francois Romieu <romieu@fr.zoreil.com> (epic100) Reviewed-by: Simon Horman <horms@kernel.org> (epic100, sis900) Reviewed-by: Sai Krishna <saikrishnag@marvell.com> (epic100) Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Link: https://patch.msgid.link/20260131022441.56274-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: phy: dp83867: Always program R/SGMII enable bits
The hardware designers at my company neglected to read the datasheet for
this PHY and did not add appropriate resistors to configure it for
SGMII. Add support for configuring the it based on phy-mode instead of
relying on the resistors for a suitable default.
====================
Sean Anderson [Thu, 29 Jan 2026 17:12:05 +0000 (12:12 -0500)]
net: phy: dp83867: Always program R/SGMII enable bits
If the board designers have neglected to populate the appropriate
resistors on the strapping pins then the phy may default to the wrong
interface mode. Enable/disable the RGMII/SGMII enable bits as necessary
to select the correct interface.
The dp83867 strapping pins have four levels and typically configure two
features at once. LED_0 controls both port mirroring and whether SGMII
is enabled. If it is pulled to VDDIO, both port mirroring and SGMII
will be enabled. For variants of the dp83867 that do not support SGMII,
this will prevent data from being transferred. As we now explicitly set
the SGMII and RGMII enable bits, we do not need to detect whether SGMII
has been inadvertently enabled.
Sean Anderson [Thu, 29 Jan 2026 17:12:04 +0000 (12:12 -0500)]
net: phy: dp83867: Program TX FIFO for all interfaces
All supported interfaces use the TX FIFO register at least some of the
time, so there's no point in checking the interface. Retain the check
for the RX FIFO level since it is only used by SGMII.
The issue is that the existing code uses ethtool_get_flow_spec_ring_vf,
which will return a non-zero value if the ring_cookie is set to
RX_CLS_FLOW_DISC, which then causes bnxt_add_ntuple_cls_rule to return
-EOPNOTSUPP because it thinks the user is trying to set an ntuple filter
for a vf.
Fix this by first checking that the ring_cookie is not RX_CLS_FLOW_DISC.
After this patch, ntuple filters for drops can be added:
% sudo ethtool -U eth0 flow-type udp4 src-ip 1.1.1.1 action -1
Added rule with ID 0
% ethtool -n eth0
44 RX rings available
Total 1 rules
Filter: 0
Rule Type: UDP over IPv4
Src IP addr: 1.1.1.1 mask: 0.0.0.0
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 0 mask: 0xffff
Action: Drop
Jakub Kicinski [Sat, 31 Jan 2026 20:30:29 +0000 (12:30 -0800)]
tools: ynl: cli: make the output compact
Make the default (non-JSON) output more compact. Looking at RSS
context dumps is pretty much impossible without this, because
default print shows the indirection table with line per entry:
Jakub Kicinski [Sat, 31 Jan 2026 22:54:54 +0000 (14:54 -0800)]
docs: networking: mention that RSS table should be 4x the queue count
Spell out the recommendation that the RSS table should be
4x the queue count to avoid traffic imbalance. Include minor
rephrasing and removal of the explicit 128 entry example
since a 128 entry table is inadequate on modern machines.
Jakub Kicinski [Sat, 31 Jan 2026 22:54:53 +0000 (14:54 -0800)]
selftests: drv-net: rss: validate min RSS table size
Add a test which checks that the RSS table is at least 4x the max
queue count supported by the device. The original RSS spec from
Microsoft stated that the RSS indirection table should be 2 to 8
times the CPU count, presumably assuming queue per CPU. If the
CPU count is not a power of two, however, a power-of-2 table
2x larger than queue count results in a 33% traffic imbalance.
Validate that the indirection table is at least 4x the queue
count. This lowers the imbalance to 16% which empirically
appears to be more acceptable to memcache-like workloads.
This first 2 patches are by Biju Das, target the rcar_canfd driver and
add support for FD-only mode.
Lad Prabhakar's patches, also for the rcar_canfd driver add support
for the RZ/T2H SoC.
The last 2 patches are by Michael Tretter and me, target the sja1000
driver and clean up the CAN state handling.
* tag 'linux-can-next-for-6.20-20260131' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: sja1000: sja1000_err(): use error counter for error state
can: sja1000: sja1000_err(): make use of sja1000_get_berr_counter() to read error counters
can: rcar_canfd: Add RZ/T2H support
dt-bindings: can: renesas,rcar-canfd: Document RZ/T2H and RZ/N2H SoCs
dt-bindings: can: renesas,rcar-canfd: Document RZ/V2H(P) and RZ/V2N SoCs
dt-bindings: can: renesas,rcar-canfd: Specify reset-names
can: rcar_canfd: Add support for FD-Only mode
dt-bindings: can: renesas,rcar-canfd: Document renesas,fd-only property
====================
As reported by Al Viro, the TCP ULP support for SMC is fundamentally
broken. The implementation attempts to convert an active TCP socket
into an SMC socket by modifying the underlying `struct file`, dentry,
and inode in-place, which violates core VFS invariants that assume
these structures are immutable for an open file, creating a risk of
use after free errors and general system instability.
Given the severity of this design flaw and the fact that cleaner
alternatives (e.g., LD_PRELOAD, BPF) exist for legacy application
transparency, the correct course of action is to remove this feature
entirely.
net: ax25: remove plumbing for never-implemented DAMA Master support
The AX25_DAMA_MASTER option has been unimplemented and marked broken
ever since it was introduced in 2007 in commit 954b2e7f4c37 ("[NET]
AX.25 Kconfig and docs updates and fixes"). At this point, it is very
unlikely it will be implemented. Remove it.
====================
net: wwan: add NMEA port type support
The series introduces a long discussed NMEA port type support for the
WWAN subsystem. There are two goals. From the WWAN driver perspective,
NMEA exported as any other port type (e.g. AT, MBIM, QMI, etc.). From
user space software perspective, the exported chardev belongs to the
GNSS class what makes it easy to distinguish desired port and the WWAN
device common to both NMEA and control (AT, MBIM, etc.) ports makes it
easy to locate a control port for the GNSS receiver activation.
Done by exporting the NMEA port via the GNSS subsystem with the WWAN
core acting as proxy between the WWAN modem driver and the GNSS
subsystem.
The series starts from a cleanup patch. Then three patches prepares the
WWAN core for the proxy style operation. Followed by a patch introding a
new WWNA port type, integration with the GNSS subsystem and demux. The
series ends with a couple of patches that introduce emulated EMEA port
to the WWAN HW simulator.
The series is the product of the discussion with Loic about the pros and
cons of possible models and implementation. Also Muhammad and Slark did
a great job defining the problem, sharing the code and pushing me to
finish the implementation. Daniele has caught an issue on driver
unloading and suggested an investigation direction. What was concluded
by Loic. Many thanks.
====================
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:57 +0000 (14:21 +0800)]
net: wwan: hwsim: support NMEA port emulation
Support NMEA port emulation for the WWAN core GNSS port testing purpose.
Emulator produces pair of GGA + RMC sentences every second what should
be enough to fool gpsd into believing it is working with a NMEA GNSS
receiver.
If the GNSS system is enabled then one NMEA port will be created
automatically for the simulated WWAN device. Manual NMEA port creation
is not supported at the moment.
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:56 +0000 (14:21 +0800)]
net: wwan: hwsim: refactor to support more port types
Just introduced WWAN NMEA port type needs a testing option. The WWAN HW
simulator was developed with the AT port type in mind and cannot be
easily extended. Refactor it now to make it capable to support more port
types.
No big functional changes, mostly renaming with a little code
rearrangement.
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:55 +0000 (14:21 +0800)]
net: wwan: add NMEA port support
Many WWAN modems come with embedded GNSS receiver inside and have a
dedicated port to output geopositioning data. On the one hand, the
GNSS receiver has little in common with WWAN modem and just shares a
host interface and should be exported using the GNSS subsystem. On the
other hand, GNSS receiver is not automatically activated and needs a
generic WWAN control port (AT, MBIM, etc.) to be turned on. And a user
space software needs extra information to find the control port.
Introduce the new type of WWAN port - NMEA. When driver asks to register
a NMEA port, the core allocates common parent WWAN device as usual, but
exports the NMEA port via the GNSS subsystem and acts as a proxy between
the device driver and the GNSS subsystem.
From the WWAN device driver perspective, a NMEA port is registered as a
regular WWAN port without any difference. And the driver interacts only
with the WWAN core. From the user space perspective, the NMEA port is a
GNSS device which parent can be used to enumerate and select the proper
control port for the GNSS receiver management.
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:54 +0000 (14:21 +0800)]
net: wwan: core: split port unregister and stop
Upcoming GNSS (NMEA) port type support requires exporting it via the
GNSS subsystem. On another hand, we still need to do basic WWAN core
work: call the port stop operation, purge queues, release the parent
WWAN device, etc. To reuse as much code as possible, split the port
unregistering function into the deregistration of a regular WWAN port
device, and the common port tearing down code.
In order to keep more code generic, break the device_unregister() call
into device_del() and put_device(), which release the port memory
uniformly.
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:53 +0000 (14:21 +0800)]
net: wwan: core: split port creation and registration
Upcoming GNSS (NMEA) port type support requires exporting it via the
GNSS subsystem. On another hand, we still need to do basic WWAN core
work: find or allocate the WWAN device, make it the port parent, etc. To
reuse as much code as possible, split the port creation function into
the registration of a regular WWAN port device, and basic port struct
initialization.
To be able to use put_device() uniformly, break the device_register()
call into device_initialize() and device_add() and call device
initialization earlier.
While at it, fix a minor number leak upon WWAN port registration
failure.
We need information about existing WWAN device children since we remove
the device after removing the last child. Previously, we tracked users
implicitly by checking whether ops was registered and existence of a
child device of the wwan_class class. Upcoming GNSS (NMEA) port type
support breaks this approach by introducing a child device of the
gnss_class class.
And a modem driver can easily trigger a kernel Oops by removing regular
(e.g., MBIM, AT) ports first and then removing a GNSS port. The WWAN
device will be unregistered on removal of a last regular WWAN port. And
subsequent GNSS port removal will cause NULL pointer dereference in
simple_recursive_removal().
In order to support ports of classes other than wwan_class, switch to
explicit references counting. Introduce a dedicated counter to the WWAN
device struct, increment it on every wwan_create_dev() call, decrement
on wwan_remove_dev(), and actually unregister the WWAN device when there
are no more references.
Run tested with wwan_hwsim with NMEA support patches applied and
different port removing sequences.
Sergey Ryazanov [Mon, 26 Jan 2026 06:21:51 +0000 (14:21 +0800)]
net: wwan: core: remove unused port_id field
It was used initially for a port id allocation, then removed, and then
accidently introduced again, but it is still unused. Drop it again to
keep code clean.
Add debugfs hooks to display tx/rx rings for each napi
vector.
Note that the cloning mechanism in fbnic_ethtool.c for configuration
changes protects against concurrency issues with simultaneous config
changes along with debugs ring accesses.
The configuration switch builds up the new configuration offline,
takes the current config down, which removes the debugfs nv files, and
switches to the new configuration. The new configuration is brought
up which brings the debugfs files back on top of the new configuration
rings.
The interaction with fbnic_queue_stop() and fbnic_queue_start() will
similarly delete and add the files for the indicated vector.
net: usb: remove unnecessary get_drvinfo code and driver versions
Many USB network drivers define get_drvinfo functions which add no
value over usbnet_get_drvinfo, only setting the driver name and
version. usbnet_get_drvinfo automatically sets the driver name, and
separate driver versions are now frowned upon in the kernel. Remove all
driver versions and replace these get_drvinfo functions with references
to usbnet_get_drvinfo where possible. Where that is not possible,
remove unnecessary code to set the driver name. Also remove two
unnecessary initializations from aqc111_get_drvinfo, an inaccurate
comment in pegasus.c, and an unused macro in catc.c.
Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Peter Korsgaard <peter@korsgaard.com> (for dm9601.c) Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Link: https://patch.msgid.link/20260129042435.13395-2-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds SNMP drop count increment for the packets in
per NUMA queues which were introduced in commit b650bf0977d3
("udp: remove busylock and add per NUMA queues"). note that SNMP
counters are incremented currently by the caller for skb. And
that these skbs on the intermediate queue cannot be counted
there so need similar logic in their error path.
Signed-off-by: Mahdi Faramarzpour <mahdifrmx@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260129083806.204752-1-mahdifrmx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Thu, 29 Jan 2026 15:34:58 +0000 (15:34 +0000)]
tcp: reduce tcp sockets size by one cache line
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU,
slub has to use extra storage for the freelist pointer after each
object, because slub assumes that any bit in the object
can be used by RCU readers.
Because proto_register() is also using SLAB_HWCACHE_ALIGN,
this forces slub to use one extra cache line per object.
We can instead put the slub freelist anywhere in the object,
granted the concurrent RCU readers are not supposed to
use the pointer value.
Add a new (struct sock)sk_freeptr field, in an union
with sk_rcu: No RCU readers would need to look at sk_rcu,
which is only used at free phase.
Add functions to support xmit along with TSO/GSO.
Also, add functions to handle TX completion events in the NAPI context.
This commit introduces the fundamental transmit data path
Add support to receive packet using NAPI, build and deliver the skb
to stack. With help of meta data available in completions, fill the
appropriate information in skb.
Geetha sowjanya [Wed, 28 Jan 2026 02:24:48 +0000 (07:54 +0530)]
octeontx2-pf: cn10k/cn20k: Update count_eot in NPA_LF_AURA_BATCH_FREE0
This patch updates the count_eot calculation for CN20K devices.
Where the count_eot feild extended to 2 bits, while maintaining
CN10K compatibility where only bit 0 is used.
Jakub Kicinski [Fri, 30 Jan 2026 03:17:42 +0000 (19:17 -0800)]
Merge tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Another fairly large set of changes, notably:
- cfg80211/mac80211
- most of EPPKE/802.1X over auth frames support
- additional FTM capabilities
- split up drop reasons better, removing generic RX_DROP
- NAN cleanups/fixes
- ath11k:
- support for Channel Frequency Response measurement
- ath12k:
- support for the QCC2072 chipset
- iwlwifi:
- partial NAN support
- UNII-9 support
- some UHR/802.11bn FW APIs
- remove most of MLO/EHT from iwlmvm
(such devices use iwlmld)
- rtw89:
- preparations for RTL8922DE support
* tag 'wireless-next-2026-01-29' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (184 commits)
wifi: iwlegacy: add missing mutex protection in il4965_store_tx_power()
wifi: iwlegacy: add missing mutex protection in il3945_store_measurement()
wifi: mac80211: use u64_stats_t with u64_stats_sync properly
wifi: p54: Fix memory leak in p54_beacon_update()
wifi: cfg80211: treat deprecated INDOOR_SP_AP_OLD control value as LPI mode
wifi: rtw88: sdio: Migrate to use sdio specific shutdown function
wifi: rsi: sdio: Migrate to use sdio specific shutdown function
sdio: Provide a bustype shutdown function
wifi: nl80211/cfg80211: support operating as RSTA in PMSR FTM request
wifi: nl80211/cfg80211: add negotiated burst period to FTM result
wifi: nl80211/cfg80211: clarify periodic FTM parameters for non-EDCA based ranging
wifi: nl80211/cfg80211: add new FTM capabilities
wifi: iwlwifi: rename struct iwl_mcc_allowed_ap_type_cmd::offset_map
wifi: iwlwifi: mvm: Remove link_id from time_events
wifi: iwlwifi: mld: change cluster_id type to u8 array
wifi: iwlwifi: support V13 of iwl_lari_config_change_cmd
wifi: iwlwifi: split bios_value_u32 to separate the header
wifi: iwlwifi: uefi: cache the DSM functions
wifi: iwlwifi: acpi: cache the DSM functions
wifi: iwlwifi: mvm: Cleanup MLO code
...
====================