Cong Wang [Fri, 30 Jan 2026 21:20:21 +0000 (13:20 -0800)]
MAINTAINERS: Remove myself from TC maintainers
Recently TC maintainer Jamal intentionally a broke reasonable
use case:
https://lore.kernel.org/netdev/aG10rqwjX6elG1Gx@pop-os.localdomain/
Although I tried my best to help by:
1) Strongly objecting this breakage from the very beginning
2) Reverting it and offering a much better solution
3) Offering Jamal for video chat on 8 Jul 2025 and 26 Nov 2025
None of them worked.
So it makes no sense for me to continue caring about this subsystem.
Most importantly, intentionally breaking reasonable use cases is
against my moral, I don't want to get ashamed.
Qingfang Deng [Thu, 29 Jan 2026 01:29:02 +0000 (09:29 +0800)]
ppp: enable TX scatter-gather
PPP channels using chan->direct_xmit prepend the PPP header to a skb and
call dev_queue_xmit() directly. In this mode the skb does not need to be
linear, but the PPP netdevice currently does not advertise
scatter-gather features, causing unnecessary linearization and
preventing GSO.
Enable NETIF_F_SG and NETIF_F_FRAGLIST on PPP devices. In case a linear
buffer is required (PPP compression, multilink, and channels without
direct_xmit), call skb_linearize() explicitly.
Alexei Lazar [Tue, 3 Feb 2026 07:30:21 +0000 (09:30 +0200)]
net/mlx5e: Extend TC max ratelimit using max_bw_value_msb
The per-TC rate limit was restricted to 255 Gbps due to the 8-bit
max_bw_value field in the QETC register.
This limit is insufficient for newer, higher-bandwidth NICs.
Extend the rate limit by using the full 16-bit max_bw_value field.
This allows the finer 100Mbps granularity to be used for rates up to
~6.5 Tbps, instead of switching to 1Gbps granularity at higher rates.
The extended range is only used when the device advertises support
via the qetcr_qshr_max_bw_val_msb capability bit in the QCAM register.
Dragos Tatulea [Tue, 3 Feb 2026 07:21:30 +0000 (09:21 +0200)]
net/mlx5e: SHAMPO, Improve allocation recovery
When memory providers are used, there is a disconnect between the
page_pool size and the available memory in the provider. This means
that the page_pool can run out of memory if the user didn't provision
a large enough buffer.
Under these conditions, mlx5 gets stuck trying to allocate new
buffers without being able to release existing buffers. This happens due
to the optimization introduced in commit 4c2a13236807
("net/mlx5e: RX, Defer page release in striding rq for better recycling")
which delays WQE releases to increase the chance of page_pool direct
recycling. The optimization was developed before memory providers
existed and this circumstance was not considered.
This patch unblocks the queue by reclaiming pages from WQEs that can be
freed and doing a one-shot retry. A WQE can be freed when:
1) All its strides have been consumed (WQE is no longer in linked list).
2) The WQE pages/netmems have not been previously released.
This reclaim mechanism is useful for regular pages as well.
Note that provisioning memory that can't fill even one MPWQE (64
4K pages) will still render the queue unusable. Same when
the application doesn't release its buffers for various reasons.
Or a combination of the two: a very small buffer is provisioned,
application releases buffers in bulk, bulk size never reached
=> queue is stuck.
Dragos Tatulea [Tue, 3 Feb 2026 07:21:29 +0000 (09:21 +0200)]
net/mlx5e: RX, Drop oversized packets in non-linear mode
Currently the driver has an inconsistent behaviour between modes when it
comes to oversized packets that are not dropped through the physical MTU
check in HW. This can happen for Multi Host configurations where each
port has a different MTU.
Current behavior:
1) Striding RQ in linear mode drops the packet in SW and counts it
with oversize_pkts_sw_drop.
2) Striding RQ in non-linear mode allows it like a normal packet.
3) Legacy RQ can't receive oversized packets by design:
the RX WQE uses MTU sized packet buffers.
This inconsistency is not a violation of the netdev policy [1]
but it is better to be consistent across modes.
This patch aligns (2) with (1) and (3). One exception is added for
LRO: don't drop the oversized packet if it is an LRO packet.
As now rq->hw_mtu always needs to be updated during the MTU change flow,
drop the reset avoidance optimization from mlx5e_change_mtu().
Extract the CQE LRO segments reading into a helper function as it
is used twice now.
[1] Documentation/networking/netdevices.rst#L205
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260203072130.1710255-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The dwmac databook for v3.74a states that lpi_intr_o is a sideband
signal which should be used to ungate the application clock, and this
signal is synchronous to the receive clock. The receive clock can run
at 2.5, 25 or 125MHz depending on the media speed, and can stop under
the control of the link partner. This means that the time it takes to
clear is dependent on the negotiated media speed, and thus can be 8,
40, or 400ns after reading the LPI control and status register.
It has been observed with some aggressive link partners, this clock
can stop while lpi_intr_o is still asserted, meaning that the signal
remains asserted for an indefinite period that the local system has
no direct control over.
The LPI interrupts will still be signalled through the main interrupt
path in any case, and this path is not dependent on the receive clock.
This, since we do not gate the application clock, and the chances of
adding clock gating in the future are slim due to the clocks being
ill-defined, lpi_intr_o serves no useful purpose. Remove the code which
requests the interrupt, and all associated code.
Reported-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> Tested-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com> # Renesas RZ/V2H board Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1vnJbt-00000007YYN-28nm@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: stmmac: fix serdes power methods
The stmmac serdes powerup/powerdown methods are not guaranteed to be
called in a balancing fashion, but these are used to call the generic
PHY subsystem's phy_power_up() and phy_power_down() methods which do
require balanced calls.
This series addresses this by making the stmmac serdes methods balanced.
====================
net: stmmac: move serdes power methods to stmmac_[open|release]()
Move the SerDes power up and down calls for the non-"after linkup"
case out of __stmmac_open() and __stmmac_release() into the
stmmac_open() and stmmac_release() methods, which means the SerDes
will only change power state on administrative changes or suspend/
resume, not while changing the interface MTU.
net: stmmac: add state tracking for legacy serdes power state
Avoid calling the serdes_powerdown() method if we have not had a
preceeding successful call to the serdes_powerup() method. This
avoids unbalancing refcounted resources that may be used in the
these platform glue serdes methods.
====================
net/rds: RDS-TCP protocol and extension improvements
This is subset 3 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.
In this set, we introduce extension headers for byte accounting
and fix several RDS/TCP protocol issues including message preservation
during connection transitions and multipath lane handling.
The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
====================
Gerd Rausch [Tue, 3 Feb 2026 05:57:23 +0000 (22:57 -0700)]
net/rds: Trigger rds_send_ping() more than once
Even though a peer may have already received a
non-zero value for "RDS_EXTHDR_NPATHS" from a node in the past,
the current peer may not.
Therefore it is important to initiate another rds_send_ping()
after a re-connect to any peer:
It is unknown at that time if we're still talking to the same
instance of RDS kernel modules on the other side.
Otherwise, the peer may just operate on a single lane
("c_npaths == 0"), not knowing that more lanes are available.
However, if "c_with_sport_idx" is supported,
we also need to check that the connection we accepted on lane#0
meets the proper source port modulo requirement, as we fan out:
Since the exchange of "RDS_EXTHDR_NPATHS" and "RDS_EXTHDR_SPORT_IDX"
is asynchronous, initially we have no choice but to accept an incoming
connection (via "accept") in the first slot ("cp_index == 0")
for backwards compatibility.
But that very connection may have come from a different lane
with "cp_index != 0", since the peer thought that we already understood
and handled "c_with_sport_idx" properly, as indicated by a previous
exchange before a module was reloaded.
In short:
If a module gets reloaded, we recover from that, but do *not*
allow a downgrade to support fewer lanes.
Downgrades would require us to merge messages from separate lanes,
which is rather tricky with the current RDS design.
Each lane has its own sequence number space and all messages
would need to be re-sequenced as we merge, all while
handling "RDS_FLAG_RETRANSMITTED" and "cp_retrans" properly.
Gerd Rausch [Tue, 3 Feb 2026 05:57:22 +0000 (22:57 -0700)]
net/rds: Use the first lane until RDS_EXTHDR_NPATHS arrives
Instead of just blocking the sender until "c_npaths" is known
(it gets updated upon the receipt of a MPRDS PONG message),
simply use the first lane (cp_index#0).
But just using the first lane isn't enough.
As soon as we enqueue messages on a different lane, we'd run the risk
of out-of-order delivery of RDS messages.
Earlier messages enqueued on "cp_index == 0" could be delivered later
than more recent messages enqueued on "cp_index > 0", mostly because of
possible head of line blocking issues causing the first lane to be
slower.
To avoid that, we simply take a snapshot of "cp_next_tx_seq" at the
time we're about to fan-out to more lanes.
Then we delay the transmission of messages enqueued on other lanes
with "cp_index > 0" until cp_index#0 caught up with the delivery of
new messages (from "cp_send_queue") as well as in-flight
messages (from "cp_retrans") that haven't been acknowledged yet
by the receiver.
We also add a new counter "mprds_catchup_tx0_retries" to keep track
of how many times "rds_send_xmit" had to suspend activities,
because it was waiting for the first lane to catch up.
Håkon Bugge [Tue, 3 Feb 2026 05:57:20 +0000 (22:57 -0700)]
net/rds: Clear reconnect pending bit
When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.
Gerd Rausch [Tue, 3 Feb 2026 05:57:19 +0000 (22:57 -0700)]
net/rds: Kick-start TCP receiver after accept
In cases where the server (the node with the higher IP-address)
in an RDS/TCP connection is overwhelmed it is possible that the
socket that was just accepted is chock-full of messages, up to
the limit of what the socket receive buffer permits.
Subsequently, "rds_tcp_data_ready" won't be called anymore,
because there is no more space to receive additional messages.
Nor was it called prior to the point of calling "rds_tcp_set_callbacks",
because the "sk_data_ready" pointer didn't even point to
"rds_tcp_data_ready" yet.
We fix this by simply kick-starting the receive-worker
for all cases where the socket state is neither
"TCP_CLOSE_WAIT" nor "TCP_CLOSE".
Gerd Rausch [Tue, 3 Feb 2026 05:57:18 +0000 (22:57 -0700)]
net/rds: rds_tcp_conn_path_shutdown must not discard messages
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged by the
TCP stack of a peer, rds_tcp_write_space() goes on to discard
prior messages from the send queue.
Which is fine, for as long as the receiver never throws any messages
away.
The dequeuing of messages in RDS/TCP is done either from the
"sk_data_ready" callback pointing to rds_tcp_data_ready()
(the most common case), or from the receive worker pointing
to rds_tcp_recv_path() which is called for as long as the
connection is "RDS_CONN_UP".
However, as soon as rds_conn_path_drop() is called for whatever reason,
including "DR_USER_RESET", "cp_state" transitions to "RDS_CONN_ERROR",
and rds_tcp_restore_callbacks() ends up restoring the callbacks
and thereby disabling message receipt.
So messages already acknowledged to the sender were dropped.
Furthermore, the "->shutdown" callback was always called
with an invalid parameter ("RCV_SHUTDOWN | SEND_SHUTDOWN == 3"),
instead of the correct pre-increment value ("SHUT_RDWR == 2").
inet_shutdown() returns "-EINVAL" in such cases, rendering
this call a NOOP.
So we change rds_tcp_conn_path_shutdown() to do the proper
"->shutdown(SHUT_WR)" call in order to signal EOF to the peer
and make it transition to "TCP_CLOSE_WAIT" (RFC 793).
This should make the peer also enter rds_tcp_conn_path_shutdown()
and do the same.
This allows us to dequeue all messages already received
and acknowledged to the peer.
We do so, until we know that the receive queue no longer has data
(skb_queue_empty()) and that we couldn't have any data
in flight anymore, because the socket transitioned to
any of the states "CLOSING", "TIME_WAIT", "CLOSE_WAIT",
"LAST_ACK", or "CLOSE" (RFC 793).
However, if we do just that, we suddenly see duplicate RDS
messages being delivered to the application.
So what gives?
Turns out that with MPRDS and its multitude of backend connections,
retransmitted messages ("RDS_FLAG_RETRANSMITTED") can outrace
the dequeuing of their original counterparts.
And the duplicate check implemented in rds_recv_local() only
discards duplicates if flag "RDS_FLAG_RETRANSMITTED" is set.
Rather curious, because a duplicate is a duplicate; it shouldn't
matter which copy is looked at and delivered first.
To avoid this entire situation, we simply make the sender discard
messages from the send-queue right from within
rds_tcp_conn_path_shutdown(). Just like rds_tcp_write_space() would
have done, were it called in time or still called.
This makes sure that we no longer have messages that we know
the receiver already dequeued sitting in our send-queue,
and therefore avoid the entire "RDS_FLAG_RETRANSMITTED" fiasco.
Now we got rid of the duplicate RDS message delivery, but we
still run into cases where RDS messages are dropped.
This time it is due to the delayed setting of the socket-callbacks
in rds_tcp_accept_one() via either rds_tcp_reset_callbacks()
or rds_tcp_set_callbacks().
By the time rds_tcp_accept_one() gets there, the socket
may already have transitioned into state "TCP_CLOSE_WAIT",
but rds_tcp_state_change() was never called.
Subsequently, "->shutdown(SHUT_WR)" did not happen either.
So the peer ends up getting stuck in state "TCP_FIN_WAIT2".
We fix that by checking for states "TCP_CLOSE_WAIT", "TCP_LAST_ACK",
or "TCP_CLOSE" and drop the freshly accepted socket in that case.
This problem is observable by running "rds-stress --reset"
frequently on either of the two sides of a RDS connection,
or both while other "rds-stress" processes are exchanging data.
Those "rds-stress" processes reported out-of-sequence
errors, with the expected sequence number being smaller
than the one actually received (due to the dropped messages).
Gerd Rausch [Tue, 3 Feb 2026 05:57:17 +0000 (22:57 -0700)]
net/rds: Encode cp_index in TCP source port
Upon "sendmsg", RDS/TCP selects a backend connection based
on a hash calculated from the source-port ("RDS_MPATH_HASH").
However, "rds_tcp_accept_one" accepts connections
in the order they arrive, which is non-deterministic.
Therefore the mapping of the sender's "cp->cp_index"
to that of the receiver changes if the backend
connections are dropped and reconnected.
However, connection state that's preserved across reconnects
(e.g. "cp_next_rx_seq") relies on that sender<->receiver
mapping to never change.
So we make sure that client and server of the TCP connection
have the exact same "cp->cp_index" across reconnects by
encoding "cp->cp_index" in the lower three bits of the
client's TCP source port.
A new extension "RDS_EXTHDR_SPORT_IDX" is introduced,
that allows the server to tell the difference between
clients that do the "cp->cp_index" encoding, and
legacy clients that pick source ports randomly.
Introduce a new extension header type RDSV3_EXTHDR_RDMA_BYTES for
an RDMA initiator to exchange rdma byte counts to its target.
Currently, RDMA operations cannot precisely account how many bytes a
peer just transferred via RDMA, which limits per-connection statistics
and future policy (e.g., monitoring or rate/cgroup accounting of RDMA
traffic).
In this patch we expand rds_message_add_extension to accept multiple
extensions, and add new flag to RDS header: RDS_FLAG_EXTHDR_EXTENSION,
along with a new extension to RDS header: rds_ext_header_rdma_bytes.
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20260203055723.1085751-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric Dumazet [Tue, 3 Feb 2026 21:47:16 +0000 (21:47 +0000)]
net_sched: sch_fq: tweak unlikely() hints in fq_dequeue()
After 076433bd78d7 ("net_sched: sch_fq: add fast path
for mostly idle qdisc") we need to remove one unlikely()
because q->internal holds all the fast path packets.
skb = fq_peek(&q->internal);
if (unlikely(skb)) {
q->internal.qlen--;
Calling INET_ECN_set_ce() is very unlikely.
These changes allow fq_dequeue_skb() to be (auto)inlined,
thus making fq_dequeue() faster.
Marco Crivellari [Wed, 24 Dec 2025 15:50:06 +0000 (16:50 +0100)]
ovpn: Replace use of system_wq with system_percpu_wq
This patch continues the effort to refactor workqueue APIs, which has begun
with the changes introducing new workqueues and a new alloc_workqueue flag:
commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
The point of the refactoring is to eventually alter the default behavior of
workqueues to become unbound by default so that their workload placement is
optimized by the scheduler.
Before that to happen after a careful review and conversion of each individual
case, workqueue users must be converted to the better named new workqueues with
no intended behaviour changes:
Randy Dunlap [Tue, 3 Feb 2026 07:52:48 +0000 (23:52 -0800)]
net/iucv: clean up iucv kernel-doc warnings
Fix numerous (many) kernel-doc warnings in iucv.[ch]:
- convert function documentation comments to a common (kernel-doc) look,
even for static functions (without "/**")
- use matching parameter and parameter description names
- use better wording in function descriptions (Jakub & AI)
- remove duplicate kernel-doc comments from the header file (Jakub)
Examples:
Warning: include/net/iucv/iucv.h:210 missing initial short description
on line: * iucv_unregister
Warning: include/net/iucv/iucv.h:216 function parameter 'handle' not
described in 'iucv_unregister'
Warning: include/net/iucv/iucv.h:467 function parameter 'answer' not
described in 'iucv_message_send2way'
Warning: net/iucv/iucv.c:727 missing initial short description on line:
* iucv_cleanup_queue
Build-tested with both "make htmldocs" and "make ARCH=s390 defconfig all".
Jakub Kicinski [Thu, 5 Feb 2026 04:31:04 +0000 (20:31 -0800)]
Merge tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Some more changes, including pulls from drivers:
- ath drivers: small features/cleanups
- rtw drivers: mostly refactoring for rtw89 RTL8922DE support
- mac80211: use hrtimers for CAC to avoid too long delays
- cfg80211/mac80211: some initial UHR (Wi-Fi 8) support
* tag 'wireless-next-2026-02-04' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (59 commits)
wifi: brcmsmac: phy: Remove unreachable error handling code
wifi: mac80211: Add eMLSR/eMLMR action frame parsing support
wifi: mac80211: add initial UHR support
wifi: cfg80211: add initial UHR support
wifi: ieee80211: add some initial UHR definitions
wifi: mac80211: use wiphy_hrtimer_work for CAC timeout
wifi: mac80211: correct ieee80211-{s1g/eht}.h include guard comments
wifi: ath12k: clear stale link mapping of ahvif->links_map
wifi: ath12k: Add support TX hardware queue stats
wifi: ath12k: Add support RX PDEV stats
wifi: ath12k: Fix index decrement when array_len is zero
wifi: ath12k: support OBSS PD configuration for AP mode
wifi: ath12k: add WMI support for spatial reuse parameter configuration
dt-bindings: net: wireless: ath11k-pci: deprecate 'firmware-name' property
wifi: ath11k: add usecase firmware handling based on device compatible
wifi: ath10k: sdio: add missing lock protection in ath10k_sdio_fw_crashed_dump()
wifi: ath10k: fix lock protection in ath10k_wmi_event_peer_sta_ps_state_chg()
wifi: ath10k: snoc: support powering on the device via pwrseq
wifi: rtw89: pci: warn if SPS OCP happens for RTL8922DE
wifi: rtw89: pci: restore LDO setting after device resume
...
====================
Jakub Kicinski [Thu, 5 Feb 2026 02:45:34 +0000 (18:45 -0800)]
Merge branch 'mptcp-misc-features-for-v6-20-7-0'
Matthieu Baerts says:
====================
mptcp: misc. features for v6.20/7.0
This series contains a few independent new features, and small fixes for
net-next:
- Patches 1-2: two small fixes linked to the MPTCP receive buffer that
are not urgent, requiring code that has been recently changed, and is
needed for the next patch. Because we are at the end of the cycle, it
seems easier to send them to net-next, instead of dealing with
conflicts between net and net-next.
- Patch 3: a refactoring to simplify the code around MPTCP DRS.
- Patch 4: a new trace event for MPTCP to help debugging receive buffer
auto-tuning issues.
- Patch 5: align internal MPTCP PM structure with NL specs, just to
manipulate the same thing.
- Patch 6: convert some min_t(int, ...) to min(): cleaner, and to avoid
future warnings.
- Patch 7: [removed]
- Patch 8: sort all #include in MPTCP Diag tool in the selftests to
prevent future potential conflicts and ease the reading.
- Patches 9-11: improve the MPTCP Join selftest by waiting for an event
instead of a "random" sleep.
- Patches 12-14: some small cleanups in the selftests, seen while
working on the previous patches.
- Patch 15: avoid marking subtests as skipped while still validating
most checks when executing the last MPTCP selftests on older kernels.
====================
In fact, behind each line, a few counters are checked, and likely not
all of them have been skipped because the they are not available on
these kernels. Instead, "new" and unsupported counters for these groups
are now ignored, and [ OK ] will be printed instead of [SKIP].
Note that on the MPTCP CI, when validating the dev versions, any
unsupported counter will cause the tests to fail. So this is safe not to
print 'SKIP' for these group checks.
nstat outputs are already printed when calling 'fail_test', no need to
do it again.
While at it, no need to use the dump_stats variable, print the extra
stats directly. And use 'ip -n $ns' instead of 'ip netns exec $ns',
shorter and clearer.
selftests: mptcp: join: userspace: wait for new events
Instead of waiting for a random amount of time (1 second), wait for an
event to be received on the other side.
To do that, when an address is announced (userspace_pm_add_addr), the
ANNOUNCED is expected. When a new subflow is created
(userspace_pm_add_sf), the SUB_ESTABLISHED event is expected.
It looks like most of the time, this helper was simply waiting a bit
more than one second: the previous MPJoin counter was often already at
the expected value. So at the end, it was just checking 10 times for
the MPJoin counter to change, but it was not happening. For the tests,
that was time, it was just waiting longer for nothing.
Instead, use 'wait_mpj' with the expected counter: in the tests, the MPJ
counter can easily be predicted. While at it, stop passing the netns as
argument: here the received MPJoin ACK is checked, which happens on the
server side. If later on, this needs to be checked on the client side,
the helper can be adapted for this case, but better avoid confusions now
if it is not needed.
While at it, stop using 'i' for the variable if it is not used.
selftests: mptcp: join: wait for estab event instead of MPJ
'wait_mpj' was used just after having created a background connection,
but before creating new subflows. So no MPJ were sent. The intention was
to wait for the connection to be established, which was the same as
doing a simple sleep with a "random" value.
Instead, wait for an "established" event. With this, the tests can
finish quicker.
This file is the only one from this directory not to have all these
header inclusions sorted by type and alphabetical order.
Adapt them, to ease the reading, prevent conflicts during potential
future backport modifying these lines, and also to avoid having UAPI
header inclusions before libc ones, see [1].
Both mptcp_wnd_end(msk) and msk->snd_nxt are u64, their difference
(aka the window size) might be limited to 32 bits - but that isn't
knowable from this code.
So checks being added to min_t() detect the potential discard of
significant bits.
Provided the 'avail_size' and return of mptcp_check_allowed_size()
are changed to an unsigned type (size_t matches the type the caller
uses) both min_t() can be changed to min().
Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ wrapped too long lines when declaring mptcp_check_allowed_size() ] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260203-net-next-mptcp-misc-feat-6-20-v1-6-31ec8bfc56d1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
mptcp: pm: align endpoint flags size with the NL specs
The MPTCP Netlink specs describe the 'flags' as a u32 type. Internally,
a u8 type was used.
Using a u8 is currently fine, because only the 5 first bits are used.
But there is also no reason not to be aligns with the specs, and
to stick to a u8. Especially because there is a whole of 3 bytes after
in both mptcp_pm_local and mptcp_pm_addr_entry structures.
Also, setting it to a u32 will allow future flags, just in case.
Paolo Abeni [Tue, 3 Feb 2026 18:41:18 +0000 (19:41 +0100)]
mptcp: fix receive space timestamp initialization
MPTCP initialize the receive buffer stamp in mptcp_rcv_space_init(),
using the provided subflow stamp. Such helper is invoked in several
places; for passive sockets, space init happened at clone time.
In such scenario, MPTCP ends-up accesses the subflow stamp before
its initialization, leading to quite randomic timing for the first
receive buffer auto-tune event, as the timestamp for newly created
subflow is not refreshed there.
Fix the issue moving the stamp initialization out of the mentioned helper,
at the data transfer start, and always using a fresh timestamp.
Paolo Abeni [Tue, 3 Feb 2026 18:41:17 +0000 (19:41 +0100)]
mptcp: do not account for OoO in mptcp_rcvbuf_grow()
MPTCP-level OoOs are physiological when multiple subflows are active
concurrently and will not cause retransmissions nor are caused by
drops.
Accounting for them in mptcp_rcvbuf_grow() causes the rcvbuf slowly
drifting towards tcp_rmem[2].
Remove such accounting. Note that subflows will still account for TCP-level
OoO when the MPTCP-level rcvbuf is propagated.
This also closes a subtle and very unlikely race condition with rcvspace
init; active sockets with user-space holding the msk-level socket lock,
could complete such initialization in the receive callback, after that the
first OoO data reaches the rcvbuf and potentially triggering a divide by
zero Oops.
Arnd Bergmann [Tue, 3 Feb 2026 16:34:00 +0000 (17:34 +0100)]
vmw_vsock: bypass false-positive Wnonnull warning with gcc-16
The gcc-16.0.1 snapshot produces a false-positive warning that turns
into a build failure with CONFIG_WERROR:
In file included from arch/x86/include/asm/string.h:6,
from net/vmw_vsock/vmci_transport.c:10:
In function 'vmci_transport_packet_init',
inlined from '__vmci_transport_send_control_pkt.constprop' at net/vmw_vsock/vmci_transport.c:198:2:
arch/x86/include/asm/string_32.h:150:25: error: argument 2 null where non-null expected because argument 3 is nonzero [-Werror=nonnull]
150 | #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
arch/x86/include/asm/string_32.h:150:25: note: in a call to built-in function '__builtin_memcpy'
net/vmw_vsock/vmci_transport.c:164:17: note: in expansion of macro 'memcpy'
164 | memcpy(&pkt->u.wait, wait, sizeof(pkt->u.wait));
| ^~~~~~
This seems relatively harmless, and it so far the only instance of this
warning I have found. The __vmci_transport_send_control_pkt function
is called either with wait=NULL or with one of the type values that
pass 'wait' into memcpy() here, but not from the same caller.
Replacing the memcpy with a struct assignment is otherwise the same
but avoids the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bryan Tan <bryan-bt.tan@broadcom.com> Link: https://patch.msgid.link/20260203163406.2636463-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per the RZ/G3L Hardware manual, CPG_CLKON_ETH register bits{12,13} are
to control the RMII{tx, rx} clocks. Document the rmii{tx.rx} clocks for
RZ/G3L SoC.
Jakub Kicinski [Thu, 5 Feb 2026 02:17:57 +0000 (18:17 -0800)]
Merge branch 's32g-use-a-syscon-for-gpr'
Dan Carpenter says:
====================
s32g: Use a syscon for GPR
The s32g devices have a GPR register region which holds a number of
miscellaneous registers. Currently only the stmmac/dwmac-s32.c uses
anything from there and we just add a line to the device tree to
access that GMAC_0_CTRL_STS register:
I have included the whole list of registers below.
We still have to maintain backwards compatibility to this format,
of course, but it would be better to access these registers through a
syscon. Putting all the registers together is more organized and shows
how the hardware actually is implemented.
Secondly, in some versions of this chipset those registers can only be
accessed via SCMI. It's relatively straight forward to handle this
by writing a syscon driver and registering it with of_syscon_register_regmap()
but it's complicated to deal with if the registers aren't grouped
together.
Here is the whole list of registers in the GPR region
Starting from 0x4007C000
0 Software-Triggered Faults (SW_NCF)
4 GMAC Control (GMAC_0_CTRL_STS)
28 CMU Status 1 (CMU_STATUS_REG1)
2C CMUs Status 2 (CMU_STATUS_REG2)
30 FCCU EOUT Override Clear (FCCU_EOUT_OVERRIDE_CLEAR_REG)
38 SRC POR Control (SRC_POR_CTRL_REG)
54 GPR21 (GPR21)
5C GPR23 (GPR23)
60 GPR24 Register (GPR24)
CC Debug Control (DEBUG_CONTROL)
F0 Timestamp Control (TIMESTAMP_CONTROL_REGISTER)
F4 FlexRay OS Tick Input Select (FLEXRAY_OS_TICK_INPUT_SELECT_REG)
FC GPR63 Register (GPR63)
Starting from 0x4007CA00
0 Coherency Enable for PFE Ports (PFE_COH_EN)
4 PFE EMAC Interface Mode (PFE_EMACX_INTF_SEL)
20 PFE EMACX Power Control (PFE_PWR_CTRL)
28 Error Injection on Cortex-M7 AHB and AXI Pipe (CM7_TCM_AHB_SLICE)
2C Error Injection AHBP Gasket Cortex-M7 (ERROR_INJECTION_AHBP_GASKET_CM7)
40 LLCE Subsystem Status (LLCE_STAT)
44 LLCE Power Control (LLCE_CTRL)
48 DDR Urgent Control (DDR_URGENT_CTRL)
4C FTM Global Load Control (FLXTIM_CTRL)
50 FTM LDOK Status (FLXTIM_STAT)
54 Top CMU Status (CMU_STAT)
58 Accelerator NoC No Pending Trans Status (NOC_NOPEND_TRANS)
90 SerDes RD/WD Toggle Control (PCIE_TOGGLE)
94 SerDes Toggle Done Status (PCIE_TOGGLEDONE_STAT)
E0 Generic Control 0 (GENCTRL0)
E4 Generic Control 1 (GENCTRL1)
F0 Generic Status 0 (GENSTAT0)
FC Cortex-M7 AXI Parity Error and AHBP Gasket Error Alarm (CM7_AXI_AHBP_GASKET_ERROR_ALARM)
Dan Carpenter [Fri, 30 Jan 2026 13:19:47 +0000 (16:19 +0300)]
dt-bindings: net: nxp,s32-dwmac: Use the GPR syscon
The S32 chipsets have a GPR region which has a miscellaneous registers
including the GMAC_0_CTRL_STS register. Originally, this code accessed
that register in a sort of ad-hoc way, but it's cleaner to use a
syscon interface to access these registers.
We still need to maintain the old method of accessing the GMAC register
but using a syscon will let us access other registers more cleanly.
Dan Carpenter [Fri, 30 Jan 2026 13:19:41 +0000 (16:19 +0300)]
net: stmmac: s32: use a syscon for S32_PHY_INTF_SEL_RGMII
On the s32 chipsets the GMAC_0_CTRL_STS register is in GPR region.
Originally, accessing this register was done in a sort of ad-hoc way,
but we want to use the syscon interface to do it.
This is a little bit ugly because we have to maintain backwards
compatibility to the old device trees so we have to support both ways
to access this register.
====================
STP/RSTP SWITCH support for PRU-ICSSM Ethernet driver
The DUAL-EMAC patch series for Megabit Industrial Communication Sub-system
(ICSSM), which provides the foundational support for Ethernet functionality
over PRU-ICSS on the TI SOCs (AM335x, AM437x, and AM57x), was merged into
net-next recently [1].
This patch series enhances the PRU-ICSSM Ethernet driver to support bridge
(STP/RSTP) SWITCH mode, which has been implemented using the "switchdev"
framework and interacts with the "mstp daemon" for STP and RSTP management
in userspace.
When the SWITCH mode is enabled, forwarding of Ethernet packets using
either the traditional store-and-forward mechanism or via cut-through is
offloaded to the two PRU based Ethernet interfaces available within the
ICSSM. The firmware running on the PRU inspects the bridge port states and
performs necessary checks before forwarding a packet. This improves the
overall system performance and significantly reduces the packet forwarding
latency.
Protocol switching from Dual-EMAC to bridge (STP/RSTP) SWITCH mode can be
done as follows.
Assuming eth2 and eth3 are the two physical ports of the ICSS2 instance:
>> brctl addbr br0
>> ip maddr add 01:80:c2:00:00:00 dev br0
>> ip link set dev br0 address $(cat /sys/class/net/eth2/address)
>> brctl addif br0 eth2
>> brctl addif br0 eth3
>> mstpd
>> brctl stp br0 on
>> mstpctl setforcevers br0 rstp
>> ip link set dev br0 up
To revert back to the default dual EMAC mode, the steps are as follows:
>> ip link set dev br0 down
>> brctl delif br0 eth2
>> brctl delif br0 eth3
>> brctl delbr br0
The patches presented in this series have gone through the patch verification
tools and no warnings or errors are reported.
====================
Roger Quadros [Fri, 30 Jan 2026 12:43:45 +0000 (18:13 +0530)]
net: ti: icssm-prueth: Add support for ICSSM RSTP switch
Add support for RSTP switch mode by enhancing the existing ICSSM dual EMAC
driver with switchdev support.
Enable the PRU-ICSSM to operate in switch mode, with the 2 PRU ports acting
as external ports and the host acting as an internal port. Packets received
from the PRU ports will be forwarded to the host (store and forward mode)
and also to the other PRU port (either using store and forward mode or via
cut-through mode). Packets coming from the host will be transmitted either
from one or both of the PRU ports (depending on the FDB decision).
By default, the dual EMAC firmware will be loaded in the PRU-ICSS
subsystem. To configure the PRU-ICSS to operate as a switch, a different
firmware must to be loaded.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260130124559.1182780-4-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roger Quadros [Fri, 30 Jan 2026 12:43:44 +0000 (18:13 +0530)]
net: ti: icssm-prueth: Add switchdev support for icssm_prueth driver
Add support for offloading the RSTP switch feature to the PRU-ICSS
subsystem by adding switchdev support. PRU-ICSS is capable of operating
in RSTP switch mode with two external ports and one host port.
PRUETH driver and firmware interface support will be added into
icssm_prueth in the subsequent commits.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260130124559.1182780-3-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Roger Quadros [Fri, 30 Jan 2026 12:43:43 +0000 (18:13 +0530)]
net: ti: icssm-prueth: Add helper functions to configure and maintain FDB
Introduce helper functions to configure and maintain Forwarding
Database (FDB) tables to aid with the switch mode feature for PRU-ICSS
ports. The PRU-ICSS FDB is maintained such that it is always in sync with
the Linux bridge driver FDB.
The FDB is used by the driver to determine whether to flood a packet,
received from the user plane, to both ports or direct it to a specific port
using the flags in the FDB table entry.
The FDB is implemented in two main components: the Index table and the
MAC Address table. Adding, deleting, and maintaining entries are handled
by the PRUETH driver. There are two types of entries:
Dynamic: created from the received packets and are subject to aging.
Static: created by the user and these entries never age out.
8-bit hash value obtained using the source MAC address is used to identify
the index to the Index/Hash table. A bucket-based approach is used to
collate source MAC addresses with the same hash value. The Index/Hash table
holds the bucket index (16-bit value) and the number of entries in the
bucket with the same hash value (16-bit value). This table can hold up to
256 entries, with each entry consuming 4 bytes of memory. The bucket index
value points to the MAC address table indicating the start of MAC addresses
having the same hash values.
Each entry in the MAC Address table consists of:
1. 6 bytes of the MAC address,
2. 2-byte aging time, and
3. 1-byte each for port information and flags respectively.
When a new entry is added to the FDB, the hash value is calculated using an
XOR operation on the 6-byte MAC address. The result is used as an index
into the Hash/Index table to check if any entries exist. If no entries are
present, the first available empty slot in the MAC Address table is
allocated to insert this MAC address. If entries with the same hash value
are already present, the new MAC address entry is added to the MAC Address
table in such a way that it ensures all entries are grouped together and
sorted in ascending MAC address order. This approach helps efficiently
manage FDB entries.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Basharath Hussain Khaja <basharath@couthit.com> Signed-off-by: Parvathi Pudi <parvathi@couthit.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260130124559.1182780-2-parvathi@couthit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: usb: sr9700: remove code to drive nonexistent multicast filter
Several registers referenced in this driver's source code do not
actually exist (they are not writable and read as zero in my testing).
They exist in this driver because it originated as a copy of the dm9601
driver. Notably, these include the multicast filter registers - this
causes the driver to not support multicast packets correctly. Remove
the multicast filter code and register definitions. Instead, set the
chip to receive all multicast filter packets when any multicast
addresses are in the list.
net: usb: introduce usbnet_mii_ioctl helper function
Many USB network drivers use identical code to pass ioctl
requests on to the MII layer. Reduce code duplication by
refactoring this code into a helper function.
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> (v1) Reviewed-by: Andrew Lunn <andrew@lunn.ch> (v3) Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20260203013517.26170-1-enelsonmoore@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: ethernet: renesas: rcar_gen4_ptp: Hide private data
The R-Car Gen4 PTP module started out as an exclusive feature of a
single driver, but have since been extended to cover both R-Car Switch
and TSN driver implementations on Gen4.
The feature have already been extended to be built as its own module
with an interface exposed thru a local header file. The header file
however also exposes the modules private data structure. The two
existing users have already started to poke at members of the struct.
The exposed private data being manipulated by users makes refactoring
and future rework hard as the interface for the module becomes to
chaotic. This small series aims to create two helpers to hide the
private data.
This is done as a small preparation before a third, new, users of the
Gen4 PTP will be added in a follow up series.
====================
net: ethernet: renesas: rcar_gen4_ptp: Hide private data from users
The Gen4 PTP helper module is already used by RTSN and RSWITCH to
support PTP clocks and will be used by RAVB too. Hide the Gen4 PTP
private data structure to make sure none of the users poke at it.
This will be more important for RAVB use-cases as more then one RAVB
device will need to cooperate using one PTP clock source.
net: ethernet: renesas: rcar_gen4_ptp: Add helper to read time
Instead of accessing the Gen4 PTP specific structure directly in drivers
add a helper to read the time. This is done in preparation to
completely hide the Gen4 PTP specific structure from users.
net: ethernet: renesas: rcar_gen4_ptp: Add helper to get clock index
Instead of accessing the Gen4 PTP specific structure directly in drivers
add a helper to read the clock index. This is done in preparation to
completely hide the Gen4 PTP specific structure from users.
Instead of accessing the Gen4 PTP specific structure directly in drivers
move the device address assignment into the preparation call. This is
done in preparation to completely hide the Gen4 PTP specific structure
from users.
net: stmmac: rk: remove need for ->set_speed() method
As we can detect whether the SoC provides the parameters necessary for
rk_set_reg_speed(), we don't need to have explicit calls to this.
Instead, we can move the contents of this function to
rk_set_clk_tx_rate().
This remsoves all the .set_speed() implementations that merely go on to
invoke rk_set_reg_speed().
net: stmmac: rk: use rk_encode_wm16() for RMII clock
The RMII clock is a single bit, which is set for 100M and clear for
10M. Move this out of struct rk_reg_speed_data (which gets rid of
this structure) into the struct rk_clock_fields as the bitmask for
this bit.
This gets rid of the per-SoC variability in the calls to
rk_set_reg_speed().
net: stmmac: rk: use rk_encode_wm16() for RMII speed
The RMII speed configuration is encoded as a single bit, which is set
for 100M and clean for 10M. Provide the bitfield definition in
struct rk_clock_fields, moving it out of struct rk_reg_speed_data's
rmii_10 and rmii_100 initialisers. Update rk_set_reg_speed() to handle
the new definition location of this bit.
net: stmmac: rk: use rk_encode_wm16() for RGMII clocks
As all of the RGMII clock selection bitfields (gmii_clk_sel) use the
same encoding, parameterise this by providing the bitfield mask in
the BSP private data.
This is the last user of GRF_FIELD_CONST(), so remove that definition
as well.
One additional change is for RK3328 - as only gmac2io supports RGMII,
only initialise the mask for this instance.
net: stmmac: rk: move speed GRF register offset to private data
Move the speed/clocking related GRF register offset into the driver
private data, convert rk_set_reg_speed() to use it and initialise this
member either from the corresponding member in struct rk_gmac_ops, or
the SoC specific initialisation function.
net: stmmac: rk: convert rk3588 to mask-based interface mode config
rk3588 has a quirk compared to the other Rockchip implementations in
that the interface mode configuration register is in the php_grf
regmap rather than the grf regmap. Add a flag to indicate this, and
a separate function to write to the appropriate regmap. This allows
rk3588 to be converted.
net: stmmac: rk: convert to mask-based interface mode configuration
The majority of Rockchip implementations require three common pieces
of information to configure the PHY interface mode:
- The grf register offset for configuring the GMAC phy_intf_sel field
and the RMII mode bit.
- The bitfield in this register for the GMAC's phy_intf_sel.
- The bit position for RMII mode but clear for RGMII mode.
Introduce members for this information into struct rk_priv_data and
struct rk_gmac_ops, which will be used to pre-initialise the struct
rk_priv_data members. We describe the register contents using
bitfields, even for those that are a single bit for consistency.
As each register comprises of two halves, where the upper half enables
changing the bit state in the lower half, we can describe these
bitfields using a 16-bit data type, and provide rk_encode_wm16() to
generate the actual register values from the field mask and field
value. We are unable to use the FIELD_PREP_WM16() macros for this as
these require the field mask to be a constant.
Add code to rk_gmac_powerup() to get the phy_intf_sel value, validating
that the resulting mode is either RMII or RGMII. No other modes are
supported by any of the Rockchip SoCs supported by this driver.
If either of the bitfield mask values are populated in struct
rk_priv_data, use these to generate the register contents, and write
the resulting value to the specified GRF register.
Convert many Rockchip implementations to use this new infrastructure.
For those where there is a single GMAC instance, it is merely a case of
filling in the new members of struct rk_gmac_ops. For those with
multiple instances, one or more of these members depends on the GMAC
instance, so setup of the members in struct rk_gmac has to be done via
the .init method of struct rk_gmac_ops. The corresponding code is
removed from the set_to_rgmii() and set_to_rmii() implementations.
Since the member name documents the purpose of the field that is being
initialised, providing preprocessor macros to define the bitfields is
deemed to be less than useful given the massive size of this driver.
The existing mechanisms remain behind for those SoCs that can not be
converted to this scheme.
====================
AccECN protocol case handling series
Plesae find the v13 AccECN case handling patch series, which covers
several excpetional case handling of Accurate ECN spec (RFC9768),
adds new identifiers to be used by CC modules, adds ecn_delta into
rate_sample, and keeps the ACE counter for computation, etc.
This patch series is part of the full AccECN patch series, which is at
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/
---
Chia-Yu Chang (13):
selftests/net: gro: add self-test for TCP CWR flag
tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
tcp: disable RFC3168 fallback identifier for CC modules
tcp: accecn: handle unexpected AccECN negotiation feedback
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
tcp: add TCP_SYNACK_RETRANS synack_type
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN
SYN/ACK
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
tcp: accecn: fallback outgoing half link to non-AccECN
tcp: accecn: detect loss ACK w/ AccECN option and add
TCP_ACCECN_OPTION_PERSIST
tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info
tcp: accecn: enable AccECN
selftests/net: packetdrill: add TCP Accurate ECN cases
Ilpo Järvinen (2):
tcp: try to avoid safer when ACKs are thinned
gro: flushing when CWR is set negatively affects AccECN
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:15 +0000 (23:25 +0100)]
selftests/net: packetdrill: add TCP Accurate ECN cases
Linux Accurate ECN test sets using ACE counters and AccECN options to
cover several scenarios: Connection teardown, different ACK conditions,
counter wrapping, SACK space grabbing, fallback schemes, negotiation
retransmission/reorder/loss, AccECN option drop/loss, different
handshake reflectors, data with marking, and different sysctl values.
The packetdrill used is commit cbe405666c9c8698ac1e72f5e8ffc551216dfa56
of repo: https://github.com/minuscat/packetdrill/tree/upstream_accecn.
And corresponding patches are sent to google/packetdrill email list.
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-16-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:13 +0000 (23:25 +0100)]
tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info
Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN
mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN,
or ECN_MODE_PENDING. This is done by utilizing available bits from
tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and
tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits).
Also, an extra 24-bit tcpi_options2 field is identified to represent
newer options and connection features, as all 8 bits of tcpi_options
field have been used.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:12 +0000 (23:25 +0100)]
tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.
Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:11 +0000 (23:25 +0100)]
tcp: accecn: fallback outgoing half link to non-AccECN
According to Section 3.2.2.1 of AccECN spec (RFC9768), if the Server
is in AccECN mode and in SYN-RCVD state, and if it receives a value of
zero on a pure ACK with SYN=0 and no SACK blocks, for the rest of the
connection the Server MUST NOT set ECT on outgoing packets and MUST
NOT respond to AccECN feedback. Nonetheless, as a Data Receiver it
MUST NOT disable AccECN feedback.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:10 +0000 (23:25 +0100)]
tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in
AccECN mode MUST NOT set ECT on any packet for the rest of the connection,
if it has received or sent at least one valid SYN or Acceptable SYN/ACK
with (AE,CWR,ECE) = (0,0,0) during the handshake.
In addition, a host in AccECN mode that is feeding back the IP-ECN
field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the
latest valid SYN or acceptable SYN/ACK to arrive.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:09 +0000 (23:25 +0100)]
tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK
For Accurate ECN, the first SYN/ACK sent by the TCP server shall set
the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the
capability negotiation. However, if the TCP server needs to retransmit
such a SYN/ACK (for example, because it did not receive an ACK
acknowledging its SYN/ACK, or received a second SYN requesting AccECN
support), the TCP server retransmits the SYN/ACK without the AccECN
option. This is because the SYN/ACK may be lost due to congestion, or a
middlebox may block the AccECN option. Furthermore, if this retransmission
also times out, to expedite connection establishment, the TCP server
should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the
AccECN option, while maintaining AccECN feedback mode.
This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:08 +0000 (23:25 +0100)]
tcp: add TCP_SYNACK_RETRANS synack_type
Before this patch, retransmitted SYN/ACK did not have a specific
synack_type; however, the upcoming patch needs to distinguish between
retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to
transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this
patch introduces a new synack_type (TCP_SYNACK_RETRANS).
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:07 +0000 (23:25 +0100)]
tcp: accecn: retransmit downgraded SYN in AccECN negotiation
Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an
AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it
SHOULD attempt to negotiate the use of AccECN at least one more time
by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on
the first retransmitted SYN (using the usual retransmission time-outs).
If this first retransmission also fails to be acknowledged, in
deployment scenarios where AccECN path traversal might be problematic,
the TCP Client SHOULD send subsequent retransmissions of the SYN with
the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0).
According to Sections 3.1.2 and 3.1.3 of AccECN spec (RFC9768).
In Section 3.1.2, it says an AccECN implementation has no need to
recognize or support the Server response labelled 'Nonce' or ECN-nonce
feedback more generally, as RFC 3540 has been reclassified as Historic.
AccECN is compatible with alternative ECN feedback integrity approaches
to the nonce. The SYN/ACK labelled 'Nonce' with (AE,CWR,ECE) = (1,0,1)
is reserved for future use. A TCP Client (A) that receives such a SYN/ACK
follows the procedure for forward compatibility given in Section 3.1.3.
Then in Section 3.1.3, it says if a TCP Client has sent a SYN requesting
AccECN feedback with (AE,CWR,ECE) = (1,1,1) then receives a SYN/ACK with
the currently reserved combination (AE,CWR,ECE) = (1,0,1) but it does not
have logic specific to such a combination, the Client MUST enable AccECN
mode as if the SYN/ACK onfirmed that the Server supported AccECN and as
if it fed back that the IP-ECN field on the SYN had arrived unchanged.
Fixes: 3cae34274c79 ("tcp: accecn: AccECN negotiation"). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:05 +0000 (23:25 +0100)]
tcp: disable RFC3168 fallback identifier for CC modules
When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.
This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:04 +0000 (23:25 +0100)]
tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
Two flags for congestion control (CC) module are added in this patch
related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN)
defines that the CC expects to negotiate AccECN functionality using the
ECE, CWR and AE flags in the TCP header.
Second, during ECN negotiation, ECT(0) in the IP header is used. This
patch enables CC to control whether ECT(0) or ECT(1) should be used on
a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the
expected ECT value in the IP header by the CA when not-yet initialized
for the connection.
The detailed AccECN negotiaotn can be found in IETF RFC9768.
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Chia-Yu Chang [Sat, 31 Jan 2026 22:25:03 +0000 (23:25 +0100)]
selftests/net: gro: add self-test for TCP CWR flag
Currently, GRO does not flush packets when the CWR bit is set.
A corresponding self-test is being added, in which the CWR flag
is set for two consecutive packets, but the first packet with the
CWR flag set will not be flushed immediately.
Ilpo Järvinen [Sat, 31 Jan 2026 22:25:02 +0000 (23:25 +0100)]
gro: flushing when CWR is set negatively affects AccECN
As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.
There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.
Ilpo Järvinen [Sat, 31 Jan 2026 22:25:01 +0000 (23:25 +0100)]
tcp: try to avoid safer when ACKs are thinned
Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.
This patch uses the existing 2-byte holes in the rx group for new
u16 variables withtout creating more holes. Below are the pahole
outcomes before and after this patch:
wlc_phy_txpwr_srom_read_lcnphy() in wlc_phy_attach_lcnphy() always
returns true, making the error handling code unreachable. Change the
function's return type to void and remove the dead code, similar to
the cleanup done for wlc_phy_txpwr_srom_read_nphy() in commit 47f0e32ffe4e ("wifi: brcmsmac: phy: Remove unreachable code").
====================
net: phy: remove modalias-based MDIO device bus matching
modalias-based MDIO device bus matching has only one user (dsa-loop),
where we can replace modalias-based matching with a simple custom
match function. This, and first patch of the series, lay the foundation
for removing modalias-based matching.
====================
Heiner Kallweit [Sat, 31 Jan 2026 17:40:00 +0000 (18:40 +0100)]
net: phy: remove modalias-based mdio bus matching
Last user dsa_loop has been migrated away from modalias-based matching,
so we can remove this feature now. It was the only user of MDIO_NAME_SIZE,
so remove also this constant.
Heiner Kallweit [Sat, 31 Jan 2026 17:38:56 +0000 (18:38 +0100)]
net: dsa: loop: remove MDIO device modalias
This change is a prerequisite for removing the MDIO device modalias,
as dsa_loop is the only user. Switch from modalias to a custom
bus match function.
Heiner Kallweit [Sat, 31 Jan 2026 17:37:15 +0000 (18:37 +0100)]
net: ethernet: adi: make name member of struct adin1110_cfg a pointer
Primary reason for this change is to remove the misuse of MDIO_NAME_SIZE
here, so that this constant can be removed in a follow-up patch.
Use case here is simply a chip name w/o any relationship to a MDIO
device. Also there's no need to reserve a longer char array, so make
the name a pointer.