While stress testing TCP I had unexpected retransmits and sack packets
when a single cpu receives data from multiple high-throughput flows.
super_netperf 4 -H srv -T,10 -l 3000 &
Tcpdump extract:
00:00:00.000007 IP6 clnt > srv: Flags [.], seq
26062848:
26124288, ack 1, win 66, options [nop,nop,TS val
651460834 ecr
3100749131], length 61440
00:00:00.000006 IP6 clnt > srv: Flags [.], seq
26124288:
26185728, ack 1, win 66, options [nop,nop,TS val
651460834 ecr
3100749131], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [P.], seq
26185728:
26243072, ack 1, win 66, options [nop,nop,TS val
651460834 ecr
3100749131], length 57344
00:00:00.000006 IP6 clnt > srv: Flags [.], seq
26243072:
26304512, ack 1, win 66, options [nop,nop,TS val
651460844 ecr
3100749141], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [.], seq
26304512:
26365952, ack 1, win 66, options [nop,nop,TS val
651460844 ecr
3100749141], length 61440
00:00:00.000007 IP6 clnt > srv: Flags [P.], seq
26365952:
26423296, ack 1, win 66, options [nop,nop,TS val
651460844 ecr
3100749141], length 57344
00:00:00.000006 IP6 clnt > srv: Flags [.], seq
26423296:
26484736, ack 1, win 66, options [nop,nop,TS val
651460853 ecr
3100749150], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [.], seq
26484736:
26546176, ack 1, win 66, options [nop,nop,TS val
651460853 ecr
3100749150], length 61440
00:00:00.000005 IP6 clnt > srv: Flags [P.], seq
26546176:
26603520, ack 1, win 66, options [nop,nop,TS val
651460853 ecr
3100749150], length 57344
00:00:00.003932 IP6 clnt > srv: Flags [P.], seq
26603520:
26619904, ack 1, win 66, options [nop,nop,TS val
651464844 ecr
3100753141], length 16384
00:00:00.006602 IP6 clnt > srv: Flags [.], seq
24862720:
24866816, ack 1, win 66, options [nop,nop,TS val
651471419 ecr
3100759716], length 4096
00:00:00.013000 IP6 clnt > srv: Flags [.], seq
24862720:
24866816, ack 1, win 66, options [nop,nop,TS val
651484421 ecr
3100772718], length 4096
00:00:00.000416 IP6 srv > clnt: Flags [.], ack
26619904, win 1393, options [nop,nop,TS val
3100773185 ecr
651484421,nop,nop,sack 1 {
24862720:
24866816}], length 0
After analysis, it appears this is because of the cond_resched()
call from __release_sock().
When current thread is yielding, while still holding the TCP socket lock,
it might regain the cpu after a very long time.
Other peer TLP/RTO is firing (multiple times) and packets are retransmit,
while the initial copy is waiting in the socket backlog or receive queue.
In this patch, I call cond_resched() only once every 16 packets.
Modern TCP stack now spends less time per packet in the backlog,
especially because ACK are no longer sent (commit
133c4c0d3717
"tcp: defer regular ACK while processing socket backlog")
Before:
clnt:/# nstat -n;sleep 10;nstat|egrep "TcpOutSegs|TcpRetransSegs|TCPFastRetrans|TCPTimeouts|Probes|TCPSpuriousRTOs|DSACK"
TcpOutSegs
19046186 0.0
TcpRetransSegs 1471 0.0
TcpExtTCPTimeouts 1397 0.0
TcpExtTCPLossProbes 1356 0.0
TcpExtTCPDSACKRecv 1352 0.0
TcpExtTCPSpuriousRTOs 114 0.0
TcpExtTCPDSACKRecvSegs 1352 0.0
After:
clnt:/# nstat -n;sleep 10;nstat|egrep "TcpOutSegs|TcpRetransSegs|TCPFastRetrans|TCPTimeouts|Probes|TCPSpuriousRTOs|DSACK"
TcpOutSegs
19218936 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20250903174811.1930820-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>