From: Sasha Levin Date: Sat, 14 Oct 2023 16:41:19 +0000 (-0400) Subject: Fixes for 6.1 X-Git-Tag: v6.1.58~7 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=d0ce0fc09aa7174f8d362789d42b3f6c1687fe89;p=thirdparty%2Fkernel%2Fstable-queue.git Fixes for 6.1 Signed-off-by: Sasha Levin --- diff --git a/queue-6.1/series b/queue-6.1/series index 60212a8ade8..af6c12034f4 100644 --- a/queue-6.1/series +++ b/queue-6.1/series @@ -69,3 +69,4 @@ net-smc-fix-pos-miscalculation-in-statistics.patch pinctrl-renesas-rzn1-enable-missing-pinmux.patch nfc-nci-assert-requested-protocol-is-valid.patch workqueue-override-implicit-ordered-attribute-in-wor.patch +tcp-enforce-receive-buffer-memory-limits-by-allowing.patch diff --git a/queue-6.1/tcp-enforce-receive-buffer-memory-limits-by-allowing.patch b/queue-6.1/tcp-enforce-receive-buffer-memory-limits-by-allowing.patch new file mode 100644 index 00000000000..0a75f528bc9 --- /dev/null +++ b/queue-6.1/tcp-enforce-receive-buffer-memory-limits-by-allowing.patch @@ -0,0 +1,275 @@ +From 5d2fbc5db765aaae71b290e1f592e9bbf9bfb5c2 Mon Sep 17 00:00:00 2001 +From: Sasha Levin +Date: Sun, 11 Jun 2023 22:05:24 -0500 +Subject: tcp: enforce receive buffer memory limits by allowing the tcp window + to shrink + +From: mfreemon@cloudflare.com + +[ Upstream commit b650d953cd391595e536153ce30b4aab385643ac ] + +Under certain circumstances, the tcp receive buffer memory limit +set by autotuning (sk_rcvbuf) is increased due to incoming data +packets as a result of the window not closing when it should be. +This can result in the receive buffer growing all the way up to +tcp_rmem[2], even for tcp sessions with a low BDP. + +To reproduce: Connect a TCP session with the receiver doing +nothing and the sender sending small packets (an infinite loop +of socket send() with 4 bytes of payload with a sleep of 1 ms +in between each send()). This will cause the tcp receive buffer +to grow all the way up to tcp_rmem[2]. + +As a result, a host can have individual tcp sessions with receive +buffers of size tcp_rmem[2], and the host itself can reach tcp_mem +limits, causing the host to go into tcp memory pressure mode. + +The fundamental issue is the relationship between the granularity +of the window scaling factor and the number of byte ACKed back +to the sender. This problem has previously been identified in +RFC 7323, appendix F [1]. + +The Linux kernel currently adheres to never shrinking the window. + +In addition to the overallocation of memory mentioned above, the +current behavior is functionally incorrect, because once tcp_rmem[2] +is reached when no remediations remain (i.e. tcp collapse fails to +free up any more memory and there are no packets to prune from the +out-of-order queue), the receiver will drop in-window packets +resulting in retransmissions and an eventual timeout of the tcp +session. A receive buffer full condition should instead result +in a zero window and an indefinite wait. + +In practice, this problem is largely hidden for most flows. It +is not applicable to mice flows. Elephant flows can send data +fast enough to "overrun" the sk_rcvbuf limit (in a single ACK), +triggering a zero window. + +But this problem does show up for other types of flows. Examples +are websockets and other type of flows that send small amounts of +data spaced apart slightly in time. In these cases, we directly +encounter the problem described in [1]. + +RFC 7323, section 2.4 [2], says there are instances when a retracted +window can be offered, and that TCP implementations MUST ensure +that they handle a shrinking window, as specified in RFC 1122, +section 4.2.2.16 [3]. All prior RFCs on the topic of tcp window +management have made clear that sender must accept a shrunk window +from the receiver, including RFC 793 [4] and RFC 1323 [5]. + +This patch implements the functionality to shrink the tcp window +when necessary to keep the right edge within the memory limit by +autotuning (sk_rcvbuf). This new functionality is enabled with +the new sysctl: net.ipv4.tcp_shrink_window + +Additional information can be found at: +https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ + +[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F +[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4 +[3] https://www.rfc-editor.org/rfc/rfc1122#page-91 +[4] https://www.rfc-editor.org/rfc/rfc793 +[5] https://www.rfc-editor.org/rfc/rfc1323 + +Signed-off-by: Mike Freemon +Reviewed-by: Eric Dumazet +Signed-off-by: David S. Miller +Signed-off-by: Sasha Levin +--- + Documentation/networking/ip-sysctl.rst | 15 +++++++ + include/net/netns/ipv4.h | 1 + + net/ipv4/sysctl_net_ipv4.c | 9 ++++ + net/ipv4/tcp_ipv4.c | 2 + + net/ipv4/tcp_output.c | 60 ++++++++++++++++++++++---- + 5 files changed, 78 insertions(+), 9 deletions(-) + +diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst +index f5f7a464605f9..b47b3d0ce5596 100644 +--- a/Documentation/networking/ip-sysctl.rst ++++ b/Documentation/networking/ip-sysctl.rst +@@ -967,6 +967,21 @@ tcp_tw_reuse - INTEGER + tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + ++tcp_shrink_window - BOOLEAN ++ This changes how the TCP receive window is calculated. ++ ++ RFC 7323, section 2.4, says there are instances when a retracted ++ window can be offered, and that TCP implementations MUST ensure ++ that they handle a shrinking window, as specified in RFC 1122. ++ ++ - 0 - Disabled. The window is never shrunk. ++ - 1 - Enabled. The window is shrunk when necessary to remain within ++ the memory limit set by autotuning (sk_rcvbuf). ++ This only occurs if a non-zero receive window ++ scaling factor is also in effect. ++ ++ Default: 0 ++ + tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP sockets. + Each TCP socket has rights to use it due to fact of its birth. +diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h +index 1b80046794451..ede2ff1da53a3 100644 +--- a/include/net/netns/ipv4.h ++++ b/include/net/netns/ipv4.h +@@ -64,6 +64,7 @@ struct netns_ipv4 { + #endif + bool fib_has_custom_local_routes; + bool fib_offload_disabled; ++ u8 sysctl_tcp_shrink_window; + #ifdef CONFIG_IP_ROUTE_CLASSID + atomic_t fib_num_tclassid_users; + #endif +diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c +index f68762ce4d8a3..73e5821584c18 100644 +--- a/net/ipv4/sysctl_net_ipv4.c ++++ b/net/ipv4/sysctl_net_ipv4.c +@@ -1387,6 +1387,15 @@ static struct ctl_table ipv4_net_table[] = { + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_TWO, + }, ++ { ++ .procname = "tcp_shrink_window", ++ .data = &init_net.ipv4.sysctl_tcp_shrink_window, ++ .maxlen = sizeof(u8), ++ .mode = 0644, ++ .proc_handler = proc_dou8vec_minmax, ++ .extra1 = SYSCTL_ZERO, ++ .extra2 = SYSCTL_ONE, ++ }, + { } + }; + +diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c +index f9b8a4a1d2edc..5df19f93f86ab 100644 +--- a/net/ipv4/tcp_ipv4.c ++++ b/net/ipv4/tcp_ipv4.c +@@ -3221,6 +3221,8 @@ static int __net_init tcp_sk_init(struct net *net) + else + net->ipv4.tcp_congestion_control = &tcp_reno; + ++ net->ipv4.sysctl_tcp_shrink_window = 0; ++ + return 0; + } + +diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c +index 5921b0f6f9f41..443b1cab25299 100644 +--- a/net/ipv4/tcp_output.c ++++ b/net/ipv4/tcp_output.c +@@ -259,8 +259,8 @@ static u16 tcp_select_window(struct sock *sk) + u32 old_win = tp->rcv_wnd; + u32 cur_win = tcp_receive_window(tp); + u32 new_win = __tcp_select_window(sk); ++ struct net *net = sock_net(sk); + +- /* Never shrink the offered window */ + if (new_win < cur_win) { + /* Danger Will Robinson! + * Don't update rcv_wup/rcv_wnd here or else +@@ -269,11 +269,14 @@ static u16 tcp_select_window(struct sock *sk) + * + * Relax Will Robinson. + */ +- if (new_win == 0) +- NET_INC_STATS(sock_net(sk), +- LINUX_MIB_TCPWANTZEROWINDOWADV); +- new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); ++ if (!READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) || !tp->rx_opt.rcv_wscale) { ++ /* Never shrink the offered window */ ++ if (new_win == 0) ++ NET_INC_STATS(net, LINUX_MIB_TCPWANTZEROWINDOWADV); ++ new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale); ++ } + } ++ + tp->rcv_wnd = new_win; + tp->rcv_wup = tp->rcv_nxt; + +@@ -281,7 +284,7 @@ static u16 tcp_select_window(struct sock *sk) + * scaled window. + */ + if (!tp->rx_opt.rcv_wscale && +- READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows)) ++ READ_ONCE(net->ipv4.sysctl_tcp_workaround_signed_windows)) + new_win = min(new_win, MAX_TCP_WINDOW); + else + new_win = min(new_win, (65535U << tp->rx_opt.rcv_wscale)); +@@ -293,10 +296,9 @@ static u16 tcp_select_window(struct sock *sk) + if (new_win == 0) { + tp->pred_flags = 0; + if (old_win) +- NET_INC_STATS(sock_net(sk), +- LINUX_MIB_TCPTOZEROWINDOWADV); ++ NET_INC_STATS(net, LINUX_MIB_TCPTOZEROWINDOWADV); + } else if (old_win == 0) { +- NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV); ++ NET_INC_STATS(net, LINUX_MIB_TCPFROMZEROWINDOWADV); + } + + return new_win; +@@ -2949,6 +2951,7 @@ u32 __tcp_select_window(struct sock *sk) + { + struct inet_connection_sock *icsk = inet_csk(sk); + struct tcp_sock *tp = tcp_sk(sk); ++ struct net *net = sock_net(sk); + /* MSS for the peer's data. Previous versions used mss_clamp + * here. I don't know if the value based on our guesses + * of peer's MSS is better for the performance. It's more correct +@@ -2970,6 +2973,15 @@ u32 __tcp_select_window(struct sock *sk) + if (mss <= 0) + return 0; + } ++ ++ /* Only allow window shrink if the sysctl is enabled and we have ++ * a non-zero scaling factor in effect. ++ */ ++ if (READ_ONCE(net->ipv4.sysctl_tcp_shrink_window) && tp->rx_opt.rcv_wscale) ++ goto shrink_window_allowed; ++ ++ /* do not allow window to shrink */ ++ + if (free_space < (full_space >> 1)) { + icsk->icsk_ack.quick = 0; + +@@ -3024,6 +3036,36 @@ u32 __tcp_select_window(struct sock *sk) + } + + return window; ++ ++shrink_window_allowed: ++ /* new window should always be an exact multiple of scaling factor */ ++ free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale); ++ ++ if (free_space < (full_space >> 1)) { ++ icsk->icsk_ack.quick = 0; ++ ++ if (tcp_under_memory_pressure(sk)) ++ tcp_adjust_rcv_ssthresh(sk); ++ ++ /* if free space is too low, return a zero window */ ++ if (free_space < (allowed_space >> 4) || free_space < mss || ++ free_space < (1 << tp->rx_opt.rcv_wscale)) ++ return 0; ++ } ++ ++ if (free_space > tp->rcv_ssthresh) { ++ free_space = tp->rcv_ssthresh; ++ /* new window should always be an exact multiple of scaling factor ++ * ++ * For this case, we ALIGN "up" (increase free_space) because ++ * we know free_space is not zero here, it has been reduced from ++ * the memory-based limit, and rcv_ssthresh is not a hard limit ++ * (unlike sk_rcvbuf). ++ */ ++ free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale)); ++ } ++ ++ return free_space; + } + + void tcp_skb_collapse_tstamp(struct sk_buff *skb, +-- +2.40.1 +