]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" This man page is Copyright (C) 1999 Andi Kleen <ak@muc.de>. |
2 | .\" Permission is granted to distribute possibly modified copies | |
3 | .\" of this page provided the header is included verbatim, | |
4 | .\" and in case of nontrivial modification author and date | |
5 | .\" of the modification is added to the header. | |
6 | .\" | |
7 | .\" 2.4 Updates by Nivedita Singhvi 4/20/02 <nivedita@us.ibm.com>. | |
8dd9ef47 MK |
8 | .\" Modified, 2004-11-11, Michael Kerrisk and Andries Brouwer |
9 | .\" Updated details of interaction of TCP_CORK and TCP_NODELAY. | |
fea681da | 10 | .\" |
5c45d5f5 | 11 | .TH TCP 7 2005-06-15 "Linux Man Page" "Linux Programmer's Manual" |
fea681da MK |
12 | .SH NAME |
13 | tcp \- TCP protocol | |
14 | .SH SYNOPSIS | |
15 | .B #include <sys/socket.h> | |
16 | .br | |
17 | .B #include <netinet/in.h> | |
18 | .br | |
19 | .B #include <netinet/tcp.h> | |
20 | .br | |
21 | .B tcp_socket = socket(PF_INET, SOCK_STREAM, 0); | |
22 | .SH DESCRIPTION | |
23 | This is an implementation of the TCP protocol defined in | |
ccca85be | 24 | RFC\ 793, RFC\ 1122 and RFC\ 2001 with the NewReno and SACK |
fd1835be MK |
25 | extensions. It provides a reliable, stream-oriented, |
26 | full-duplex connection between two sockets on top of | |
fea681da MK |
27 | .BR ip (7), |
28 | for both v4 and v6 versions. | |
29 | TCP guarantees that the data arrives in order and | |
fd1835be MK |
30 | retransmits lost packets. |
31 | It generates and checks a per-packet checksum to catch transmission errors. | |
32 | TCP does not preserve record boundaries. | |
fea681da | 33 | |
fd1835be | 34 | A newly created TCP socket has no remote or local address and is not |
fea681da MK |
35 | fully specified. To create an outgoing TCP connection use |
36 | .BR connect (2) | |
37 | to establish a connection to another TCP socket. | |
fd1835be | 38 | To receive new incoming connections, first |
fea681da | 39 | .BR bind (2) |
fd1835be | 40 | the socket to a local address and port and then call |
fea681da | 41 | .BR listen (2) |
fd1835be | 42 | to put the socket into the listening state. After that a new |
fea681da MK |
43 | socket for each incoming connection can be accepted |
44 | using | |
45 | .BR accept (2). | |
46 | A socket which has had | |
fd1835be | 47 | .B accept() |
fea681da | 48 | or |
fd1835be | 49 | .B connect() |
fea681da MK |
50 | successfully called on it is fully specified and may |
51 | transmit data. Data cannot be transmitted on listening or | |
52 | not yet connected sockets. | |
53 | ||
ccca85be | 54 | Linux supports RFC\ 1323 TCP high performance |
fea681da MK |
55 | extensions. These include Protection Against Wrapped |
56 | Sequence Numbers (PAWS), Window Scaling and | |
57 | Timestamps. Window scaling allows the use | |
58 | of large (> 64K) TCP windows in order to support links with high | |
59 | latency or bandwidth. To make use of them, the send and | |
60 | receive buffer sizes must be increased. | |
61 | They can be set globally with the | |
62 | .B net.ipv4.tcp_wmem | |
63 | and | |
64 | .B net.ipv4.tcp_rmem | |
65 | sysctl variables, or on individual sockets by using the | |
66 | .B SO_SNDBUF | |
67 | and | |
68 | .B SO_RCVBUF | |
69 | socket options with the | |
70 | .BR setsockopt (2) | |
71 | call. | |
72 | ||
73 | The maximum sizes for socket buffers declared via the | |
74 | .B SO_SNDBUF | |
75 | and | |
76 | .B SO_RCVBUF | |
77 | mechanisms are limited by the global | |
78 | .B net.core.rmem_max | |
79 | and | |
80 | .B net.core.wmem_max | |
81 | sysctls. Note that TCP actually allocates twice the size of | |
82 | the buffer requested in the | |
83 | .BR setsockopt (2) | |
84 | call, and so a succeeding | |
85 | .BR getsockopt (2) | |
86 | call will not return the same size of buffer as requested | |
87 | in the | |
88 | .BR setsockopt (2) | |
fd1835be | 89 | call. TCP uses the extra space for administrative purposes and internal |
fea681da MK |
90 | kernel structures, and the sysctl variables reflect the |
91 | larger sizes compared to the actual TCP windows. | |
92 | On individual connections, the socket buffer size must be | |
93 | set prior to the | |
94 | .B listen() | |
95 | or | |
96 | .B connect() | |
97 | calls in order to have it take effect. See | |
98 | .BR socket (7) | |
99 | for more information. | |
100 | .PP | |
101 | TCP supports urgent data. Urgent data is used to signal the | |
102 | receiver that some important message is part of the data | |
103 | stream and that it should be processed as soon as possible. | |
104 | To send urgent data specify the | |
105 | .B MSG_OOB | |
106 | option to | |
107 | .BR send (2). | |
108 | When urgent data is received, the kernel sends a | |
109 | .B SIGURG | |
61f4934a | 110 | signal to the process or process group that has been set as the |
ccca85be | 111 | socket "owner" using the |
fea681da MK |
112 | .B SIOCSPGRP |
113 | or | |
114 | .B FIOSETOWN | |
ccca85be MK |
115 | ioctls (or the SUSv3-specified |
116 | .BR fcntl (2) | |
117 | .B F_SETOWN | |
118 | operation). | |
119 | When the | |
fea681da MK |
120 | .B SO_OOBINLINE |
121 | socket option is enabled, urgent data is put into the normal | |
fd1835be | 122 | data stream (a program can test for its location using the |
fea681da | 123 | .B SIOCATMARK |
ccca85be | 124 | ioctl described below), |
fea681da MK |
125 | otherwise it can be only received when the |
126 | .B MSG_OOB | |
127 | flag is set for | |
95d29ab2 MK |
128 | .BR recv (2) |
129 | or | |
130 | .BR recvmsg (2). | |
fea681da MK |
131 | |
132 | Linux 2.4 introduced a number of changes for improved | |
133 | throughput and scaling, as well as enhanced functionality. | |
fd1835be | 134 | Some of these features include support for zero-copy |
fea681da MK |
135 | .BR sendfile (2), |
136 | Explicit Congestion Notification, new | |
137 | management of TIME_WAIT sockets, keep-alive socket options | |
138 | and support for Duplicate SACK extensions. | |
139 | .SH "ADDRESS FORMATS" | |
140 | TCP is built on top of IP (see | |
141 | .BR ip (7)). | |
142 | The address formats defined by | |
143 | .BR ip (7) | |
144 | apply to TCP. TCP only supports point-to-point | |
145 | communication; broadcasting and multicasting are not | |
146 | supported. | |
147 | .SH SYSCTLS | |
148 | These variables can be accessed by the | |
149 | .B /proc/sys/net/ipv4/* | |
150 | files or with the | |
151 | .BR sysctl (2) | |
152 | interface. In addition, most IP sysctls also apply to TCP; see | |
153 | .BR ip (7). | |
ccca85be MK |
154 | Variables described as |
155 | .I Boolean | |
156 | take an integer value, with a non-zero value ("true") meaning that | |
157 | the corresponding option is enabled, and a zero value ("false") | |
158 | meaning that the option is disabled. | |
5c45d5f5 MK |
159 | .\" FIXME: As at 14 Jun 2005, kernel 2.6.12, the following are |
160 | .\" not yet documented (shown with default values): | |
161 | .\" | |
162 | .\" /proc/sys/net/ipv4/tcp_bic_beta | |
163 | .\" 819 | |
164 | .\" /proc/sys/net/ipv4/tcp_moderate_rcvbuf | |
165 | .\" 1 | |
166 | .\" /proc/sys/net/ipv4/tcp_no_metrics_save | |
167 | .\" 0 | |
168 | .\" /proc/sys/net/ipv4/tcp_vegas_alpha | |
169 | .\" 2 | |
170 | .\" /proc/sys/net/ipv4/tcp_vegas_beta | |
171 | .\" 6 | |
172 | .\" /proc/sys/net/ipv4/tcp_vegas_gamma | |
173 | .\" 2 | |
fea681da | 174 | .TP |
5c45d5f5 | 175 | .BR tcp_abort_on_overflow " (Boolean; default: disabled)" |
fea681da | 176 | Enable resetting connections if the listening service is too |
5c45d5f5 MK |
177 | slow and unable to keep up and accept them. |
178 | It means that if overflow occurred due | |
fea681da | 179 | to a burst, the connection will recover. Enable this option |
fd1835be MK |
180 | .I only |
181 | if you are really sure that the listening daemon | |
fea681da MK |
182 | cannot be tuned to accept connections faster. Enabling this |
183 | option can harm the clients of your server. | |
184 | .TP | |
5c45d5f5 | 185 | .BR tcp_adv_win_scale " (integer; default: 2)" |
fea681da MK |
186 | Count buffering overhead as bytes/2^tcp_adv_win_scale |
187 | (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), | |
5c45d5f5 | 188 | if it is <= 0. |
fea681da MK |
189 | |
190 | The socket receive buffer space is shared between the | |
191 | application and kernel. TCP maintains part of the buffer as | |
192 | the TCP window, this is the size of the receive window | |
193 | advertised to the other end. The rest of the space is used | |
194 | as the "application" buffer, used to isolate the network | |
195 | from scheduling and application latencies. The | |
5c45d5f5 | 196 | .BR tcp_adv_win_scale |
fea681da MK |
197 | default value of 2 implies that the space |
198 | used for the application buffer is one fourth that of the | |
199 | total. | |
200 | .TP | |
5c45d5f5 | 201 | .BR tcp_app_win " (integer; default: 31)" |
fea681da MK |
202 | This variable defines how many |
203 | bytes of the TCP window are reserved for buffering | |
204 | overhead. | |
205 | ||
206 | A maximum of (window/2^tcp_app_win, mss) bytes in the window | |
207 | are reserved for the application buffer. A value of 0 | |
5c45d5f5 MK |
208 | implies that no amount is reserved. |
209 | .\" | |
6f802359 | 210 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
fea681da | 211 | .TP |
5c45d5f5 MK |
212 | .BR tcp_bic " (Boolean; default: disabled)" |
213 | Enable BIC TCP congestion control algorithm. | |
214 | BIC-TCP is a sender-side only change that ensures a linear RTT | |
215 | fairness under large windows while offering both scalability and | |
216 | bounded TCP-friendliness. The protocol combines two schemes | |
217 | called additive increase and binary search increase. When the | |
218 | congestion window is large, additive increase with a large | |
219 | increment ensures linear RTT fairness as well as good | |
220 | scalability. Under small congestion windows, binary search | |
221 | increase provides TCP friendliness. | |
222 | .\" | |
6f802359 | 223 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
224 | .TP |
225 | .BR tcp_bic_low_window " (integer; default: 14)" | |
226 | Sets the threshold window (in packets) where BIC TCP starts to | |
227 | adjust the congestion window. Below this threshold BIC TCP behaves | |
228 | the same as the default TCP Reno. | |
229 | .\" | |
6f802359 | 230 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
231 | .TP |
232 | .BR tcp_bic_fast_convergence " (Boolean; default: enabled)" | |
233 | Forces BIC TCP to more quickly respond to changes in congestion | |
234 | window. Allows two flows sharing the same connection to converge | |
235 | more rapidly. | |
236 | .TP | |
237 | .BR tcp_dsack " (Boolean; default: enabled)" | |
ccca85be | 238 | Enable RFC\ 2883 TCP Duplicate SACK support. |
fea681da | 239 | .TP |
5c45d5f5 | 240 | .BR tcp_ecn " (Boolean; default: disabled)" |
ccca85be | 241 | Enable RFC\ 2884 Explicit Congestion Notification. |
5c45d5f5 | 242 | When enabled, connectivity to some |
fea681da MK |
243 | destinations could be affected due to older, misbehaving |
244 | routers along the path causing connections to be dropped. | |
245 | .TP | |
5c45d5f5 MK |
246 | .BR tcp_fack " (Boolean; default: enabled)" |
247 | Enable TCP Forward Acknowledgement support. | |
fea681da | 248 | .TP |
5c45d5f5 | 249 | .BR tcp_fin_timeout " (integer; default: 60)" |
fd1835be | 250 | This specifies how many seconds to wait for a final FIN packet before the |
fea681da MK |
251 | socket is forcibly closed. This is strictly a violation of |
252 | the TCP specification, but required to prevent | |
5c45d5f5 MK |
253 | denial-of-service attacks. |
254 | In Linux 2.2, the default value was 180. | |
255 | .\" | |
6f802359 | 256 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
257 | .TP |
258 | .BR tcp_frto " (Boolean; default: disabled)" | |
259 | Enables F-RTO, an enhanced recovery algorithm for TCP retransmission | |
260 | timeouts. It is particularly beneficial in wireless environments | |
261 | where packet loss is typically due to random radio interference | |
262 | rather than intermediate router congestion. | |
fea681da | 263 | .TP |
5c45d5f5 | 264 | .BR tcp_keepalive_intvl " (integer; default: 75)" |
fea681da | 265 | The number of seconds between TCP keep-alive probes. |
fea681da | 266 | .TP |
5c45d5f5 | 267 | .BR tcp_keepalive_probes " (integer; default: 9)" |
fea681da MK |
268 | The maximum number of TCP keep-alive probes to send |
269 | before giving up and killing the connection if | |
270 | no response is obtained from the other end. | |
fea681da | 271 | .TP |
5c45d5f5 | 272 | .BR tcp_keepalive_time " (integer; default: 7200)" |
fea681da MK |
273 | The number of seconds a connection needs to be idle |
274 | before TCP begins sending out keep-alive probes. | |
275 | Keep-alives are only sent when the | |
276 | .B SO_KEEPALIVE | |
277 | socket option is enabled. The default value is 7200 seconds | |
278 | (2 hours). An idle connection is terminated after | |
279 | approximately an additional 11 minutes (9 probes an interval | |
280 | of 75 seconds apart) when keep-alive is enabled. | |
281 | ||
282 | Note that underlying connection tracking mechanisms and | |
283 | application timeouts may be much shorter. | |
5c45d5f5 | 284 | .\" |
6f802359 | 285 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
286 | .TP |
287 | .BR tcp_low_latency " (Boolean; default: disabled)" | |
288 | If enabled, the TCP stack makes decisions that prefer lower | |
289 | latency as opposed to higher throughput. | |
290 | It this option is disabled, then higher throughput is preferred. | |
291 | An example of an application where this default should be | |
292 | changed would be a Beowulf compute cluster. | |
fea681da | 293 | .TP |
5c45d5f5 | 294 | .BR tcp_max_orphans " (integer; default: see below)" |
fea681da MK |
295 | The maximum number of orphaned (not attached to any user file |
296 | handle) TCP sockets allowed in the system. When this number | |
297 | is exceeded, the orphaned connection is reset and a warning | |
fd1835be | 298 | is printed. This limit exists only to prevent simple denial-of-service |
fea681da MK |
299 | attacks. Lowering this limit is not recommended. Network |
300 | conditions might require you to increase the number of | |
301 | orphans allowed, but note that each orphan can eat up to ~64K | |
302 | of unswappable memory. The default initial value is set | |
303 | equal to the kernel parameter NR_FILE. This initial default | |
304 | is adjusted depending on the memory in the system. | |
305 | .TP | |
5c45d5f5 | 306 | .BR tcp_max_syn_backlog " (integer; default: see below)" |
fea681da MK |
307 | The maximum number of queued connection requests which have |
308 | still not received an acknowledgement from the connecting | |
309 | client. If this number is exceeded, the kernel will begin | |
310 | dropping requests. The default value of 256 is increased to | |
311 | 1024 when the memory present in the system is adequate or | |
312 | greater (>= 128Mb), and reduced to 128 for those systems with | |
313 | very low memory (<= 32Mb). It is recommended that if this | |
314 | needs to be increased above 1024, TCP_SYNQ_HSIZE in | |
315 | include/net/tcp.h be modified to keep | |
316 | TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog, and the kernel be | |
317 | recompiled. | |
318 | .TP | |
5c45d5f5 | 319 | .BR tcp_max_tw_buckets " (integer; default: see below)" |
fea681da | 320 | The maximum number of sockets in TIME_WAIT state allowed in |
fd1835be | 321 | the system. This limit exists only to prevent simple denial-of-service |
fea681da MK |
322 | attacks. The default value of NR_FILE*2 is adjusted |
323 | depending on the memory in the system. If this number is | |
324 | exceeded, the socket is closed and a warning is printed. | |
325 | .TP | |
5c45d5f5 | 326 | .BR tcp_mem |
fea681da MK |
327 | This is a vector of 3 integers: [low, pressure, high]. These |
328 | bounds are used by TCP to track its memory usage. The | |
329 | defaults are calculated at boot time from the amount of | |
330 | available memory. | |
331 | ||
332 | .I low | |
333 | - TCP doesn't regulate its memory allocation when the number | |
334 | of pages it has allocated globally is below this number. | |
335 | ||
336 | .I pressure | |
337 | - when the amount of memory allocated by TCP | |
338 | exceeds this number of pages, TCP moderates its memory | |
339 | consumption. This memory pressure state is exited | |
340 | once the number of pages allocated falls below | |
341 | the | |
342 | .B low | |
343 | mark. | |
344 | ||
345 | .I high | |
346 | - the maximum number of pages, globally, that TCP | |
347 | will allocate. This value overrides any other limits | |
348 | imposed by the kernel. | |
349 | .TP | |
5c45d5f5 | 350 | .BR tcp_orphan_retries " (integer; default: 8)" |
fea681da MK |
351 | The maximum number of attempts made to probe the other |
352 | end of a connection which has been closed by our end. | |
fea681da | 353 | .TP |
5c45d5f5 | 354 | .BR tcp_reordering " (integer; default: 3)" |
fea681da MK |
355 | The maximum a packet can be reordered in a TCP packet stream |
356 | without TCP assuming packet loss and going into slow start. | |
5c45d5f5 | 357 | It is not advisable to change this number. |
fea681da MK |
358 | This is a packet reordering detection metric designed to |
359 | minimize unnecessary back off and retransmits provoked by | |
360 | reordering of packets on a connection. | |
361 | .TP | |
5c45d5f5 | 362 | .BR tcp_retrans_collapse " (Boolean; default: enabled)" |
fea681da | 363 | Try to send full-sized packets during retransmit. |
fea681da | 364 | .TP |
5c45d5f5 | 365 | .BR tcp_retries1 " (integer; default: 3)" |
fea681da MK |
366 | The number of times TCP will attempt to retransmit a |
367 | packet on an established connection normally, | |
368 | without the extra effort of getting the network | |
369 | layers involved. Once we exceed this number of | |
370 | retransmits, we first have the network layer | |
371 | update the route if possible before each new retransmit. | |
372 | The default is the RFC specified minimum of 3. | |
373 | .TP | |
5c45d5f5 | 374 | .BR tcp_retries2 " (integer; default: 15)" |
fea681da MK |
375 | The maximum number of times a TCP packet is retransmitted |
376 | in established state before giving up. The default | |
377 | value is 15, which corresponds to a duration of | |
378 | approximately between 13 to 30 minutes, depending | |
ccca85be | 379 | on the retransmission timeout. The RFC\ 1122 specified |
fea681da MK |
380 | minimum limit of 100 seconds is typically deemed too |
381 | short. | |
382 | .TP | |
5c45d5f5 | 383 | .BR tcp_rfc1337 " (Boolean; default: disabled)" |
fea681da | 384 | Enable TCP behaviour conformant with RFC 1337. |
5c45d5f5 | 385 | When disabled, |
fea681da MK |
386 | if a RST is received in TIME_WAIT state, we close |
387 | the socket immediately without waiting for the end | |
388 | of the TIME_WAIT period. | |
389 | .TP | |
5c45d5f5 | 390 | .BR tcp_rmem |
fea681da MK |
391 | This is a vector of 3 integers: [min, default, |
392 | max]. These parameters are used by TCP to regulate receive | |
393 | buffer sizes. TCP dynamically adjusts the size of the | |
394 | receive buffer from the defaults listed below, in the range | |
395 | of these sysctl variables, depending on memory available | |
396 | in the system. | |
397 | ||
398 | .I min | |
399 | - minimum size of the receive buffer used by each TCP | |
400 | socket. The default value is 4K, and is lowered to | |
fd1835be | 401 | PAGE_SIZE bytes in low-memory systems. This value |
fea681da MK |
402 | is used to ensure that in memory pressure mode, |
403 | allocations below this size will still succeed. This is not | |
404 | used to bound the size of the receive buffer declared | |
405 | using | |
406 | .B SO_RCVBUF | |
407 | on a socket. | |
408 | ||
409 | .I default | |
410 | - the default size of the receive buffer for a TCP socket. | |
411 | This value overwrites the initial default buffer size from | |
412 | the generic global | |
413 | .B net.core.rmem_default | |
414 | defined for all protocols. The default value is 87380 | |
fd1835be | 415 | bytes, and is lowered to 43689 in low-memory systems. If |
fea681da MK |
416 | larger receive buffer sizes are desired, this value should |
417 | be increased (to affect all sockets). To employ large TCP | |
418 | windows, the | |
419 | .B net.ipv4.tcp_window_scaling | |
420 | must be enabled (default). | |
421 | ||
422 | .I max | |
423 | - the maximum size of the receive buffer used by | |
424 | each TCP socket. This value does not override the global | |
425 | .BR net.core.rmem_max . | |
426 | This is not used to limit the size of the receive buffer | |
427 | declared using | |
428 | .B SO_RCVBUF | |
429 | on a socket. | |
430 | The default value of 87380*2 bytes is lowered to 87380 | |
fd1835be | 431 | in low-memory systems. |
fea681da | 432 | .TP |
5c45d5f5 | 433 | .BR tcp_sack " (Boolean; default: enabled)" |
ccca85be | 434 | Enable RFC\ 2018 TCP Selective Acknowledgements. |
fea681da | 435 | .TP |
5c45d5f5 | 436 | .BR tcp_stdurg " (Boolean; default: disabled)" |
ccca85be | 437 | If this option is enabled, then use the RFC\ 1122 interpretation |
6f802359 | 438 | of the TCP urgent-pointer field. |
ccca85be MK |
439 | .\" RFC 793 was ambiguous in its specification of the meaning of the |
440 | .\" urgent pointer. RFC 1122 (and RFC 961) fixed on a particular | |
441 | .\" resolution of this ambiguity (unfortunately the "wrong" one). | |
6f802359 MK |
442 | According to this interpretation, the urgent pointer points |
443 | to the last byte of urgent data. | |
444 | If this option is disabled, then use the BSD-compatible interpretation of | |
ccca85be | 445 | the urgent pointer: |
6f802359 MK |
446 | the urgent pointer points to the first byte after the urgent data. |
447 | Enabling this option may lead to interoperability problems. | |
fea681da | 448 | .TP |
5c45d5f5 | 449 | .BR tcp_synack_retries " (integer; default: 5)" |
fea681da MK |
450 | The maximum number of times a SYN/ACK segment |
451 | for a passive TCP connection will be retransmitted. | |
5c45d5f5 | 452 | This number should not be higher than 255. |
fea681da | 453 | .TP |
5c45d5f5 | 454 | .BR tcp_syncookies " (Boolean)" |
fea681da MK |
455 | Enable TCP syncookies. The kernel must be compiled with |
456 | .BR CONFIG_SYN_COOKIES . | |
457 | Send out syncookies when the syn backlog queue of a socket | |
458 | overflows. The syncookies feature attempts to protect a | |
459 | socket from a SYN flood attack. This should be used as a | |
460 | last resort, if at all. This is a violation of the TCP | |
461 | protocol, and conflicts with other areas of TCP such as TCP | |
462 | extensions. It can cause problems for clients and relays. | |
463 | It is not recommended as a tuning mechanism for heavily | |
464 | loaded servers to help with overloaded or misconfigured | |
465 | conditions. For recommended alternatives see | |
466 | .BR tcp_max_syn_backlog , | |
467 | .BR tcp_synack_retries , | |
5c45d5f5 | 468 | and |
fea681da MK |
469 | .BR tcp_abort_on_overflow . |
470 | .TP | |
5c45d5f5 | 471 | .BR tcp_syn_retries " (integer; default: 5)" |
fea681da MK |
472 | The maximum number of times initial SYNs for an active TCP |
473 | connection attempt will be retransmitted. This value should | |
474 | not be higher than 255. The default value is 5, which | |
475 | corresponds to approximately 180 seconds. | |
476 | .TP | |
5c45d5f5 | 477 | .BR tcp_timestamps " (Boolean; default: enabled)" |
ccca85be | 478 | Enable RFC\ 1323 TCP timestamps. |
fea681da | 479 | .TP |
5c45d5f5 MK |
480 | .BR tcp_tw_recycle " (Boolean; default: disabled)" |
481 | Enable fast recycling of TIME-WAIT sockets. | |
482 | Enabling this option is not | |
fea681da MK |
483 | recommended since this causes problems when working |
484 | with NAT (Network Address Translation). | |
5c45d5f5 | 485 | .\" |
6f802359 | 486 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
487 | .TP |
488 | .BR tcp_tw_reuse " (Boolean; default: disabled)" | |
489 | Allow to reuse TIME-WAIT sockets for new connections when it is | |
490 | safe from protocol viewpoint. | |
491 | It should not be changed without advice/request of technical | |
492 | experts. | |
fea681da | 493 | .TP |
5c45d5f5 | 494 | .BR tcp_window_scaling " (Boolean; default: disabled)" |
ccca85be | 495 | Enable RFC\ 1323 TCP window scaling. |
5c45d5f5 | 496 | This feature allows the use of a large window |
fea681da MK |
497 | (> 64K) on a TCP connection, should the other end support it. |
498 | Normally, the 16 bit window length field in the TCP header | |
499 | limits the window size to less than 64K bytes. If larger | |
500 | windows are desired, applications can increase the size of | |
501 | their socket buffers and the window scaling option will be | |
502 | employed. If | |
5c45d5f5 | 503 | .BR tcp_window_scaling |
fea681da MK |
504 | is disabled, TCP will not negotiate the use of window |
505 | scaling with the other end during connection setup. | |
5c45d5f5 | 506 | .\" |
6f802359 | 507 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
508 | .TP |
509 | .BR tcp_vegas_cong_avoid " (Boolean; default: disabled)" | |
510 | Enable TCP Vegas congestion avoidance algorithm. | |
511 | TCP Vegas is a sender-side only change to TCP that anticipates | |
512 | the onset of congestion by estimating the bandwidth. TCP Vegas | |
513 | adjusts the sending rate by modifying the congestion | |
514 | window. TCP Vegas should provide less packet loss, but it is | |
515 | not as aggressive as TCP Reno. | |
516 | .\" | |
6f802359 | 517 | .\" The following is from 2.6.12: Documentation/networking/ip-sysctl.txt |
5c45d5f5 MK |
518 | .TP |
519 | .BR tcp_westwood " (Boolean; default: disabled)" | |
520 | Enable TCP Westwood+ congestion control algorithm. | |
521 | TCP Westwood+ is a sender-side only modification of the TCP Reno | |
522 | protocol stack that optimizes the performance of TCP congestion | |
523 | control. It is based on end-to-end bandwidth estimation to set | |
524 | congestion window and slow start threshold after a congestion | |
525 | episode. Using this estimation, TCP Westwood+ adaptively sets a | |
526 | slow start threshold and a congestion window which takes into | |
527 | account the bandwidth used at the time congestion is experienced. | |
ccca85be MK |
528 | TCP Westwood+ significantly increases fairness with respect to |
529 | TCP Reno in wired networks and throughput over wireless links. | |
fea681da | 530 | .TP |
5c45d5f5 | 531 | .BR tcp_wmem |
fea681da MK |
532 | This is a vector of 3 integers: [min, default, max]. These |
533 | parameters are used by TCP to regulate send buffer sizes. | |
534 | TCP dynamically adjusts the size of the send buffer from the | |
535 | default values listed below, in the range of these sysctl | |
536 | variables, depending on memory available. | |
537 | ||
538 | .I min | |
539 | - minimum size of the send buffer used by each TCP socket. | |
540 | The default value is 4K bytes. | |
541 | This value is used to ensure that in memory pressure mode, | |
542 | allocations below this size will still succeed. This is not | |
543 | used to bound the size of the send buffer declared | |
544 | using | |
545 | .B SO_SNDBUF | |
546 | on a socket. | |
547 | ||
548 | .I default | |
549 | - the default size of the send buffer for a TCP socket. | |
550 | This value overwrites the initial default buffer size from | |
551 | the generic global | |
552 | .B net.core.wmem_default | |
553 | defined for all protocols. The default value is 16K bytes. | |
554 | If larger send buffer sizes are desired, this value | |
555 | should be increased (to affect all sockets). To employ | |
556 | large TCP windows, the sysctl variable | |
557 | .B net.ipv4.tcp_window_scaling | |
558 | must be enabled (default). | |
559 | ||
560 | .I max | |
561 | - the maximum size of the send buffer used by | |
562 | each TCP socket. This value does not override the global | |
563 | .BR net.core.wmem_max . | |
564 | This is not used to limit the size of the send buffer | |
565 | declared using | |
566 | .B SO_SNDBUF | |
567 | on a socket. | |
568 | The default value is 128K bytes. It is lowered to 64K | |
569 | depending on the memory available in the system. | |
570 | .SH "SOCKET OPTIONS" | |
571 | To set or get a TCP socket option, call | |
572 | .BR getsockopt (2) | |
573 | to read or | |
574 | .BR setsockopt (2) | |
575 | to write the option with the option level argument set to | |
576 | .BR SOL_TCP. | |
577 | In addition, | |
578 | most | |
579 | .B SOL_IP | |
580 | socket options are valid on TCP sockets. For more | |
581 | information see | |
582 | .BR ip (7). | |
583 | .TP | |
584 | .B TCP_CORK | |
585 | If set, don't send out partial frames. All queued | |
586 | partial frames are sent when the option is cleared again. | |
587 | This is useful for prepending headers before calling | |
588 | .BR sendfile (2), | |
8dd9ef47 MK |
589 | or for throughput optimization. |
590 | This option can be | |
fea681da | 591 | combined with |
8dd9ef47 MK |
592 | .BR TCP_NODELAY |
593 | only since Linux 2.5.71. | |
fea681da MK |
594 | This option should not be used in code intended to be |
595 | portable. | |
596 | .TP | |
597 | .B TCP_DEFER_ACCEPT | |
598 | Allows a listener to be awakened only when data arrives on | |
599 | the socket. Takes an integer value (seconds), this can | |
600 | bound the maximum number of attempts TCP will make to | |
601 | complete the connection. This option should not be used in | |
602 | code intended to be portable. | |
603 | .TP | |
604 | .B TCP_INFO | |
605 | Used to collect information about this socket. The kernel | |
fd1835be | 606 | returns a \fIstruct tcp_info\fP as defined in the file |
fea681da MK |
607 | /usr/include/linux/tcp.h. This option should not be used in |
608 | code intended to be portable. | |
609 | .TP | |
610 | .B TCP_KEEPCNT | |
611 | The maximum number of keepalive probes TCP should send | |
612 | before dropping the connection. This option should not be | |
613 | used in code intended to be portable. | |
614 | .TP | |
615 | .B TCP_KEEPIDLE | |
616 | The time (in seconds) the connection needs to remain idle | |
617 | before TCP starts sending keepalive probes, if the socket | |
618 | option SO_KEEPALIVE has been set on this socket. This | |
619 | option should not be used in code intended to be portable. | |
620 | .TP | |
621 | .B TCP_KEEPINTVL | |
622 | The time (in seconds) between individual keepalive probes. | |
623 | This option should not be used in code intended to be | |
624 | portable. | |
625 | .TP | |
626 | .B TCP_LINGER2 | |
627 | The lifetime of orphaned FIN_WAIT2 state sockets. This | |
628 | option can be used to override the system wide sysctl | |
629 | .B tcp_fin_timeout | |
630 | on this socket. This is not to be confused with the | |
631 | .BR socket (7) | |
632 | level option | |
633 | .BR SO_LINGER . | |
634 | This option should not be used in code intended to be | |
635 | portable. | |
636 | .TP | |
637 | .B TCP_MAXSEG | |
638 | The maximum segment size for outgoing TCP packets. If this | |
639 | option is set before connection establishment, it also | |
640 | changes the MSS value announced to the other end in the | |
641 | initial packet. Values greater than the (eventual) | |
642 | interface MTU have no effect. TCP will also impose | |
643 | its minimum and maximum bounds over the value provided. | |
644 | .TP | |
645 | .B TCP_NODELAY | |
646 | If set, disable the Nagle algorithm. This means that segments | |
647 | are always sent as soon as possible, even if there is only a | |
648 | small amount of data. When not set, data is buffered until there | |
649 | is a sufficient amount to send out, thereby avoiding the | |
650 | frequent sending of small packets, which results in poor | |
8dd9ef47 | 651 | utilization of the network. |
704a18f0 | 652 | This option is overridden by |
925d4d6a MK |
653 | .BR TCP_CORK ; |
654 | however, setting this option forces an explicit flush of | |
655 | pending output, even if | |
656 | .B TCP_CORK | |
657 | is currently set. | |
fea681da MK |
658 | .TP |
659 | .B TCP_QUICKACK | |
660 | Enable quickack mode if set or disable quickack | |
661 | mode if cleared. In quickack mode, acks are sent | |
662 | immediately, rather than delayed if needed in accordance | |
663 | to normal TCP operation. This flag is not permanent, | |
664 | it only enables a switch to or from quickack mode. | |
665 | Subsequent operation of the TCP protocol will | |
666 | once again enter/leave quickack mode depending on | |
667 | internal protocol processing and factors such as | |
668 | delayed ack timeouts occurring and data transfer. | |
669 | This option should not be used in code intended to be | |
670 | portable. | |
671 | .TP | |
672 | .B TCP_SYNCNT | |
673 | Set the number of SYN retransmits that TCP should send before | |
674 | aborting the attempt to connect. It cannot exceed 255. | |
675 | This option should not be used in code intended to be | |
676 | portable. | |
677 | .TP | |
678 | .B TCP_WINDOW_CLAMP | |
679 | Bound the size of the advertised window to this value. The | |
680 | kernel imposes a minimum size of SOCK_MIN_RCVBUF/2. | |
681 | This option should not be used in code intended to be | |
682 | portable. | |
683 | .SH IOCTLS | |
fd1835be MK |
684 | These following |
685 | .BR ioctl (2) | |
686 | calls return information in | |
687 | .IR value . | |
fea681da MK |
688 | The correct syntax is: |
689 | .PP | |
690 | .RS | |
691 | .nf | |
692 | .BI int " value"; | |
693 | .IB error " = ioctl(" tcp_socket ", " ioctl_type ", &" value ");" | |
694 | .fi | |
695 | .RE | |
fd1835be MK |
696 | .PP |
697 | .I ioctl_type | |
698 | is one of the following: | |
fea681da MK |
699 | .TP |
700 | .BR SIOCINQ | |
fd1835be MK |
701 | Returns the amount of queued unread data in the receive buffer. |
702 | The socket must not be in LISTEN state, otherwise an error (EINVAL) | |
fea681da MK |
703 | is returned. |
704 | .TP | |
705 | .B SIOCATMARK | |
fd1835be MK |
706 | Returns true (i.e., |
707 | .I value | |
708 | is non-zero) if the inbound data stream is at the urgent mark. | |
95d29ab2 MK |
709 | .sp |
710 | If the | |
711 | .BR SO_OOBINLINE | |
712 | socket option is set, and | |
fd1835be | 713 | .B SIOCATMARK |
95d29ab2 | 714 | returns true, then the |
fd1835be | 715 | next read from the socket will return the urgent data. |
95d29ab2 MK |
716 | If the |
717 | .BR SO_OOBINLINE | |
718 | socket option is not set, and | |
719 | .B SIOCATMARK | |
720 | returns true, then the | |
721 | next read from the socket will return the bytes following | |
722 | the urgent data (to actually read the urgent data requires the | |
723 | .B recv(MSG_OOB) | |
724 | flag). | |
725 | .sp | |
726 | Note that a read never reads across the urgent mark. | |
ccca85be MK |
727 | If an application is informed of the presence of urgent data via |
728 | .BR select (2) | |
729 | (using the | |
730 | .I exceptfds | |
731 | argument) or through delivery of a | |
732 | .B SIGURG | |
733 | signal, | |
734 | then it can advance up to the mark using a loop which repeatedly tests | |
735 | .B SIOCATMARK | |
736 | and performs a read (requesting any number of bytes) as long as | |
fd1835be | 737 | .B SIOCATMARK |
ccca85be | 738 | returns false. |
fea681da MK |
739 | .TP |
740 | .B SIOCOUTQ | |
fd1835be MK |
741 | Returns the amount of unsent data in the socket send queue. |
742 | The socket must not be in LISTEN state, otherwise an error (EINVAL) | |
fea681da MK |
743 | is returned. |
744 | .SH "ERROR HANDLING" | |
745 | When a network error occurs, TCP tries to resend the | |
746 | packet. If it doesn't succeed after some time, either | |
747 | .B ETIMEDOUT | |
748 | or the last received error on this connection is reported. | |
749 | .PP | |
750 | Some applications require a quicker error notification. | |
751 | This can be enabled with the | |
752 | .B SOL_IP | |
753 | level | |
754 | .B IP_RECVERR | |
755 | socket option. When this option is enabled, all incoming | |
756 | errors are immediately passed to the user program. Use this | |
757 | option with care \- it makes TCP less tolerant to routing | |
758 | changes and other normal network conditions. | |
759 | .SH NOTES | |
760 | When an error occurs doing a connection setup occurring in a | |
761 | socket write | |
762 | .B SIGPIPE | |
763 | is only raised when the | |
764 | .B SO_KEEPALIVE | |
765 | socket option is set. | |
766 | .PP | |
767 | TCP has no real out-of-band data; it has urgent data. In | |
768 | Linux this means if the other end sends newer out-of-band | |
769 | data the older urgent data is inserted as normal data into | |
770 | the stream (even when | |
771 | .B SO_OOBINLINE | |
ccca85be | 772 | is not set). This differs from BSD-based stacks. |
fea681da MK |
773 | .PP |
774 | Linux uses the BSD compatible interpretation of the urgent | |
ccca85be | 775 | pointer field by default. This violates RFC\ 1122, but is |
fea681da MK |
776 | required for interoperability with other stacks. It can be |
777 | changed by the | |
778 | .B tcp_stdurg | |
779 | sysctl. | |
780 | .SH ERRORS | |
781 | .TP | |
782 | .B EPIPE | |
783 | The other end closed the socket unexpectedly or a read is | |
784 | executed on a shut down socket. | |
785 | .TP | |
786 | .B ETIMEDOUT | |
787 | The other end didn't acknowledge retransmitted data after | |
788 | some time. | |
789 | .TP | |
790 | .B EAFNOTSUPPORT | |
791 | Passed socket address type in | |
792 | .I sin_family | |
793 | was not | |
794 | .BR AF_INET . | |
795 | .PP | |
796 | Any errors defined for | |
797 | .BR ip (7) | |
798 | or the generic socket layer may also be returned for TCP. | |
799 | .SH BUGS | |
800 | Not all errors are documented. | |
801 | .br | |
802 | IPv6 is not described. | |
803 | .\" Only a single Linux kernel version is described | |
804 | .\" Info for 2.2 was lost. Should be added again, | |
805 | .\" or put into a separate page. | |
806 | .SH VERSIONS | |
fd1835be | 807 | Support for Explicit Congestion Notification, zero-copy |
fea681da MK |
808 | sendfile, reordering support and some SACK extensions |
809 | (DSACK) were introduced in 2.4. | |
810 | Support for forward acknowledgement (FACK), TIME_WAIT recycling, | |
811 | per connection keepalive socket options and sysctls | |
812 | were introduced in 2.3. | |
813 | ||
814 | The default values and descriptions for the sysctl variables | |
815 | given above are applicable for the 2.4 kernel. | |
816 | .SH AUTHORS | |
817 | This man page was originally written by Andi Kleen. | |
818 | It was updated for 2.4 by Nivedita Singhvi with input from | |
819 | Alexey Kuznetsov's Documentation/networking/ip-sysctls.txt | |
820 | document. | |
821 | .SH "SEE ALSO" | |
822 | .BR accept (2), | |
823 | .BR bind (2), | |
824 | .BR connect (2), | |
825 | .BR getsockopt (2), | |
826 | .BR listen (2), | |
827 | .BR recvmsg (2), | |
828 | .BR sendfile (2), | |
829 | .BR sendmsg (2), | |
830 | .BR socket (2), | |
831 | .BR sysctl (2), | |
832 | .BR ip (7), | |
833 | .BR socket (7) | |
834 | .sp | |
ccca85be | 835 | RFC\ 793 for the TCP specification. |
fea681da | 836 | .br |
ccca85be | 837 | RFC\ 1122 for the TCP requirements and a description |
fea681da MK |
838 | of the Nagle algorithm. |
839 | .br | |
ccca85be | 840 | RFC\ 1323 for TCP timestamp and window scaling options. |
fea681da | 841 | .br |
ccca85be | 842 | RFC\ 1644 for a description of TIME_WAIT assassination |
fea681da MK |
843 | hazards. |
844 | .br | |
ccca85be | 845 | RFC\ 2481 for a description of Explicit Congestion |
fea681da MK |
846 | Notification. |
847 | .br | |
ccca85be | 848 | RFC\ 2581 for TCP congestion control algorithms. |
fea681da | 849 | .br |
ccca85be | 850 | RFC\ 2018 and RFC\ 2883 for SACK and extensions to SACK. |