]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/epoll.7
man*/: srcfix (Use .P instead of .PP or .LP)
[thirdparty/man-pages.git] / man7 / epoll.7
CommitLineData
fea681da
MK
1.\" Copyright (C) 2003 Davide Libenzi
2.\"
e4a74ca8 3.\" SPDX-License-Identifier: GPL-2.0-or-later
fea681da
MK
4.\"
5.\" Davide Libenzi <davidel@xmailserver.org>
6.\"
4c1c5274 7.TH epoll 7 (date) "Linux man-pages (unreleased)"
fea681da
MK
8.SH NAME
9epoll \- I/O event notification facility
10.SH SYNOPSIS
c7db92b9 11.nf
fea681da 12.B #include <sys/epoll.h>
c7db92b9 13.fi
fea681da 14.SH DESCRIPTION
2b348e56
MK
15The
16.B epoll
17API performs a similar task to
18.BR poll (2):
19monitoring multiple file descriptors to see if I/O is possible on any of them.
20The
fea681da 21.B epoll
2b348e56 22API can be used either as an edge-triggered or a level-triggered
fc15f317 23interface and scales well to large numbers of watched file descriptors.
c6d039a3 24.P
04091160
MK
25The central concept of the
26.B epoll
27API is the
28.B epoll
29.IR instance ,
30an in-kernel data structure which, from a user-space perspective,
31can be considered as a container for two lists:
cdede5cd 32.IP \[bu] 3
04091160
MK
33The
34.I interest
35list (sometimes also called the
36.B epoll
37set): the set of file descriptors that the process has registered
38an interest in monitoring.
cdede5cd 39.IP \[bu]
04091160
MK
40The
41.I ready
42list: the set of file descriptors that are "ready" for I/O.
43The ready list is a subset of
44(or, more precisely, a set of references to)
0a26e2d3
MK
45the file descriptors in the interest list.
46The ready list is dynamically populated
04091160 47by the kernel as a result of I/O activity on those file descriptors.
c6d039a3 48.P
9d0f3fcb 49The following system calls are provided to
7547121f 50create and manage an
fea681da 51.B epoll
7547121f 52instance:
cdede5cd 53.IP \[bu] 3
2b348e56 54.BR epoll_create (2)
302b4b87 55creates a new
fea681da 56.B epoll
2b348e56 57instance and returns a file descriptor referring to that instance.
9d0f3fcb
MK
58(The more recent
59.BR epoll_create1 (2)
60extends the functionality of
61.BR epoll_create (2).)
cdede5cd 62.IP \[bu]
7547121f 63Interest in particular file descriptors is then registered via
04091160
MK
64.BR epoll_ctl (2),
65which adds items to the interest list of the
4524285a 66.B epoll
04091160 67instance.
cdede5cd 68.IP \[bu]
2b348e56
MK
69.BR epoll_wait (2)
70waits for I/O events,
71blocking the calling thread if no events are currently available.
04091160
MK
72(This system call can be thought of as fetching items from
73the ready list of the
74.B epoll
75instance.)
76.\"
c634028a 77.SS Level-triggered and edge-triggered
fea681da
MK
78The
79.B epoll
fc15f317 80event distribution interface is able to behave both as edge-triggered
7547121f 81(ET) and as level-triggered (LT).
69eb01fd
MK
82The difference between the two mechanisms
83can be described as follows.
c13182ef 84Suppose that
7025a2fe 85this scenario happens:
22356d97 86.IP (1) 5
fc15f317
MK
87The file descriptor that represents the read side of a pipe
88.RI ( rfd )
7547121f 89is registered on the
fea681da 90.B epoll
7547121f 91instance.
22356d97 92.IP (2)
c4b7e5ac 93A pipe writer writes 2\ kB of data on the write side of the pipe.
22356d97 94.IP (3)
fea681da
MK
95A call to
96.BR epoll_wait (2)
97is done that will return
fc15f317
MK
98.I rfd
99as a ready file descriptor.
22356d97 100.IP (4)
c4b7e5ac 101The pipe reader reads 1\ kB of data from
fc15f317 102.IR rfd .
22356d97 103.IP (5)
fea681da
MK
104A call to
105.BR epoll_wait (2)
106is done.
c6d039a3 107.P
fea681da 108If the
fc15f317 109.I rfd
fea681da
MK
110file descriptor has been added to the
111.B epoll
112interface using the
113.B EPOLLET
f2e101d0 114(edge-triggered)
fea681da
MK
115flag, the call to
116.BR epoll_wait (2)
988db661 117done in step
fea681da 118.B 5
fc15f317
MK
119will probably hang despite the available data still present in the file
120input buffer;
121meanwhile the remote peer might be expecting a response based on the
c13182ef 122data it already sent.
33a0ccb2
MK
123The reason for this is that edge-triggered mode
124delivers events only when changes occur on the monitored file descriptor.
fea681da
MK
125So, in step
126.B 5
127the caller might end up waiting for some data that is already present inside
c13182ef
MK
128the input buffer.
129In the above example, an event on
fc15f317 130.I rfd
fea681da 131will be generated because of the write done in
0daa9e92 132.B 2
66eca51e 133and the event is consumed in
fea681da
MK
134.BR 3 .
135Since the read operation done in
136.B 4
137does not consume the whole buffer data, the call to
138.BR epoll_wait (2)
139done in step
140.B 5
fc15f317 141might block indefinitely.
c6d039a3 142.P
fc15f317 143An application that employs the
fea681da 144.B EPOLLET
ff40dbb3 145flag should use nonblocking file descriptors to avoid having a blocking
fc15f317 146read or write starve a task that is handling multiple file descriptors.
fea681da
MK
147The suggested way to use
148.B epoll
fc15f317 149as an edge-triggered
66eca51e 150.RB ( EPOLLET )
fc15f317 151interface is as follows:
22356d97 152.IP (1) 5
ff40dbb3 153with nonblocking file descriptors; and
22356d97 154.IP (2)
69eb01fd 155by waiting for an event only after
fea681da 156.BR read (2)
c13182ef 157or
fea681da 158.BR write (2)
097585ed
MK
159return
160.BR EAGAIN .
c6d039a3 161.P
f2e101d0
MK
162By contrast, when used as a level-triggered interface
163(the default, when
164.B EPOLLET
165is not specified),
fea681da 166.B epoll
512a1783 167is simply a faster
fea681da
MK
168.BR poll (2),
169and can be used wherever the latter is used since it shares the
c13182ef 170same semantics.
c6d039a3 171.P
7547121f
MK
172Since even with edge-triggered
173.BR epoll ,
fc15f317 174multiple events can be generated upon receipt of multiple chunks of data,
fea681da
MK
175the caller has the option to specify the
176.B EPOLLONESHOT
177flag, to tell
178.B epoll
3f1c1b0a 179to disable the associated file descriptor after the receipt of an event with
fea681da
MK
180.BR epoll_wait (2).
181When the
182.B EPOLLONESHOT
c13182ef 183flag is specified,
fc15f317 184it is the caller's responsibility to rearm the file descriptor using
fea681da
MK
185.BR epoll_ctl (2)
186with
187.BR EPOLL_CTL_MOD .
c6d039a3 188.P
a3961b2f
MK
189If multiple threads
190(or processes, if child processes have inherited the
191.B epoll
192file descriptor across
193.BR fork (2))
194are blocked in
195.BR epoll_wait (2)
9d7fb784 196waiting on the same epoll file descriptor and a file descriptor
a3961b2f
MK
197in the interest list that is marked for edge-triggered
198.RB ( EPOLLET )
199notification becomes ready,
200just one of the threads (or processes) is awoken from
201.BR epoll_wait (2).
202This provides a useful optimization for avoiding "thundering herd" wake-ups
203in some scenarios.
204.\"
6db5acce
N
205.SS Interaction with autosleep
206If the system is in
207.B autosleep
208mode via
209.I /sys/power/autosleep
210and an event happens which wakes the device from sleep, the device
8e798cce 211driver will keep the device awake only until that event is queued.
d3695ae2
MK
212To keep the device awake until the event has been processed,
213it is necessary to use the
bf7bc8b8 214.BR epoll_ctl (2)
6db5acce
N
215.B EPOLLWAKEUP
216flag.
c6d039a3 217.P
d3695ae2
MK
218When the
219.B EPOLLWAKEUP
220flag is set in the
6db5acce
N
221.B events
222field for a
d3695ae2
MK
223.IR "struct epoll_event" ,
224the system will be kept awake from the moment the event is queued,
6db5acce 225through the
d3695ae2 226.BR epoll_wait (2)
6db5acce 227call which returns the event until the subsequent
d3695ae2
MK
228.BR epoll_wait (2)
229call.
230If the event should keep the system awake beyond that time,
231then a separate
6db5acce
N
232.I wake_lock
233should be taken before the second
d3695ae2 234.BR epoll_wait (2)
6db5acce 235call.
5ee0575d
MK
236.SS /proc interfaces
237The following interfaces can be used to limit the amount of
238kernel memory consumed by epoll:
b324e17d 239.\" Following was added in Linux 2.6.28, but them removed in Linux 2.6.29
f09cbcf3 240.\" .TP
597fa43c
MK
241.\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
242.\" This specifies an upper limit on the number of epoll instances
243.\" that can be created per real user ID.
5ee0575d
MK
244.TP
245.IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
246This specifies a limit on the total number of
247file descriptors that a user can register across
248all epoll instances on the system.
249The limit is per real user ID.
250Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
251and roughly 160 bytes on a 64-bit kernel.
252Currently,
b324e17d 253.\" Linux 2.6.29 (in Linux 2.6.28, the default was 1/32 of lowmem)
5ee0575d
MK
254the default value for
255.I max_user_watches
597fa43c 256is 1/25 (4%) of the available low memory,
5ee0575d 257divided by the registration cost in bytes.
c634028a 258.SS Example for suggested usage
fea681da
MK
259While the usage of
260.B epoll
fc15f317
MK
261when employed as a level-triggered interface does have the same
262semantics as
fea681da 263.BR poll (2),
fc15f317 264the edge-triggered usage requires more clarification to avoid stalls
c13182ef
MK
265in the application event loop.
266In this example, listener is a
ff40dbb3 267nonblocking socket on which
fea681da 268.BR listen (2)
c13182ef 269has been called.
54d02f32
MK
270The function
271.I do_use_fd()
272uses the new ready file descriptor until
097585ed
MK
273.B EAGAIN
274is returned by either
fea681da
MK
275.BR read (2)
276or
277.BR write (2).
fc15f317 278An event-driven state machine application should, after having received
097585ed 279.BR EAGAIN ,
54d02f32
MK
280record its current state so that at the next call to
281.I do_use_fd()
fea681da
MK
282it will continue to
283.BR read (2)
284or
285.BR write (2)
c13182ef 286from where it stopped before.
c6d039a3 287.P
3bc917f6 288.in +4n
bdd915e2 289.EX
66132b5e
MK
290#define MAX_EVENTS 10
291struct epoll_event ev, events[MAX_EVENTS];
292int listen_sock, conn_sock, nfds, epollfd;
fe5dba13 293\&
b957f81f 294/* Code to set up listening socket, \[aq]listen_sock\[aq],
46b20ca1 295 (socket(), bind(), listen()) omitted. */
fe5dba13 296\&
a3e65c93 297epollfd = epoll_create1(0);
66132b5e 298if (epollfd == \-1) {
a3e65c93 299 perror("epoll_create1");
66132b5e
MK
300 exit(EXIT_FAILURE);
301}
fe5dba13 302\&
a8d9df27 303ev.events = EPOLLIN;
66132b5e
MK
304ev.data.fd = listen_sock;
305if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
306 perror("epoll_ctl: listen_sock");
307 exit(EXIT_FAILURE);
308}
fe5dba13 309\&
d4949190 310for (;;) {
66132b5e 311 nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
40c75945 312 if (nfds == \-1) {
be6b243a 313 perror("epoll_wait");
40c75945
MK
314 exit(EXIT_FAILURE);
315 }
fe5dba13 316\&
cf0a9ace 317 for (n = 0; n < nfds; ++n) {
66132b5e
MK
318 if (events[n].data.fd == listen_sock) {
319 conn_sock = accept(listen_sock,
24a31d63 320 (struct sockaddr *) &addr, &addrlen);
66132b5e 321 if (conn_sock == \-1) {
fea681da 322 perror("accept");
15277745 323 exit(EXIT_FAILURE);
fea681da 324 }
66132b5e 325 setnonblocking(conn_sock);
fea681da 326 ev.events = EPOLLIN | EPOLLET;
66132b5e
MK
327 ev.data.fd = conn_sock;
328 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
329 &ev) == \-1) {
df5c8d49 330 perror("epoll_ctl: conn_sock");
66132b5e 331 exit(EXIT_FAILURE);
fea681da 332 }
cf0a9ace 333 } else {
fea681da 334 do_use_fd(events[n].data.fd);
cf0a9ace 335 }
fea681da
MK
336 }
337}
bdd915e2 338.EE
3bc917f6 339.in
c6d039a3 340.P
fc15f317 341When used as an edge-triggered interface, for performance reasons, it is
3bc917f6
MK
342possible to add the file descriptor inside the
343.B epoll
344interface
fc15f317 345.RB ( EPOLL_CTL_ADD )
69eb01fd 346once by specifying
fc15f317 347.RB ( EPOLLIN | EPOLLOUT ).
c13182ef 348This allows you to avoid
fea681da
MK
349continuously switching between
350.B EPOLLIN
351and
352.B EPOLLOUT
353calling
354.BR epoll_ctl (2)
355with
356.BR EPOLL_CTL_MOD .
c634028a 357.SS Questions and answers
cdede5cd 358.IP \[bu] 3
7547121f 359What is the key used to distinguish the file descriptors registered in an
a607673b 360interest list?
6832efaf 361.IP
7fb5cf0f
MK
362The key is the combination of the file descriptor number and
363the open file description
d377b54d 364(also known as an "open file handle",
7fb5cf0f 365the kernel's internal representation of an open file).
cdede5cd 366.IP \[bu]
7547121f 367What happens if you register the same file descriptor on an
3bc917f6 368.B epoll
7547121f 369instance twice?
6832efaf 370.IP
097585ed
MK
371You will probably get
372.BR EEXIST .
2b229334
MK
373However, it is possible to add a duplicate
374.RB ( dup (2),
375.BR dup2 (2),
376.BR fcntl (2)
7fb5cf0f 377.BR F_DUPFD )
d9cb0d7d 378file descriptor to the same
2b229334 379.B epoll
7547121f 380instance.
d9cb0d7d 381.\" But a file descriptor duplicated by fork(2) can't be added to the
d377b54d
MK
382.\" set, because the [file *, fd] pair is already in the epoll set.
383.\" That is a somewhat ugly inconsistency. On the one hand, a child process
7fb5cf0f 384.\" cannot add the duplicate file descriptor to the epoll set. (In every
d9cb0d7d
MK
385.\" other case that I can think of, file descriptors duplicated by fork have
386.\" similar semantics to file descriptors duplicated by dup() and friends.) On
7fb5cf0f 387.\" the other hand, the very fact that the child has a duplicate of the
d9cb0d7d
MK
388.\" file descriptor means that even if the parent closes its file descriptor,
389.\" then epoll_wait() in the parent will continue to receive notifications for
390.\" that file descriptor because of the duplicated file descriptor in the child.
7fb5cf0f 391.\"
d377b54d
MK
392.\" See http://thread.gmane.org/gmane.linux.kernel/596462/
393.\" "epoll design problems with common fork/exec patterns"
31981fa1 394.\"
7fb5cf0f 395.\" mtk, Feb 2008
2b229334
MK
396This can be a useful technique for filtering events,
397if the duplicate file descriptors are registered with different
398.I events
399masks.
cdede5cd 400.IP \[bu]
fea681da
MK
401Can two
402.B epoll
7547121f 403instances wait for the same file descriptor?
1c44bd5b 404If so, are events reported to both
fea681da 405.B epoll
fc15f317 406file descriptors?
6832efaf 407.IP
fc15f317 408Yes, and events would be reported to both.
882bbb69 409However, careful programming may be needed to do this correctly.
cdede5cd 410.IP \[bu]
fea681da
MK
411Is the
412.B epoll
fc15f317 413file descriptor itself poll/epoll/selectable?
6832efaf 414.IP
fea681da 415Yes.
cc65f7d8
MK
416If an
417.B epoll
1c4070c7 418file descriptor has events waiting, then it will
cc65f7d8 419indicate as being readable.
cdede5cd 420.IP \[bu]
7547121f 421What happens if one attempts to put an
fea681da 422.B epoll
7547121f 423file descriptor into its own file descriptor set?
6832efaf 424.IP
4fecd703
MK
425The
426.BR epoll_ctl (2)
a23d8efa 427call fails
4fecd703 428.RB ( EINVAL ).
c13182ef 429However, you can add an
fea681da 430.B epoll
3bc917f6
MK
431file descriptor inside another
432.B epoll
433file descriptor set.
cdede5cd 434.IP \[bu]
54d02f32 435Can I send an
fea681da 436.B epoll
008f1ecc 437file descriptor over a UNIX domain socket to another process?
6832efaf 438.IP
54d02f32 439Yes, but it does not make sense to do this, since the receiving process
a607673b 440would not have copies of the file descriptors in the interest list.
cdede5cd 441.IP \[bu]
fc15f317 442Will closing a file descriptor cause it to be removed from all
fea681da 443.B epoll
a607673b 444interest lists?
6832efaf 445.IP
a4a120c7
MK
446Yes, but be aware of the following point.
447A file descriptor is a reference to an open file description (see
448.BR open (2)).
d9cb0d7d 449Whenever a file descriptor is duplicated via
a4a120c7
MK
450.BR dup (2),
451.BR dup2 (2),
452.BR fcntl (2)
453.BR F_DUPFD ,
454or
455.BR fork (2),
456a new file descriptor referring to the same open file description is
457created.
458An open file description continues to exist until all
459file descriptors referring to it have been closed.
d1d90ea5 460.IP
d377b54d 461A file descriptor is removed from an
d1d90ea5
MK
462interest list only after all the file descriptors referring to the underlying
463open file description have been closed.
a4a120c7 464This means that even after a file descriptor that is part of an
d1d90ea5 465interest list has been closed,
a4a120c7
MK
466events may be reported for that file descriptor if other file
467descriptors referring to the same underlying file description remain open.
d1d90ea5
MK
468To prevent this happening,
469the file descriptor must be explicitly removed from the interest list (using
470.BR epoll_ctl (2)
471.BR EPOLL_CTL_DEL )
472before it is duplicated.
473Alternatively,
474the application must ensure that all file descriptors are closed
475(which may be difficult if file descriptors were duplicated
476behind the scenes by library functions that used
477.BR dup (2)
478or
479.BR fork (2)).
cdede5cd 480.IP \[bu]
fc15f317 481If more than one event occurs between
fea681da
MK
482.BR epoll_wait (2)
483calls, are they combined or reported separately?
6832efaf 484.IP
fea681da 485They will be combined.
cdede5cd 486.IP \[bu]
988db661 487Does an operation on a file descriptor affect the
fc15f317 488already collected but not yet reported events?
6832efaf 489.IP
fc15f317 490You can do two operations on an existing file descriptor.
c13182ef
MK
491Remove would be meaningless for
492this case.
3b777aff 493Modify will reread available I/O.
cdede5cd 494.IP \[bu]
fc15f317 495Do I need to continuously read/write a file descriptor
097585ed
MK
496until
497.B EAGAIN
498when using the
fea681da 499.B EPOLLET
b4ebb4ee 500flag (edge-triggered behavior)?
6832efaf 501.IP
c13182ef 502Receiving an event from
fea681da 503.BR epoll_wait (2)
f11af7da 504should suggest to you that such
160c5be1 505file descriptor is ready for the requested I/O operation.
ff40dbb3 506You must consider it ready until the next (nonblocking)
cb1de8d7 507read/write yields
097585ed 508.BR EAGAIN .
f11af7da 509When and how you will use the file descriptor is entirely up to you.
bdd915e2 510.IP
cb1de8d7
MK
511For packet/token-oriented files (e.g., datagram socket,
512terminal in canonical mode),
146c1764 513the only way to detect the end of the read/write I/O space
cb1de8d7
MK
514is to continue to read/write until
515.BR EAGAIN .
bdd915e2 516.IP
cb1de8d7 517For stream-oriented files (e.g., pipe, FIFO, stream socket), the
f11af7da
MK
518condition that the read/write I/O space is exhausted can also be detected by
519checking the amount of data read from / written to the target file
520descriptor.
c13182ef 521For example, if you call
fea681da 522.BR read (2)
160c5be1 523by asking to read a certain amount of data and
fea681da 524.BR read (2)
f11af7da
MK
525returns a lower number of bytes, you
526can be sure of having exhausted the read I/O space for the file
527descriptor.
160c5be1 528The same is true when writing using
fc15f317 529.BR write (2).
cb1de8d7
MK
530(Avoid this latter technique if you cannot guarantee that
531the monitored file descriptor always refers to a stream-oriented file.)
c634028a 532.SS Possible pitfalls and ways to avoid them
cdede5cd 533.IP \[bu] 3
22356d97
AC
534.B Starvation (edge-triggered)
535.IP
c13182ef
MK
536If there is a large amount of I/O space,
537it is possible that by trying to drain
538it the other files will not get processed causing starvation.
fc15f317
MK
539(This problem is not specific to
540.BR epoll .)
22356d97 541.IP
c13182ef
MK
542The solution is to maintain a ready list
543and mark the file descriptor as ready
fea681da
MK
544in its associated data structure, thereby allowing the application to
545remember which files need to be processed but still round robin amongst
c13182ef
MK
546all the ready files.
547This also supports ignoring subsequent events you
fc15f317 548receive for file descriptors that are already ready.
cdede5cd 549.IP \[bu]
22356d97
AC
550.B If using an event cache...
551.IP
fc15f317 552If you use an event cache or store all the file descriptors returned from
fea681da 553.BR epoll_wait (2),
c13182ef 554then make sure to provide a way to mark
fc15f317 555its closure dynamically (i.e., caused by
c13182ef
MK
556a previous event's processing).
557Suppose you receive 100 events from
fea681da 558.BR epoll_wait (2),
c13182ef
MK
559and in event #47 a condition causes event #13 to be closed.
560If you remove the structure and
63f6a20a 561.BR close (2)
fc15f317
MK
562the file descriptor for event #13, then your
563event cache might still say there are events waiting for that
564file descriptor causing confusion.
22356d97 565.IP
fea681da
MK
566One solution for this is to call, during the processing of event 47,
567.BR epoll_ctl ( EPOLL_CTL_DEL )
fc15f317 568to delete file descriptor 13 and
63f6a20a 569.BR close (2),
f87925c6 570then mark its associated
c13182ef
MK
571data structure as removed and link it to a cleanup list.
572If you find another
fc15f317
MK
573event for file descriptor 13 in your batch processing,
574you will discover the file descriptor had been
fea681da 575previously removed and there will be no confusion.
2b2581ee 576.SH VERSIONS
4131356c
AC
577Some other systems provide similar mechanisms;
578for example,
579FreeBSD has
c13182ef
MK
580.IR kqueue ,
581and Solaris has
c803c3e3 582.IR /dev/poll .
4131356c
AC
583.SH STANDARDS
584Linux.
585.SH HISTORY
586Linux 2.5.44.
587.\" Its interface should be finalized in Linux 2.5.66.
588glibc 2.3.2.
58a80cd4
MK
589.SH NOTES
590The set of file descriptors that is being monitored via
591an epoll file descriptor can be viewed via the entry for
592the epoll file descriptor in the process's
1ae6b2c7 593.IR /proc/ pid /fdinfo
58a80cd4
MK
594directory.
595See
596.BR proc (5)
597for further details.
c6d039a3 598.P
b8dd62ac
MK
599The
600.BR kcmp (2)
601.B KCMP_EPOLL_TFD
602operation can be used to test whether a file descriptor
603is present in an epoll instance.
47297adb 604.SH SEE ALSO
fea681da 605.BR epoll_create (2),
9d0f3fcb 606.BR epoll_create1 (2),
fea681da 607.BR epoll_ctl (2),
634c92fb
MK
608.BR epoll_wait (2),
609.BR poll (2),
610.BR select (2)