]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/epoll.7
sched.7: Note error that occurs when writing invalid value to /proc/PID/autogroup
[thirdparty/man-pages.git] / man7 / epoll.7
CommitLineData
fea681da
MK
1.\" Copyright (C) 2003 Davide Libenzi
2.\"
f0008367 3.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
fea681da
MK
4.\" This program is free software; you can redistribute it and/or modify
5.\" it under the terms of the GNU General Public License as published by
6.\" the Free Software Foundation; either version 2 of the License, or
7.\" (at your option) any later version.
8.\"
9.\" This program is distributed in the hope that it will be useful,
10.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
11.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12.\" GNU General Public License for more details.
13.\"
68fa4398
MK
14.\" You should have received a copy of the GNU General Public
15.\" License along with this manual; if not, see
16.\" <http://www.gnu.org/licenses/>.
8ff7380d 17.\" %%%LICENSE_END
fea681da
MK
18.\"
19.\" Davide Libenzi <davidel@xmailserver.org>
20.\"
b8efb414 21.TH EPOLL 7 2016-10-08 "Linux" "Linux Programmer's Manual"
fea681da
MK
22.SH NAME
23epoll \- I/O event notification facility
24.SH SYNOPSIS
25.B #include <sys/epoll.h>
26.SH DESCRIPTION
2b348e56
MK
27The
28.B epoll
29API performs a similar task to
30.BR poll (2):
31monitoring multiple file descriptors to see if I/O is possible on any of them.
32The
fea681da 33.B epoll
2b348e56 34API can be used either as an edge-triggered or a level-triggered
fc15f317 35interface and scales well to large numbers of watched file descriptors.
9d0f3fcb 36The following system calls are provided to
7547121f 37create and manage an
fea681da 38.B epoll
7547121f
MK
39instance:
40.IP * 3
2b348e56
MK
41.BR epoll_create (2)
42creates an
fea681da 43.B epoll
2b348e56 44instance and returns a file descriptor referring to that instance.
9d0f3fcb
MK
45(The more recent
46.BR epoll_create1 (2)
47extends the functionality of
48.BR epoll_create (2).)
7547121f
MK
49.IP *
50Interest in particular file descriptors is then registered via
fea681da 51.BR epoll_ctl (2).
540cc87f 52The set of file descriptors currently registered on an
7547121f
MK
53.B epoll
54instance is sometimes called an
55.I epoll
56set.
57.IP *
2b348e56
MK
58.BR epoll_wait (2)
59waits for I/O events,
60blocking the calling thread if no events are currently available.
c634028a 61.SS Level-triggered and edge-triggered
fea681da
MK
62The
63.B epoll
fc15f317 64event distribution interface is able to behave both as edge-triggered
7547121f 65(ET) and as level-triggered (LT).
69eb01fd
MK
66The difference between the two mechanisms
67can be described as follows.
c13182ef 68Suppose that
7025a2fe 69this scenario happens:
69eb01fd 70.IP 1. 3
fc15f317
MK
71The file descriptor that represents the read side of a pipe
72.RI ( rfd )
7547121f 73is registered on the
fea681da 74.B epoll
7547121f 75instance.
69eb01fd
MK
76.IP 2.
77A pipe writer writes 2 kB of data on the write side of the pipe.
78.IP 3.
fea681da
MK
79A call to
80.BR epoll_wait (2)
81is done that will return
fc15f317
MK
82.I rfd
83as a ready file descriptor.
69eb01fd
MK
84.IP 4.
85The pipe reader reads 1 kB of data from
fc15f317 86.IR rfd .
69eb01fd 87.IP 5.
fea681da
MK
88A call to
89.BR epoll_wait (2)
90is done.
91.PP
fea681da 92If the
fc15f317 93.I rfd
fea681da
MK
94file descriptor has been added to the
95.B epoll
96interface using the
97.B EPOLLET
f2e101d0 98(edge-triggered)
fea681da
MK
99flag, the call to
100.BR epoll_wait (2)
988db661 101done in step
fea681da 102.B 5
fc15f317
MK
103will probably hang despite the available data still present in the file
104input buffer;
105meanwhile the remote peer might be expecting a response based on the
c13182ef 106data it already sent.
33a0ccb2
MK
107The reason for this is that edge-triggered mode
108delivers events only when changes occur on the monitored file descriptor.
fea681da
MK
109So, in step
110.B 5
111the caller might end up waiting for some data that is already present inside
c13182ef
MK
112the input buffer.
113In the above example, an event on
fc15f317 114.I rfd
fea681da 115will be generated because of the write done in
0daa9e92 116.B 2
66eca51e 117and the event is consumed in
fea681da
MK
118.BR 3 .
119Since the read operation done in
120.B 4
121does not consume the whole buffer data, the call to
122.BR epoll_wait (2)
123done in step
124.B 5
fc15f317
MK
125might block indefinitely.
126
127An application that employs the
fea681da 128.B EPOLLET
ff40dbb3 129flag should use nonblocking file descriptors to avoid having a blocking
fc15f317 130read or write starve a task that is handling multiple file descriptors.
fea681da
MK
131The suggested way to use
132.B epoll
fc15f317 133as an edge-triggered
66eca51e 134.RB ( EPOLLET )
fc15f317 135interface is as follows:
fea681da 136.RS
3bc917f6 137.TP 4
fea681da 138.B i
ff40dbb3 139with nonblocking file descriptors; and
c13182ef 140.TP
fea681da 141.B ii
69eb01fd 142by waiting for an event only after
fea681da 143.BR read (2)
c13182ef 144or
fea681da 145.BR write (2)
097585ed
MK
146return
147.BR EAGAIN .
fea681da
MK
148.RE
149.PP
f2e101d0
MK
150By contrast, when used as a level-triggered interface
151(the default, when
152.B EPOLLET
153is not specified),
fea681da 154.B epoll
512a1783 155is simply a faster
fea681da
MK
156.BR poll (2),
157and can be used wherever the latter is used since it shares the
c13182ef 158same semantics.
fc15f317 159
7547121f
MK
160Since even with edge-triggered
161.BR epoll ,
fc15f317 162multiple events can be generated upon receipt of multiple chunks of data,
fea681da
MK
163the caller has the option to specify the
164.B EPOLLONESHOT
165flag, to tell
166.B epoll
3f1c1b0a 167to disable the associated file descriptor after the receipt of an event with
fea681da
MK
168.BR epoll_wait (2).
169When the
170.B EPOLLONESHOT
c13182ef 171flag is specified,
fc15f317 172it is the caller's responsibility to rearm the file descriptor using
fea681da
MK
173.BR epoll_ctl (2)
174with
175.BR EPOLL_CTL_MOD .
6db5acce
N
176.SS Interaction with autosleep
177If the system is in
178.B autosleep
179mode via
180.I /sys/power/autosleep
181and an event happens which wakes the device from sleep, the device
8e798cce 182driver will keep the device awake only until that event is queued.
d3695ae2
MK
183To keep the device awake until the event has been processed,
184it is necessary to use the
bf7bc8b8 185.BR epoll_ctl (2)
6db5acce
N
186.B EPOLLWAKEUP
187flag.
188
d3695ae2
MK
189When the
190.B EPOLLWAKEUP
191flag is set in the
6db5acce
N
192.B events
193field for a
d3695ae2
MK
194.IR "struct epoll_event" ,
195the system will be kept awake from the moment the event is queued,
6db5acce 196through the
d3695ae2 197.BR epoll_wait (2)
6db5acce 198call which returns the event until the subsequent
d3695ae2
MK
199.BR epoll_wait (2)
200call.
201If the event should keep the system awake beyond that time,
202then a separate
6db5acce
N
203.I wake_lock
204should be taken before the second
d3695ae2 205.BR epoll_wait (2)
6db5acce 206call.
5ee0575d
MK
207.SS /proc interfaces
208The following interfaces can be used to limit the amount of
209kernel memory consumed by epoll:
597fa43c 210.\" Following was added in 2.6.28, but them removed in 2.6.29
f09cbcf3 211.\" .TP
597fa43c
MK
212.\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)"
213.\" This specifies an upper limit on the number of epoll instances
214.\" that can be created per real user ID.
5ee0575d
MK
215.TP
216.IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)"
217This specifies a limit on the total number of
218file descriptors that a user can register across
219all epoll instances on the system.
220The limit is per real user ID.
221Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel,
222and roughly 160 bytes on a 64-bit kernel.
223Currently,
597fa43c 224.\" 2.6.29 (in 2.6.28, the default was 1/32 of lowmem)
5ee0575d
MK
225the default value for
226.I max_user_watches
597fa43c 227is 1/25 (4%) of the available low memory,
5ee0575d 228divided by the registration cost in bytes.
c634028a 229.SS Example for suggested usage
fea681da
MK
230While the usage of
231.B epoll
fc15f317
MK
232when employed as a level-triggered interface does have the same
233semantics as
fea681da 234.BR poll (2),
fc15f317 235the edge-triggered usage requires more clarification to avoid stalls
c13182ef
MK
236in the application event loop.
237In this example, listener is a
ff40dbb3 238nonblocking socket on which
fea681da 239.BR listen (2)
c13182ef 240has been called.
54d02f32
MK
241The function
242.I do_use_fd()
243uses the new ready file descriptor until
097585ed
MK
244.B EAGAIN
245is returned by either
fea681da
MK
246.BR read (2)
247or
248.BR write (2).
fc15f317 249An event-driven state machine application should, after having received
097585ed 250.BR EAGAIN ,
54d02f32
MK
251record its current state so that at the next call to
252.I do_use_fd()
fea681da
MK
253it will continue to
254.BR read (2)
255or
256.BR write (2)
c13182ef 257from where it stopped before.
fea681da 258
3bc917f6 259.in +4n
fea681da 260.nf
66132b5e
MK
261#define MAX_EVENTS 10
262struct epoll_event ev, events[MAX_EVENTS];
263int listen_sock, conn_sock, nfds, epollfd;
264
7d26f7d4
MK
265/* Code to set up listening socket, \(aqlisten_sock\(aq,
266 (socket(), bind(), listen()) omitted */
66132b5e 267
a3e65c93 268epollfd = epoll_create1(0);
66132b5e 269if (epollfd == \-1) {
a3e65c93 270 perror("epoll_create1");
66132b5e
MK
271 exit(EXIT_FAILURE);
272}
273
a8d9df27 274ev.events = EPOLLIN;
66132b5e
MK
275ev.data.fd = listen_sock;
276if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) {
277 perror("epoll_ctl: listen_sock");
278 exit(EXIT_FAILURE);
279}
fea681da 280
d4949190 281for (;;) {
66132b5e 282 nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1);
40c75945 283 if (nfds == \-1) {
be6b243a 284 perror("epoll_wait");
40c75945
MK
285 exit(EXIT_FAILURE);
286 }
fea681da 287
cf0a9ace 288 for (n = 0; n < nfds; ++n) {
66132b5e
MK
289 if (events[n].data.fd == listen_sock) {
290 conn_sock = accept(listen_sock,
24a31d63 291 (struct sockaddr *) &addr, &addrlen);
66132b5e 292 if (conn_sock == \-1) {
fea681da 293 perror("accept");
15277745 294 exit(EXIT_FAILURE);
fea681da 295 }
66132b5e 296 setnonblocking(conn_sock);
fea681da 297 ev.events = EPOLLIN | EPOLLET;
66132b5e
MK
298 ev.data.fd = conn_sock;
299 if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock,
300 &ev) == \-1) {
df5c8d49 301 perror("epoll_ctl: conn_sock");
66132b5e 302 exit(EXIT_FAILURE);
fea681da 303 }
cf0a9ace 304 } else {
fea681da 305 do_use_fd(events[n].data.fd);
cf0a9ace 306 }
fea681da
MK
307 }
308}
309.fi
3bc917f6 310.in
fea681da 311
fc15f317 312When used as an edge-triggered interface, for performance reasons, it is
3bc917f6
MK
313possible to add the file descriptor inside the
314.B epoll
315interface
fc15f317 316.RB ( EPOLL_CTL_ADD )
69eb01fd 317once by specifying
fc15f317 318.RB ( EPOLLIN | EPOLLOUT ).
c13182ef 319This allows you to avoid
fea681da
MK
320continuously switching between
321.B EPOLLIN
322and
323.B EPOLLOUT
324calling
325.BR epoll_ctl (2)
326with
327.BR EPOLL_CTL_MOD .
c634028a 328.SS Questions and answers
28afd4f4 329.TP 4
7fb5cf0f 330.B Q0
7547121f 331What is the key used to distinguish the file descriptors registered in an
3bc917f6
MK
332.B epoll
333set?
7fb5cf0f
MK
334.TP
335.B A0
336The key is the combination of the file descriptor number and
337the open file description
d377b54d 338(also known as an "open file handle",
7fb5cf0f
MK
339the kernel's internal representation of an open file).
340.TP
c13182ef 341.B Q1
7547121f 342What happens if you register the same file descriptor on an
3bc917f6 343.B epoll
7547121f 344instance twice?
fea681da 345.TP
c13182ef 346.B A1
097585ed
MK
347You will probably get
348.BR EEXIST .
2b229334
MK
349However, it is possible to add a duplicate
350.RB ( dup (2),
351.BR dup2 (2),
352.BR fcntl (2)
7fb5cf0f 353.BR F_DUPFD )
d9cb0d7d 354file descriptor to the same
2b229334 355.B epoll
7547121f 356instance.
d9cb0d7d 357.\" But a file descriptor duplicated by fork(2) can't be added to the
d377b54d
MK
358.\" set, because the [file *, fd] pair is already in the epoll set.
359.\" That is a somewhat ugly inconsistency. On the one hand, a child process
7fb5cf0f 360.\" cannot add the duplicate file descriptor to the epoll set. (In every
d9cb0d7d
MK
361.\" other case that I can think of, file descriptors duplicated by fork have
362.\" similar semantics to file descriptors duplicated by dup() and friends.) On
7fb5cf0f 363.\" the other hand, the very fact that the child has a duplicate of the
d9cb0d7d
MK
364.\" file descriptor means that even if the parent closes its file descriptor,
365.\" then epoll_wait() in the parent will continue to receive notifications for
366.\" that file descriptor because of the duplicated file descriptor in the child.
7fb5cf0f 367.\"
d377b54d
MK
368.\" See http://thread.gmane.org/gmane.linux.kernel/596462/
369.\" "epoll design problems with common fork/exec patterns"
31981fa1 370.\"
7fb5cf0f 371.\" mtk, Feb 2008
2b229334
MK
372This can be a useful technique for filtering events,
373if the duplicate file descriptors are registered with different
374.I events
375masks.
fea681da 376.TP
c13182ef 377.B Q2
fea681da
MK
378Can two
379.B epoll
7547121f 380instances wait for the same file descriptor?
1c44bd5b 381If so, are events reported to both
fea681da 382.B epoll
fc15f317 383file descriptors?
fea681da
MK
384.TP
385.B A2
fc15f317 386Yes, and events would be reported to both.
882bbb69 387However, careful programming may be needed to do this correctly.
fea681da
MK
388.TP
389.B Q3
390Is the
391.B epoll
fc15f317 392file descriptor itself poll/epoll/selectable?
fea681da
MK
393.TP
394.B A3
395Yes.
cc65f7d8
MK
396If an
397.B epoll
1c4070c7 398file descriptor has events waiting, then it will
cc65f7d8 399indicate as being readable.
fea681da 400.TP
c13182ef 401.B Q4
7547121f 402What happens if one attempts to put an
fea681da 403.B epoll
7547121f 404file descriptor into its own file descriptor set?
fea681da
MK
405.TP
406.B A4
4fecd703
MK
407The
408.BR epoll_ctl (2)
409call will fail
410.RB ( EINVAL ).
c13182ef 411However, you can add an
fea681da 412.B epoll
3bc917f6
MK
413file descriptor inside another
414.B epoll
415file descriptor set.
fea681da
MK
416.TP
417.B Q5
54d02f32 418Can I send an
fea681da 419.B epoll
008f1ecc 420file descriptor over a UNIX domain socket to another process?
fea681da
MK
421.TP
422.B A5
54d02f32
MK
423Yes, but it does not make sense to do this, since the receiving process
424would not have copies of the file descriptors in the
425.B epoll
426set.
fea681da
MK
427.TP
428.B Q6
fc15f317 429Will closing a file descriptor cause it to be removed from all
fea681da
MK
430.B epoll
431sets automatically?
432.TP
433.B A6
a4a120c7
MK
434Yes, but be aware of the following point.
435A file descriptor is a reference to an open file description (see
436.BR open (2)).
d9cb0d7d 437Whenever a file descriptor is duplicated via
a4a120c7
MK
438.BR dup (2),
439.BR dup2 (2),
440.BR fcntl (2)
441.BR F_DUPFD ,
442or
443.BR fork (2),
444a new file descriptor referring to the same open file description is
445created.
446An open file description continues to exist until all
447file descriptors referring to it have been closed.
d377b54d 448A file descriptor is removed from an
a4a120c7
MK
449.B epoll
450set only after all the file descriptors referring to the underlying
31981fa1 451open file description have been closed
d9cb0d7d 452(or before if the file descriptor is explicitly removed using
0b80cf56 453.BR epoll_ctl (2)
d377b54d 454.BR EPOLL_CTL_DEL ).
a4a120c7
MK
455This means that even after a file descriptor that is part of an
456.B epoll
457set has been closed,
458events may be reported for that file descriptor if other file
459descriptors referring to the same underlying file description remain open.
fea681da 460.TP
c13182ef 461.B Q7
fc15f317 462If more than one event occurs between
fea681da
MK
463.BR epoll_wait (2)
464calls, are they combined or reported separately?
465.TP
466.B A7
467They will be combined.
468.TP
469.B Q8
988db661 470Does an operation on a file descriptor affect the
fc15f317 471already collected but not yet reported events?
fea681da
MK
472.TP
473.B A8
fc15f317 474You can do two operations on an existing file descriptor.
c13182ef
MK
475Remove would be meaningless for
476this case.
3b777aff 477Modify will reread available I/O.
fea681da
MK
478.TP
479.B Q9
fc15f317 480Do I need to continuously read/write a file descriptor
097585ed
MK
481until
482.B EAGAIN
483when using the
fea681da 484.B EPOLLET
fc15f317 485flag (edge-triggered behavior) ?
fea681da
MK
486.TP
487.B A9
c13182ef 488Receiving an event from
fea681da 489.BR epoll_wait (2)
f11af7da 490should suggest to you that such
160c5be1 491file descriptor is ready for the requested I/O operation.
ff40dbb3 492You must consider it ready until the next (nonblocking)
cb1de8d7 493read/write yields
097585ed 494.BR EAGAIN .
f11af7da
MK
495When and how you will use the file descriptor is entirely up to you.
496.sp
cb1de8d7
MK
497For packet/token-oriented files (e.g., datagram socket,
498terminal in canonical mode),
146c1764 499the only way to detect the end of the read/write I/O space
cb1de8d7
MK
500is to continue to read/write until
501.BR EAGAIN .
502.sp
503For stream-oriented files (e.g., pipe, FIFO, stream socket), the
f11af7da
MK
504condition that the read/write I/O space is exhausted can also be detected by
505checking the amount of data read from / written to the target file
506descriptor.
c13182ef 507For example, if you call
fea681da 508.BR read (2)
160c5be1 509by asking to read a certain amount of data and
fea681da 510.BR read (2)
f11af7da
MK
511returns a lower number of bytes, you
512can be sure of having exhausted the read I/O space for the file
513descriptor.
160c5be1 514The same is true when writing using
fc15f317 515.BR write (2).
cb1de8d7
MK
516(Avoid this latter technique if you cannot guarantee that
517the monitored file descriptor always refers to a stream-oriented file.)
c634028a 518.SS Possible pitfalls and ways to avoid them
fea681da 519.TP
fc15f317 520.B o Starvation (edge-triggered)
fea681da 521.PP
c13182ef
MK
522If there is a large amount of I/O space,
523it is possible that by trying to drain
524it the other files will not get processed causing starvation.
fc15f317
MK
525(This problem is not specific to
526.BR epoll .)
fea681da 527.PP
c13182ef
MK
528The solution is to maintain a ready list
529and mark the file descriptor as ready
fea681da
MK
530in its associated data structure, thereby allowing the application to
531remember which files need to be processed but still round robin amongst
c13182ef
MK
532all the ready files.
533This also supports ignoring subsequent events you
fc15f317 534receive for file descriptors that are already ready.
fea681da 535.TP
c13182ef 536.B o If using an event cache...
fea681da 537.PP
fc15f317 538If you use an event cache or store all the file descriptors returned from
fea681da 539.BR epoll_wait (2),
c13182ef 540then make sure to provide a way to mark
fc15f317 541its closure dynamically (i.e., caused by
c13182ef
MK
542a previous event's processing).
543Suppose you receive 100 events from
fea681da 544.BR epoll_wait (2),
c13182ef
MK
545and in event #47 a condition causes event #13 to be closed.
546If you remove the structure and
63f6a20a 547.BR close (2)
fc15f317
MK
548the file descriptor for event #13, then your
549event cache might still say there are events waiting for that
550file descriptor causing confusion.
c13182ef 551.PP
fea681da
MK
552One solution for this is to call, during the processing of event 47,
553.BR epoll_ctl ( EPOLL_CTL_DEL )
fc15f317 554to delete file descriptor 13 and
63f6a20a 555.BR close (2),
f87925c6 556then mark its associated
c13182ef
MK
557data structure as removed and link it to a cleanup list.
558If you find another
fc15f317
MK
559event for file descriptor 13 in your batch processing,
560you will discover the file descriptor had been
fea681da 561previously removed and there will be no confusion.
2b2581ee 562.SH VERSIONS
04be4241
MK
563The
564.B epoll
565API was introduced in Linux kernel 2.5.44.
7547121f 566.\" Its interface should be finalized in Linux kernel 2.5.66.
d1d87801 567Support was added to glibc in version 2.3.2.
fea681da 568.SH CONFORMING TO
3bc917f6
MK
569The
570.B epoll
571API is Linux-specific.
c803c3e3 572Some other systems provide similar
75b94dc3 573mechanisms, for example, FreeBSD has
c13182ef
MK
574.IR kqueue ,
575and Solaris has
c803c3e3 576.IR /dev/poll .
58a80cd4
MK
577.SH NOTES
578The set of file descriptors that is being monitored via
579an epoll file descriptor can be viewed via the entry for
580the epoll file descriptor in the process's
581.IR /proc/[pid]/fdinfo
582directory.
583See
584.BR proc (5)
585for further details.
47297adb 586.SH SEE ALSO
fea681da 587.BR epoll_create (2),
9d0f3fcb 588.BR epoll_create1 (2),
fea681da 589.BR epoll_ctl (2),
634c92fb
MK
590.BR epoll_wait (2),
591.BR poll (2),
592.BR select (2)