]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/epoll.7
Convert to American spelling conventions
[thirdparty/man-pages.git] / man7 / epoll.7
CommitLineData
fea681da
MK
1.\"
2.\" epoll by Davide Libenzi ( efficient event notification retrieval )
3.\" Copyright (C) 2003 Davide Libenzi
4.\"
5.\" This program is free software; you can redistribute it and/or modify
6.\" it under the terms of the GNU General Public License as published by
7.\" the Free Software Foundation; either version 2 of the License, or
8.\" (at your option) any later version.
9.\"
10.\" This program is distributed in the hope that it will be useful,
11.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
12.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13.\" GNU General Public License for more details.
14.\"
15.\" You should have received a copy of the GNU General Public License
16.\" along with this program; if not, write to the Free Software
17.\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18.\"
19.\" Davide Libenzi <davidel@xmailserver.org>
20.\"
05eabe65 21.TH EPOLL 7 2002-10-23 "Linux" "Linux Programmer's Manual"
fea681da
MK
22.SH NAME
23epoll \- I/O event notification facility
24.SH SYNOPSIS
25.B #include <sys/epoll.h>
26.SH DESCRIPTION
27.B epoll
c13182ef 28is a variant of
fea681da
MK
29.BR poll (2)
30that can be used either as Edge or Level Triggered interface and scales
c13182ef
MK
31well to large numbers of watched fds.
32Three system calls are provided to
fea681da
MK
33set up and control an
34.B epoll
c13182ef 35set:
fea681da
MK
36.BR epoll_create (2),
37.BR epoll_ctl (2),
38.BR epoll_wait (2).
39
40An
41.B epoll
42set is connected to a file descriptor created by
43.BR epoll_create (2).
c13182ef 44Interest for certain file descriptors is then registered via
fea681da 45.BR epoll_ctl (2).
c13182ef 46Finally, the actual wait is started by
fea681da 47.BR epoll_wait (2).
2b2581ee 48.SS Level Triggered and Edge Triggered
fea681da
MK
49The
50.B epoll
51event distribution interface is able to behave both as Edge Triggered
c13182ef
MK
52( ET ) and Level Triggered ( LT ).
53The difference between ET and LT
54event distribution mechanism can be described as follows.
55Suppose that
fea681da
MK
56this scenario happens :
57.TP
58.B 1
59The file descriptor that represents the read side of a pipe (
60.B RFD
61) is added inside the
62.B epoll
63device.
64.TP
65.B 2
66Pipe writer writes 2Kb of data on the write side of the pipe.
67.TP
68.B 3
69A call to
70.BR epoll_wait (2)
71is done that will return
72.B RFD
73as ready file descriptor.
74.TP
75.B 4
76The pipe reader reads 1Kb of data from
77.BR RFD .
78.TP
79.B 5
80A call to
81.BR epoll_wait (2)
82is done.
83.PP
fea681da
MK
84If the
85.B RFD
86file descriptor has been added to the
87.B epoll
88interface using the
89.B EPOLLET
90flag, the call to
91.BR epoll_wait (2)
92done in step
93.B 5
94will probably hang because of the available data still present in the file
95input buffers and the remote peer might be expecting a response based on the
c13182ef
MK
96data it already sent.
97The reason for this is that Edge Triggered event
fea681da
MK
98distribution delivers events only when events happens on the monitored file.
99So, in step
100.B 5
101the caller might end up waiting for some data that is already present inside
c13182ef
MK
102the input buffer.
103In the above example, an event on
fea681da
MK
104.B RFD
105will be generated because of the write done in
66eca51e
MK
106.BR 2
107and the event is consumed in
fea681da
MK
108.BR 3 .
109Since the read operation done in
110.B 4
111does not consume the whole buffer data, the call to
112.BR epoll_wait (2)
113done in step
114.B 5
c13182ef
MK
115might lock indefinitely.
116The
fea681da
MK
117.B epoll
118interface, when used with the
119.B EPOLLET
120flag ( Edge Triggered )
121should use non-blocking file descriptors to avoid having a blocking
122read or write starve the task that is handling multiple file descriptors.
123The suggested way to use
124.B epoll
c13182ef 125as an Edge Triggered
66eca51e
MK
126.RB ( EPOLLET )
127interface is below, and possible pitfalls to avoid follow.
fea681da 128.RS
c13182ef 129.TP
fea681da
MK
130.B i
131with non-blocking file descriptors
c13182ef 132.TP
fea681da
MK
133.B ii
134by going to wait for an event only after
135.BR read (2)
c13182ef 136or
fea681da
MK
137.BR write (2)
138return EAGAIN
139.RE
140.PP
141On the contrary, when used as a Level Triggered interface,
142.B epoll
143is by all means a faster
144.BR poll (2),
145and can be used wherever the latter is used since it shares the
c13182ef
MK
146same semantics.
147Since even with the Edge Triggered
fea681da 148.B epoll
3f1c1b0a 149multiple events can be generated up on receipt of multiple chunks of data,
fea681da
MK
150the caller has the option to specify the
151.B EPOLLONESHOT
152flag, to tell
153.B epoll
3f1c1b0a 154to disable the associated file descriptor after the receipt of an event with
fea681da
MK
155.BR epoll_wait (2).
156When the
157.B EPOLLONESHOT
c13182ef
MK
158flag is specified,
159it is caller responsibility to rearm the file descriptor using
fea681da
MK
160.BR epoll_ctl (2)
161with
162.BR EPOLL_CTL_MOD .
2b2581ee 163.SS Example for Suggested Usage
fea681da
MK
164While the usage of
165.B epoll
166when employed like a Level Triggered interface does have the same
167semantics of
168.BR poll (2),
9fdfa163 169an Edge Triggered usage requires more clarification to avoid stalls
c13182ef
MK
170in the application event loop.
171In this example, listener is a
fea681da
MK
172non-blocking socket on which
173.BR listen (2)
c13182ef
MK
174has been called.
175The function do_use_fd() uses the new ready
fea681da
MK
176file descriptor until EAGAIN is returned by either
177.BR read (2)
178or
179.BR write (2).
180An event driven state machine application should, after having received
181EAGAIN, record its current state so that at the next call to do_use_fd()
182it will continue to
183.BR read (2)
184or
185.BR write (2)
c13182ef 186from where it stopped before.
fea681da
MK
187
188.nf
189struct epoll_event ev, *events;
190
191for(;;) {
2bc2f479 192 nfds = epoll_wait(kdpfd, events, maxevents, \-1);
fea681da 193
cf0a9ace
MK
194 for (n = 0; n < nfds; ++n) {
195 if (events[n].data.fd == listener) {
fea681da
MK
196 client = accept(listener, (struct sockaddr *) &local,
197 &addrlen);
198 if(client < 0){
199 perror("accept");
200 continue;
201 }
202 setnonblocking(client);
203 ev.events = EPOLLIN | EPOLLET;
204 ev.data.fd = client;
205 if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
7dfefab8 206 fprintf(stderr, "epoll set insertion error: fd=%d\\n",
fea681da 207 client);
2bc2f479 208 return \-1;
fea681da 209 }
cf0a9ace 210 } else {
fea681da 211 do_use_fd(events[n].data.fd);
cf0a9ace 212 }
fea681da
MK
213 }
214}
215.fi
216
217When used as an Edge triggered interface, for performance reasons, it is
218possible to add the file descriptor inside the epoll interface (
219.B EPOLL_CTL_ADD
220) once by specifying (
221.BR EPOLLIN | EPOLLOUT
c13182ef
MK
222).
223This allows you to avoid
fea681da
MK
224continuously switching between
225.B EPOLLIN
226and
227.B EPOLLOUT
228calling
229.BR epoll_ctl (2)
230with
231.BR EPOLL_CTL_MOD .
2b2581ee 232.SS Questions and Answers
c13182ef
MK
233.TP
234.B Q1
fea681da
MK
235What happens if you add the same fd to an epoll_set twice?
236.TP
c13182ef
MK
237.B A1
238You will probably get EEXIST.
239However, it is possible that two
240threads may add the same fd twice.
241This is a harmless condition.
fea681da 242.TP
c13182ef 243.B Q2
fea681da
MK
244Can two
245.B epoll
1c44bd5b
MK
246sets wait for the same fd?
247If so, are events reported to both
fea681da
MK
248.B epoll
249sets fds?
250.TP
251.B A2
c13182ef
MK
252Yes.
253However, it is not recommended.
254Yes it would be reported to both.
fea681da
MK
255.TP
256.B Q3
257Is the
258.B epoll
259fd itself poll/epoll/selectable?
260.TP
261.B A3
262Yes.
263.TP
c13182ef 264.B Q4
fea681da
MK
265What happens if the
266.B epoll
267fd is put into its own fd set?
268.TP
269.B A4
c13182ef
MK
270It will fail.
271However, you can add an
fea681da 272.B epoll
c13182ef 273fd inside another epoll fd set.
fea681da
MK
274.TP
275.B Q5
276Can I send the
277.B epoll
278fd over a unix-socket to another process?
279.TP
280.B A5
281No.
282.TP
283.B Q6
284Will the close of an fd cause it to be removed from all
285.B epoll
286sets automatically?
287.TP
288.B A6
289Yes.
290.TP
c13182ef 291.B Q7
fea681da
MK
292If more than one event comes in between
293.BR epoll_wait (2)
294calls, are they combined or reported separately?
295.TP
296.B A7
297They will be combined.
298.TP
299.B Q8
300Does an operation on an fd affect the already collected but not yet reported
301events?
302.TP
303.B A8
c13182ef
MK
304You can do two operations on an existing fd.
305Remove would be meaningless for
306this case.
307Modify will re-read available I/O.
fea681da
MK
308.TP
309.B Q9
310Do I need to continuously read/write an fd until EAGAIN when using the
311.B EPOLLET
d9bfdb9c 312flag ( Edge Triggered behavior ) ?
fea681da
MK
313.TP
314.B A9
c13182ef
MK
315No you don't.
316Receiving an event from
fea681da 317.BR epoll_wait (2)
c13182ef
MK
318should suggest to you that such file descriptor is ready
319for the requested I/O operation.
320You have simply to consider it ready until you will receive the
321next EAGAIN.
322When and how you will use such file descriptor is entirely up
323to you.
324Also, the condition that the read/write I/O space is exhausted can
fea681da 325be detected by checking the amount of data read/write from/to the target
c13182ef
MK
326file descriptor.
327For example, if you call
fea681da
MK
328.BR read (2)
329by asking to read a certain amount of data and
330.BR read (2)
331returns a lower number of bytes, you can be sure to have exhausted the read
c13182ef
MK
332I/O space for such file descriptor.
333Same is valid when writing using the
fea681da
MK
334.BR write (2)
335function.
2b2581ee 336.SS Possible Pitfalls and Ways to Avoid Them
fea681da
MK
337.TP
338.B o Starvation ( Edge Triggered )
339.PP
c13182ef
MK
340If there is a large amount of I/O space,
341it is possible that by trying to drain
342it the other files will not get processed causing starvation.
343This is not specific to
fea681da
MK
344.BR epoll .
345.PP
c13182ef
MK
346The solution is to maintain a ready list
347and mark the file descriptor as ready
fea681da
MK
348in its associated data structure, thereby allowing the application to
349remember which files need to be processed but still round robin amongst
c13182ef
MK
350all the ready files.
351This also supports ignoring subsequent events you
fea681da 352receive for fd's that are already ready.
fea681da 353.TP
c13182ef 354.B o If using an event cache...
fea681da
MK
355.PP
356If you use an event cache or store all the fd's returned from
357.BR epoll_wait (2),
c13182ef
MK
358then make sure to provide a way to mark
359its closure dynamically (ie- caused by
360a previous event's processing).
361Suppose you receive 100 events from
fea681da 362.BR epoll_wait (2),
c13182ef
MK
363and in event #47 a condition causes event #13 to be closed.
364If you remove the structure and
63f6a20a 365.BR close (2)
c13182ef
MK
366the fd for event #13, then your
367event cache might still say there are events waiting for that fd causing
fea681da 368confusion.
c13182ef 369.PP
fea681da
MK
370One solution for this is to call, during the processing of event 47,
371.BR epoll_ctl ( EPOLL_CTL_DEL )
c13182ef 372to delete fd 13 and
63f6a20a 373.BR close (2),
f87925c6 374then mark its associated
c13182ef
MK
375data structure as removed and link it to a cleanup list.
376If you find another
fea681da
MK
377event for fd 13 in your batch processing, you will discover the fd had been
378previously removed and there will be no confusion.
2b2581ee
MK
379.SH VERSIONS
380.BR epoll (7)
381is a new API introduced in Linux kernel 2.5.44.
382Its interface should be finalized in Linux kernel 2.5.66.
fea681da 383.SH CONFORMING TO
c13182ef 384The epoll API is Linux specific.
c803c3e3 385Some other systems provide similar
c13182ef
MK
386mechanisms, e.g., FreeBSD has
387.IR kqueue ,
388and Solaris has
c803c3e3 389.IR /dev/poll .
fea681da
MK
390.SH "SEE ALSO"
391.BR epoll_create (2),
392.BR epoll_ctl (2),
393.BR epoll_wait (2)