]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" |
2 | .\" epoll by Davide Libenzi ( efficient event notification retrieval ) | |
3 | .\" Copyright (C) 2003 Davide Libenzi | |
4 | .\" | |
5 | .\" This program is free software; you can redistribute it and/or modify | |
6 | .\" it under the terms of the GNU General Public License as published by | |
7 | .\" the Free Software Foundation; either version 2 of the License, or | |
8 | .\" (at your option) any later version. | |
9 | .\" | |
10 | .\" This program is distributed in the hope that it will be useful, | |
11 | .\" but WITHOUT ANY WARRANTY; without even the implied warranty of | |
12 | .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
13 | .\" GNU General Public License for more details. | |
14 | .\" | |
15 | .\" You should have received a copy of the GNU General Public License | |
16 | .\" along with this program; if not, write to the Free Software | |
17 | .\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
18 | .\" | |
19 | .\" Davide Libenzi <davidel@xmailserver.org> | |
20 | .\" | |
05eabe65 | 21 | .TH EPOLL 7 2002-10-23 "Linux" "Linux Programmer's Manual" |
fea681da MK |
22 | .SH NAME |
23 | epoll \- I/O event notification facility | |
24 | .SH SYNOPSIS | |
25 | .B #include <sys/epoll.h> | |
26 | .SH DESCRIPTION | |
27 | .B epoll | |
c13182ef | 28 | is a variant of |
fea681da MK |
29 | .BR poll (2) |
30 | that can be used either as Edge or Level Triggered interface and scales | |
c13182ef MK |
31 | well to large numbers of watched fds. |
32 | Three system calls are provided to | |
fea681da MK |
33 | set up and control an |
34 | .B epoll | |
c13182ef | 35 | set: |
fea681da MK |
36 | .BR epoll_create (2), |
37 | .BR epoll_ctl (2), | |
38 | .BR epoll_wait (2). | |
39 | ||
40 | An | |
41 | .B epoll | |
42 | set is connected to a file descriptor created by | |
43 | .BR epoll_create (2). | |
c13182ef | 44 | Interest for certain file descriptors is then registered via |
fea681da | 45 | .BR epoll_ctl (2). |
c13182ef | 46 | Finally, the actual wait is started by |
fea681da | 47 | .BR epoll_wait (2). |
2b2581ee | 48 | .SS Level Triggered and Edge Triggered |
fea681da MK |
49 | The |
50 | .B epoll | |
51 | event distribution interface is able to behave both as Edge Triggered | |
c13182ef MK |
52 | ( ET ) and Level Triggered ( LT ). |
53 | The difference between ET and LT | |
54 | event distribution mechanism can be described as follows. | |
55 | Suppose that | |
fea681da MK |
56 | this scenario happens : |
57 | .TP | |
58 | .B 1 | |
59 | The file descriptor that represents the read side of a pipe ( | |
60 | .B RFD | |
61 | ) is added inside the | |
62 | .B epoll | |
63 | device. | |
64 | .TP | |
65 | .B 2 | |
66 | Pipe writer writes 2Kb of data on the write side of the pipe. | |
67 | .TP | |
68 | .B 3 | |
69 | A call to | |
70 | .BR epoll_wait (2) | |
71 | is done that will return | |
72 | .B RFD | |
73 | as ready file descriptor. | |
74 | .TP | |
75 | .B 4 | |
76 | The pipe reader reads 1Kb of data from | |
77 | .BR RFD . | |
78 | .TP | |
79 | .B 5 | |
80 | A call to | |
81 | .BR epoll_wait (2) | |
82 | is done. | |
83 | .PP | |
fea681da MK |
84 | If the |
85 | .B RFD | |
86 | file descriptor has been added to the | |
87 | .B epoll | |
88 | interface using the | |
89 | .B EPOLLET | |
90 | flag, the call to | |
91 | .BR epoll_wait (2) | |
92 | done in step | |
93 | .B 5 | |
94 | will probably hang because of the available data still present in the file | |
95 | input buffers and the remote peer might be expecting a response based on the | |
c13182ef MK |
96 | data it already sent. |
97 | The reason for this is that Edge Triggered event | |
fea681da MK |
98 | distribution delivers events only when events happens on the monitored file. |
99 | So, in step | |
100 | .B 5 | |
101 | the caller might end up waiting for some data that is already present inside | |
c13182ef MK |
102 | the input buffer. |
103 | In the above example, an event on | |
fea681da MK |
104 | .B RFD |
105 | will be generated because of the write done in | |
66eca51e MK |
106 | .BR 2 |
107 | and the event is consumed in | |
fea681da MK |
108 | .BR 3 . |
109 | Since the read operation done in | |
110 | .B 4 | |
111 | does not consume the whole buffer data, the call to | |
112 | .BR epoll_wait (2) | |
113 | done in step | |
114 | .B 5 | |
c13182ef MK |
115 | might lock indefinitely. |
116 | The | |
fea681da MK |
117 | .B epoll |
118 | interface, when used with the | |
119 | .B EPOLLET | |
120 | flag ( Edge Triggered ) | |
121 | should use non-blocking file descriptors to avoid having a blocking | |
122 | read or write starve the task that is handling multiple file descriptors. | |
123 | The suggested way to use | |
124 | .B epoll | |
c13182ef | 125 | as an Edge Triggered |
66eca51e MK |
126 | .RB ( EPOLLET ) |
127 | interface is below, and possible pitfalls to avoid follow. | |
fea681da | 128 | .RS |
c13182ef | 129 | .TP |
fea681da MK |
130 | .B i |
131 | with non-blocking file descriptors | |
c13182ef | 132 | .TP |
fea681da MK |
133 | .B ii |
134 | by going to wait for an event only after | |
135 | .BR read (2) | |
c13182ef | 136 | or |
fea681da MK |
137 | .BR write (2) |
138 | return EAGAIN | |
139 | .RE | |
140 | .PP | |
141 | On the contrary, when used as a Level Triggered interface, | |
142 | .B epoll | |
143 | is by all means a faster | |
144 | .BR poll (2), | |
145 | and can be used wherever the latter is used since it shares the | |
c13182ef MK |
146 | same semantics. |
147 | Since even with the Edge Triggered | |
fea681da | 148 | .B epoll |
3f1c1b0a | 149 | multiple events can be generated up on receipt of multiple chunks of data, |
fea681da MK |
150 | the caller has the option to specify the |
151 | .B EPOLLONESHOT | |
152 | flag, to tell | |
153 | .B epoll | |
3f1c1b0a | 154 | to disable the associated file descriptor after the receipt of an event with |
fea681da MK |
155 | .BR epoll_wait (2). |
156 | When the | |
157 | .B EPOLLONESHOT | |
c13182ef MK |
158 | flag is specified, |
159 | it is caller responsibility to rearm the file descriptor using | |
fea681da MK |
160 | .BR epoll_ctl (2) |
161 | with | |
162 | .BR EPOLL_CTL_MOD . | |
2b2581ee | 163 | .SS Example for Suggested Usage |
fea681da MK |
164 | While the usage of |
165 | .B epoll | |
166 | when employed like a Level Triggered interface does have the same | |
167 | semantics of | |
168 | .BR poll (2), | |
9fdfa163 | 169 | an Edge Triggered usage requires more clarification to avoid stalls |
c13182ef MK |
170 | in the application event loop. |
171 | In this example, listener is a | |
fea681da MK |
172 | non-blocking socket on which |
173 | .BR listen (2) | |
c13182ef MK |
174 | has been called. |
175 | The function do_use_fd() uses the new ready | |
fea681da MK |
176 | file descriptor until EAGAIN is returned by either |
177 | .BR read (2) | |
178 | or | |
179 | .BR write (2). | |
180 | An event driven state machine application should, after having received | |
181 | EAGAIN, record its current state so that at the next call to do_use_fd() | |
182 | it will continue to | |
183 | .BR read (2) | |
184 | or | |
185 | .BR write (2) | |
c13182ef | 186 | from where it stopped before. |
fea681da MK |
187 | |
188 | .nf | |
189 | struct epoll_event ev, *events; | |
190 | ||
191 | for(;;) { | |
2bc2f479 | 192 | nfds = epoll_wait(kdpfd, events, maxevents, \-1); |
fea681da | 193 | |
cf0a9ace MK |
194 | for (n = 0; n < nfds; ++n) { |
195 | if (events[n].data.fd == listener) { | |
fea681da MK |
196 | client = accept(listener, (struct sockaddr *) &local, |
197 | &addrlen); | |
198 | if(client < 0){ | |
199 | perror("accept"); | |
200 | continue; | |
201 | } | |
202 | setnonblocking(client); | |
203 | ev.events = EPOLLIN | EPOLLET; | |
204 | ev.data.fd = client; | |
205 | if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) { | |
7dfefab8 | 206 | fprintf(stderr, "epoll set insertion error: fd=%d\\n", |
fea681da | 207 | client); |
2bc2f479 | 208 | return \-1; |
fea681da | 209 | } |
cf0a9ace | 210 | } else { |
fea681da | 211 | do_use_fd(events[n].data.fd); |
cf0a9ace | 212 | } |
fea681da MK |
213 | } |
214 | } | |
215 | .fi | |
216 | ||
217 | When used as an Edge triggered interface, for performance reasons, it is | |
218 | possible to add the file descriptor inside the epoll interface ( | |
219 | .B EPOLL_CTL_ADD | |
220 | ) once by specifying ( | |
221 | .BR EPOLLIN | EPOLLOUT | |
c13182ef MK |
222 | ). |
223 | This allows you to avoid | |
fea681da MK |
224 | continuously switching between |
225 | .B EPOLLIN | |
226 | and | |
227 | .B EPOLLOUT | |
228 | calling | |
229 | .BR epoll_ctl (2) | |
230 | with | |
231 | .BR EPOLL_CTL_MOD . | |
2b2581ee | 232 | .SS Questions and Answers |
c13182ef MK |
233 | .TP |
234 | .B Q1 | |
fea681da MK |
235 | What happens if you add the same fd to an epoll_set twice? |
236 | .TP | |
c13182ef MK |
237 | .B A1 |
238 | You will probably get EEXIST. | |
239 | However, it is possible that two | |
240 | threads may add the same fd twice. | |
241 | This is a harmless condition. | |
fea681da | 242 | .TP |
c13182ef | 243 | .B Q2 |
fea681da MK |
244 | Can two |
245 | .B epoll | |
1c44bd5b MK |
246 | sets wait for the same fd? |
247 | If so, are events reported to both | |
fea681da MK |
248 | .B epoll |
249 | sets fds? | |
250 | .TP | |
251 | .B A2 | |
c13182ef MK |
252 | Yes. |
253 | However, it is not recommended. | |
254 | Yes it would be reported to both. | |
fea681da MK |
255 | .TP |
256 | .B Q3 | |
257 | Is the | |
258 | .B epoll | |
259 | fd itself poll/epoll/selectable? | |
260 | .TP | |
261 | .B A3 | |
262 | Yes. | |
263 | .TP | |
c13182ef | 264 | .B Q4 |
fea681da MK |
265 | What happens if the |
266 | .B epoll | |
267 | fd is put into its own fd set? | |
268 | .TP | |
269 | .B A4 | |
c13182ef MK |
270 | It will fail. |
271 | However, you can add an | |
fea681da | 272 | .B epoll |
c13182ef | 273 | fd inside another epoll fd set. |
fea681da MK |
274 | .TP |
275 | .B Q5 | |
276 | Can I send the | |
277 | .B epoll | |
278 | fd over a unix-socket to another process? | |
279 | .TP | |
280 | .B A5 | |
281 | No. | |
282 | .TP | |
283 | .B Q6 | |
284 | Will the close of an fd cause it to be removed from all | |
285 | .B epoll | |
286 | sets automatically? | |
287 | .TP | |
288 | .B A6 | |
289 | Yes. | |
290 | .TP | |
c13182ef | 291 | .B Q7 |
fea681da MK |
292 | If more than one event comes in between |
293 | .BR epoll_wait (2) | |
294 | calls, are they combined or reported separately? | |
295 | .TP | |
296 | .B A7 | |
297 | They will be combined. | |
298 | .TP | |
299 | .B Q8 | |
300 | Does an operation on an fd affect the already collected but not yet reported | |
301 | events? | |
302 | .TP | |
303 | .B A8 | |
c13182ef MK |
304 | You can do two operations on an existing fd. |
305 | Remove would be meaningless for | |
306 | this case. | |
307 | Modify will re-read available I/O. | |
fea681da MK |
308 | .TP |
309 | .B Q9 | |
310 | Do I need to continuously read/write an fd until EAGAIN when using the | |
311 | .B EPOLLET | |
d9bfdb9c | 312 | flag ( Edge Triggered behavior ) ? |
fea681da MK |
313 | .TP |
314 | .B A9 | |
c13182ef MK |
315 | No you don't. |
316 | Receiving an event from | |
fea681da | 317 | .BR epoll_wait (2) |
c13182ef MK |
318 | should suggest to you that such file descriptor is ready |
319 | for the requested I/O operation. | |
320 | You have simply to consider it ready until you will receive the | |
321 | next EAGAIN. | |
322 | When and how you will use such file descriptor is entirely up | |
323 | to you. | |
324 | Also, the condition that the read/write I/O space is exhausted can | |
fea681da | 325 | be detected by checking the amount of data read/write from/to the target |
c13182ef MK |
326 | file descriptor. |
327 | For example, if you call | |
fea681da MK |
328 | .BR read (2) |
329 | by asking to read a certain amount of data and | |
330 | .BR read (2) | |
331 | returns a lower number of bytes, you can be sure to have exhausted the read | |
c13182ef MK |
332 | I/O space for such file descriptor. |
333 | Same is valid when writing using the | |
fea681da MK |
334 | .BR write (2) |
335 | function. | |
2b2581ee | 336 | .SS Possible Pitfalls and Ways to Avoid Them |
fea681da MK |
337 | .TP |
338 | .B o Starvation ( Edge Triggered ) | |
339 | .PP | |
c13182ef MK |
340 | If there is a large amount of I/O space, |
341 | it is possible that by trying to drain | |
342 | it the other files will not get processed causing starvation. | |
343 | This is not specific to | |
fea681da MK |
344 | .BR epoll . |
345 | .PP | |
c13182ef MK |
346 | The solution is to maintain a ready list |
347 | and mark the file descriptor as ready | |
fea681da MK |
348 | in its associated data structure, thereby allowing the application to |
349 | remember which files need to be processed but still round robin amongst | |
c13182ef MK |
350 | all the ready files. |
351 | This also supports ignoring subsequent events you | |
fea681da | 352 | receive for fd's that are already ready. |
fea681da | 353 | .TP |
c13182ef | 354 | .B o If using an event cache... |
fea681da MK |
355 | .PP |
356 | If you use an event cache or store all the fd's returned from | |
357 | .BR epoll_wait (2), | |
c13182ef MK |
358 | then make sure to provide a way to mark |
359 | its closure dynamically (ie- caused by | |
360 | a previous event's processing). | |
361 | Suppose you receive 100 events from | |
fea681da | 362 | .BR epoll_wait (2), |
c13182ef MK |
363 | and in event #47 a condition causes event #13 to be closed. |
364 | If you remove the structure and | |
63f6a20a | 365 | .BR close (2) |
c13182ef MK |
366 | the fd for event #13, then your |
367 | event cache might still say there are events waiting for that fd causing | |
fea681da | 368 | confusion. |
c13182ef | 369 | .PP |
fea681da MK |
370 | One solution for this is to call, during the processing of event 47, |
371 | .BR epoll_ctl ( EPOLL_CTL_DEL ) | |
c13182ef | 372 | to delete fd 13 and |
63f6a20a | 373 | .BR close (2), |
f87925c6 | 374 | then mark its associated |
c13182ef MK |
375 | data structure as removed and link it to a cleanup list. |
376 | If you find another | |
fea681da MK |
377 | event for fd 13 in your batch processing, you will discover the fd had been |
378 | previously removed and there will be no confusion. | |
2b2581ee MK |
379 | .SH VERSIONS |
380 | .BR epoll (7) | |
381 | is a new API introduced in Linux kernel 2.5.44. | |
382 | Its interface should be finalized in Linux kernel 2.5.66. | |
fea681da | 383 | .SH CONFORMING TO |
c13182ef | 384 | The epoll API is Linux specific. |
c803c3e3 | 385 | Some other systems provide similar |
c13182ef MK |
386 | mechanisms, e.g., FreeBSD has |
387 | .IR kqueue , | |
388 | and Solaris has | |
c803c3e3 | 389 | .IR /dev/poll . |
fea681da MK |
390 | .SH "SEE ALSO" |
391 | .BR epoll_create (2), | |
392 | .BR epoll_ctl (2), | |
393 | .BR epoll_wait (2) |