]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" |
2 | .\" epoll by Davide Libenzi ( efficient event notification retrieval ) | |
3 | .\" Copyright (C) 2003 Davide Libenzi | |
4 | .\" | |
5 | .\" This program is free software; you can redistribute it and/or modify | |
6 | .\" it under the terms of the GNU General Public License as published by | |
7 | .\" the Free Software Foundation; either version 2 of the License, or | |
8 | .\" (at your option) any later version. | |
9 | .\" | |
10 | .\" This program is distributed in the hope that it will be useful, | |
11 | .\" but WITHOUT ANY WARRANTY; without even the implied warranty of | |
12 | .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
13 | .\" GNU General Public License for more details. | |
14 | .\" | |
15 | .\" You should have received a copy of the GNU General Public License | |
16 | .\" along with this program; if not, write to the Free Software | |
17 | .\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
18 | .\" | |
19 | .\" Davide Libenzi <davidel@xmailserver.org> | |
20 | .\" | |
21 | .\" | |
22 | .TH EPOLL 4 "2002-10-23" Linux "Linux Programmer's Manual" | |
23 | .SH NAME | |
24 | epoll \- I/O event notification facility | |
25 | .SH SYNOPSIS | |
26 | .B #include <sys/epoll.h> | |
27 | .SH DESCRIPTION | |
28 | .B epoll | |
29 | is a variant of | |
30 | .BR poll (2) | |
31 | that can be used either as Edge or Level Triggered interface and scales | |
32 | well to large numbers of watched fds. Three system calls are provided to | |
33 | set up and control an | |
34 | .B epoll | |
35 | set: | |
36 | .BR epoll_create (2), | |
37 | .BR epoll_ctl (2), | |
38 | .BR epoll_wait (2). | |
39 | ||
40 | An | |
41 | .B epoll | |
42 | set is connected to a file descriptor created by | |
43 | .BR epoll_create (2). | |
44 | Interest for certain file descriptors is then registered via | |
45 | .BR epoll_ctl (2). | |
46 | Finally, the actual wait is started by | |
47 | .BR epoll_wait (2). | |
48 | ||
49 | .SH NOTES | |
50 | The | |
51 | .B epoll | |
52 | event distribution interface is able to behave both as Edge Triggered | |
53 | ( ET ) and Level Triggered ( LT ). The difference between ET and LT | |
54 | event distribution mechanism can be described as follows. Suppose that | |
55 | this scenario happens : | |
56 | .TP | |
57 | .B 1 | |
58 | The file descriptor that represents the read side of a pipe ( | |
59 | .B RFD | |
60 | ) is added inside the | |
61 | .B epoll | |
62 | device. | |
63 | .TP | |
64 | .B 2 | |
65 | Pipe writer writes 2Kb of data on the write side of the pipe. | |
66 | .TP | |
67 | .B 3 | |
68 | A call to | |
69 | .BR epoll_wait (2) | |
70 | is done that will return | |
71 | .B RFD | |
72 | as ready file descriptor. | |
73 | .TP | |
74 | .B 4 | |
75 | The pipe reader reads 1Kb of data from | |
76 | .BR RFD . | |
77 | .TP | |
78 | .B 5 | |
79 | A call to | |
80 | .BR epoll_wait (2) | |
81 | is done. | |
82 | .PP | |
83 | ||
84 | If the | |
85 | .B RFD | |
86 | file descriptor has been added to the | |
87 | .B epoll | |
88 | interface using the | |
89 | .B EPOLLET | |
90 | flag, the call to | |
91 | .BR epoll_wait (2) | |
92 | done in step | |
93 | .B 5 | |
94 | will probably hang because of the available data still present in the file | |
95 | input buffers and the remote peer might be expecting a response based on the | |
96 | data it already sent. The reason for this is that Edge Triggered event | |
97 | distribution delivers events only when events happens on the monitored file. | |
98 | So, in step | |
99 | .B 5 | |
100 | the caller might end up waiting for some data that is already present inside | |
101 | the input buffer. In the above example, an event on | |
102 | .B RFD | |
103 | will be generated because of the write done in | |
104 | .B 2 | |
105 | , and the event is consumed in | |
106 | .BR 3 . | |
107 | Since the read operation done in | |
108 | .B 4 | |
109 | does not consume the whole buffer data, the call to | |
110 | .BR epoll_wait (2) | |
111 | done in step | |
112 | .B 5 | |
113 | might lock indefinitely. The | |
114 | .B epoll | |
115 | interface, when used with the | |
116 | .B EPOLLET | |
117 | flag ( Edge Triggered ) | |
118 | should use non-blocking file descriptors to avoid having a blocking | |
119 | read or write starve the task that is handling multiple file descriptors. | |
120 | The suggested way to use | |
121 | .B epoll | |
122 | as an Edge Triggered ( | |
123 | .B EPOLLET | |
124 | ) interface is below, and possible pitfalls to avoid follow. | |
125 | .RS | |
126 | .TP | |
127 | .B i | |
128 | with non-blocking file descriptors | |
129 | .TP | |
130 | .B ii | |
131 | by going to wait for an event only after | |
132 | .BR read (2) | |
133 | or | |
134 | .BR write (2) | |
135 | return EAGAIN | |
136 | .RE | |
137 | .PP | |
138 | On the contrary, when used as a Level Triggered interface, | |
139 | .B epoll | |
140 | is by all means a faster | |
141 | .BR poll (2), | |
142 | and can be used wherever the latter is used since it shares the | |
143 | same semantics. Since even with the Edge Triggered | |
144 | .B epoll | |
145 | multiple events can be generated up on receival of multiple chunks of data, | |
146 | the caller has the option to specify the | |
147 | .B EPOLLONESHOT | |
148 | flag, to tell | |
149 | .B epoll | |
150 | to disable the associated file descriptor after the receival of an event with | |
151 | .BR epoll_wait (2). | |
152 | When the | |
153 | .B EPOLLONESHOT | |
154 | flag is specified, it is caller responsibility to rearm the file descriptor using | |
155 | .BR epoll_ctl (2) | |
156 | with | |
157 | .BR EPOLL_CTL_MOD . | |
158 | ||
159 | .SH EXAMPLE FOR SUGGESTED USAGE | |
160 | ||
161 | While the usage of | |
162 | .B epoll | |
163 | when employed like a Level Triggered interface does have the same | |
164 | semantics of | |
165 | .BR poll (2), | |
9fdfa163 | 166 | an Edge Triggered usage requires more clarification to avoid stalls |
fea681da MK |
167 | in the application event loop. In this example, listener is a |
168 | non-blocking socket on which | |
169 | .BR listen (2) | |
170 | has been called. The function do_use_fd() uses the new ready | |
171 | file descriptor until EAGAIN is returned by either | |
172 | .BR read (2) | |
173 | or | |
174 | .BR write (2). | |
175 | An event driven state machine application should, after having received | |
176 | EAGAIN, record its current state so that at the next call to do_use_fd() | |
177 | it will continue to | |
178 | .BR read (2) | |
179 | or | |
180 | .BR write (2) | |
181 | from where it stopped before. | |
182 | ||
183 | .nf | |
184 | struct epoll_event ev, *events; | |
185 | ||
186 | for(;;) { | |
2bc2f479 | 187 | nfds = epoll_wait(kdpfd, events, maxevents, \-1); |
fea681da MK |
188 | |
189 | for(n = 0; n < nfds; ++n) { | |
190 | if(events[n].data.fd == listener) { | |
191 | client = accept(listener, (struct sockaddr *) &local, | |
192 | &addrlen); | |
193 | if(client < 0){ | |
194 | perror("accept"); | |
195 | continue; | |
196 | } | |
197 | setnonblocking(client); | |
198 | ev.events = EPOLLIN | EPOLLET; | |
199 | ev.data.fd = client; | |
200 | if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) { | |
201 | fprintf(stderr, "epoll set insertion error: fd=%d\n", | |
202 | client); | |
2bc2f479 | 203 | return \-1; |
fea681da MK |
204 | } |
205 | } | |
206 | else | |
207 | do_use_fd(events[n].data.fd); | |
208 | } | |
209 | } | |
210 | .fi | |
211 | ||
212 | When used as an Edge triggered interface, for performance reasons, it is | |
213 | possible to add the file descriptor inside the epoll interface ( | |
214 | .B EPOLL_CTL_ADD | |
215 | ) once by specifying ( | |
216 | .BR EPOLLIN | EPOLLOUT | |
217 | ). This allows you to avoid | |
218 | continuously switching between | |
219 | .B EPOLLIN | |
220 | and | |
221 | .B EPOLLOUT | |
222 | calling | |
223 | .BR epoll_ctl (2) | |
224 | with | |
225 | .BR EPOLL_CTL_MOD . | |
226 | ||
227 | .SH QUESTIONS AND ANSWERS (from linux-kernel) | |
228 | ||
229 | .RS | |
230 | .TP | |
231 | .B Q1 | |
232 | What happens if you add the same fd to an epoll_set twice? | |
233 | .TP | |
234 | .B A1 | |
235 | You will probably get EEXIST. However, it is possible that two | |
236 | threads may add the same fd twice. This is a harmless condition. | |
237 | .TP | |
238 | .B Q2 | |
239 | Can two | |
240 | .B epoll | |
241 | sets wait for the same fd? If so, are events reported | |
242 | to both | |
243 | .B epoll | |
244 | sets fds? | |
245 | .TP | |
246 | .B A2 | |
247 | Yes. However, it is not recommended. Yes it would be reported to both. | |
248 | .TP | |
249 | .B Q3 | |
250 | Is the | |
251 | .B epoll | |
252 | fd itself poll/epoll/selectable? | |
253 | .TP | |
254 | .B A3 | |
255 | Yes. | |
256 | .TP | |
257 | .B Q4 | |
258 | What happens if the | |
259 | .B epoll | |
260 | fd is put into its own fd set? | |
261 | .TP | |
262 | .B A4 | |
263 | It will fail. However, you can add an | |
264 | .B epoll | |
265 | fd inside another epoll fd set. | |
266 | .TP | |
267 | .B Q5 | |
268 | Can I send the | |
269 | .B epoll | |
270 | fd over a unix-socket to another process? | |
271 | .TP | |
272 | .B A5 | |
273 | No. | |
274 | .TP | |
275 | .B Q6 | |
276 | Will the close of an fd cause it to be removed from all | |
277 | .B epoll | |
278 | sets automatically? | |
279 | .TP | |
280 | .B A6 | |
281 | Yes. | |
282 | .TP | |
283 | .B Q7 | |
284 | If more than one event comes in between | |
285 | .BR epoll_wait (2) | |
286 | calls, are they combined or reported separately? | |
287 | .TP | |
288 | .B A7 | |
289 | They will be combined. | |
290 | .TP | |
291 | .B Q8 | |
292 | Does an operation on an fd affect the already collected but not yet reported | |
293 | events? | |
294 | .TP | |
295 | .B A8 | |
296 | You can do two operations on an existing fd. Remove would be meaningless for | |
297 | this case. Modify will re-read available I/O. | |
298 | .TP | |
299 | .B Q9 | |
300 | Do I need to continuously read/write an fd until EAGAIN when using the | |
301 | .B EPOLLET | |
302 | flag ( Edge Triggered behaviour ) ? | |
303 | .TP | |
304 | .B A9 | |
305 | No you don't. Receiving an event from | |
306 | .BR epoll_wait (2) | |
307 | should suggest to you that such file descriptor is ready for the requested I/O | |
308 | operation. You have simply to consider it ready until you will receive the | |
309 | next EAGAIN. When and how you will use such file descriptor is entirely up | |
310 | to you. Also, the condition that the read/write I/O space is exhausted can | |
311 | be detected by checking the amount of data read/write from/to the target | |
312 | file descriptor. For example, if you call | |
313 | .BR read (2) | |
314 | by asking to read a certain amount of data and | |
315 | .BR read (2) | |
316 | returns a lower number of bytes, you can be sure to have exhausted the read | |
317 | I/O space for such file descriptor. Same is valid when writing using the | |
318 | .BR write (2) | |
319 | function. | |
320 | .RE | |
321 | ||
322 | .SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM | |
323 | .RS | |
324 | .TP | |
325 | .B o Starvation ( Edge Triggered ) | |
326 | .PP | |
327 | If there is a large amount of I/O space, it is possible that by trying to drain | |
328 | it the other files will not get processed causing starvation. This | |
329 | is not specific to | |
330 | .BR epoll . | |
331 | .PP | |
332 | .PP | |
333 | The solution is to maintain a ready list and mark the file descriptor as ready | |
334 | in its associated data structure, thereby allowing the application to | |
335 | remember which files need to be processed but still round robin amongst | |
336 | all the ready files. This also supports ignoring subsequent events you | |
337 | receive for fd's that are already ready. | |
338 | .PP | |
339 | ||
340 | .TP | |
341 | .B o If using an event cache... | |
342 | .PP | |
343 | If you use an event cache or store all the fd's returned from | |
344 | .BR epoll_wait (2), | |
345 | then make sure to provide a way to mark its closure dynamically (ie- caused by | |
346 | a previous event's processing). Suppose you receive 100 events from | |
347 | .BR epoll_wait (2), | |
9fdfa163 | 348 | and in event #47 a condition causes event #13 to be closed. |
fea681da MK |
349 | If you remove the structure and close() the fd for event #13, then your |
350 | event cache might still say there are events waiting for that fd causing | |
351 | confusion. | |
352 | .PP | |
353 | .PP | |
354 | One solution for this is to call, during the processing of event 47, | |
355 | .BR epoll_ctl ( EPOLL_CTL_DEL ) | |
356 | to delete fd 13 and close(), then mark its associated | |
357 | data structure as removed and link it to a cleanup list. If you find another | |
358 | event for fd 13 in your batch processing, you will discover the fd had been | |
359 | previously removed and there will be no confusion. | |
360 | .PP | |
361 | ||
362 | .RE | |
363 | .SH CONFORMING TO | |
364 | .BR epoll (4) | |
365 | is a new API introduced in Linux kernel 2.5.44. | |
366 | Its interface should be finalized in Linux kernel 2.5.66. | |
367 | .SH "SEE ALSO" | |
368 | .BR epoll_create (2), | |
369 | .BR epoll_ctl (2), | |
370 | .BR epoll_wait (2) |