]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" |
2 | .\" epoll by Davide Libenzi ( efficient event notification retrieval ) | |
3 | .\" Copyright (C) 2003 Davide Libenzi | |
4 | .\" | |
5 | .\" This program is free software; you can redistribute it and/or modify | |
6 | .\" it under the terms of the GNU General Public License as published by | |
7 | .\" the Free Software Foundation; either version 2 of the License, or | |
8 | .\" (at your option) any later version. | |
9 | .\" | |
10 | .\" This program is distributed in the hope that it will be useful, | |
11 | .\" but WITHOUT ANY WARRANTY; without even the implied warranty of | |
12 | .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
13 | .\" GNU General Public License for more details. | |
14 | .\" | |
15 | .\" You should have received a copy of the GNU General Public License | |
16 | .\" along with this program; if not, write to the Free Software | |
17 | .\" Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
18 | .\" | |
19 | .\" Davide Libenzi <davidel@xmailserver.org> | |
20 | .\" | |
21 | .\" | |
6d86c355 | 22 | .TH EPOLL 7 "2002-10-23" Linux "Linux Programmer's Manual" |
fea681da MK |
23 | .SH NAME |
24 | epoll \- I/O event notification facility | |
25 | .SH SYNOPSIS | |
26 | .B #include <sys/epoll.h> | |
27 | .SH DESCRIPTION | |
28 | .B epoll | |
29 | is a variant of | |
30 | .BR poll (2) | |
31 | that can be used either as Edge or Level Triggered interface and scales | |
32 | well to large numbers of watched fds. Three system calls are provided to | |
33 | set up and control an | |
34 | .B epoll | |
35 | set: | |
36 | .BR epoll_create (2), | |
37 | .BR epoll_ctl (2), | |
38 | .BR epoll_wait (2). | |
39 | ||
40 | An | |
41 | .B epoll | |
42 | set is connected to a file descriptor created by | |
43 | .BR epoll_create (2). | |
44 | Interest for certain file descriptors is then registered via | |
45 | .BR epoll_ctl (2). | |
46 | Finally, the actual wait is started by | |
47 | .BR epoll_wait (2). | |
fea681da MK |
48 | .SH NOTES |
49 | The | |
50 | .B epoll | |
51 | event distribution interface is able to behave both as Edge Triggered | |
52 | ( ET ) and Level Triggered ( LT ). The difference between ET and LT | |
53 | event distribution mechanism can be described as follows. Suppose that | |
54 | this scenario happens : | |
55 | .TP | |
56 | .B 1 | |
57 | The file descriptor that represents the read side of a pipe ( | |
58 | .B RFD | |
59 | ) is added inside the | |
60 | .B epoll | |
61 | device. | |
62 | .TP | |
63 | .B 2 | |
64 | Pipe writer writes 2Kb of data on the write side of the pipe. | |
65 | .TP | |
66 | .B 3 | |
67 | A call to | |
68 | .BR epoll_wait (2) | |
69 | is done that will return | |
70 | .B RFD | |
71 | as ready file descriptor. | |
72 | .TP | |
73 | .B 4 | |
74 | The pipe reader reads 1Kb of data from | |
75 | .BR RFD . | |
76 | .TP | |
77 | .B 5 | |
78 | A call to | |
79 | .BR epoll_wait (2) | |
80 | is done. | |
81 | .PP | |
82 | ||
83 | If the | |
84 | .B RFD | |
85 | file descriptor has been added to the | |
86 | .B epoll | |
87 | interface using the | |
88 | .B EPOLLET | |
89 | flag, the call to | |
90 | .BR epoll_wait (2) | |
91 | done in step | |
92 | .B 5 | |
93 | will probably hang because of the available data still present in the file | |
94 | input buffers and the remote peer might be expecting a response based on the | |
95 | data it already sent. The reason for this is that Edge Triggered event | |
96 | distribution delivers events only when events happens on the monitored file. | |
97 | So, in step | |
98 | .B 5 | |
99 | the caller might end up waiting for some data that is already present inside | |
100 | the input buffer. In the above example, an event on | |
101 | .B RFD | |
102 | will be generated because of the write done in | |
66eca51e MK |
103 | .BR 2 |
104 | and the event is consumed in | |
fea681da MK |
105 | .BR 3 . |
106 | Since the read operation done in | |
107 | .B 4 | |
108 | does not consume the whole buffer data, the call to | |
109 | .BR epoll_wait (2) | |
110 | done in step | |
111 | .B 5 | |
112 | might lock indefinitely. The | |
113 | .B epoll | |
114 | interface, when used with the | |
115 | .B EPOLLET | |
116 | flag ( Edge Triggered ) | |
117 | should use non-blocking file descriptors to avoid having a blocking | |
118 | read or write starve the task that is handling multiple file descriptors. | |
119 | The suggested way to use | |
120 | .B epoll | |
66eca51e MK |
121 | as an Edge Triggered |
122 | .RB ( EPOLLET ) | |
123 | interface is below, and possible pitfalls to avoid follow. | |
fea681da MK |
124 | .RS |
125 | .TP | |
126 | .B i | |
127 | with non-blocking file descriptors | |
128 | .TP | |
129 | .B ii | |
130 | by going to wait for an event only after | |
131 | .BR read (2) | |
132 | or | |
133 | .BR write (2) | |
134 | return EAGAIN | |
135 | .RE | |
136 | .PP | |
137 | On the contrary, when used as a Level Triggered interface, | |
138 | .B epoll | |
139 | is by all means a faster | |
140 | .BR poll (2), | |
141 | and can be used wherever the latter is used since it shares the | |
142 | same semantics. Since even with the Edge Triggered | |
143 | .B epoll | |
3f1c1b0a | 144 | multiple events can be generated up on receipt of multiple chunks of data, |
fea681da MK |
145 | the caller has the option to specify the |
146 | .B EPOLLONESHOT | |
147 | flag, to tell | |
148 | .B epoll | |
3f1c1b0a | 149 | to disable the associated file descriptor after the receipt of an event with |
fea681da MK |
150 | .BR epoll_wait (2). |
151 | When the | |
152 | .B EPOLLONESHOT | |
153 | flag is specified, it is caller responsibility to rearm the file descriptor using | |
154 | .BR epoll_ctl (2) | |
155 | with | |
156 | .BR EPOLL_CTL_MOD . | |
fea681da | 157 | .SH EXAMPLE FOR SUGGESTED USAGE |
fea681da MK |
158 | While the usage of |
159 | .B epoll | |
160 | when employed like a Level Triggered interface does have the same | |
161 | semantics of | |
162 | .BR poll (2), | |
9fdfa163 | 163 | an Edge Triggered usage requires more clarification to avoid stalls |
fea681da MK |
164 | in the application event loop. In this example, listener is a |
165 | non-blocking socket on which | |
166 | .BR listen (2) | |
167 | has been called. The function do_use_fd() uses the new ready | |
168 | file descriptor until EAGAIN is returned by either | |
169 | .BR read (2) | |
170 | or | |
171 | .BR write (2). | |
172 | An event driven state machine application should, after having received | |
173 | EAGAIN, record its current state so that at the next call to do_use_fd() | |
174 | it will continue to | |
175 | .BR read (2) | |
176 | or | |
177 | .BR write (2) | |
178 | from where it stopped before. | |
179 | ||
180 | .nf | |
181 | struct epoll_event ev, *events; | |
182 | ||
183 | for(;;) { | |
2bc2f479 | 184 | nfds = epoll_wait(kdpfd, events, maxevents, \-1); |
fea681da MK |
185 | |
186 | for(n = 0; n < nfds; ++n) { | |
187 | if(events[n].data.fd == listener) { | |
188 | client = accept(listener, (struct sockaddr *) &local, | |
189 | &addrlen); | |
190 | if(client < 0){ | |
191 | perror("accept"); | |
192 | continue; | |
193 | } | |
194 | setnonblocking(client); | |
195 | ev.events = EPOLLIN | EPOLLET; | |
196 | ev.data.fd = client; | |
197 | if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) { | |
7dfefab8 | 198 | fprintf(stderr, "epoll set insertion error: fd=%d\\n", |
fea681da | 199 | client); |
2bc2f479 | 200 | return \-1; |
fea681da MK |
201 | } |
202 | } | |
203 | else | |
204 | do_use_fd(events[n].data.fd); | |
205 | } | |
206 | } | |
207 | .fi | |
208 | ||
209 | When used as an Edge triggered interface, for performance reasons, it is | |
210 | possible to add the file descriptor inside the epoll interface ( | |
211 | .B EPOLL_CTL_ADD | |
212 | ) once by specifying ( | |
213 | .BR EPOLLIN | EPOLLOUT | |
214 | ). This allows you to avoid | |
215 | continuously switching between | |
216 | .B EPOLLIN | |
217 | and | |
218 | .B EPOLLOUT | |
219 | calling | |
220 | .BR epoll_ctl (2) | |
221 | with | |
222 | .BR EPOLL_CTL_MOD . | |
223 | ||
1de2d7cd | 224 | .SH QUESTIONS AND ANSWERS |
fea681da | 225 | |
fea681da MK |
226 | .TP |
227 | .B Q1 | |
228 | What happens if you add the same fd to an epoll_set twice? | |
229 | .TP | |
230 | .B A1 | |
231 | You will probably get EEXIST. However, it is possible that two | |
232 | threads may add the same fd twice. This is a harmless condition. | |
233 | .TP | |
234 | .B Q2 | |
235 | Can two | |
236 | .B epoll | |
237 | sets wait for the same fd? If so, are events reported | |
238 | to both | |
239 | .B epoll | |
240 | sets fds? | |
241 | .TP | |
242 | .B A2 | |
243 | Yes. However, it is not recommended. Yes it would be reported to both. | |
244 | .TP | |
245 | .B Q3 | |
246 | Is the | |
247 | .B epoll | |
248 | fd itself poll/epoll/selectable? | |
249 | .TP | |
250 | .B A3 | |
251 | Yes. | |
252 | .TP | |
253 | .B Q4 | |
254 | What happens if the | |
255 | .B epoll | |
256 | fd is put into its own fd set? | |
257 | .TP | |
258 | .B A4 | |
259 | It will fail. However, you can add an | |
260 | .B epoll | |
261 | fd inside another epoll fd set. | |
262 | .TP | |
263 | .B Q5 | |
264 | Can I send the | |
265 | .B epoll | |
266 | fd over a unix-socket to another process? | |
267 | .TP | |
268 | .B A5 | |
269 | No. | |
270 | .TP | |
271 | .B Q6 | |
272 | Will the close of an fd cause it to be removed from all | |
273 | .B epoll | |
274 | sets automatically? | |
275 | .TP | |
276 | .B A6 | |
277 | Yes. | |
278 | .TP | |
279 | .B Q7 | |
280 | If more than one event comes in between | |
281 | .BR epoll_wait (2) | |
282 | calls, are they combined or reported separately? | |
283 | .TP | |
284 | .B A7 | |
285 | They will be combined. | |
286 | .TP | |
287 | .B Q8 | |
288 | Does an operation on an fd affect the already collected but not yet reported | |
289 | events? | |
290 | .TP | |
291 | .B A8 | |
292 | You can do two operations on an existing fd. Remove would be meaningless for | |
293 | this case. Modify will re-read available I/O. | |
294 | .TP | |
295 | .B Q9 | |
296 | Do I need to continuously read/write an fd until EAGAIN when using the | |
297 | .B EPOLLET | |
298 | flag ( Edge Triggered behaviour ) ? | |
299 | .TP | |
300 | .B A9 | |
301 | No you don't. Receiving an event from | |
302 | .BR epoll_wait (2) | |
303 | should suggest to you that such file descriptor is ready for the requested I/O | |
304 | operation. You have simply to consider it ready until you will receive the | |
305 | next EAGAIN. When and how you will use such file descriptor is entirely up | |
306 | to you. Also, the condition that the read/write I/O space is exhausted can | |
307 | be detected by checking the amount of data read/write from/to the target | |
308 | file descriptor. For example, if you call | |
309 | .BR read (2) | |
310 | by asking to read a certain amount of data and | |
311 | .BR read (2) | |
312 | returns a lower number of bytes, you can be sure to have exhausted the read | |
313 | I/O space for such file descriptor. Same is valid when writing using the | |
314 | .BR write (2) | |
315 | function. | |
fea681da | 316 | .SH POSSIBLE PITFALLS AND WAYS TO AVOID THEM |
fea681da MK |
317 | .TP |
318 | .B o Starvation ( Edge Triggered ) | |
319 | .PP | |
320 | If there is a large amount of I/O space, it is possible that by trying to drain | |
321 | it the other files will not get processed causing starvation. This | |
322 | is not specific to | |
323 | .BR epoll . | |
324 | .PP | |
325 | .PP | |
326 | The solution is to maintain a ready list and mark the file descriptor as ready | |
327 | in its associated data structure, thereby allowing the application to | |
328 | remember which files need to be processed but still round robin amongst | |
329 | all the ready files. This also supports ignoring subsequent events you | |
330 | receive for fd's that are already ready. | |
fea681da MK |
331 | .TP |
332 | .B o If using an event cache... | |
333 | .PP | |
334 | If you use an event cache or store all the fd's returned from | |
335 | .BR epoll_wait (2), | |
336 | then make sure to provide a way to mark its closure dynamically (ie- caused by | |
337 | a previous event's processing). Suppose you receive 100 events from | |
338 | .BR epoll_wait (2), | |
9fdfa163 | 339 | and in event #47 a condition causes event #13 to be closed. |
b5cc2ffb | 340 | If you remove the structure and |
4d52e8f8 | 341 | .BR close () |
b5cc2ffb | 342 | the fd for event #13, then your |
fea681da MK |
343 | event cache might still say there are events waiting for that fd causing |
344 | confusion. | |
fea681da MK |
345 | .PP |
346 | One solution for this is to call, during the processing of event 47, | |
347 | .BR epoll_ctl ( EPOLL_CTL_DEL ) | |
b5cc2ffb | 348 | to delete fd 13 and |
f87925c6 MK |
349 | .BR close (), |
350 | then mark its associated | |
fea681da MK |
351 | data structure as removed and link it to a cleanup list. If you find another |
352 | event for fd 13 in your batch processing, you will discover the fd had been | |
353 | previously removed and there will be no confusion. | |
fea681da | 354 | .SH CONFORMING TO |
6d86c355 | 355 | .BR epoll (7) |
fea681da MK |
356 | is a new API introduced in Linux kernel 2.5.44. |
357 | Its interface should be finalized in Linux kernel 2.5.66. | |
358 | .SH "SEE ALSO" | |
359 | .BR epoll_create (2), | |
360 | .BR epoll_ctl (2), | |
361 | .BR epoll_wait (2) |