]>
Commit | Line | Data |
---|---|---|
8f0aff2a | 1 | .\" Page by b.hubert |
1abce893 MK |
2 | .\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de> |
3 | .\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com> | |
2297bf0e | 4 | .\" |
2e46a6e7 | 5 | .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) |
8f0aff2a | 6 | .\" may be freely modified and distributed |
8ff7380d | 7 | .\" %%%LICENSE_END |
fea681da MK |
8 | .\" |
9 | .\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com) | |
10 | .\" added ERRORS section. | |
11 | .\" | |
12 | .\" Modified 2004-06-17 mtk | |
13 | .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE | |
14 | .\" | |
47f5c4ba MK |
15 | .\" FIXME Still to integrate are some points from Torvald Riegel's mail of |
16 | .\" 2015-01-23: | |
17 | .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 | |
18 | .\" | |
78e85692 | 19 | .\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail |
02182e7c MK |
20 | .\" at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 |
21 | .\" | |
3d155313 | 22 | .TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual" |
fea681da | 23 | .SH NAME |
ce154705 | 24 | futex \- fast user-space locking |
fea681da | 25 | .SH SYNOPSIS |
9d9dc1e8 | 26 | .nf |
fea681da MK |
27 | .sp |
28 | .B "#include <linux/futex.h>" | |
fea681da MK |
29 | .B "#include <sys/time.h>" |
30 | .sp | |
d33602c4 | 31 | .BI "int futex(int *" uaddr ", int " futex_op ", int " val , |
768d3c23 | 32 | .BI " const struct timespec *" timeout , \ |
c6dc40a2 | 33 | " \fR /* or: \fBuint32_t \fIval2\fP */ |
9d9dc1e8 | 34 | .BI " int *" uaddr2 ", int " val3 ); |
9d9dc1e8 | 35 | .fi |
409f08b0 | 36 | |
b939d6e4 MK |
37 | .IR Note : |
38 | There is no glibc wrapper for this system call; see NOTES. | |
47297adb | 39 | .SH DESCRIPTION |
fea681da MK |
40 | .PP |
41 | The | |
e511ffb6 | 42 | .BR futex () |
4b35dc5d | 43 | system call provides a method for waiting until a certain condition becomes |
077981d4 MK |
44 | true. |
45 | It is typically used as a blocking construct in the context of | |
d45f244c MK |
46 | shared-memory synchronization. |
47 | When using futexes, the majority of | |
48 | the synchronization operations are performed in user space. | |
49 | The user-space program employs | |
50 | .BR futex () | |
51 | operations only when it is likely that the program has to block for | |
4c8cb0ff | 52 | a longer time until the condition becomes true. |
d45f244c MK |
53 | The program uses another |
54 | .BR futex () | |
55 | operation to wake anyone waiting for a particular condition. | |
4b35dc5d | 56 | |
7e8dcabc MK |
57 | A futex is a 32-bit value\(emreferred to below as a |
58 | .IR "futex word" \(emwhose | |
59 | address is supplied to the | |
4b35dc5d | 60 | .BR futex () |
7e8dcabc MK |
61 | system call. |
62 | (Futexes are 32-bits in size on all platforms, including 64-bit systems.) | |
63 | All futex operations are governed by this value. | |
64 | In order to share a futex between processes, | |
65 | the futex is placed in a region of shared memory, | |
66 | created using (for example) | |
67 | .BR mmap (2) | |
68 | or | |
69 | .BR shmat (2). | |
70 | (Thus the futex word may have different | |
71 | virtual addresses in different processes, | |
72 | but these addresses all refer to the same location in physical memory.) | |
809ca3ae | 73 | |
0c3ec26b MK |
74 | When executing a futex operation that requests to block a thread, |
75 | the kernel will block only if the futex word has the value that the | |
55f9e85e MK |
76 | calling thread supplied (as one of the arguments of the |
77 | .BR futex () | |
78 | call) as the expected value of the futex word. | |
9d32a39b MK |
79 | The loading of the futex word's value, |
80 | the comparison of that value with the expected value, | |
077981d4 | 81 | and the actual blocking will happen atomically and totally |
0c3ec26b | 82 | ordered with respect to concurrently executing futex |
b80daba2 HS |
83 | operations on the same futex word. |
84 | Thus, the futex word is used to connect the synchronization in user space | |
9d32a39b | 85 | with the implementation of blocking by the kernel. |
55f9e85e | 86 | Analogously to an atomic |
4b35dc5d | 87 | compare-and-exchange operation that potentially changes shared memory, |
077981d4 | 88 | blocking via a futex is an atomic compare-and-block operation. |
d6bb5a38 MK |
89 | .\" FIXME(Torvald Riegel): |
90 | .\" Eventually we want to have some text in NOTES to satisfy | |
91 | .\" the reference in the following sentence | |
92 | .\" See NOTES for | |
93 | .\" a detailed specification of the synchronization semantics. | |
4b35dc5d | 94 | |
55f9e85e | 95 | One example use of futexes is for implementing locks. |
c0dc758e MK |
96 | The state of the lock (i.e., acquired or not acquired) |
97 | can be represented as an atomically accessed flag in shared memory. | |
4c8cb0ff | 98 | In the uncontended case, |
55f9e85e | 99 | a thread can access or modify the lock state with atomic instructions |
4c8cb0ff MK |
100 | for example atomically changing it from not acquired to acquired |
101 | using an atomic compare-and-exchange instruction. | |
55f9e85e MK |
102 | (Such instructions are performed entirely in user mode, |
103 | and the kernel maintains no information about the lock state.) | |
104 | On the other hand, a thread may be unable to acquire a lock because | |
8e754e12 | 105 | it is already acquired by another thread. |
55f9e85e | 106 | It then may pass the lock's flag as a futex word and the value |
0c3ec26b | 107 | representing the acquired state as the expected value to a |
8e754e12 HS |
108 | .BR futex () |
109 | wait operation. | |
55f9e85e | 110 | This |
8e754e12 | 111 | .BR futex () |
55f9e85e | 112 | call will block if and only if the lock is still acquired. |
077981d4 | 113 | When releasing the lock, a thread has to first reset the |
0c3ec26b | 114 | lock state to not acquired and then execute a futex |
55f9e85e | 115 | operation that wakes threads blocked on the lock flag used as a futex word |
4c8cb0ff | 116 | (this can be be further optimized to avoid unnecessary wake-ups). |
077981d4 | 117 | See |
4b35dc5d TR |
118 | .BR futex (7) |
119 | for more detail on how to use futexes. | |
120 | ||
121 | Besides the basic wait and wake-up futex functionality, there are further | |
077981d4 MK |
122 | futex operations aimed at supporting more complex use cases. |
123 | Also note that | |
4c8cb0ff MK |
124 | no explicit initialization or destruction are necessary to use futexes; |
125 | the kernel maintains a futex | |
126 | (i.e., the kernel-internal implementation artifact) | |
4b35dc5d TR |
127 | only while operations such as |
128 | .BR FUTEX_WAIT , | |
129 | described below, are being performed on a particular futex word. | |
a663ca5a MK |
130 | .\" |
131 | .SS Arguments | |
fea681da MK |
132 | The |
133 | .I uaddr | |
077981d4 MK |
134 | argument points to the futex word. |
135 | On all platforms, futexes are four-byte | |
4b35dc5d | 136 | integers that must be aligned on a four-byte boundary. |
f388ba70 MK |
137 | The operation to perform on the futex is specified in the |
138 | .I futex_op | |
139 | argument; | |
140 | .IR val | |
141 | is a value whose meaning and purpose depends on | |
142 | .IR futex_op . | |
36ab2074 MK |
143 | |
144 | The remaining arguments | |
145 | .RI ( timeout , | |
146 | .IR uaddr2 , | |
147 | and | |
148 | .IR val3 ) | |
149 | are required only for certain of the futex operations described below. | |
150 | Where one of these arguments is not required, it is ignored. | |
768d3c23 | 151 | |
36ab2074 MK |
152 | For several blocking operations, the |
153 | .I timeout | |
154 | argument is a pointer to a | |
155 | .IR timespec | |
156 | structure that specifies a timeout for the operation. | |
157 | However, notwithstanding the prototype shown above, for some operations, | |
10022b8e | 158 | the least significant four bytes are used as an integer whose meaning |
36ab2074 | 159 | is determined by the operation. |
768d3c23 MK |
160 | For these operations, the kernel casts the |
161 | .I timeout | |
10022b8e HS |
162 | value first to |
163 | .IR "unsigned long", | |
164 | then to | |
c6dc40a2 | 165 | .IR uint32_t , |
768d3c23 MK |
166 | and in the remainder of this page, this argument is referred to as |
167 | .I val2 | |
168 | when interpreted in this fashion. | |
169 | ||
de5a3bb4 | 170 | Where it is required, the |
36ab2074 | 171 | .IR uaddr2 |
4c8cb0ff MK |
172 | argument is a pointer to a second futex word that is employed |
173 | by the operation. | |
36ab2074 MK |
174 | The interpretation of the final integer argument, |
175 | .IR val3 , | |
176 | depends on the operation. | |
a663ca5a MK |
177 | .\" |
178 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
179 | .\" | |
180 | .SS Futex operations | |
6be4bad7 | 181 | The |
d33602c4 | 182 | .I futex_op |
6be4bad7 MK |
183 | argument consists of two parts: |
184 | a command that specifies the operation to be performed, | |
185 | bit-wise ORed with zero or or more options that | |
186 | modify the behaviour of the operation. | |
fc30eb79 | 187 | The options that may be included in |
d33602c4 | 188 | .I futex_op |
fc30eb79 TG |
189 | are as follows: |
190 | .TP | |
191 | .BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)" | |
192 | .\" commit 34f01cc1f512fa783302982776895c73714ebbc2 | |
193 | This option bit can be employed with all futex operations. | |
e45f9735 | 194 | It tells the kernel that the futex is process-private and not shared |
0c3ec26b MK |
195 | with another process (i.e., it is being used for synchronization |
196 | only between threads of the same process). | |
943ccc52 MK |
197 | This allows the kernel to make some additional performance optimizations. |
198 | .\" I.e., It allows the kernel choose the fast path for validating | |
199 | .\" the user-space address and avoids expensive VMA lookups, | |
200 | .\" taking reference counts on file backing store, and so on. | |
ae2c1774 MK |
201 | |
202 | As a convenience, | |
203 | .IR <linux/futex.h> | |
204 | defines a set of constants with the suffix | |
205 | .BR _PRIVATE | |
206 | that are equivalents of all of the operations listed below, | |
dcdfde26 | 207 | .\" except the obsolete FUTEX_FD, for which the "private" flag was |
ae2c1774 MK |
208 | .\" meaningless |
209 | but with the | |
210 | .BR FUTEX_PRIVATE_FLAG | |
211 | ORed into the constant value. | |
212 | Thus, there are | |
213 | .BR FUTEX_WAIT_PRIVATE , | |
214 | .BR FUTEX_WAKE_PRIVATE , | |
215 | and so on. | |
2e98bbc2 TG |
216 | .TP |
217 | .BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)" | |
218 | .\" commit 1acdac104668a0834cfa267de9946fac7764d486 | |
4a7e5b05 | 219 | This option bit can be employed only with the |
2e98bbc2 TG |
220 | .BR FUTEX_WAIT_BITSET |
221 | and | |
222 | .BR FUTEX_WAIT_REQUEUE_PI | |
c84cf68c | 223 | operations. |
2e98bbc2 | 224 | |
f2103b26 MK |
225 | If this option is set, the kernel treats |
226 | .I timeout | |
227 | as an absolute time based on | |
2e98bbc2 TG |
228 | .BR CLOCK_REALTIME . |
229 | ||
f2103b26 MK |
230 | If this option is not set, the kernel treats |
231 | .I timeout | |
232 | as relative time, | |
1c952cf5 MK |
233 | measured against the |
234 | .BR CLOCK_MONOTONIC | |
235 | clock. | |
6be4bad7 MK |
236 | .PP |
237 | The operation specified in | |
d33602c4 | 238 | .I futex_op |
6be4bad7 | 239 | is one of the following: |
70b06b90 MK |
240 | .\" |
241 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
242 | .\" | |
fea681da | 243 | .TP |
81c9d87e MK |
244 | .BR FUTEX_WAIT " (since Linux 2.6.0)" |
245 | .\" Strictly speaking, since some time in 2.5.x | |
f065673c | 246 | This operation tests that the value at the |
4b35dc5d | 247 | futex word pointed to by the address |
fea681da | 248 | .I uaddr |
4b35dc5d | 249 | still contains the expected value |
fea681da | 250 | .IR val , |
fd105614 | 251 | and if so, then sleeps waiting for a |
682edefb | 252 | .B FUTEX_WAKE |
fd105614 | 253 | operation on the futex word. |
077981d4 | 254 | The load of the value of the futex word is an atomic memory |
4b35dc5d | 255 | access (i.e., using atomic machine instructions of the respective |
077981d4 MK |
256 | architecture). |
257 | This load, the comparison with the expected value, and | |
fd105614 | 258 | starting to sleep are performed atomically |
da56650a MK |
259 | .\" FIXME: Torvald, I think we may need to add some explanation of |
260 | .\" "totally odered" here. | |
fd105614 MK |
261 | and totally ordered |
262 | with respect to other futex operations on the same futex word. | |
c0dc758e MK |
263 | If the thread starts to sleep, |
264 | it is considered a waiter on this futex word. | |
f065673c MK |
265 | If the futex value does not match |
266 | .IR val , | |
4710334a | 267 | then the call fails immediately with the error |
badbf70c | 268 | .BR EAGAIN . |
4b35dc5d TR |
269 | |
270 | The purpose of the comparison with the expected value is to prevent lost | |
fd105614 MK |
271 | wake-ups. |
272 | If another thread changed the value of the futex word after the | |
c0dc758e MK |
273 | calling thread decided to block based on the prior value, |
274 | and if the other thread executed a | |
4b35dc5d TR |
275 | .BR FUTEX_WAKE |
276 | operation (or similar wake-up) after the value change and before this | |
f065673c | 277 | .BR FUTEX_WAIT |
4b35dc5d TR |
278 | operation, then the latter will observe the value change and will not start |
279 | to sleep. | |
1909e523 | 280 | |
c13182ef | 281 | If the |
fea681da | 282 | .I timeout |
53ba4030 | 283 | argument is non-NULL, its contents specify a relative timeout for the wait, |
1c952cf5 MK |
284 | measured according to the |
285 | .BR CLOCK_MONOTONIC | |
286 | clock. | |
82a6092b | 287 | (This interval will be rounded up to the system clock granularity, |
8297383e | 288 | and is guaranteed not to expire early.) |
82a6092b MK |
289 | If |
290 | .I timeout | |
291 | is NULL, the call blocks indefinitely. | |
4798a7f3 | 292 | |
c13182ef | 293 | The arguments |
fea681da MK |
294 | .I uaddr2 |
295 | and | |
296 | .I val3 | |
297 | are ignored. | |
298 | ||
74f58a64 MK |
299 | .\" FIXME(Torvald) I think we should remove this. Or maybe adapt to a |
300 | .\" different example. | |
4b35dc5d TR |
301 | .\" For |
302 | .\" .BR futex (7), | |
303 | .\" this call is executed if decrementing the count gave a negative value | |
304 | .\" (indicating contention), | |
305 | .\" and will sleep until another process or thread releases | |
306 | .\" the futex and executes the | |
307 | .\" .B FUTEX_WAKE | |
308 | .\" operation. | |
70b06b90 MK |
309 | .\" |
310 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
311 | .\" | |
fea681da | 312 | .TP |
81c9d87e MK |
313 | .BR FUTEX_WAKE " (since Linux 2.6.0)" |
314 | .\" Strictly speaking, since Linux 2.5.x | |
f065673c MK |
315 | This operation wakes at most |
316 | .I val | |
4b35dc5d | 317 | of the waiters that are waiting (e.g., inside |
f065673c | 318 | .BR FUTEX_WAIT ) |
4b35dc5d | 319 | on the futex word at the address |
f065673c MK |
320 | .IR uaddr . |
321 | Most commonly, | |
322 | .I val | |
323 | is specified as either 1 (wake up a single waiter) or | |
324 | .BR INT_MAX | |
325 | (wake up all waiters). | |
730bfbda MK |
326 | No guarantee is provided about which waiters are awoken |
327 | (e.g., a waiter with a higher scheduling priority is not guaranteed | |
328 | to be awoken in preference to a waiter with a lower priority). | |
4798a7f3 | 329 | |
fea681da MK |
330 | The arguments |
331 | .IR timeout , | |
c8b921bd | 332 | .IR uaddr2 , |
fea681da MK |
333 | and |
334 | .I val3 | |
335 | are ignored. | |
336 | ||
74f58a64 MK |
337 | .\" FIXME(Torvald) I think we should remove this. Or maybe adapt to |
338 | .\" a different example. | |
4c8cb0ff MK |
339 | .\" For |
340 | .\" .BR futex (7), | |
341 | .\" this is executed if incrementing the count showed that | |
342 | .\" there were waiters, | |
343 | .\" once the futex value has been set to 1 | |
344 | .\" (indicating that it is available). | |
345 | .\" | |
346 | .\" FIXME How does "incrementing the count show that there were waiters"? | |
70b06b90 MK |
347 | .\" |
348 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
349 | .\" | |
a7c2bf45 MK |
350 | .TP |
351 | .BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)" | |
352 | .\" Strictly speaking, from Linux 2.5.x to 2.6.25 | |
4c8cb0ff MK |
353 | This operation creates a file descriptor that is associated with |
354 | the futex at | |
a7c2bf45 | 355 | .IR uaddr . |
bdc5957a MK |
356 | The caller must close the returned file descriptor after use. |
357 | When another process or thread performs a | |
a7c2bf45 | 358 | .BR FUTEX_WAKE |
4b35dc5d | 359 | on the futex word, the file descriptor indicates as being readable with |
a7c2bf45 MK |
360 | .BR select (2), |
361 | .BR poll (2), | |
362 | and | |
363 | .BR epoll (7) | |
364 | ||
f1d2171d | 365 | The file descriptor can be used to obtain asynchronous notifications: if |
a7c2bf45 | 366 | .I val |
bdc5957a | 367 | is nonzero, then when another process or thread executes a |
a7c2bf45 MK |
368 | .BR FUTEX_WAKE , |
369 | the caller will receive the signal number that was passed in | |
370 | .IR val . | |
371 | ||
372 | The arguments | |
373 | .IR timeout , | |
374 | .I uaddr2 | |
375 | and | |
376 | .I val3 | |
377 | are ignored. | |
378 | ||
a7c2bf45 MK |
379 | Because it was inherently racy, |
380 | .B FUTEX_FD | |
381 | has been removed | |
382 | .\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80 | |
383 | from Linux 2.6.26 onward. | |
70b06b90 MK |
384 | .\" |
385 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
386 | .\" | |
a7c2bf45 MK |
387 | .TP |
388 | .BR FUTEX_REQUEUE " (since Linux 2.6.0)" | |
a7c2bf45 | 389 | This operation performs the same task as |
27dd3a6e MK |
390 | .BR FUTEX_CMP_REQUEUE |
391 | (see below), except that no check is made using the value in | |
a7c2bf45 MK |
392 | .IR val3 . |
393 | (The argument | |
394 | .I val3 | |
395 | is ignored.) | |
70b06b90 MK |
396 | .\" |
397 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
398 | .\" | |
a7c2bf45 MK |
399 | .TP |
400 | .BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)" | |
4b35dc5d | 401 | This operation first checks whether the location |
a7c2bf45 MK |
402 | .I uaddr |
403 | still contains the value | |
404 | .IR val3 . | |
405 | If not, the operation fails with the error | |
406 | .BR EAGAIN . | |
4b35dc5d | 407 | Otherwise, the operation wakes up a maximum of |
a7c2bf45 MK |
408 | .I val |
409 | waiters that are waiting on the futex at | |
410 | .IR uaddr . | |
411 | If there are more than | |
412 | .I val | |
413 | waiters, then the remaining waiters are removed | |
414 | from the wait queue of the source futex at | |
415 | .I uaddr | |
416 | and added to the wait queue of the target futex at | |
417 | .IR uaddr2 . | |
418 | The | |
768d3c23 | 419 | .I val2 |
936876a9 | 420 | argument specifies an upper limit on the number of waiters |
a7c2bf45 | 421 | that are requeued to the futex at |
768d3c23 | 422 | .IR uaddr2 . |
a7c2bf45 | 423 | |
d6bb5a38 MK |
424 | .\" FIXME(Torvald) Is the following correct? Or is just the decision |
425 | .\" which threads to wake or requeue part of the atomic operation? | |
4b35dc5d TR |
426 | The load from |
427 | .I uaddr | |
4c8cb0ff MK |
428 | is an atomic memory access (i.e., using atomic machine instructions of |
429 | the respective architecture). | |
077981d4 | 430 | This load, the comparison with |
4b35dc5d | 431 | .IR val3 , |
4c8cb0ff MK |
432 | and the requeueing of any waiters are performed atomically and totally |
433 | ordered with respect to other operations on the same futex word. | |
4b35dc5d | 434 | |
a7c2bf45 MK |
435 | Typical values to specify for |
436 | .I val | |
437 | are 0 or or 1. | |
438 | (Specifying | |
439 | .BR INT_MAX | |
440 | is not useful, because it would make the | |
441 | .BR FUTEX_CMP_REQUEUE | |
442 | operation equivalent to | |
443 | .BR FUTEX_WAKE .) | |
936876a9 | 444 | The limit value specified via |
768d3c23 MK |
445 | .I val2 |
446 | is typically either 1 or | |
a7c2bf45 MK |
447 | .BR INT_MAX . |
448 | (Specifying the argument as 0 is not useful, because it would make the | |
449 | .BR FUTEX_CMP_REQUEUE | |
450 | operation equivalent to | |
451 | .BR FUTEX_WAIT .) | |
627b50ce MK |
452 | |
453 | The | |
454 | .B FUTEX_CMP_REQUEUE | |
455 | operation was added as a replacement for the earlier | |
456 | .BR FUTEX_REQUEUE . | |
457 | The difference is that the check of the value at | |
458 | .I uaddr | |
459 | can be used to ensure that requeueing happens only under certain | |
460 | conditions, which allows race conditions to be avoided in certain use cases. | |
dcb410c3 | 461 | .\" But, as Rich Felker points out, there remain valid use cases for |
627b50ce MK |
462 | .\" FUTEX_REQUEUE, for example, when the calling thread is requeuing |
463 | .\" the target(s) to a lock that the calling thread owns | |
464 | .\" From: Rich Felker <dalias@libc.org> | |
465 | .\" Date: Wed, 29 Oct 2014 22:43:17 -0400 | |
466 | .\" To: Darren Hart <dvhart@infradead.org> | |
467 | .\" CC: libc-alpha@sourceware.org, ... | |
468 | .\" Subject: Re: Add futex wrapper to glibc? | |
469 | ||
470 | Both | |
471 | .BR FUTEX_REQUEUE | |
472 | and | |
473 | .BR FUTEX_CMP_REQUEUE | |
474 | can be used to avoid "thundering herd" wake-ups that could occur when using | |
475 | .B FUTEX_WAKE | |
476 | in cases where all of the waiters that are woken need to acquire | |
477 | another futex. | |
478 | Consider the following scenario, | |
479 | where multiple waiter threads are waiting on B, | |
480 | a wait queue implemented using a futex: | |
481 | ||
482 | .in +4n | |
483 | .nf | |
484 | lock(A) | |
485 | while (!check_value(V)) { | |
486 | unlock(A); | |
487 | block_on(B); | |
488 | lock(A); | |
489 | }; | |
490 | unlock(A); | |
491 | .fi | |
492 | .in | |
493 | ||
494 | If a waker thread used | |
495 | .BR FUTEX_WAKE , | |
496 | then all waiters waiting on B would be woken up, | |
497 | and they would would all try to acquire lock A. | |
498 | However, waking all of the threads in this manner would be pointless because | |
499 | all except one of the threads would immediately block on lock A again. | |
500 | By contrast, a requeue operation wakes just one waiter and moves | |
501 | the other waiters to lock A, | |
502 | and when the woken waiter unlocks A then the next waiter can proceed. | |
43d16602 | 503 | .\" |
70b06b90 MK |
504 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
505 | .\" | |
a5956430 MK |
506 | .\" FIXME I added a lengthy piece of text on FUTEX_WAKE_OP text, |
507 | .\" and I'd be happy if someone checked it. | |
fea681da | 508 | .TP |
d67e21f5 MK |
509 | .BR FUTEX_WAKE_OP " (since Linux 2.6.14)" |
510 | .\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721 | |
6bac3b85 MK |
511 | .\" Author: Jakub Jelinek <jakub@redhat.com> |
512 | .\" Date: Tue Sep 6 15:16:25 2005 -0700 | |
4c8cb0ff MK |
513 | .\" FIXME(Torvald) The glibc condvar implementation is currently being |
514 | .\" revised (e.g., to not use an internal lock anymore). | |
515 | .\" It is probably more future-proof to remove this paragraph. | |
d6bb5a38 | 516 | .\" [Torvald, do you have an update here?] |
6bac3b85 MK |
517 | This operation was added to support some user-space use cases |
518 | where more than one futex must be handled at the same time. | |
519 | The most notable example is the implementation of | |
520 | .BR pthread_cond_signal (3), | |
521 | which requires operations on two futexes, | |
522 | the one used to implement the mutex and the one used in the implementation | |
523 | of the wait queue associated with the condition variable. | |
524 | .BR FUTEX_WAKE_OP | |
525 | allows such cases to be implemented without leading to | |
526 | high rates of contention and context switching. | |
527 | ||
528 | The | |
529 | .BR FUTEX_WAIT_OP | |
e61abc20 | 530 | operation is equivalent to executing the following code atomically |
4c8cb0ff MK |
531 | and totally ordered with respect to other futex operations on |
532 | any of the two supplied futex words: | |
6bac3b85 MK |
533 | |
534 | .in +4n | |
535 | .nf | |
536 | int oldval = *(int *) uaddr2; | |
537 | *(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP; | |
538 | futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); | |
539 | if (oldval \fIcmp\fP \fIcmparg\fP) | |
768d3c23 | 540 | futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); |
6bac3b85 MK |
541 | .fi |
542 | .in | |
543 | ||
544 | In other words, | |
545 | .BR FUTEX_WAIT_OP | |
546 | does the following: | |
547 | .RS | |
548 | .IP * 3 | |
4b35dc5d TR |
549 | saves the original value of the futex word at |
550 | .IR uaddr2 | |
551 | and performs an operation to modify the value of the futex at | |
6bac3b85 | 552 | .IR uaddr2 ; |
4c8cb0ff MK |
553 | this is an atomic read-modify-write memory access (i.e., using atomic |
554 | machine instructions of the respective architecture) | |
6bac3b85 MK |
555 | .IP * |
556 | wakes up a maximum of | |
557 | .I val | |
4b35dc5d | 558 | waiters on the futex for the futex word at |
6bac3b85 MK |
559 | .IR uaddr ; |
560 | and | |
561 | .IP * | |
4c8cb0ff MK |
562 | dependent on the results of a test of the original value of the |
563 | futex word at | |
6bac3b85 MK |
564 | .IR uaddr2 , |
565 | wakes up a maximum of | |
768d3c23 | 566 | .I val2 |
4b35dc5d | 567 | waiters on the futex for the futex word at |
6bac3b85 MK |
568 | .IR uaddr2 . |
569 | .RE | |
570 | .IP | |
6bac3b85 MK |
571 | The operation and comparison that are to be performed are encoded |
572 | in the bits of the argument | |
573 | .IR val3 . | |
574 | Pictorially, the encoding is: | |
575 | ||
f6af90e7 | 576 | .in +8n |
6bac3b85 | 577 | .nf |
f6af90e7 MK |
578 | +---+---+-----------+-----------+ |
579 | |op |cmp| oparg | cmparg | | |
580 | +---+---+-----------+-----------+ | |
581 | 4 4 12 12 <== # of bits | |
6bac3b85 MK |
582 | .fi |
583 | .in | |
584 | ||
585 | Expressed in code, the encoding is: | |
586 | ||
587 | .in +4n | |
588 | .nf | |
589 | #define FUTEX_OP(op, oparg, cmp, cmparg) \\ | |
590 | (((op & 0xf) << 28) | \\ | |
591 | ((cmp & 0xf) << 24) | \\ | |
592 | ((oparg & 0xfff) << 12) | \\ | |
593 | (cmparg & 0xfff)) | |
594 | .fi | |
595 | .in | |
596 | ||
597 | In the above, | |
598 | .I op | |
599 | and | |
600 | .I cmp | |
601 | are each one of the codes listed below. | |
602 | The | |
603 | .I oparg | |
604 | and | |
605 | .I cmparg | |
606 | components are literal numeric values, except as noted below. | |
607 | ||
608 | The | |
609 | .I op | |
610 | component has one of the following values: | |
611 | ||
612 | .in +4n | |
613 | .nf | |
614 | FUTEX_OP_SET 0 /* uaddr2 = oparg; */ | |
615 | FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ | |
616 | FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ | |
617 | FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */ | |
618 | FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */ | |
619 | .fi | |
620 | .in | |
621 | ||
622 | In addition, bit-wise ORing the following value into | |
623 | .I op | |
624 | causes | |
625 | .IR "(1\ <<\ oparg)" | |
626 | to be used as the operand: | |
627 | ||
628 | .in +4n | |
629 | .nf | |
630 | FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ | |
631 | .fi | |
632 | .in | |
633 | ||
634 | The | |
635 | .I cmp | |
636 | field is one of the following: | |
637 | ||
638 | .in +4n | |
639 | .nf | |
640 | FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ | |
641 | FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ | |
642 | FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ | |
643 | FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ | |
644 | FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ | |
645 | FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ | |
646 | .fi | |
647 | .in | |
648 | ||
649 | The return value of | |
650 | .BR FUTEX_WAKE_OP | |
651 | is the sum of the number of waiters woken on the futex | |
652 | .IR uaddr | |
653 | plus the number of waiters woken on the futex | |
654 | .IR uaddr2 . | |
70b06b90 MK |
655 | .\" |
656 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
657 | .\" | |
d67e21f5 | 658 | .TP |
79c9b436 TG |
659 | .BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)" |
660 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
fd9e59d4 | 661 | This operation is like |
79c9b436 TG |
662 | .BR FUTEX_WAIT |
663 | except that | |
664 | .I val3 | |
665 | is used to provide a 32-bit bitset to the kernel. | |
666 | This bitset is stored in the kernel-internal state of the waiter. | |
667 | See the description of | |
668 | .BR FUTEX_WAKE_BITSET | |
669 | for further details. | |
670 | ||
fd9e59d4 MK |
671 | The |
672 | .BR FUTEX_WAIT_BITSET | |
9732dd8b | 673 | operation also interprets the |
fd9e59d4 MK |
674 | .I timeout |
675 | argument differently from | |
676 | .BR FUTEX_WAIT . | |
677 | See the discussion of | |
678 | .BR FUTEX_CLOCK_REALTIME , | |
679 | above. | |
680 | ||
79c9b436 TG |
681 | The |
682 | .I uaddr2 | |
683 | argument is ignored. | |
70b06b90 MK |
684 | .\" |
685 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
686 | .\" | |
79c9b436 | 687 | .TP |
d67e21f5 MK |
688 | .BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)" |
689 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
55cc422d TG |
690 | This operation is the same as |
691 | .BR FUTEX_WAKE | |
692 | except that the | |
693 | .I val3 | |
694 | argument is used to provide a 32-bit bitset to the kernel. | |
98d769c0 MK |
695 | This bitset is used to select which waiters should be woken up. |
696 | The selection is done by a bit-wise AND of the "wake" bitset | |
697 | (i.e., the value in | |
698 | .IR val3 ) | |
699 | and the bitset which is stored in the kernel-internal | |
09cb4ce7 | 700 | state of the waiter (the "wait" bitset that is set using |
98d769c0 MK |
701 | .BR FUTEX_WAIT_BITSET ). |
702 | All of the waiters for which the result of the AND is nonzero are woken up; | |
703 | the remaining waiters are left sleeping. | |
704 | ||
e9d4496b MK |
705 | The effect of |
706 | .BR FUTEX_WAIT_BITSET | |
707 | and | |
708 | .BR FUTEX_WAKE_BITSET | |
9732dd8b MK |
709 | is to allow selective wake-ups among multiple waiters that are blocked |
710 | on the same futex. | |
ac894879 MK |
711 | However, note that, depending on the use case, |
712 | employing this bitset multiplexing feature on a | |
713 | futex can be less efficient than simply using multiple futexes, | |
e9d4496b MK |
714 | because employing bitset multiplexing requires the kernel |
715 | to check all waiters on a futex, | |
716 | including those that are not interested in being woken up | |
717 | (i.e., they do not have the relevant bit set in their "wait" bitset). | |
718 | .\" According to http://locklessinc.com/articles/futex_cheat_sheet/: | |
719 | .\" | |
720 | .\" "The original reason for the addition of these extensions | |
721 | .\" was to improve the performance of pthread read-write locks | |
722 | .\" in glibc. However, the pthreads library no longer uses the | |
723 | .\" same locking algorithm, and these extensions are not used | |
724 | .\" without the bitset parameter being all ones. | |
725 | .\" | |
726 | .\" The page goes on to note that the FUTEX_WAIT_BITSET operation | |
727 | .\" is nevertheless used (with a bitset of all ones) in order to | |
728 | .\" obtain the absolute timeout functionality that is useful | |
729 | .\" for efficiently implementing Pthreads APIs (which use absolute | |
730 | .\" timeouts); FUTEX_WAIT provides only relative timeouts. | |
731 | ||
98d769c0 MK |
732 | The |
733 | .I uaddr2 | |
734 | and | |
735 | .I timeout | |
736 | arguments are ignored. | |
9732dd8b MK |
737 | |
738 | The | |
739 | .BR FUTEX_WAIT | |
740 | and | |
741 | .BR FUTEX_WAKE | |
742 | operations correspond to | |
743 | .BR FUTEX_WAIT_BITSET | |
744 | and | |
745 | .BR FUTEX_WAKE_BITSET | |
746 | operations where the bitsets are all ones. | |
bd90a5f9 | 747 | .\" |
70b06b90 | 748 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
bd90a5f9 MK |
749 | .\" |
750 | .SS Priority-inheritance futexes | |
b52e1cd4 MK |
751 | Linux supports priority-inheritance (PI) futexes in order to handle |
752 | priority-inversion problems that can be encountered with | |
753 | normal futex locks. | |
b565548b | 754 | Priority inversion is the problem that occurs when a high-priority |
bdc5957a MK |
755 | task is blocked waiting to acquire a lock held by a low-priority task, |
756 | while tasks at an intermediate priority continuously preempt | |
757 | the low-priority task from the CPU. | |
758 | Consequently, the low-priority task makes no progress toward | |
759 | releasing the lock, and the high-priority task remains blocked. | |
7f315ae3 | 760 | |
7d20efd7 MK |
761 | Priority inheritance is a mechanism for dealing with |
762 | the priority-inversion problem. | |
bdc5957a MK |
763 | With this mechanism, when a high-priority task becomes blocked |
764 | by a lock held by a low-priority task, | |
9cee832c MK |
765 | the priority of the low-priority task is temporarily raised |
766 | to that of the high-priority task, | |
bdc5957a | 767 | so that it is not preempted by any intermediate level tasks, |
7d20efd7 MK |
768 | and can thus make progress toward releasing the lock. |
769 | To be effective, priority inheritance must be transitive, | |
bdc5957a MK |
770 | meaning that if a high-priority task blocks on a lock |
771 | held by a lower-priority task that is itself blocked by lock | |
772 | held by another intermediate-priority task | |
7d20efd7 | 773 | (and so on, for chains of arbitrary length), |
b0f35fbb | 774 | then both of those tasks |
bdc5957a MK |
775 | (or more generally, all of the tasks in a lock chain) |
776 | have their priorities raised to be the same as the high-priority task. | |
7d20efd7 | 777 | |
9e2b90ee | 778 | From a user-space perspective, |
39e9b2e1 MK |
779 | what makes a futex PI-aware is a policy agreement (described below) |
780 | between user space and the kernel about the value of the futex word, | |
f0a9e8f4 | 781 | coupled with the use of the PI-futex operations described below |
9e2b90ee MK |
782 | (in particular, |
783 | .BR FUTEX_LOCK_PI , | |
784 | .BR FUTEX_TRYLOCK_PI , | |
785 | and | |
786 | .BR FUTEX_CMP_REQUEUE_PI ). | |
787 | .\" Quoting Darren Hart: | |
788 | .\" These opcodes paired with the PI futex value policy (described below) | |
789 | .\" defines a "futex" as PI aware. These were created very specifically | |
790 | .\" in support of PI pthread_mutexes, so it makes a lot more sense to | |
791 | .\" talk about a PI aware pthread_mutex, than a PI aware futex, since | |
792 | .\" there is a lot of policy and scaffolding that has to be built up | |
793 | .\" around it to use it properly (this is what a PI pthread_mutex is). | |
794 | ||
ac894879 | 795 | .\" mtk: The following text is drawn from the Hart/Guniguntala paper |
1af427a4 | 796 | .\" (listed in SEE ALSO), but I have reworded some pieces |
f0a9e8f4 | 797 | .\" significntly. |
79d918c7 | 798 | .\" |
f0a9e8f4 | 799 | The PI-futex operations described below differ from the other |
4b35dc5d TR |
800 | futex operations in that they impose policy on the use of the value of the |
801 | futex word: | |
79d918c7 | 802 | .IP * 3 |
4b35dc5d | 803 | If the lock is not acquired, the futex word's value shall be 0. |
79d918c7 | 804 | .IP * |
4c8cb0ff MK |
805 | If the lock is acquired, the futex word's value shall |
806 | be the thread ID (TID; | |
4b35dc5d | 807 | see |
79d918c7 MK |
808 | .BR gettid (2)) |
809 | of the owning thread. | |
810 | .IP * | |
79d918c7 MK |
811 | If the lock is owned and there are threads contending for the lock, |
812 | then the | |
813 | .B FUTEX_WAITERS | |
4b35dc5d | 814 | bit shall be set in the futex word's value; in other words, this value is: |
79d918c7 MK |
815 | |
816 | FUTEX_WAITERS | TID | |
9e2b90ee | 817 | |
79d918c7 | 818 | .PP |
aab5ca4b | 819 | Note that a PI futex word never has just the value |
9e2b90ee MK |
820 | .BR FUTEX_WAITERS , |
821 | which is a permissible state for non-PI futexes. | |
822 | ||
79d918c7 | 823 | With this policy in place, |
fd105614 | 824 | a user-space application can acquire an unacquired |
4b35dc5d | 825 | lock or release a lock that no other threads try to acquire using atomic |
fd105614 MK |
826 | instructions executed in user mode |
827 | (e.g., a compare-and-swap operation such as | |
b52e1cd4 MK |
828 | .I cmpxchg |
829 | on the x86 architecture). | |
4c8cb0ff MK |
830 | Acquiring a lock simply consists of using compare-and-swap to atomically |
831 | set the futex word's value to the caller's TID if its previous value was 0. | |
4b35dc5d TR |
832 | Releasing a lock requires using compare-and-swap to set the futex word's |
833 | value to 0 if the previous value was the expected TID. | |
b52e1cd4 | 834 | |
4b35dc5d | 835 | If a futex is already acquired (i.e., has a nonzero value), |
b52e1cd4 | 836 | waiters must employ the |
79d918c7 MK |
837 | .B FUTEX_LOCK_PI |
838 | operation to acquire the lock. | |
4b35dc5d | 839 | If other threads are waiting for the lock, then the |
79d918c7 | 840 | .B FUTEX_WAITERS |
4c8cb0ff MK |
841 | bit is set in the futex value; |
842 | in this case, the lock owner must employ the | |
79d918c7 | 843 | .B FUTEX_UNLOCK_PI |
b52e1cd4 MK |
844 | operation to release the lock. |
845 | ||
79d918c7 MK |
846 | In the cases where callers are forced into the kernel |
847 | (i.e., required to perform a | |
848 | .BR futex () | |
0c3ec26b | 849 | call), |
79d918c7 MK |
850 | they then deal directly with a so-called RT-mutex, |
851 | a kernel locking mechanism which implements the required | |
852 | priority-inheritance semantics. | |
853 | After the RT-mutex is acquired, the futex value is updated accordingly, | |
854 | before the calling thread returns to user space. | |
79d918c7 | 855 | |
a59fca75 | 856 | It is important to note |
ac894879 | 857 | .\" tglx (July 2015): |
30239c10 MK |
858 | .\" If there are multiple waiters on a pi futex then a wake pi operation |
859 | .\" will wake the first waiter and hand over the lock to this waiter. This | |
860 | .\" includes handing over the rtmutex which represents the futex in the | |
861 | .\" kernel. The strict requirement is that the futex owner and the rtmutex | |
862 | .\" owner must be the same, except for the update period which is | |
863 | .\" serialized by the futex internal locking. That means the kernel must | |
864 | .\" update the user space value prior to returning to user space | |
4b35dc5d | 865 | that the kernel will update the futex word's value prior |
79d918c7 MK |
866 | to returning to user space. |
867 | Unlike the other futex operations described above, | |
f0a9e8f4 | 868 | the PI-futex operations are designed |
d9d5be6b | 869 | for the implementation of very specific IPC mechanisms. |
fc57e6bb | 870 | .\" |
7bd3ffbc | 871 | .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart |
99c0ac69 MK |
872 | .\" made the observation that "EINVAL is returned if the non-pi |
873 | .\" to pi or op pairing semantics are violated." | |
874 | .\" Probably there needs to be a general statement about this | |
875 | .\" requirement, probably located at about this point in the page. | |
d6bb5a38 | 876 | .\" Darren (or someone else), care to take a shot at this? |
30239c10 MK |
877 | .\" tglx (July 2015): |
878 | .\" Well, that's hard to describe because the kernel only has a limited | |
879 | .\" way of detecting such mismatches. It only can detect it when there are | |
880 | .\" non PI waiters on a futex and a PI function is called or vice versa. | |
881 | ||
882 | If a futex has an associated RT-mutex in the kernel | |
883 | (i.e., there are blocked waiters) | |
884 | and the owner of the futex/RT-mutex dies unexpectedly, | |
885 | then the kernel cleans up the RT-mutex and hands it over to the next waiter. | |
886 | This in turn requires that the user-space value is updated accordingly. | |
887 | To indicate that this is required, the kernel sets the | |
888 | .B FUTEX_OWNER_DIED | |
889 | bit in the futex word along with the thread ID of the new owner. | |
890 | User space is then responsible for cleaning this up | |
891 | (though there are cases where the kernel does the cleanup). | |
892 | .\" tglx (July 2015): | |
893 | .\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where | |
894 | .\" the kernel has no state associated. This happens via the robust futex | |
895 | .\" mechanism. In that case the futex value will be set to | |
896 | .\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non | |
897 | .\" PI futexes. | |
bd90a5f9 MK |
898 | |
899 | PI futexes are operated on by specifying one of the following values in | |
900 | .IR futex_op : | |
70b06b90 MK |
901 | .\" |
902 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
903 | .\" | |
d67e21f5 MK |
904 | .TP |
905 | .BR FUTEX_LOCK_PI " (since Linux 2.6.18)" | |
906 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc | |
67833bec | 907 | This operation is used after after an attempt to acquire |
fd105614 | 908 | the lock via an atomic user-mode instruction failed |
4b35dc5d | 909 | because the futex word has a nonzero value\(emspecifically, |
8297383e | 910 | because it contained the (PID-namespace-specific) TID of the lock owner. |
67833bec | 911 | |
4b35dc5d | 912 | The operation checks the value of the futex word at the address |
67833bec | 913 | .IR uaddr . |
70b06b90 MK |
914 | If the value is 0, then the kernel tries to atomically set |
915 | the futex value to the caller's TID. | |
c3875d1d | 916 | If the futex word's value is nonzero, |
67833bec | 917 | the kernel atomically sets the |
e0547e70 | 918 | .B FUTEX_WAITERS |
67833bec MK |
919 | bit, which signals the futex owner that it cannot unlock the futex in |
920 | user space atomically by setting the futex value to 0. | |
c3875d1d MK |
921 | .\" tglx (July 2015): |
922 | .\" The operation here is similar to the FUTEX_WAIT logic. When the user | |
923 | .\" space atomic acquire does not succeed because the futex value was non | |
924 | .\" zero, then the waiter goes into the kernel, takes the kernel internal | |
925 | .\" lock and retries the acquisition under the lock. If the acquisition | |
926 | .\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal | |
927 | .\" the lock owner that it needs to go into the kernel. Here is the pseudo | |
928 | .\" code: | |
929 | .\" | |
930 | .\" lock(kernel_lock); | |
931 | .\" retry: | |
932 | .\" | |
933 | .\" /* | |
934 | .\" * Owner might have unlocked in userspace before we | |
935 | .\" * were able to set the waiter bit. | |
936 | .\" */ | |
937 | .\" if (atomic_acquire(futex) == SUCCESS) { | |
938 | .\" unlock(kernel_lock()); | |
939 | .\" return 0; | |
940 | .\" } | |
941 | .\" | |
942 | .\" /* | |
943 | .\" * Owner might have unlocked after the above atomic_acquire() | |
944 | .\" * attempt. | |
945 | .\" */ | |
946 | .\" if (atomic_set_waiters_bit(futex) != SUCCESS) | |
947 | .\" goto retry; | |
948 | .\" | |
949 | .\" queue_waiter(); | |
950 | .\" unlock(kernel_lock); | |
951 | .\" block(); | |
952 | .\" | |
953 | After that, the kernel: | |
954 | .RS | |
955 | .IP 1. 3 | |
956 | Tries to find the thread which is associated with the owner TID. | |
957 | .IP 2. | |
958 | Creates or reuses kernel state on behalf of the owner. | |
959 | (If this is the first waiter, there is no kernel state for this | |
960 | futex, so kernel state is created by locking the RT-mutex | |
961 | and the futex owner is made the owner of the RT-mutex. | |
962 | If there are existing waiters, then the existing state is reused.) | |
963 | .IP 3. | |
964 | Attaches the waiter to it | |
965 | (i.e., the waiter is enqueued on the RT-mutex waiter list). | |
966 | .RE | |
967 | .IP | |
ac894879 MK |
968 | If more than one waiter exists, |
969 | the enqueueing of the waiter is in descending priority order. | |
970 | (For information on priority ordering, see the discussion of the | |
971 | .BR SCHED_DEADLINE , | |
972 | .BR SCHED_FIFO , | |
973 | and | |
974 | .BR SCHED_RR | |
975 | scheduling policies in | |
976 | .BR sched (7).) | |
977 | The owner inherits either the waiter's CPU bandwidth | |
978 | (if the waiter is scheduled under the | |
979 | .BR SCHED_DEADLINE | |
980 | policy) or the waiter's priority (if the waiter is scheduled under the | |
981 | .BR SCHED_RR | |
982 | or | |
983 | .BR SCHED_FIFO | |
984 | policy). | |
c3875d1d MK |
985 | This inheritance follows the lock chain in the case of nested locking |
986 | (i.e., task 1 blocks on lock A, held by task 2, | |
987 | while task 2 blocks on lock B, held by task 3) | |
988 | and performs deadlock detection. | |
989 | ||
e0547e70 TG |
990 | The |
991 | .I timeout | |
9ce19cf1 MK |
992 | argument provides a timeout for the lock attempt. |
993 | It is interpreted as an absolute time, measured against the | |
994 | .BR CLOCK_REALTIME | |
995 | clock. | |
996 | If | |
997 | .I timeout | |
998 | is NULL, the operation will block indefinitely. | |
e0547e70 | 999 | |
a449c634 | 1000 | The |
e0547e70 TG |
1001 | .IR uaddr2 , |
1002 | .IR val , | |
1003 | and | |
1004 | .IR val3 | |
a449c634 | 1005 | arguments are ignored. |
67833bec | 1006 | .\" |
70b06b90 MK |
1007 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
1008 | .\" | |
d67e21f5 | 1009 | .TP |
12fdbe23 | 1010 | .BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 1011 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
12fdbe23 MK |
1012 | This operation tries to acquire the futex at |
1013 | .IR uaddr . | |
c3875d1d MK |
1014 | It is invoked when a user-space atomic acquire did not |
1015 | succeed because the futex word was not 0. | |
1016 | ||
41b6c6a0 MK |
1017 | .\" FIXME The wording "The trylock in kernel" below needs clarification. |
1018 | .\" Suggestions? | |
1019 | .\" | |
c3875d1d MK |
1020 | The trylock in kernel might succeed because the futex word |
1021 | contains stale state | |
1022 | .RB ( FUTEX_WAITERS | |
1023 | and/or | |
1024 | .BR FUTEX_OWNER_DIED ). | |
1025 | This can happen when the owner of the futex died. | |
0b761826 | 1026 | .\" FIXME I think it would be helpful here to say a few more words about |
70b06b90 MK |
1027 | .\" the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI. |
1028 | .\" Can someone propose something? | |
1029 | .\" | |
a282e5b0 | 1030 | User space cannot handle this condition in a race-free manner |
084744ef MK |
1031 | |
1032 | The | |
1033 | .IR uaddr2 , | |
1034 | .IR val , | |
1035 | .IR timeout , | |
1036 | and | |
1037 | .IR val3 | |
1038 | arguments are ignored. | |
70b06b90 MK |
1039 | .\" |
1040 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1041 | .\" | |
d67e21f5 | 1042 | .TP |
12fdbe23 | 1043 | .BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 1044 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
d4ba4328 | 1045 | This operation wakes the top priority waiter that is waiting in |
ecae2099 TG |
1046 | .B FUTEX_LOCK_PI |
1047 | on the futex address provided by the | |
1048 | .I uaddr | |
1049 | argument. | |
1050 | ||
1051 | This is called when the user space value at | |
1052 | .I uaddr | |
1053 | cannot be changed atomically from a TID (of the owner) to 0. | |
1054 | ||
1055 | The | |
1056 | .IR uaddr2 , | |
1057 | .IR val , | |
1058 | .IR timeout , | |
1059 | and | |
1060 | .IR val3 | |
11a194bf | 1061 | arguments are ignored. |
70b06b90 MK |
1062 | .\" |
1063 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1064 | .\" | |
d67e21f5 | 1065 | .TP |
d67e21f5 MK |
1066 | .BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)" |
1067 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
f812a08b DH |
1068 | This operation is a PI-aware variant of |
1069 | .BR FUTEX_CMP_REQUEUE . | |
1070 | It requeues waiters that are blocked via | |
1071 | .B FUTEX_WAIT_REQUEUE_PI | |
1072 | on | |
1073 | .I uaddr | |
1074 | from a non-PI source futex | |
1075 | .RI ( uaddr ) | |
1076 | to a PI target futex | |
1077 | .RI ( uaddr2 ). | |
1078 | ||
9e54d26d MK |
1079 | As with |
1080 | .BR FUTEX_CMP_REQUEUE , | |
1081 | this operation wakes up a maximum of | |
1082 | .I val | |
1083 | waiters that are waiting on the futex at | |
1084 | .IR uaddr . | |
1085 | However, for | |
1086 | .BR FUTEX_CMP_REQUEUE_PI , | |
1087 | .I val | |
6fbeb8f4 | 1088 | is required to be 1 |
939ca89f | 1089 | (since the main point is to avoid a thundering herd). |
9e54d26d MK |
1090 | The remaining waiters are removed from the wait queue of the source futex at |
1091 | .I uaddr | |
1092 | and added to the wait queue of the target futex at | |
1093 | .IR uaddr2 . | |
f812a08b | 1094 | |
9e54d26d | 1095 | The |
768d3c23 | 1096 | .I val2 |
c6d8cf21 MK |
1097 | .\" val2 is the cap on the number of requeued waiters. |
1098 | .\" In the glibc pthread_cond_broadcast() implementation, this argument | |
1099 | .\" is specified as INT_MAX, and for pthread_cond_signal() it is 0. | |
9e54d26d | 1100 | and |
768d3c23 | 1101 | .I val3 |
9e54d26d MK |
1102 | arguments serve the same purposes as for |
1103 | .BR FUTEX_CMP_REQUEUE . | |
70b06b90 | 1104 | .\" |
8297383e | 1105 | .\" The page at http://locklessinc.com/articles/futex_cheat_sheet/ |
be376673 | 1106 | .\" notes that "priority-inheritance Futex to priority-inheritance |
8297383e MK |
1107 | .\" Futex requeues are currently unsupported". However, probably |
1108 | .\" the page does not need to say nothing about this, since | |
1109 | .\" Thomas Gleixner commented (July 2015): "they never will be | |
1110 | .\" supported because they make no sense at all" | |
70b06b90 MK |
1111 | .\" |
1112 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1113 | .\" | |
d67e21f5 MK |
1114 | .TP |
1115 | .BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)" | |
1116 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
70b06b90 | 1117 | .\" |
c3875d1d | 1118 | Wait on a non-PI futex at |
6ff1b4c0 | 1119 | .I uaddr |
c3875d1d MK |
1120 | and potentially be requeued (via a |
1121 | .BR FUTEX_CMP_REQUEUE_PI | |
1122 | operation in another task) onto a PI futex at | |
6ff1b4c0 TG |
1123 | .IR uaddr2 . |
1124 | The wait operation on | |
1125 | .I uaddr | |
c3875d1d | 1126 | is the same as for |
6ff1b4c0 | 1127 | .BR FUTEX_WAIT . |
c3875d1d | 1128 | |
6ff1b4c0 TG |
1129 | The waiter can be removed from the wait on |
1130 | .I uaddr | |
6ff1b4c0 | 1131 | without requeueing on |
c3875d1d MK |
1132 | .IR uaddr2 |
1133 | via a | |
1134 | .BR FUTEX_WAIT | |
1135 | operation in another task. | |
1136 | In this case, the | |
1137 | .BR FUTEX_WAIT_REQUEUE_PI | |
1138 | operation returns with the error | |
1139 | .BR EWOULDBLOCK . | |
a4e69912 | 1140 | |
63bea7dc MK |
1141 | If |
1142 | .I timeout | |
1143 | is not NULL, it specifies a timeout for the wait operation; | |
1144 | this timeout is interpreted as outlined above in the description of the | |
1145 | .BR FUTEX_CLOCK_REALTIME | |
1146 | option. | |
1147 | If | |
1148 | .I timeout | |
1149 | is NULL, the operation can block indefinitely. | |
1150 | ||
a4e69912 MK |
1151 | The |
1152 | .I val3 | |
1153 | argument is ignored. | |
abb571e8 MK |
1154 | |
1155 | The | |
1156 | .BR FUTEX_WAIT_REQUEUE_PI | |
1157 | and | |
1158 | .BR FUTEX_CMP_REQUEUE_PI | |
1159 | were added to support a fairly specific use case: | |
1160 | support for priority-inheritance-aware POSIX threads condition variables. | |
1161 | The idea is that these operations should always be paired, | |
1162 | in order to ensure that user space and the kernel remain in sync. | |
1163 | Thus, in the | |
1164 | .BR FUTEX_WAIT_REQUEUE_PI | |
1165 | operation, the user-space application pre-specifies the target | |
1166 | of the requeue that takes place in the | |
1167 | .BR FUTEX_CMP_REQUEUE_PI | |
1168 | operation. | |
1169 | .\" | |
1170 | .\" Darren Hart notes that a patch to allow glibc to fully support | |
1af427a4 | 1171 | .\" PI-aware pthreads condition variables has not yet been accepted into |
abb571e8 MK |
1172 | .\" glibc. The story is complex, and can be found at |
1173 | .\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588 | |
1174 | .\" Darren notes that in the meantime, the patch is shipped with various | |
1af427a4 | 1175 | .\" PREEMPT_RT-enabled Linux systems. |
abb571e8 MK |
1176 | .\" |
1177 | .\" Related to the preceding, Darren proposed that somewhere, man-pages | |
1178 | .\" should document the following point: | |
1af427a4 | 1179 | .\" |
4c8cb0ff MK |
1180 | .\" While the Linux kernel, since 2.6.31, supports requeueing of |
1181 | .\" priority-inheritance (PI) aware mutexes via the | |
1182 | .\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations, | |
1183 | .\" the glibc implementation does not yet take full advantage of this. | |
1184 | .\" Specifically, the condvar internal data lock remains a non-PI aware | |
1185 | .\" mutex, regardless of the type of the pthread_mutex associated with | |
1186 | .\" the condvar. This can lead to an unbounded priority inversion on | |
1187 | .\" the internal data lock even when associating a PI aware | |
1188 | .\" pthread_mutex with a condvar during a pthread_cond*_wait | |
1189 | .\" operation. For this reason, it is not recommended to rely on | |
1190 | .\" priority inheritance when using pthread condition variables. | |
1af427a4 MK |
1191 | .\" |
1192 | .\" The problem is that the obvious location for this text is | |
1193 | .\" the pthread_cond*wait(3) man page. However, such a man page | |
abb571e8 | 1194 | .\" does not currently exist. |
70b06b90 | 1195 | .\" |
6700de24 | 1196 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
70b06b90 | 1197 | .\" |
47297adb | 1198 | .SH RETURN VALUE |
fea681da | 1199 | .PP |
a5c5a06a MK |
1200 | In the event of an error (and assuming that |
1201 | .BR futex () | |
1202 | was invoked via | |
1203 | .BR syscall (2)), | |
1204 | all operations return \-1 and set | |
e808bba0 | 1205 | .I errno |
6f147f79 | 1206 | to indicate the cause of the error. |
adc89a97 | 1207 | |
e808bba0 MK |
1208 | The return value on success depends on the operation, |
1209 | as described in the following list: | |
fea681da MK |
1210 | .TP |
1211 | .B FUTEX_WAIT | |
077981d4 | 1212 | Returns 0 if the caller was woken up. |
4c8cb0ff MK |
1213 | Note that a wake-up can also be caused by common futex usage patterns |
1214 | in unrelated code that happened to have previously used the futex word's | |
1215 | memory location (e.g., typical futex-based implementations of | |
1216 | Pthreads mutexes can cause this under some conditions). | |
1217 | Therefore, callers should always conservatively assume that a return | |
1218 | value of 0 can mean a spurious wake-up, and use the futex word's value | |
1219 | (i.e., the user space synchronization scheme) | |
1220 | to decide whether to continue to block or not. | |
fea681da MK |
1221 | .TP |
1222 | .B FUTEX_WAKE | |
bdc5957a | 1223 | Returns the number of waiters that were woken up. |
fea681da MK |
1224 | .TP |
1225 | .B FUTEX_FD | |
1226 | Returns the new file descriptor associated with the futex. | |
1227 | .TP | |
1228 | .B FUTEX_REQUEUE | |
bdc5957a | 1229 | Returns the number of waiters that were woken up. |
fea681da MK |
1230 | .TP |
1231 | .B FUTEX_CMP_REQUEUE | |
bdc5957a | 1232 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1233 | requeued to the futex for the futex word at |
3dfcc11d MK |
1234 | .IR uaddr2 . |
1235 | If this value is greater than | |
1236 | .IR val , | |
fd105614 | 1237 | then the difference is the number of waiters requeued to the futex for the |
4c8cb0ff | 1238 | futex word at |
3dfcc11d | 1239 | .IR uaddr2 . |
dcad19c0 MK |
1240 | .TP |
1241 | .B FUTEX_WAKE_OP | |
a8b5b324 | 1242 | Returns the total number of waiters that were woken up. |
4c8cb0ff MK |
1243 | This is the sum of the woken waiters on the two futexes for |
1244 | the futex words at | |
a8b5b324 MK |
1245 | .I uaddr |
1246 | and | |
1247 | .IR uaddr2 . | |
dcad19c0 MK |
1248 | .TP |
1249 | .B FUTEX_WAIT_BITSET | |
077981d4 MK |
1250 | Returns 0 if the caller was woken up. |
1251 | See | |
4b35dc5d TR |
1252 | .B FUTEX_WAIT |
1253 | for how to interpret this correctly in practice. | |
dcad19c0 MK |
1254 | .TP |
1255 | .B FUTEX_WAKE_BITSET | |
bdc5957a | 1256 | Returns the number of waiters that were woken up. |
dcad19c0 MK |
1257 | .TP |
1258 | .B FUTEX_LOCK_PI | |
bf02a260 | 1259 | Returns 0 if the futex was successfully locked. |
dcad19c0 MK |
1260 | .TP |
1261 | .B FUTEX_TRYLOCK_PI | |
5c716eef | 1262 | Returns 0 if the futex was successfully locked. |
dcad19c0 MK |
1263 | .TP |
1264 | .B FUTEX_UNLOCK_PI | |
52bb928f | 1265 | Returns 0 if the futex was successfully unlocked. |
dcad19c0 MK |
1266 | .TP |
1267 | .B FUTEX_CMP_REQUEUE_PI | |
bdc5957a | 1268 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1269 | requeued to the futex for the futex word at |
dddd395a MK |
1270 | .IR uaddr2 . |
1271 | If this value is greater than | |
1272 | .IR val , | |
4c8cb0ff MK |
1273 | then difference is the number of waiters requeued to the futex for |
1274 | the futex word at | |
dddd395a | 1275 | .IR uaddr2 . |
dcad19c0 MK |
1276 | .TP |
1277 | .B FUTEX_WAIT_REQUEUE_PI | |
4c8cb0ff MK |
1278 | Returns 0 if the caller was successfully requeued to the futex for |
1279 | the futex word at | |
22c15de9 | 1280 | .IR uaddr2 . |
70b06b90 MK |
1281 | .\" |
1282 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1283 | .\" | |
fea681da MK |
1284 | .SH ERRORS |
1285 | .TP | |
1286 | .B EACCES | |
4b35dc5d | 1287 | No read access to the memory of a futex word. |
fea681da MK |
1288 | .TP |
1289 | .B EAGAIN | |
f48516d1 | 1290 | .RB ( FUTEX_WAIT , |
4b35dc5d | 1291 | .BR FUTEX_WAIT_BITSET , |
f48516d1 | 1292 | .BR FUTEX_WAIT_REQUEUE_PI ) |
badbf70c MK |
1293 | The value pointed to by |
1294 | .I uaddr | |
1295 | was not equal to the expected value | |
1296 | .I val | |
1297 | at the time of the call. | |
9732dd8b MK |
1298 | |
1299 | .BR Note : | |
1300 | on Linux, the symbolic names | |
1301 | .B EAGAIN | |
1302 | and | |
1303 | .B EWOULDBLOCK | |
77da5feb | 1304 | (both of which appear in different parts of the kernel futex code) |
9732dd8b | 1305 | have the same value. |
badbf70c MK |
1306 | .TP |
1307 | .B EAGAIN | |
8f2068bb MK |
1308 | .RB ( FUTEX_CMP_REQUEUE , |
1309 | .BR FUTEX_CMP_REQUEUE_PI ) | |
ce5602fd | 1310 | The value pointed to by |
9f6c40c0 МК |
1311 | .I uaddr |
1312 | is not equal to the expected value | |
1313 | .IR val3 . | |
fea681da | 1314 | .TP |
5662f56a MK |
1315 | .BR EAGAIN |
1316 | .RB ( FUTEX_LOCK_PI , | |
aaec9032 MK |
1317 | .BR FUTEX_TRYLOCK_PI , |
1318 | .BR FUTEX_CMP_REQUEUE_PI ) | |
1319 | The futex owner thread ID of | |
1320 | .I uaddr | |
1321 | (for | |
1322 | .BR FUTEX_CMP_REQUEUE_PI : | |
1323 | .IR uaddr2 ) | |
1324 | is about to exit, | |
5662f56a MK |
1325 | but has not yet handled the internal state cleanup. |
1326 | Try again. | |
1327 | .TP | |
7a39e745 MK |
1328 | .BR EDEADLK |
1329 | .RB ( FUTEX_LOCK_PI , | |
9732dd8b MK |
1330 | .BR FUTEX_TRYLOCK_PI , |
1331 | .BR FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1332 | The futex word at |
7a39e745 MK |
1333 | .I uaddr |
1334 | is already locked by the caller. | |
1335 | .TP | |
662c0da8 | 1336 | .BR EDEADLK |
c3875d1d | 1337 | .\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some |
d6bb5a38 | 1338 | .\" places, and EDEADLOCK in others. On almost all architectures |
4c8cb0ff MK |
1339 | .\" these constants are synonymous. Is there a reason that both |
1340 | .\" names are used? | |
8297383e MK |
1341 | .\" |
1342 | .\" tglx (July 2015): "No. We should probably fix that." | |
1343 | .\" | |
662c0da8 | 1344 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
4b35dc5d | 1345 | While requeueing a waiter to the PI futex for the futex word at |
662c0da8 MK |
1346 | .IR uaddr2 , |
1347 | the kernel detected a deadlock. | |
1348 | .TP | |
fea681da | 1349 | .B EFAULT |
1ea901e8 MK |
1350 | A required pointer argument (i.e., |
1351 | .IR uaddr , | |
1352 | .IR uaddr2 , | |
1353 | or | |
1354 | .IR timeout ) | |
496df304 | 1355 | did not point to a valid user-space address. |
fea681da | 1356 | .TP |
9f6c40c0 | 1357 | .B EINTR |
e808bba0 | 1358 | A |
9f6c40c0 | 1359 | .B FUTEX_WAIT |
2674f781 MK |
1360 | or |
1361 | .B FUTEX_WAIT_BITSET | |
e808bba0 | 1362 | operation was interrupted by a signal (see |
f529fd20 MK |
1363 | .BR signal (7)). |
1364 | In kernels before Linux 2.6.22, this error could also be returned for | |
1365 | on a spurious wakeup; since Linux 2.6.22, this no longer happens. | |
9f6c40c0 | 1366 | .TP |
fea681da | 1367 | .B EINVAL |
180f97b7 MK |
1368 | The operation in |
1369 | .IR futex_op | |
1370 | is one of those that employs a timeout, but the supplied | |
fb2f4c27 MK |
1371 | .I timeout |
1372 | argument was invalid | |
1373 | .RI ( tv_sec | |
1374 | was less than zero, or | |
1375 | .IR tv_nsec | |
cabee29d | 1376 | was not less than 1,000,000,000). |
fb2f4c27 MK |
1377 | .TP |
1378 | .B EINVAL | |
0c74df0b | 1379 | The operation specified in |
025e1374 | 1380 | .IR futex_op |
0c74df0b | 1381 | employs one or both of the pointers |
51ee94be | 1382 | .I uaddr |
a1f47699 | 1383 | and |
0c74df0b MK |
1384 | .IR uaddr2 , |
1385 | but one of these does not point to a valid object\(emthat is, | |
1386 | the address is not four-byte-aligned. | |
51ee94be MK |
1387 | .TP |
1388 | .B EINVAL | |
55cc422d TG |
1389 | .RB ( FUTEX_WAIT_BITSET , |
1390 | .BR FUTEX_WAKE_BITSET ) | |
79c9b436 TG |
1391 | The bitset supplied in |
1392 | .IR val3 | |
1393 | is zero. | |
1394 | .TP | |
1395 | .B EINVAL | |
2abcba67 | 1396 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
add875c0 MK |
1397 | .I uaddr |
1398 | equals | |
1399 | .IR uaddr2 | |
1400 | (i.e., an attempt was made to requeue to the same futex). | |
1401 | .TP | |
ff597681 MK |
1402 | .BR EINVAL |
1403 | .RB ( FUTEX_FD ) | |
1404 | The signal number supplied in | |
1405 | .I val | |
1406 | is invalid. | |
1407 | .TP | |
6bac3b85 | 1408 | .B EINVAL |
476debd7 MK |
1409 | .RB ( FUTEX_WAKE , |
1410 | .BR FUTEX_WAKE_OP , | |
1411 | .BR FUTEX_WAKE_BITSET , | |
1412 | .BR FUTEX_REQUEUE , | |
1413 | .BR FUTEX_CMP_REQUEUE ) | |
1414 | The kernel detected an inconsistency between the user-space state at | |
1415 | .I uaddr | |
1416 | and the kernel state\(emthat is, it detected a waiter which waits in | |
1417 | .BR FUTEX_LOCK_PI | |
1418 | on | |
1419 | .IR uaddr . | |
1420 | .TP | |
1421 | .B EINVAL | |
a218ef20 | 1422 | .RB ( FUTEX_LOCK_PI , |
ce022f18 MK |
1423 | .BR FUTEX_TRYLOCK_PI , |
1424 | .BR FUTEX_UNLOCK_PI ) | |
a218ef20 MK |
1425 | The kernel detected an inconsistency between the user-space state at |
1426 | .I uaddr | |
1427 | and the kernel state. | |
ce022f18 | 1428 | This indicates either state corruption |
ce022f18 | 1429 | or that the kernel found a waiter on |
a218ef20 MK |
1430 | .I uaddr |
1431 | which is waiting via | |
1432 | .BR FUTEX_WAIT | |
1433 | or | |
1434 | .BR FUTEX_WAIT_BITSET . | |
1435 | .TP | |
1436 | .B EINVAL | |
f9250b1a MK |
1437 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1438 | The kernel detected an inconsistency between the user-space state at | |
99c0041d MK |
1439 | .I uaddr2 |
1440 | and the kernel state; | |
1441 | that is, the kernel detected a waiter which waits via | |
1442 | .BR FUTEX_WAIT | |
8297383e MK |
1443 | or |
1444 | .BR FUTEX_WAIT_BITSET | |
99c0041d MK |
1445 | on |
1446 | .IR uaddr2 . | |
1447 | .TP | |
1448 | .B EINVAL | |
1449 | .RB ( FUTEX_CMP_REQUEUE_PI ) | |
1450 | The kernel detected an inconsistency between the user-space state at | |
f9250b1a MK |
1451 | .I uaddr |
1452 | and the kernel state; | |
1453 | that is, the kernel detected a waiter which waits via | |
75299c8d | 1454 | .BR FUTEX_WAIT |
99c0041d | 1455 | or |
75299c8d | 1456 | .BR FUTEX_WAIT_BITESET |
f9250b1a MK |
1457 | on |
1458 | .IR uaddr . | |
1459 | .TP | |
1460 | .B EINVAL | |
99c0041d | 1461 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
75299c8d MK |
1462 | The kernel detected an inconsistency between the user-space state at |
1463 | .I uaddr | |
1464 | and the kernel state; | |
1465 | that is, the kernel detected a waiter which waits on | |
1466 | .I uaddr | |
1467 | via | |
1468 | .BR FUTEX_LOCK_PI | |
1469 | (instead of | |
1470 | .BR FUTEX_WAIT_REQUEUE_PI ). | |
99c0041d MK |
1471 | .TP |
1472 | .B EINVAL | |
9786b3ca | 1473 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
8297383e MK |
1474 | .\" This deals with the case: |
1475 | .\" wait_requeue_pi(A, B); | |
1476 | .\" requeue_pi(A, C); | |
9786b3ca MK |
1477 | An attempt was made to requeue a waiter to a futex other than that |
1478 | specified by the matching | |
1479 | .B FUTEX_WAIT_REQUEUE_PI | |
1480 | call for that waiter. | |
1481 | .TP | |
1482 | .B EINVAL | |
f0c0d61c MK |
1483 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1484 | The | |
1485 | .I val | |
1486 | argument is not 1. | |
1487 | .TP | |
1488 | .B EINVAL | |
4832b48a | 1489 | Invalid argument. |
fea681da | 1490 | .TP |
a449c634 MK |
1491 | .BR ENOMEM |
1492 | .RB ( FUTEX_LOCK_PI , | |
e34a8fb6 MK |
1493 | .BR FUTEX_TRYLOCK_PI , |
1494 | .BR FUTEX_CMP_REQUEUE_PI ) | |
a449c634 MK |
1495 | The kernel could not allocate memory to hold state information. |
1496 | .TP | |
fea681da | 1497 | .B ENFILE |
ff597681 | 1498 | .RB ( FUTEX_FD ) |
fea681da | 1499 | The system limit on the total number of open files has been reached. |
4701fc28 MK |
1500 | .TP |
1501 | .B ENOSYS | |
1502 | Invalid operation specified in | |
d33602c4 | 1503 | .IR futex_op . |
9f6c40c0 | 1504 | .TP |
4a7e5b05 MK |
1505 | .B ENOSYS |
1506 | The | |
1507 | .BR FUTEX_CLOCK_REALTIME | |
1508 | option was specified in | |
1afcee7c | 1509 | .IR futex_op , |
4a7e5b05 MK |
1510 | but the accompanying operation was neither |
1511 | .BR FUTEX_WAIT_BITSET | |
1512 | nor | |
1513 | .BR FUTEX_WAIT_REQUEUE_PI . | |
1514 | .TP | |
a9dcb4d1 MK |
1515 | .BR ENOSYS |
1516 | .RB ( FUTEX_LOCK_PI , | |
f2424fae | 1517 | .BR FUTEX_TRYLOCK_PI , |
4945ff19 | 1518 | .BR FUTEX_UNLOCK_PI , |
4cf92894 | 1519 | .BR FUTEX_CMP_REQUEUE_PI , |
794bb106 | 1520 | .BR FUTEX_WAIT_REQUEUE_PI ) |
4b35dc5d | 1521 | A run-time check determined that the operation is not available. |
f0a9e8f4 | 1522 | The PI-futex operations are not implemented on all architectures and |
077981d4 | 1523 | are not supported on some CPU variants. |
a9dcb4d1 | 1524 | .TP |
c7589177 MK |
1525 | .BR EPERM |
1526 | .RB ( FUTEX_LOCK_PI , | |
dc2742a8 MK |
1527 | .BR FUTEX_TRYLOCK_PI , |
1528 | .BR FUTEX_CMP_REQUEUE_PI ) | |
04331c3f | 1529 | The caller is not allowed to attach itself to the futex at |
dc2742a8 MK |
1530 | .I uaddr |
1531 | (for | |
1532 | .BR FUTEX_CMP_REQUEUE_PI : | |
1533 | the futex at | |
1534 | .IR uaddr2 ). | |
c7589177 MK |
1535 | (This may be caused by a state corruption in user space.) |
1536 | .TP | |
76f347ba | 1537 | .BR EPERM |
87276709 | 1538 | .RB ( FUTEX_UNLOCK_PI ) |
4b35dc5d | 1539 | The caller does not own the lock represented by the futex word. |
76f347ba | 1540 | .TP |
0b0e4934 MK |
1541 | .BR ESRCH |
1542 | .RB ( FUTEX_LOCK_PI , | |
9732dd8b MK |
1543 | .BR FUTEX_TRYLOCK_PI , |
1544 | .BR FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1545 | The thread ID in the futex word at |
0b0e4934 MK |
1546 | .I uaddr |
1547 | does not exist. | |
1548 | .TP | |
360f773c MK |
1549 | .BR ESRCH |
1550 | .RB ( FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1551 | The thread ID in the futex word at |
360f773c MK |
1552 | .I uaddr2 |
1553 | does not exist. | |
1554 | .TP | |
9f6c40c0 | 1555 | .B ETIMEDOUT |
4d85047f MK |
1556 | The operation in |
1557 | .IR futex_op | |
1558 | employed the timeout specified in | |
1559 | .IR timeout , | |
1560 | and the timeout expired before the operation completed. | |
70b06b90 MK |
1561 | .\" |
1562 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1563 | .\" | |
47297adb | 1564 | .SH VERSIONS |
a1d5f77c | 1565 | .PP |
81c9d87e MK |
1566 | Futexes were first made available in a stable kernel release |
1567 | with Linux 2.6.0. | |
1568 | ||
4c8cb0ff MK |
1569 | Initial futex support was merged in Linux 2.5.7 but with different |
1570 | semantics from what was described above. | |
52dee70e | 1571 | A four-argument system call with the semantics |
fd3fa7ef | 1572 | described in this page was introduced in Linux 2.5.40. |
11b520ed | 1573 | In Linux 2.5.70, one argument |
a1d5f77c | 1574 | was added. |
11b520ed | 1575 | In Linux 2.6.7, a sixth argument was added\(emmessy, especially |
a1d5f77c | 1576 | on the s390 architecture. |
47297adb | 1577 | .SH CONFORMING TO |
8382f16d | 1578 | This system call is Linux-specific. |
47297adb | 1579 | .SH NOTES |
baf0f1f4 MK |
1580 | Glibc does not provide a wrapper for this system call; call it using |
1581 | .BR syscall (2). | |
cf44281c | 1582 | |
02f7b623 MK |
1583 | Several higher-level programming abstractions are implemented via futexes, |
1584 | including POSIX semaphores and | |
1585 | various POSIX threads synchronization mechanisms | |
1586 | (mutexes, condition variables, read-write locks, and barriers). | |
74f58a64 MK |
1587 | .\" TODO FIXME(Torvald) Above, we cite this section and claim it contains |
1588 | .\" details on the synchronization semantics; add the C11 equivalents | |
1589 | .\" here (or whatever we find consensus for). | |
305cc415 MK |
1590 | .\" |
1591 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1592 | .\" | |
1593 | .SH EXAMPLE | |
1594 | .\" FIXME Is it worth having an example program? | |
1595 | .\" FIXME Anything obviously broken in the example program? | |
1596 | .\" | |
77da5feb | 1597 | The program below demonstrates use of futexes in a program |
305cc415 MK |
1598 | where parent and child use a pair of futexes located inside a |
1599 | shared anonymous mapping to synchronize access to a shared resource: | |
1600 | the terminal. | |
1601 | The two processes each write | |
1602 | .IR nloops | |
1603 | (a command-line argument that defaults to 5 if omitted) | |
1604 | messages to the terminal and employ a synchronization protocol | |
1605 | that ensures that they alternate in writing messages. | |
1606 | Upon running this program we see output such as the following: | |
1607 | ||
1608 | .in +4n | |
1609 | .nf | |
1610 | $ \fB./futex_demo\fP | |
1611 | Parent (18534) 0 | |
1612 | Child (18535) 0 | |
1613 | Parent (18534) 1 | |
1614 | Child (18535) 1 | |
1615 | Parent (18534) 2 | |
1616 | Child (18535) 2 | |
1617 | Parent (18534) 3 | |
1618 | Child (18535) 3 | |
1619 | Parent (18534) 4 | |
1620 | Child (18535) 4 | |
1621 | .fi | |
1622 | .in | |
1623 | .SS Program source | |
1624 | \& | |
1625 | .nf | |
1626 | /* futex_demo.c | |
1627 | ||
1628 | Usage: futex_demo [nloops] | |
1629 | (Default: 5) | |
1630 | ||
1631 | Demonstrate the use of futexes in a program where parent and child | |
1632 | use a pair of futexes located inside a shared anonymous mapping to | |
1633 | synchronize access to a shared resource: the terminal. The two | |
1634 | processes each write \(aqnum\-loops\(aq messages to the terminal and employ | |
1635 | a synchronization protocol that ensures that they alternate in | |
1636 | writing messages. | |
1637 | */ | |
1638 | #define _GNU_SOURCE | |
1639 | #include <stdio.h> | |
1640 | #include <errno.h> | |
1641 | #include <stdlib.h> | |
1642 | #include <unistd.h> | |
1643 | #include <sys/wait.h> | |
1644 | #include <sys/mman.h> | |
1645 | #include <sys/syscall.h> | |
1646 | #include <linux/futex.h> | |
1647 | #include <sys/time.h> | |
1648 | ||
1649 | #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\ | |
1650 | } while (0) | |
1651 | ||
1652 | static int *futex1, *futex2, *iaddr; | |
1653 | ||
1654 | static int | |
1655 | futex(int *uaddr, int futex_op, int val, | |
1656 | const struct timespec *timeout, int *uaddr2, int val3) | |
1657 | { | |
1658 | return syscall(SYS_futex, uaddr, futex_op, val, | |
1659 | timeout, uaddr, val3); | |
1660 | } | |
1661 | ||
1662 | /* Acquire the futex pointed to by \(aqfutexp\(aq: wait for its value to | |
1663 | become 1, and then set the value to 0. */ | |
1664 | ||
1665 | static void | |
1666 | fwait(int *futexp) | |
1667 | { | |
1668 | int s; | |
1669 | ||
1670 | /* __sync_bool_compare_and_swap(ptr, oldval, newval) is a gcc | |
1671 | built\-in function. It atomically performs the equivalent of: | |
1672 | ||
1673 | if (*ptr == oldval) | |
1674 | *ptr = newval; | |
1675 | ||
1676 | It returns true if the test yielded true and *ptr was updated. | |
1677 | The alternative here would be to employ the equivalent atomic | |
1678 | machine\-language instructions. For further information, see | |
1679 | the GCC Manual. */ | |
1680 | ||
305cc415 | 1681 | while (1) { |
83e80dda | 1682 | |
63ad44cb | 1683 | /* Is the futex available? */ |
83e80dda | 1684 | |
305cc415 MK |
1685 | if (__sync_bool_compare_and_swap(futexp, 1, 0)) |
1686 | break; /* Yes */ | |
1687 | ||
63ad44cb | 1688 | /* Futex is not available; wait */ |
83e80dda | 1689 | |
63ad44cb HS |
1690 | s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); |
1691 | if (s == \-1 && errno != EAGAIN) | |
1692 | errExit("futex\-FUTEX_WAIT"); | |
305cc415 MK |
1693 | } |
1694 | } | |
1695 | ||
1696 | /* Release the futex pointed to by \(aqfutexp\(aq: if the futex currently | |
1697 | has the value 0, set its value to 1 and the wake any futex waiters, | |
1698 | so that if the peer is blocked in fpost(), it can proceed. */ | |
1699 | ||
1700 | static void | |
1701 | fpost(int *futexp) | |
1702 | { | |
1703 | int s; | |
1704 | ||
1705 | /* __sync_bool_compare_and_swap() was described in comments above */ | |
1706 | ||
1707 | if (__sync_bool_compare_and_swap(futexp, 0, 1)) { | |
1708 | ||
1709 | s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); | |
1710 | if (s == \-1) | |
1711 | errExit("futex\-FUTEX_WAKE"); | |
1712 | } | |
1713 | } | |
1714 | ||
1715 | int | |
1716 | main(int argc, char *argv[]) | |
1717 | { | |
1718 | pid_t childPid; | |
1719 | int j, nloops; | |
1720 | ||
1721 | setbuf(stdout, NULL); | |
1722 | ||
1723 | nloops = (argc > 1) ? atoi(argv[1]) : 5; | |
1724 | ||
1725 | /* Create a shared anonymous mapping that will hold the futexes. | |
1726 | Since the futexes are being shared between processes, we | |
1727 | subsequently use the "shared" futex operations (i.e., not the | |
1728 | ones suffixed "_PRIVATE") */ | |
1729 | ||
1730 | iaddr = mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE, | |
1731 | MAP_ANONYMOUS | MAP_SHARED, \-1, 0); | |
1732 | if (iaddr == MAP_FAILED) | |
1733 | errExit("mmap"); | |
1734 | ||
1735 | futex1 = &iaddr[0]; | |
1736 | futex2 = &iaddr[1]; | |
1737 | ||
1738 | *futex1 = 0; /* State: unavailable */ | |
1739 | *futex2 = 1; /* State: available */ | |
1740 | ||
1741 | /* Create a child process that inherits the shared anonymous | |
35764662 | 1742 | mapping */ |
305cc415 MK |
1743 | |
1744 | childPid = fork(); | |
92a46690 | 1745 | if (childPid == \-1) |
305cc415 MK |
1746 | errExit("fork"); |
1747 | ||
1748 | if (childPid == 0) { /* Child */ | |
1749 | for (j = 0; j < nloops; j++) { | |
1750 | fwait(futex1); | |
1751 | printf("Child (%ld) %d\\n", (long) getpid(), j); | |
1752 | fpost(futex2); | |
1753 | } | |
1754 | ||
1755 | exit(EXIT_SUCCESS); | |
1756 | } | |
1757 | ||
1758 | /* Parent falls through to here */ | |
1759 | ||
1760 | for (j = 0; j < nloops; j++) { | |
1761 | fwait(futex2); | |
1762 | printf("Parent (%ld) %d\\n", (long) getpid(), j); | |
1763 | fpost(futex1); | |
1764 | } | |
1765 | ||
1766 | wait(NULL); | |
1767 | ||
1768 | exit(EXIT_SUCCESS); | |
1769 | } | |
1770 | .fi | |
47297adb | 1771 | .SH SEE ALSO |
4c222281 | 1772 | .ad l |
9913033c | 1773 | .BR get_robust_list (2), |
d806bc05 | 1774 | .BR restart_syscall (2), |
e0074751 | 1775 | .BR pthread_mutexattr_getprotocol (3), |
ac894879 MK |
1776 | .BR futex (7), |
1777 | .BR sched (7) | |
fea681da | 1778 | .PP |
f5ad572f MK |
1779 | The following kernel source files: |
1780 | .IP * 2 | |
1781 | .I Documentation/pi-futex.txt | |
1782 | .IP * | |
1783 | .I Documentation/futex-requeue-pi.txt | |
1784 | .IP * | |
1785 | .I Documentation/locking/rt-mutex.txt | |
1786 | .IP * | |
1787 | .I Documentation/locking/rt-mutex-design.txt | |
8fe019c7 MK |
1788 | .IP * |
1789 | .I Documentation/robust-futex-ABI.txt | |
43b99089 | 1790 | .PP |
4c222281 | 1791 | Franke, H., Russell, R., and Kirwood, M., 2002. |
52087dd3 | 1792 | \fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP |
4c222281 | 1793 | (from proceedings of the Ottawa Linux Symposium 2002), |
9b936e9e | 1794 | .br |
608bf950 SK |
1795 | .UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002-pages-479-495.pdf |
1796 | .UE | |
f42eb21b | 1797 | |
4c222281 | 1798 | Hart, D., 2009. \fIA futex overview and update\fP, |
2ed26199 MK |
1799 | .UR http://lwn.net/Articles/360699/ |
1800 | .UE | |
1801 | ||
4c222281 | 1802 | Hart, D. and Guniguntala, D., 2009. |
0483b6cc | 1803 | \fIRequeue-PI: Making Glibc Condvars PI-Aware\fP |
4c222281 | 1804 | (from proceedings of the 2009 Real-Time Linux Workshop), |
0483b6cc MK |
1805 | .UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf |
1806 | .UE | |
1807 | ||
4c222281 | 1808 | Drepper, U., 2011. \fIFutexes Are Tricky\fP, |
f42eb21b MK |
1809 | .UR http://www.akkadia.org/drepper/futex.pdf |
1810 | .UE | |
9b936e9e MK |
1811 | .PP |
1812 | Futex example library, futex-*.tar.bz2 at | |
1813 | .br | |
a605264d | 1814 | .UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/ |
608bf950 | 1815 | .UE |
34f14794 MK |
1816 | .\" |
1817 | .\" FIXME Are there any other resources that should be listed | |
1818 | .\" in the SEE ALSO section? | |
74f58a64 | 1819 | .\" FIXME(Torvald) We should probably refer to the glibc code here, in |
4c8cb0ff MK |
1820 | .\" particular the glibc-internal futex wrapper functions that are |
1821 | .\" WIP, and the generic pthread_mutex_t and perhaps condvar | |
1822 | .\" implementations. |