]>
Commit | Line | Data |
---|---|---|
8f0aff2a | 1 | .\" Page by b.hubert |
1abce893 MK |
2 | .\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de> |
3 | .\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com> | |
2297bf0e | 4 | .\" |
2e46a6e7 | 5 | .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) |
8f0aff2a | 6 | .\" may be freely modified and distributed |
8ff7380d | 7 | .\" %%%LICENSE_END |
fea681da MK |
8 | .\" |
9 | .\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com) | |
10 | .\" added ERRORS section. | |
11 | .\" | |
12 | .\" Modified 2004-06-17 mtk | |
13 | .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE | |
14 | .\" | |
47f5c4ba MK |
15 | .\" FIXME Still to integrate are some points from Torvald Riegel's mail of |
16 | .\" 2015-01-23: | |
17 | .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 | |
18 | .\" | |
02182e7c MK |
19 | .\" FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail |
20 | .\" at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 | |
21 | .\" | |
3d155313 | 22 | .TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual" |
fea681da | 23 | .SH NAME |
ce154705 | 24 | futex \- fast user-space locking |
fea681da | 25 | .SH SYNOPSIS |
9d9dc1e8 | 26 | .nf |
fea681da MK |
27 | .sp |
28 | .B "#include <linux/futex.h>" | |
fea681da MK |
29 | .B "#include <sys/time.h>" |
30 | .sp | |
d33602c4 | 31 | .BI "int futex(int *" uaddr ", int " futex_op ", int " val , |
768d3c23 MK |
32 | .BI " const struct timespec *" timeout , \ |
33 | " \fR /* or: \fBu32 \fIval2\fP */ | |
9d9dc1e8 | 34 | .BI " int *" uaddr2 ", int " val3 ); |
9d9dc1e8 | 35 | .fi |
409f08b0 | 36 | |
b939d6e4 MK |
37 | .IR Note : |
38 | There is no glibc wrapper for this system call; see NOTES. | |
47297adb | 39 | .SH DESCRIPTION |
fea681da MK |
40 | .PP |
41 | The | |
e511ffb6 | 42 | .BR futex () |
4b35dc5d TR |
43 | system call provides a method for waiting until a certain condition becomes |
44 | true. It is typically used as a blocking construct in the context of | |
45 | shared-memory synchronization: The program implements the majority of the | |
46 | synchronization in user space, and uses one of operations of the system call | |
47 | when it is likely that it has to block for a longer time until the condition | |
48 | becomes true. The program uses another operation of the system call to wake | |
49 | anyone waiting for a particular condition. | |
50 | ||
51 | The condition is represented by the futex word, which is an address in memory | |
52 | supplied to the | |
53 | .BR futex () | |
54 | system call, and the value at this memory location. | |
a5956430 MK |
55 | (While the virtual addresses for the same memory in separate |
56 | processes may not be equal, | |
57 | the kernel maps them internally so that the same memory mapped | |
58 | in different locations will correspond for | |
e511ffb6 | 59 | .BR futex () |
f19904c0 | 60 | calls.) |
809ca3ae | 61 | |
4b35dc5d TR |
62 | When executing a futex operation that requests to block a thread, the kernel |
63 | will only block if the futex word has the value that the calling thread | |
64 | supplied as expected value. The load from the futex word, the comparison with | |
65 | the expected value, and the actual blocking will happen atomically and totally | |
66 | ordered with respect to concurrently executing futex operations on the same | |
67 | futex word, such as operations that wake threads blocked on this futex word. | |
68 | Thus, the futex word is used to connect the synchronization in user space with | |
69 | the implementation of blocking by the kernel; similar to an atomic | |
70 | compare-and-exchange operation that potentially changes shared memory, | |
71 | blocking via a futex is an atomic compare-and-block operation. See NOTES for | |
72 | a detailed specification of the synchronization semantics. | |
73 | ||
74 | One example use of futexes is implementing locks. The state of the lock (i.e., | |
75 | acquired or not acquired) can be represented as an atomically accessed flag | |
76 | in shared memory. In the uncontended case, a thread can access or modify the | |
77 | lock state with atomic instructions, for example atomically changing it from | |
78 | not acquired to acquired using an atomic compare-and-exchange instruction. If | |
79 | a thread cannot acquire a lock because it is already acquired by another | |
80 | thread, it can request to block if and only the lock is still acquired by | |
81 | using the lock's flag as futex word and expecting a value that represents the | |
82 | acquired state. When releasing the lock, a thread has to first reset the | |
83 | lock state to not acquired and then execute the futex operation that wakes | |
84 | one thread blocked on the futex word that is the lock's flag (this can be | |
85 | be further optimized to avoid unnecessary wake-ups). See | |
86 | .BR futex (7) | |
87 | for more detail on how to use futexes. | |
88 | ||
89 | Besides the basic wait and wake-up futex functionality, there are further | |
90 | futex operations aimed at supporting more complex use cases. Also note that | |
91 | no explicit initialization or destruction are necessary to use futexes; the | |
92 | kernel maintains a futex (i.e., the kernel-internal implementation artifact) | |
93 | only while operations such as | |
94 | .BR FUTEX_WAIT , | |
95 | described below, are being performed on a particular futex word. | |
a663ca5a MK |
96 | .\" |
97 | .SS Arguments | |
fea681da MK |
98 | The |
99 | .I uaddr | |
4b35dc5d TR |
100 | argument points to the futex word. On all platforms, futexes are four-byte |
101 | integers that must be aligned on a four-byte boundary. | |
f388ba70 MK |
102 | The operation to perform on the futex is specified in the |
103 | .I futex_op | |
104 | argument; | |
105 | .IR val | |
106 | is a value whose meaning and purpose depends on | |
107 | .IR futex_op . | |
36ab2074 MK |
108 | |
109 | The remaining arguments | |
110 | .RI ( timeout , | |
111 | .IR uaddr2 , | |
112 | and | |
113 | .IR val3 ) | |
114 | are required only for certain of the futex operations described below. | |
115 | Where one of these arguments is not required, it is ignored. | |
768d3c23 | 116 | |
36ab2074 MK |
117 | For several blocking operations, the |
118 | .I timeout | |
119 | argument is a pointer to a | |
120 | .IR timespec | |
121 | structure that specifies a timeout for the operation. | |
122 | However, notwithstanding the prototype shown above, for some operations, | |
123 | this argument is instead a four-byte integer whose meaning | |
124 | is determined by the operation. | |
768d3c23 MK |
125 | For these operations, the kernel casts the |
126 | .I timeout | |
127 | value to | |
128 | .IR u32 , | |
129 | and in the remainder of this page, this argument is referred to as | |
130 | .I val2 | |
131 | when interpreted in this fashion. | |
132 | ||
de5a3bb4 | 133 | Where it is required, the |
36ab2074 | 134 | .IR uaddr2 |
4b35dc5d | 135 | argument is a pointer to a second futex word that is employed by the operation. |
36ab2074 MK |
136 | The interpretation of the final integer argument, |
137 | .IR val3 , | |
138 | depends on the operation. | |
a663ca5a MK |
139 | .\" |
140 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
141 | .\" | |
142 | .SS Futex operations | |
6be4bad7 | 143 | The |
d33602c4 | 144 | .I futex_op |
6be4bad7 MK |
145 | argument consists of two parts: |
146 | a command that specifies the operation to be performed, | |
147 | bit-wise ORed with zero or or more options that | |
148 | modify the behaviour of the operation. | |
fc30eb79 | 149 | The options that may be included in |
d33602c4 | 150 | .I futex_op |
fc30eb79 TG |
151 | are as follows: |
152 | .TP | |
153 | .BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)" | |
154 | .\" commit 34f01cc1f512fa783302982776895c73714ebbc2 | |
155 | This option bit can be employed with all futex operations. | |
e45f9735 | 156 | It tells the kernel that the futex is process-private and not shared |
4b35dc5d TR |
157 | with another process (i.e., it is only being used for synchronization between |
158 | threads of the same process). | |
fc30eb79 TG |
159 | This allows the kernel to choose the fast path for validating |
160 | the user-space address and avoids expensive VMA lookups, | |
161 | taking reference counts on file backing store, and so on. | |
ae2c1774 MK |
162 | |
163 | As a convenience, | |
164 | .IR <linux/futex.h> | |
165 | defines a set of constants with the suffix | |
166 | .BR _PRIVATE | |
167 | that are equivalents of all of the operations listed below, | |
dcdfde26 | 168 | .\" except the obsolete FUTEX_FD, for which the "private" flag was |
ae2c1774 MK |
169 | .\" meaningless |
170 | but with the | |
171 | .BR FUTEX_PRIVATE_FLAG | |
172 | ORed into the constant value. | |
173 | Thus, there are | |
174 | .BR FUTEX_WAIT_PRIVATE , | |
175 | .BR FUTEX_WAKE_PRIVATE , | |
176 | and so on. | |
2e98bbc2 TG |
177 | .TP |
178 | .BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)" | |
179 | .\" commit 1acdac104668a0834cfa267de9946fac7764d486 | |
4a7e5b05 | 180 | This option bit can be employed only with the |
2e98bbc2 TG |
181 | .BR FUTEX_WAIT_BITSET |
182 | and | |
183 | .BR FUTEX_WAIT_REQUEUE_PI | |
c84cf68c | 184 | operations. |
2e98bbc2 | 185 | |
f2103b26 MK |
186 | If this option is set, the kernel treats |
187 | .I timeout | |
188 | as an absolute time based on | |
2e98bbc2 TG |
189 | .BR CLOCK_REALTIME . |
190 | ||
f2103b26 MK |
191 | If this option is not set, the kernel treats |
192 | .I timeout | |
193 | as relative time, | |
f1d2171d | 194 | .\" FIXME XXX I added CLOCK_MONOTONIC here. Okay? |
1c952cf5 MK |
195 | measured against the |
196 | .BR CLOCK_MONOTONIC | |
197 | clock. | |
6be4bad7 MK |
198 | .PP |
199 | The operation specified in | |
d33602c4 | 200 | .I futex_op |
6be4bad7 | 201 | is one of the following: |
70b06b90 MK |
202 | .\" |
203 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
204 | .\" | |
fea681da | 205 | .TP |
81c9d87e MK |
206 | .BR FUTEX_WAIT " (since Linux 2.6.0)" |
207 | .\" Strictly speaking, since some time in 2.5.x | |
f065673c | 208 | This operation tests that the value at the |
4b35dc5d | 209 | futex word pointed to by the address |
fea681da | 210 | .I uaddr |
4b35dc5d | 211 | still contains the expected value |
fea681da | 212 | .IR val , |
4b35dc5d | 213 | and if so, then sleeps awaiting |
682edefb | 214 | .B FUTEX_WAKE |
4b35dc5d TR |
215 | on the futex word. The load of the value of the futex word is an atomic memory |
216 | access (i.e., using atomic machine instructions of the respective | |
217 | architecture). This load, the comparison with the expected value, and | |
218 | starting to sleep are performed atomically and totally ordered with respect | |
219 | to other futex operations on the same futex word. If the thread starts to | |
220 | sleep, it is considered a waiter on this futex word. | |
f065673c MK |
221 | If the futex value does not match |
222 | .IR val , | |
4710334a | 223 | then the call fails immediately with the error |
badbf70c | 224 | .BR EAGAIN . |
4b35dc5d TR |
225 | |
226 | The purpose of the comparison with the expected value is to prevent lost | |
227 | wake-ups: If another thread changed the value of the futex word after the | |
228 | calling thread decided to block based on the prior value, and if the other | |
229 | thread executed a | |
230 | .BR FUTEX_WAKE | |
231 | operation (or similar wake-up) after the value change and before this | |
f065673c | 232 | .BR FUTEX_WAIT |
4b35dc5d TR |
233 | operation, then the latter will observe the value change and will not start |
234 | to sleep. | |
1909e523 | 235 | |
c13182ef | 236 | If the |
fea681da | 237 | .I timeout |
53ba4030 | 238 | argument is non-NULL, its contents specify a relative timeout for the wait, |
f1d2171d | 239 | .\" FIXME XXX I added CLOCK_MONOTONIC here. Okay? |
1c952cf5 MK |
240 | measured according to the |
241 | .BR CLOCK_MONOTONIC | |
242 | clock. | |
82a6092b MK |
243 | (This interval will be rounded up to the system clock granularity, |
244 | and kernel scheduling delays mean that the | |
245 | blocking interval may overrun by a small amount.) | |
246 | If | |
247 | .I timeout | |
248 | is NULL, the call blocks indefinitely. | |
4798a7f3 | 249 | |
c13182ef | 250 | The arguments |
fea681da MK |
251 | .I uaddr2 |
252 | and | |
253 | .I val3 | |
254 | are ignored. | |
255 | ||
4b35dc5d TR |
256 | .\" XXX I think we should remove this. Or maybe adapt to a different example. |
257 | .\" For | |
258 | .\" .BR futex (7), | |
259 | .\" this call is executed if decrementing the count gave a negative value | |
260 | .\" (indicating contention), | |
261 | .\" and will sleep until another process or thread releases | |
262 | .\" the futex and executes the | |
263 | .\" .B FUTEX_WAKE | |
264 | .\" operation. | |
70b06b90 MK |
265 | .\" |
266 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
267 | .\" | |
fea681da | 268 | .TP |
81c9d87e MK |
269 | .BR FUTEX_WAKE " (since Linux 2.6.0)" |
270 | .\" Strictly speaking, since Linux 2.5.x | |
f065673c MK |
271 | This operation wakes at most |
272 | .I val | |
4b35dc5d TR |
273 | .\" XXX I believe FUTEX_WAIT_BITSET waiters, for example, could also be woken |
274 | .\" (therefore, make it e.g. instead of i.e.)? | |
275 | of the waiters that are waiting (e.g., inside | |
f065673c | 276 | .BR FUTEX_WAIT ) |
4b35dc5d | 277 | on the futex word at the address |
f065673c MK |
278 | .IR uaddr . |
279 | Most commonly, | |
280 | .I val | |
281 | is specified as either 1 (wake up a single waiter) or | |
282 | .BR INT_MAX | |
283 | (wake up all waiters). | |
730bfbda MK |
284 | .\" FIXME Please confirm that the following is correct: |
285 | No guarantee is provided about which waiters are awoken | |
286 | (e.g., a waiter with a higher scheduling priority is not guaranteed | |
287 | to be awoken in preference to a waiter with a lower priority). | |
4798a7f3 | 288 | |
fea681da MK |
289 | The arguments |
290 | .IR timeout , | |
c8b921bd | 291 | .IR uaddr2 , |
fea681da MK |
292 | and |
293 | .I val3 | |
294 | are ignored. | |
295 | ||
4b35dc5d TR |
296 | .\" XXX I think we should remove this. Or maybe adapt to a different example. |
297 | .\" For | |
298 | .\" .BR futex (7), | |
299 | .\" this is executed if incrementing the count showed that there were waiters, | |
64191e8f | 300 | .\" FIXME How does "incrementing the count showed that there were waiters"? |
4b35dc5d | 301 | .\" once the futex value has been set to 1 (indicating that it is available). |
70b06b90 MK |
302 | .\" |
303 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
304 | .\" | |
a7c2bf45 MK |
305 | .TP |
306 | .BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)" | |
307 | .\" Strictly speaking, from Linux 2.5.x to 2.6.25 | |
308 | This operation creates a file descriptor that is associated with the futex at | |
309 | .IR uaddr . | |
bdc5957a MK |
310 | The caller must close the returned file descriptor after use. |
311 | When another process or thread performs a | |
a7c2bf45 | 312 | .BR FUTEX_WAKE |
4b35dc5d | 313 | on the futex word, the file descriptor indicates as being readable with |
a7c2bf45 MK |
314 | .BR select (2), |
315 | .BR poll (2), | |
316 | and | |
317 | .BR epoll (7) | |
318 | ||
f1d2171d | 319 | The file descriptor can be used to obtain asynchronous notifications: if |
a7c2bf45 | 320 | .I val |
bdc5957a | 321 | is nonzero, then when another process or thread executes a |
a7c2bf45 MK |
322 | .BR FUTEX_WAKE , |
323 | the caller will receive the signal number that was passed in | |
324 | .IR val . | |
325 | ||
326 | The arguments | |
327 | .IR timeout , | |
328 | .I uaddr2 | |
329 | and | |
330 | .I val3 | |
331 | are ignored. | |
332 | ||
4b35dc5d | 333 | .\" FIXME We never define "upped". Maybe just remove that sentence? |
a7c2bf45 MK |
334 | To prevent race conditions, the caller should test if the futex has |
335 | been upped after | |
336 | .B FUTEX_FD | |
337 | returns. | |
338 | ||
339 | Because it was inherently racy, | |
340 | .B FUTEX_FD | |
341 | has been removed | |
342 | .\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80 | |
343 | from Linux 2.6.26 onward. | |
70b06b90 MK |
344 | .\" |
345 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
346 | .\" | |
a7c2bf45 MK |
347 | .TP |
348 | .BR FUTEX_REQUEUE " (since Linux 2.6.0)" | |
349 | .\" Strictly speaking: from Linux 2.5.70 | |
4b35dc5d TR |
350 | .\" FIXME Is there some indication that it is broken in general, or is this |
351 | .\" comment implicitly speaking about the condvar (?) use case? If the latter | |
352 | .\" we might want to weaken the advice a little. | |
a7c2bf45 | 353 | .IR "Avoid using this operation" . |
4b35dc5d | 354 | It is broken for its intended purpose. |
a7c2bf45 MK |
355 | Use |
356 | .BR FUTEX_CMP_REQUEUE | |
357 | instead. | |
358 | ||
359 | This operation performs the same task as | |
360 | .BR FUTEX_CMP_REQUEUE , | |
361 | except that no check is made using the value in | |
362 | .IR val3 . | |
363 | (The argument | |
364 | .I val3 | |
365 | is ignored.) | |
70b06b90 MK |
366 | .\" |
367 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
368 | .\" | |
a7c2bf45 MK |
369 | .TP |
370 | .BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)" | |
4b35dc5d | 371 | This operation first checks whether the location |
a7c2bf45 MK |
372 | .I uaddr |
373 | still contains the value | |
374 | .IR val3 . | |
375 | If not, the operation fails with the error | |
376 | .BR EAGAIN . | |
4b35dc5d | 377 | Otherwise, the operation wakes up a maximum of |
a7c2bf45 MK |
378 | .I val |
379 | waiters that are waiting on the futex at | |
380 | .IR uaddr . | |
381 | If there are more than | |
382 | .I val | |
383 | waiters, then the remaining waiters are removed | |
384 | from the wait queue of the source futex at | |
385 | .I uaddr | |
386 | and added to the wait queue of the target futex at | |
387 | .IR uaddr2 . | |
388 | The | |
768d3c23 | 389 | .I val2 |
936876a9 | 390 | argument specifies an upper limit on the number of waiters |
a7c2bf45 | 391 | that are requeued to the futex at |
768d3c23 | 392 | .IR uaddr2 . |
a7c2bf45 | 393 | |
4b35dc5d TR |
394 | .\" FIXME Is this correct? Or is just the decision which threads to wake or |
395 | .\" requeue part of the atomic operation? | |
396 | The load from | |
397 | .I uaddr | |
398 | is an atomic memory access (i.e., using atomic machine instructions of the | |
399 | respective architecture). This load, the comparison with | |
400 | .IR val3 , | |
401 | and the requeueing of any waiters are performed atomically and totally ordered | |
402 | with respect to other operations on the same futex word. | |
403 | ||
404 | This operation was added as a replacement for the earlier | |
405 | .BR FUTEX_REQUEUE . | |
406 | The difference is that the check of the value at | |
407 | .I uaddr | |
408 | can be used to ensure that requeueing only happens under certain conditions. | |
409 | Both operations can be used to avoid a "thundering herd" effect when | |
410 | .B FUTEX_WAKE | |
411 | is used and all of the waiters that are woken need to acquire another futex. | |
412 | ||
a7c2bf45 MK |
413 | .\" FIXME Please review the following new paragraph to see if it is |
414 | .\" accurate. | |
415 | Typical values to specify for | |
416 | .I val | |
417 | are 0 or or 1. | |
418 | (Specifying | |
419 | .BR INT_MAX | |
420 | is not useful, because it would make the | |
421 | .BR FUTEX_CMP_REQUEUE | |
422 | operation equivalent to | |
423 | .BR FUTEX_WAKE .) | |
936876a9 | 424 | The limit value specified via |
768d3c23 MK |
425 | .I val2 |
426 | is typically either 1 or | |
a7c2bf45 MK |
427 | .BR INT_MAX . |
428 | (Specifying the argument as 0 is not useful, because it would make the | |
429 | .BR FUTEX_CMP_REQUEUE | |
430 | operation equivalent to | |
431 | .BR FUTEX_WAIT .) | |
6bac3b85 | 432 | .\" |
43d16602 MK |
433 | .\" FIXME Here, it would be helpful to have an example of how |
434 | .\" FUTEX_CMP_REQUEUE might be used, at the same time illustrating | |
435 | .\" why FUTEX_WAKE is unsuitable for the same use case. | |
436 | .\" | |
70b06b90 MK |
437 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
438 | .\" | |
a5956430 MK |
439 | .\" FIXME I added a lengthy piece of text on FUTEX_WAKE_OP text, |
440 | .\" and I'd be happy if someone checked it. | |
fea681da | 441 | .TP |
d67e21f5 MK |
442 | .BR FUTEX_WAKE_OP " (since Linux 2.6.14)" |
443 | .\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721 | |
6bac3b85 MK |
444 | .\" Author: Jakub Jelinek <jakub@redhat.com> |
445 | .\" Date: Tue Sep 6 15:16:25 2005 -0700 | |
4b35dc5d TR |
446 | .\" FIXME The glibc condvar implementation is currently being revised (e.g., |
447 | .\" to not use an internal lock anymore). | |
448 | .\" It is probably more future-proof to remove this paragraph. | |
6bac3b85 MK |
449 | This operation was added to support some user-space use cases |
450 | where more than one futex must be handled at the same time. | |
451 | The most notable example is the implementation of | |
452 | .BR pthread_cond_signal (3), | |
453 | which requires operations on two futexes, | |
454 | the one used to implement the mutex and the one used in the implementation | |
455 | of the wait queue associated with the condition variable. | |
456 | .BR FUTEX_WAKE_OP | |
457 | allows such cases to be implemented without leading to | |
458 | high rates of contention and context switching. | |
459 | ||
460 | The | |
461 | .BR FUTEX_WAIT_OP | |
4b35dc5d TR |
462 | operation is equivalent to execute the following code atomically and totally |
463 | ordered with respect to other futex operations on any of the two supplied | |
464 | futex words: | |
6bac3b85 MK |
465 | |
466 | .in +4n | |
467 | .nf | |
468 | int oldval = *(int *) uaddr2; | |
469 | *(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP; | |
470 | futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); | |
471 | if (oldval \fIcmp\fP \fIcmparg\fP) | |
768d3c23 | 472 | futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); |
6bac3b85 MK |
473 | .fi |
474 | .in | |
475 | ||
476 | In other words, | |
477 | .BR FUTEX_WAIT_OP | |
478 | does the following: | |
479 | .RS | |
480 | .IP * 3 | |
4b35dc5d TR |
481 | saves the original value of the futex word at |
482 | .IR uaddr2 | |
483 | and performs an operation to modify the value of the futex at | |
6bac3b85 | 484 | .IR uaddr2 ; |
4b35dc5d TR |
485 | this is an atomic read-modify-write memory access (i.e., using atomic machine |
486 | instructions of the respective architecture) | |
6bac3b85 MK |
487 | .IP * |
488 | wakes up a maximum of | |
489 | .I val | |
4b35dc5d | 490 | waiters on the futex for the futex word at |
6bac3b85 MK |
491 | .IR uaddr ; |
492 | and | |
493 | .IP * | |
4b35dc5d | 494 | dependent on the results of a test of the original value of the futex word at |
6bac3b85 MK |
495 | .IR uaddr2 , |
496 | wakes up a maximum of | |
768d3c23 | 497 | .I val2 |
4b35dc5d | 498 | waiters on the futex for the futex word at |
6bac3b85 MK |
499 | .IR uaddr2 . |
500 | .RE | |
501 | .IP | |
6bac3b85 MK |
502 | The operation and comparison that are to be performed are encoded |
503 | in the bits of the argument | |
504 | .IR val3 . | |
505 | Pictorially, the encoding is: | |
506 | ||
f6af90e7 | 507 | .in +8n |
6bac3b85 | 508 | .nf |
f6af90e7 MK |
509 | +---+---+-----------+-----------+ |
510 | |op |cmp| oparg | cmparg | | |
511 | +---+---+-----------+-----------+ | |
512 | 4 4 12 12 <== # of bits | |
6bac3b85 MK |
513 | .fi |
514 | .in | |
515 | ||
516 | Expressed in code, the encoding is: | |
517 | ||
518 | .in +4n | |
519 | .nf | |
520 | #define FUTEX_OP(op, oparg, cmp, cmparg) \\ | |
521 | (((op & 0xf) << 28) | \\ | |
522 | ((cmp & 0xf) << 24) | \\ | |
523 | ((oparg & 0xfff) << 12) | \\ | |
524 | (cmparg & 0xfff)) | |
525 | .fi | |
526 | .in | |
527 | ||
528 | In the above, | |
529 | .I op | |
530 | and | |
531 | .I cmp | |
532 | are each one of the codes listed below. | |
533 | The | |
534 | .I oparg | |
535 | and | |
536 | .I cmparg | |
537 | components are literal numeric values, except as noted below. | |
538 | ||
539 | The | |
540 | .I op | |
541 | component has one of the following values: | |
542 | ||
543 | .in +4n | |
544 | .nf | |
545 | FUTEX_OP_SET 0 /* uaddr2 = oparg; */ | |
546 | FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ | |
547 | FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ | |
548 | FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */ | |
549 | FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */ | |
550 | .fi | |
551 | .in | |
552 | ||
553 | In addition, bit-wise ORing the following value into | |
554 | .I op | |
555 | causes | |
556 | .IR "(1\ <<\ oparg)" | |
557 | to be used as the operand: | |
558 | ||
559 | .in +4n | |
560 | .nf | |
561 | FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ | |
562 | .fi | |
563 | .in | |
564 | ||
565 | The | |
566 | .I cmp | |
567 | field is one of the following: | |
568 | ||
569 | .in +4n | |
570 | .nf | |
571 | FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ | |
572 | FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ | |
573 | FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ | |
574 | FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ | |
575 | FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ | |
576 | FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ | |
577 | .fi | |
578 | .in | |
579 | ||
580 | The return value of | |
581 | .BR FUTEX_WAKE_OP | |
582 | is the sum of the number of waiters woken on the futex | |
583 | .IR uaddr | |
584 | plus the number of waiters woken on the futex | |
585 | .IR uaddr2 . | |
70b06b90 MK |
586 | .\" |
587 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
588 | .\" | |
d67e21f5 | 589 | .TP |
79c9b436 TG |
590 | .BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)" |
591 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
fd9e59d4 | 592 | This operation is like |
79c9b436 TG |
593 | .BR FUTEX_WAIT |
594 | except that | |
595 | .I val3 | |
596 | is used to provide a 32-bit bitset to the kernel. | |
597 | This bitset is stored in the kernel-internal state of the waiter. | |
598 | See the description of | |
599 | .BR FUTEX_WAKE_BITSET | |
600 | for further details. | |
601 | ||
fd9e59d4 MK |
602 | The |
603 | .BR FUTEX_WAIT_BITSET | |
9732dd8b | 604 | operation also interprets the |
fd9e59d4 MK |
605 | .I timeout |
606 | argument differently from | |
607 | .BR FUTEX_WAIT . | |
608 | See the discussion of | |
609 | .BR FUTEX_CLOCK_REALTIME , | |
610 | above. | |
611 | ||
79c9b436 TG |
612 | The |
613 | .I uaddr2 | |
614 | argument is ignored. | |
70b06b90 MK |
615 | .\" |
616 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
617 | .\" | |
79c9b436 | 618 | .TP |
d67e21f5 MK |
619 | .BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)" |
620 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
55cc422d TG |
621 | This operation is the same as |
622 | .BR FUTEX_WAKE | |
623 | except that the | |
624 | .I val3 | |
625 | argument is used to provide a 32-bit bitset to the kernel. | |
98d769c0 MK |
626 | This bitset is used to select which waiters should be woken up. |
627 | The selection is done by a bit-wise AND of the "wake" bitset | |
628 | (i.e., the value in | |
629 | .IR val3 ) | |
630 | and the bitset which is stored in the kernel-internal | |
09cb4ce7 | 631 | state of the waiter (the "wait" bitset that is set using |
98d769c0 MK |
632 | .BR FUTEX_WAIT_BITSET ). |
633 | All of the waiters for which the result of the AND is nonzero are woken up; | |
634 | the remaining waiters are left sleeping. | |
635 | ||
f1d2171d | 636 | .\" FIXME XXX Is this paragraph that I added okay? |
e9d4496b MK |
637 | The effect of |
638 | .BR FUTEX_WAIT_BITSET | |
639 | and | |
640 | .BR FUTEX_WAKE_BITSET | |
9732dd8b MK |
641 | is to allow selective wake-ups among multiple waiters that are blocked |
642 | on the same futex. | |
09cb4ce7 | 643 | Note, however, that using this bitset multiplexing feature on a |
e9d4496b MK |
644 | futex is less efficient than simply using multiple futexes, |
645 | because employing bitset multiplexing requires the kernel | |
646 | to check all waiters on a futex, | |
647 | including those that are not interested in being woken up | |
648 | (i.e., they do not have the relevant bit set in their "wait" bitset). | |
649 | .\" According to http://locklessinc.com/articles/futex_cheat_sheet/: | |
650 | .\" | |
651 | .\" "The original reason for the addition of these extensions | |
652 | .\" was to improve the performance of pthread read-write locks | |
653 | .\" in glibc. However, the pthreads library no longer uses the | |
654 | .\" same locking algorithm, and these extensions are not used | |
655 | .\" without the bitset parameter being all ones. | |
656 | .\" | |
657 | .\" The page goes on to note that the FUTEX_WAIT_BITSET operation | |
658 | .\" is nevertheless used (with a bitset of all ones) in order to | |
659 | .\" obtain the absolute timeout functionality that is useful | |
660 | .\" for efficiently implementing Pthreads APIs (which use absolute | |
661 | .\" timeouts); FUTEX_WAIT provides only relative timeouts. | |
662 | ||
98d769c0 MK |
663 | The |
664 | .I uaddr2 | |
665 | and | |
666 | .I timeout | |
667 | arguments are ignored. | |
9732dd8b MK |
668 | |
669 | The | |
670 | .BR FUTEX_WAIT | |
671 | and | |
672 | .BR FUTEX_WAKE | |
673 | operations correspond to | |
674 | .BR FUTEX_WAIT_BITSET | |
675 | and | |
676 | .BR FUTEX_WAKE_BITSET | |
677 | operations where the bitsets are all ones. | |
bd90a5f9 | 678 | .\" |
70b06b90 | 679 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
bd90a5f9 MK |
680 | .\" |
681 | .SS Priority-inheritance futexes | |
b52e1cd4 MK |
682 | Linux supports priority-inheritance (PI) futexes in order to handle |
683 | priority-inversion problems that can be encountered with | |
684 | normal futex locks. | |
b565548b | 685 | Priority inversion is the problem that occurs when a high-priority |
bdc5957a MK |
686 | task is blocked waiting to acquire a lock held by a low-priority task, |
687 | while tasks at an intermediate priority continuously preempt | |
688 | the low-priority task from the CPU. | |
689 | Consequently, the low-priority task makes no progress toward | |
690 | releasing the lock, and the high-priority task remains blocked. | |
7f315ae3 | 691 | |
7d20efd7 MK |
692 | Priority inheritance is a mechanism for dealing with |
693 | the priority-inversion problem. | |
bdc5957a MK |
694 | With this mechanism, when a high-priority task becomes blocked |
695 | by a lock held by a low-priority task, | |
7d20efd7 | 696 | the latter's priority is temporarily raised to that of the former, |
bdc5957a | 697 | so that it is not preempted by any intermediate level tasks, |
7d20efd7 MK |
698 | and can thus make progress toward releasing the lock. |
699 | To be effective, priority inheritance must be transitive, | |
bdc5957a MK |
700 | meaning that if a high-priority task blocks on a lock |
701 | held by a lower-priority task that is itself blocked by lock | |
702 | held by another intermediate-priority task | |
7d20efd7 | 703 | (and so on, for chains of arbitrary length), |
bdc5957a MK |
704 | then both of those task |
705 | (or more generally, all of the tasks in a lock chain) | |
706 | have their priorities raised to be the same as the high-priority task. | |
7d20efd7 | 707 | |
9e2b90ee MK |
708 | .\" FIXME XXX The following is my attempt at a definition of PI futexes, |
709 | .\" based on mail discussions with Darren Hart. Does it seem okay? | |
710 | From a user-space perspective, | |
711 | what makes a futex PI-aware is a policy agreement between user space | |
4b35dc5d | 712 | and the kernel about the value of the futex word (described in a moment), |
9e2b90ee MK |
713 | coupled with the use of the PI futex operations described below |
714 | (in particular, | |
715 | .BR FUTEX_LOCK_PI , | |
716 | .BR FUTEX_TRYLOCK_PI , | |
717 | and | |
718 | .BR FUTEX_CMP_REQUEUE_PI ). | |
719 | .\" Quoting Darren Hart: | |
720 | .\" These opcodes paired with the PI futex value policy (described below) | |
721 | .\" defines a "futex" as PI aware. These were created very specifically | |
722 | .\" in support of PI pthread_mutexes, so it makes a lot more sense to | |
723 | .\" talk about a PI aware pthread_mutex, than a PI aware futex, since | |
724 | .\" there is a lot of policy and scaffolding that has to be built up | |
725 | .\" around it to use it properly (this is what a PI pthread_mutex is). | |
726 | ||
f1d2171d | 727 | .\" FIXME XXX ===== Start of adapted Hart/Guniguntala text ===== |
1af427a4 MK |
728 | .\" The following text is drawn from the Hart/Guniguntala paper |
729 | .\" (listed in SEE ALSO), but I have reworded some pieces | |
730 | .\" significantly. Please check it. | |
79d918c7 MK |
731 | .\" |
732 | The PI futex operations described below differ from the other | |
4b35dc5d TR |
733 | futex operations in that they impose policy on the use of the value of the |
734 | futex word: | |
79d918c7 | 735 | .IP * 3 |
4b35dc5d | 736 | If the lock is not acquired, the futex word's value shall be 0. |
79d918c7 | 737 | .IP * |
4b35dc5d TR |
738 | If the lock is acquired, the futex word's value shall be the thread ID (TID; |
739 | see | |
79d918c7 MK |
740 | .BR gettid (2)) |
741 | of the owning thread. | |
742 | .IP * | |
f1d2171d | 743 | .\" FIXME XXX In the following line, I added "the lock is owned and". Okay? |
79d918c7 MK |
744 | If the lock is owned and there are threads contending for the lock, |
745 | then the | |
746 | .B FUTEX_WAITERS | |
4b35dc5d | 747 | bit shall be set in the futex word's value; in other words, this value is: |
79d918c7 MK |
748 | |
749 | FUTEX_WAITERS | TID | |
9e2b90ee | 750 | |
79d918c7 | 751 | .PP |
4b35dc5d | 752 | Note that a PI futex word never just has the value |
9e2b90ee MK |
753 | .BR FUTEX_WAITERS , |
754 | which is a permissible state for non-PI futexes. | |
755 | ||
79d918c7 | 756 | With this policy in place, |
4b35dc5d TR |
757 | a user-space application can acquire a not-acquired |
758 | lock or release a lock that no other threads try to acquire using atomic | |
759 | instructions executed in user space (e.g., a compare-and-swap operation such | |
760 | as | |
b52e1cd4 MK |
761 | .I cmpxchg |
762 | on the x86 architecture). | |
4b35dc5d TR |
763 | Acquiring a lock simply consists of using compare-and-swap to atomically set |
764 | the futex word's value to the caller's TID if its previous value was 0. | |
765 | Releasing a lock requires using compare-and-swap to set the futex word's | |
766 | value to 0 if the previous value was the expected TID. | |
b52e1cd4 | 767 | |
4b35dc5d | 768 | If a futex is already acquired (i.e., has a nonzero value), |
b52e1cd4 | 769 | waiters must employ the |
79d918c7 MK |
770 | .B FUTEX_LOCK_PI |
771 | operation to acquire the lock. | |
4b35dc5d | 772 | If other threads are waiting for the lock, then the |
79d918c7 | 773 | .B FUTEX_WAITERS |
4b35dc5d | 774 | bit is set in the futex value; in this case, the lock owner must employ the |
79d918c7 | 775 | .B FUTEX_UNLOCK_PI |
b52e1cd4 MK |
776 | operation to release the lock. |
777 | ||
79d918c7 MK |
778 | In the cases where callers are forced into the kernel |
779 | (i.e., required to perform a | |
780 | .BR futex () | |
781 | operation), | |
782 | they then deal directly with a so-called RT-mutex, | |
783 | a kernel locking mechanism which implements the required | |
784 | priority-inheritance semantics. | |
785 | After the RT-mutex is acquired, the futex value is updated accordingly, | |
786 | before the calling thread returns to user space. | |
787 | .\" FIXME ===== End of adapted Hart/Guniguntala text ===== | |
788 | ||
a59fca75 MK |
789 | It is important to note |
790 | .\" FIXME We need some explanation here of *why* it is important to | |
1af427a4 | 791 | .\" note this. Can someone explain? |
4b35dc5d | 792 | that the kernel will update the futex word's value prior |
79d918c7 MK |
793 | to returning to user space. |
794 | Unlike the other futex operations described above, | |
795 | the PI futex operations are designed | |
d9d5be6b | 796 | for the implementation of very specific IPC mechanisms. |
fc57e6bb | 797 | .\" |
7bd3ffbc | 798 | .\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart |
99c0ac69 MK |
799 | .\" made the observation that "EINVAL is returned if the non-pi |
800 | .\" to pi or op pairing semantics are violated." | |
801 | .\" Probably there needs to be a general statement about this | |
802 | .\" requirement, probably located at about this point in the page. | |
7bd3ffbc | 803 | .\" Darren, care to take a shot at this? |
dd003bef MK |
804 | .\" |
805 | .\" FIXME Somewhere on this page (I guess under the discussion of PI | |
806 | .\" futexes) we need a discussion of the FUTEX_OWNER_DIED bit. | |
807 | .\" Can someone propose a text? | |
bd90a5f9 MK |
808 | |
809 | PI futexes are operated on by specifying one of the following values in | |
810 | .IR futex_op : | |
70b06b90 MK |
811 | .\" |
812 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
813 | .\" | |
d67e21f5 MK |
814 | .TP |
815 | .BR FUTEX_LOCK_PI " (since Linux 2.6.18)" | |
816 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc | |
67833bec MK |
817 | .\" |
818 | .\" FIXME I did some significant rewording of tglx's text. | |
819 | .\" Please check, in case I injected errors. | |
820 | .\" | |
821 | This operation is used after after an attempt to acquire | |
4b35dc5d TR |
822 | the lock via an atomic user-space instruction failed |
823 | because the futex word has a nonzero value\(emspecifically, | |
67833bec | 824 | because it contained the namespace-specific TID of the lock owner. |
67259526 | 825 | .\" FIXME In the preceding line, what does "namespace-specific" mean? |
67833bec | 826 | .\" (I kept those words from tglx.) |
67259526 | 827 | .\" That is, what kind of namespace are we talking about? |
67833bec MK |
828 | .\" (I suppose we are talking PID namespaces here, but I want to |
829 | .\" be sure.) | |
830 | ||
4b35dc5d | 831 | The operation checks the value of the futex word at the address |
67833bec | 832 | .IR uaddr . |
70b06b90 MK |
833 | If the value is 0, then the kernel tries to atomically set |
834 | the futex value to the caller's TID. | |
67833bec MK |
835 | If that fails, |
836 | .\" FIXME What would be the cause of failure? | |
4b35dc5d | 837 | or the futex word's value is nonzero, |
67833bec | 838 | the kernel atomically sets the |
e0547e70 | 839 | .B FUTEX_WAITERS |
67833bec MK |
840 | bit, which signals the futex owner that it cannot unlock the futex in |
841 | user space atomically by setting the futex value to 0. | |
842 | After that, the kernel tries to find the thread which is | |
843 | associated with the owner TID, | |
844 | .\" FIXME Could I get a bit more detail on the next two lines? | |
845 | .\" What is "creates or reuses kernel state" about? | |
846 | creates or reuses kernel state on behalf of the owner | |
847 | and attaches the waiter to it. | |
67259526 MK |
848 | .\" FIXME In the next line, what type of "priority" are we talking about? |
849 | .\" Realtime priorities for SCHED_FIFO and SCHED_RR? | |
850 | .\" Or something else? | |
1f043693 | 851 | The enqueueing of the waiter is in descending priority order if more |
e0547e70 | 852 | than one waiter exists. |
67259526 | 853 | .\" FIXME What does "bandwidth" refer to in the next line? |
e0547e70 | 854 | The owner inherits either the priority or the bandwidth of the waiter. |
67259526 MK |
855 | .\" FIXME In the preceding line, what determines whether the |
856 | .\" owner inherits the priority versus the bandwidth? | |
67833bec MK |
857 | .\" |
858 | .\" FIXME Could I get some help translating the next sentence into | |
859 | .\" something that user-space developers (and I) can understand? | |
70b06b90 | 860 | .\" In particular, what are "nested locks" in this context? |
e0547e70 TG |
861 | This inheritance follows the lock chain in the case of |
862 | nested locking and performs deadlock detection. | |
863 | ||
9ce19cf1 MK |
864 | .\" FIXME tglx says "The timeout argument is handled as described in |
865 | .\" FUTEX_WAIT." However, it appears to me that this is not right. | |
70b06b90 | 866 | .\" Is the following formulation correct? |
e0547e70 TG |
867 | The |
868 | .I timeout | |
9ce19cf1 MK |
869 | argument provides a timeout for the lock attempt. |
870 | It is interpreted as an absolute time, measured against the | |
871 | .BR CLOCK_REALTIME | |
872 | clock. | |
873 | If | |
874 | .I timeout | |
875 | is NULL, the operation will block indefinitely. | |
e0547e70 | 876 | |
a449c634 | 877 | The |
e0547e70 TG |
878 | .IR uaddr2 , |
879 | .IR val , | |
880 | and | |
881 | .IR val3 | |
a449c634 | 882 | arguments are ignored. |
67833bec | 883 | .\" |
70b06b90 MK |
884 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
885 | .\" | |
d67e21f5 | 886 | .TP |
12fdbe23 | 887 | .BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 888 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
12fdbe23 MK |
889 | This operation tries to acquire the futex at |
890 | .IR uaddr . | |
0b761826 | 891 | .\" FIXME I think it would be helpful here to say a few more words about |
70b06b90 MK |
892 | .\" the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI. |
893 | .\" Can someone propose something? | |
894 | .\" | |
4b35dc5d TR |
895 | .\" FIXME Additionally, we claim above that just FUTEX_WAITERS is never an |
896 | .\" allowed state. | |
fa0388c3 | 897 | It deals with the situation where the TID value at |
12fdbe23 MK |
898 | .I uaddr |
899 | is 0, but the | |
b52e1cd4 | 900 | .B FUTEX_WAITERS |
12fdbe23 | 901 | bit is set. |
fa0388c3 MK |
902 | .\" FIXME How does the situation in the previous sentence come about? |
903 | .\" Probably it would be helpful to say something about that in | |
904 | .\" the man page. | |
badbf70c | 905 | .\" FIXME And *how* does FUTEX_TRYLOCK_PI deal with this situation? |
a282e5b0 | 906 | User space cannot handle this condition in a race-free manner |
084744ef MK |
907 | |
908 | The | |
909 | .IR uaddr2 , | |
910 | .IR val , | |
911 | .IR timeout , | |
912 | and | |
913 | .IR val3 | |
914 | arguments are ignored. | |
70b06b90 MK |
915 | .\" |
916 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
917 | .\" | |
d67e21f5 | 918 | .TP |
12fdbe23 | 919 | .BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 920 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
d4ba4328 | 921 | This operation wakes the top priority waiter that is waiting in |
ecae2099 TG |
922 | .B FUTEX_LOCK_PI |
923 | on the futex address provided by the | |
924 | .I uaddr | |
925 | argument. | |
926 | ||
927 | This is called when the user space value at | |
928 | .I uaddr | |
929 | cannot be changed atomically from a TID (of the owner) to 0. | |
930 | ||
931 | The | |
932 | .IR uaddr2 , | |
933 | .IR val , | |
934 | .IR timeout , | |
935 | and | |
936 | .IR val3 | |
11a194bf | 937 | arguments are ignored. |
70b06b90 MK |
938 | .\" |
939 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
940 | .\" | |
d67e21f5 | 941 | .TP |
d67e21f5 MK |
942 | .BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)" |
943 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
f812a08b DH |
944 | This operation is a PI-aware variant of |
945 | .BR FUTEX_CMP_REQUEUE . | |
946 | It requeues waiters that are blocked via | |
947 | .B FUTEX_WAIT_REQUEUE_PI | |
948 | on | |
949 | .I uaddr | |
950 | from a non-PI source futex | |
951 | .RI ( uaddr ) | |
952 | to a PI target futex | |
953 | .RI ( uaddr2 ). | |
954 | ||
9e54d26d MK |
955 | As with |
956 | .BR FUTEX_CMP_REQUEUE , | |
957 | this operation wakes up a maximum of | |
958 | .I val | |
959 | waiters that are waiting on the futex at | |
960 | .IR uaddr . | |
961 | However, for | |
962 | .BR FUTEX_CMP_REQUEUE_PI , | |
963 | .I val | |
6fbeb8f4 | 964 | is required to be 1 |
939ca89f | 965 | (since the main point is to avoid a thundering herd). |
9e54d26d MK |
966 | The remaining waiters are removed from the wait queue of the source futex at |
967 | .I uaddr | |
968 | and added to the wait queue of the target futex at | |
969 | .IR uaddr2 . | |
f812a08b | 970 | |
9e54d26d | 971 | The |
768d3c23 | 972 | .I val2 |
c6d8cf21 MK |
973 | .\" val2 is the cap on the number of requeued waiters. |
974 | .\" In the glibc pthread_cond_broadcast() implementation, this argument | |
975 | .\" is specified as INT_MAX, and for pthread_cond_signal() it is 0. | |
9e54d26d | 976 | and |
768d3c23 | 977 | .I val3 |
9e54d26d MK |
978 | arguments serve the same purposes as for |
979 | .BR FUTEX_CMP_REQUEUE . | |
70b06b90 | 980 | .\" |
be376673 MK |
981 | .\" FIXME The page at http://locklessinc.com/articles/futex_cheat_sheet/ |
982 | .\" notes that "priority-inheritance Futex to priority-inheritance | |
983 | .\" Futex requeues are currently unsupported". Do we need to say | |
984 | .\" something in the man page about that? | |
70b06b90 MK |
985 | .\" |
986 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
987 | .\" | |
d67e21f5 MK |
988 | .TP |
989 | .BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)" | |
990 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
70b06b90 MK |
991 | .\" |
992 | .\" FIXME I find the next sentence (from tglx) pretty hard to grok. | |
1af427a4 | 993 | .\" Could someone explain it a bit more? |
6ff1b4c0 TG |
994 | Wait operation to wait on a non-PI futex at |
995 | .I uaddr | |
996 | and potentially be requeued onto a PI futex at | |
997 | .IR uaddr2 . | |
998 | The wait operation on | |
999 | .I uaddr | |
1000 | is the same as | |
1001 | .BR FUTEX_WAIT . | |
70b06b90 | 1002 | .\" |
f1d2171d MK |
1003 | .\" FIXME I'm not quite clear on the meaning of the following sentence. |
1004 | .\" Is this trying to say that while blocked in a | |
1005 | .\" FUTEX_WAIT_REQUEUE_PI, it could happen that another | |
1006 | .\" task does a FUTEX_WAKE on uaddr that simply causes | |
1007 | .\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI | |
1008 | .\" does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI | |
1009 | .\" opertion? Does it remain blocked, or does it unblock | |
1010 | .\" In which case, what does user space see? | |
6ff1b4c0 TG |
1011 | The waiter can be removed from the wait on |
1012 | .I uaddr | |
1013 | via | |
1014 | .BR FUTEX_WAKE | |
1015 | without requeueing on | |
1016 | .IR uaddr2 . | |
a4e69912 | 1017 | |
63bea7dc MK |
1018 | .\" FIXME Please check the following. tglx said "The timeout argument |
1019 | .\" is handled as described in FUTEX_WAIT.", but the truth is | |
1020 | .\" as below, AFAICS | |
1021 | If | |
1022 | .I timeout | |
1023 | is not NULL, it specifies a timeout for the wait operation; | |
1024 | this timeout is interpreted as outlined above in the description of the | |
1025 | .BR FUTEX_CLOCK_REALTIME | |
1026 | option. | |
1027 | If | |
1028 | .I timeout | |
1029 | is NULL, the operation can block indefinitely. | |
1030 | ||
a4e69912 MK |
1031 | The |
1032 | .I val3 | |
1033 | argument is ignored. | |
70b06b90 | 1034 | .\" FIXME Re the preceding sentence... Actually 'val3' is internally set to |
a4e69912 MK |
1035 | .\" FUTEX_BITSET_MATCH_ANY before calling futex_wait_requeue_pi(). |
1036 | .\" I'm not sure we need to say anything about this though. | |
1037 | .\" Comments? | |
abb571e8 MK |
1038 | |
1039 | The | |
1040 | .BR FUTEX_WAIT_REQUEUE_PI | |
1041 | and | |
1042 | .BR FUTEX_CMP_REQUEUE_PI | |
1043 | were added to support a fairly specific use case: | |
1044 | support for priority-inheritance-aware POSIX threads condition variables. | |
1045 | The idea is that these operations should always be paired, | |
1046 | in order to ensure that user space and the kernel remain in sync. | |
1047 | Thus, in the | |
1048 | .BR FUTEX_WAIT_REQUEUE_PI | |
1049 | operation, the user-space application pre-specifies the target | |
1050 | of the requeue that takes place in the | |
1051 | .BR FUTEX_CMP_REQUEUE_PI | |
1052 | operation. | |
1053 | .\" | |
1054 | .\" Darren Hart notes that a patch to allow glibc to fully support | |
1af427a4 | 1055 | .\" PI-aware pthreads condition variables has not yet been accepted into |
abb571e8 MK |
1056 | .\" glibc. The story is complex, and can be found at |
1057 | .\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588 | |
1058 | .\" Darren notes that in the meantime, the patch is shipped with various | |
1af427a4 | 1059 | .\" PREEMPT_RT-enabled Linux systems. |
abb571e8 MK |
1060 | .\" |
1061 | .\" Related to the preceding, Darren proposed that somewhere, man-pages | |
1062 | .\" should document the following point: | |
1af427a4 | 1063 | .\" |
abb571e8 MK |
1064 | .\" While the Linux kernel, since 2.6.31, supports requeueing of |
1065 | .\" priority-inheritance (PI) aware mutexes via the | |
1066 | .\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations, | |
1067 | .\" the glibc implementation does not yet take full advantage of this. | |
1068 | .\" Specifically, the condvar internal data lock remains a non-PI aware | |
1069 | .\" mutex, regardless of the type of the pthread_mutex associated with | |
1070 | .\" the condvar. This can lead to an unbounded priority inversion on | |
1071 | .\" the internal data lock even when associating a PI aware | |
1072 | .\" pthread_mutex with a condvar during a pthread_cond*_wait | |
1073 | .\" operation. For this reason, it is not recommended to rely on | |
1074 | .\" priority inheritance when using pthread condition variables. | |
1af427a4 MK |
1075 | .\" |
1076 | .\" The problem is that the obvious location for this text is | |
1077 | .\" the pthread_cond*wait(3) man page. However, such a man page | |
abb571e8 | 1078 | .\" does not currently exist. |
70b06b90 | 1079 | .\" |
6700de24 | 1080 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
70b06b90 | 1081 | .\" |
47297adb | 1082 | .SH RETURN VALUE |
fea681da | 1083 | .PP |
6f147f79 | 1084 | In the event of an error, all operations return \-1 and set |
e808bba0 | 1085 | .I errno |
6f147f79 | 1086 | to indicate the cause of the error. |
e808bba0 MK |
1087 | The return value on success depends on the operation, |
1088 | as described in the following list: | |
fea681da MK |
1089 | .TP |
1090 | .B FUTEX_WAIT | |
4b35dc5d TR |
1091 | Returns 0 if the caller was woken up. Note that a wake-up can also be |
1092 | caused by common futex usage patterns in unrelated code that happened to have | |
1093 | previously used the futex word's memory location (e.g., typical futex-based | |
1094 | implementations of Pthreads mutexes can cause this under some conditions). | |
1095 | Therefore, callers should always conservatively assume that a return value of | |
1096 | 0 can mean a spurious wake-up, and use the futex word's value (i.e., the user | |
1097 | space synchronization scheme) to decide whether to continue to block or not. | |
fea681da MK |
1098 | .TP |
1099 | .B FUTEX_WAKE | |
bdc5957a | 1100 | Returns the number of waiters that were woken up. |
fea681da MK |
1101 | .TP |
1102 | .B FUTEX_FD | |
1103 | Returns the new file descriptor associated with the futex. | |
1104 | .TP | |
1105 | .B FUTEX_REQUEUE | |
bdc5957a | 1106 | Returns the number of waiters that were woken up. |
fea681da MK |
1107 | .TP |
1108 | .B FUTEX_CMP_REQUEUE | |
bdc5957a | 1109 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1110 | requeued to the futex for the futex word at |
3dfcc11d MK |
1111 | .IR uaddr2 . |
1112 | If this value is greater than | |
1113 | .IR val , | |
4b35dc5d TR |
1114 | then difference is the number of waiters requeued to the futex for the futex |
1115 | word at | |
3dfcc11d | 1116 | .IR uaddr2 . |
dcad19c0 MK |
1117 | .TP |
1118 | .B FUTEX_WAKE_OP | |
a8b5b324 | 1119 | Returns the total number of waiters that were woken up. |
4b35dc5d | 1120 | This is the sum of the woken waiters on the two futexes for the futex words at |
a8b5b324 MK |
1121 | .I uaddr |
1122 | and | |
1123 | .IR uaddr2 . | |
dcad19c0 MK |
1124 | .TP |
1125 | .B FUTEX_WAIT_BITSET | |
4b35dc5d TR |
1126 | Returns 0 if the caller was woken up. See |
1127 | .B FUTEX_WAIT | |
1128 | for how to interpret this correctly in practice. | |
dcad19c0 MK |
1129 | .TP |
1130 | .B FUTEX_WAKE_BITSET | |
bdc5957a | 1131 | Returns the number of waiters that were woken up. |
dcad19c0 MK |
1132 | .TP |
1133 | .B FUTEX_LOCK_PI | |
bf02a260 | 1134 | Returns 0 if the futex was successfully locked. |
dcad19c0 MK |
1135 | .TP |
1136 | .B FUTEX_TRYLOCK_PI | |
5c716eef | 1137 | Returns 0 if the futex was successfully locked. |
dcad19c0 MK |
1138 | .TP |
1139 | .B FUTEX_UNLOCK_PI | |
52bb928f | 1140 | Returns 0 if the futex was successfully unlocked. |
dcad19c0 MK |
1141 | .TP |
1142 | .B FUTEX_CMP_REQUEUE_PI | |
bdc5957a | 1143 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1144 | requeued to the futex for the futex word at |
dddd395a MK |
1145 | .IR uaddr2 . |
1146 | If this value is greater than | |
1147 | .IR val , | |
4b35dc5d TR |
1148 | then difference is the number of waiters requeued to the futex for the futex |
1149 | word at | |
dddd395a | 1150 | .IR uaddr2 . |
dcad19c0 MK |
1151 | .TP |
1152 | .B FUTEX_WAIT_REQUEUE_PI | |
4b35dc5d TR |
1153 | Returns 0 if the caller was successfully requeued to the futex for the futex |
1154 | word at | |
22c15de9 | 1155 | .IR uaddr2 . |
70b06b90 MK |
1156 | .\" |
1157 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1158 | .\" | |
fea681da MK |
1159 | .SH ERRORS |
1160 | .TP | |
1161 | .B EACCES | |
4b35dc5d | 1162 | No read access to the memory of a futex word. |
fea681da MK |
1163 | .TP |
1164 | .B EAGAIN | |
f48516d1 | 1165 | .RB ( FUTEX_WAIT , |
4b35dc5d | 1166 | .BR FUTEX_WAIT_BITSET , |
f48516d1 | 1167 | .BR FUTEX_WAIT_REQUEUE_PI ) |
badbf70c MK |
1168 | The value pointed to by |
1169 | .I uaddr | |
1170 | was not equal to the expected value | |
1171 | .I val | |
1172 | at the time of the call. | |
9732dd8b MK |
1173 | |
1174 | .BR Note : | |
1175 | on Linux, the symbolic names | |
1176 | .B EAGAIN | |
1177 | and | |
1178 | .B EWOULDBLOCK | |
77da5feb | 1179 | (both of which appear in different parts of the kernel futex code) |
9732dd8b | 1180 | have the same value. |
badbf70c MK |
1181 | .TP |
1182 | .B EAGAIN | |
8f2068bb MK |
1183 | .RB ( FUTEX_CMP_REQUEUE , |
1184 | .BR FUTEX_CMP_REQUEUE_PI ) | |
ce5602fd | 1185 | The value pointed to by |
9f6c40c0 МК |
1186 | .I uaddr |
1187 | is not equal to the expected value | |
1188 | .IR val3 . | |
fd1dc4c2 | 1189 | .\" FIXME: Is the following sentence correct? |
4b35dc5d | 1190 | .\" I would prefer to remove this sentence. --triegel@redhat.com |
fea681da | 1191 | (This probably indicates a race; |
682edefb MK |
1192 | use the safe |
1193 | .B FUTEX_WAKE | |
1194 | now.) | |
c0091dd3 | 1195 | .\" |
f1d2171d | 1196 | .\" FIXME XXX Should there be an EAGAIN case for FUTEX_TRYLOCK_PI? |
c0091dd3 MK |
1197 | .\" It seems so, looking at the handling of the rt_mutex_trylock() |
1198 | .\" call in futex_lock_pi() | |
9732dd8b | 1199 | .\" (Davidlohr also thinks so.) |
c0091dd3 | 1200 | .\" |
fea681da | 1201 | .TP |
5662f56a MK |
1202 | .BR EAGAIN |
1203 | .RB ( FUTEX_LOCK_PI , | |
aaec9032 MK |
1204 | .BR FUTEX_TRYLOCK_PI , |
1205 | .BR FUTEX_CMP_REQUEUE_PI ) | |
1206 | The futex owner thread ID of | |
1207 | .I uaddr | |
1208 | (for | |
1209 | .BR FUTEX_CMP_REQUEUE_PI : | |
1210 | .IR uaddr2 ) | |
1211 | is about to exit, | |
5662f56a MK |
1212 | but has not yet handled the internal state cleanup. |
1213 | Try again. | |
1214 | .TP | |
7a39e745 MK |
1215 | .BR EDEADLK |
1216 | .RB ( FUTEX_LOCK_PI , | |
9732dd8b MK |
1217 | .BR FUTEX_TRYLOCK_PI , |
1218 | .BR FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1219 | The futex word at |
7a39e745 MK |
1220 | .I uaddr |
1221 | is already locked by the caller. | |
1222 | .TP | |
662c0da8 MK |
1223 | .BR EDEADLK |
1224 | .\" FIXME I reworded tglx's text somewhat; is the following okay? | |
f1d2171d MK |
1225 | .\" FIXME XXX I see that kernel/locking/rtmutex.c uses EDEADLK in some places, |
1226 | .\" and EDEADLOCK in others. On almost all architectures these | |
1227 | .\" constants are synonymous. Is there a reason that both names | |
1228 | .\" are used? | |
662c0da8 | 1229 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
4b35dc5d | 1230 | While requeueing a waiter to the PI futex for the futex word at |
662c0da8 MK |
1231 | .IR uaddr2 , |
1232 | the kernel detected a deadlock. | |
1233 | .TP | |
fea681da | 1234 | .B EFAULT |
1ea901e8 MK |
1235 | A required pointer argument (i.e., |
1236 | .IR uaddr , | |
1237 | .IR uaddr2 , | |
1238 | or | |
1239 | .IR timeout ) | |
496df304 | 1240 | did not point to a valid user-space address. |
fea681da | 1241 | .TP |
9f6c40c0 | 1242 | .B EINTR |
e808bba0 | 1243 | A |
9f6c40c0 | 1244 | .B FUTEX_WAIT |
2674f781 MK |
1245 | or |
1246 | .B FUTEX_WAIT_BITSET | |
e808bba0 | 1247 | operation was interrupted by a signal (see |
f529fd20 MK |
1248 | .BR signal (7)). |
1249 | In kernels before Linux 2.6.22, this error could also be returned for | |
1250 | on a spurious wakeup; since Linux 2.6.22, this no longer happens. | |
9f6c40c0 | 1251 | .TP |
fea681da | 1252 | .B EINVAL |
180f97b7 MK |
1253 | The operation in |
1254 | .IR futex_op | |
1255 | is one of those that employs a timeout, but the supplied | |
fb2f4c27 MK |
1256 | .I timeout |
1257 | argument was invalid | |
1258 | .RI ( tv_sec | |
1259 | was less than zero, or | |
1260 | .IR tv_nsec | |
1261 | was not less than 1000,000,000). | |
1262 | .TP | |
1263 | .B EINVAL | |
0c74df0b | 1264 | The operation specified in |
025e1374 | 1265 | .IR futex_op |
0c74df0b | 1266 | employs one or both of the pointers |
51ee94be | 1267 | .I uaddr |
a1f47699 | 1268 | and |
0c74df0b MK |
1269 | .IR uaddr2 , |
1270 | but one of these does not point to a valid object\(emthat is, | |
1271 | the address is not four-byte-aligned. | |
51ee94be MK |
1272 | .TP |
1273 | .B EINVAL | |
55cc422d TG |
1274 | .RB ( FUTEX_WAIT_BITSET , |
1275 | .BR FUTEX_WAKE_BITSET ) | |
79c9b436 TG |
1276 | The bitset supplied in |
1277 | .IR val3 | |
1278 | is zero. | |
1279 | .TP | |
1280 | .B EINVAL | |
2abcba67 | 1281 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
add875c0 MK |
1282 | .I uaddr |
1283 | equals | |
1284 | .IR uaddr2 | |
1285 | (i.e., an attempt was made to requeue to the same futex). | |
1286 | .TP | |
ff597681 MK |
1287 | .BR EINVAL |
1288 | .RB ( FUTEX_FD ) | |
1289 | The signal number supplied in | |
1290 | .I val | |
1291 | is invalid. | |
1292 | .TP | |
6bac3b85 | 1293 | .B EINVAL |
476debd7 MK |
1294 | .RB ( FUTEX_WAKE , |
1295 | .BR FUTEX_WAKE_OP , | |
1296 | .BR FUTEX_WAKE_BITSET , | |
1297 | .BR FUTEX_REQUEUE , | |
1298 | .BR FUTEX_CMP_REQUEUE ) | |
1299 | The kernel detected an inconsistency between the user-space state at | |
1300 | .I uaddr | |
1301 | and the kernel state\(emthat is, it detected a waiter which waits in | |
1302 | .BR FUTEX_LOCK_PI | |
1303 | on | |
1304 | .IR uaddr . | |
1305 | .TP | |
1306 | .B EINVAL | |
a218ef20 | 1307 | .RB ( FUTEX_LOCK_PI , |
ce022f18 MK |
1308 | .BR FUTEX_TRYLOCK_PI , |
1309 | .BR FUTEX_UNLOCK_PI ) | |
a218ef20 MK |
1310 | The kernel detected an inconsistency between the user-space state at |
1311 | .I uaddr | |
1312 | and the kernel state. | |
ce022f18 MK |
1313 | This indicates either state corruption |
1314 | .\" FIXME tglx did not mention the "state corruption" for FUTEX_UNLOCK_PI. | |
1315 | .\" Does that case also apply for FUTEX_UNLOCK_PI? | |
1316 | or that the kernel found a waiter on | |
a218ef20 MK |
1317 | .I uaddr |
1318 | which is waiting via | |
1319 | .BR FUTEX_WAIT | |
1320 | or | |
1321 | .BR FUTEX_WAIT_BITSET . | |
1322 | .TP | |
1323 | .B EINVAL | |
f9250b1a MK |
1324 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1325 | The kernel detected an inconsistency between the user-space state at | |
99c0041d MK |
1326 | .I uaddr2 |
1327 | and the kernel state; | |
1328 | that is, the kernel detected a waiter which waits via | |
1329 | .BR FUTEX_WAIT | |
1330 | .\" FIXME tglx did not mention FUTEX_WAIT_BITSET here, | |
1331 | .\" but should that not also be included here? | |
1332 | on | |
1333 | .IR uaddr2 . | |
1334 | .TP | |
1335 | .B EINVAL | |
1336 | .RB ( FUTEX_CMP_REQUEUE_PI ) | |
1337 | The kernel detected an inconsistency between the user-space state at | |
f9250b1a MK |
1338 | .I uaddr |
1339 | and the kernel state; | |
1340 | that is, the kernel detected a waiter which waits via | |
75299c8d | 1341 | .BR FUTEX_WAIT |
99c0041d | 1342 | or |
75299c8d | 1343 | .BR FUTEX_WAIT_BITESET |
f9250b1a MK |
1344 | on |
1345 | .IR uaddr . | |
1346 | .TP | |
1347 | .B EINVAL | |
99c0041d | 1348 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
75299c8d MK |
1349 | The kernel detected an inconsistency between the user-space state at |
1350 | .I uaddr | |
1351 | and the kernel state; | |
1352 | that is, the kernel detected a waiter which waits on | |
1353 | .I uaddr | |
1354 | via | |
1355 | .BR FUTEX_LOCK_PI | |
1356 | (instead of | |
1357 | .BR FUTEX_WAIT_REQUEUE_PI ). | |
99c0041d MK |
1358 | .TP |
1359 | .B EINVAL | |
9786b3ca | 1360 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
f1d2171d | 1361 | .\" FIXME XXX The following is a reworded version of Darren Hart's text. |
9786b3ca MK |
1362 | .\" Please check that I did not introduce any errors. |
1363 | An attempt was made to requeue a waiter to a futex other than that | |
1364 | specified by the matching | |
1365 | .B FUTEX_WAIT_REQUEUE_PI | |
1366 | call for that waiter. | |
1367 | .TP | |
1368 | .B EINVAL | |
f0c0d61c MK |
1369 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1370 | The | |
1371 | .I val | |
1372 | argument is not 1. | |
1373 | .TP | |
1374 | .B EINVAL | |
4832b48a | 1375 | Invalid argument. |
fea681da | 1376 | .TP |
a449c634 MK |
1377 | .BR ENOMEM |
1378 | .RB ( FUTEX_LOCK_PI , | |
e34a8fb6 MK |
1379 | .BR FUTEX_TRYLOCK_PI , |
1380 | .BR FUTEX_CMP_REQUEUE_PI ) | |
a449c634 MK |
1381 | The kernel could not allocate memory to hold state information. |
1382 | .TP | |
fea681da | 1383 | .B ENFILE |
ff597681 | 1384 | .RB ( FUTEX_FD ) |
fea681da | 1385 | The system limit on the total number of open files has been reached. |
4701fc28 MK |
1386 | .TP |
1387 | .B ENOSYS | |
1388 | Invalid operation specified in | |
d33602c4 | 1389 | .IR futex_op . |
9f6c40c0 | 1390 | .TP |
4a7e5b05 MK |
1391 | .B ENOSYS |
1392 | The | |
1393 | .BR FUTEX_CLOCK_REALTIME | |
1394 | option was specified in | |
1afcee7c | 1395 | .IR futex_op , |
4a7e5b05 MK |
1396 | but the accompanying operation was neither |
1397 | .BR FUTEX_WAIT_BITSET | |
1398 | nor | |
1399 | .BR FUTEX_WAIT_REQUEUE_PI . | |
1400 | .TP | |
a9dcb4d1 MK |
1401 | .BR ENOSYS |
1402 | .RB ( FUTEX_LOCK_PI , | |
f2424fae | 1403 | .BR FUTEX_TRYLOCK_PI , |
4945ff19 | 1404 | .BR FUTEX_UNLOCK_PI , |
4cf92894 | 1405 | .BR FUTEX_CMP_REQUEUE_PI , |
794bb106 | 1406 | .BR FUTEX_WAIT_REQUEUE_PI ) |
4b35dc5d | 1407 | A run-time check determined that the operation is not available. |
a2ebebcd MK |
1408 | The PI futex operations are not implemented on all architectures and |
1409 | are not supported on some CPU variants. | |
a9dcb4d1 | 1410 | .TP |
c7589177 MK |
1411 | .BR EPERM |
1412 | .RB ( FUTEX_LOCK_PI , | |
dc2742a8 MK |
1413 | .BR FUTEX_TRYLOCK_PI , |
1414 | .BR FUTEX_CMP_REQUEUE_PI ) | |
04331c3f | 1415 | The caller is not allowed to attach itself to the futex at |
dc2742a8 MK |
1416 | .I uaddr |
1417 | (for | |
1418 | .BR FUTEX_CMP_REQUEUE_PI : | |
1419 | the futex at | |
1420 | .IR uaddr2 ). | |
c7589177 MK |
1421 | (This may be caused by a state corruption in user space.) |
1422 | .TP | |
76f347ba | 1423 | .BR EPERM |
87276709 | 1424 | .RB ( FUTEX_UNLOCK_PI ) |
4b35dc5d | 1425 | The caller does not own the lock represented by the futex word. |
76f347ba | 1426 | .TP |
0b0e4934 MK |
1427 | .BR ESRCH |
1428 | .RB ( FUTEX_LOCK_PI , | |
9732dd8b MK |
1429 | .BR FUTEX_TRYLOCK_PI , |
1430 | .BR FUTEX_CMP_REQUEUE_PI ) | |
0b0e4934 MK |
1431 | .\" FIXME I reworded the following sentence a bit differently from |
1432 | .\" tglx's formulation. Is it okay? | |
4b35dc5d | 1433 | The thread ID in the futex word at |
0b0e4934 MK |
1434 | .I uaddr |
1435 | does not exist. | |
1436 | .TP | |
360f773c MK |
1437 | .BR ESRCH |
1438 | .RB ( FUTEX_CMP_REQUEUE_PI ) | |
1439 | .\" FIXME I reworded the following sentence a bit differently from | |
1440 | .\" tglx's formulation. Is it okay? | |
4b35dc5d | 1441 | The thread ID in the futex word at |
360f773c MK |
1442 | .I uaddr2 |
1443 | does not exist. | |
1444 | .TP | |
9f6c40c0 | 1445 | .B ETIMEDOUT |
4d85047f MK |
1446 | The operation in |
1447 | .IR futex_op | |
1448 | employed the timeout specified in | |
1449 | .IR timeout , | |
1450 | and the timeout expired before the operation completed. | |
70b06b90 MK |
1451 | .\" |
1452 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1453 | .\" | |
47297adb | 1454 | .SH VERSIONS |
a1d5f77c | 1455 | .PP |
81c9d87e MK |
1456 | Futexes were first made available in a stable kernel release |
1457 | with Linux 2.6.0. | |
1458 | ||
a1d5f77c MK |
1459 | Initial futex support was merged in Linux 2.5.7 but with different semantics |
1460 | from what was described above. | |
52dee70e | 1461 | A four-argument system call with the semantics |
fd3fa7ef | 1462 | described in this page was introduced in Linux 2.5.40. |
11b520ed | 1463 | In Linux 2.5.70, one argument |
a1d5f77c | 1464 | was added. |
11b520ed | 1465 | In Linux 2.6.7, a sixth argument was added\(emmessy, especially |
a1d5f77c | 1466 | on the s390 architecture. |
47297adb | 1467 | .SH CONFORMING TO |
8382f16d | 1468 | This system call is Linux-specific. |
47297adb | 1469 | .SH NOTES |
baf0f1f4 MK |
1470 | Glibc does not provide a wrapper for this system call; call it using |
1471 | .BR syscall (2). | |
4b35dc5d TR |
1472 | .\" TODO FIXME Above, we cite this section and claim it contains details on |
1473 | .\" the synchronization semantics; add the C11 equivalents here (or whatever | |
1474 | .\" we find consensus for). | |
305cc415 MK |
1475 | .\" |
1476 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1477 | .\" | |
1478 | .SH EXAMPLE | |
1479 | .\" FIXME Is it worth having an example program? | |
1480 | .\" FIXME Anything obviously broken in the example program? | |
1481 | .\" | |
77da5feb | 1482 | The program below demonstrates use of futexes in a program |
305cc415 MK |
1483 | where parent and child use a pair of futexes located inside a |
1484 | shared anonymous mapping to synchronize access to a shared resource: | |
1485 | the terminal. | |
1486 | The two processes each write | |
1487 | .IR nloops | |
1488 | (a command-line argument that defaults to 5 if omitted) | |
1489 | messages to the terminal and employ a synchronization protocol | |
1490 | that ensures that they alternate in writing messages. | |
1491 | Upon running this program we see output such as the following: | |
1492 | ||
1493 | .in +4n | |
1494 | .nf | |
1495 | $ \fB./futex_demo\fP | |
1496 | Parent (18534) 0 | |
1497 | Child (18535) 0 | |
1498 | Parent (18534) 1 | |
1499 | Child (18535) 1 | |
1500 | Parent (18534) 2 | |
1501 | Child (18535) 2 | |
1502 | Parent (18534) 3 | |
1503 | Child (18535) 3 | |
1504 | Parent (18534) 4 | |
1505 | Child (18535) 4 | |
1506 | .fi | |
1507 | .in | |
1508 | .SS Program source | |
1509 | \& | |
1510 | .nf | |
1511 | /* futex_demo.c | |
1512 | ||
1513 | Usage: futex_demo [nloops] | |
1514 | (Default: 5) | |
1515 | ||
1516 | Demonstrate the use of futexes in a program where parent and child | |
1517 | use a pair of futexes located inside a shared anonymous mapping to | |
1518 | synchronize access to a shared resource: the terminal. The two | |
1519 | processes each write \(aqnum\-loops\(aq messages to the terminal and employ | |
1520 | a synchronization protocol that ensures that they alternate in | |
1521 | writing messages. | |
1522 | */ | |
1523 | #define _GNU_SOURCE | |
1524 | #include <stdio.h> | |
1525 | #include <errno.h> | |
1526 | #include <stdlib.h> | |
1527 | #include <unistd.h> | |
1528 | #include <sys/wait.h> | |
1529 | #include <sys/mman.h> | |
1530 | #include <sys/syscall.h> | |
1531 | #include <linux/futex.h> | |
1532 | #include <sys/time.h> | |
1533 | ||
1534 | #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\ | |
1535 | } while (0) | |
1536 | ||
1537 | static int *futex1, *futex2, *iaddr; | |
1538 | ||
1539 | static int | |
1540 | futex(int *uaddr, int futex_op, int val, | |
1541 | const struct timespec *timeout, int *uaddr2, int val3) | |
1542 | { | |
1543 | return syscall(SYS_futex, uaddr, futex_op, val, | |
1544 | timeout, uaddr, val3); | |
1545 | } | |
1546 | ||
1547 | /* Acquire the futex pointed to by \(aqfutexp\(aq: wait for its value to | |
1548 | become 1, and then set the value to 0. */ | |
1549 | ||
1550 | static void | |
1551 | fwait(int *futexp) | |
1552 | { | |
1553 | int s; | |
1554 | ||
1555 | /* __sync_bool_compare_and_swap(ptr, oldval, newval) is a gcc | |
1556 | built\-in function. It atomically performs the equivalent of: | |
1557 | ||
1558 | if (*ptr == oldval) | |
1559 | *ptr = newval; | |
1560 | ||
1561 | It returns true if the test yielded true and *ptr was updated. | |
1562 | The alternative here would be to employ the equivalent atomic | |
1563 | machine\-language instructions. For further information, see | |
1564 | the GCC Manual. */ | |
1565 | ||
1566 | /* Maybe the futex is already available: */ | |
1567 | ||
1568 | if (__sync_bool_compare_and_swap(futexp, 1, 0)) | |
1569 | return; | |
1570 | ||
1571 | /* No; we must wait for the futex value to be changed */ | |
1572 | ||
1573 | while (1) { | |
1574 | s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); | |
1575 | if (s == \-1 && errno != EAGAIN) | |
1576 | errExit("futex\-FUTEX_WAIT"); | |
1577 | ||
1578 | /* Is the futex now available? */ | |
1579 | ||
1580 | if (__sync_bool_compare_and_swap(futexp, 1, 0)) | |
1581 | break; /* Yes */ | |
1582 | ||
1583 | /* Futex is still not available; wait again */ | |
1584 | } | |
1585 | } | |
1586 | ||
1587 | /* Release the futex pointed to by \(aqfutexp\(aq: if the futex currently | |
1588 | has the value 0, set its value to 1 and the wake any futex waiters, | |
1589 | so that if the peer is blocked in fpost(), it can proceed. */ | |
1590 | ||
1591 | static void | |
1592 | fpost(int *futexp) | |
1593 | { | |
1594 | int s; | |
1595 | ||
1596 | /* __sync_bool_compare_and_swap() was described in comments above */ | |
1597 | ||
1598 | if (__sync_bool_compare_and_swap(futexp, 0, 1)) { | |
1599 | ||
1600 | s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); | |
1601 | if (s == \-1) | |
1602 | errExit("futex\-FUTEX_WAKE"); | |
1603 | } | |
1604 | } | |
1605 | ||
1606 | int | |
1607 | main(int argc, char *argv[]) | |
1608 | { | |
1609 | pid_t childPid; | |
1610 | int j, nloops; | |
1611 | ||
1612 | setbuf(stdout, NULL); | |
1613 | ||
1614 | nloops = (argc > 1) ? atoi(argv[1]) : 5; | |
1615 | ||
1616 | /* Create a shared anonymous mapping that will hold the futexes. | |
1617 | Since the futexes are being shared between processes, we | |
1618 | subsequently use the "shared" futex operations (i.e., not the | |
1619 | ones suffixed "_PRIVATE") */ | |
1620 | ||
1621 | iaddr = mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE, | |
1622 | MAP_ANONYMOUS | MAP_SHARED, \-1, 0); | |
1623 | if (iaddr == MAP_FAILED) | |
1624 | errExit("mmap"); | |
1625 | ||
1626 | futex1 = &iaddr[0]; | |
1627 | futex2 = &iaddr[1]; | |
1628 | ||
1629 | *futex1 = 0; /* State: unavailable */ | |
1630 | *futex2 = 1; /* State: available */ | |
1631 | ||
1632 | /* Create a child process that inherits the shared anonymous | |
1633 | mappping */ | |
1634 | ||
1635 | childPid = fork(); | |
1636 | if (childPid == 1) | |
1637 | errExit("fork"); | |
1638 | ||
1639 | if (childPid == 0) { /* Child */ | |
1640 | for (j = 0; j < nloops; j++) { | |
1641 | fwait(futex1); | |
1642 | printf("Child (%ld) %d\\n", (long) getpid(), j); | |
1643 | fpost(futex2); | |
1644 | } | |
1645 | ||
1646 | exit(EXIT_SUCCESS); | |
1647 | } | |
1648 | ||
1649 | /* Parent falls through to here */ | |
1650 | ||
1651 | for (j = 0; j < nloops; j++) { | |
1652 | fwait(futex2); | |
1653 | printf("Parent (%ld) %d\\n", (long) getpid(), j); | |
1654 | fpost(futex1); | |
1655 | } | |
1656 | ||
1657 | wait(NULL); | |
1658 | ||
1659 | exit(EXIT_SUCCESS); | |
1660 | } | |
1661 | .fi | |
47297adb | 1662 | .SH SEE ALSO |
4c222281 | 1663 | .ad l |
9913033c | 1664 | .BR get_robust_list (2), |
d806bc05 | 1665 | .BR restart_syscall (2), |
14d8dd3b | 1666 | .BR futex (7) |
fea681da | 1667 | .PP |
f5ad572f MK |
1668 | The following kernel source files: |
1669 | .IP * 2 | |
1670 | .I Documentation/pi-futex.txt | |
1671 | .IP * | |
1672 | .I Documentation/futex-requeue-pi.txt | |
1673 | .IP * | |
1674 | .I Documentation/locking/rt-mutex.txt | |
1675 | .IP * | |
1676 | .I Documentation/locking/rt-mutex-design.txt | |
8fe019c7 MK |
1677 | .IP * |
1678 | .I Documentation/robust-futex-ABI.txt | |
43b99089 | 1679 | .PP |
4c222281 | 1680 | Franke, H., Russell, R., and Kirwood, M., 2002. |
52087dd3 | 1681 | \fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP |
4c222281 | 1682 | (from proceedings of the Ottawa Linux Symposium 2002), |
9b936e9e | 1683 | .br |
608bf950 SK |
1684 | .UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002-pages-479-495.pdf |
1685 | .UE | |
f42eb21b | 1686 | |
4c222281 | 1687 | Hart, D., 2009. \fIA futex overview and update\fP, |
2ed26199 MK |
1688 | .UR http://lwn.net/Articles/360699/ |
1689 | .UE | |
1690 | ||
4c222281 | 1691 | Hart, D. and Guniguntala, D., 2009. |
0483b6cc | 1692 | \fIRequeue-PI: Making Glibc Condvars PI-Aware\fP |
4c222281 | 1693 | (from proceedings of the 2009 Real-Time Linux Workshop), |
0483b6cc MK |
1694 | .UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf |
1695 | .UE | |
1696 | ||
4c222281 | 1697 | Drepper, U., 2011. \fIFutexes Are Tricky\fP, |
f42eb21b MK |
1698 | .UR http://www.akkadia.org/drepper/futex.pdf |
1699 | .UE | |
9b936e9e MK |
1700 | .PP |
1701 | Futex example library, futex-*.tar.bz2 at | |
1702 | .br | |
a605264d | 1703 | .UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/ |
608bf950 | 1704 | .UE |
34f14794 MK |
1705 | .\" |
1706 | .\" FIXME Are there any other resources that should be listed | |
1707 | .\" in the SEE ALSO section? | |
4b35dc5d TR |
1708 | .\" FIXME We should probably refer to the glibc code here, in particular the |
1709 | .\" glibc-internal futex wrapper functions that are WIP, and the | |
1710 | .\" generic pthread_mutex_t and perhaps condvar implementations. | |
1711 |