]> git.ipfire.org Git - thirdparty/man-pages.git/blob - man2/futex.2
sched_setattr.2: tfix
[thirdparty/man-pages.git] / man2 / futex.2
1 .\" Page by b.hubert
2 .\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de>
3 .\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com>
4 .\"
5 .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
6 .\" may be freely modified and distributed
7 .\" %%%LICENSE_END
8 .\"
9 .\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com)
10 .\" added ERRORS section.
11 .\"
12 .\" Modified 2004-06-17 mtk
13 .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
14 .\"
15 .\" FIXME Still to integrate are some points from Torvald Riegel's mail of
16 .\" 2015-01-23:
17 .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
18 .\"
19 .\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail
20 .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
21 .\"
22 .TH FUTEX 2 2020-02-09 "Linux" "Linux Programmer's Manual"
23 .SH NAME
24 futex \- fast user-space locking
25 .SH SYNOPSIS
26 .nf
27 .PP
28 .B "#include <linux/futex.h>"
29 .B "#include <sys/time.h>"
30 .PP
31 .BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
32 .BI " const struct timespec *" timeout , \
33 " \fR /* or: \fBuint32_t \fIval2\fP */
34 .BI " int *" uaddr2 ", int " val3 );
35 .fi
36 .PP
37 .IR Note :
38 There is no glibc wrapper for this system call; see NOTES.
39 .SH DESCRIPTION
40 .PP
41 The
42 .BR futex ()
43 system call provides a method for waiting until a certain condition becomes
44 true.
45 It is typically used as a blocking construct in the context of
46 shared-memory synchronization.
47 When using futexes, the majority of
48 the synchronization operations are performed in user space.
49 A user-space program employs the
50 .BR futex ()
51 system call only when it is likely that the program has to block for
52 a longer time until the condition becomes true.
53 Other
54 .BR futex ()
55 operations can be used to wake any processes or threads waiting
56 for a particular condition.
57 .PP
58 A futex is a 32-bit value\(emreferred to below as a
59 .IR "futex word" \(emwhose
60 address is supplied to the
61 .BR futex ()
62 system call.
63 (Futexes are 32 bits in size on all platforms, including 64-bit systems.)
64 All futex operations are governed by this value.
65 In order to share a futex between processes,
66 the futex is placed in a region of shared memory,
67 created using (for example)
68 .BR mmap (2)
69 or
70 .BR shmat (2).
71 (Thus, the futex word may have different
72 virtual addresses in different processes,
73 but these addresses all refer to the same location in physical memory.)
74 In a multithreaded program, it is sufficient to place the futex word
75 in a global variable shared by all threads.
76 .PP
77 When executing a futex operation that requests to block a thread,
78 the kernel will block only if the futex word has the value that the
79 calling thread supplied (as one of the arguments of the
80 .BR futex ()
81 call) as the expected value of the futex word.
82 The loading of the futex word's value,
83 the comparison of that value with the expected value,
84 and the actual blocking will happen atomically and will be totally ordered
85 with respect to concurrent operations performed by other threads
86 on the same futex word.
87 .\" Notes from Darren Hart (Dec 2015):
88 .\" Totally ordered with respect futex operations refers to semantics
89 .\" of the ACQUIRE/RELEASE operations and how they impact ordering of
90 .\" memory reads and writes. The kernel futex operations are protected
91 .\" by spinlocks, which ensure that all operations are serialized
92 .\" with respect to one another.
93 .\"
94 .\" This is a lot to attempt to define in this document. Perhaps a
95 .\" reference to linux/Documentation/memory-barriers.txt as a footnote
96 .\" would be sufficient? Or perhaps for this manual, "serialized" would
97 .\" be sufficient, with a footnote regarding "totally ordered" and a
98 .\" pointer to the memory-barrier documentation?
99 Thus, the futex word is used to connect the synchronization in user space
100 with the implementation of blocking by the kernel.
101 Analogously to an atomic
102 compare-and-exchange operation that potentially changes shared memory,
103 blocking via a futex is an atomic compare-and-block operation.
104 .\" FIXME(Torvald Riegel):
105 .\" Eventually we want to have some text in NOTES to satisfy
106 .\" the reference in the following sentence
107 .\" See NOTES for a detailed specification of
108 .\" the synchronization semantics.
109 .PP
110 One use of futexes is for implementing locks.
111 The state of the lock (i.e., acquired or not acquired)
112 can be represented as an atomically accessed flag in shared memory.
113 In the uncontended case,
114 a thread can access or modify the lock state with atomic instructions,
115 for example atomically changing it from not acquired to acquired
116 using an atomic compare-and-exchange instruction.
117 (Such instructions are performed entirely in user mode,
118 and the kernel maintains no information about the lock state.)
119 On the other hand, a thread may be unable to acquire a lock because
120 it is already acquired by another thread.
121 It then may pass the lock's flag as a futex word and the value
122 representing the acquired state as the expected value to a
123 .BR futex ()
124 wait operation.
125 This
126 .BR futex ()
127 operation will block if and only if the lock is still acquired
128 (i.e., the value in the futex word still matches the "acquired state").
129 When releasing the lock, a thread has to first reset the
130 lock state to not acquired and then execute a futex
131 operation that wakes threads blocked on the lock flag used as a futex word
132 (this can be further optimized to avoid unnecessary wake-ups).
133 See
134 .BR futex (7)
135 for more detail on how to use futexes.
136 .PP
137 Besides the basic wait and wake-up futex functionality, there are further
138 futex operations aimed at supporting more complex use cases.
139 .PP
140 Note that
141 no explicit initialization or destruction is necessary to use futexes;
142 the kernel maintains a futex
143 (i.e., the kernel-internal implementation artifact)
144 only while operations such as
145 .BR FUTEX_WAIT ,
146 described below, are being performed on a particular futex word.
147 .\"
148 .SS Arguments
149 The
150 .I uaddr
151 argument points to the futex word.
152 On all platforms, futexes are four-byte
153 integers that must be aligned on a four-byte boundary.
154 The operation to perform on the futex is specified in the
155 .I futex_op
156 argument;
157 .IR val
158 is a value whose meaning and purpose depends on
159 .IR futex_op .
160 .PP
161 The remaining arguments
162 .RI ( timeout ,
163 .IR uaddr2 ,
164 and
165 .IR val3 )
166 are required only for certain of the futex operations described below.
167 Where one of these arguments is not required, it is ignored.
168 .PP
169 For several blocking operations, the
170 .I timeout
171 argument is a pointer to a
172 .IR timespec
173 structure that specifies a timeout for the operation.
174 However, notwithstanding the prototype shown above, for some operations,
175 the least significant four bytes of this argument are instead
176 used as an integer whose meaning is determined by the operation.
177 For these operations, the kernel casts the
178 .I timeout
179 value first to
180 .IR "unsigned long",
181 then to
182 .IR uint32_t ,
183 and in the remainder of this page, this argument is referred to as
184 .I val2
185 when interpreted in this fashion.
186 .PP
187 Where it is required, the
188 .IR uaddr2
189 argument is a pointer to a second futex word that is employed
190 by the operation.
191 .PP
192 The interpretation of the final integer argument,
193 .IR val3 ,
194 depends on the operation.
195 .\"
196 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
197 .\"
198 .SS Futex operations
199 The
200 .I futex_op
201 argument consists of two parts:
202 a command that specifies the operation to be performed,
203 bit-wise ORed with zero or more options that
204 modify the behaviour of the operation.
205 The options that may be included in
206 .I futex_op
207 are as follows:
208 .TP
209 .BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)"
210 .\" commit 34f01cc1f512fa783302982776895c73714ebbc2
211 This option bit can be employed with all futex operations.
212 It tells the kernel that the futex is process-private and not shared
213 with another process (i.e., it is being used for synchronization
214 only between threads of the same process).
215 This allows the kernel to make some additional performance optimizations.
216 .\" I.e., It allows the kernel choose the fast path for validating
217 .\" the user-space address and avoids expensive VMA lookups,
218 .\" taking reference counts on file backing store, and so on.
219 .IP
220 As a convenience,
221 .IR <linux/futex.h>
222 defines a set of constants with the suffix
223 .BR _PRIVATE
224 that are equivalents of all of the operations listed below,
225 .\" except the obsolete FUTEX_FD, for which the "private" flag was
226 .\" meaningless
227 but with the
228 .BR FUTEX_PRIVATE_FLAG
229 ORed into the constant value.
230 Thus, there are
231 .BR FUTEX_WAIT_PRIVATE ,
232 .BR FUTEX_WAKE_PRIVATE ,
233 and so on.
234 .TP
235 .BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)"
236 .\" commit 1acdac104668a0834cfa267de9946fac7764d486
237 This option bit can be employed only with the
238 .BR FUTEX_WAIT_BITSET ,
239 .BR FUTEX_WAIT_REQUEUE_PI ,
240 and
241 (since Linux 4.5)
242 .\" commit 337f13046ff03717a9e99675284a817527440a49
243 .BR FUTEX_WAIT
244 operations.
245 .IP
246 If this option is set, the kernel measures the
247 .I timeout
248 against the
249 .BR CLOCK_REALTIME
250 clock.
251 .IP
252 If this option is not set, the kernel measures the
253 .I timeout
254 against the
255 .BR CLOCK_MONOTONIC
256 clock.
257 .PP
258 The operation specified in
259 .I futex_op
260 is one of the following:
261 .\"
262 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
263 .\"
264 .TP
265 .BR FUTEX_WAIT " (since Linux 2.6.0)"
266 .\" Strictly speaking, since some time in 2.5.x
267 This operation tests that the value at the
268 futex word pointed to by the address
269 .I uaddr
270 still contains the expected value
271 .IR val ,
272 and if so, then sleeps waiting for a
273 .B FUTEX_WAKE
274 operation on the futex word.
275 The load of the value of the futex word is an atomic memory
276 access (i.e., using atomic machine instructions of the respective
277 architecture).
278 This load, the comparison with the expected value, and
279 starting to sleep are performed atomically
280 .\" FIXME: Torvald, I think we may need to add some explanation of
281 .\" "totally ordered" here.
282 and totally ordered
283 with respect to other futex operations on the same futex word.
284 If the thread starts to sleep,
285 it is considered a waiter on this futex word.
286 If the futex value does not match
287 .IR val ,
288 then the call fails immediately with the error
289 .BR EAGAIN .
290 .IP
291 The purpose of the comparison with the expected value is to prevent lost
292 wake-ups.
293 If another thread changed the value of the futex word after the
294 calling thread decided to block based on the prior value,
295 and if the other thread executed a
296 .BR FUTEX_WAKE
297 operation (or similar wake-up) after the value change and before this
298 .BR FUTEX_WAIT
299 operation, then the calling thread will observe the
300 value change and will not start to sleep.
301 .IP
302 If the
303 .I timeout
304 is not NULL, the structure it points to specifies a
305 timeout for the wait.
306 (This interval will be rounded up to the system clock granularity,
307 and is guaranteed not to expire early.)
308 The timeout is by default measured according to the
309 .BR CLOCK_MONOTONIC
310 clock, but, since Linux 4.5, the
311 .BR CLOCK_REALTIME
312 clock can be selected by specifying
313 .BR FUTEX_CLOCK_REALTIME
314 in
315 .IR futex_op .
316 If
317 .I timeout
318 is NULL, the call blocks indefinitely.
319 .IP
320 .IR Note :
321 for
322 .BR FUTEX_WAIT ,
323 .IR timeout
324 is interpreted as a
325 .IR relative
326 value.
327 This differs from other futex operations, where
328 .I timeout
329 is interpreted as an absolute value.
330 To obtain the equivalent of
331 .BR FUTEX_WAIT
332 with an absolute timeout, employ
333 .BR FUTEX_WAIT_BITSET
334 with
335 .IR val3
336 specified as
337 .BR FUTEX_BITSET_MATCH_ANY .
338 .IP
339 The arguments
340 .I uaddr2
341 and
342 .I val3
343 are ignored.
344 .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a
345 .\" different example.
346 .\"
347 .\" For
348 .\" .BR futex (7),
349 .\" this call is executed if decrementing the count gave a negative value
350 .\" (indicating contention),
351 .\" and will sleep until another process or thread releases
352 .\" the futex and executes the
353 .\" .B FUTEX_WAKE
354 .\" operation.
355 .\"
356 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
357 .\"
358 .TP
359 .BR FUTEX_WAKE " (since Linux 2.6.0)"
360 .\" Strictly speaking, since Linux 2.5.x
361 This operation wakes at most
362 .I val
363 of the waiters that are waiting (e.g., inside
364 .BR FUTEX_WAIT )
365 on the futex word at the address
366 .IR uaddr .
367 Most commonly,
368 .I val
369 is specified as either 1 (wake up a single waiter) or
370 .BR INT_MAX
371 (wake up all waiters).
372 No guarantee is provided about which waiters are awoken
373 (e.g., a waiter with a higher scheduling priority is not guaranteed
374 to be awoken in preference to a waiter with a lower priority).
375 .IP
376 The arguments
377 .IR timeout ,
378 .IR uaddr2 ,
379 and
380 .I val3
381 are ignored.
382 .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to
383 .\" a different example.
384 .\"
385 .\" For
386 .\" .BR futex (7),
387 .\" this is executed if incrementing the count showed that
388 .\" there were waiters,
389 .\" once the futex value has been set to 1
390 .\" (indicating that it is available).
391 .\"
392 .\" How does "incrementing the count show that there were waiters"?
393 .\"
394 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
395 .\"
396 .TP
397 .BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)"
398 .\" Strictly speaking, from Linux 2.5.x to 2.6.25
399 This operation creates a file descriptor that is associated with
400 the futex at
401 .IR uaddr .
402 The caller must close the returned file descriptor after use.
403 When another process or thread performs a
404 .BR FUTEX_WAKE
405 on the futex word, the file descriptor indicates as being readable with
406 .BR select (2),
407 .BR poll (2),
408 and
409 .BR epoll (7)
410 .IP
411 The file descriptor can be used to obtain asynchronous notifications: if
412 .I val
413 is nonzero, then, when another process or thread executes a
414 .BR FUTEX_WAKE ,
415 the caller will receive the signal number that was passed in
416 .IR val .
417 .IP
418 The arguments
419 .IR timeout ,
420 .I uaddr2
421 and
422 .I val3
423 are ignored.
424 .IP
425 Because it was inherently racy,
426 .B FUTEX_FD
427 has been removed
428 .\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80
429 from Linux 2.6.26 onward.
430 .\"
431 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
432 .\"
433 .TP
434 .BR FUTEX_REQUEUE " (since Linux 2.6.0)"
435 This operation performs the same task as
436 .BR FUTEX_CMP_REQUEUE
437 (see below), except that no check is made using the value in
438 .IR val3 .
439 (The argument
440 .I val3
441 is ignored.)
442 .\"
443 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
444 .\"
445 .TP
446 .BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
447 This operation first checks whether the location
448 .I uaddr
449 still contains the value
450 .IR val3 .
451 If not, the operation fails with the error
452 .BR EAGAIN .
453 Otherwise, the operation wakes up a maximum of
454 .I val
455 waiters that are waiting on the futex at
456 .IR uaddr .
457 If there are more than
458 .I val
459 waiters, then the remaining waiters are removed
460 from the wait queue of the source futex at
461 .I uaddr
462 and added to the wait queue of the target futex at
463 .IR uaddr2 .
464 The
465 .I val2
466 argument specifies an upper limit on the number of waiters
467 that are requeued to the futex at
468 .IR uaddr2 .
469 .IP
470 .\" FIXME(Torvald) Is the following correct? Or is just the decision
471 .\" which threads to wake or requeue part of the atomic operation?
472 The load from
473 .I uaddr
474 is an atomic memory access (i.e., using atomic machine instructions of
475 the respective architecture).
476 This load, the comparison with
477 .IR val3 ,
478 and the requeueing of any waiters are performed atomically and totally
479 ordered with respect to other operations on the same futex word.
480 .\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ###
481 .\" The operation is serialized with respect to operations on both
482 .\" source and target futex. No other waiter can enqueue itself
483 .\" for waiting and no other waiter can dequeue itself because of
484 .\" a timeout or signal.
485 .IP
486 Typical values to specify for
487 .I val
488 are 0 or 1.
489 (Specifying
490 .BR INT_MAX
491 is not useful, because it would make the
492 .BR FUTEX_CMP_REQUEUE
493 operation equivalent to
494 .BR FUTEX_WAKE .)
495 The limit value specified via
496 .I val2
497 is typically either 1 or
498 .BR INT_MAX .
499 (Specifying the argument as 0 is not useful, because it would make the
500 .BR FUTEX_CMP_REQUEUE
501 operation equivalent to
502 .BR FUTEX_WAIT .)
503 .IP
504 The
505 .B FUTEX_CMP_REQUEUE
506 operation was added as a replacement for the earlier
507 .BR FUTEX_REQUEUE .
508 The difference is that the check of the value at
509 .I uaddr
510 can be used to ensure that requeueing happens only under certain
511 conditions, which allows race conditions to be avoided in certain use cases.
512 .\" But, as Rich Felker points out, there remain valid use cases for
513 .\" FUTEX_REQUEUE, for example, when the calling thread is requeuing
514 .\" the target(s) to a lock that the calling thread owns
515 .\" From: Rich Felker <dalias@libc.org>
516 .\" Date: Wed, 29 Oct 2014 22:43:17 -0400
517 .\" To: Darren Hart <dvhart@infradead.org>
518 .\" CC: libc-alpha@sourceware.org, ...
519 .\" Subject: Re: Add futex wrapper to glibc?
520 .IP
521 Both
522 .BR FUTEX_REQUEUE
523 and
524 .BR FUTEX_CMP_REQUEUE
525 can be used to avoid "thundering herd" wake-ups that could occur when using
526 .B FUTEX_WAKE
527 in cases where all of the waiters that are woken need to acquire
528 another futex.
529 Consider the following scenario,
530 where multiple waiter threads are waiting on B,
531 a wait queue implemented using a futex:
532 .IP
533 .in +4n
534 .EX
535 lock(A)
536 while (!check_value(V)) {
537 unlock(A);
538 block_on(B);
539 lock(A);
540 };
541 unlock(A);
542 .EE
543 .in
544 .IP
545 If a waker thread used
546 .BR FUTEX_WAKE ,
547 then all waiters waiting on B would be woken up,
548 and they would all try to acquire lock A.
549 However, waking all of the threads in this manner would be pointless because
550 all except one of the threads would immediately block on lock A again.
551 By contrast, a requeue operation wakes just one waiter and moves
552 the other waiters to lock A,
553 and when the woken waiter unlocks A then the next waiter can proceed.
554 .\"
555 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
556 .\"
557 .TP
558 .BR FUTEX_WAKE_OP " (since Linux 2.6.14)"
559 .\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
560 .\" Author: Jakub Jelinek <jakub@redhat.com>
561 .\" Date: Tue Sep 6 15:16:25 2005 -0700
562 .\" FIXME. (Torvald) The glibc condvar implementation is currently being
563 .\" revised (e.g., to not use an internal lock anymore).
564 .\" It is probably more future-proof to remove this paragraph.
565 .\" [Torvald, do you have an update here?]
566 This operation was added to support some user-space use cases
567 where more than one futex must be handled at the same time.
568 The most notable example is the implementation of
569 .BR pthread_cond_signal (3),
570 which requires operations on two futexes,
571 the one used to implement the mutex and the one used in the implementation
572 of the wait queue associated with the condition variable.
573 .BR FUTEX_WAKE_OP
574 allows such cases to be implemented without leading to
575 high rates of contention and context switching.
576 .IP
577 The
578 .BR FUTEX_WAKE_OP
579 operation is equivalent to executing the following code atomically
580 and totally ordered with respect to other futex operations on
581 any of the two supplied futex words:
582 .IP
583 .in +4n
584 .EX
585 int oldval = *(int *) uaddr2;
586 *(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
587 futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
588 if (oldval \fIcmp\fP \fIcmparg\fP)
589 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
590 .EE
591 .in
592 .IP
593 In other words,
594 .BR FUTEX_WAKE_OP
595 does the following:
596 .RS
597 .IP * 3
598 saves the original value of the futex word at
599 .IR uaddr2
600 and performs an operation to modify the value of the futex at
601 .IR uaddr2 ;
602 this is an atomic read-modify-write memory access (i.e., using atomic
603 machine instructions of the respective architecture)
604 .IP *
605 wakes up a maximum of
606 .I val
607 waiters on the futex for the futex word at
608 .IR uaddr ;
609 and
610 .IP *
611 dependent on the results of a test of the original value of the
612 futex word at
613 .IR uaddr2 ,
614 wakes up a maximum of
615 .I val2
616 waiters on the futex for the futex word at
617 .IR uaddr2 .
618 .RE
619 .IP
620 The operation and comparison that are to be performed are encoded
621 in the bits of the argument
622 .IR val3 .
623 Pictorially, the encoding is:
624 .IP
625 .in +8n
626 .EX
627 +---+---+-----------+-----------+
628 |op |cmp| oparg | cmparg |
629 +---+---+-----------+-----------+
630 4 4 12 12 <== # of bits
631 .EE
632 .in
633 .IP
634 Expressed in code, the encoding is:
635 .IP
636 .in +4n
637 .EX
638 #define FUTEX_OP(op, oparg, cmp, cmparg) \e
639 (((op & 0xf) << 28) | \e
640 ((cmp & 0xf) << 24) | \e
641 ((oparg & 0xfff) << 12) | \e
642 (cmparg & 0xfff))
643 .EE
644 .in
645 .IP
646 In the above,
647 .I op
648 and
649 .I cmp
650 are each one of the codes listed below.
651 The
652 .I oparg
653 and
654 .I cmparg
655 components are literal numeric values, except as noted below.
656 .IP
657 The
658 .I op
659 component has one of the following values:
660 .IP
661 .in +4n
662 .EX
663 FUTEX_OP_SET 0 /* uaddr2 = oparg; */
664 FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
665 FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
666 FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
667 FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
668 .EE
669 .in
670 .IP
671 In addition, bit-wise ORing the following value into
672 .I op
673 causes
674 .IR "(1\ <<\ oparg)"
675 to be used as the operand:
676 .IP
677 .in +4n
678 .EX
679 FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
680 .EE
681 .in
682 .IP
683 The
684 .I cmp
685 field is one of the following:
686 .IP
687 .in +4n
688 .EX
689 FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
690 FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
691 FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
692 FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
693 FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
694 FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
695 .EE
696 .in
697 .IP
698 The return value of
699 .BR FUTEX_WAKE_OP
700 is the sum of the number of waiters woken on the futex
701 .IR uaddr
702 plus the number of waiters woken on the futex
703 .IR uaddr2 .
704 .\"
705 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
706 .\"
707 .TP
708 .BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)"
709 .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
710 This operation is like
711 .BR FUTEX_WAIT
712 except that
713 .I val3
714 is used to provide a 32-bit bit mask to the kernel.
715 This bit mask, in which at least one bit must be set,
716 is stored in the kernel-internal state of the waiter.
717 See the description of
718 .BR FUTEX_WAKE_BITSET
719 for further details.
720 .IP
721 If
722 .I timeout
723 is not NULL, the structure it points to specifies
724 an absolute timeout for the wait operation.
725 If
726 .I timeout
727 is NULL, the operation can block indefinitely.
728 .IP
729 .IP
730 The
731 .I uaddr2
732 argument is ignored.
733 .\"
734 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
735 .\"
736 .TP
737 .BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)"
738 .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
739 This operation is the same as
740 .BR FUTEX_WAKE
741 except that the
742 .I val3
743 argument is used to provide a 32-bit bit mask to the kernel.
744 This bit mask, in which at least one bit must be set,
745 is used to select which waiters should be woken up.
746 The selection is done by a bit-wise AND of the "wake" bit mask
747 (i.e., the value in
748 .IR val3 )
749 and the bit mask which is stored in the kernel-internal
750 state of the waiter (the "wait" bit mask that is set using
751 .BR FUTEX_WAIT_BITSET ).
752 All of the waiters for which the result of the AND is nonzero are woken up;
753 the remaining waiters are left sleeping.
754 .IP
755 The effect of
756 .BR FUTEX_WAIT_BITSET
757 and
758 .BR FUTEX_WAKE_BITSET
759 is to allow selective wake-ups among multiple waiters that are blocked
760 on the same futex.
761 However, note that, depending on the use case,
762 employing this bit-mask multiplexing feature on a
763 futex can be less efficient than simply using multiple futexes,
764 because employing bit-mask multiplexing requires the kernel
765 to check all waiters on a futex,
766 including those that are not interested in being woken up
767 (i.e., they do not have the relevant bit set in their "wait" bit mask).
768 .\" According to http://locklessinc.com/articles/futex_cheat_sheet/:
769 .\"
770 .\" "The original reason for the addition of these extensions
771 .\" was to improve the performance of pthread read-write locks
772 .\" in glibc. However, the pthreads library no longer uses the
773 .\" same locking algorithm, and these extensions are not used
774 .\" without the bitset parameter being all ones.
775 .\"
776 .\" The page goes on to note that the FUTEX_WAIT_BITSET operation
777 .\" is nevertheless used (with a bit mask of all ones) in order to
778 .\" obtain the absolute timeout functionality that is useful
779 .\" for efficiently implementing Pthreads APIs (which use absolute
780 .\" timeouts); FUTEX_WAIT provides only relative timeouts.
781 .IP
782 The constant
783 .BR FUTEX_BITSET_MATCH_ANY ,
784 which corresponds to all 32 bits set in the bit mask, can be used as the
785 .I val3
786 argument for
787 .BR FUTEX_WAIT_BITSET
788 and
789 .BR FUTEX_WAKE_BITSET .
790 Other than differences in the handling of the
791 .I timeout
792 argument, the
793 .BR FUTEX_WAIT
794 operation is equivalent to
795 .BR FUTEX_WAIT_BITSET
796 with
797 .IR val3
798 specified as
799 .BR FUTEX_BITSET_MATCH_ANY ;
800 that is, allow a wake-up by any waker.
801 The
802 .BR FUTEX_WAKE
803 operation is equivalent to
804 .BR FUTEX_WAKE_BITSET
805 with
806 .IR val3
807 specified as
808 .BR FUTEX_BITSET_MATCH_ANY ;
809 that is, wake up any waiter(s).
810 .IP
811 The
812 .I uaddr2
813 and
814 .I timeout
815 arguments are ignored.
816 .\"
817 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
818 .\"
819 .SS Priority-inheritance futexes
820 Linux supports priority-inheritance (PI) futexes in order to handle
821 priority-inversion problems that can be encountered with
822 normal futex locks.
823 Priority inversion is the problem that occurs when a high-priority
824 task is blocked waiting to acquire a lock held by a low-priority task,
825 while tasks at an intermediate priority continuously preempt
826 the low-priority task from the CPU.
827 Consequently, the low-priority task makes no progress toward
828 releasing the lock, and the high-priority task remains blocked.
829 .PP
830 Priority inheritance is a mechanism for dealing with
831 the priority-inversion problem.
832 With this mechanism, when a high-priority task becomes blocked
833 by a lock held by a low-priority task,
834 the priority of the low-priority task is temporarily raised
835 to that of the high-priority task,
836 so that it is not preempted by any intermediate level tasks,
837 and can thus make progress toward releasing the lock.
838 To be effective, priority inheritance must be transitive,
839 meaning that if a high-priority task blocks on a lock
840 held by a lower-priority task that is itself blocked by a lock
841 held by another intermediate-priority task
842 (and so on, for chains of arbitrary length),
843 then both of those tasks
844 (or more generally, all of the tasks in a lock chain)
845 have their priorities raised to be the same as the high-priority task.
846 .PP
847 From a user-space perspective,
848 what makes a futex PI-aware is a policy agreement (described below)
849 between user space and the kernel about the value of the futex word,
850 coupled with the use of the PI-futex operations described below.
851 (Unlike the other futex operations described above,
852 the PI-futex operations are designed
853 for the implementation of very specific IPC mechanisms.)
854 .\"
855 .\" Quoting Darren Hart:
856 .\" These opcodes paired with the PI futex value policy (described below)
857 .\" defines a "futex" as PI aware. These were created very specifically
858 .\" in support of PI pthread_mutexes, so it makes a lot more sense to
859 .\" talk about a PI aware pthread_mutex, than a PI aware futex, since
860 .\" there is a lot of policy and scaffolding that has to be built up
861 .\" around it to use it properly (this is what a PI pthread_mutex is).
862 .PP
863 .\" mtk: The following text is drawn from the Hart/Guniguntala paper
864 .\" (listed in SEE ALSO), but I have reworded some pieces
865 .\" significantly.
866 .\"
867 The PI-futex operations described below differ from the other
868 futex operations in that they impose policy on the use of the value of the
869 futex word:
870 .IP * 3
871 If the lock is not acquired, the futex word's value shall be 0.
872 .IP *
873 If the lock is acquired, the futex word's value shall
874 be the thread ID (TID;
875 see
876 .BR gettid (2))
877 of the owning thread.
878 .IP *
879 If the lock is owned and there are threads contending for the lock,
880 then the
881 .B FUTEX_WAITERS
882 bit shall be set in the futex word's value; in other words, this value is:
883 .IP
884 FUTEX_WAITERS | TID
885 .IP
886 (Note that is invalid for a PI futex word to have no owner and
887 .BR FUTEX_WAITERS
888 set.)
889 .PP
890 With this policy in place,
891 a user-space application can acquire an unacquired
892 lock or release a lock using atomic instructions executed in user mode
893 (e.g., a compare-and-swap operation such as
894 .I cmpxchg
895 on the x86 architecture).
896 Acquiring a lock simply consists of using compare-and-swap to atomically
897 set the futex word's value to the caller's TID if its previous value was 0.
898 Releasing a lock requires using compare-and-swap to set the futex word's
899 value to 0 if the previous value was the expected TID.
900 .PP
901 If a futex is already acquired (i.e., has a nonzero value),
902 waiters must employ the
903 .B FUTEX_LOCK_PI
904 operation to acquire the lock.
905 If other threads are waiting for the lock, then the
906 .B FUTEX_WAITERS
907 bit is set in the futex value;
908 in this case, the lock owner must employ the
909 .B FUTEX_UNLOCK_PI
910 operation to release the lock.
911 .PP
912 In the cases where callers are forced into the kernel
913 (i.e., required to perform a
914 .BR futex ()
915 call),
916 they then deal directly with a so-called RT-mutex,
917 a kernel locking mechanism which implements the required
918 priority-inheritance semantics.
919 After the RT-mutex is acquired, the futex value is updated accordingly,
920 before the calling thread returns to user space.
921 .PP
922 It is important to note
923 .\" tglx (July 2015):
924 .\" If there are multiple waiters on a pi futex then a wake pi operation
925 .\" will wake the first waiter and hand over the lock to this waiter. This
926 .\" includes handing over the rtmutex which represents the futex in the
927 .\" kernel. The strict requirement is that the futex owner and the rtmutex
928 .\" owner must be the same, except for the update period which is
929 .\" serialized by the futex internal locking. That means the kernel must
930 .\" update the user-space value prior to returning to user space
931 that the kernel will update the futex word's value prior
932 to returning to user space.
933 (This prevents the possibility of the futex word's value ending
934 up in an invalid state, such as having an owner but the value being 0,
935 or having waiters but not having the
936 .B FUTEX_WAITERS
937 bit set.)
938 .PP
939 If a futex has an associated RT-mutex in the kernel
940 (i.e., there are blocked waiters)
941 and the owner of the futex/RT-mutex dies unexpectedly,
942 then the kernel cleans up the RT-mutex and hands it over to the next waiter.
943 This in turn requires that the user-space value is updated accordingly.
944 To indicate that this is required, the kernel sets the
945 .B FUTEX_OWNER_DIED
946 bit in the futex word along with the thread ID of the new owner.
947 User space can detect this situation via the presence of the
948 .B FUTEX_OWNER_DIED
949 bit and is then responsible for cleaning up the stale state left over by
950 the dead owner.
951 .\" tglx (July 2015):
952 .\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where
953 .\" the kernel has no state associated. This happens via the robust futex
954 .\" mechanism. In that case the futex value will be set to
955 .\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non
956 .\" PI futexes.
957 .PP
958 PI futexes are operated on by specifying one of the values listed below in
959 .IR futex_op .
960 Note that the PI futex operations must be used as paired operations
961 and are subject to some additional requirements:
962 .IP * 3
963 .B FUTEX_LOCK_PI
964 and
965 .B FUTEX_TRYLOCK_PI
966 pair with
967 .BR FUTEX_UNLOCK_PI .
968 .B FUTEX_UNLOCK_PI
969 must be called only on a futex owned by the calling thread,
970 as defined by the value policy, otherwise the error
971 .B EPERM
972 results.
973 .IP *
974 .B FUTEX_WAIT_REQUEUE_PI
975 pairs with
976 .BR FUTEX_CMP_REQUEUE_PI .
977 This must be performed from a non-PI futex to a distinct PI futex
978 (or the error
979 .B EINVAL
980 results).
981 Additionally,
982 .I val
983 (the number of waiters to be woken) must be 1
984 (or the error
985 .B EINVAL
986 results).
987 .PP
988 The PI futex operations are as follows:
989 .\"
990 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
991 .\"
992 .TP
993 .BR FUTEX_LOCK_PI " (since Linux 2.6.18)"
994 .\" commit c87e2837be82df479a6bae9f155c43516d2feebc
995 This operation is used after an attempt to acquire
996 the lock via an atomic user-mode instruction failed
997 because the futex word has a nonzero value\(emspecifically,
998 because it contained the (PID-namespace-specific) TID of the lock owner.
999 .IP
1000 The operation checks the value of the futex word at the address
1001 .IR uaddr .
1002 If the value is 0, then the kernel tries to atomically set
1003 the futex value to the caller's TID.
1004 If the futex word's value is nonzero,
1005 the kernel atomically sets the
1006 .B FUTEX_WAITERS
1007 bit, which signals the futex owner that it cannot unlock the futex in
1008 user space atomically by setting the futex value to 0.
1009 .\" tglx (July 2015):
1010 .\" The operation here is similar to the FUTEX_WAIT logic. When the user
1011 .\" space atomic acquire does not succeed because the futex value was non
1012 .\" zero, then the waiter goes into the kernel, takes the kernel internal
1013 .\" lock and retries the acquisition under the lock. If the acquisition
1014 .\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal
1015 .\" the lock owner that it needs to go into the kernel. Here is the pseudo
1016 .\" code:
1017 .\"
1018 .\" lock(kernel_lock);
1019 .\" retry:
1020 .\"
1021 .\" /*
1022 .\" * Owner might have unlocked in userspace before we
1023 .\" * were able to set the waiter bit.
1024 .\" */
1025 .\" if (atomic_acquire(futex) == SUCCESS) {
1026 .\" unlock(kernel_lock());
1027 .\" return 0;
1028 .\" }
1029 .\"
1030 .\" /*
1031 .\" * Owner might have unlocked after the above atomic_acquire()
1032 .\" * attempt.
1033 .\" */
1034 .\" if (atomic_set_waiters_bit(futex) != SUCCESS)
1035 .\" goto retry;
1036 .\"
1037 .\" queue_waiter();
1038 .\" unlock(kernel_lock);
1039 .\" block();
1040 .\"
1041 After that, the kernel:
1042 .RS
1043 .IP 1. 3
1044 Tries to find the thread which is associated with the owner TID.
1045 .IP 2.
1046 Creates or reuses kernel state on behalf of the owner.
1047 (If this is the first waiter, there is no kernel state for this
1048 futex, so kernel state is created by locking the RT-mutex
1049 and the futex owner is made the owner of the RT-mutex.
1050 If there are existing waiters, then the existing state is reused.)
1051 .IP 3.
1052 Attaches the waiter to the futex
1053 (i.e., the waiter is enqueued on the RT-mutex waiter list).
1054 .RE
1055 .IP
1056 If more than one waiter exists,
1057 the enqueueing of the waiter is in descending priority order.
1058 (For information on priority ordering, see the discussion of the
1059 .BR SCHED_DEADLINE ,
1060 .BR SCHED_FIFO ,
1061 and
1062 .BR SCHED_RR
1063 scheduling policies in
1064 .BR sched (7).)
1065 The owner inherits either the waiter's CPU bandwidth
1066 (if the waiter is scheduled under the
1067 .BR SCHED_DEADLINE
1068 policy) or the waiter's priority (if the waiter is scheduled under the
1069 .BR SCHED_RR
1070 or
1071 .BR SCHED_FIFO
1072 policy).
1073 .\" August 2015:
1074 .\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL)
1075 .\" processes, does the nice value come into play also?
1076 .\"
1077 .\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order
1078 This inheritance follows the lock chain in the case of nested locking
1079 .\" (i.e., task 1 blocks on lock A, held by task 2,
1080 .\" while task 2 blocks on lock B, held by task 3)
1081 and performs deadlock detection.
1082 .IP
1083 The
1084 .I timeout
1085 argument provides a timeout for the lock attempt.
1086 If
1087 .I timeout
1088 is not NULL, the structure it points to specifies
1089 an absolute timeout, measured against the
1090 .BR CLOCK_REALTIME
1091 clock.
1092 .\" 2016-07-07 response from Thomas Gleixner on LKML:
1093 .\" From: Thomas Gleixner <tglx@linutronix.de>
1094 .\" Date: 6 July 2016 at 20:57
1095 .\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
1096 .\"
1097 .\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:
1098 .\" > On 06/23/2016 08:28 PM, Darren Hart wrote:
1099 .\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
1100 .\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
1101 .\" > > specific amount of time, regardless of wall time.
1102 .\" >
1103 .\" > Yes, that's another weird inconsistency.
1104 .\"
1105 .\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on
1106 .\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back
1107 .\" then when we added LOCK_PI.
1108 If
1109 .I timeout
1110 is NULL, the operation will block indefinitely.
1111 .IP
1112 The
1113 .IR uaddr2 ,
1114 .IR val ,
1115 and
1116 .IR val3
1117 arguments are ignored.
1118 .\"
1119 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1120 .\"
1121 .TP
1122 .BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)"
1123 .\" commit c87e2837be82df479a6bae9f155c43516d2feebc
1124 This operation tries to acquire the lock at
1125 .IR uaddr .
1126 It is invoked when a user-space atomic acquire did not
1127 succeed because the futex word was not 0.
1128 .IP
1129 Because the kernel has access to more state information than user space,
1130 acquisition of the lock might succeed if performed by the
1131 kernel in cases where the futex word
1132 (i.e., the state information accessible to use-space) contains stale state
1133 .RB ( FUTEX_WAITERS
1134 and/or
1135 .BR FUTEX_OWNER_DIED ).
1136 This can happen when the owner of the futex died.
1137 User space cannot handle this condition in a race-free manner,
1138 but the kernel can fix this up and acquire the futex.
1139 .\" Paraphrasing a f2f conversation with Thomas Gleixner about the
1140 .\" above point (Aug 2015): ###
1141 .\" There is a rare possibility of a race condition involving an
1142 .\" uncontended futex with no owner, but with waiters. The
1143 .\" kernel-user-space contract is that if a futex is nonzero, you must
1144 .\" go into kernel. The futex was owned by a task, and that task dies
1145 .\" but there are no waiters, so the futex value is non zero.
1146 .\" Therefore, the next locker has to go into the kernel,
1147 .\" so that the kernel has a chance to clean up. (CMXCH on zero
1148 .\" in user space would fail, so kernel has to clean up.)
1149 .\" Darren Hart (Oct 2015):
1150 .\" The trylock in the kernel has more state, so it can independently
1151 .\" verify the flags that userspace must trust implicitly.
1152 .IP
1153 The
1154 .IR uaddr2 ,
1155 .IR val ,
1156 .IR timeout ,
1157 and
1158 .IR val3
1159 arguments are ignored.
1160 .\"
1161 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1162 .\"
1163 .TP
1164 .BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)"
1165 .\" commit c87e2837be82df479a6bae9f155c43516d2feebc
1166 This operation wakes the top priority waiter that is waiting in
1167 .B FUTEX_LOCK_PI
1168 on the futex address provided by the
1169 .I uaddr
1170 argument.
1171 .IP
1172 This is called when the user-space value at
1173 .I uaddr
1174 cannot be changed atomically from a TID (of the owner) to 0.
1175 .IP
1176 The
1177 .IR uaddr2 ,
1178 .IR val ,
1179 .IR timeout ,
1180 and
1181 .IR val3
1182 arguments are ignored.
1183 .\"
1184 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1185 .\"
1186 .TP
1187 .BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)"
1188 .\" commit 52400ba946759af28442dee6265c5c0180ac7122
1189 This operation is a PI-aware variant of
1190 .BR FUTEX_CMP_REQUEUE .
1191 It requeues waiters that are blocked via
1192 .B FUTEX_WAIT_REQUEUE_PI
1193 on
1194 .I uaddr
1195 from a non-PI source futex
1196 .RI ( uaddr )
1197 to a PI target futex
1198 .RI ( uaddr2 ).
1199 .IP
1200 As with
1201 .BR FUTEX_CMP_REQUEUE ,
1202 this operation wakes up a maximum of
1203 .I val
1204 waiters that are waiting on the futex at
1205 .IR uaddr .
1206 However, for
1207 .BR FUTEX_CMP_REQUEUE_PI ,
1208 .I val
1209 is required to be 1
1210 (since the main point is to avoid a thundering herd).
1211 The remaining waiters are removed from the wait queue of the source futex at
1212 .I uaddr
1213 and added to the wait queue of the target futex at
1214 .IR uaddr2 .
1215 .IP
1216 The
1217 .I val2
1218 .\" val2 is the cap on the number of requeued waiters.
1219 .\" In the glibc pthread_cond_broadcast() implementation, this argument
1220 .\" is specified as INT_MAX, and for pthread_cond_signal() it is 0.
1221 and
1222 .I val3
1223 arguments serve the same purposes as for
1224 .BR FUTEX_CMP_REQUEUE .
1225 .\"
1226 .\" The page at http://locklessinc.com/articles/futex_cheat_sheet/
1227 .\" notes that "priority-inheritance Futex to priority-inheritance
1228 .\" Futex requeues are currently unsupported". However, probably
1229 .\" the page does not need to say nothing about this, since
1230 .\" Thomas Gleixner commented (July 2015): "they never will be
1231 .\" supported because they make no sense at all"
1232 .\"
1233 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1234 .\"
1235 .TP
1236 .BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)"
1237 .\" commit 52400ba946759af28442dee6265c5c0180ac7122
1238 .\"
1239 Wait on a non-PI futex at
1240 .I uaddr
1241 and potentially be requeued (via a
1242 .BR FUTEX_CMP_REQUEUE_PI
1243 operation in another task) onto a PI futex at
1244 .IR uaddr2 .
1245 The wait operation on
1246 .I uaddr
1247 is the same as for
1248 .BR FUTEX_WAIT .
1249 .IP
1250 The waiter can be removed from the wait on
1251 .I uaddr
1252 without requeueing on
1253 .IR uaddr2
1254 via a
1255 .BR FUTEX_WAKE
1256 operation in another task.
1257 In this case, the
1258 .BR FUTEX_WAIT_REQUEUE_PI
1259 operation fails with the error
1260 .BR EAGAIN .
1261 .IP
1262 If
1263 .I timeout
1264 is not NULL, the structure it points to specifies
1265 an absolute timeout for the wait operation.
1266 If
1267 .I timeout
1268 is NULL, the operation can block indefinitely.
1269 .IP
1270 The
1271 .I val3
1272 argument is ignored.
1273 .IP
1274 The
1275 .BR FUTEX_WAIT_REQUEUE_PI
1276 and
1277 .BR FUTEX_CMP_REQUEUE_PI
1278 were added to support a fairly specific use case:
1279 support for priority-inheritance-aware POSIX threads condition variables.
1280 The idea is that these operations should always be paired,
1281 in order to ensure that user space and the kernel remain in sync.
1282 Thus, in the
1283 .BR FUTEX_WAIT_REQUEUE_PI
1284 operation, the user-space application pre-specifies the target
1285 of the requeue that takes place in the
1286 .BR FUTEX_CMP_REQUEUE_PI
1287 operation.
1288 .\"
1289 .\" Darren Hart notes that a patch to allow glibc to fully support
1290 .\" PI-aware pthreads condition variables has not yet been accepted into
1291 .\" glibc. The story is complex, and can be found at
1292 .\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588
1293 .\" Darren notes that in the meantime, the patch is shipped with various
1294 .\" PREEMPT_RT-enabled Linux systems.
1295 .\"
1296 .\" Related to the preceding, Darren proposed that somewhere, man-pages
1297 .\" should document the following point:
1298 .\"
1299 .\" While the Linux kernel, since 2.6.31, supports requeueing of
1300 .\" priority-inheritance (PI) aware mutexes via the
1301 .\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations,
1302 .\" the glibc implementation does not yet take full advantage of this.
1303 .\" Specifically, the condvar internal data lock remains a non-PI aware
1304 .\" mutex, regardless of the type of the pthread_mutex associated with
1305 .\" the condvar. This can lead to an unbounded priority inversion on
1306 .\" the internal data lock even when associating a PI aware
1307 .\" pthread_mutex with a condvar during a pthread_cond*_wait
1308 .\" operation. For this reason, it is not recommended to rely on
1309 .\" priority inheritance when using pthread condition variables.
1310 .\"
1311 .\" The problem is that the obvious location for this text is
1312 .\" the pthread_cond*wait(3) man page. However, such a man page
1313 .\" does not currently exist.
1314 .\"
1315 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1316 .\"
1317 .SH RETURN VALUE
1318 .PP
1319 In the event of an error (and assuming that
1320 .BR futex ()
1321 was invoked via
1322 .BR syscall (2)),
1323 all operations return \-1 and set
1324 .I errno
1325 to indicate the cause of the error.
1326 .PP
1327 The return value on success depends on the operation,
1328 as described in the following list:
1329 .TP
1330 .B FUTEX_WAIT
1331 Returns 0 if the caller was woken up.
1332 Note that a wake-up can also be caused by common futex usage patterns
1333 in unrelated code that happened to have previously used the futex word's
1334 memory location (e.g., typical futex-based implementations of
1335 Pthreads mutexes can cause this under some conditions).
1336 Therefore, callers should always conservatively assume that a return
1337 value of 0 can mean a spurious wake-up, and use the futex word's value
1338 (i.e., the user-space synchronization scheme)
1339 to decide whether to continue to block or not.
1340 .TP
1341 .B FUTEX_WAKE
1342 Returns the number of waiters that were woken up.
1343 .TP
1344 .B FUTEX_FD
1345 Returns the new file descriptor associated with the futex.
1346 .TP
1347 .B FUTEX_REQUEUE
1348 Returns the number of waiters that were woken up.
1349 .TP
1350 .B FUTEX_CMP_REQUEUE
1351 Returns the total number of waiters that were woken up or
1352 requeued to the futex for the futex word at
1353 .IR uaddr2 .
1354 If this value is greater than
1355 .IR val ,
1356 then the difference is the number of waiters requeued to the futex for the
1357 futex word at
1358 .IR uaddr2 .
1359 .TP
1360 .B FUTEX_WAKE_OP
1361 Returns the total number of waiters that were woken up.
1362 This is the sum of the woken waiters on the two futexes for
1363 the futex words at
1364 .I uaddr
1365 and
1366 .IR uaddr2 .
1367 .TP
1368 .B FUTEX_WAIT_BITSET
1369 Returns 0 if the caller was woken up.
1370 See
1371 .B FUTEX_WAIT
1372 for how to interpret this correctly in practice.
1373 .TP
1374 .B FUTEX_WAKE_BITSET
1375 Returns the number of waiters that were woken up.
1376 .TP
1377 .B FUTEX_LOCK_PI
1378 Returns 0 if the futex was successfully locked.
1379 .TP
1380 .B FUTEX_TRYLOCK_PI
1381 Returns 0 if the futex was successfully locked.
1382 .TP
1383 .B FUTEX_UNLOCK_PI
1384 Returns 0 if the futex was successfully unlocked.
1385 .TP
1386 .B FUTEX_CMP_REQUEUE_PI
1387 Returns the total number of waiters that were woken up or
1388 requeued to the futex for the futex word at
1389 .IR uaddr2 .
1390 If this value is greater than
1391 .IR val ,
1392 then difference is the number of waiters requeued to the futex for
1393 the futex word at
1394 .IR uaddr2 .
1395 .TP
1396 .B FUTEX_WAIT_REQUEUE_PI
1397 Returns 0 if the caller was successfully requeued to the futex for
1398 the futex word at
1399 .IR uaddr2 .
1400 .\"
1401 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1402 .\"
1403 .SH ERRORS
1404 .TP
1405 .B EACCES
1406 No read access to the memory of a futex word.
1407 .TP
1408 .B EAGAIN
1409 .RB ( FUTEX_WAIT ,
1410 .BR FUTEX_WAIT_BITSET ,
1411 .BR FUTEX_WAIT_REQUEUE_PI )
1412 The value pointed to by
1413 .I uaddr
1414 was not equal to the expected value
1415 .I val
1416 at the time of the call.
1417 .IP
1418 .BR Note :
1419 on Linux, the symbolic names
1420 .B EAGAIN
1421 and
1422 .B EWOULDBLOCK
1423 (both of which appear in different parts of the kernel futex code)
1424 have the same value.
1425 .TP
1426 .B EAGAIN
1427 .RB ( FUTEX_CMP_REQUEUE ,
1428 .BR FUTEX_CMP_REQUEUE_PI )
1429 The value pointed to by
1430 .I uaddr
1431 is not equal to the expected value
1432 .IR val3 .
1433 .TP
1434 .BR EAGAIN
1435 .RB ( FUTEX_LOCK_PI ,
1436 .BR FUTEX_TRYLOCK_PI ,
1437 .BR FUTEX_CMP_REQUEUE_PI )
1438 The futex owner thread ID of
1439 .I uaddr
1440 (for
1441 .BR FUTEX_CMP_REQUEUE_PI :
1442 .IR uaddr2 )
1443 is about to exit,
1444 but has not yet handled the internal state cleanup.
1445 Try again.
1446 .TP
1447 .BR EDEADLK
1448 .RB ( FUTEX_LOCK_PI ,
1449 .BR FUTEX_TRYLOCK_PI ,
1450 .BR FUTEX_CMP_REQUEUE_PI )
1451 The futex word at
1452 .I uaddr
1453 is already locked by the caller.
1454 .TP
1455 .BR EDEADLK
1456 .\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some
1457 .\" places, and EDEADLOCK in others. On almost all architectures
1458 .\" these constants are synonymous. Is there a reason that both
1459 .\" names are used?
1460 .\"
1461 .\" tglx (July 2015): "No. We should probably fix that."
1462 .\"
1463 .RB ( FUTEX_CMP_REQUEUE_PI )
1464 While requeueing a waiter to the PI futex for the futex word at
1465 .IR uaddr2 ,
1466 the kernel detected a deadlock.
1467 .TP
1468 .B EFAULT
1469 A required pointer argument (i.e.,
1470 .IR uaddr ,
1471 .IR uaddr2 ,
1472 or
1473 .IR timeout )
1474 did not point to a valid user-space address.
1475 .TP
1476 .B EINTR
1477 A
1478 .B FUTEX_WAIT
1479 or
1480 .B FUTEX_WAIT_BITSET
1481 operation was interrupted by a signal (see
1482 .BR signal (7)).
1483 In kernels before Linux 2.6.22, this error could also be returned for
1484 a spurious wakeup; since Linux 2.6.22, this no longer happens.
1485 .TP
1486 .B EINVAL
1487 The operation in
1488 .IR futex_op
1489 is one of those that employs a timeout, but the supplied
1490 .I timeout
1491 argument was invalid
1492 .RI ( tv_sec
1493 was less than zero, or
1494 .IR tv_nsec
1495 was not less than 1,000,000,000).
1496 .TP
1497 .B EINVAL
1498 The operation specified in
1499 .IR futex_op
1500 employs one or both of the pointers
1501 .I uaddr
1502 and
1503 .IR uaddr2 ,
1504 but one of these does not point to a valid object\(emthat is,
1505 the address is not four-byte-aligned.
1506 .TP
1507 .B EINVAL
1508 .RB ( FUTEX_WAIT_BITSET ,
1509 .BR FUTEX_WAKE_BITSET )
1510 The bit mask supplied in
1511 .IR val3
1512 is zero.
1513 .TP
1514 .B EINVAL
1515 .RB ( FUTEX_CMP_REQUEUE_PI )
1516 .I uaddr
1517 equals
1518 .IR uaddr2
1519 (i.e., an attempt was made to requeue to the same futex).
1520 .TP
1521 .BR EINVAL
1522 .RB ( FUTEX_FD )
1523 The signal number supplied in
1524 .I val
1525 is invalid.
1526 .TP
1527 .B EINVAL
1528 .RB ( FUTEX_WAKE ,
1529 .BR FUTEX_WAKE_OP ,
1530 .BR FUTEX_WAKE_BITSET ,
1531 .BR FUTEX_REQUEUE ,
1532 .BR FUTEX_CMP_REQUEUE )
1533 The kernel detected an inconsistency between the user-space state at
1534 .I uaddr
1535 and the kernel state\(emthat is, it detected a waiter which waits in
1536 .BR FUTEX_LOCK_PI
1537 on
1538 .IR uaddr .
1539 .TP
1540 .B EINVAL
1541 .RB ( FUTEX_LOCK_PI ,
1542 .BR FUTEX_TRYLOCK_PI ,
1543 .BR FUTEX_UNLOCK_PI )
1544 The kernel detected an inconsistency between the user-space state at
1545 .I uaddr
1546 and the kernel state.
1547 This indicates either state corruption
1548 or that the kernel found a waiter on
1549 .I uaddr
1550 which is waiting via
1551 .BR FUTEX_WAIT
1552 or
1553 .BR FUTEX_WAIT_BITSET .
1554 .TP
1555 .B EINVAL
1556 .RB ( FUTEX_CMP_REQUEUE_PI )
1557 The kernel detected an inconsistency between the user-space state at
1558 .I uaddr2
1559 and the kernel state;
1560 .\" From a conversation with Thomas Gleixner (Aug 2015): ###
1561 .\" The kernel sees: I have non PI state for a futex you tried to
1562 .\" tell me was PI
1563 that is, the kernel detected a waiter which waits via
1564 .BR FUTEX_WAIT
1565 or
1566 .BR FUTEX_WAIT_BITSET
1567 on
1568 .IR uaddr2 .
1569 .TP
1570 .B EINVAL
1571 .RB ( FUTEX_CMP_REQUEUE_PI )
1572 The kernel detected an inconsistency between the user-space state at
1573 .I uaddr
1574 and the kernel state;
1575 that is, the kernel detected a waiter which waits via
1576 .BR FUTEX_WAIT
1577 or
1578 .BR FUTEX_WAIT_BITESET
1579 on
1580 .IR uaddr .
1581 .TP
1582 .B EINVAL
1583 .RB ( FUTEX_CMP_REQUEUE_PI )
1584 The kernel detected an inconsistency between the user-space state at
1585 .I uaddr
1586 and the kernel state;
1587 that is, the kernel detected a waiter which waits on
1588 .I uaddr
1589 via
1590 .BR FUTEX_LOCK_PI
1591 (instead of
1592 .BR FUTEX_WAIT_REQUEUE_PI ).
1593 .TP
1594 .B EINVAL
1595 .RB ( FUTEX_CMP_REQUEUE_PI )
1596 .\" This deals with the case:
1597 .\" wait_requeue_pi(A, B);
1598 .\" requeue_pi(A, C);
1599 An attempt was made to requeue a waiter to a futex other than that
1600 specified by the matching
1601 .B FUTEX_WAIT_REQUEUE_PI
1602 call for that waiter.
1603 .TP
1604 .B EINVAL
1605 .RB ( FUTEX_CMP_REQUEUE_PI )
1606 The
1607 .I val
1608 argument is not 1.
1609 .TP
1610 .B EINVAL
1611 Invalid argument.
1612 .TP
1613 .B ENFILE
1614 .RB ( FUTEX_FD )
1615 The system-wide limit on the total number of open files has been reached.
1616 .TP
1617 .BR ENOMEM
1618 .RB ( FUTEX_LOCK_PI ,
1619 .BR FUTEX_TRYLOCK_PI ,
1620 .BR FUTEX_CMP_REQUEUE_PI )
1621 The kernel could not allocate memory to hold state information.
1622 .TP
1623 .B ENOSYS
1624 Invalid operation specified in
1625 .IR futex_op .
1626 .TP
1627 .B ENOSYS
1628 The
1629 .BR FUTEX_CLOCK_REALTIME
1630 option was specified in
1631 .IR futex_op ,
1632 but the accompanying operation was neither
1633 .BR FUTEX_WAIT ,
1634 .BR FUTEX_WAIT_BITSET ,
1635 nor
1636 .BR FUTEX_WAIT_REQUEUE_PI .
1637 .TP
1638 .BR ENOSYS
1639 .RB ( FUTEX_LOCK_PI ,
1640 .BR FUTEX_TRYLOCK_PI ,
1641 .BR FUTEX_UNLOCK_PI ,
1642 .BR FUTEX_CMP_REQUEUE_PI ,
1643 .BR FUTEX_WAIT_REQUEUE_PI )
1644 A run-time check determined that the operation is not available.
1645 The PI-futex operations are not implemented on all architectures and
1646 are not supported on some CPU variants.
1647 .TP
1648 .BR EPERM
1649 .RB ( FUTEX_LOCK_PI ,
1650 .BR FUTEX_TRYLOCK_PI ,
1651 .BR FUTEX_CMP_REQUEUE_PI )
1652 The caller is not allowed to attach itself to the futex at
1653 .I uaddr
1654 (for
1655 .BR FUTEX_CMP_REQUEUE_PI :
1656 the futex at
1657 .IR uaddr2 ).
1658 (This may be caused by a state corruption in user space.)
1659 .TP
1660 .BR EPERM
1661 .RB ( FUTEX_UNLOCK_PI )
1662 The caller does not own the lock represented by the futex word.
1663 .TP
1664 .BR ESRCH
1665 .RB ( FUTEX_LOCK_PI ,
1666 .BR FUTEX_TRYLOCK_PI ,
1667 .BR FUTEX_CMP_REQUEUE_PI )
1668 The thread ID in the futex word at
1669 .I uaddr
1670 does not exist.
1671 .TP
1672 .BR ESRCH
1673 .RB ( FUTEX_CMP_REQUEUE_PI )
1674 The thread ID in the futex word at
1675 .I uaddr2
1676 does not exist.
1677 .TP
1678 .B ETIMEDOUT
1679 The operation in
1680 .IR futex_op
1681 employed the timeout specified in
1682 .IR timeout ,
1683 and the timeout expired before the operation completed.
1684 .\"
1685 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1686 .\"
1687 .SH VERSIONS
1688 .PP
1689 Futexes were first made available in a stable kernel release
1690 with Linux 2.6.0.
1691 .PP
1692 Initial futex support was merged in Linux 2.5.7 but with different
1693 semantics from what was described above.
1694 A four-argument system call with the semantics
1695 described in this page was introduced in Linux 2.5.40.
1696 A fifth argument was added in Linux 2.5.70,
1697 and a sixth argument was added in Linux 2.6.7.
1698 .SH CONFORMING TO
1699 This system call is Linux-specific.
1700 .SH NOTES
1701 Glibc does not provide a wrapper for this system call; call it using
1702 .BR syscall (2).
1703 .PP
1704 Several higher-level programming abstractions are implemented via futexes,
1705 including POSIX semaphores and
1706 various POSIX threads synchronization mechanisms
1707 (mutexes, condition variables, read-write locks, and barriers).
1708 .\" TODO FIXME(Torvald) Above, we cite this section and claim it contains
1709 .\" details on the synchronization semantics; add the C11 equivalents
1710 .\" here (or whatever we find consensus for).
1711 .\"
1712 .\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1713 .\"
1714 .SH EXAMPLE
1715 The program below demonstrates use of futexes in a program where a parent
1716 process and a child process use a pair of futexes located inside a
1717 shared anonymous mapping to synchronize access to a shared resource:
1718 the terminal.
1719 The two processes each write
1720 .IR nloops
1721 (a command-line argument that defaults to 5 if omitted)
1722 messages to the terminal and employ a synchronization protocol
1723 that ensures that they alternate in writing messages.
1724 Upon running this program we see output such as the following:
1725 .PP
1726 .in +4n
1727 .EX
1728 $ \fB./futex_demo\fP
1729 Parent (18534) 0
1730 Child (18535) 0
1731 Parent (18534) 1
1732 Child (18535) 1
1733 Parent (18534) 2
1734 Child (18535) 2
1735 Parent (18534) 3
1736 Child (18535) 3
1737 Parent (18534) 4
1738 Child (18535) 4
1739 .EE
1740 .in
1741 .SS Program source
1742 \&
1743 .EX
1744 /* futex_demo.c
1745
1746 Usage: futex_demo [nloops]
1747 (Default: 5)
1748
1749 Demonstrate the use of futexes in a program where parent and child
1750 use a pair of futexes located inside a shared anonymous mapping to
1751 synchronize access to a shared resource: the terminal. The two
1752 processes each write \(aqnum\-loops\(aq messages to the terminal and employ
1753 a synchronization protocol that ensures that they alternate in
1754 writing messages.
1755 */
1756 #define _GNU_SOURCE
1757 #include <stdio.h>
1758 #include <errno.h>
1759 #include <stdatomic.h>
1760 #include <stdlib.h>
1761 #include <unistd.h>
1762 #include <sys/wait.h>
1763 #include <sys/mman.h>
1764 #include <sys/syscall.h>
1765 #include <linux/futex.h>
1766 #include <sys/time.h>
1767
1768 #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \e
1769 } while (0)
1770
1771 static int *futex1, *futex2, *iaddr;
1772
1773 static int
1774 futex(int *uaddr, int futex_op, int val,
1775 const struct timespec *timeout, int *uaddr2, int val3)
1776 {
1777 return syscall(SYS_futex, uaddr, futex_op, val,
1778 timeout, uaddr2, val3);
1779 }
1780
1781 /* Acquire the futex pointed to by \(aqfutexp\(aq: wait for its value to
1782 become 1, and then set the value to 0. */
1783
1784 static void
1785 fwait(int *futexp)
1786 {
1787 int s;
1788
1789 /* atomic_compare_exchange_strong(ptr, oldval, newval)
1790 atomically performs the equivalent of:
1791
1792 if (*ptr == *oldval)
1793 *ptr = newval;
1794
1795 It returns true if the test yielded true and *ptr was updated. */
1796
1797 while (1) {
1798
1799 /* Is the futex available? */
1800 const int one = 1;
1801 if (atomic_compare_exchange_strong(futexp, &one, 0))
1802 break; /* Yes */
1803
1804 /* Futex is not available; wait */
1805
1806 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
1807 if (s == \-1 && errno != EAGAIN)
1808 errExit("futex\-FUTEX_WAIT");
1809 }
1810 }
1811
1812 /* Release the futex pointed to by \(aqfutexp\(aq: if the futex currently
1813 has the value 0, set its value to 1 and the wake any futex waiters,
1814 so that if the peer is blocked in fpost(), it can proceed. */
1815
1816 static void
1817 fpost(int *futexp)
1818 {
1819 int s;
1820
1821 /* atomic_compare_exchange_strong() was described in comments above */
1822
1823 const int zero = 0;
1824 if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
1825 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
1826 if (s == \-1)
1827 errExit("futex\-FUTEX_WAKE");
1828 }
1829 }
1830
1831 int
1832 main(int argc, char *argv[])
1833 {
1834 pid_t childPid;
1835 int j, nloops;
1836
1837 setbuf(stdout, NULL);
1838
1839 nloops = (argc > 1) ? atoi(argv[1]) : 5;
1840
1841 /* Create a shared anonymous mapping that will hold the futexes.
1842 Since the futexes are being shared between processes, we
1843 subsequently use the "shared" futex operations (i.e., not the
1844 ones suffixed "_PRIVATE") */
1845
1846 iaddr = mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE,
1847 MAP_ANONYMOUS | MAP_SHARED, \-1, 0);
1848 if (iaddr == MAP_FAILED)
1849 errExit("mmap");
1850
1851 futex1 = &iaddr[0];
1852 futex2 = &iaddr[1];
1853
1854 *futex1 = 0; /* State: unavailable */
1855 *futex2 = 1; /* State: available */
1856
1857 /* Create a child process that inherits the shared anonymous
1858 mapping */
1859
1860 childPid = fork();
1861 if (childPid == \-1)
1862 errExit("fork");
1863
1864 if (childPid == 0) { /* Child */
1865 for (j = 0; j < nloops; j++) {
1866 fwait(futex1);
1867 printf("Child (%ld) %d\en", (long) getpid(), j);
1868 fpost(futex2);
1869 }
1870
1871 exit(EXIT_SUCCESS);
1872 }
1873
1874 /* Parent falls through to here */
1875
1876 for (j = 0; j < nloops; j++) {
1877 fwait(futex2);
1878 printf("Parent (%ld) %d\en", (long) getpid(), j);
1879 fpost(futex1);
1880 }
1881
1882 wait(NULL);
1883
1884 exit(EXIT_SUCCESS);
1885 }
1886 .EE
1887 .SH SEE ALSO
1888 .ad l
1889 .BR get_robust_list (2),
1890 .BR restart_syscall (2),
1891 .BR pthread_mutexattr_getprotocol (3),
1892 .BR futex (7),
1893 .BR sched (7)
1894 .PP
1895 The following kernel source files:
1896 .IP * 2
1897 .I Documentation/pi-futex.txt
1898 .IP *
1899 .I Documentation/futex-requeue-pi.txt
1900 .IP *
1901 .I Documentation/locking/rt-mutex.txt
1902 .IP *
1903 .I Documentation/locking/rt-mutex-design.txt
1904 .IP *
1905 .I Documentation/robust-futex-ABI.txt
1906 .PP
1907 Franke, H., Russell, R., and Kirwood, M., 2002.
1908 \fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP
1909 (from proceedings of the Ottawa Linux Symposium 2002),
1910 .br
1911 .UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf
1912 .UE
1913 .PP
1914 Hart, D., 2009. \fIA futex overview and update\fP,
1915 .UR http://lwn.net/Articles/360699/
1916 .UE
1917 .PP
1918 Hart, D.\& and Guniguntala, D., 2009.
1919 \fIRequeue-PI: Making Glibc Condvars PI-Aware\fP
1920 (from proceedings of the 2009 Real-Time Linux Workshop),
1921 .UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf
1922 .UE
1923 .PP
1924 Drepper, U., 2011. \fIFutexes Are Tricky\fP,
1925 .UR http://www.akkadia.org/drepper/futex.pdf
1926 .UE
1927 .PP
1928 Futex example library, futex-*.tar.bz2 at
1929 .br
1930 .UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/
1931 .UE
1932 .\"
1933 .\" FIXME(Torvald) We should probably refer to the glibc code here, in
1934 .\" particular the glibc-internal futex wrapper functions that are
1935 .\" WIP, and the generic pthread_mutex_t and perhaps condvar
1936 .\" implementations.