]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man2/futex.2
futex.2: Terminology fixes
[thirdparty/man-pages.git] / man2 / futex.2
CommitLineData
8f0aff2a 1.\" Page by b.hubert
1abce893
MK
2.\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de>
3.\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com>
2297bf0e 4.\"
2e46a6e7 5.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
8f0aff2a 6.\" may be freely modified and distributed
8ff7380d 7.\" %%%LICENSE_END
fea681da
MK
8.\"
9.\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com)
10.\" added ERRORS section.
11.\"
12.\" Modified 2004-06-17 mtk
13.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
14.\"
47f5c4ba
MK
15.\" FIXME Still to integrate are some points from Torvald Riegel's mail of
16.\" 2015-01-23:
17.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
18.\"
02182e7c
MK
19.\" FIXME Do we need add some text regarding Torvald Riegel's 2015-01-24 mail
20.\" at http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
21.\"
3d155313 22.TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual"
fea681da 23.SH NAME
ce154705 24futex \- fast user-space locking
fea681da 25.SH SYNOPSIS
9d9dc1e8 26.nf
fea681da
MK
27.sp
28.B "#include <linux/futex.h>"
fea681da
MK
29.B "#include <sys/time.h>"
30.sp
d33602c4 31.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
768d3c23
MK
32.BI " const struct timespec *" timeout , \
33" \fR /* or: \fBu32 \fIval2\fP */
9d9dc1e8 34.BI " int *" uaddr2 ", int " val3 );
9d9dc1e8 35.fi
409f08b0 36
b939d6e4
MK
37.IR Note :
38There is no glibc wrapper for this system call; see NOTES.
47297adb 39.SH DESCRIPTION
fea681da
MK
40.PP
41The
e511ffb6 42.BR futex ()
4b35dc5d
TR
43system call provides a method for waiting until a certain condition becomes
44true. It is typically used as a blocking construct in the context of
45shared-memory synchronization: The program implements the majority of the
46synchronization in user space, and uses one of operations of the system call
47when it is likely that it has to block for a longer time until the condition
48becomes true. The program uses another operation of the system call to wake
49anyone waiting for a particular condition.
50
51The condition is represented by the futex word, which is an address in memory
52supplied to the
53.BR futex ()
54system call, and the value at this memory location.
a5956430
MK
55(While the virtual addresses for the same memory in separate
56processes may not be equal,
57the kernel maps them internally so that the same memory mapped
58in different locations will correspond for
e511ffb6 59.BR futex ()
f19904c0 60calls.)
809ca3ae 61
4b35dc5d
TR
62When executing a futex operation that requests to block a thread, the kernel
63will only block if the futex word has the value that the calling thread
64supplied as expected value. The load from the futex word, the comparison with
65the expected value, and the actual blocking will happen atomically and totally
66ordered with respect to concurrently executing futex operations on the same
67futex word, such as operations that wake threads blocked on this futex word.
68Thus, the futex word is used to connect the synchronization in user space with
69the implementation of blocking by the kernel; similar to an atomic
70compare-and-exchange operation that potentially changes shared memory,
71blocking via a futex is an atomic compare-and-block operation. See NOTES for
72a detailed specification of the synchronization semantics.
73
74One example use of futexes is implementing locks. The state of the lock (i.e.,
75acquired or not acquired) can be represented as an atomically accessed flag
76in shared memory. In the uncontended case, a thread can access or modify the
77lock state with atomic instructions, for example atomically changing it from
78not acquired to acquired using an atomic compare-and-exchange instruction. If
79a thread cannot acquire a lock because it is already acquired by another
80thread, it can request to block if and only the lock is still acquired by
81using the lock's flag as futex word and expecting a value that represents the
82acquired state. When releasing the lock, a thread has to first reset the
83lock state to not acquired and then execute the futex operation that wakes
84one thread blocked on the futex word that is the lock's flag (this can be
85be further optimized to avoid unnecessary wake-ups). See
86.BR futex (7)
87for more detail on how to use futexes.
88
89Besides the basic wait and wake-up futex functionality, there are further
90futex operations aimed at supporting more complex use cases. Also note that
91no explicit initialization or destruction are necessary to use futexes; the
92kernel maintains a futex (i.e., the kernel-internal implementation artifact)
93only while operations such as
94.BR FUTEX_WAIT ,
95described below, are being performed on a particular futex word.
a663ca5a
MK
96.\"
97.SS Arguments
fea681da
MK
98The
99.I uaddr
4b35dc5d
TR
100argument points to the futex word. On all platforms, futexes are four-byte
101integers that must be aligned on a four-byte boundary.
f388ba70
MK
102The operation to perform on the futex is specified in the
103.I futex_op
104argument;
105.IR val
106is a value whose meaning and purpose depends on
107.IR futex_op .
36ab2074
MK
108
109The remaining arguments
110.RI ( timeout ,
111.IR uaddr2 ,
112and
113.IR val3 )
114are required only for certain of the futex operations described below.
115Where one of these arguments is not required, it is ignored.
768d3c23 116
36ab2074
MK
117For several blocking operations, the
118.I timeout
119argument is a pointer to a
120.IR timespec
121structure that specifies a timeout for the operation.
122However, notwithstanding the prototype shown above, for some operations,
123this argument is instead a four-byte integer whose meaning
124is determined by the operation.
768d3c23
MK
125For these operations, the kernel casts the
126.I timeout
127value to
128.IR u32 ,
129and in the remainder of this page, this argument is referred to as
130.I val2
131when interpreted in this fashion.
132
de5a3bb4 133Where it is required, the
36ab2074 134.IR uaddr2
4b35dc5d 135argument is a pointer to a second futex word that is employed by the operation.
36ab2074
MK
136The interpretation of the final integer argument,
137.IR val3 ,
138depends on the operation.
a663ca5a
MK
139.\"
140.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
141.\"
142.SS Futex operations
6be4bad7 143The
d33602c4 144.I futex_op
6be4bad7
MK
145argument consists of two parts:
146a command that specifies the operation to be performed,
147bit-wise ORed with zero or or more options that
148modify the behaviour of the operation.
fc30eb79 149The options that may be included in
d33602c4 150.I futex_op
fc30eb79
TG
151are as follows:
152.TP
153.BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)"
154.\" commit 34f01cc1f512fa783302982776895c73714ebbc2
155This option bit can be employed with all futex operations.
e45f9735 156It tells the kernel that the futex is process-private and not shared
4b35dc5d
TR
157with another process (i.e., it is only being used for synchronization between
158threads of the same process).
fc30eb79
TG
159This allows the kernel to choose the fast path for validating
160the user-space address and avoids expensive VMA lookups,
161taking reference counts on file backing store, and so on.
ae2c1774
MK
162
163As a convenience,
164.IR <linux/futex.h>
165defines a set of constants with the suffix
166.BR _PRIVATE
167that are equivalents of all of the operations listed below,
dcdfde26 168.\" except the obsolete FUTEX_FD, for which the "private" flag was
ae2c1774
MK
169.\" meaningless
170but with the
171.BR FUTEX_PRIVATE_FLAG
172ORed into the constant value.
173Thus, there are
174.BR FUTEX_WAIT_PRIVATE ,
175.BR FUTEX_WAKE_PRIVATE ,
176and so on.
2e98bbc2
TG
177.TP
178.BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)"
179.\" commit 1acdac104668a0834cfa267de9946fac7764d486
4a7e5b05 180This option bit can be employed only with the
2e98bbc2
TG
181.BR FUTEX_WAIT_BITSET
182and
183.BR FUTEX_WAIT_REQUEUE_PI
c84cf68c 184operations.
2e98bbc2 185
f2103b26
MK
186If this option is set, the kernel treats
187.I timeout
188as an absolute time based on
2e98bbc2
TG
189.BR CLOCK_REALTIME .
190
f2103b26
MK
191If this option is not set, the kernel treats
192.I timeout
193as relative time,
f1d2171d 194.\" FIXME XXX I added CLOCK_MONOTONIC here. Okay?
1c952cf5
MK
195measured against the
196.BR CLOCK_MONOTONIC
197clock.
6be4bad7
MK
198.PP
199The operation specified in
d33602c4 200.I futex_op
6be4bad7 201is one of the following:
70b06b90
MK
202.\"
203.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
204.\"
fea681da 205.TP
81c9d87e
MK
206.BR FUTEX_WAIT " (since Linux 2.6.0)"
207.\" Strictly speaking, since some time in 2.5.x
f065673c 208This operation tests that the value at the
4b35dc5d 209futex word pointed to by the address
fea681da 210.I uaddr
4b35dc5d 211still contains the expected value
fea681da 212.IR val ,
4b35dc5d 213and if so, then sleeps awaiting
682edefb 214.B FUTEX_WAKE
4b35dc5d
TR
215on the futex word. The load of the value of the futex word is an atomic memory
216access (i.e., using atomic machine instructions of the respective
217architecture). This load, the comparison with the expected value, and
218starting to sleep are performed atomically and totally ordered with respect
219to other futex operations on the same futex word. If the thread starts to
220sleep, it is considered a waiter on this futex word.
f065673c
MK
221If the futex value does not match
222.IR val ,
4710334a 223then the call fails immediately with the error
badbf70c 224.BR EAGAIN .
4b35dc5d
TR
225
226The purpose of the comparison with the expected value is to prevent lost
227wake-ups: If another thread changed the value of the futex word after the
228calling thread decided to block based on the prior value, and if the other
229thread executed a
230.BR FUTEX_WAKE
231operation (or similar wake-up) after the value change and before this
f065673c 232.BR FUTEX_WAIT
4b35dc5d
TR
233operation, then the latter will observe the value change and will not start
234to sleep.
1909e523 235
c13182ef 236If the
fea681da 237.I timeout
53ba4030 238argument is non-NULL, its contents specify a relative timeout for the wait,
f1d2171d 239.\" FIXME XXX I added CLOCK_MONOTONIC here. Okay?
1c952cf5
MK
240measured according to the
241.BR CLOCK_MONOTONIC
242clock.
82a6092b
MK
243(This interval will be rounded up to the system clock granularity,
244and kernel scheduling delays mean that the
245blocking interval may overrun by a small amount.)
246If
247.I timeout
248is NULL, the call blocks indefinitely.
4798a7f3 249
c13182ef 250The arguments
fea681da
MK
251.I uaddr2
252and
253.I val3
254are ignored.
255
4b35dc5d
TR
256.\" XXX I think we should remove this. Or maybe adapt to a different example.
257.\" For
258.\" .BR futex (7),
259.\" this call is executed if decrementing the count gave a negative value
260.\" (indicating contention),
261.\" and will sleep until another process or thread releases
262.\" the futex and executes the
263.\" .B FUTEX_WAKE
264.\" operation.
70b06b90
MK
265.\"
266.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
267.\"
fea681da 268.TP
81c9d87e
MK
269.BR FUTEX_WAKE " (since Linux 2.6.0)"
270.\" Strictly speaking, since Linux 2.5.x
f065673c
MK
271This operation wakes at most
272.I val
4b35dc5d
TR
273.\" XXX I believe FUTEX_WAIT_BITSET waiters, for example, could also be woken
274.\" (therefore, make it e.g. instead of i.e.)?
275of the waiters that are waiting (e.g., inside
f065673c 276.BR FUTEX_WAIT )
4b35dc5d 277on the futex word at the address
f065673c
MK
278.IR uaddr .
279Most commonly,
280.I val
281is specified as either 1 (wake up a single waiter) or
282.BR INT_MAX
283(wake up all waiters).
730bfbda
MK
284.\" FIXME Please confirm that the following is correct:
285No guarantee is provided about which waiters are awoken
286(e.g., a waiter with a higher scheduling priority is not guaranteed
287to be awoken in preference to a waiter with a lower priority).
4798a7f3 288
fea681da
MK
289The arguments
290.IR timeout ,
c8b921bd 291.IR uaddr2 ,
fea681da
MK
292and
293.I val3
294are ignored.
295
4b35dc5d
TR
296.\" XXX I think we should remove this. Or maybe adapt to a different example.
297.\" For
298.\" .BR futex (7),
299.\" this is executed if incrementing the count showed that there were waiters,
64191e8f 300.\" FIXME How does "incrementing the count showed that there were waiters"?
4b35dc5d 301.\" once the futex value has been set to 1 (indicating that it is available).
70b06b90
MK
302.\"
303.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
304.\"
a7c2bf45
MK
305.TP
306.BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)"
307.\" Strictly speaking, from Linux 2.5.x to 2.6.25
308This operation creates a file descriptor that is associated with the futex at
309.IR uaddr .
bdc5957a
MK
310The caller must close the returned file descriptor after use.
311When another process or thread performs a
a7c2bf45 312.BR FUTEX_WAKE
4b35dc5d 313on the futex word, the file descriptor indicates as being readable with
a7c2bf45
MK
314.BR select (2),
315.BR poll (2),
316and
317.BR epoll (7)
318
f1d2171d 319The file descriptor can be used to obtain asynchronous notifications: if
a7c2bf45 320.I val
bdc5957a 321is nonzero, then when another process or thread executes a
a7c2bf45
MK
322.BR FUTEX_WAKE ,
323the caller will receive the signal number that was passed in
324.IR val .
325
326The arguments
327.IR timeout ,
328.I uaddr2
329and
330.I val3
331are ignored.
332
4b35dc5d 333.\" FIXME We never define "upped". Maybe just remove that sentence?
a7c2bf45
MK
334To prevent race conditions, the caller should test if the futex has
335been upped after
336.B FUTEX_FD
337returns.
338
339Because it was inherently racy,
340.B FUTEX_FD
341has been removed
342.\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80
343from Linux 2.6.26 onward.
70b06b90
MK
344.\"
345.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
346.\"
a7c2bf45
MK
347.TP
348.BR FUTEX_REQUEUE " (since Linux 2.6.0)"
349.\" Strictly speaking: from Linux 2.5.70
4b35dc5d
TR
350.\" FIXME Is there some indication that it is broken in general, or is this
351.\" comment implicitly speaking about the condvar (?) use case? If the latter
352.\" we might want to weaken the advice a little.
a7c2bf45 353.IR "Avoid using this operation" .
4b35dc5d 354It is broken for its intended purpose.
a7c2bf45
MK
355Use
356.BR FUTEX_CMP_REQUEUE
357instead.
358
359This operation performs the same task as
360.BR FUTEX_CMP_REQUEUE ,
361except that no check is made using the value in
362.IR val3 .
363(The argument
364.I val3
365is ignored.)
70b06b90
MK
366.\"
367.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
368.\"
a7c2bf45
MK
369.TP
370.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
4b35dc5d 371This operation first checks whether the location
a7c2bf45
MK
372.I uaddr
373still contains the value
374.IR val3 .
375If not, the operation fails with the error
376.BR EAGAIN .
4b35dc5d 377Otherwise, the operation wakes up a maximum of
a7c2bf45
MK
378.I val
379waiters that are waiting on the futex at
380.IR uaddr .
381If there are more than
382.I val
383waiters, then the remaining waiters are removed
384from the wait queue of the source futex at
385.I uaddr
386and added to the wait queue of the target futex at
387.IR uaddr2 .
388The
768d3c23 389.I val2
936876a9 390argument specifies an upper limit on the number of waiters
a7c2bf45 391that are requeued to the futex at
768d3c23 392.IR uaddr2 .
a7c2bf45 393
4b35dc5d
TR
394.\" FIXME Is this correct? Or is just the decision which threads to wake or
395.\" requeue part of the atomic operation?
396The load from
397.I uaddr
398is an atomic memory access (i.e., using atomic machine instructions of the
399respective architecture). This load, the comparison with
400.IR val3 ,
401and the requeueing of any waiters are performed atomically and totally ordered
402with respect to other operations on the same futex word.
403
404This operation was added as a replacement for the earlier
405.BR FUTEX_REQUEUE .
406The difference is that the check of the value at
407.I uaddr
408can be used to ensure that requeueing only happens under certain conditions.
409Both operations can be used to avoid a "thundering herd" effect when
410.B FUTEX_WAKE
411is used and all of the waiters that are woken need to acquire another futex.
412
a7c2bf45
MK
413.\" FIXME Please review the following new paragraph to see if it is
414.\" accurate.
415Typical values to specify for
416.I val
417are 0 or or 1.
418(Specifying
419.BR INT_MAX
420is not useful, because it would make the
421.BR FUTEX_CMP_REQUEUE
422operation equivalent to
423.BR FUTEX_WAKE .)
936876a9 424The limit value specified via
768d3c23
MK
425.I val2
426is typically either 1 or
a7c2bf45
MK
427.BR INT_MAX .
428(Specifying the argument as 0 is not useful, because it would make the
429.BR FUTEX_CMP_REQUEUE
430operation equivalent to
431.BR FUTEX_WAIT .)
6bac3b85 432.\"
43d16602
MK
433.\" FIXME Here, it would be helpful to have an example of how
434.\" FUTEX_CMP_REQUEUE might be used, at the same time illustrating
435.\" why FUTEX_WAKE is unsuitable for the same use case.
436.\"
70b06b90
MK
437.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
438.\"
a5956430
MK
439.\" FIXME I added a lengthy piece of text on FUTEX_WAKE_OP text,
440.\" and I'd be happy if someone checked it.
fea681da 441.TP
d67e21f5
MK
442.BR FUTEX_WAKE_OP " (since Linux 2.6.14)"
443.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
6bac3b85
MK
444.\" Author: Jakub Jelinek <jakub@redhat.com>
445.\" Date: Tue Sep 6 15:16:25 2005 -0700
4b35dc5d
TR
446.\" FIXME The glibc condvar implementation is currently being revised (e.g.,
447.\" to not use an internal lock anymore).
448.\" It is probably more future-proof to remove this paragraph.
6bac3b85
MK
449This operation was added to support some user-space use cases
450where more than one futex must be handled at the same time.
451The most notable example is the implementation of
452.BR pthread_cond_signal (3),
453which requires operations on two futexes,
454the one used to implement the mutex and the one used in the implementation
455of the wait queue associated with the condition variable.
456.BR FUTEX_WAKE_OP
457allows such cases to be implemented without leading to
458high rates of contention and context switching.
459
460The
461.BR FUTEX_WAIT_OP
4b35dc5d
TR
462operation is equivalent to execute the following code atomically and totally
463ordered with respect to other futex operations on any of the two supplied
464futex words:
6bac3b85
MK
465
466.in +4n
467.nf
468int oldval = *(int *) uaddr2;
469*(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
470futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
471if (oldval \fIcmp\fP \fIcmparg\fP)
768d3c23 472 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
6bac3b85
MK
473.fi
474.in
475
476In other words,
477.BR FUTEX_WAIT_OP
478does the following:
479.RS
480.IP * 3
4b35dc5d
TR
481saves the original value of the futex word at
482.IR uaddr2
483and performs an operation to modify the value of the futex at
6bac3b85 484.IR uaddr2 ;
4b35dc5d
TR
485this is an atomic read-modify-write memory access (i.e., using atomic machine
486instructions of the respective architecture)
6bac3b85
MK
487.IP *
488wakes up a maximum of
489.I val
4b35dc5d 490waiters on the futex for the futex word at
6bac3b85
MK
491.IR uaddr ;
492and
493.IP *
4b35dc5d 494dependent on the results of a test of the original value of the futex word at
6bac3b85
MK
495.IR uaddr2 ,
496wakes up a maximum of
768d3c23 497.I val2
4b35dc5d 498waiters on the futex for the futex word at
6bac3b85
MK
499.IR uaddr2 .
500.RE
501.IP
6bac3b85
MK
502The operation and comparison that are to be performed are encoded
503in the bits of the argument
504.IR val3 .
505Pictorially, the encoding is:
506
f6af90e7 507.in +8n
6bac3b85 508.nf
f6af90e7
MK
509+---+---+-----------+-----------+
510|op |cmp| oparg | cmparg |
511+---+---+-----------+-----------+
512 4 4 12 12 <== # of bits
6bac3b85
MK
513.fi
514.in
515
516Expressed in code, the encoding is:
517
518.in +4n
519.nf
520#define FUTEX_OP(op, oparg, cmp, cmparg) \\
521 (((op & 0xf) << 28) | \\
522 ((cmp & 0xf) << 24) | \\
523 ((oparg & 0xfff) << 12) | \\
524 (cmparg & 0xfff))
525.fi
526.in
527
528In the above,
529.I op
530and
531.I cmp
532are each one of the codes listed below.
533The
534.I oparg
535and
536.I cmparg
537components are literal numeric values, except as noted below.
538
539The
540.I op
541component has one of the following values:
542
543.in +4n
544.nf
545FUTEX_OP_SET 0 /* uaddr2 = oparg; */
546FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
547FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
548FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
549FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
550.fi
551.in
552
553In addition, bit-wise ORing the following value into
554.I op
555causes
556.IR "(1\ <<\ oparg)"
557to be used as the operand:
558
559.in +4n
560.nf
561FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
562.fi
563.in
564
565The
566.I cmp
567field is one of the following:
568
569.in +4n
570.nf
571FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
572FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
573FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
574FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
575FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
576FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
577.fi
578.in
579
580The return value of
581.BR FUTEX_WAKE_OP
582is the sum of the number of waiters woken on the futex
583.IR uaddr
584plus the number of waiters woken on the futex
585.IR uaddr2 .
70b06b90
MK
586.\"
587.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
588.\"
d67e21f5 589.TP
79c9b436
TG
590.BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)"
591.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
fd9e59d4 592This operation is like
79c9b436
TG
593.BR FUTEX_WAIT
594except that
595.I val3
596is used to provide a 32-bit bitset to the kernel.
597This bitset is stored in the kernel-internal state of the waiter.
598See the description of
599.BR FUTEX_WAKE_BITSET
600for further details.
601
fd9e59d4
MK
602The
603.BR FUTEX_WAIT_BITSET
9732dd8b 604operation also interprets the
fd9e59d4
MK
605.I timeout
606argument differently from
607.BR FUTEX_WAIT .
608See the discussion of
609.BR FUTEX_CLOCK_REALTIME ,
610above.
611
79c9b436
TG
612The
613.I uaddr2
614argument is ignored.
70b06b90
MK
615.\"
616.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
617.\"
79c9b436 618.TP
d67e21f5
MK
619.BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)"
620.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
55cc422d
TG
621This operation is the same as
622.BR FUTEX_WAKE
623except that the
624.I val3
625argument is used to provide a 32-bit bitset to the kernel.
98d769c0
MK
626This bitset is used to select which waiters should be woken up.
627The selection is done by a bit-wise AND of the "wake" bitset
628(i.e., the value in
629.IR val3 )
630and the bitset which is stored in the kernel-internal
09cb4ce7 631state of the waiter (the "wait" bitset that is set using
98d769c0
MK
632.BR FUTEX_WAIT_BITSET ).
633All of the waiters for which the result of the AND is nonzero are woken up;
634the remaining waiters are left sleeping.
635
f1d2171d 636.\" FIXME XXX Is this paragraph that I added okay?
e9d4496b
MK
637The effect of
638.BR FUTEX_WAIT_BITSET
639and
640.BR FUTEX_WAKE_BITSET
9732dd8b
MK
641is to allow selective wake-ups among multiple waiters that are blocked
642on the same futex.
09cb4ce7 643Note, however, that using this bitset multiplexing feature on a
e9d4496b
MK
644futex is less efficient than simply using multiple futexes,
645because employing bitset multiplexing requires the kernel
646to check all waiters on a futex,
647including those that are not interested in being woken up
648(i.e., they do not have the relevant bit set in their "wait" bitset).
649.\" According to http://locklessinc.com/articles/futex_cheat_sheet/:
650.\"
651.\" "The original reason for the addition of these extensions
652.\" was to improve the performance of pthread read-write locks
653.\" in glibc. However, the pthreads library no longer uses the
654.\" same locking algorithm, and these extensions are not used
655.\" without the bitset parameter being all ones.
656.\"
657.\" The page goes on to note that the FUTEX_WAIT_BITSET operation
658.\" is nevertheless used (with a bitset of all ones) in order to
659.\" obtain the absolute timeout functionality that is useful
660.\" for efficiently implementing Pthreads APIs (which use absolute
661.\" timeouts); FUTEX_WAIT provides only relative timeouts.
662
98d769c0
MK
663The
664.I uaddr2
665and
666.I timeout
667arguments are ignored.
9732dd8b
MK
668
669The
670.BR FUTEX_WAIT
671and
672.BR FUTEX_WAKE
673operations correspond to
674.BR FUTEX_WAIT_BITSET
675and
676.BR FUTEX_WAKE_BITSET
677operations where the bitsets are all ones.
bd90a5f9 678.\"
70b06b90 679.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
bd90a5f9
MK
680.\"
681.SS Priority-inheritance futexes
b52e1cd4
MK
682Linux supports priority-inheritance (PI) futexes in order to handle
683priority-inversion problems that can be encountered with
684normal futex locks.
b565548b 685Priority inversion is the problem that occurs when a high-priority
bdc5957a
MK
686task is blocked waiting to acquire a lock held by a low-priority task,
687while tasks at an intermediate priority continuously preempt
688the low-priority task from the CPU.
689Consequently, the low-priority task makes no progress toward
690releasing the lock, and the high-priority task remains blocked.
7f315ae3 691
7d20efd7
MK
692Priority inheritance is a mechanism for dealing with
693the priority-inversion problem.
bdc5957a
MK
694With this mechanism, when a high-priority task becomes blocked
695by a lock held by a low-priority task,
7d20efd7 696the latter's priority is temporarily raised to that of the former,
bdc5957a 697so that it is not preempted by any intermediate level tasks,
7d20efd7
MK
698and can thus make progress toward releasing the lock.
699To be effective, priority inheritance must be transitive,
bdc5957a
MK
700meaning that if a high-priority task blocks on a lock
701held by a lower-priority task that is itself blocked by lock
702held by another intermediate-priority task
7d20efd7 703(and so on, for chains of arbitrary length),
bdc5957a
MK
704then both of those task
705(or more generally, all of the tasks in a lock chain)
706have their priorities raised to be the same as the high-priority task.
7d20efd7 707
9e2b90ee
MK
708.\" FIXME XXX The following is my attempt at a definition of PI futexes,
709.\" based on mail discussions with Darren Hart. Does it seem okay?
710From a user-space perspective,
711what makes a futex PI-aware is a policy agreement between user space
4b35dc5d 712and the kernel about the value of the futex word (described in a moment),
9e2b90ee
MK
713coupled with the use of the PI futex operations described below
714(in particular,
715.BR FUTEX_LOCK_PI ,
716.BR FUTEX_TRYLOCK_PI ,
717and
718.BR FUTEX_CMP_REQUEUE_PI ).
719.\" Quoting Darren Hart:
720.\" These opcodes paired with the PI futex value policy (described below)
721.\" defines a "futex" as PI aware. These were created very specifically
722.\" in support of PI pthread_mutexes, so it makes a lot more sense to
723.\" talk about a PI aware pthread_mutex, than a PI aware futex, since
724.\" there is a lot of policy and scaffolding that has to be built up
725.\" around it to use it properly (this is what a PI pthread_mutex is).
726
f1d2171d 727.\" FIXME XXX ===== Start of adapted Hart/Guniguntala text =====
1af427a4
MK
728.\" The following text is drawn from the Hart/Guniguntala paper
729.\" (listed in SEE ALSO), but I have reworded some pieces
730.\" significantly. Please check it.
79d918c7
MK
731.\"
732The PI futex operations described below differ from the other
4b35dc5d
TR
733futex operations in that they impose policy on the use of the value of the
734futex word:
79d918c7 735.IP * 3
4b35dc5d 736If the lock is not acquired, the futex word's value shall be 0.
79d918c7 737.IP *
4b35dc5d
TR
738If the lock is acquired, the futex word's value shall be the thread ID (TID;
739see
79d918c7
MK
740.BR gettid (2))
741of the owning thread.
742.IP *
f1d2171d 743.\" FIXME XXX In the following line, I added "the lock is owned and". Okay?
79d918c7
MK
744If the lock is owned and there are threads contending for the lock,
745then the
746.B FUTEX_WAITERS
4b35dc5d 747bit shall be set in the futex word's value; in other words, this value is:
79d918c7
MK
748
749 FUTEX_WAITERS | TID
9e2b90ee 750
79d918c7 751.PP
4b35dc5d 752Note that a PI futex word never just has the value
9e2b90ee
MK
753.BR FUTEX_WAITERS ,
754which is a permissible state for non-PI futexes.
755
79d918c7 756With this policy in place,
4b35dc5d
TR
757a user-space application can acquire a not-acquired
758lock or release a lock that no other threads try to acquire using atomic
759instructions executed in user space (e.g., a compare-and-swap operation such
760as
b52e1cd4
MK
761.I cmpxchg
762on the x86 architecture).
4b35dc5d
TR
763Acquiring a lock simply consists of using compare-and-swap to atomically set
764the futex word's value to the caller's TID if its previous value was 0.
765Releasing a lock requires using compare-and-swap to set the futex word's
766value to 0 if the previous value was the expected TID.
b52e1cd4 767
4b35dc5d 768If a futex is already acquired (i.e., has a nonzero value),
b52e1cd4 769waiters must employ the
79d918c7
MK
770.B FUTEX_LOCK_PI
771operation to acquire the lock.
4b35dc5d 772If other threads are waiting for the lock, then the
79d918c7 773.B FUTEX_WAITERS
4b35dc5d 774bit is set in the futex value; in this case, the lock owner must employ the
79d918c7 775.B FUTEX_UNLOCK_PI
b52e1cd4
MK
776operation to release the lock.
777
79d918c7
MK
778In the cases where callers are forced into the kernel
779(i.e., required to perform a
780.BR futex ()
781operation),
782they then deal directly with a so-called RT-mutex,
783a kernel locking mechanism which implements the required
784priority-inheritance semantics.
785After the RT-mutex is acquired, the futex value is updated accordingly,
786before the calling thread returns to user space.
787.\" FIXME ===== End of adapted Hart/Guniguntala text =====
788
a59fca75
MK
789It is important to note
790.\" FIXME We need some explanation here of *why* it is important to
1af427a4 791.\" note this. Can someone explain?
4b35dc5d 792that the kernel will update the futex word's value prior
79d918c7
MK
793to returning to user space.
794Unlike the other futex operations described above,
795the PI futex operations are designed
d9d5be6b 796for the implementation of very specific IPC mechanisms.
fc57e6bb 797.\"
7bd3ffbc 798.\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
99c0ac69
MK
799.\" made the observation that "EINVAL is returned if the non-pi
800.\" to pi or op pairing semantics are violated."
801.\" Probably there needs to be a general statement about this
802.\" requirement, probably located at about this point in the page.
7bd3ffbc 803.\" Darren, care to take a shot at this?
dd003bef
MK
804.\"
805.\" FIXME Somewhere on this page (I guess under the discussion of PI
806.\" futexes) we need a discussion of the FUTEX_OWNER_DIED bit.
807.\" Can someone propose a text?
bd90a5f9
MK
808
809PI futexes are operated on by specifying one of the following values in
810.IR futex_op :
70b06b90
MK
811.\"
812.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
813.\"
d67e21f5
MK
814.TP
815.BR FUTEX_LOCK_PI " (since Linux 2.6.18)"
816.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
67833bec
MK
817.\"
818.\" FIXME I did some significant rewording of tglx's text.
819.\" Please check, in case I injected errors.
820.\"
821This operation is used after after an attempt to acquire
4b35dc5d
TR
822the lock via an atomic user-space instruction failed
823because the futex word has a nonzero value\(emspecifically,
67833bec 824because it contained the namespace-specific TID of the lock owner.
67259526 825.\" FIXME In the preceding line, what does "namespace-specific" mean?
67833bec 826.\" (I kept those words from tglx.)
67259526 827.\" That is, what kind of namespace are we talking about?
67833bec
MK
828.\" (I suppose we are talking PID namespaces here, but I want to
829.\" be sure.)
830
4b35dc5d 831The operation checks the value of the futex word at the address
67833bec 832.IR uaddr .
70b06b90
MK
833If the value is 0, then the kernel tries to atomically set
834the futex value to the caller's TID.
67833bec
MK
835If that fails,
836.\" FIXME What would be the cause of failure?
4b35dc5d 837or the futex word's value is nonzero,
67833bec 838the kernel atomically sets the
e0547e70 839.B FUTEX_WAITERS
67833bec
MK
840bit, which signals the futex owner that it cannot unlock the futex in
841user space atomically by setting the futex value to 0.
842After that, the kernel tries to find the thread which is
843associated with the owner TID,
844.\" FIXME Could I get a bit more detail on the next two lines?
845.\" What is "creates or reuses kernel state" about?
846creates or reuses kernel state on behalf of the owner
847and attaches the waiter to it.
67259526
MK
848.\" FIXME In the next line, what type of "priority" are we talking about?
849.\" Realtime priorities for SCHED_FIFO and SCHED_RR?
850.\" Or something else?
1f043693 851The enqueueing of the waiter is in descending priority order if more
e0547e70 852than one waiter exists.
67259526 853.\" FIXME What does "bandwidth" refer to in the next line?
e0547e70 854The owner inherits either the priority or the bandwidth of the waiter.
67259526
MK
855.\" FIXME In the preceding line, what determines whether the
856.\" owner inherits the priority versus the bandwidth?
67833bec
MK
857.\"
858.\" FIXME Could I get some help translating the next sentence into
859.\" something that user-space developers (and I) can understand?
70b06b90 860.\" In particular, what are "nested locks" in this context?
e0547e70
TG
861This inheritance follows the lock chain in the case of
862nested locking and performs deadlock detection.
863
9ce19cf1
MK
864.\" FIXME tglx says "The timeout argument is handled as described in
865.\" FUTEX_WAIT." However, it appears to me that this is not right.
70b06b90 866.\" Is the following formulation correct?
e0547e70
TG
867The
868.I timeout
9ce19cf1
MK
869argument provides a timeout for the lock attempt.
870It is interpreted as an absolute time, measured against the
871.BR CLOCK_REALTIME
872clock.
873If
874.I timeout
875is NULL, the operation will block indefinitely.
e0547e70 876
a449c634 877The
e0547e70
TG
878.IR uaddr2 ,
879.IR val ,
880and
881.IR val3
a449c634 882arguments are ignored.
67833bec 883.\"
70b06b90
MK
884.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
885.\"
d67e21f5 886.TP
12fdbe23 887.BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)"
d67e21f5 888.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
12fdbe23
MK
889This operation tries to acquire the futex at
890.IR uaddr .
0b761826 891.\" FIXME I think it would be helpful here to say a few more words about
70b06b90
MK
892.\" the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI.
893.\" Can someone propose something?
894.\"
4b35dc5d
TR
895.\" FIXME Additionally, we claim above that just FUTEX_WAITERS is never an
896.\" allowed state.
fa0388c3 897It deals with the situation where the TID value at
12fdbe23
MK
898.I uaddr
899is 0, but the
b52e1cd4 900.B FUTEX_WAITERS
12fdbe23 901bit is set.
fa0388c3
MK
902.\" FIXME How does the situation in the previous sentence come about?
903.\" Probably it would be helpful to say something about that in
904.\" the man page.
badbf70c 905.\" FIXME And *how* does FUTEX_TRYLOCK_PI deal with this situation?
a282e5b0 906User space cannot handle this condition in a race-free manner
084744ef
MK
907
908The
909.IR uaddr2 ,
910.IR val ,
911.IR timeout ,
912and
913.IR val3
914arguments are ignored.
70b06b90
MK
915.\"
916.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
917.\"
d67e21f5 918.TP
12fdbe23 919.BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)"
d67e21f5 920.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
d4ba4328 921This operation wakes the top priority waiter that is waiting in
ecae2099
TG
922.B FUTEX_LOCK_PI
923on the futex address provided by the
924.I uaddr
925argument.
926
927This is called when the user space value at
928.I uaddr
929cannot be changed atomically from a TID (of the owner) to 0.
930
931The
932.IR uaddr2 ,
933.IR val ,
934.IR timeout ,
935and
936.IR val3
11a194bf 937arguments are ignored.
70b06b90
MK
938.\"
939.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
940.\"
d67e21f5 941.TP
d67e21f5
MK
942.BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)"
943.\" commit 52400ba946759af28442dee6265c5c0180ac7122
f812a08b
DH
944This operation is a PI-aware variant of
945.BR FUTEX_CMP_REQUEUE .
946It requeues waiters that are blocked via
947.B FUTEX_WAIT_REQUEUE_PI
948on
949.I uaddr
950from a non-PI source futex
951.RI ( uaddr )
952to a PI target futex
953.RI ( uaddr2 ).
954
9e54d26d
MK
955As with
956.BR FUTEX_CMP_REQUEUE ,
957this operation wakes up a maximum of
958.I val
959waiters that are waiting on the futex at
960.IR uaddr .
961However, for
962.BR FUTEX_CMP_REQUEUE_PI ,
963.I val
6fbeb8f4 964is required to be 1
939ca89f 965(since the main point is to avoid a thundering herd).
9e54d26d
MK
966The remaining waiters are removed from the wait queue of the source futex at
967.I uaddr
968and added to the wait queue of the target futex at
969.IR uaddr2 .
f812a08b 970
9e54d26d 971The
768d3c23 972.I val2
c6d8cf21
MK
973.\" val2 is the cap on the number of requeued waiters.
974.\" In the glibc pthread_cond_broadcast() implementation, this argument
975.\" is specified as INT_MAX, and for pthread_cond_signal() it is 0.
9e54d26d 976and
768d3c23 977.I val3
9e54d26d
MK
978arguments serve the same purposes as for
979.BR FUTEX_CMP_REQUEUE .
70b06b90 980.\"
be376673
MK
981.\" FIXME The page at http://locklessinc.com/articles/futex_cheat_sheet/
982.\" notes that "priority-inheritance Futex to priority-inheritance
983.\" Futex requeues are currently unsupported". Do we need to say
984.\" something in the man page about that?
70b06b90
MK
985.\"
986.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
987.\"
d67e21f5
MK
988.TP
989.BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)"
990.\" commit 52400ba946759af28442dee6265c5c0180ac7122
70b06b90
MK
991.\"
992.\" FIXME I find the next sentence (from tglx) pretty hard to grok.
1af427a4 993.\" Could someone explain it a bit more?
6ff1b4c0
TG
994Wait operation to wait on a non-PI futex at
995.I uaddr
996and potentially be requeued onto a PI futex at
997.IR uaddr2 .
998The wait operation on
999.I uaddr
1000is the same as
1001.BR FUTEX_WAIT .
70b06b90 1002.\"
f1d2171d
MK
1003.\" FIXME I'm not quite clear on the meaning of the following sentence.
1004.\" Is this trying to say that while blocked in a
1005.\" FUTEX_WAIT_REQUEUE_PI, it could happen that another
1006.\" task does a FUTEX_WAKE on uaddr that simply causes
1007.\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
1008.\" does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
1009.\" opertion? Does it remain blocked, or does it unblock
1010.\" In which case, what does user space see?
6ff1b4c0
TG
1011The waiter can be removed from the wait on
1012.I uaddr
1013via
1014.BR FUTEX_WAKE
1015without requeueing on
1016.IR uaddr2 .
a4e69912 1017
63bea7dc
MK
1018.\" FIXME Please check the following. tglx said "The timeout argument
1019.\" is handled as described in FUTEX_WAIT.", but the truth is
1020.\" as below, AFAICS
1021If
1022.I timeout
1023is not NULL, it specifies a timeout for the wait operation;
1024this timeout is interpreted as outlined above in the description of the
1025.BR FUTEX_CLOCK_REALTIME
1026option.
1027If
1028.I timeout
1029is NULL, the operation can block indefinitely.
1030
a4e69912
MK
1031The
1032.I val3
1033argument is ignored.
70b06b90 1034.\" FIXME Re the preceding sentence... Actually 'val3' is internally set to
a4e69912
MK
1035.\" FUTEX_BITSET_MATCH_ANY before calling futex_wait_requeue_pi().
1036.\" I'm not sure we need to say anything about this though.
1037.\" Comments?
abb571e8
MK
1038
1039The
1040.BR FUTEX_WAIT_REQUEUE_PI
1041and
1042.BR FUTEX_CMP_REQUEUE_PI
1043were added to support a fairly specific use case:
1044support for priority-inheritance-aware POSIX threads condition variables.
1045The idea is that these operations should always be paired,
1046in order to ensure that user space and the kernel remain in sync.
1047Thus, in the
1048.BR FUTEX_WAIT_REQUEUE_PI
1049operation, the user-space application pre-specifies the target
1050of the requeue that takes place in the
1051.BR FUTEX_CMP_REQUEUE_PI
1052operation.
1053.\"
1054.\" Darren Hart notes that a patch to allow glibc to fully support
1af427a4 1055.\" PI-aware pthreads condition variables has not yet been accepted into
abb571e8
MK
1056.\" glibc. The story is complex, and can be found at
1057.\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588
1058.\" Darren notes that in the meantime, the patch is shipped with various
1af427a4 1059.\" PREEMPT_RT-enabled Linux systems.
abb571e8
MK
1060.\"
1061.\" Related to the preceding, Darren proposed that somewhere, man-pages
1062.\" should document the following point:
1af427a4 1063.\"
abb571e8
MK
1064.\" While the Linux kernel, since 2.6.31, supports requeueing of
1065.\" priority-inheritance (PI) aware mutexes via the
1066.\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations,
1067.\" the glibc implementation does not yet take full advantage of this.
1068.\" Specifically, the condvar internal data lock remains a non-PI aware
1069.\" mutex, regardless of the type of the pthread_mutex associated with
1070.\" the condvar. This can lead to an unbounded priority inversion on
1071.\" the internal data lock even when associating a PI aware
1072.\" pthread_mutex with a condvar during a pthread_cond*_wait
1073.\" operation. For this reason, it is not recommended to rely on
1074.\" priority inheritance when using pthread condition variables.
1af427a4
MK
1075.\"
1076.\" The problem is that the obvious location for this text is
1077.\" the pthread_cond*wait(3) man page. However, such a man page
abb571e8 1078.\" does not currently exist.
70b06b90 1079.\"
6700de24 1080.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
70b06b90 1081.\"
47297adb 1082.SH RETURN VALUE
fea681da 1083.PP
6f147f79 1084In the event of an error, all operations return \-1 and set
e808bba0 1085.I errno
6f147f79 1086to indicate the cause of the error.
e808bba0
MK
1087The return value on success depends on the operation,
1088as described in the following list:
fea681da
MK
1089.TP
1090.B FUTEX_WAIT
4b35dc5d
TR
1091Returns 0 if the caller was woken up. Note that a wake-up can also be
1092caused by common futex usage patterns in unrelated code that happened to have
1093previously used the futex word's memory location (e.g., typical futex-based
1094implementations of Pthreads mutexes can cause this under some conditions).
1095Therefore, callers should always conservatively assume that a return value of
10960 can mean a spurious wake-up, and use the futex word's value (i.e., the user
1097space synchronization scheme) to decide whether to continue to block or not.
fea681da
MK
1098.TP
1099.B FUTEX_WAKE
bdc5957a 1100Returns the number of waiters that were woken up.
fea681da
MK
1101.TP
1102.B FUTEX_FD
1103Returns the new file descriptor associated with the futex.
1104.TP
1105.B FUTEX_REQUEUE
bdc5957a 1106Returns the number of waiters that were woken up.
fea681da
MK
1107.TP
1108.B FUTEX_CMP_REQUEUE
bdc5957a 1109Returns the total number of waiters that were woken up or
4b35dc5d 1110requeued to the futex for the futex word at
3dfcc11d
MK
1111.IR uaddr2 .
1112If this value is greater than
1113.IR val ,
4b35dc5d
TR
1114then difference is the number of waiters requeued to the futex for the futex
1115word at
3dfcc11d 1116.IR uaddr2 .
dcad19c0
MK
1117.TP
1118.B FUTEX_WAKE_OP
a8b5b324 1119Returns the total number of waiters that were woken up.
4b35dc5d 1120This is the sum of the woken waiters on the two futexes for the futex words at
a8b5b324
MK
1121.I uaddr
1122and
1123.IR uaddr2 .
dcad19c0
MK
1124.TP
1125.B FUTEX_WAIT_BITSET
4b35dc5d
TR
1126Returns 0 if the caller was woken up. See
1127.B FUTEX_WAIT
1128for how to interpret this correctly in practice.
dcad19c0
MK
1129.TP
1130.B FUTEX_WAKE_BITSET
bdc5957a 1131Returns the number of waiters that were woken up.
dcad19c0
MK
1132.TP
1133.B FUTEX_LOCK_PI
bf02a260 1134Returns 0 if the futex was successfully locked.
dcad19c0
MK
1135.TP
1136.B FUTEX_TRYLOCK_PI
5c716eef 1137Returns 0 if the futex was successfully locked.
dcad19c0
MK
1138.TP
1139.B FUTEX_UNLOCK_PI
52bb928f 1140Returns 0 if the futex was successfully unlocked.
dcad19c0
MK
1141.TP
1142.B FUTEX_CMP_REQUEUE_PI
bdc5957a 1143Returns the total number of waiters that were woken up or
4b35dc5d 1144requeued to the futex for the futex word at
dddd395a
MK
1145.IR uaddr2 .
1146If this value is greater than
1147.IR val ,
4b35dc5d
TR
1148then difference is the number of waiters requeued to the futex for the futex
1149word at
dddd395a 1150.IR uaddr2 .
dcad19c0
MK
1151.TP
1152.B FUTEX_WAIT_REQUEUE_PI
4b35dc5d
TR
1153Returns 0 if the caller was successfully requeued to the futex for the futex
1154word at
22c15de9 1155.IR uaddr2 .
70b06b90
MK
1156.\"
1157.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1158.\"
fea681da
MK
1159.SH ERRORS
1160.TP
1161.B EACCES
4b35dc5d 1162No read access to the memory of a futex word.
fea681da
MK
1163.TP
1164.B EAGAIN
f48516d1 1165.RB ( FUTEX_WAIT ,
4b35dc5d 1166.BR FUTEX_WAIT_BITSET ,
f48516d1 1167.BR FUTEX_WAIT_REQUEUE_PI )
badbf70c
MK
1168The value pointed to by
1169.I uaddr
1170was not equal to the expected value
1171.I val
1172at the time of the call.
9732dd8b
MK
1173
1174.BR Note :
1175on Linux, the symbolic names
1176.B EAGAIN
1177and
1178.B EWOULDBLOCK
77da5feb 1179(both of which appear in different parts of the kernel futex code)
9732dd8b 1180have the same value.
badbf70c
MK
1181.TP
1182.B EAGAIN
8f2068bb
MK
1183.RB ( FUTEX_CMP_REQUEUE ,
1184.BR FUTEX_CMP_REQUEUE_PI )
ce5602fd 1185The value pointed to by
9f6c40c0
МК
1186.I uaddr
1187is not equal to the expected value
1188.IR val3 .
fd1dc4c2 1189.\" FIXME: Is the following sentence correct?
4b35dc5d 1190.\" I would prefer to remove this sentence. --triegel@redhat.com
fea681da 1191(This probably indicates a race;
682edefb
MK
1192use the safe
1193.B FUTEX_WAKE
1194now.)
c0091dd3 1195.\"
f1d2171d 1196.\" FIXME XXX Should there be an EAGAIN case for FUTEX_TRYLOCK_PI?
c0091dd3
MK
1197.\" It seems so, looking at the handling of the rt_mutex_trylock()
1198.\" call in futex_lock_pi()
9732dd8b 1199.\" (Davidlohr also thinks so.)
c0091dd3 1200.\"
fea681da 1201.TP
5662f56a
MK
1202.BR EAGAIN
1203.RB ( FUTEX_LOCK_PI ,
aaec9032
MK
1204.BR FUTEX_TRYLOCK_PI ,
1205.BR FUTEX_CMP_REQUEUE_PI )
1206The futex owner thread ID of
1207.I uaddr
1208(for
1209.BR FUTEX_CMP_REQUEUE_PI :
1210.IR uaddr2 )
1211is about to exit,
5662f56a
MK
1212but has not yet handled the internal state cleanup.
1213Try again.
1214.TP
7a39e745
MK
1215.BR EDEADLK
1216.RB ( FUTEX_LOCK_PI ,
9732dd8b
MK
1217.BR FUTEX_TRYLOCK_PI ,
1218.BR FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1219The futex word at
7a39e745
MK
1220.I uaddr
1221is already locked by the caller.
1222.TP
662c0da8
MK
1223.BR EDEADLK
1224.\" FIXME I reworded tglx's text somewhat; is the following okay?
f1d2171d
MK
1225.\" FIXME XXX I see that kernel/locking/rtmutex.c uses EDEADLK in some places,
1226.\" and EDEADLOCK in others. On almost all architectures these
1227.\" constants are synonymous. Is there a reason that both names
1228.\" are used?
662c0da8 1229.RB ( FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1230While requeueing a waiter to the PI futex for the futex word at
662c0da8
MK
1231.IR uaddr2 ,
1232the kernel detected a deadlock.
1233.TP
fea681da 1234.B EFAULT
1ea901e8
MK
1235A required pointer argument (i.e.,
1236.IR uaddr ,
1237.IR uaddr2 ,
1238or
1239.IR timeout )
496df304 1240did not point to a valid user-space address.
fea681da 1241.TP
9f6c40c0 1242.B EINTR
e808bba0 1243A
9f6c40c0 1244.B FUTEX_WAIT
2674f781
MK
1245or
1246.B FUTEX_WAIT_BITSET
e808bba0 1247operation was interrupted by a signal (see
f529fd20
MK
1248.BR signal (7)).
1249In kernels before Linux 2.6.22, this error could also be returned for
1250on a spurious wakeup; since Linux 2.6.22, this no longer happens.
9f6c40c0 1251.TP
fea681da 1252.B EINVAL
180f97b7
MK
1253The operation in
1254.IR futex_op
1255is one of those that employs a timeout, but the supplied
fb2f4c27
MK
1256.I timeout
1257argument was invalid
1258.RI ( tv_sec
1259was less than zero, or
1260.IR tv_nsec
1261was not less than 1000,000,000).
1262.TP
1263.B EINVAL
0c74df0b 1264The operation specified in
025e1374 1265.IR futex_op
0c74df0b 1266employs one or both of the pointers
51ee94be 1267.I uaddr
a1f47699 1268and
0c74df0b
MK
1269.IR uaddr2 ,
1270but one of these does not point to a valid object\(emthat is,
1271the address is not four-byte-aligned.
51ee94be
MK
1272.TP
1273.B EINVAL
55cc422d
TG
1274.RB ( FUTEX_WAIT_BITSET ,
1275.BR FUTEX_WAKE_BITSET )
79c9b436
TG
1276The bitset supplied in
1277.IR val3
1278is zero.
1279.TP
1280.B EINVAL
2abcba67 1281.RB ( FUTEX_CMP_REQUEUE_PI )
add875c0
MK
1282.I uaddr
1283equals
1284.IR uaddr2
1285(i.e., an attempt was made to requeue to the same futex).
1286.TP
ff597681
MK
1287.BR EINVAL
1288.RB ( FUTEX_FD )
1289The signal number supplied in
1290.I val
1291is invalid.
1292.TP
6bac3b85 1293.B EINVAL
476debd7
MK
1294.RB ( FUTEX_WAKE ,
1295.BR FUTEX_WAKE_OP ,
1296.BR FUTEX_WAKE_BITSET ,
1297.BR FUTEX_REQUEUE ,
1298.BR FUTEX_CMP_REQUEUE )
1299The kernel detected an inconsistency between the user-space state at
1300.I uaddr
1301and the kernel state\(emthat is, it detected a waiter which waits in
1302.BR FUTEX_LOCK_PI
1303on
1304.IR uaddr .
1305.TP
1306.B EINVAL
a218ef20 1307.RB ( FUTEX_LOCK_PI ,
ce022f18
MK
1308.BR FUTEX_TRYLOCK_PI ,
1309.BR FUTEX_UNLOCK_PI )
a218ef20
MK
1310The kernel detected an inconsistency between the user-space state at
1311.I uaddr
1312and the kernel state.
ce022f18
MK
1313This indicates either state corruption
1314.\" FIXME tglx did not mention the "state corruption" for FUTEX_UNLOCK_PI.
1315.\" Does that case also apply for FUTEX_UNLOCK_PI?
1316or that the kernel found a waiter on
a218ef20
MK
1317.I uaddr
1318which is waiting via
1319.BR FUTEX_WAIT
1320or
1321.BR FUTEX_WAIT_BITSET .
1322.TP
1323.B EINVAL
f9250b1a
MK
1324.RB ( FUTEX_CMP_REQUEUE_PI )
1325The kernel detected an inconsistency between the user-space state at
99c0041d
MK
1326.I uaddr2
1327and the kernel state;
1328that is, the kernel detected a waiter which waits via
1329.BR FUTEX_WAIT
1330.\" FIXME tglx did not mention FUTEX_WAIT_BITSET here,
1331.\" but should that not also be included here?
1332on
1333.IR uaddr2 .
1334.TP
1335.B EINVAL
1336.RB ( FUTEX_CMP_REQUEUE_PI )
1337The kernel detected an inconsistency between the user-space state at
f9250b1a
MK
1338.I uaddr
1339and the kernel state;
1340that is, the kernel detected a waiter which waits via
75299c8d 1341.BR FUTEX_WAIT
99c0041d 1342or
75299c8d 1343.BR FUTEX_WAIT_BITESET
f9250b1a
MK
1344on
1345.IR uaddr .
1346.TP
1347.B EINVAL
99c0041d 1348.RB ( FUTEX_CMP_REQUEUE_PI )
75299c8d
MK
1349The kernel detected an inconsistency between the user-space state at
1350.I uaddr
1351and the kernel state;
1352that is, the kernel detected a waiter which waits on
1353.I uaddr
1354via
1355.BR FUTEX_LOCK_PI
1356(instead of
1357.BR FUTEX_WAIT_REQUEUE_PI ).
99c0041d
MK
1358.TP
1359.B EINVAL
9786b3ca 1360.RB ( FUTEX_CMP_REQUEUE_PI )
f1d2171d 1361.\" FIXME XXX The following is a reworded version of Darren Hart's text.
9786b3ca
MK
1362.\" Please check that I did not introduce any errors.
1363An attempt was made to requeue a waiter to a futex other than that
1364specified by the matching
1365.B FUTEX_WAIT_REQUEUE_PI
1366call for that waiter.
1367.TP
1368.B EINVAL
f0c0d61c
MK
1369.RB ( FUTEX_CMP_REQUEUE_PI )
1370The
1371.I val
1372argument is not 1.
1373.TP
1374.B EINVAL
4832b48a 1375Invalid argument.
fea681da 1376.TP
a449c634
MK
1377.BR ENOMEM
1378.RB ( FUTEX_LOCK_PI ,
e34a8fb6
MK
1379.BR FUTEX_TRYLOCK_PI ,
1380.BR FUTEX_CMP_REQUEUE_PI )
a449c634
MK
1381The kernel could not allocate memory to hold state information.
1382.TP
fea681da 1383.B ENFILE
ff597681 1384.RB ( FUTEX_FD )
fea681da 1385The system limit on the total number of open files has been reached.
4701fc28
MK
1386.TP
1387.B ENOSYS
1388Invalid operation specified in
d33602c4 1389.IR futex_op .
9f6c40c0 1390.TP
4a7e5b05
MK
1391.B ENOSYS
1392The
1393.BR FUTEX_CLOCK_REALTIME
1394option was specified in
1afcee7c 1395.IR futex_op ,
4a7e5b05
MK
1396but the accompanying operation was neither
1397.BR FUTEX_WAIT_BITSET
1398nor
1399.BR FUTEX_WAIT_REQUEUE_PI .
1400.TP
a9dcb4d1
MK
1401.BR ENOSYS
1402.RB ( FUTEX_LOCK_PI ,
f2424fae 1403.BR FUTEX_TRYLOCK_PI ,
4945ff19 1404.BR FUTEX_UNLOCK_PI ,
4cf92894 1405.BR FUTEX_CMP_REQUEUE_PI ,
794bb106 1406.BR FUTEX_WAIT_REQUEUE_PI )
4b35dc5d 1407A run-time check determined that the operation is not available.
a2ebebcd
MK
1408The PI futex operations are not implemented on all architectures and
1409are not supported on some CPU variants.
a9dcb4d1 1410.TP
c7589177
MK
1411.BR EPERM
1412.RB ( FUTEX_LOCK_PI ,
dc2742a8
MK
1413.BR FUTEX_TRYLOCK_PI ,
1414.BR FUTEX_CMP_REQUEUE_PI )
04331c3f 1415The caller is not allowed to attach itself to the futex at
dc2742a8
MK
1416.I uaddr
1417(for
1418.BR FUTEX_CMP_REQUEUE_PI :
1419the futex at
1420.IR uaddr2 ).
c7589177
MK
1421(This may be caused by a state corruption in user space.)
1422.TP
76f347ba 1423.BR EPERM
87276709 1424.RB ( FUTEX_UNLOCK_PI )
4b35dc5d 1425The caller does not own the lock represented by the futex word.
76f347ba 1426.TP
0b0e4934
MK
1427.BR ESRCH
1428.RB ( FUTEX_LOCK_PI ,
9732dd8b
MK
1429.BR FUTEX_TRYLOCK_PI ,
1430.BR FUTEX_CMP_REQUEUE_PI )
0b0e4934
MK
1431.\" FIXME I reworded the following sentence a bit differently from
1432.\" tglx's formulation. Is it okay?
4b35dc5d 1433The thread ID in the futex word at
0b0e4934
MK
1434.I uaddr
1435does not exist.
1436.TP
360f773c
MK
1437.BR ESRCH
1438.RB ( FUTEX_CMP_REQUEUE_PI )
1439.\" FIXME I reworded the following sentence a bit differently from
1440.\" tglx's formulation. Is it okay?
4b35dc5d 1441The thread ID in the futex word at
360f773c
MK
1442.I uaddr2
1443does not exist.
1444.TP
9f6c40c0 1445.B ETIMEDOUT
4d85047f
MK
1446The operation in
1447.IR futex_op
1448employed the timeout specified in
1449.IR timeout ,
1450and the timeout expired before the operation completed.
70b06b90
MK
1451.\"
1452.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1453.\"
47297adb 1454.SH VERSIONS
a1d5f77c 1455.PP
81c9d87e
MK
1456Futexes were first made available in a stable kernel release
1457with Linux 2.6.0.
1458
a1d5f77c
MK
1459Initial futex support was merged in Linux 2.5.7 but with different semantics
1460from what was described above.
52dee70e 1461A four-argument system call with the semantics
fd3fa7ef 1462described in this page was introduced in Linux 2.5.40.
11b520ed 1463In Linux 2.5.70, one argument
a1d5f77c 1464was added.
11b520ed 1465In Linux 2.6.7, a sixth argument was added\(emmessy, especially
a1d5f77c 1466on the s390 architecture.
47297adb 1467.SH CONFORMING TO
8382f16d 1468This system call is Linux-specific.
47297adb 1469.SH NOTES
baf0f1f4
MK
1470Glibc does not provide a wrapper for this system call; call it using
1471.BR syscall (2).
4b35dc5d
TR
1472.\" TODO FIXME Above, we cite this section and claim it contains details on
1473.\" the synchronization semantics; add the C11 equivalents here (or whatever
1474.\" we find consensus for).
305cc415
MK
1475.\"
1476.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1477.\"
1478.SH EXAMPLE
1479.\" FIXME Is it worth having an example program?
1480.\" FIXME Anything obviously broken in the example program?
1481.\"
77da5feb 1482The program below demonstrates use of futexes in a program
305cc415
MK
1483where parent and child use a pair of futexes located inside a
1484shared anonymous mapping to synchronize access to a shared resource:
1485the terminal.
1486The two processes each write
1487.IR nloops
1488(a command-line argument that defaults to 5 if omitted)
1489messages to the terminal and employ a synchronization protocol
1490that ensures that they alternate in writing messages.
1491Upon running this program we see output such as the following:
1492
1493.in +4n
1494.nf
1495$ \fB./futex_demo\fP
1496Parent (18534) 0
1497Child (18535) 0
1498Parent (18534) 1
1499Child (18535) 1
1500Parent (18534) 2
1501Child (18535) 2
1502Parent (18534) 3
1503Child (18535) 3
1504Parent (18534) 4
1505Child (18535) 4
1506.fi
1507.in
1508.SS Program source
1509\&
1510.nf
1511/* futex_demo.c
1512
1513 Usage: futex_demo [nloops]
1514 (Default: 5)
1515
1516 Demonstrate the use of futexes in a program where parent and child
1517 use a pair of futexes located inside a shared anonymous mapping to
1518 synchronize access to a shared resource: the terminal. The two
1519 processes each write \(aqnum\-loops\(aq messages to the terminal and employ
1520 a synchronization protocol that ensures that they alternate in
1521 writing messages.
1522*/
1523#define _GNU_SOURCE
1524#include <stdio.h>
1525#include <errno.h>
1526#include <stdlib.h>
1527#include <unistd.h>
1528#include <sys/wait.h>
1529#include <sys/mman.h>
1530#include <sys/syscall.h>
1531#include <linux/futex.h>
1532#include <sys/time.h>
1533
1534#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\
1535 } while (0)
1536
1537static int *futex1, *futex2, *iaddr;
1538
1539static int
1540futex(int *uaddr, int futex_op, int val,
1541 const struct timespec *timeout, int *uaddr2, int val3)
1542{
1543 return syscall(SYS_futex, uaddr, futex_op, val,
1544 timeout, uaddr, val3);
1545}
1546
1547/* Acquire the futex pointed to by \(aqfutexp\(aq: wait for its value to
1548 become 1, and then set the value to 0. */
1549
1550static void
1551fwait(int *futexp)
1552{
1553 int s;
1554
1555 /* __sync_bool_compare_and_swap(ptr, oldval, newval) is a gcc
1556 built\-in function. It atomically performs the equivalent of:
1557
1558 if (*ptr == oldval)
1559 *ptr = newval;
1560
1561 It returns true if the test yielded true and *ptr was updated.
1562 The alternative here would be to employ the equivalent atomic
1563 machine\-language instructions. For further information, see
1564 the GCC Manual. */
1565
1566 /* Maybe the futex is already available: */
1567
1568 if (__sync_bool_compare_and_swap(futexp, 1, 0))
1569 return;
1570
1571 /* No; we must wait for the futex value to be changed */
1572
1573 while (1) {
1574 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
1575 if (s == \-1 && errno != EAGAIN)
1576 errExit("futex\-FUTEX_WAIT");
1577
1578 /* Is the futex now available? */
1579
1580 if (__sync_bool_compare_and_swap(futexp, 1, 0))
1581 break; /* Yes */
1582
1583 /* Futex is still not available; wait again */
1584 }
1585}
1586
1587/* Release the futex pointed to by \(aqfutexp\(aq: if the futex currently
1588 has the value 0, set its value to 1 and the wake any futex waiters,
1589 so that if the peer is blocked in fpost(), it can proceed. */
1590
1591static void
1592fpost(int *futexp)
1593{
1594 int s;
1595
1596 /* __sync_bool_compare_and_swap() was described in comments above */
1597
1598 if (__sync_bool_compare_and_swap(futexp, 0, 1)) {
1599
1600 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
1601 if (s == \-1)
1602 errExit("futex\-FUTEX_WAKE");
1603 }
1604}
1605
1606int
1607main(int argc, char *argv[])
1608{
1609 pid_t childPid;
1610 int j, nloops;
1611
1612 setbuf(stdout, NULL);
1613
1614 nloops = (argc > 1) ? atoi(argv[1]) : 5;
1615
1616 /* Create a shared anonymous mapping that will hold the futexes.
1617 Since the futexes are being shared between processes, we
1618 subsequently use the "shared" futex operations (i.e., not the
1619 ones suffixed "_PRIVATE") */
1620
1621 iaddr = mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE,
1622 MAP_ANONYMOUS | MAP_SHARED, \-1, 0);
1623 if (iaddr == MAP_FAILED)
1624 errExit("mmap");
1625
1626 futex1 = &iaddr[0];
1627 futex2 = &iaddr[1];
1628
1629 *futex1 = 0; /* State: unavailable */
1630 *futex2 = 1; /* State: available */
1631
1632 /* Create a child process that inherits the shared anonymous
1633 mappping */
1634
1635 childPid = fork();
1636 if (childPid == 1)
1637 errExit("fork");
1638
1639 if (childPid == 0) { /* Child */
1640 for (j = 0; j < nloops; j++) {
1641 fwait(futex1);
1642 printf("Child (%ld) %d\\n", (long) getpid(), j);
1643 fpost(futex2);
1644 }
1645
1646 exit(EXIT_SUCCESS);
1647 }
1648
1649 /* Parent falls through to here */
1650
1651 for (j = 0; j < nloops; j++) {
1652 fwait(futex2);
1653 printf("Parent (%ld) %d\\n", (long) getpid(), j);
1654 fpost(futex1);
1655 }
1656
1657 wait(NULL);
1658
1659 exit(EXIT_SUCCESS);
1660}
1661.fi
47297adb 1662.SH SEE ALSO
4c222281 1663.ad l
9913033c 1664.BR get_robust_list (2),
d806bc05 1665.BR restart_syscall (2),
14d8dd3b 1666.BR futex (7)
fea681da 1667.PP
f5ad572f
MK
1668The following kernel source files:
1669.IP * 2
1670.I Documentation/pi-futex.txt
1671.IP *
1672.I Documentation/futex-requeue-pi.txt
1673.IP *
1674.I Documentation/locking/rt-mutex.txt
1675.IP *
1676.I Documentation/locking/rt-mutex-design.txt
8fe019c7
MK
1677.IP *
1678.I Documentation/robust-futex-ABI.txt
43b99089 1679.PP
4c222281 1680Franke, H., Russell, R., and Kirwood, M., 2002.
52087dd3 1681\fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP
4c222281 1682(from proceedings of the Ottawa Linux Symposium 2002),
9b936e9e 1683.br
608bf950
SK
1684.UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002-pages-479-495.pdf
1685.UE
f42eb21b 1686
4c222281 1687Hart, D., 2009. \fIA futex overview and update\fP,
2ed26199
MK
1688.UR http://lwn.net/Articles/360699/
1689.UE
1690
4c222281 1691Hart, D. and Guniguntala, D., 2009.
0483b6cc 1692\fIRequeue-PI: Making Glibc Condvars PI-Aware\fP
4c222281 1693(from proceedings of the 2009 Real-Time Linux Workshop),
0483b6cc
MK
1694.UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf
1695.UE
1696
4c222281 1697Drepper, U., 2011. \fIFutexes Are Tricky\fP,
f42eb21b
MK
1698.UR http://www.akkadia.org/drepper/futex.pdf
1699.UE
9b936e9e
MK
1700.PP
1701Futex example library, futex-*.tar.bz2 at
1702.br
a605264d 1703.UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/
608bf950 1704.UE
34f14794
MK
1705.\"
1706.\" FIXME Are there any other resources that should be listed
1707.\" in the SEE ALSO section?
4b35dc5d
TR
1708.\" FIXME We should probably refer to the glibc code here, in particular the
1709.\" glibc-internal futex wrapper functions that are WIP, and the
1710.\" generic pthread_mutex_t and perhaps condvar implementations.
1711