]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man/man2/futex.2
man/, share/mk/: Move man*/ to man/
[thirdparty/man-pages.git] / man / man2 / futex.2
CommitLineData
8f0aff2a 1.\" Page by b.hubert
1abce893
MK
2.\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de>
3.\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com>
2297bf0e 4.\"
2e46a6e7 5.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
8f0aff2a 6.\" may be freely modified and distributed
8ff7380d 7.\" %%%LICENSE_END
fea681da
MK
8.\"
9.\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com)
10.\" added ERRORS section.
11.\"
12.\" Modified 2004-06-17 mtk
13.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
14.\"
47f5c4ba 15.\" FIXME Still to integrate are some points from Torvald Riegel's mail of
9915ea23 16.\" 2015-01-23:
47f5c4ba
MK
17.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977
18.\"
78e85692 19.\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail
9915ea23 20.\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242
02182e7c 21.\"
4c1c5274 22.TH futex 2 (date) "Linux man-pages (unreleased)"
fea681da 23.SH NAME
ce154705 24futex \- fast user-space locking
504866e4
AC
25.SH LIBRARY
26Standard C library
8fc3b2cf 27.RI ( libc ", " \-lc )
fea681da 28.SH SYNOPSIS
9d9dc1e8 29.nf
c6d039a3 30.P
0aa385fe
AC
31.BR "#include <linux/futex.h>" " /* Definition of " FUTEX_* " constants */"
32.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
33.B #include <unistd.h>
c6d039a3 34.P
0aa385fe
AC
35.BI "long syscall(SYS_futex, uint32_t *" uaddr ", int " futex_op \
36", uint32_t " val ,
37.BI " const struct timespec *" timeout , \
9bfc9cb1 38" \fR /* or: \fBuint32_t \fIval2\fP */"
0aa385fe 39.BI " uint32_t *" uaddr2 ", uint32_t " val3 );
9d9dc1e8 40.fi
c6d039a3 41.P
b939d6e4 42.IR Note :
0aa385fe
AC
43glibc provides no wrapper for
44.BR futex (),
45necessitating the use of
46.BR syscall (2).
47297adb 47.SH DESCRIPTION
fea681da 48The
e511ffb6 49.BR futex ()
4b35dc5d 50system call provides a method for waiting until a certain condition becomes
077981d4
MK
51true.
52It is typically used as a blocking construct in the context of
d45f244c
MK
53shared-memory synchronization.
54When using futexes, the majority of
55the synchronization operations are performed in user space.
bc54ed38 56A user-space program employs the
d45f244c 57.BR futex ()
ca4e5b2b 58system call only when it is likely that the program has to block for
4c8cb0ff 59a longer time until the condition becomes true.
bc54ed38 60Other
d45f244c 61.BR futex ()
bc54ed38
MK
62operations can be used to wake any processes or threads waiting
63for a particular condition.
c6d039a3 64.P
36546c38
AC
65A futex is a 32-bit value\[em]referred to below as a
66.IR "futex word" \[em]whose
7e8dcabc 67address is supplied to the
4b35dc5d 68.BR futex ()
7e8dcabc 69system call.
c3f4c019 70(Futexes are 32 bits in size on all platforms, including 64-bit systems.)
7e8dcabc
MK
71All futex operations are governed by this value.
72In order to share a futex between processes,
73the futex is placed in a region of shared memory,
74created using (for example)
75.BR mmap (2)
76or
77.BR shmat (2).
c3f4c019 78(Thus, the futex word may have different
7e8dcabc
MK
79virtual addresses in different processes,
80but these addresses all refer to the same location in physical memory.)
ca4e5b2b
MK
81In a multithreaded program, it is sufficient to place the futex word
82in a global variable shared by all threads.
c6d039a3 83.P
0c3ec26b
MK
84When executing a futex operation that requests to block a thread,
85the kernel will block only if the futex word has the value that the
55f9e85e
MK
86calling thread supplied (as one of the arguments of the
87.BR futex ()
88call) as the expected value of the futex word.
9d32a39b
MK
89The loading of the futex word's value,
90the comparison of that value with the expected value,
bc54ed38 91and the actual blocking will happen atomically and will be totally ordered
da894b18 92with respect to concurrent operations performed by other threads
0fb87d16 93on the same futex word.
da894b18
MK
94.\" Notes from Darren Hart (Dec 2015):
95.\" Totally ordered with respect futex operations refers to semantics
96.\" of the ACQUIRE/RELEASE operations and how they impact ordering of
97.\" memory reads and writes. The kernel futex operations are protected
f6615c42 98.\" by spinlocks, which ensure that all operations are serialized
da894b18
MK
99.\" with respect to one another.
100.\"
101.\" This is a lot to attempt to define in this document. Perhaps a
102.\" reference to linux/Documentation/memory-barriers.txt as a footnote
103.\" would be sufficient? Or perhaps for this manual, "serialized" would
104.\" be sufficient, with a footnote regarding "totally ordered" and a
105.\" pointer to the memory-barrier documentation?
b80daba2 106Thus, the futex word is used to connect the synchronization in user space
9d32a39b 107with the implementation of blocking by the kernel.
55f9e85e 108Analogously to an atomic
4b35dc5d 109compare-and-exchange operation that potentially changes shared memory,
077981d4 110blocking via a futex is an atomic compare-and-block operation.
d6bb5a38 111.\" FIXME(Torvald Riegel):
61066e14
MK
112.\" Eventually we want to have some text in NOTES to satisfy
113.\" the reference in the following sentence
114.\" See NOTES for a detailed specification of
115.\" the synchronization semantics.
c6d039a3 116.P
ca4e5b2b 117One use of futexes is for implementing locks.
c0dc758e
MK
118The state of the lock (i.e., acquired or not acquired)
119can be represented as an atomically accessed flag in shared memory.
4c8cb0ff 120In the uncontended case,
c3f4c019 121a thread can access or modify the lock state with atomic instructions,
4c8cb0ff
MK
122for example atomically changing it from not acquired to acquired
123using an atomic compare-and-exchange instruction.
55f9e85e
MK
124(Such instructions are performed entirely in user mode,
125and the kernel maintains no information about the lock state.)
126On the other hand, a thread may be unable to acquire a lock because
8e754e12 127it is already acquired by another thread.
55f9e85e 128It then may pass the lock's flag as a futex word and the value
0c3ec26b 129representing the acquired state as the expected value to a
8e754e12
HS
130.BR futex ()
131wait operation.
55f9e85e 132This
8e754e12 133.BR futex ()
bc54ed38 134operation will block if and only if the lock is still acquired
f6615c42 135(i.e., the value in the futex word still matches the "acquired state").
077981d4 136When releasing the lock, a thread has to first reset the
0c3ec26b 137lock state to not acquired and then execute a futex
55f9e85e 138operation that wakes threads blocked on the lock flag used as a futex word
f6615c42 139(this can be further optimized to avoid unnecessary wake-ups).
077981d4 140See
4b35dc5d
TR
141.BR futex (7)
142for more detail on how to use futexes.
c6d039a3 143.P
4b35dc5d 144Besides the basic wait and wake-up futex functionality, there are further
077981d4 145futex operations aimed at supporting more complex use cases.
c6d039a3 146.P
ca4e5b2b 147Note that
2af84f99 148no explicit initialization or destruction is necessary to use futexes;
4c8cb0ff
MK
149the kernel maintains a futex
150(i.e., the kernel-internal implementation artifact)
4b35dc5d
TR
151only while operations such as
152.BR FUTEX_WAIT ,
153described below, are being performed on a particular futex word.
a663ca5a
MK
154.\"
155.SS Arguments
fea681da
MK
156The
157.I uaddr
077981d4
MK
158argument points to the futex word.
159On all platforms, futexes are four-byte
4b35dc5d 160integers that must be aligned on a four-byte boundary.
f388ba70
MK
161The operation to perform on the futex is specified in the
162.I futex_op
163argument;
1ae6b2c7 164.I val
f388ba70
MK
165is a value whose meaning and purpose depends on
166.IR futex_op .
c6d039a3 167.P
36ab2074
MK
168The remaining arguments
169.RI ( timeout ,
170.IR uaddr2 ,
171and
172.IR val3 )
173are required only for certain of the futex operations described below.
174Where one of these arguments is not required, it is ignored.
c6d039a3 175.P
36ab2074
MK
176For several blocking operations, the
177.I timeout
178argument is a pointer to a
1ae6b2c7 179.I timespec
36ab2074
MK
180structure that specifies a timeout for the operation.
181However, notwithstanding the prototype shown above, for some operations,
eb4aa521
MK
182the least significant four bytes of this argument are instead
183used as an integer whose meaning is determined by the operation.
768d3c23
MK
184For these operations, the kernel casts the
185.I timeout
10022b8e
HS
186value first to
187.IR "unsigned long",
188then to
c6dc40a2 189.IR uint32_t ,
768d3c23
MK
190and in the remainder of this page, this argument is referred to as
191.I val2
192when interpreted in this fashion.
c6d039a3 193.P
de5a3bb4 194Where it is required, the
1ae6b2c7 195.I uaddr2
4c8cb0ff
MK
196argument is a pointer to a second futex word that is employed
197by the operation.
c6d039a3 198.P
36ab2074
MK
199The interpretation of the final integer argument,
200.IR val3 ,
201depends on the operation.
a663ca5a
MK
202.\"
203.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
204.\"
205.SS Futex operations
6be4bad7 206The
d33602c4 207.I futex_op
6be4bad7
MK
208argument consists of two parts:
209a command that specifies the operation to be performed,
5d771a4a 210bitwise ORed with zero or more options that
6be4bad7 211modify the behaviour of the operation.
fc30eb79 212The options that may be included in
d33602c4 213.I futex_op
fc30eb79
TG
214are as follows:
215.TP
216.BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)"
217.\" commit 34f01cc1f512fa783302982776895c73714ebbc2
218This option bit can be employed with all futex operations.
e45f9735 219It tells the kernel that the futex is process-private and not shared
0c3ec26b
MK
220with another process (i.e., it is being used for synchronization
221only between threads of the same process).
943ccc52
MK
222This allows the kernel to make some additional performance optimizations.
223.\" I.e., It allows the kernel choose the fast path for validating
224.\" the user-space address and avoids expensive VMA lookups,
225.\" taking reference counts on file backing store, and so on.
efeece04 226.IP
ae2c1774 227As a convenience,
eeeee811 228.I <linux/futex.h>
ae2c1774 229defines a set of constants with the suffix
eeeee811 230.B _PRIVATE
ae2c1774 231that are equivalents of all of the operations listed below,
dcdfde26 232.\" except the obsolete FUTEX_FD, for which the "private" flag was
ae2c1774
MK
233.\" meaningless
234but with the
1ae6b2c7 235.B FUTEX_PRIVATE_FLAG
ae2c1774
MK
236ORed into the constant value.
237Thus, there are
238.BR FUTEX_WAIT_PRIVATE ,
239.BR FUTEX_WAKE_PRIVATE ,
240and so on.
2e98bbc2
TG
241.TP
242.BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)"
243.\" commit 1acdac104668a0834cfa267de9946fac7764d486
4a7e5b05 244This option bit can be employed only with the
949ceae3
MK
245.BR FUTEX_WAIT_BITSET ,
246.BR FUTEX_WAIT_REQUEUE_PI ,
949ceae3
MK
247(since Linux 4.5)
248.\" commit 337f13046ff03717a9e99675284a817527440a49
e79977ae
KK
249.BR FUTEX_WAIT ,
250and
eeeee811 251(since Linux 5.14)
e79977ae 252.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae
eeeee811 253.B FUTEX_LOCK_PI2
c84cf68c 254operations.
efeece04 255.IP
8064bfa5 256If this option is set, the kernel measures the
f2103b26 257.I timeout
8064bfa5 258against the
eeeee811 259.B CLOCK_REALTIME
8064bfa5 260clock.
efeece04 261.IP
8064bfa5 262If this option is not set, the kernel measures the
f2103b26 263.I timeout
8064bfa5 264against the
eeeee811 265.B CLOCK_MONOTONIC
1c952cf5 266clock.
c6d039a3 267.P
6be4bad7 268The operation specified in
d33602c4 269.I futex_op
6be4bad7 270is one of the following:
70b06b90
MK
271.\"
272.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
273.\"
fea681da 274.TP
81c9d87e 275.BR FUTEX_WAIT " (since Linux 2.6.0)"
b324e17d 276.\" Strictly speaking, since some time in Linux 2.5.x
f065673c 277This operation tests that the value at the
4b35dc5d 278futex word pointed to by the address
fea681da 279.I uaddr
4b35dc5d 280still contains the expected value
fea681da 281.IR val ,
fd105614 282and if so, then sleeps waiting for a
682edefb 283.B FUTEX_WAKE
fd105614 284operation on the futex word.
077981d4 285The load of the value of the futex word is an atomic memory
4b35dc5d 286access (i.e., using atomic machine instructions of the respective
077981d4
MK
287architecture).
288This load, the comparison with the expected value, and
fd105614 289starting to sleep are performed atomically
da56650a 290.\" FIXME: Torvald, I think we may need to add some explanation of
61066e14 291.\" "totally ordered" here.
fd105614
MK
292and totally ordered
293with respect to other futex operations on the same futex word.
c0dc758e
MK
294If the thread starts to sleep,
295it is considered a waiter on this futex word.
f065673c
MK
296If the futex value does not match
297.IR val ,
4710334a 298then the call fails immediately with the error
badbf70c 299.BR EAGAIN .
efeece04 300.IP
4b35dc5d 301The purpose of the comparison with the expected value is to prevent lost
fd105614
MK
302wake-ups.
303If another thread changed the value of the futex word after the
c0dc758e
MK
304calling thread decided to block based on the prior value,
305and if the other thread executed a
1ae6b2c7 306.B FUTEX_WAKE
4b35dc5d 307operation (or similar wake-up) after the value change and before this
1ae6b2c7 308.B FUTEX_WAIT
bc54ed38
MK
309operation, then the calling thread will observe the
310value change and will not start to sleep.
efeece04 311.IP
c13182ef 312If the
fea681da 313.I timeout
40d2dab9 314is not NULL, the structure it points to specifies a
40d2dab9 315timeout for the wait.
ac991a11
MK
316(This interval will be rounded up to the system clock granularity,
317and is guaranteed not to expire early.)
a6918f1d 318The timeout is by default measured according to the
1ae6b2c7 319.B CLOCK_MONOTONIC
a01c3098 320clock, but, since Linux 4.5, the
1ae6b2c7 321.B CLOCK_REALTIME
a01c3098 322clock can be selected by specifying
1ae6b2c7 323.B FUTEX_CLOCK_REALTIME
a01c3098
MK
324in
325.IR futex_op .
82a6092b
MK
326If
327.I timeout
328is NULL, the call blocks indefinitely.
efeece04 329.IP
4100abc5
MK
330.IR Note :
331for
332.BR FUTEX_WAIT ,
1ae6b2c7 333.I timeout
4100abc5 334is interpreted as a
1ae6b2c7 335.I relative
4100abc5
MK
336value.
337This differs from other futex operations, where
338.I timeout
339is interpreted as an absolute value.
340To obtain the equivalent of
1ae6b2c7 341.B FUTEX_WAIT
4100abc5 342with an absolute timeout, employ
1ae6b2c7 343.B FUTEX_WAIT_BITSET
4100abc5 344with
1ae6b2c7 345.I val3
4100abc5
MK
346specified as
347.BR FUTEX_BITSET_MATCH_ANY .
efeece04 348.IP
c13182ef 349The arguments
fea681da
MK
350.I uaddr2
351and
352.I val3
353are ignored.
9915ea23
MK
354.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a
355.\" different example.
356.\"
357.\" For
358.\" .BR futex (7),
359.\" this call is executed if decrementing the count gave a negative value
360.\" (indicating contention),
361.\" and will sleep until another process or thread releases
362.\" the futex and executes the
363.\" .B FUTEX_WAKE
364.\" operation.
70b06b90
MK
365.\"
366.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
367.\"
fea681da 368.TP
81c9d87e
MK
369.BR FUTEX_WAKE " (since Linux 2.6.0)"
370.\" Strictly speaking, since Linux 2.5.x
f065673c
MK
371This operation wakes at most
372.I val
4b35dc5d 373of the waiters that are waiting (e.g., inside
f065673c 374.BR FUTEX_WAIT )
4b35dc5d 375on the futex word at the address
f065673c
MK
376.IR uaddr .
377Most commonly,
378.I val
379is specified as either 1 (wake up a single waiter) or
1ae6b2c7 380.B INT_MAX
f065673c 381(wake up all waiters).
730bfbda
MK
382No guarantee is provided about which waiters are awoken
383(e.g., a waiter with a higher scheduling priority is not guaranteed
384to be awoken in preference to a waiter with a lower priority).
efeece04 385.IP
fea681da
MK
386The arguments
387.IR timeout ,
c8b921bd 388.IR uaddr2 ,
fea681da
MK
389and
390.I val3
391are ignored.
9915ea23
MK
392.\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to
393.\" a different example.
394.\"
4c8cb0ff
MK
395.\" For
396.\" .BR futex (7),
397.\" this is executed if incrementing the count showed that
398.\" there were waiters,
399.\" once the futex value has been set to 1
400.\" (indicating that it is available).
401.\"
9915ea23 402.\" How does "incrementing the count show that there were waiters"?
70b06b90
MK
403.\"
404.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
405.\"
a7c2bf45
MK
406.TP
407.BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)"
b324e17d 408.\" Strictly speaking, from Linux 2.5.x to Linux 2.6.25
4c8cb0ff
MK
409This operation creates a file descriptor that is associated with
410the futex at
a7c2bf45 411.IR uaddr .
bdc5957a
MK
412The caller must close the returned file descriptor after use.
413When another process or thread performs a
1ae6b2c7 414.B FUTEX_WAKE
4b35dc5d 415on the futex word, the file descriptor indicates as being readable with
a7c2bf45
MK
416.BR select (2),
417.BR poll (2),
418and
419.BR epoll (7)
efeece04 420.IP
f1d2171d 421The file descriptor can be used to obtain asynchronous notifications: if
a7c2bf45 422.I val
ca4e5b2b 423is nonzero, then, when another process or thread executes a
a7c2bf45
MK
424.BR FUTEX_WAKE ,
425the caller will receive the signal number that was passed in
426.IR val .
efeece04 427.IP
a7c2bf45
MK
428The arguments
429.IR timeout ,
d556548b 430.IR uaddr2 ,
a7c2bf45
MK
431and
432.I val3
433are ignored.
efeece04 434.IP
a7c2bf45
MK
435Because it was inherently racy,
436.B FUTEX_FD
437has been removed
438.\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80
439from Linux 2.6.26 onward.
70b06b90
MK
440.\"
441.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
442.\"
a7c2bf45
MK
443.TP
444.BR FUTEX_REQUEUE " (since Linux 2.6.0)"
a7c2bf45 445This operation performs the same task as
1ae6b2c7 446.B FUTEX_CMP_REQUEUE
27dd3a6e 447(see below), except that no check is made using the value in
a7c2bf45
MK
448.IR val3 .
449(The argument
450.I val3
451is ignored.)
70b06b90
MK
452.\"
453.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
454.\"
a7c2bf45
MK
455.TP
456.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
4b35dc5d 457This operation first checks whether the location
a7c2bf45
MK
458.I uaddr
459still contains the value
460.IR val3 .
461If not, the operation fails with the error
462.BR EAGAIN .
4b35dc5d 463Otherwise, the operation wakes up a maximum of
a7c2bf45
MK
464.I val
465waiters that are waiting on the futex at
466.IR uaddr .
467If there are more than
468.I val
469waiters, then the remaining waiters are removed
470from the wait queue of the source futex at
471.I uaddr
472and added to the wait queue of the target futex at
473.IR uaddr2 .
474The
768d3c23 475.I val2
936876a9 476argument specifies an upper limit on the number of waiters
a7c2bf45 477that are requeued to the futex at
768d3c23 478.IR uaddr2 .
efeece04 479.IP
d6bb5a38
MK
480.\" FIXME(Torvald) Is the following correct? Or is just the decision
481.\" which threads to wake or requeue part of the atomic operation?
4b35dc5d
TR
482The load from
483.I uaddr
4c8cb0ff
MK
484is an atomic memory access (i.e., using atomic machine instructions of
485the respective architecture).
077981d4 486This load, the comparison with
4b35dc5d 487.IR val3 ,
4c8cb0ff
MK
488and the requeueing of any waiters are performed atomically and totally
489ordered with respect to other operations on the same futex word.
ee65b0e8
MK
490.\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ###
491.\" The operation is serialized with respect to operations on both
492.\" source and target futex. No other waiter can enqueue itself
493.\" for waiting and no other waiter can dequeue itself because of
494.\" a timeout or signal.
efeece04 495.IP
a7c2bf45
MK
496Typical values to specify for
497.I val
ed1819cf 498are 0 or 1.
a7c2bf45 499(Specifying
1ae6b2c7 500.B INT_MAX
a7c2bf45 501is not useful, because it would make the
1ae6b2c7 502.B FUTEX_CMP_REQUEUE
a7c2bf45
MK
503operation equivalent to
504.BR FUTEX_WAKE .)
936876a9 505The limit value specified via
768d3c23
MK
506.I val2
507is typically either 1 or
a7c2bf45
MK
508.BR INT_MAX .
509(Specifying the argument as 0 is not useful, because it would make the
1ae6b2c7 510.B FUTEX_CMP_REQUEUE
a7c2bf45
MK
511operation equivalent to
512.BR FUTEX_WAIT .)
efeece04 513.IP
627b50ce
MK
514The
515.B FUTEX_CMP_REQUEUE
516operation was added as a replacement for the earlier
517.BR FUTEX_REQUEUE .
518The difference is that the check of the value at
519.I uaddr
520can be used to ensure that requeueing happens only under certain
521conditions, which allows race conditions to be avoided in certain use cases.
dcb410c3 522.\" But, as Rich Felker points out, there remain valid use cases for
627b50ce
MK
523.\" FUTEX_REQUEUE, for example, when the calling thread is requeuing
524.\" the target(s) to a lock that the calling thread owns
525.\" From: Rich Felker <dalias@libc.org>
526.\" Date: Wed, 29 Oct 2014 22:43:17 -0400
527.\" To: Darren Hart <dvhart@infradead.org>
528.\" CC: libc-alpha@sourceware.org, ...
529.\" Subject: Re: Add futex wrapper to glibc?
efeece04 530.IP
627b50ce 531Both
1ae6b2c7 532.B FUTEX_REQUEUE
627b50ce 533and
1ae6b2c7 534.B FUTEX_CMP_REQUEUE
627b50ce
MK
535can be used to avoid "thundering herd" wake-ups that could occur when using
536.B FUTEX_WAKE
537in cases where all of the waiters that are woken need to acquire
538another futex.
539Consider the following scenario,
540where multiple waiter threads are waiting on B,
541a wait queue implemented using a futex:
efeece04 542.IP
627b50ce 543.in +4n
b76974c1 544.EX
627b50ce
MK
545lock(A)
546while (!check_value(V)) {
547 unlock(A);
548 block_on(B);
549 lock(A);
550};
551unlock(A);
b76974c1 552.EE
627b50ce 553.in
efeece04 554.IP
627b50ce
MK
555If a waker thread used
556.BR FUTEX_WAKE ,
557then all waiters waiting on B would be woken up,
67c67ff2 558and they would all try to acquire lock A.
627b50ce
MK
559However, waking all of the threads in this manner would be pointless because
560all except one of the threads would immediately block on lock A again.
561By contrast, a requeue operation wakes just one waiter and moves
562the other waiters to lock A,
563and when the woken waiter unlocks A then the next waiter can proceed.
43d16602 564.\"
70b06b90
MK
565.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
566.\"
fea681da 567.TP
d67e21f5
MK
568.BR FUTEX_WAKE_OP " (since Linux 2.6.14)"
569.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
6bac3b85
MK
570.\" Author: Jakub Jelinek <jakub@redhat.com>
571.\" Date: Tue Sep 6 15:16:25 2005 -0700
9915ea23 572.\" FIXME. (Torvald) The glibc condvar implementation is currently being
4c8cb0ff
MK
573.\" revised (e.g., to not use an internal lock anymore).
574.\" It is probably more future-proof to remove this paragraph.
d6bb5a38 575.\" [Torvald, do you have an update here?]
6bac3b85
MK
576This operation was added to support some user-space use cases
577where more than one futex must be handled at the same time.
578The most notable example is the implementation of
579.BR pthread_cond_signal (3),
580which requires operations on two futexes,
581the one used to implement the mutex and the one used in the implementation
582of the wait queue associated with the condition variable.
1ae6b2c7 583.B FUTEX_WAKE_OP
6bac3b85
MK
584allows such cases to be implemented without leading to
585high rates of contention and context switching.
efeece04 586.IP
6bac3b85 587The
1ae6b2c7 588.B FUTEX_WAKE_OP
e61abc20 589operation is equivalent to executing the following code atomically
4c8cb0ff
MK
590and totally ordered with respect to other futex operations on
591any of the two supplied futex words:
efeece04 592.IP
6bac3b85 593.in +4n
b76974c1 594.EX
2253ecf0
AC
595uint32_t oldval = *(uint32_t *) uaddr2;
596*(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
6bac3b85
MK
597futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
598if (oldval \fIcmp\fP \fIcmparg\fP)
768d3c23 599 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
b76974c1 600.EE
6bac3b85 601.in
efeece04 602.IP
6bac3b85 603In other words,
1ae6b2c7 604.B FUTEX_WAKE_OP
6bac3b85
MK
605does the following:
606.RS
cdede5cd 607.IP \[bu] 3
4b35dc5d 608saves the original value of the futex word at
1ae6b2c7 609.I uaddr2
4b35dc5d 610and performs an operation to modify the value of the futex at
6bac3b85 611.IR uaddr2 ;
4c8cb0ff
MK
612this is an atomic read-modify-write memory access (i.e., using atomic
613machine instructions of the respective architecture)
cdede5cd 614.IP \[bu]
6bac3b85
MK
615wakes up a maximum of
616.I val
4b35dc5d 617waiters on the futex for the futex word at
6bac3b85
MK
618.IR uaddr ;
619and
cdede5cd 620.IP \[bu]
4c8cb0ff
MK
621dependent on the results of a test of the original value of the
622futex word at
6bac3b85
MK
623.IR uaddr2 ,
624wakes up a maximum of
768d3c23 625.I val2
4b35dc5d 626waiters on the futex for the futex word at
6bac3b85
MK
627.IR uaddr2 .
628.RE
629.IP
6bac3b85
MK
630The operation and comparison that are to be performed are encoded
631in the bits of the argument
632.IR val3 .
633Pictorially, the encoding is:
efeece04 634.IP
3cf61490 635.in +4n
b76974c1 636.EX
f6af90e7
MK
637+---+---+-----------+-----------+
638|op |cmp| oparg | cmparg |
639+---+---+-----------+-----------+
640 4 4 12 12 <== # of bits
b76974c1 641.EE
6bac3b85 642.in
efeece04 643.IP
6bac3b85 644Expressed in code, the encoding is:
efeece04 645.IP
6bac3b85 646.in +4n
b76974c1 647.EX
d1a71985
MK
648#define FUTEX_OP(op, oparg, cmp, cmparg) \e
649 (((op & 0xf) << 28) | \e
650 ((cmp & 0xf) << 24) | \e
651 ((oparg & 0xfff) << 12) | \e
6bac3b85 652 (cmparg & 0xfff))
b76974c1 653.EE
6bac3b85 654.in
efeece04 655.IP
6bac3b85
MK
656In the above,
657.I op
658and
659.I cmp
660are each one of the codes listed below.
661The
662.I oparg
663and
664.I cmparg
665components are literal numeric values, except as noted below.
efeece04 666.IP
6bac3b85
MK
667The
668.I op
669component has one of the following values:
efeece04 670.IP
6bac3b85 671.in +4n
b76974c1 672.EX
6bac3b85
MK
673FUTEX_OP_SET 0 /* uaddr2 = oparg; */
674FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
675FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
3f029bc9 676FUTEX_OP_ANDN 3 /* uaddr2 &= \[ti]oparg; */
a1e9245d 677FUTEX_OP_XOR 4 /* uaddr2 \[ha]= oparg; */
b76974c1 678.EE
6bac3b85 679.in
efeece04 680.IP
5d771a4a 681In addition, bitwise ORing the following value into
6bac3b85
MK
682.I op
683causes
1ae6b2c7 684.I (1\~<<\~oparg)
6bac3b85 685to be used as the operand:
efeece04 686.IP
6bac3b85 687.in +4n
b76974c1 688.EX
6bac3b85 689FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
b76974c1 690.EE
6bac3b85 691.in
efeece04 692.IP
6bac3b85
MK
693The
694.I cmp
695field is one of the following:
efeece04 696.IP
6bac3b85 697.in +4n
b76974c1 698.EX
6bac3b85
MK
699FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
700FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
701FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
702FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
703FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
704FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
b76974c1 705.EE
6bac3b85 706.in
efeece04 707.IP
6bac3b85 708The return value of
1ae6b2c7 709.B FUTEX_WAKE_OP
6bac3b85 710is the sum of the number of waiters woken on the futex
1ae6b2c7 711.I uaddr
6bac3b85
MK
712plus the number of waiters woken on the futex
713.IR uaddr2 .
70b06b90
MK
714.\"
715.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
716.\"
d67e21f5 717.TP
79c9b436
TG
718.BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)"
719.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
fd9e59d4 720This operation is like
1ae6b2c7 721.B FUTEX_WAIT
79c9b436
TG
722except that
723.I val3
84abf4ba 724is used to provide a 32-bit bit mask to the kernel.
2ae96e8a 725This bit mask, in which at least one bit must be set,
6c38ce7f 726is stored in the kernel-internal state of the waiter.
79c9b436 727See the description of
1ae6b2c7 728.B FUTEX_WAKE_BITSET
79c9b436 729for further details.
efeece04 730.IP
8064bfa5
MK
731If
732.I timeout
733is not NULL, the structure it points to specifies
734an absolute timeout for the wait operation.
735If
736.I timeout
737is NULL, the operation can block indefinitely.
efeece04 738.IP
79c9b436
TG
739The
740.I uaddr2
741argument is ignored.
70b06b90
MK
742.\"
743.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
744.\"
79c9b436 745.TP
d67e21f5
MK
746.BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)"
747.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
55cc422d 748This operation is the same as
1ae6b2c7 749.B FUTEX_WAKE
55cc422d 750except that the
e24fbf10 751.I val3
5e1456d4 752argument is used to provide a 32-bit bit mask to the kernel.
6c38ce7f
MK
753This bit mask, in which at least one bit must be set,
754is used to select which waiters should be woken up.
5d771a4a 755The selection is done by a bitwise AND of the "wake" bit mask
98d769c0
MK
756(i.e., the value in
757.IR val3 )
5e1456d4
MK
758and the bit mask which is stored in the kernel-internal
759state of the waiter (the "wait" bit mask that is set using
98d769c0
MK
760.BR FUTEX_WAIT_BITSET ).
761All of the waiters for which the result of the AND is nonzero are woken up;
762the remaining waiters are left sleeping.
efeece04 763.IP
e9d4496b 764The effect of
1ae6b2c7 765.B FUTEX_WAIT_BITSET
e9d4496b 766and
1ae6b2c7 767.B FUTEX_WAKE_BITSET
9732dd8b
MK
768is to allow selective wake-ups among multiple waiters that are blocked
769on the same futex.
ac894879 770However, note that, depending on the use case,
5e1456d4 771employing this bit-mask multiplexing feature on a
ac894879 772futex can be less efficient than simply using multiple futexes,
5e1456d4 773because employing bit-mask multiplexing requires the kernel
e9d4496b
MK
774to check all waiters on a futex,
775including those that are not interested in being woken up
5e1456d4 776(i.e., they do not have the relevant bit set in their "wait" bit mask).
e9d4496b
MK
777.\" According to http://locklessinc.com/articles/futex_cheat_sheet/:
778.\"
779.\" "The original reason for the addition of these extensions
780.\" was to improve the performance of pthread read-write locks
781.\" in glibc. However, the pthreads library no longer uses the
782.\" same locking algorithm, and these extensions are not used
783.\" without the bitset parameter being all ones.
e24fbf10 784.\"
e9d4496b 785.\" The page goes on to note that the FUTEX_WAIT_BITSET operation
5e1456d4 786.\" is nevertheless used (with a bit mask of all ones) in order to
e9d4496b
MK
787.\" obtain the absolute timeout functionality that is useful
788.\" for efficiently implementing Pthreads APIs (which use absolute
789.\" timeouts); FUTEX_WAIT provides only relative timeouts.
efeece04 790.IP
678c9986
MK
791The constant
792.BR FUTEX_BITSET_MATCH_ANY ,
793which corresponds to all 32 bits set in the bit mask, can be used as the
794.I val3
795argument for
1ae6b2c7 796.B FUTEX_WAIT_BITSET
98d769c0 797and
678c9986
MK
798.BR FUTEX_WAKE_BITSET .
799Other than differences in the handling of the
98d769c0 800.I timeout
678c9986 801argument, the
1ae6b2c7 802.B FUTEX_WAIT
678c9986 803operation is equivalent to
1ae6b2c7 804.B FUTEX_WAIT_BITSET
678c9986 805with
1ae6b2c7 806.I val3
678c9986
MK
807specified as
808.BR FUTEX_BITSET_MATCH_ANY ;
809that is, allow a wake-up by any waker.
810The
1ae6b2c7 811.B FUTEX_WAKE
678c9986 812operation is equivalent to
1ae6b2c7 813.B FUTEX_WAKE_BITSET
678c9986 814with
1ae6b2c7 815.I val3
678c9986
MK
816specified as
817.BR FUTEX_BITSET_MATCH_ANY ;
818that is, wake up any waiter(s).
efeece04 819.IP
678c9986
MK
820The
821.I uaddr2
822and
823.I timeout
824arguments are ignored.
bd90a5f9 825.\"
70b06b90 826.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
bd90a5f9
MK
827.\"
828.SS Priority-inheritance futexes
b52e1cd4
MK
829Linux supports priority-inheritance (PI) futexes in order to handle
830priority-inversion problems that can be encountered with
831normal futex locks.
b565548b 832Priority inversion is the problem that occurs when a high-priority
bdc5957a
MK
833task is blocked waiting to acquire a lock held by a low-priority task,
834while tasks at an intermediate priority continuously preempt
835the low-priority task from the CPU.
836Consequently, the low-priority task makes no progress toward
837releasing the lock, and the high-priority task remains blocked.
c6d039a3 838.P
7d20efd7
MK
839Priority inheritance is a mechanism for dealing with
840the priority-inversion problem.
bdc5957a
MK
841With this mechanism, when a high-priority task becomes blocked
842by a lock held by a low-priority task,
9cee832c
MK
843the priority of the low-priority task is temporarily raised
844to that of the high-priority task,
bdc5957a 845so that it is not preempted by any intermediate level tasks,
7d20efd7
MK
846and can thus make progress toward releasing the lock.
847To be effective, priority inheritance must be transitive,
bdc5957a 848meaning that if a high-priority task blocks on a lock
ca4e5b2b 849held by a lower-priority task that is itself blocked by a lock
bdc5957a 850held by another intermediate-priority task
7d20efd7 851(and so on, for chains of arbitrary length),
b0f35fbb 852then both of those tasks
bdc5957a
MK
853(or more generally, all of the tasks in a lock chain)
854have their priorities raised to be the same as the high-priority task.
c6d039a3 855.P
9e2b90ee 856From a user-space perspective,
39e9b2e1
MK
857what makes a futex PI-aware is a policy agreement (described below)
858between user space and the kernel about the value of the futex word,
601399f3
MK
859coupled with the use of the PI-futex operations described below.
860(Unlike the other futex operations described above,
861the PI-futex operations are designed
862for the implementation of very specific IPC mechanisms.)
863.\"
9e2b90ee
MK
864.\" Quoting Darren Hart:
865.\" These opcodes paired with the PI futex value policy (described below)
866.\" defines a "futex" as PI aware. These were created very specifically
867.\" in support of PI pthread_mutexes, so it makes a lot more sense to
868.\" talk about a PI aware pthread_mutex, than a PI aware futex, since
869.\" there is a lot of policy and scaffolding that has to be built up
870.\" around it to use it properly (this is what a PI pthread_mutex is).
c6d039a3 871.P
ac894879 872.\" mtk: The following text is drawn from the Hart/Guniguntala paper
1af427a4 873.\" (listed in SEE ALSO), but I have reworded some pieces
8d825152 874.\" significantly.
79d918c7 875.\"
f0a9e8f4 876The PI-futex operations described below differ from the other
4b35dc5d
TR
877futex operations in that they impose policy on the use of the value of the
878futex word:
cdede5cd 879.IP \[bu] 3
4b35dc5d 880If the lock is not acquired, the futex word's value shall be 0.
cdede5cd 881.IP \[bu]
4c8cb0ff
MK
882If the lock is acquired, the futex word's value shall
883be the thread ID (TID;
4b35dc5d 884see
79d918c7
MK
885.BR gettid (2))
886of the owning thread.
cdede5cd 887.IP \[bu]
79d918c7
MK
888If the lock is owned and there are threads contending for the lock,
889then the
890.B FUTEX_WAITERS
4b35dc5d 891bit shall be set in the futex word's value; in other words, this value is:
efeece04 892.IP
1ae6b2c7
AC
893.in +4n
894.EX
895FUTEX_WAITERS | TID
896.EE
897.in
601399f3
MK
898.IP
899(Note that is invalid for a PI futex word to have no owner and
1ae6b2c7 900.B FUTEX_WAITERS
601399f3 901set.)
c6d039a3 902.P
79d918c7 903With this policy in place,
fd105614 904a user-space application can acquire an unacquired
601399f3 905lock or release a lock using atomic instructions executed in user mode
fd105614 906(e.g., a compare-and-swap operation such as
b52e1cd4
MK
907.I cmpxchg
908on the x86 architecture).
4c8cb0ff
MK
909Acquiring a lock simply consists of using compare-and-swap to atomically
910set the futex word's value to the caller's TID if its previous value was 0.
4b35dc5d
TR
911Releasing a lock requires using compare-and-swap to set the futex word's
912value to 0 if the previous value was the expected TID.
c6d039a3 913.P
4b35dc5d 914If a futex is already acquired (i.e., has a nonzero value),
b52e1cd4 915waiters must employ the
79d918c7 916.B FUTEX_LOCK_PI
e79977ae
KK
917or
918.B FUTEX_LOCK_PI2
919operations to acquire the lock.
4b35dc5d 920If other threads are waiting for the lock, then the
79d918c7 921.B FUTEX_WAITERS
4c8cb0ff
MK
922bit is set in the futex value;
923in this case, the lock owner must employ the
79d918c7 924.B FUTEX_UNLOCK_PI
b52e1cd4 925operation to release the lock.
c6d039a3 926.P
79d918c7
MK
927In the cases where callers are forced into the kernel
928(i.e., required to perform a
929.BR futex ()
0c3ec26b 930call),
79d918c7
MK
931they then deal directly with a so-called RT-mutex,
932a kernel locking mechanism which implements the required
933priority-inheritance semantics.
934After the RT-mutex is acquired, the futex value is updated accordingly,
935before the calling thread returns to user space.
c6d039a3 936.P
a59fca75 937It is important to note
ac894879 938.\" tglx (July 2015):
30239c10
MK
939.\" If there are multiple waiters on a pi futex then a wake pi operation
940.\" will wake the first waiter and hand over the lock to this waiter. This
941.\" includes handing over the rtmutex which represents the futex in the
942.\" kernel. The strict requirement is that the futex owner and the rtmutex
943.\" owner must be the same, except for the update period which is
944.\" serialized by the futex internal locking. That means the kernel must
1d09c150 945.\" update the user-space value prior to returning to user space
4b35dc5d 946that the kernel will update the futex word's value prior
79d918c7 947to returning to user space.
601399f3
MK
948(This prevents the possibility of the futex word's value ending
949up in an invalid state, such as having an owner but the value being 0,
950or having waiters but not having the
951.B FUTEX_WAITERS
952bit set.)
c6d039a3 953.P
30239c10
MK
954If a futex has an associated RT-mutex in the kernel
955(i.e., there are blocked waiters)
956and the owner of the futex/RT-mutex dies unexpectedly,
957then the kernel cleans up the RT-mutex and hands it over to the next waiter.
958This in turn requires that the user-space value is updated accordingly.
959To indicate that this is required, the kernel sets the
960.B FUTEX_OWNER_DIED
961bit in the futex word along with the thread ID of the new owner.
8adaf0a7
MK
962User space can detect this situation via the presence of the
963.B FUTEX_OWNER_DIED
964bit and is then responsible for cleaning up the stale state left over by
1d09c150 965the dead owner.
30239c10
MK
966.\" tglx (July 2015):
967.\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where
968.\" the kernel has no state associated. This happens via the robust futex
969.\" mechanism. In that case the futex value will be set to
970.\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non
971.\" PI futexes.
c6d039a3 972.P
601399f3
MK
973PI futexes are operated on by specifying one of the values listed below in
974.IR futex_op .
975Note that the PI futex operations must be used as paired operations
976and are subject to some additional requirements:
cdede5cd 977.IP \[bu] 3
eeeee811
AC
978.BR FUTEX_LOCK_PI ,
979.BR FUTEX_LOCK_PI2 ,
601399f3
MK
980and
981.B FUTEX_TRYLOCK_PI
982pair with
d8012462 983.BR FUTEX_UNLOCK_PI .
601399f3
MK
984.B FUTEX_UNLOCK_PI
985must be called only on a futex owned by the calling thread,
986as defined by the value policy, otherwise the error
987.B EPERM
988results.
cdede5cd 989.IP \[bu]
601399f3
MK
990.B FUTEX_WAIT_REQUEUE_PI
991pairs with
992.BR FUTEX_CMP_REQUEUE_PI .
993This must be performed from a non-PI futex to a distinct PI futex
994(or the error
995.B EINVAL
996results).
997Additionally,
998.I val
999(the number of waiters to be woken) must be 1
1000(or the error
1001.B EINVAL
1002results).
c6d039a3 1003.P
601399f3 1004The PI futex operations are as follows:
70b06b90
MK
1005.\"
1006.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1007.\"
d67e21f5
MK
1008.TP
1009.BR FUTEX_LOCK_PI " (since Linux 2.6.18)"
1010.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
bc54ed38 1011This operation is used after an attempt to acquire
fd105614 1012the lock via an atomic user-mode instruction failed
36546c38 1013because the futex word has a nonzero value\[em]specifically,
8297383e 1014because it contained the (PID-namespace-specific) TID of the lock owner.
efeece04 1015.IP
4b35dc5d 1016The operation checks the value of the futex word at the address
67833bec 1017.IR uaddr .
70b06b90
MK
1018If the value is 0, then the kernel tries to atomically set
1019the futex value to the caller's TID.
c3875d1d 1020If the futex word's value is nonzero,
67833bec 1021the kernel atomically sets the
e0547e70 1022.B FUTEX_WAITERS
67833bec
MK
1023bit, which signals the futex owner that it cannot unlock the futex in
1024user space atomically by setting the futex value to 0.
c3875d1d
MK
1025.\" tglx (July 2015):
1026.\" The operation here is similar to the FUTEX_WAIT logic. When the user
1027.\" space atomic acquire does not succeed because the futex value was non
1028.\" zero, then the waiter goes into the kernel, takes the kernel internal
1029.\" lock and retries the acquisition under the lock. If the acquisition
1030.\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal
1031.\" the lock owner that it needs to go into the kernel. Here is the pseudo
1032.\" code:
1033.\"
1034.\" lock(kernel_lock);
1035.\" retry:
9bfc9cb1 1036.\"
c3875d1d 1037.\" /*
ba2c4752 1038.\" * Owner might have unlocked in user space before we
c3875d1d
MK
1039.\" * were able to set the waiter bit.
1040.\" */
1041.\" if (atomic_acquire(futex) == SUCCESS) {
1042.\" unlock(kernel_lock());
1043.\" return 0;
1044.\" }
1045.\"
1046.\" /*
1047.\" * Owner might have unlocked after the above atomic_acquire()
1048.\" * attempt.
1049.\" */
1050.\" if (atomic_set_waiters_bit(futex) != SUCCESS)
1051.\" goto retry;
1052.\"
1053.\" queue_waiter();
1054.\" unlock(kernel_lock);
1055.\" block();
1056.\"
1057After that, the kernel:
1058.RS
22356d97 1059.IP (1) 5
c3875d1d 1060Tries to find the thread which is associated with the owner TID.
22356d97 1061.IP (2)
c3875d1d
MK
1062Creates or reuses kernel state on behalf of the owner.
1063(If this is the first waiter, there is no kernel state for this
1064futex, so kernel state is created by locking the RT-mutex
1065and the futex owner is made the owner of the RT-mutex.
1066If there are existing waiters, then the existing state is reused.)
22356d97 1067.IP (3)
ca4e5b2b 1068Attaches the waiter to the futex
c3875d1d
MK
1069(i.e., the waiter is enqueued on the RT-mutex waiter list).
1070.RE
1071.IP
ac894879
MK
1072If more than one waiter exists,
1073the enqueueing of the waiter is in descending priority order.
1074(For information on priority ordering, see the discussion of the
1075.BR SCHED_DEADLINE ,
1076.BR SCHED_FIFO ,
1077and
1ae6b2c7 1078.B SCHED_RR
ac894879
MK
1079scheduling policies in
1080.BR sched (7).)
1081The owner inherits either the waiter's CPU bandwidth
1082(if the waiter is scheduled under the
1ae6b2c7 1083.B SCHED_DEADLINE
ac894879 1084policy) or the waiter's priority (if the waiter is scheduled under the
1ae6b2c7 1085.B SCHED_RR
ac894879 1086or
1ae6b2c7 1087.B SCHED_FIFO
ac894879 1088policy).
1d09c150
MK
1089.\" August 2015:
1090.\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL)
1091.\" processes, does the nice value come into play also?
1092.\"
1093.\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order
c3875d1d 1094This inheritance follows the lock chain in the case of nested locking
ca4e5b2b
MK
1095.\" (i.e., task 1 blocks on lock A, held by task 2,
1096.\" while task 2 blocks on lock B, held by task 3)
c3875d1d 1097and performs deadlock detection.
efeece04 1098.IP
e0547e70
TG
1099The
1100.I timeout
9ce19cf1 1101argument provides a timeout for the lock attempt.
8064bfa5
MK
1102If
1103.I timeout
1104is not NULL, the structure it points to specifies
1105an absolute timeout, measured against the
1ae6b2c7 1106.B CLOCK_REALTIME
9ce19cf1 1107clock.
c082f385
MK
1108.\" 2016-07-07 response from Thomas Gleixner on LKML:
1109.\" From: Thomas Gleixner <tglx@linutronix.de>
1110.\" Date: 6 July 2016 at 20:57
1111.\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
2ae96e8a 1112.\"
c082f385
MK
1113.\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote:
1114.\" > On 06/23/2016 08:28 PM, Darren Hart wrote:
1115.\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using
1116.\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a
1117.\" > > specific amount of time, regardless of wall time.
1118.\" >
1119.\" > Yes, that's another weird inconsistency.
2ae96e8a 1120.\"
c082f385
MK
1121.\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on
1122.\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back
1123.\" then when we added LOCK_PI.
9ce19cf1
MK
1124If
1125.I timeout
1126is NULL, the operation will block indefinitely.
efeece04 1127.IP
a449c634 1128The
e0547e70
TG
1129.IR uaddr2 ,
1130.IR val ,
1131and
eeeee811 1132.I val3
a449c634 1133arguments are ignored.
67833bec 1134.\"
70b06b90
MK
1135.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1136.\"
d67e21f5 1137.TP
eeeee811 1138.BR FUTEX_LOCK_PI2 " (since Linux 5.14)"
e79977ae 1139.\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae
69bc3836
MK
1140This operation is the same as
1141.BR FUTEX_LOCK_PI ,
1142except that the clock against which
eeeee811 1143.I timeout
69bc3836
MK
1144is measured is selectable.
1145By default, the (absolute) timeout specified in
e79977ae 1146.I timeout
caaa2efa 1147is measured against the
69bc3836
MK
1148.B CLOCK_MONOTONIC
1149clock, but if the
1150.B FUTEX_CLOCK_REALTIME
1151flag is specified in
1152.IR futex_op ,
1153then the timeout is measured against the
1154.B CLOCK_REALTIME
1155clock.
e79977ae
KK
1156.\"
1157.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1158.\"
1159.TP
12fdbe23 1160.BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)"
d67e21f5 1161.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
3fbb1be1 1162This operation tries to acquire the lock at
12fdbe23 1163.IR uaddr .
c3875d1d
MK
1164It is invoked when a user-space atomic acquire did not
1165succeed because the futex word was not 0.
efeece04 1166.IP
8adaf0a7
MK
1167Because the kernel has access to more state information than user space,
1168acquisition of the lock might succeed if performed by the
1169kernel in cases where the futex word
1170(i.e., the state information accessible to use-space) contains stale state
c3875d1d
MK
1171.RB ( FUTEX_WAITERS
1172and/or
1173.BR FUTEX_OWNER_DIED ).
1174This can happen when the owner of the futex died.
1d09c150
MK
1175User space cannot handle this condition in a race-free manner,
1176but the kernel can fix this up and acquire the futex.
ee65b0e8
MK
1177.\" Paraphrasing a f2f conversation with Thomas Gleixner about the
1178.\" above point (Aug 2015): ###
1179.\" There is a rare possibility of a race condition involving an
1180.\" uncontended futex with no owner, but with waiters. The
1181.\" kernel-user-space contract is that if a futex is nonzero, you must
1182.\" go into kernel. The futex was owned by a task, and that task dies
1183.\" but there are no waiters, so the futex value is non zero.
1184.\" Therefore, the next locker has to go into the kernel,
1185.\" so that the kernel has a chance to clean up. (CMXCH on zero
1186.\" in user space would fail, so kernel has to clean up.)
8adaf0a7
MK
1187.\" Darren Hart (Oct 2015):
1188.\" The trylock in the kernel has more state, so it can independently
ba2c4752 1189.\" verify the flags that user space must trust implicitly.
efeece04 1190.IP
084744ef
MK
1191The
1192.IR uaddr2 ,
1193.IR val ,
1194.IR timeout ,
1195and
1ae6b2c7 1196.I val3
084744ef 1197arguments are ignored.
70b06b90
MK
1198.\"
1199.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1200.\"
d67e21f5 1201.TP
12fdbe23 1202.BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)"
d67e21f5 1203.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
d4ba4328 1204This operation wakes the top priority waiter that is waiting in
ecae2099 1205.B FUTEX_LOCK_PI
e79977ae
KK
1206or
1207.B FUTEX_LOCK_PI2
ecae2099
TG
1208on the futex address provided by the
1209.I uaddr
1210argument.
efeece04 1211.IP
1d09c150 1212This is called when the user-space value at
ecae2099
TG
1213.I uaddr
1214cannot be changed atomically from a TID (of the owner) to 0.
efeece04 1215.IP
ecae2099
TG
1216The
1217.IR uaddr2 ,
1218.IR val ,
1219.IR timeout ,
1220and
1ae6b2c7 1221.I val3
11a194bf 1222arguments are ignored.
70b06b90
MK
1223.\"
1224.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1225.\"
d67e21f5 1226.TP
d67e21f5
MK
1227.BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)"
1228.\" commit 52400ba946759af28442dee6265c5c0180ac7122
f812a08b
DH
1229This operation is a PI-aware variant of
1230.BR FUTEX_CMP_REQUEUE .
1231It requeues waiters that are blocked via
1232.B FUTEX_WAIT_REQUEUE_PI
1233on
1234.I uaddr
1235from a non-PI source futex
1236.RI ( uaddr )
1237to a PI target futex
1238.RI ( uaddr2 ).
efeece04 1239.IP
9e54d26d
MK
1240As with
1241.BR FUTEX_CMP_REQUEUE ,
1242this operation wakes up a maximum of
1243.I val
1244waiters that are waiting on the futex at
1245.IR uaddr .
1246However, for
1247.BR FUTEX_CMP_REQUEUE_PI ,
1248.I val
6fbeb8f4 1249is required to be 1
939ca89f 1250(since the main point is to avoid a thundering herd).
9e54d26d
MK
1251The remaining waiters are removed from the wait queue of the source futex at
1252.I uaddr
1253and added to the wait queue of the target futex at
1254.IR uaddr2 .
efeece04 1255.IP
9e54d26d 1256The
768d3c23 1257.I val2
c6d8cf21
MK
1258.\" val2 is the cap on the number of requeued waiters.
1259.\" In the glibc pthread_cond_broadcast() implementation, this argument
1260.\" is specified as INT_MAX, and for pthread_cond_signal() it is 0.
9e54d26d 1261and
768d3c23 1262.I val3
9e54d26d
MK
1263arguments serve the same purposes as for
1264.BR FUTEX_CMP_REQUEUE .
70b06b90 1265.\"
8297383e 1266.\" The page at http://locklessinc.com/articles/futex_cheat_sheet/
be376673 1267.\" notes that "priority-inheritance Futex to priority-inheritance
8297383e
MK
1268.\" Futex requeues are currently unsupported". However, probably
1269.\" the page does not need to say nothing about this, since
1270.\" Thomas Gleixner commented (July 2015): "they never will be
1271.\" supported because they make no sense at all"
70b06b90
MK
1272.\"
1273.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1274.\"
d67e21f5
MK
1275.TP
1276.BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)"
1277.\" commit 52400ba946759af28442dee6265c5c0180ac7122
70b06b90 1278.\"
c3875d1d 1279Wait on a non-PI futex at
6ff1b4c0 1280.I uaddr
c3875d1d 1281and potentially be requeued (via a
1ae6b2c7 1282.B FUTEX_CMP_REQUEUE_PI
c3875d1d 1283operation in another task) onto a PI futex at
6ff1b4c0
TG
1284.IR uaddr2 .
1285The wait operation on
1286.I uaddr
c3875d1d 1287is the same as for
6ff1b4c0 1288.BR FUTEX_WAIT .
efeece04 1289.IP
6ff1b4c0
TG
1290The waiter can be removed from the wait on
1291.I uaddr
6ff1b4c0 1292without requeueing on
1ae6b2c7 1293.I uaddr2
c3875d1d 1294via a
1ae6b2c7 1295.B FUTEX_WAKE
c3875d1d
MK
1296operation in another task.
1297In this case, the
1ae6b2c7 1298.B FUTEX_WAIT_REQUEUE_PI
3fbb1be1
MK
1299operation fails with the error
1300.BR EAGAIN .
efeece04 1301.IP
63bea7dc
MK
1302If
1303.I timeout
8064bfa5
MK
1304is not NULL, the structure it points to specifies
1305an absolute timeout for the wait operation.
63bea7dc
MK
1306If
1307.I timeout
1308is NULL, the operation can block indefinitely.
efeece04 1309.IP
a4e69912
MK
1310The
1311.I val3
1312argument is ignored.
efeece04 1313.IP
abb571e8 1314The
1ae6b2c7 1315.B FUTEX_WAIT_REQUEUE_PI
abb571e8 1316and
1ae6b2c7 1317.B FUTEX_CMP_REQUEUE_PI
abb571e8
MK
1318were added to support a fairly specific use case:
1319support for priority-inheritance-aware POSIX threads condition variables.
1320The idea is that these operations should always be paired,
1321in order to ensure that user space and the kernel remain in sync.
1322Thus, in the
1ae6b2c7 1323.B FUTEX_WAIT_REQUEUE_PI
abb571e8
MK
1324operation, the user-space application pre-specifies the target
1325of the requeue that takes place in the
1ae6b2c7 1326.B FUTEX_CMP_REQUEUE_PI
abb571e8
MK
1327operation.
1328.\"
1329.\" Darren Hart notes that a patch to allow glibc to fully support
1af427a4 1330.\" PI-aware pthreads condition variables has not yet been accepted into
abb571e8
MK
1331.\" glibc. The story is complex, and can be found at
1332.\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588
1333.\" Darren notes that in the meantime, the patch is shipped with various
1af427a4 1334.\" PREEMPT_RT-enabled Linux systems.
abb571e8
MK
1335.\"
1336.\" Related to the preceding, Darren proposed that somewhere, man-pages
1337.\" should document the following point:
1af427a4 1338.\"
b324e17d 1339.\" While the Linux kernel, since Linux 2.6.31, supports requeueing of
4c8cb0ff
MK
1340.\" priority-inheritance (PI) aware mutexes via the
1341.\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations,
1342.\" the glibc implementation does not yet take full advantage of this.
1343.\" Specifically, the condvar internal data lock remains a non-PI aware
1344.\" mutex, regardless of the type of the pthread_mutex associated with
1345.\" the condvar. This can lead to an unbounded priority inversion on
1346.\" the internal data lock even when associating a PI aware
1347.\" pthread_mutex with a condvar during a pthread_cond*_wait
1348.\" operation. For this reason, it is not recommended to rely on
1349.\" priority inheritance when using pthread condition variables.
1af427a4
MK
1350.\"
1351.\" The problem is that the obvious location for this text is
1352.\" the pthread_cond*wait(3) man page. However, such a man page
abb571e8 1353.\" does not currently exist.
70b06b90 1354.\"
6700de24 1355.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
70b06b90 1356.\"
47297adb 1357.SH RETURN VALUE
a5c5a06a
MK
1358In the event of an error (and assuming that
1359.BR futex ()
1360was invoked via
1361.BR syscall (2)),
1362all operations return \-1 and set
e808bba0 1363.I errno
855d489a 1364to indicate the error.
c6d039a3 1365.P
e808bba0
MK
1366The return value on success depends on the operation,
1367as described in the following list:
fea681da
MK
1368.TP
1369.B FUTEX_WAIT
077981d4 1370Returns 0 if the caller was woken up.
4c8cb0ff
MK
1371Note that a wake-up can also be caused by common futex usage patterns
1372in unrelated code that happened to have previously used the futex word's
1373memory location (e.g., typical futex-based implementations of
1374Pthreads mutexes can cause this under some conditions).
1375Therefore, callers should always conservatively assume that a return
1376value of 0 can mean a spurious wake-up, and use the futex word's value
bc54ed38
MK
1377(i.e., the user-space synchronization scheme)
1378to decide whether to continue to block or not.
fea681da
MK
1379.TP
1380.B FUTEX_WAKE
bdc5957a 1381Returns the number of waiters that were woken up.
fea681da
MK
1382.TP
1383.B FUTEX_FD
1384Returns the new file descriptor associated with the futex.
1385.TP
1386.B FUTEX_REQUEUE
bdc5957a 1387Returns the number of waiters that were woken up.
fea681da
MK
1388.TP
1389.B FUTEX_CMP_REQUEUE
bdc5957a 1390Returns the total number of waiters that were woken up or
4b35dc5d 1391requeued to the futex for the futex word at
3dfcc11d
MK
1392.IR uaddr2 .
1393If this value is greater than
1394.IR val ,
fd105614 1395then the difference is the number of waiters requeued to the futex for the
4c8cb0ff 1396futex word at
3dfcc11d 1397.IR uaddr2 .
dcad19c0
MK
1398.TP
1399.B FUTEX_WAKE_OP
a8b5b324 1400Returns the total number of waiters that were woken up.
4c8cb0ff
MK
1401This is the sum of the woken waiters on the two futexes for
1402the futex words at
a8b5b324
MK
1403.I uaddr
1404and
1405.IR uaddr2 .
dcad19c0
MK
1406.TP
1407.B FUTEX_WAIT_BITSET
077981d4
MK
1408Returns 0 if the caller was woken up.
1409See
4b35dc5d
TR
1410.B FUTEX_WAIT
1411for how to interpret this correctly in practice.
dcad19c0
MK
1412.TP
1413.B FUTEX_WAKE_BITSET
bdc5957a 1414Returns the number of waiters that were woken up.
dcad19c0
MK
1415.TP
1416.B FUTEX_LOCK_PI
bf02a260 1417Returns 0 if the futex was successfully locked.
dcad19c0 1418.TP
e79977ae
KK
1419.B FUTEX_LOCK_PI2
1420Returns 0 if the futex was successfully locked.
1421.TP
dcad19c0 1422.B FUTEX_TRYLOCK_PI
5c716eef 1423Returns 0 if the futex was successfully locked.
dcad19c0
MK
1424.TP
1425.B FUTEX_UNLOCK_PI
52bb928f 1426Returns 0 if the futex was successfully unlocked.
dcad19c0
MK
1427.TP
1428.B FUTEX_CMP_REQUEUE_PI
bdc5957a 1429Returns the total number of waiters that were woken up or
4b35dc5d 1430requeued to the futex for the futex word at
dddd395a
MK
1431.IR uaddr2 .
1432If this value is greater than
1433.IR val ,
4c8cb0ff
MK
1434then difference is the number of waiters requeued to the futex for
1435the futex word at
dddd395a 1436.IR uaddr2 .
dcad19c0
MK
1437.TP
1438.B FUTEX_WAIT_REQUEUE_PI
4c8cb0ff
MK
1439Returns 0 if the caller was successfully requeued to the futex for
1440the futex word at
22c15de9 1441.IR uaddr2 .
70b06b90
MK
1442.\"
1443.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1444.\"
fea681da
MK
1445.SH ERRORS
1446.TP
1447.B EACCES
4b35dc5d 1448No read access to the memory of a futex word.
fea681da
MK
1449.TP
1450.B EAGAIN
f48516d1 1451.RB ( FUTEX_WAIT ,
4b35dc5d 1452.BR FUTEX_WAIT_BITSET ,
f48516d1 1453.BR FUTEX_WAIT_REQUEUE_PI )
badbf70c
MK
1454The value pointed to by
1455.I uaddr
1456was not equal to the expected value
1457.I val
1458at the time of the call.
efeece04 1459.IP
9732dd8b
MK
1460.BR Note :
1461on Linux, the symbolic names
1462.B EAGAIN
1463and
1464.B EWOULDBLOCK
77da5feb 1465(both of which appear in different parts of the kernel futex code)
9732dd8b 1466have the same value.
badbf70c
MK
1467.TP
1468.B EAGAIN
8f2068bb
MK
1469.RB ( FUTEX_CMP_REQUEUE ,
1470.BR FUTEX_CMP_REQUEUE_PI )
ce5602fd 1471The value pointed to by
9f6c40c0
МК
1472.I uaddr
1473is not equal to the expected value
1474.IR val3 .
fea681da 1475.TP
eeeee811 1476.B EAGAIN
5662f56a 1477.RB ( FUTEX_LOCK_PI ,
e79977ae 1478.BR FUTEX_LOCK_PI2 ,
aaec9032
MK
1479.BR FUTEX_TRYLOCK_PI ,
1480.BR FUTEX_CMP_REQUEUE_PI )
1481The futex owner thread ID of
1482.I uaddr
1483(for
1484.BR FUTEX_CMP_REQUEUE_PI :
1485.IR uaddr2 )
1486is about to exit,
5662f56a
MK
1487but has not yet handled the internal state cleanup.
1488Try again.
1489.TP
eeeee811 1490.B EDEADLK
7a39e745 1491.RB ( FUTEX_LOCK_PI ,
e79977ae 1492.BR FUTEX_LOCK_PI2 ,
9732dd8b
MK
1493.BR FUTEX_TRYLOCK_PI ,
1494.BR FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1495The futex word at
7a39e745
MK
1496.I uaddr
1497is already locked by the caller.
1498.TP
1ae6b2c7 1499.B EDEADLK
c3875d1d 1500.\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some
d6bb5a38 1501.\" places, and EDEADLOCK in others. On almost all architectures
4c8cb0ff
MK
1502.\" these constants are synonymous. Is there a reason that both
1503.\" names are used?
8297383e
MK
1504.\"
1505.\" tglx (July 2015): "No. We should probably fix that."
1506.\"
662c0da8 1507.RB ( FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1508While requeueing a waiter to the PI futex for the futex word at
662c0da8
MK
1509.IR uaddr2 ,
1510the kernel detected a deadlock.
1511.TP
fea681da 1512.B EFAULT
1ea901e8
MK
1513A required pointer argument (i.e.,
1514.IR uaddr ,
1515.IR uaddr2 ,
1516or
1517.IR timeout )
496df304 1518did not point to a valid user-space address.
fea681da 1519.TP
9f6c40c0 1520.B EINTR
e808bba0 1521A
9f6c40c0 1522.B FUTEX_WAIT
2674f781
MK
1523or
1524.B FUTEX_WAIT_BITSET
e808bba0 1525operation was interrupted by a signal (see
f529fd20 1526.BR signal (7)).
b324e17d 1527Before Linux 2.6.22, this error could also be returned for
b5fff4ea 1528a spurious wakeup; since Linux 2.6.22, this no longer happens.
9f6c40c0 1529.TP
fea681da 1530.B EINVAL
180f97b7 1531The operation in
eeeee811 1532.I futex_op
180f97b7 1533is one of those that employs a timeout, but the supplied
fb2f4c27
MK
1534.I timeout
1535argument was invalid
1536.RI ( tv_sec
1537was less than zero, or
eeeee811 1538.I tv_nsec
cabee29d 1539was not less than 1,000,000,000).
fb2f4c27
MK
1540.TP
1541.B EINVAL
0c74df0b 1542The operation specified in
eeeee811 1543.I futex_op
0c74df0b 1544employs one or both of the pointers
51ee94be 1545.I uaddr
a1f47699 1546and
0c74df0b 1547.IR uaddr2 ,
36546c38 1548but one of these does not point to a valid object\[em]that is,
0c74df0b 1549the address is not four-byte-aligned.
51ee94be
MK
1550.TP
1551.B EINVAL
55cc422d
TG
1552.RB ( FUTEX_WAIT_BITSET ,
1553.BR FUTEX_WAKE_BITSET )
5e1456d4 1554The bit mask supplied in
eeeee811 1555.I val3
79c9b436
TG
1556is zero.
1557.TP
1558.B EINVAL
2abcba67 1559.RB ( FUTEX_CMP_REQUEUE_PI )
add875c0
MK
1560.I uaddr
1561equals
eeeee811 1562.I uaddr2
add875c0
MK
1563(i.e., an attempt was made to requeue to the same futex).
1564.TP
eeeee811 1565.B EINVAL
ff597681
MK
1566.RB ( FUTEX_FD )
1567The signal number supplied in
1568.I val
1569is invalid.
1570.TP
6bac3b85 1571.B EINVAL
476debd7
MK
1572.RB ( FUTEX_WAKE ,
1573.BR FUTEX_WAKE_OP ,
1574.BR FUTEX_WAKE_BITSET ,
1575.BR FUTEX_REQUEUE ,
1576.BR FUTEX_CMP_REQUEUE )
1577The kernel detected an inconsistency between the user-space state at
1578.I uaddr
36546c38 1579and the kernel state\[em]that is, it detected a waiter which waits in
eeeee811 1580.B FUTEX_LOCK_PI
e79977ae 1581or
eeeee811 1582.B FUTEX_LOCK_PI2
476debd7
MK
1583on
1584.IR uaddr .
1585.TP
1586.B EINVAL
a218ef20 1587.RB ( FUTEX_LOCK_PI ,
e79977ae 1588.BR FUTEX_LOCK_PI2 ,
ce022f18
MK
1589.BR FUTEX_TRYLOCK_PI ,
1590.BR FUTEX_UNLOCK_PI )
a218ef20
MK
1591The kernel detected an inconsistency between the user-space state at
1592.I uaddr
1593and the kernel state.
ce022f18 1594This indicates either state corruption
ce022f18 1595or that the kernel found a waiter on
a218ef20
MK
1596.I uaddr
1597which is waiting via
eeeee811 1598.B FUTEX_WAIT
a218ef20
MK
1599or
1600.BR FUTEX_WAIT_BITSET .
1601.TP
1602.B EINVAL
f9250b1a
MK
1603.RB ( FUTEX_CMP_REQUEUE_PI )
1604The kernel detected an inconsistency between the user-space state at
99c0041d
MK
1605.I uaddr2
1606and the kernel state;
ee65b0e8
MK
1607.\" From a conversation with Thomas Gleixner (Aug 2015): ###
1608.\" The kernel sees: I have non PI state for a futex you tried to
1609.\" tell me was PI
99c0041d 1610that is, the kernel detected a waiter which waits via
eeeee811 1611.B FUTEX_WAIT
8297383e 1612or
eeeee811 1613.B FUTEX_WAIT_BITSET
99c0041d
MK
1614on
1615.IR uaddr2 .
1616.TP
1617.B EINVAL
1618.RB ( FUTEX_CMP_REQUEUE_PI )
1619The kernel detected an inconsistency between the user-space state at
f9250b1a
MK
1620.I uaddr
1621and the kernel state;
1622that is, the kernel detected a waiter which waits via
eeeee811 1623.B FUTEX_WAIT
99c0041d 1624or
eeeee811 1625.B FUTEX_WAIT_BITSET
f9250b1a
MK
1626on
1627.IR uaddr .
1628.TP
1629.B EINVAL
99c0041d 1630.RB ( FUTEX_CMP_REQUEUE_PI )
75299c8d
MK
1631The kernel detected an inconsistency between the user-space state at
1632.I uaddr
1633and the kernel state;
1634that is, the kernel detected a waiter which waits on
1635.I uaddr
1636via
eeeee811 1637.B FUTEX_LOCK_PI
e79977ae 1638or
eeeee811 1639.B FUTEX_LOCK_PI2
75299c8d
MK
1640(instead of
1641.BR FUTEX_WAIT_REQUEUE_PI ).
99c0041d
MK
1642.TP
1643.B EINVAL
9786b3ca 1644.RB ( FUTEX_CMP_REQUEUE_PI )
8297383e
MK
1645.\" This deals with the case:
1646.\" wait_requeue_pi(A, B);
1647.\" requeue_pi(A, C);
9786b3ca
MK
1648An attempt was made to requeue a waiter to a futex other than that
1649specified by the matching
1650.B FUTEX_WAIT_REQUEUE_PI
1651call for that waiter.
1652.TP
1653.B EINVAL
f0c0d61c
MK
1654.RB ( FUTEX_CMP_REQUEUE_PI )
1655The
1656.I val
1657argument is not 1.
1658.TP
1659.B EINVAL
4832b48a 1660Invalid argument.
fea681da 1661.TP
d07d4ef3
MK
1662.B ENFILE
1663.RB ( FUTEX_FD )
1664The system-wide limit on the total number of open files has been reached.
1665.TP
eeeee811 1666.B ENOMEM
a449c634 1667.RB ( FUTEX_LOCK_PI ,
e79977ae 1668.BR FUTEX_LOCK_PI2 ,
e34a8fb6
MK
1669.BR FUTEX_TRYLOCK_PI ,
1670.BR FUTEX_CMP_REQUEUE_PI )
a449c634
MK
1671The kernel could not allocate memory to hold state information.
1672.TP
4701fc28
MK
1673.B ENOSYS
1674Invalid operation specified in
d33602c4 1675.IR futex_op .
9f6c40c0 1676.TP
4a7e5b05
MK
1677.B ENOSYS
1678The
eeeee811 1679.B FUTEX_CLOCK_REALTIME
4a7e5b05 1680option was specified in
1afcee7c 1681.IR futex_op ,
4a7e5b05 1682but the accompanying operation was neither
017d194b
MK
1683.BR FUTEX_WAIT ,
1684.BR FUTEX_WAIT_BITSET ,
e79977ae 1685.BR FUTEX_WAIT_REQUEUE_PI ,
4a7e5b05 1686nor
e79977ae 1687.BR FUTEX_LOCK_PI2 .
4a7e5b05 1688.TP
eeeee811 1689.B ENOSYS
a9dcb4d1 1690.RB ( FUTEX_LOCK_PI ,
e79977ae 1691.BR FUTEX_LOCK_PI2 ,
f2424fae 1692.BR FUTEX_TRYLOCK_PI ,
4945ff19 1693.BR FUTEX_UNLOCK_PI ,
4cf92894 1694.BR FUTEX_CMP_REQUEUE_PI ,
794bb106 1695.BR FUTEX_WAIT_REQUEUE_PI )
4b35dc5d 1696A run-time check determined that the operation is not available.
f0a9e8f4 1697The PI-futex operations are not implemented on all architectures and
077981d4 1698are not supported on some CPU variants.
a9dcb4d1 1699.TP
eeeee811 1700.B EPERM
c7589177 1701.RB ( FUTEX_LOCK_PI ,
e79977ae 1702.BR FUTEX_LOCK_PI2 ,
dc2742a8
MK
1703.BR FUTEX_TRYLOCK_PI ,
1704.BR FUTEX_CMP_REQUEUE_PI )
04331c3f 1705The caller is not allowed to attach itself to the futex at
dc2742a8
MK
1706.I uaddr
1707(for
1708.BR FUTEX_CMP_REQUEUE_PI :
1709the futex at
1710.IR uaddr2 ).
c7589177
MK
1711(This may be caused by a state corruption in user space.)
1712.TP
eeeee811 1713.B EPERM
87276709 1714.RB ( FUTEX_UNLOCK_PI )
4b35dc5d 1715The caller does not own the lock represented by the futex word.
76f347ba 1716.TP
eeeee811 1717.B ESRCH
0b0e4934 1718.RB ( FUTEX_LOCK_PI ,
e79977ae 1719.BR FUTEX_LOCK_PI2 ,
9732dd8b
MK
1720.BR FUTEX_TRYLOCK_PI ,
1721.BR FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1722The thread ID in the futex word at
0b0e4934
MK
1723.I uaddr
1724does not exist.
1725.TP
eeeee811 1726.B ESRCH
360f773c 1727.RB ( FUTEX_CMP_REQUEUE_PI )
4b35dc5d 1728The thread ID in the futex word at
360f773c
MK
1729.I uaddr2
1730does not exist.
1731.TP
9f6c40c0 1732.B ETIMEDOUT
4d85047f 1733The operation in
eeeee811 1734.I futex_op
4d85047f
MK
1735employed the timeout specified in
1736.IR timeout ,
1737and the timeout expired before the operation completed.
70b06b90
MK
1738.\"
1739.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1740.\"
4131356c
AC
1741.SH STANDARDS
1742Linux.
1743.SH HISTORY
1744Linux 2.6.0.
c6d039a3 1745.P
4c8cb0ff
MK
1746Initial futex support was merged in Linux 2.5.7 but with different
1747semantics from what was described above.
52dee70e 1748A four-argument system call with the semantics
fd3fa7ef 1749described in this page was introduced in Linux 2.5.40.
d0442d14
MK
1750A fifth argument was added in Linux 2.5.70,
1751and a sixth argument was added in Linux 2.6.7.
a14af333 1752.SH EXAMPLES
bc54ed38
MK
1753The program below demonstrates use of futexes in a program where a parent
1754process and a child process use a pair of futexes located inside a
305cc415
MK
1755shared anonymous mapping to synchronize access to a shared resource:
1756the terminal.
1757The two processes each write
1ae6b2c7 1758.I nloops
305cc415
MK
1759(a command-line argument that defaults to 5 if omitted)
1760messages to the terminal and employ a synchronization protocol
1761that ensures that they alternate in writing messages.
1762Upon running this program we see output such as the following:
c6d039a3 1763.P
305cc415 1764.in +4n
b76974c1 1765.EX
305cc415
MK
1766$ \fB./futex_demo\fP
1767Parent (18534) 0
1768Child (18535) 0
1769Parent (18534) 1
1770Child (18535) 1
1771Parent (18534) 2
1772Child (18535) 2
1773Parent (18534) 3
1774Child (18535) 3
1775Parent (18534) 4
1776Child (18535) 4
b76974c1 1777.EE
305cc415
MK
1778.in
1779.SS Program source
1780\&
33857069 1781.\" SRC BEGIN (futex.c)
e7d0bb47 1782.EX
305cc415 1783/* futex_demo.c
fe5dba13 1784\&
305cc415
MK
1785 Usage: futex_demo [nloops]
1786 (Default: 5)
fe5dba13 1787\&
305cc415
MK
1788 Demonstrate the use of futexes in a program where parent and child
1789 use a pair of futexes located inside a shared anonymous mapping to
1790 synchronize access to a shared resource: the terminal. The two
b957f81f 1791 processes each write \[aq]num\-loops\[aq] messages to the terminal and employ
305cc415
MK
1792 a synchronization protocol that ensures that they alternate in
1793 writing messages.
1794*/
1795#define _GNU_SOURCE
5a5208c1 1796#include <err.h>
305cc415 1797#include <errno.h>
4ae706b0 1798#include <linux/futex.h>
915c4ba3 1799#include <stdatomic.h>
8eb90116 1800#include <stdint.h>
4ae706b0 1801#include <stdio.h>
305cc415 1802#include <stdlib.h>
305cc415
MK
1803#include <sys/mman.h>
1804#include <sys/syscall.h>
305cc415 1805#include <sys/time.h>
4ae706b0
AC
1806#include <sys/wait.h>
1807#include <unistd.h>
fe5dba13 1808\&
2253ecf0 1809static uint32_t *futex1, *futex2, *iaddr;
fe5dba13 1810\&
305cc415 1811static int
2253ecf0
AC
1812futex(uint32_t *uaddr, int futex_op, uint32_t val,
1813 const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3)
305cc415
MK
1814{
1815 return syscall(SYS_futex, uaddr, futex_op, val,
c1e04f01 1816 timeout, uaddr2, val3);
305cc415 1817}
fe5dba13 1818\&
b957f81f 1819/* Acquire the futex pointed to by \[aq]futexp\[aq]: wait for its value to
305cc415 1820 become 1, and then set the value to 0. */
fe5dba13 1821\&
305cc415 1822static void
2253ecf0 1823fwait(uint32_t *futexp)
305cc415 1824{
041a155c
AC
1825 long s;
1826 const uint32_t one = 1;
fe5dba13 1827\&
915c4ba3
BP
1828 /* atomic_compare_exchange_strong(ptr, oldval, newval)
1829 atomically performs the equivalent of:
fe5dba13 1830\&
915c4ba3 1831 if (*ptr == *oldval)
305cc415 1832 *ptr = newval;
fe5dba13 1833\&
915c4ba3 1834 It returns true if the test yielded true and *ptr was updated. */
fe5dba13 1835\&
305cc415 1836 while (1) {
fe5dba13 1837\&
63ad44cb 1838 /* Is the futex available? */
09e456c2 1839 if (atomic_compare_exchange_strong(futexp, &one, 0))
305cc415 1840 break; /* Yes */
fe5dba13 1841\&
c6beb8a1 1842 /* Futex is not available; wait. */
fe5dba13 1843\&
63ad44cb
HS
1844 s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
1845 if (s == \-1 && errno != EAGAIN)
5a5208c1 1846 err(EXIT_FAILURE, "futex\-FUTEX_WAIT");
305cc415
MK
1847 }
1848}
fe5dba13 1849\&
b957f81f 1850/* Release the futex pointed to by \[aq]futexp\[aq]: if the futex currently
a00c4698 1851 has the value 0, set its value to 1 and then wake any futex waiters,
e2c75104 1852 so that if the peer is blocked in fwait(), it can proceed. */
fe5dba13 1853\&
305cc415 1854static void
2253ecf0 1855fpost(uint32_t *futexp)
305cc415 1856{
041a155c
AC
1857 long s;
1858 const uint32_t zero = 0;
fe5dba13 1859\&
68219aba 1860 /* atomic_compare_exchange_strong() was described
c6beb8a1 1861 in comments above. */
fe5dba13 1862\&
09e456c2 1863 if (atomic_compare_exchange_strong(futexp, &zero, 1)) {
305cc415
MK
1864 s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0);
1865 if (s == \-1)
5a5208c1 1866 err(EXIT_FAILURE, "futex\-FUTEX_WAKE");
305cc415
MK
1867 }
1868}
fe5dba13 1869\&
305cc415
MK
1870int
1871main(int argc, char *argv[])
1872{
b42296e4
AC
1873 pid_t childPid;
1874 unsigned int nloops;
fe5dba13 1875\&
305cc415 1876 setbuf(stdout, NULL);
fe5dba13 1877\&
305cc415 1878 nloops = (argc > 1) ? atoi(argv[1]) : 5;
fe5dba13 1879\&
305cc415
MK
1880 /* Create a shared anonymous mapping that will hold the futexes.
1881 Since the futexes are being shared between processes, we
1882 subsequently use the "shared" futex operations (i.e., not the
c6beb8a1 1883 ones suffixed "_PRIVATE"). */
fe5dba13 1884\&
d60a7a9a 1885 iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE,
4687ab0e 1886 MAP_ANONYMOUS | MAP_SHARED, \-1, 0);
305cc415 1887 if (iaddr == MAP_FAILED)
5a5208c1 1888 err(EXIT_FAILURE, "mmap");
fe5dba13 1889\&
305cc415
MK
1890 futex1 = &iaddr[0];
1891 futex2 = &iaddr[1];
fe5dba13 1892\&
305cc415
MK
1893 *futex1 = 0; /* State: unavailable */
1894 *futex2 = 1; /* State: available */
fe5dba13 1895\&
305cc415 1896 /* Create a child process that inherits the shared anonymous
c6beb8a1 1897 mapping. */
fe5dba13 1898\&
305cc415 1899 childPid = fork();
92a46690 1900 if (childPid == \-1)
5a5208c1 1901 err(EXIT_FAILURE, "fork");
fe5dba13 1902\&
305cc415 1903 if (childPid == 0) { /* Child */
b42296e4 1904 for (unsigned int j = 0; j < nloops; j++) {
305cc415 1905 fwait(futex1);
b42296e4 1906 printf("Child (%jd) %u\en", (intmax_t) getpid(), j);
305cc415
MK
1907 fpost(futex2);
1908 }
fe5dba13 1909\&
305cc415
MK
1910 exit(EXIT_SUCCESS);
1911 }
fe5dba13 1912\&
c6beb8a1 1913 /* Parent falls through to here. */
fe5dba13 1914\&
b42296e4 1915 for (unsigned int j = 0; j < nloops; j++) {
305cc415 1916 fwait(futex2);
b42296e4 1917 printf("Parent (%jd) %u\en", (intmax_t) getpid(), j);
305cc415
MK
1918 fpost(futex1);
1919 }
fe5dba13 1920\&
305cc415 1921 wait(NULL);
fe5dba13 1922\&
305cc415
MK
1923 exit(EXIT_SUCCESS);
1924}
e7d0bb47 1925.EE
33857069 1926.\" SRC END
47297adb 1927.SH SEE ALSO
4c222281 1928.ad l
9913033c 1929.BR get_robust_list (2),
d806bc05 1930.BR restart_syscall (2),
e0074751 1931.BR pthread_mutexattr_getprotocol (3),
ac894879
MK
1932.BR futex (7),
1933.BR sched (7)
c6d039a3 1934.P
f5ad572f 1935The following kernel source files:
cdede5cd 1936.IP \[bu] 3
b49c2acb 1937.I Documentation/pi\-futex.txt
cdede5cd 1938.IP \[bu]
b49c2acb 1939.I Documentation/futex\-requeue\-pi.txt
cdede5cd 1940.IP \[bu]
b49c2acb 1941.I Documentation/locking/rt\-mutex.txt
cdede5cd 1942.IP \[bu]
b49c2acb 1943.I Documentation/locking/rt\-mutex\-design.txt
cdede5cd 1944.IP \[bu]
b49c2acb 1945.I Documentation/robust\-futex\-ABI.txt
c6d039a3 1946.P
4c222281 1947Franke, H., Russell, R., and Kirwood, M., 2002.
52087dd3 1948\fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP
4c222281 1949(from proceedings of the Ottawa Linux Symposium 2002),
9b936e9e 1950.br
5465ae95 1951.UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf
608bf950 1952.UE
c6d039a3 1953.P
4c222281 1954Hart, D., 2009. \fIA futex overview and update\fP,
2ed26199
MK
1955.UR http://lwn.net/Articles/360699/
1956.UE
c6d039a3 1957.P
8fb01fde 1958Hart, D.\& and Guniguntala, D., 2009.
75c018a1 1959\fIRequeue-PI: Making glibc Condvars PI-Aware\fP
4c222281 1960(from proceedings of the 2009 Real-Time Linux Workshop),
0483b6cc
MK
1961.UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf
1962.UE
c6d039a3 1963.P
4c222281 1964Drepper, U., 2011. \fIFutexes Are Tricky\fP,
f42eb21b
MK
1965.UR http://www.akkadia.org/drepper/futex.pdf
1966.UE
c6d039a3 1967.P
6693a21c 1968Futex example library, futex\-*.tar.bz2 at
9b936e9e 1969.br
99ef5aaf 1970.UR https://mirrors.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/
608bf950 1971.UE
34f14794 1972.\"
74f58a64 1973.\" FIXME(Torvald) We should probably refer to the glibc code here, in
9915ea23
MK
1974.\" particular the glibc-internal futex wrapper functions that are
1975.\" WIP, and the generic pthread_mutex_t and perhaps condvar
1976.\" implementations.