]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man2/futex.2
futex.2: Add an attmpt at a defintion of PI-aware futexes
[thirdparty/man-pages.git] / man2 / futex.2
CommitLineData
8f0aff2a 1.\" Page by b.hubert
1abce893
MK
2.\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de>
3.\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com>
2297bf0e 4.\"
2e46a6e7 5.\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE)
8f0aff2a 6.\" may be freely modified and distributed
8ff7380d 7.\" %%%LICENSE_END
fea681da
MK
8.\"
9.\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com)
10.\" added ERRORS section.
11.\"
12.\" Modified 2004-06-17 mtk
13.\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE
14.\"
3d155313 15.TH FUTEX 2 2014-05-21 "Linux" "Linux Programmer's Manual"
fea681da 16.SH NAME
ce154705 17futex \- fast user-space locking
fea681da 18.SH SYNOPSIS
9d9dc1e8 19.nf
fea681da
MK
20.sp
21.B "#include <linux/futex.h>"
fea681da
MK
22.B "#include <sys/time.h>"
23.sp
d33602c4 24.BI "int futex(int *" uaddr ", int " futex_op ", int " val ,
768d3c23
MK
25.BI " const struct timespec *" timeout , \
26" \fR /* or: \fBu32 \fIval2\fP */
9d9dc1e8 27.BI " int *" uaddr2 ", int " val3 );
9d9dc1e8 28.fi
409f08b0 29
b939d6e4
MK
30.IR Note :
31There is no glibc wrapper for this system call; see NOTES.
47297adb 32.SH DESCRIPTION
fea681da
MK
33.PP
34The
e511ffb6 35.BR futex ()
fea681da
MK
36system call provides a method for
37a program to wait for a value at a given address to change, and a
f19904c0 38method to wake up anyone waiting on a particular address.
a5956430
MK
39(While the virtual addresses for the same memory in separate
40processes may not be equal,
41the kernel maps them internally so that the same memory mapped
42in different locations will correspond for
e511ffb6 43.BR futex ()
f19904c0 44calls.)
fd3fa7ef 45This system call is typically used to
fea681da
MK
46implement the contended case of a lock in shared memory, as
47described in
a8bda636 48.BR futex (7).
809ca3ae
MK
49
50In the uncontended case,
51all operations on the futex memory location are performed
52in user space using atomic machine-language instructions,
53and the kernel maintains no information about the futex.
54The kernel allocates state information for the futex only
55in the contended case, when operations such as
56.BR FUTEX_WAIT ,
57described below, are performed.
58
f388ba70
MK
59When a futex operation did not finish uncontended in user space, a
60.BR futex ()
61call needs to be made to the kernel to arbitrate.
bdc5957a
MK
62Arbitration can either mean putting the caller
63to sleep or, conversely, waking a waiting process or thread.
fea681da 64.PP
f388ba70
MK
65Callers of
66.BR futex ()
67are expected to adhere to the semantics described in
a8bda636 68.BR futex (7).
ed44c7c0
MK
69As these semantics involve writing nonportable assembly instructions
70(see the example library referred to in SEE ALSO),
71this in turn probably means that most users will in fact be
72library authors and not general application developers.
a663ca5a
MK
73.\"
74.SS Arguments
fea681da
MK
75The
76.I uaddr
f388ba70
MK
77argument points to an integer which stores the counter (futex).
78On all platforms, futexes are four-byte integers that
79must be aligned on a four-byte boundary.
80The operation to perform on the futex is specified in the
81.I futex_op
82argument;
83.IR val
84is a value whose meaning and purpose depends on
85.IR futex_op .
36ab2074
MK
86
87The remaining arguments
88.RI ( timeout ,
89.IR uaddr2 ,
90and
91.IR val3 )
92are required only for certain of the futex operations described below.
93Where one of these arguments is not required, it is ignored.
768d3c23 94
36ab2074
MK
95For several blocking operations, the
96.I timeout
97argument is a pointer to a
98.IR timespec
99structure that specifies a timeout for the operation.
100However, notwithstanding the prototype shown above, for some operations,
101this argument is instead a four-byte integer whose meaning
102is determined by the operation.
768d3c23
MK
103For these operations, the kernel casts the
104.I timeout
105value to
106.IR u32 ,
107and in the remainder of this page, this argument is referred to as
108.I val2
109when interpreted in this fashion.
110
de5a3bb4 111Where it is required, the
36ab2074 112.IR uaddr2
de5a3bb4 113argument is a pointer to a second futex that is employed by the operation.
36ab2074
MK
114The interpretation of the final integer argument,
115.IR val3 ,
116depends on the operation.
a663ca5a
MK
117.\"
118.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
119.\"
120.SS Futex operations
6be4bad7 121The
d33602c4 122.I futex_op
6be4bad7
MK
123argument consists of two parts:
124a command that specifies the operation to be performed,
125bit-wise ORed with zero or or more options that
126modify the behaviour of the operation.
fc30eb79 127The options that may be included in
d33602c4 128.I futex_op
fc30eb79
TG
129are as follows:
130.TP
131.BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)"
132.\" commit 34f01cc1f512fa783302982776895c73714ebbc2
133This option bit can be employed with all futex operations.
e45f9735
MK
134It tells the kernel that the futex is process-private and not shared
135with another process
136(i.e., it is being used for synchronization between threads).
fc30eb79
TG
137This allows the kernel to choose the fast path for validating
138the user-space address and avoids expensive VMA lookups,
139taking reference counts on file backing store, and so on.
ae2c1774
MK
140
141As a convenience,
142.IR <linux/futex.h>
143defines a set of constants with the suffix
144.BR _PRIVATE
145that are equivalents of all of the operations listed below,
dcdfde26 146.\" except the obsolete FUTEX_FD, for which the "private" flag was
ae2c1774
MK
147.\" meaningless
148but with the
149.BR FUTEX_PRIVATE_FLAG
150ORed into the constant value.
151Thus, there are
152.BR FUTEX_WAIT_PRIVATE ,
153.BR FUTEX_WAKE_PRIVATE ,
154and so on.
2e98bbc2
TG
155.TP
156.BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)"
157.\" commit 1acdac104668a0834cfa267de9946fac7764d486
4a7e5b05 158This option bit can be employed only with the
2e98bbc2
TG
159.BR FUTEX_WAIT_BITSET
160and
161.BR FUTEX_WAIT_REQUEUE_PI
c84cf68c 162operations.
2e98bbc2 163
f2103b26
MK
164If this option is set, the kernel treats
165.I timeout
166as an absolute time based on
2e98bbc2
TG
167.BR CLOCK_REALTIME .
168
f2103b26
MK
169If this option is not set, the kernel treats
170.I timeout
171as relative time,
f1d2171d 172.\" FIXME XXX I added CLOCK_MONOTONIC here. Okay?
1c952cf5
MK
173measured against the
174.BR CLOCK_MONOTONIC
175clock.
6be4bad7
MK
176.PP
177The operation specified in
d33602c4 178.I futex_op
6be4bad7 179is one of the following:
70b06b90
MK
180.\"
181.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
182.\"
fea681da 183.TP
81c9d87e
MK
184.BR FUTEX_WAIT " (since Linux 2.6.0)"
185.\" Strictly speaking, since some time in 2.5.x
f065673c
MK
186This operation tests that the value at the
187location pointed to by the futex address
fea681da
MK
188.I uaddr
189still contains the value
190.IR val ,
f065673c 191and then sleeps awaiting
682edefb 192.B FUTEX_WAKE
f065673c
MK
193on the futex address.
194The test and sleep steps are performed atomically.
195If the futex value does not match
196.IR val ,
4710334a 197then the call fails immediately with the error
badbf70c 198.BR EAGAIN .
f065673c
MK
199.\" FIXME I added the following sentence. Please confirm that it is correct.
200The purpose of the test step is to detect races where
bdc5957a 201another process or thread changes the value of the futex between
f065673c
MK
202the time it was last checked and the time of the
203.BR FUTEX_WAIT
63d3f911 204operation.
1909e523 205
c13182ef 206If the
fea681da 207.I timeout
53ba4030 208argument is non-NULL, its contents specify a relative timeout for the wait,
f1d2171d 209.\" FIXME XXX I added CLOCK_MONOTONIC here. Okay?
1c952cf5
MK
210measured according to the
211.BR CLOCK_MONOTONIC
212clock.
82a6092b
MK
213(This interval will be rounded up to the system clock granularity,
214and kernel scheduling delays mean that the
215blocking interval may overrun by a small amount.)
216If
217.I timeout
218is NULL, the call blocks indefinitely.
4798a7f3 219
c13182ef 220The arguments
fea681da
MK
221.I uaddr2
222and
223.I val3
224are ignored.
225
226For
a8bda636 227.BR futex (7),
fea681da 228this call is executed if decrementing the count gave a negative value
bdc5957a
MK
229(indicating contention),
230and will sleep until another process or thread releases
682edefb
MK
231the futex and executes the
232.B FUTEX_WAKE
233operation.
70b06b90
MK
234.\"
235.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
236.\"
fea681da 237.TP
81c9d87e
MK
238.BR FUTEX_WAKE " (since Linux 2.6.0)"
239.\" Strictly speaking, since Linux 2.5.x
f065673c
MK
240This operation wakes at most
241.I val
bdc5957a 242of the waiters that are waiting (i.e., inside
f065673c
MK
243.BR FUTEX_WAIT )
244on the futex at the address
245.IR uaddr .
246Most commonly,
247.I val
248is specified as either 1 (wake up a single waiter) or
249.BR INT_MAX
250(wake up all waiters).
730bfbda
MK
251.\" FIXME Please confirm that the following is correct:
252No guarantee is provided about which waiters are awoken
253(e.g., a waiter with a higher scheduling priority is not guaranteed
254to be awoken in preference to a waiter with a lower priority).
4798a7f3 255
fea681da
MK
256The arguments
257.IR timeout ,
c8b921bd 258.IR uaddr2 ,
fea681da
MK
259and
260.I val3
261are ignored.
262
263For
a8bda636 264.BR futex (7),
f2bf5121 265this is executed if incrementing the count showed that there were waiters,
64191e8f 266.\" FIXME How does "incrementing the count showed that there were waiters"?
f2bf5121 267once the futex value has been set to 1 (indicating that it is available).
70b06b90
MK
268.\"
269.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
270.\"
a7c2bf45
MK
271.TP
272.BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)"
273.\" Strictly speaking, from Linux 2.5.x to 2.6.25
274This operation creates a file descriptor that is associated with the futex at
275.IR uaddr .
bdc5957a
MK
276The caller must close the returned file descriptor after use.
277When another process or thread performs a
a7c2bf45
MK
278.BR FUTEX_WAKE
279on the futex, the file descriptor indicates as being readable with
280.BR select (2),
281.BR poll (2),
282and
283.BR epoll (7)
284
f1d2171d 285The file descriptor can be used to obtain asynchronous notifications: if
a7c2bf45 286.I val
bdc5957a 287is nonzero, then when another process or thread executes a
a7c2bf45
MK
288.BR FUTEX_WAKE ,
289the caller will receive the signal number that was passed in
290.IR val .
291
292The arguments
293.IR timeout ,
294.I uaddr2
295and
296.I val3
297are ignored.
298
299To prevent race conditions, the caller should test if the futex has
300been upped after
301.B FUTEX_FD
302returns.
303
304Because it was inherently racy,
305.B FUTEX_FD
306has been removed
307.\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80
308from Linux 2.6.26 onward.
70b06b90
MK
309.\"
310.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
311.\"
a7c2bf45
MK
312.TP
313.BR FUTEX_REQUEUE " (since Linux 2.6.0)"
314.\" Strictly speaking: from Linux 2.5.70
315.\"
f1d2171d 316.\" FIXME XXX I added this warning. Okay?
a7c2bf45
MK
317.IR "Avoid using this operation" .
318It is broken (unavoidably racy) for its intended purpose.
319Use
320.BR FUTEX_CMP_REQUEUE
321instead.
322
323This operation performs the same task as
324.BR FUTEX_CMP_REQUEUE ,
325except that no check is made using the value in
326.IR val3 .
327(The argument
328.I val3
329is ignored.)
70b06b90
MK
330.\"
331.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
332.\"
a7c2bf45
MK
333.TP
334.BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)"
335This operation was added as a replacement for the earlier
336.BR FUTEX_REQUEUE ,
337because that operation was racy for its intended use.
338
339As with
340.BR FUTEX_REQUEUE ,
341the
342.BR FUTEX_CMP_REQUEUE
343operation is used to avoid a "thundering herd" effect when
344.B FUTEX_WAKE
bdc5957a
MK
345is used and all of the waiters that are woken up
346need to acquire another futex.
a7c2bf45
MK
347It differs from
348.BR FUTEX_REQUEUE
349in that it first checks whether the location
350.I uaddr
351still contains the value
352.IR val3 .
353If not, the operation fails with the error
354.BR EAGAIN .
70b06b90
MK
355.\" FIXME I added the following sentence on the rationale for
356.\" FUTEX_CMP_REQUEUE. Is it correct? Should it be expanded?
a7c2bf45
MK
357This additional feature of
358.BR FUTEX_CMP_REQUEUE
359can be used by the caller to (atomically) detect changes
360in the value of the target futex at
361.IR uaddr2 .
362
363The operation wakes up a maximum of
364.I val
365waiters that are waiting on the futex at
366.IR uaddr .
367If there are more than
368.I val
369waiters, then the remaining waiters are removed
370from the wait queue of the source futex at
371.I uaddr
372and added to the wait queue of the target futex at
373.IR uaddr2 .
936876a9 374
a7c2bf45 375The
768d3c23 376.I val2
936876a9 377argument specifies an upper limit on the number of waiters
a7c2bf45 378that are requeued to the futex at
768d3c23 379.IR uaddr2 .
a7c2bf45
MK
380
381.\" FIXME Please review the following new paragraph to see if it is
382.\" accurate.
383Typical values to specify for
384.I val
385are 0 or or 1.
386(Specifying
387.BR INT_MAX
388is not useful, because it would make the
389.BR FUTEX_CMP_REQUEUE
390operation equivalent to
391.BR FUTEX_WAKE .)
936876a9 392The limit value specified via
768d3c23
MK
393.I val2
394is typically either 1 or
a7c2bf45
MK
395.BR INT_MAX .
396(Specifying the argument as 0 is not useful, because it would make the
397.BR FUTEX_CMP_REQUEUE
398operation equivalent to
399.BR FUTEX_WAIT .)
6bac3b85 400.\"
43d16602
MK
401.\" FIXME Here, it would be helpful to have an example of how
402.\" FUTEX_CMP_REQUEUE might be used, at the same time illustrating
403.\" why FUTEX_WAKE is unsuitable for the same use case.
404.\"
70b06b90
MK
405.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
406.\"
a5956430
MK
407.\" FIXME I added a lengthy piece of text on FUTEX_WAKE_OP text,
408.\" and I'd be happy if someone checked it.
fea681da 409.TP
d67e21f5
MK
410.BR FUTEX_WAKE_OP " (since Linux 2.6.14)"
411.\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721
6bac3b85
MK
412.\" Author: Jakub Jelinek <jakub@redhat.com>
413.\" Date: Tue Sep 6 15:16:25 2005 -0700
414This operation was added to support some user-space use cases
415where more than one futex must be handled at the same time.
416The most notable example is the implementation of
417.BR pthread_cond_signal (3),
418which requires operations on two futexes,
419the one used to implement the mutex and the one used in the implementation
420of the wait queue associated with the condition variable.
421.BR FUTEX_WAKE_OP
422allows such cases to be implemented without leading to
423high rates of contention and context switching.
424
425The
426.BR FUTEX_WAIT_OP
427operation is equivalent to atomically executing the following code:
428
429.in +4n
430.nf
431int oldval = *(int *) uaddr2;
432*(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP;
433futex(uaddr, FUTEX_WAKE, val, 0, 0, 0);
434if (oldval \fIcmp\fP \fIcmparg\fP)
768d3c23 435 futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0);
6bac3b85
MK
436.fi
437.in
438
439In other words,
440.BR FUTEX_WAIT_OP
441does the following:
442.RS
443.IP * 3
444saves the original value of the futex at
445.IR uaddr2 ;
446.IP *
447performs an operation to modify the value of the futex at
448.IR uaddr2 ;
449.IP *
450wakes up a maximum of
451.I val
452waiters on the futex
453.IR uaddr ;
454and
455.IP *
456dependent on the results of a test of the original value of the futex at
457.IR uaddr2 ,
458wakes up a maximum of
768d3c23 459.I val2
6bac3b85
MK
460waiters on the futex
461.IR uaddr2 .
462.RE
463.IP
6bac3b85
MK
464The operation and comparison that are to be performed are encoded
465in the bits of the argument
466.IR val3 .
467Pictorially, the encoding is:
468
f6af90e7 469.in +8n
6bac3b85 470.nf
f6af90e7
MK
471+---+---+-----------+-----------+
472|op |cmp| oparg | cmparg |
473+---+---+-----------+-----------+
474 4 4 12 12 <== # of bits
6bac3b85
MK
475.fi
476.in
477
478Expressed in code, the encoding is:
479
480.in +4n
481.nf
482#define FUTEX_OP(op, oparg, cmp, cmparg) \\
483 (((op & 0xf) << 28) | \\
484 ((cmp & 0xf) << 24) | \\
485 ((oparg & 0xfff) << 12) | \\
486 (cmparg & 0xfff))
487.fi
488.in
489
490In the above,
491.I op
492and
493.I cmp
494are each one of the codes listed below.
495The
496.I oparg
497and
498.I cmparg
499components are literal numeric values, except as noted below.
500
501The
502.I op
503component has one of the following values:
504
505.in +4n
506.nf
507FUTEX_OP_SET 0 /* uaddr2 = oparg; */
508FUTEX_OP_ADD 1 /* uaddr2 += oparg; */
509FUTEX_OP_OR 2 /* uaddr2 |= oparg; */
510FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */
511FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */
512.fi
513.in
514
515In addition, bit-wise ORing the following value into
516.I op
517causes
518.IR "(1\ <<\ oparg)"
519to be used as the operand:
520
521.in +4n
522.nf
523FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */
524.fi
525.in
526
527The
528.I cmp
529field is one of the following:
530
531.in +4n
532.nf
533FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */
534FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */
535FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */
536FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */
537FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */
538FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */
539.fi
540.in
541
542The return value of
543.BR FUTEX_WAKE_OP
544is the sum of the number of waiters woken on the futex
545.IR uaddr
546plus the number of waiters woken on the futex
547.IR uaddr2 .
70b06b90
MK
548.\"
549.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
550.\"
d67e21f5 551.TP
79c9b436
TG
552.BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)"
553.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
fd9e59d4 554This operation is like
79c9b436
TG
555.BR FUTEX_WAIT
556except that
557.I val3
558is used to provide a 32-bit bitset to the kernel.
559This bitset is stored in the kernel-internal state of the waiter.
560See the description of
561.BR FUTEX_WAKE_BITSET
562for further details.
563
fd9e59d4
MK
564The
565.BR FUTEX_WAIT_BITSET
566also interprets the
567.I timeout
568argument differently from
569.BR FUTEX_WAIT .
570See the discussion of
571.BR FUTEX_CLOCK_REALTIME ,
572above.
573
79c9b436
TG
574The
575.I uaddr2
576argument is ignored.
70b06b90
MK
577.\"
578.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
579.\"
79c9b436 580.TP
d67e21f5
MK
581.BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)"
582.\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d
55cc422d
TG
583This operation is the same as
584.BR FUTEX_WAKE
585except that the
586.I val3
587argument is used to provide a 32-bit bitset to the kernel.
98d769c0
MK
588This bitset is used to select which waiters should be woken up.
589The selection is done by a bit-wise AND of the "wake" bitset
590(i.e., the value in
591.IR val3 )
592and the bitset which is stored in the kernel-internal
09cb4ce7 593state of the waiter (the "wait" bitset that is set using
98d769c0
MK
594.BR FUTEX_WAIT_BITSET ).
595All of the waiters for which the result of the AND is nonzero are woken up;
596the remaining waiters are left sleeping.
597
f1d2171d 598.\" FIXME XXX Is this paragraph that I added okay?
e9d4496b
MK
599The effect of
600.BR FUTEX_WAIT_BITSET
601and
602.BR FUTEX_WAKE_BITSET
603is to allow selective wake-ups among multiple waiters that are waiting
604on the same futex;
605since a futex has a size of 32 bits,
606these operations provide 32 wakeup "channels".
607(The
608.BR FUTEX_WAIT
609and
610.BR FUTEX_WAKE
611operations correspond to
612.BR FUTEX_WAIT_BITSET
613and
614.BR FUTEX_WAKE_BITSET
615operations where the bitsets are all ones.)
09cb4ce7 616Note, however, that using this bitset multiplexing feature on a
e9d4496b
MK
617futex is less efficient than simply using multiple futexes,
618because employing bitset multiplexing requires the kernel
619to check all waiters on a futex,
620including those that are not interested in being woken up
621(i.e., they do not have the relevant bit set in their "wait" bitset).
622.\" According to http://locklessinc.com/articles/futex_cheat_sheet/:
623.\"
624.\" "The original reason for the addition of these extensions
625.\" was to improve the performance of pthread read-write locks
626.\" in glibc. However, the pthreads library no longer uses the
627.\" same locking algorithm, and these extensions are not used
628.\" without the bitset parameter being all ones.
629.\"
630.\" The page goes on to note that the FUTEX_WAIT_BITSET operation
631.\" is nevertheless used (with a bitset of all ones) in order to
632.\" obtain the absolute timeout functionality that is useful
633.\" for efficiently implementing Pthreads APIs (which use absolute
634.\" timeouts); FUTEX_WAIT provides only relative timeouts.
635
98d769c0
MK
636The
637.I uaddr2
638and
639.I timeout
640arguments are ignored.
bd90a5f9 641.\"
70b06b90 642.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
bd90a5f9
MK
643.\"
644.SS Priority-inheritance futexes
b52e1cd4
MK
645Linux supports priority-inheritance (PI) futexes in order to handle
646priority-inversion problems that can be encountered with
647normal futex locks.
b565548b 648Priority inversion is the problem that occurs when a high-priority
bdc5957a
MK
649task is blocked waiting to acquire a lock held by a low-priority task,
650while tasks at an intermediate priority continuously preempt
651the low-priority task from the CPU.
652Consequently, the low-priority task makes no progress toward
653releasing the lock, and the high-priority task remains blocked.
7f315ae3 654
7d20efd7
MK
655Priority inheritance is a mechanism for dealing with
656the priority-inversion problem.
bdc5957a
MK
657With this mechanism, when a high-priority task becomes blocked
658by a lock held by a low-priority task,
7d20efd7 659the latter's priority is temporarily raised to that of the former,
bdc5957a 660so that it is not preempted by any intermediate level tasks,
7d20efd7
MK
661and can thus make progress toward releasing the lock.
662To be effective, priority inheritance must be transitive,
bdc5957a
MK
663meaning that if a high-priority task blocks on a lock
664held by a lower-priority task that is itself blocked by lock
665held by another intermediate-priority task
7d20efd7 666(and so on, for chains of arbitrary length),
bdc5957a
MK
667then both of those task
668(or more generally, all of the tasks in a lock chain)
669have their priorities raised to be the same as the high-priority task.
7d20efd7 670
9e2b90ee
MK
671.\" FIXME XXX The following is my attempt at a definition of PI futexes,
672.\" based on mail discussions with Darren Hart. Does it seem okay?
673From a user-space perspective,
674what makes a futex PI-aware is a policy agreement between user space
675and the kernel about the value of the futex (described in a moment),
676coupled with the use of the PI futex operations described below
677(in particular,
678.BR FUTEX_LOCK_PI ,
679.BR FUTEX_TRYLOCK_PI ,
680and
681.BR FUTEX_CMP_REQUEUE_PI ).
682.\" Quoting Darren Hart:
683.\" These opcodes paired with the PI futex value policy (described below)
684.\" defines a "futex" as PI aware. These were created very specifically
685.\" in support of PI pthread_mutexes, so it makes a lot more sense to
686.\" talk about a PI aware pthread_mutex, than a PI aware futex, since
687.\" there is a lot of policy and scaffolding that has to be built up
688.\" around it to use it properly (this is what a PI pthread_mutex is).
689
f1d2171d 690.\" FIXME XXX ===== Start of adapted Hart/Guniguntala text =====
79d918c7
MK
691.\" The following text is drawn from the Hart/Guniguntala paper,
692.\" but I have reworded some pieces significantly. Please check it.
693.\"
694The PI futex operations described below differ from the other
695futex operations in that they impose policy on the use of the futex value:
696.IP * 3
7c16fbff 697If the lock is unowned, the futex value shall be 0.
79d918c7
MK
698.IP *
699If the lock is owned, the futex value shall be the thread ID (TID; see
700.BR gettid (2))
701of the owning thread.
702.IP *
f1d2171d 703.\" FIXME XXX In the following line, I added "the lock is owned and". Okay?
79d918c7
MK
704If the lock is owned and there are threads contending for the lock,
705then the
706.B FUTEX_WAITERS
707bit shall be set in the futex value; in other words, the futex value is:
708
709 FUTEX_WAITERS | TID
9e2b90ee 710
79d918c7 711.PP
9e2b90ee
MK
712Note that a PI futex never just has the value
713.BR FUTEX_WAITERS ,
714which is a permissible state for non-PI futexes.
715
79d918c7
MK
716With this policy in place,
717a user-space application can acquire an unowned
21b060ba 718lock or release an uncontended lock using atomic
21b060ba 719instructions executed in user-space (e.g.,
b52e1cd4
MK
720.I cmpxchg
721on the x86 architecture).
722Locking an unowned lock simply consists of setting
723the futex value to the caller's TID.
724Releasing an uncontended lock simply requires setting the futex value to 0.
725
726If a futex is currently owned (i.e., has a nonzero value),
727waiters must employ the
79d918c7
MK
728.B FUTEX_LOCK_PI
729operation to acquire the lock.
b52e1cd4 730If a lock is contended (i.e., the
79d918c7 731.B FUTEX_WAITERS
b52e1cd4 732bit is set in the futex value), the lock owner must employ the
79d918c7 733.B FUTEX_UNLOCK_PI
b52e1cd4
MK
734operation to release the lock.
735
79d918c7
MK
736In the cases where callers are forced into the kernel
737(i.e., required to perform a
738.BR futex ()
739operation),
740they then deal directly with a so-called RT-mutex,
741a kernel locking mechanism which implements the required
742priority-inheritance semantics.
743After the RT-mutex is acquired, the futex value is updated accordingly,
744before the calling thread returns to user space.
745.\" FIXME ===== End of adapted Hart/Guniguntala text =====
746
a59fca75
MK
747It is important to note
748.\" FIXME We need some explanation here of *why* it is important to
70b06b90 749.\" note this
a59fca75 750that the kernel will update the futex value prior
79d918c7
MK
751to returning to user space.
752Unlike the other futex operations described above,
753the PI futex operations are designed
d9d5be6b 754for the implementation of very specific IPC mechanisms.
fc57e6bb 755.\"
7bd3ffbc 756.\" FIXME XXX In discussing errors for FUTEX_CMP_REQUEUE_PI, Darren Hart
99c0ac69
MK
757.\" made the observation that "EINVAL is returned if the non-pi
758.\" to pi or op pairing semantics are violated."
759.\" Probably there needs to be a general statement about this
760.\" requirement, probably located at about this point in the page.
7bd3ffbc 761.\" Darren, care to take a shot at this?
dd003bef
MK
762.\"
763.\" FIXME Somewhere on this page (I guess under the discussion of PI
764.\" futexes) we need a discussion of the FUTEX_OWNER_DIED bit.
765.\" Can someone propose a text?
bd90a5f9
MK
766
767PI futexes are operated on by specifying one of the following values in
768.IR futex_op :
70b06b90
MK
769.\"
770.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
771.\"
d67e21f5
MK
772.TP
773.BR FUTEX_LOCK_PI " (since Linux 2.6.18)"
774.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
67833bec
MK
775.\"
776.\" FIXME I did some significant rewording of tglx's text.
777.\" Please check, in case I injected errors.
778.\"
779This operation is used after after an attempt to acquire
780the futex lock via an atomic user-space instruction failed
781because the futex has a nonzero value\(emspecifically,
782because it contained the namespace-specific TID of the lock owner.
67259526 783.\" FIXME In the preceding line, what does "namespace-specific" mean?
67833bec 784.\" (I kept those words from tglx.)
67259526 785.\" That is, what kind of namespace are we talking about?
67833bec
MK
786.\" (I suppose we are talking PID namespaces here, but I want to
787.\" be sure.)
788
789The operation checks the value of the futex at the address
790.IR uaddr .
70b06b90
MK
791If the value is 0, then the kernel tries to atomically set
792the futex value to the caller's TID.
67833bec
MK
793If that fails,
794.\" FIXME What would be the cause of failure?
795or the futex value is nonzero,
796the kernel atomically sets the
e0547e70 797.B FUTEX_WAITERS
67833bec
MK
798bit, which signals the futex owner that it cannot unlock the futex in
799user space atomically by setting the futex value to 0.
800After that, the kernel tries to find the thread which is
801associated with the owner TID,
802.\" FIXME Could I get a bit more detail on the next two lines?
803.\" What is "creates or reuses kernel state" about?
804creates or reuses kernel state on behalf of the owner
805and attaches the waiter to it.
67259526
MK
806.\" FIXME In the next line, what type of "priority" are we talking about?
807.\" Realtime priorities for SCHED_FIFO and SCHED_RR?
808.\" Or something else?
1f043693 809The enqueueing of the waiter is in descending priority order if more
e0547e70 810than one waiter exists.
67259526 811.\" FIXME What does "bandwidth" refer to in the next line?
e0547e70 812The owner inherits either the priority or the bandwidth of the waiter.
67259526
MK
813.\" FIXME In the preceding line, what determines whether the
814.\" owner inherits the priority versus the bandwidth?
67833bec
MK
815.\"
816.\" FIXME Could I get some help translating the next sentence into
817.\" something that user-space developers (and I) can understand?
70b06b90 818.\" In particular, what are "nested locks" in this context?
e0547e70
TG
819This inheritance follows the lock chain in the case of
820nested locking and performs deadlock detection.
821
9ce19cf1
MK
822.\" FIXME tglx says "The timeout argument is handled as described in
823.\" FUTEX_WAIT." However, it appears to me that this is not right.
70b06b90 824.\" Is the following formulation correct?
e0547e70
TG
825The
826.I timeout
9ce19cf1
MK
827argument provides a timeout for the lock attempt.
828It is interpreted as an absolute time, measured against the
829.BR CLOCK_REALTIME
830clock.
831If
832.I timeout
833is NULL, the operation will block indefinitely.
e0547e70 834
a449c634 835The
e0547e70
TG
836.IR uaddr2 ,
837.IR val ,
838and
839.IR val3
a449c634 840arguments are ignored.
67833bec 841.\"
70b06b90
MK
842.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
843.\"
d67e21f5 844.TP
12fdbe23 845.BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)"
d67e21f5 846.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
12fdbe23
MK
847This operation tries to acquire the futex at
848.IR uaddr .
0b761826 849.\" FIXME I think it would be helpful here to say a few more words about
70b06b90
MK
850.\" the difference(s) between FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI.
851.\" Can someone propose something?
852.\"
fa0388c3 853It deals with the situation where the TID value at
12fdbe23
MK
854.I uaddr
855is 0, but the
b52e1cd4 856.B FUTEX_WAITERS
12fdbe23 857bit is set.
fa0388c3
MK
858.\" FIXME How does the situation in the previous sentence come about?
859.\" Probably it would be helpful to say something about that in
860.\" the man page.
badbf70c 861.\" FIXME And *how* does FUTEX_TRYLOCK_PI deal with this situation?
a282e5b0 862User space cannot handle this condition in a race-free manner
084744ef
MK
863
864The
865.IR uaddr2 ,
866.IR val ,
867.IR timeout ,
868and
869.IR val3
870arguments are ignored.
70b06b90
MK
871.\"
872.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
873.\"
d67e21f5 874.TP
12fdbe23 875.BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)"
d67e21f5 876.\" commit c87e2837be82df479a6bae9f155c43516d2feebc
d4ba4328 877This operation wakes the top priority waiter that is waiting in
ecae2099
TG
878.B FUTEX_LOCK_PI
879on the futex address provided by the
880.I uaddr
881argument.
882
883This is called when the user space value at
884.I uaddr
885cannot be changed atomically from a TID (of the owner) to 0.
886
887The
888.IR uaddr2 ,
889.IR val ,
890.IR timeout ,
891and
892.IR val3
11a194bf 893arguments are ignored.
70b06b90
MK
894.\"
895.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
896.\"
d67e21f5 897.TP
d67e21f5
MK
898.BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)"
899.\" commit 52400ba946759af28442dee6265c5c0180ac7122
f812a08b
DH
900This operation is a PI-aware variant of
901.BR FUTEX_CMP_REQUEUE .
902It requeues waiters that are blocked via
903.B FUTEX_WAIT_REQUEUE_PI
904on
905.I uaddr
906from a non-PI source futex
907.RI ( uaddr )
908to a PI target futex
909.RI ( uaddr2 ).
910
9e54d26d
MK
911As with
912.BR FUTEX_CMP_REQUEUE ,
913this operation wakes up a maximum of
914.I val
915waiters that are waiting on the futex at
916.IR uaddr .
917However, for
918.BR FUTEX_CMP_REQUEUE_PI ,
919.I val
6fbeb8f4 920is required to be 1
939ca89f 921(since the main point is to avoid a thundering herd).
9e54d26d
MK
922The remaining waiters are removed from the wait queue of the source futex at
923.I uaddr
924and added to the wait queue of the target futex at
925.IR uaddr2 .
f812a08b 926
9e54d26d 927The
768d3c23 928.I val2
c6d8cf21
MK
929.\" val2 is the cap on the number of requeued waiters.
930.\" In the glibc pthread_cond_broadcast() implementation, this argument
931.\" is specified as INT_MAX, and for pthread_cond_signal() it is 0.
9e54d26d 932and
768d3c23 933.I val3
9e54d26d
MK
934arguments serve the same purposes as for
935.BR FUTEX_CMP_REQUEUE .
70b06b90 936.\"
be376673
MK
937.\" FIXME The page at http://locklessinc.com/articles/futex_cheat_sheet/
938.\" notes that "priority-inheritance Futex to priority-inheritance
939.\" Futex requeues are currently unsupported". Do we need to say
940.\" something in the man page about that?
70b06b90
MK
941.\"
942.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
943.\"
d67e21f5
MK
944.TP
945.BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)"
946.\" commit 52400ba946759af28442dee6265c5c0180ac7122
70b06b90
MK
947.\"
948.\" FIXME I find the next sentence (from tglx) pretty hard to grok.
949.\" Could someone explain it a bit more.
6ff1b4c0
TG
950Wait operation to wait on a non-PI futex at
951.I uaddr
952and potentially be requeued onto a PI futex at
953.IR uaddr2 .
954The wait operation on
955.I uaddr
956is the same as
957.BR FUTEX_WAIT .
70b06b90 958.\"
f1d2171d
MK
959.\" FIXME I'm not quite clear on the meaning of the following sentence.
960.\" Is this trying to say that while blocked in a
961.\" FUTEX_WAIT_REQUEUE_PI, it could happen that another
962.\" task does a FUTEX_WAKE on uaddr that simply causes
963.\" a normal wake, with the result that the FUTEX_WAIT_REQUEUE_PI
964.\" does not complete? What happens then to the FUTEX_WAIT_REQUEUE_PI
965.\" opertion? Does it remain blocked, or does it unblock
966.\" In which case, what does user space see?
6ff1b4c0
TG
967The waiter can be removed from the wait on
968.I uaddr
969via
970.BR FUTEX_WAKE
971without requeueing on
972.IR uaddr2 .
a4e69912 973
63bea7dc
MK
974.\" FIXME Please check the following. tglx said "The timeout argument
975.\" is handled as described in FUTEX_WAIT.", but the truth is
976.\" as below, AFAICS
977If
978.I timeout
979is not NULL, it specifies a timeout for the wait operation;
980this timeout is interpreted as outlined above in the description of the
981.BR FUTEX_CLOCK_REALTIME
982option.
983If
984.I timeout
985is NULL, the operation can block indefinitely.
986
a4e69912
MK
987The
988.I val3
989argument is ignored.
70b06b90 990.\" FIXME Re the preceding sentence... Actually 'val3' is internally set to
a4e69912
MK
991.\" FUTEX_BITSET_MATCH_ANY before calling futex_wait_requeue_pi().
992.\" I'm not sure we need to say anything about this though.
993.\" Comments?
abb571e8
MK
994
995The
996.BR FUTEX_WAIT_REQUEUE_PI
997and
998.BR FUTEX_CMP_REQUEUE_PI
999were added to support a fairly specific use case:
1000support for priority-inheritance-aware POSIX threads condition variables.
1001The idea is that these operations should always be paired,
1002in order to ensure that user space and the kernel remain in sync.
1003Thus, in the
1004.BR FUTEX_WAIT_REQUEUE_PI
1005operation, the user-space application pre-specifies the target
1006of the requeue that takes place in the
1007.BR FUTEX_CMP_REQUEUE_PI
1008operation.
1009.\"
1010.\" Darren Hart notes that a patch to allow glibc to fully support
1011.\" PI-aware pthreds condition variables has not yet been accepted into
1012.\" glibc. The story is complex, and can be found at
1013.\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588
1014.\" Darren notes that in the meantime, the patch is shipped with various
1015.\" PREEMPT_RT enabled Linux systems.
1016.\"
1017.\" Related to the preceding, Darren proposed that somewhere, man-pages
1018.\" should document the following point:
1019.\" While the Linux kernel, since 2.6.31, supports requeueing of
1020.\" priority-inheritance (PI) aware mutexes via the
1021.\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations,
1022.\" the glibc implementation does not yet take full advantage of this.
1023.\" Specifically, the condvar internal data lock remains a non-PI aware
1024.\" mutex, regardless of the type of the pthread_mutex associated with
1025.\" the condvar. This can lead to an unbounded priority inversion on
1026.\" the internal data lock even when associating a PI aware
1027.\" pthread_mutex with a condvar during a pthread_cond*_wait
1028.\" operation. For this reason, it is not recommended to rely on
1029.\" priority inheritance when using pthread condition variables.
1030.\" The problem is that the obvious somewhere to place this text
1031.\" is the pthread_cond*wait(3) man page. However, such a man page
1032.\" does not currently exist.
70b06b90 1033.\"
6700de24 1034.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
70b06b90 1035.\"
47297adb 1036.SH RETURN VALUE
fea681da 1037.PP
6f147f79 1038In the event of an error, all operations return \-1 and set
e808bba0 1039.I errno
6f147f79 1040to indicate the cause of the error.
e808bba0
MK
1041The return value on success depends on the operation,
1042as described in the following list:
fea681da
MK
1043.TP
1044.B FUTEX_WAIT
bdc5957a 1045Returns 0 if the caller was woken by a
682edefb 1046.B FUTEX_WAKE
7446a837
MK
1047or
1048.B FUTEX_WAKE_BITSET
682edefb 1049call.
fea681da
MK
1050.TP
1051.B FUTEX_WAKE
bdc5957a 1052Returns the number of waiters that were woken up.
fea681da
MK
1053.TP
1054.B FUTEX_FD
1055Returns the new file descriptor associated with the futex.
1056.TP
1057.B FUTEX_REQUEUE
bdc5957a 1058Returns the number of waiters that were woken up.
fea681da
MK
1059.TP
1060.B FUTEX_CMP_REQUEUE
bdc5957a
MK
1061Returns the total number of waiters that were woken up or
1062requeued to the futex at
3dfcc11d
MK
1063.IR uaddr2 .
1064If this value is greater than
1065.IR val ,
1066then difference is the number of waiters requeued to the futex at
1067.IR uaddr2 .
dcad19c0
MK
1068.TP
1069.B FUTEX_WAKE_OP
f1d2171d 1070.\" FIXME XXX Is the following correct?
a8b5b324
MK
1071Returns the total number of waiters that were woken up.
1072This is the sum of the woken waiters on the two futexes at
1073.I uaddr
1074and
1075.IR uaddr2 .
dcad19c0
MK
1076.TP
1077.B FUTEX_WAIT_BITSET
f1d2171d 1078.\" FIXME XXX Is the following correct?
bdc5957a 1079Returns 0 if the caller was woken by a
7bcc5351
MK
1080.B FUTEX_WAKE
1081or
1082.B FUTEX_WAKE_BITSET
1083call.
dcad19c0
MK
1084.TP
1085.B FUTEX_WAKE_BITSET
f1d2171d 1086.\" FIXME XXX Is the following correct?
bdc5957a 1087Returns the number of waiters that were woken up.
dcad19c0
MK
1088.TP
1089.B FUTEX_LOCK_PI
f1d2171d 1090.\" FIXME XXX Is the following correct?
bf02a260 1091Returns 0 if the futex was successfully locked.
dcad19c0
MK
1092.TP
1093.B FUTEX_TRYLOCK_PI
f1d2171d 1094.\" FIXME XXX Is the following correct?
5c716eef 1095Returns 0 if the futex was successfully locked.
dcad19c0
MK
1096.TP
1097.B FUTEX_UNLOCK_PI
f1d2171d 1098.\" FIXME XXX Is the following correct?
52bb928f 1099Returns 0 if the futex was successfully unlocked.
dcad19c0
MK
1100.TP
1101.B FUTEX_CMP_REQUEUE_PI
f1d2171d 1102.\" FIXME XXX Is the following correct?
bdc5957a
MK
1103Returns the total number of waiters that were woken up or
1104requeued to the futex at
dddd395a
MK
1105.IR uaddr2 .
1106If this value is greater than
1107.IR val ,
1108then difference is the number of waiters requeued to the futex at
1109.IR uaddr2 .
dcad19c0
MK
1110.TP
1111.B FUTEX_WAIT_REQUEUE_PI
f1d2171d 1112.\" FIXME XXX Is the following correct?
22c15de9
MK
1113Returns 0 if the caller was successfully requeued to the futex at
1114.IR uaddr2 .
70b06b90
MK
1115.\"
1116.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1117.\"
fea681da
MK
1118.SH ERRORS
1119.TP
1120.B EACCES
1121No read access to futex memory.
1122.TP
1123.B EAGAIN
f48516d1
MK
1124.RB ( FUTEX_WAIT ,
1125.BR FUTEX_WAIT_REQUEUE_PI )
badbf70c
MK
1126The value pointed to by
1127.I uaddr
1128was not equal to the expected value
1129.I val
1130at the time of the call.
1131.TP
1132.B EAGAIN
8f2068bb
MK
1133.RB ( FUTEX_CMP_REQUEUE ,
1134.BR FUTEX_CMP_REQUEUE_PI )
ce5602fd 1135The value pointed to by
9f6c40c0
МК
1136.I uaddr
1137is not equal to the expected value
1138.IR val3 .
fd1dc4c2 1139.\" FIXME: Is the following sentence correct?
fea681da 1140(This probably indicates a race;
682edefb
MK
1141use the safe
1142.B FUTEX_WAKE
1143now.)
c0091dd3 1144.\"
f1d2171d 1145.\" FIXME XXX Should there be an EAGAIN case for FUTEX_TRYLOCK_PI?
c0091dd3
MK
1146.\" It seems so, looking at the handling of the rt_mutex_trylock()
1147.\" call in futex_lock_pi()
1148.\"
fea681da 1149.TP
5662f56a
MK
1150.BR EAGAIN
1151.RB ( FUTEX_LOCK_PI ,
aaec9032
MK
1152.BR FUTEX_TRYLOCK_PI ,
1153.BR FUTEX_CMP_REQUEUE_PI )
1154The futex owner thread ID of
1155.I uaddr
1156(for
1157.BR FUTEX_CMP_REQUEUE_PI :
1158.IR uaddr2 )
1159is about to exit,
5662f56a
MK
1160but has not yet handled the internal state cleanup.
1161Try again.
61f8c1d1 1162.\"
f1d2171d 1163.\" FIXME XXX Is there not also an EAGAIN error case on 'uaddr2' for
61f8c1d1
MK
1164.\" FUTEX_REQUEUE and FUTEX_CMP_REQUEUE via
1165.\" futex_requeue() ==> futex_proxy_trylock_atomic() ==>
1166.\" futex_lock_pi_atomic() ==> attach_to_pi_owner() ==> EAGAIN?
5662f56a 1167.TP
7a39e745
MK
1168.BR EDEADLK
1169.RB ( FUTEX_LOCK_PI ,
1170.BR FUTEX_TRYLOCK_PI )
1171The futex at
1172.I uaddr
1173is already locked by the caller.
d08ce5dd 1174.\"
f1d2171d 1175.\" FIXME XXX Is there not also an EDEADLK error case on 'uaddr2' for
d08ce5dd
MK
1176.\" FUTEX_REQUEUE and FUTEX_CMP_REQUEUE via
1177.\" futex_requeue() ==> futex_proxy_trylock_atomic() ==>
1178.\" futex_lock_pi_atomic() ==> attach_to_pi_owner() ==> EDEADLK?
7a39e745 1179.TP
662c0da8
MK
1180.BR EDEADLK
1181.\" FIXME I reworded tglx's text somewhat; is the following okay?
f1d2171d
MK
1182.\" FIXME XXX I see that kernel/locking/rtmutex.c uses EDEADLK in some places,
1183.\" and EDEADLOCK in others. On almost all architectures these
1184.\" constants are synonymous. Is there a reason that both names
1185.\" are used?
662c0da8
MK
1186.RB ( FUTEX_CMP_REQUEUE_PI )
1187While requeueing a waiter to the PI futex at
1188.IR uaddr2 ,
1189the kernel detected a deadlock.
1190.TP
fea681da 1191.B EFAULT
1ea901e8
MK
1192A required pointer argument (i.e.,
1193.IR uaddr ,
1194.IR uaddr2 ,
1195or
1196.IR timeout )
496df304 1197did not point to a valid user-space address.
fea681da 1198.TP
9f6c40c0 1199.B EINTR
e808bba0 1200A
9f6c40c0 1201.B FUTEX_WAIT
2674f781
MK
1202or
1203.B FUTEX_WAIT_BITSET
e808bba0
MK
1204operation was interrupted by a signal (see
1205.BR signal (7))
1206or a spurious wakeup.
5eeca856
MK
1207.\" FIXME
1208.\" Regarding the words "spurious wakeup" above, I received this
1209.\" bug report from Rich Felker:
1210.\"
1211.\" I see no code in the kernel whereby a "spurious wakeup", or anything
1212.\" other than interruption by a signal handler that's not SA_RESTART,
1213.\" can cause futex to fail with EINTR. In general, overloading of EINTR
1214.\" and/or spurious EINTRs from a syscall make it impossible to use that
1215.\" syscall for implementing any function where EINTR is a mandatory
1216.\" failure on interruption-by-signal, since there is no way for
1217.\" userspace to distinguish whether the EINTR occurred as a result of
1218.\" an interrupting signal or some other reason. The kernel folks have
1219.\" gone to great lengths to fix spurious EINTRs (see signal(7) for
1220.\" history), especially by non-interrupting signal handlers, including
1221.\" in futex, and allowing EINTR here would be contrary to that goal.
1222.\"
1223.\" It's my belief that the "or a spurious wakeup" text should simply be
1224.\" removed.
1225.\"
1226.\" The reason I'm raising this topic is its relevance to a thread on
1227.\" libc-alpha:
1228.\" [RFC] mutex destruction (#13690): problem description and workarounds
1229.\"
1230.\" The bug and mailing list discussions to which Rich refers are:
1231.\" https://sourceware.org/bugzilla/show_bug.cgi?id=13690
1232.\" https://sourceware.org/ml/libc-alpha/2014-12/threads.html#0001
1233.\"
1234.\" Can anyone comment on whether the words "spurious wakeup" are correct?
1235.\"
9f6c40c0 1236.TP
fea681da 1237.B EINVAL
180f97b7
MK
1238The operation in
1239.IR futex_op
1240is one of those that employs a timeout, but the supplied
fb2f4c27
MK
1241.I timeout
1242argument was invalid
1243.RI ( tv_sec
1244was less than zero, or
1245.IR tv_nsec
1246was not less than 1000,000,000).
1247.TP
1248.B EINVAL
0c74df0b 1249The operation specified in
025e1374 1250.IR futex_op
0c74df0b 1251employs one or both of the pointers
51ee94be 1252.I uaddr
a1f47699 1253and
0c74df0b
MK
1254.IR uaddr2 ,
1255but one of these does not point to a valid object\(emthat is,
1256the address is not four-byte-aligned.
51ee94be
MK
1257.TP
1258.B EINVAL
55cc422d
TG
1259.RB ( FUTEX_WAIT_BITSET ,
1260.BR FUTEX_WAKE_BITSET )
79c9b436
TG
1261The bitset supplied in
1262.IR val3
1263is zero.
1264.TP
1265.B EINVAL
2043f2c1 1266.RB ( FUTEX_REQUEUE ,
f1d2171d 1267.\" FIXME XXX tglx suggested adding this, but does this error really occur for
2043f2c1
MK
1268.\" FUTEX_REQUEUE? (The case where it occurs for FUTEX_CMP_REQUEUE_PI
1269.\" is obvious at the start of futex_requeue().)
f1d2171d
MK
1270.\" Darren Hart seems to agree with me that it does not occur for
1271.\" FUTEX_REQUEUE. If Darren and I turn out to be wrong, then
1272.\" FUTEX_CMP_REQUEUE probably also needs to be added here.
2043f2c1 1273.BR FUTEX_CMP_REQUEUE_PI )
add875c0
MK
1274.I uaddr
1275equals
1276.IR uaddr2
1277(i.e., an attempt was made to requeue to the same futex).
1278.TP
ff597681
MK
1279.BR EINVAL
1280.RB ( FUTEX_FD )
1281The signal number supplied in
1282.I val
1283is invalid.
1284.TP
6bac3b85 1285.B EINVAL
476debd7
MK
1286.RB ( FUTEX_WAKE ,
1287.BR FUTEX_WAKE_OP ,
1288.BR FUTEX_WAKE_BITSET ,
1289.BR FUTEX_REQUEUE ,
1290.BR FUTEX_CMP_REQUEUE )
1291The kernel detected an inconsistency between the user-space state at
1292.I uaddr
1293and the kernel state\(emthat is, it detected a waiter which waits in
1294.BR FUTEX_LOCK_PI
1295on
1296.IR uaddr .
1297.TP
1298.B EINVAL
a218ef20 1299.RB ( FUTEX_LOCK_PI ,
ce022f18
MK
1300.BR FUTEX_TRYLOCK_PI ,
1301.BR FUTEX_UNLOCK_PI )
a218ef20
MK
1302The kernel detected an inconsistency between the user-space state at
1303.I uaddr
1304and the kernel state.
ce022f18
MK
1305This indicates either state corruption
1306.\" FIXME tglx did not mention the "state corruption" for FUTEX_UNLOCK_PI.
1307.\" Does that case also apply for FUTEX_UNLOCK_PI?
1308or that the kernel found a waiter on
a218ef20
MK
1309.I uaddr
1310which is waiting via
1311.BR FUTEX_WAIT
1312or
1313.BR FUTEX_WAIT_BITSET .
1314.TP
1315.B EINVAL
f9250b1a
MK
1316.RB ( FUTEX_CMP_REQUEUE_PI )
1317The kernel detected an inconsistency between the user-space state at
99c0041d
MK
1318.I uaddr2
1319and the kernel state;
1320that is, the kernel detected a waiter which waits via
1321.BR FUTEX_WAIT
1322.\" FIXME tglx did not mention FUTEX_WAIT_BITSET here,
1323.\" but should that not also be included here?
1324on
1325.IR uaddr2 .
1326.TP
1327.B EINVAL
1328.RB ( FUTEX_CMP_REQUEUE_PI )
1329The kernel detected an inconsistency between the user-space state at
f9250b1a
MK
1330.I uaddr
1331and the kernel state;
1332that is, the kernel detected a waiter which waits via
75299c8d 1333.BR FUTEX_WAIT
99c0041d 1334or
75299c8d 1335.BR FUTEX_WAIT_BITESET
f9250b1a
MK
1336on
1337.IR uaddr .
1338.TP
1339.B EINVAL
99c0041d 1340.RB ( FUTEX_CMP_REQUEUE_PI )
75299c8d
MK
1341The kernel detected an inconsistency between the user-space state at
1342.I uaddr
1343and the kernel state;
1344that is, the kernel detected a waiter which waits on
1345.I uaddr
1346via
1347.BR FUTEX_LOCK_PI
1348(instead of
1349.BR FUTEX_WAIT_REQUEUE_PI ).
99c0041d
MK
1350.TP
1351.B EINVAL
9786b3ca 1352.RB ( FUTEX_CMP_REQUEUE_PI )
f1d2171d 1353.\" FIXME XXX The following is a reworded version of Darren Hart's text.
9786b3ca
MK
1354.\" Please check that I did not introduce any errors.
1355An attempt was made to requeue a waiter to a futex other than that
1356specified by the matching
1357.B FUTEX_WAIT_REQUEUE_PI
1358call for that waiter.
1359.TP
1360.B EINVAL
f0c0d61c
MK
1361.RB ( FUTEX_CMP_REQUEUE_PI )
1362The
1363.I val
1364argument is not 1.
1365.TP
1366.B EINVAL
4832b48a 1367Invalid argument.
fea681da 1368.TP
a449c634
MK
1369.BR ENOMEM
1370.RB ( FUTEX_LOCK_PI ,
e34a8fb6
MK
1371.BR FUTEX_TRYLOCK_PI ,
1372.BR FUTEX_CMP_REQUEUE_PI )
a449c634
MK
1373The kernel could not allocate memory to hold state information.
1374.TP
fea681da 1375.B ENFILE
ff597681 1376.RB ( FUTEX_FD )
fea681da 1377The system limit on the total number of open files has been reached.
4701fc28
MK
1378.TP
1379.B ENOSYS
1380Invalid operation specified in
d33602c4 1381.IR futex_op .
9f6c40c0 1382.TP
4a7e5b05
MK
1383.B ENOSYS
1384The
1385.BR FUTEX_CLOCK_REALTIME
1386option was specified in
1afcee7c 1387.IR futex_op ,
4a7e5b05
MK
1388but the accompanying operation was neither
1389.BR FUTEX_WAIT_BITSET
1390nor
1391.BR FUTEX_WAIT_REQUEUE_PI .
1392.TP
a9dcb4d1
MK
1393.BR ENOSYS
1394.RB ( FUTEX_LOCK_PI ,
f2424fae 1395.BR FUTEX_TRYLOCK_PI ,
4945ff19 1396.BR FUTEX_UNLOCK_PI ,
4cf92894 1397.BR FUTEX_CMP_REQUEUE_PI ,
794bb106 1398.BR FUTEX_WAIT_REQUEUE_PI )
a9dcb4d1 1399A run-time check determined that the operation not available.
a2ebebcd
MK
1400The PI futex operations are not implemented on all architectures and
1401are not supported on some CPU variants.
a9dcb4d1 1402.TP
c7589177
MK
1403.BR EPERM
1404.RB ( FUTEX_LOCK_PI ,
dc2742a8
MK
1405.BR FUTEX_TRYLOCK_PI ,
1406.BR FUTEX_CMP_REQUEUE_PI )
04331c3f 1407The caller is not allowed to attach itself to the futex at
dc2742a8
MK
1408.I uaddr
1409(for
1410.BR FUTEX_CMP_REQUEUE_PI :
1411the futex at
1412.IR uaddr2 ).
c7589177 1413(This may be caused by a state corruption in user space.)
61f8c1d1 1414.\"
f1d2171d 1415.\" FIXME XXX Is there not also an EPERM error case on 'uaddr2' for
61f8c1d1
MK
1416.\" FUTEX_REQUEUE and FUTEX_CMP_REQUEUE via
1417.\" futex_requeue() ==> futex_proxy_trylock_atomic() ==>
1418.\" futex_lock_pi_atomic() ==> attach_to_pi_owner() ==> EPERM?
c7589177 1419.TP
76f347ba 1420.BR EPERM
87276709 1421.RB ( FUTEX_UNLOCK_PI )
76f347ba
MK
1422The caller does not own the futex.
1423.TP
0b0e4934
MK
1424.BR ESRCH
1425.RB ( FUTEX_LOCK_PI ,
1426.BR FUTEX_TRYLOCK_PI )
1427.\" FIXME I reworded the following sentence a bit differently from
1428.\" tglx's formulation. Is it okay?
1429The thread ID in the futex at
1430.I uaddr
1431does not exist.
61f8c1d1 1432.\"
f1d2171d 1433.\" FIXME XXX Is there not also an ESRCH error case on 'uaddr2' for
61f8c1d1
MK
1434.\" FUTEX_REQUEUE and FUTEX_CMP_REQUEUE via
1435.\" futex_requeue() ==> futex_proxy_trylock_atomic() ==>
1436.\" futex_lock_pi_atomic() ==> attach_to_pi_owner() ==> ESRCH?
0b0e4934 1437.TP
360f773c
MK
1438.BR ESRCH
1439.RB ( FUTEX_CMP_REQUEUE_PI )
1440.\" FIXME I reworded the following sentence a bit differently from
1441.\" tglx's formulation. Is it okay?
1442The thread ID in the futex at
1443.I uaddr2
1444does not exist.
1445.TP
9f6c40c0 1446.B ETIMEDOUT
4d85047f
MK
1447The operation in
1448.IR futex_op
1449employed the timeout specified in
1450.IR timeout ,
1451and the timeout expired before the operation completed.
70b06b90
MK
1452.\"
1453.\""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
1454.\"
47297adb 1455.SH VERSIONS
a1d5f77c 1456.PP
81c9d87e
MK
1457Futexes were first made available in a stable kernel release
1458with Linux 2.6.0.
1459
a1d5f77c
MK
1460Initial futex support was merged in Linux 2.5.7 but with different semantics
1461from what was described above.
52dee70e 1462A four-argument system call with the semantics
fd3fa7ef 1463described in this page was introduced in Linux 2.5.40.
11b520ed 1464In Linux 2.5.70, one argument
a1d5f77c 1465was added.
11b520ed 1466In Linux 2.6.7, a sixth argument was added\(emmessy, especially
a1d5f77c 1467on the s390 architecture.
47297adb 1468.SH CONFORMING TO
8382f16d 1469This system call is Linux-specific.
47297adb 1470.SH NOTES
baf0f1f4
MK
1471Glibc does not provide a wrapper for this system call; call it using
1472.BR syscall (2).
47297adb 1473.SH SEE ALSO
4c222281 1474.ad l
9913033c 1475.BR get_robust_list (2),
d806bc05 1476.BR restart_syscall (2),
14d8dd3b 1477.BR futex (7)
fea681da 1478.PP
f5ad572f
MK
1479The following kernel source files:
1480.IP * 2
1481.I Documentation/pi-futex.txt
1482.IP *
1483.I Documentation/futex-requeue-pi.txt
1484.IP *
1485.I Documentation/locking/rt-mutex.txt
1486.IP *
1487.I Documentation/locking/rt-mutex-design.txt
8fe019c7
MK
1488.IP *
1489.I Documentation/robust-futex-ABI.txt
43b99089 1490.PP
4c222281 1491Franke, H., Russell, R., and Kirwood, M., 2002.
52087dd3 1492\fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP
4c222281 1493(from proceedings of the Ottawa Linux Symposium 2002),
9b936e9e 1494.br
608bf950
SK
1495.UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002-pages-479-495.pdf
1496.UE
f42eb21b 1497
4c222281 1498Hart, D., 2009. \fIA futex overview and update\fP,
2ed26199
MK
1499.UR http://lwn.net/Articles/360699/
1500.UE
1501
4c222281 1502Hart, D. and Guniguntala, D., 2009.
0483b6cc 1503\fIRequeue-PI: Making Glibc Condvars PI-Aware\fP
4c222281 1504(from proceedings of the 2009 Real-Time Linux Workshop),
0483b6cc
MK
1505.UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf
1506.UE
1507
4c222281 1508Drepper, U., 2011. \fIFutexes Are Tricky\fP,
f42eb21b
MK
1509.UR http://www.akkadia.org/drepper/futex.pdf
1510.UE
9b936e9e
MK
1511.PP
1512Futex example library, futex-*.tar.bz2 at
1513.br
a605264d 1514.UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/
608bf950 1515.UE
34f14794
MK
1516.\"
1517.\" FIXME Are there any other resources that should be listed
1518.\" in the SEE ALSO section?