]>
Commit | Line | Data |
---|---|---|
8f0aff2a | 1 | .\" Page by b.hubert |
1abce893 MK |
2 | .\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de> |
3 | .\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com> | |
2297bf0e | 4 | .\" |
2e46a6e7 | 5 | .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) |
8f0aff2a | 6 | .\" may be freely modified and distributed |
8ff7380d | 7 | .\" %%%LICENSE_END |
fea681da MK |
8 | .\" |
9 | .\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com) | |
10 | .\" added ERRORS section. | |
11 | .\" | |
12 | .\" Modified 2004-06-17 mtk | |
13 | .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE | |
14 | .\" | |
47f5c4ba | 15 | .\" FIXME Still to integrate are some points from Torvald Riegel's mail of |
9915ea23 | 16 | .\" 2015-01-23: |
47f5c4ba MK |
17 | .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 |
18 | .\" | |
78e85692 | 19 | .\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail |
9915ea23 | 20 | .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 |
02182e7c | 21 | .\" |
4c1c5274 | 22 | .TH futex 2 (date) "Linux man-pages (unreleased)" |
fea681da | 23 | .SH NAME |
ce154705 | 24 | futex \- fast user-space locking |
504866e4 AC |
25 | .SH LIBRARY |
26 | Standard C library | |
8fc3b2cf | 27 | .RI ( libc ", " \-lc ) |
fea681da | 28 | .SH SYNOPSIS |
9d9dc1e8 | 29 | .nf |
c6d039a3 | 30 | .P |
0aa385fe AC |
31 | .BR "#include <linux/futex.h>" " /* Definition of " FUTEX_* " constants */" |
32 | .BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */" | |
33 | .B #include <unistd.h> | |
c6d039a3 | 34 | .P |
0aa385fe AC |
35 | .BI "long syscall(SYS_futex, uint32_t *" uaddr ", int " futex_op \ |
36 | ", uint32_t " val , | |
37 | .BI " const struct timespec *" timeout , \ | |
9bfc9cb1 | 38 | " \fR /* or: \fBuint32_t \fIval2\fP */" |
0aa385fe | 39 | .BI " uint32_t *" uaddr2 ", uint32_t " val3 ); |
9d9dc1e8 | 40 | .fi |
c6d039a3 | 41 | .P |
b939d6e4 | 42 | .IR Note : |
0aa385fe AC |
43 | glibc provides no wrapper for |
44 | .BR futex (), | |
45 | necessitating the use of | |
46 | .BR syscall (2). | |
47297adb | 47 | .SH DESCRIPTION |
fea681da | 48 | The |
e511ffb6 | 49 | .BR futex () |
4b35dc5d | 50 | system call provides a method for waiting until a certain condition becomes |
077981d4 MK |
51 | true. |
52 | It is typically used as a blocking construct in the context of | |
d45f244c MK |
53 | shared-memory synchronization. |
54 | When using futexes, the majority of | |
55 | the synchronization operations are performed in user space. | |
bc54ed38 | 56 | A user-space program employs the |
d45f244c | 57 | .BR futex () |
ca4e5b2b | 58 | system call only when it is likely that the program has to block for |
4c8cb0ff | 59 | a longer time until the condition becomes true. |
bc54ed38 | 60 | Other |
d45f244c | 61 | .BR futex () |
bc54ed38 MK |
62 | operations can be used to wake any processes or threads waiting |
63 | for a particular condition. | |
c6d039a3 | 64 | .P |
36546c38 AC |
65 | A futex is a 32-bit value\[em]referred to below as a |
66 | .IR "futex word" \[em]whose | |
7e8dcabc | 67 | address is supplied to the |
4b35dc5d | 68 | .BR futex () |
7e8dcabc | 69 | system call. |
c3f4c019 | 70 | (Futexes are 32 bits in size on all platforms, including 64-bit systems.) |
7e8dcabc MK |
71 | All futex operations are governed by this value. |
72 | In order to share a futex between processes, | |
73 | the futex is placed in a region of shared memory, | |
74 | created using (for example) | |
75 | .BR mmap (2) | |
76 | or | |
77 | .BR shmat (2). | |
c3f4c019 | 78 | (Thus, the futex word may have different |
7e8dcabc MK |
79 | virtual addresses in different processes, |
80 | but these addresses all refer to the same location in physical memory.) | |
ca4e5b2b MK |
81 | In a multithreaded program, it is sufficient to place the futex word |
82 | in a global variable shared by all threads. | |
c6d039a3 | 83 | .P |
0c3ec26b MK |
84 | When executing a futex operation that requests to block a thread, |
85 | the kernel will block only if the futex word has the value that the | |
55f9e85e MK |
86 | calling thread supplied (as one of the arguments of the |
87 | .BR futex () | |
88 | call) as the expected value of the futex word. | |
9d32a39b MK |
89 | The loading of the futex word's value, |
90 | the comparison of that value with the expected value, | |
bc54ed38 | 91 | and the actual blocking will happen atomically and will be totally ordered |
da894b18 | 92 | with respect to concurrent operations performed by other threads |
0fb87d16 | 93 | on the same futex word. |
da894b18 MK |
94 | .\" Notes from Darren Hart (Dec 2015): |
95 | .\" Totally ordered with respect futex operations refers to semantics | |
96 | .\" of the ACQUIRE/RELEASE operations and how they impact ordering of | |
97 | .\" memory reads and writes. The kernel futex operations are protected | |
f6615c42 | 98 | .\" by spinlocks, which ensure that all operations are serialized |
da894b18 MK |
99 | .\" with respect to one another. |
100 | .\" | |
101 | .\" This is a lot to attempt to define in this document. Perhaps a | |
102 | .\" reference to linux/Documentation/memory-barriers.txt as a footnote | |
103 | .\" would be sufficient? Or perhaps for this manual, "serialized" would | |
104 | .\" be sufficient, with a footnote regarding "totally ordered" and a | |
105 | .\" pointer to the memory-barrier documentation? | |
b80daba2 | 106 | Thus, the futex word is used to connect the synchronization in user space |
9d32a39b | 107 | with the implementation of blocking by the kernel. |
55f9e85e | 108 | Analogously to an atomic |
4b35dc5d | 109 | compare-and-exchange operation that potentially changes shared memory, |
077981d4 | 110 | blocking via a futex is an atomic compare-and-block operation. |
d6bb5a38 | 111 | .\" FIXME(Torvald Riegel): |
61066e14 MK |
112 | .\" Eventually we want to have some text in NOTES to satisfy |
113 | .\" the reference in the following sentence | |
114 | .\" See NOTES for a detailed specification of | |
115 | .\" the synchronization semantics. | |
c6d039a3 | 116 | .P |
ca4e5b2b | 117 | One use of futexes is for implementing locks. |
c0dc758e MK |
118 | The state of the lock (i.e., acquired or not acquired) |
119 | can be represented as an atomically accessed flag in shared memory. | |
4c8cb0ff | 120 | In the uncontended case, |
c3f4c019 | 121 | a thread can access or modify the lock state with atomic instructions, |
4c8cb0ff MK |
122 | for example atomically changing it from not acquired to acquired |
123 | using an atomic compare-and-exchange instruction. | |
55f9e85e MK |
124 | (Such instructions are performed entirely in user mode, |
125 | and the kernel maintains no information about the lock state.) | |
126 | On the other hand, a thread may be unable to acquire a lock because | |
8e754e12 | 127 | it is already acquired by another thread. |
55f9e85e | 128 | It then may pass the lock's flag as a futex word and the value |
0c3ec26b | 129 | representing the acquired state as the expected value to a |
8e754e12 HS |
130 | .BR futex () |
131 | wait operation. | |
55f9e85e | 132 | This |
8e754e12 | 133 | .BR futex () |
bc54ed38 | 134 | operation will block if and only if the lock is still acquired |
f6615c42 | 135 | (i.e., the value in the futex word still matches the "acquired state"). |
077981d4 | 136 | When releasing the lock, a thread has to first reset the |
0c3ec26b | 137 | lock state to not acquired and then execute a futex |
55f9e85e | 138 | operation that wakes threads blocked on the lock flag used as a futex word |
f6615c42 | 139 | (this can be further optimized to avoid unnecessary wake-ups). |
077981d4 | 140 | See |
4b35dc5d TR |
141 | .BR futex (7) |
142 | for more detail on how to use futexes. | |
c6d039a3 | 143 | .P |
4b35dc5d | 144 | Besides the basic wait and wake-up futex functionality, there are further |
077981d4 | 145 | futex operations aimed at supporting more complex use cases. |
c6d039a3 | 146 | .P |
ca4e5b2b | 147 | Note that |
2af84f99 | 148 | no explicit initialization or destruction is necessary to use futexes; |
4c8cb0ff MK |
149 | the kernel maintains a futex |
150 | (i.e., the kernel-internal implementation artifact) | |
4b35dc5d TR |
151 | only while operations such as |
152 | .BR FUTEX_WAIT , | |
153 | described below, are being performed on a particular futex word. | |
a663ca5a MK |
154 | .\" |
155 | .SS Arguments | |
fea681da MK |
156 | The |
157 | .I uaddr | |
077981d4 MK |
158 | argument points to the futex word. |
159 | On all platforms, futexes are four-byte | |
4b35dc5d | 160 | integers that must be aligned on a four-byte boundary. |
f388ba70 MK |
161 | The operation to perform on the futex is specified in the |
162 | .I futex_op | |
163 | argument; | |
1ae6b2c7 | 164 | .I val |
f388ba70 MK |
165 | is a value whose meaning and purpose depends on |
166 | .IR futex_op . | |
c6d039a3 | 167 | .P |
36ab2074 MK |
168 | The remaining arguments |
169 | .RI ( timeout , | |
170 | .IR uaddr2 , | |
171 | and | |
172 | .IR val3 ) | |
173 | are required only for certain of the futex operations described below. | |
174 | Where one of these arguments is not required, it is ignored. | |
c6d039a3 | 175 | .P |
36ab2074 MK |
176 | For several blocking operations, the |
177 | .I timeout | |
178 | argument is a pointer to a | |
1ae6b2c7 | 179 | .I timespec |
36ab2074 MK |
180 | structure that specifies a timeout for the operation. |
181 | However, notwithstanding the prototype shown above, for some operations, | |
eb4aa521 MK |
182 | the least significant four bytes of this argument are instead |
183 | used as an integer whose meaning is determined by the operation. | |
768d3c23 MK |
184 | For these operations, the kernel casts the |
185 | .I timeout | |
10022b8e HS |
186 | value first to |
187 | .IR "unsigned long", | |
188 | then to | |
c6dc40a2 | 189 | .IR uint32_t , |
768d3c23 MK |
190 | and in the remainder of this page, this argument is referred to as |
191 | .I val2 | |
192 | when interpreted in this fashion. | |
c6d039a3 | 193 | .P |
de5a3bb4 | 194 | Where it is required, the |
1ae6b2c7 | 195 | .I uaddr2 |
4c8cb0ff MK |
196 | argument is a pointer to a second futex word that is employed |
197 | by the operation. | |
c6d039a3 | 198 | .P |
36ab2074 MK |
199 | The interpretation of the final integer argument, |
200 | .IR val3 , | |
201 | depends on the operation. | |
a663ca5a MK |
202 | .\" |
203 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
204 | .\" | |
205 | .SS Futex operations | |
6be4bad7 | 206 | The |
d33602c4 | 207 | .I futex_op |
6be4bad7 MK |
208 | argument consists of two parts: |
209 | a command that specifies the operation to be performed, | |
5d771a4a | 210 | bitwise ORed with zero or more options that |
6be4bad7 | 211 | modify the behaviour of the operation. |
fc30eb79 | 212 | The options that may be included in |
d33602c4 | 213 | .I futex_op |
fc30eb79 TG |
214 | are as follows: |
215 | .TP | |
216 | .BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)" | |
217 | .\" commit 34f01cc1f512fa783302982776895c73714ebbc2 | |
218 | This option bit can be employed with all futex operations. | |
e45f9735 | 219 | It tells the kernel that the futex is process-private and not shared |
0c3ec26b MK |
220 | with another process (i.e., it is being used for synchronization |
221 | only between threads of the same process). | |
943ccc52 MK |
222 | This allows the kernel to make some additional performance optimizations. |
223 | .\" I.e., It allows the kernel choose the fast path for validating | |
224 | .\" the user-space address and avoids expensive VMA lookups, | |
225 | .\" taking reference counts on file backing store, and so on. | |
efeece04 | 226 | .IP |
ae2c1774 | 227 | As a convenience, |
eeeee811 | 228 | .I <linux/futex.h> |
ae2c1774 | 229 | defines a set of constants with the suffix |
eeeee811 | 230 | .B _PRIVATE |
ae2c1774 | 231 | that are equivalents of all of the operations listed below, |
dcdfde26 | 232 | .\" except the obsolete FUTEX_FD, for which the "private" flag was |
ae2c1774 MK |
233 | .\" meaningless |
234 | but with the | |
1ae6b2c7 | 235 | .B FUTEX_PRIVATE_FLAG |
ae2c1774 MK |
236 | ORed into the constant value. |
237 | Thus, there are | |
238 | .BR FUTEX_WAIT_PRIVATE , | |
239 | .BR FUTEX_WAKE_PRIVATE , | |
240 | and so on. | |
2e98bbc2 TG |
241 | .TP |
242 | .BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)" | |
243 | .\" commit 1acdac104668a0834cfa267de9946fac7764d486 | |
4a7e5b05 | 244 | This option bit can be employed only with the |
949ceae3 MK |
245 | .BR FUTEX_WAIT_BITSET , |
246 | .BR FUTEX_WAIT_REQUEUE_PI , | |
949ceae3 MK |
247 | (since Linux 4.5) |
248 | .\" commit 337f13046ff03717a9e99675284a817527440a49 | |
e79977ae KK |
249 | .BR FUTEX_WAIT , |
250 | and | |
eeeee811 | 251 | (since Linux 5.14) |
e79977ae | 252 | .\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae |
eeeee811 | 253 | .B FUTEX_LOCK_PI2 |
c84cf68c | 254 | operations. |
efeece04 | 255 | .IP |
8064bfa5 | 256 | If this option is set, the kernel measures the |
f2103b26 | 257 | .I timeout |
8064bfa5 | 258 | against the |
eeeee811 | 259 | .B CLOCK_REALTIME |
8064bfa5 | 260 | clock. |
efeece04 | 261 | .IP |
8064bfa5 | 262 | If this option is not set, the kernel measures the |
f2103b26 | 263 | .I timeout |
8064bfa5 | 264 | against the |
eeeee811 | 265 | .B CLOCK_MONOTONIC |
1c952cf5 | 266 | clock. |
c6d039a3 | 267 | .P |
6be4bad7 | 268 | The operation specified in |
d33602c4 | 269 | .I futex_op |
6be4bad7 | 270 | is one of the following: |
70b06b90 MK |
271 | .\" |
272 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
273 | .\" | |
fea681da | 274 | .TP |
81c9d87e | 275 | .BR FUTEX_WAIT " (since Linux 2.6.0)" |
b324e17d | 276 | .\" Strictly speaking, since some time in Linux 2.5.x |
f065673c | 277 | This operation tests that the value at the |
4b35dc5d | 278 | futex word pointed to by the address |
fea681da | 279 | .I uaddr |
4b35dc5d | 280 | still contains the expected value |
fea681da | 281 | .IR val , |
fd105614 | 282 | and if so, then sleeps waiting for a |
682edefb | 283 | .B FUTEX_WAKE |
fd105614 | 284 | operation on the futex word. |
077981d4 | 285 | The load of the value of the futex word is an atomic memory |
4b35dc5d | 286 | access (i.e., using atomic machine instructions of the respective |
077981d4 MK |
287 | architecture). |
288 | This load, the comparison with the expected value, and | |
fd105614 | 289 | starting to sleep are performed atomically |
da56650a | 290 | .\" FIXME: Torvald, I think we may need to add some explanation of |
61066e14 | 291 | .\" "totally ordered" here. |
fd105614 MK |
292 | and totally ordered |
293 | with respect to other futex operations on the same futex word. | |
c0dc758e MK |
294 | If the thread starts to sleep, |
295 | it is considered a waiter on this futex word. | |
f065673c MK |
296 | If the futex value does not match |
297 | .IR val , | |
4710334a | 298 | then the call fails immediately with the error |
badbf70c | 299 | .BR EAGAIN . |
efeece04 | 300 | .IP |
4b35dc5d | 301 | The purpose of the comparison with the expected value is to prevent lost |
fd105614 MK |
302 | wake-ups. |
303 | If another thread changed the value of the futex word after the | |
c0dc758e MK |
304 | calling thread decided to block based on the prior value, |
305 | and if the other thread executed a | |
1ae6b2c7 | 306 | .B FUTEX_WAKE |
4b35dc5d | 307 | operation (or similar wake-up) after the value change and before this |
1ae6b2c7 | 308 | .B FUTEX_WAIT |
bc54ed38 MK |
309 | operation, then the calling thread will observe the |
310 | value change and will not start to sleep. | |
efeece04 | 311 | .IP |
c13182ef | 312 | If the |
fea681da | 313 | .I timeout |
40d2dab9 | 314 | is not NULL, the structure it points to specifies a |
40d2dab9 | 315 | timeout for the wait. |
ac991a11 MK |
316 | (This interval will be rounded up to the system clock granularity, |
317 | and is guaranteed not to expire early.) | |
a6918f1d | 318 | The timeout is by default measured according to the |
1ae6b2c7 | 319 | .B CLOCK_MONOTONIC |
a01c3098 | 320 | clock, but, since Linux 4.5, the |
1ae6b2c7 | 321 | .B CLOCK_REALTIME |
a01c3098 | 322 | clock can be selected by specifying |
1ae6b2c7 | 323 | .B FUTEX_CLOCK_REALTIME |
a01c3098 MK |
324 | in |
325 | .IR futex_op . | |
82a6092b MK |
326 | If |
327 | .I timeout | |
328 | is NULL, the call blocks indefinitely. | |
efeece04 | 329 | .IP |
4100abc5 MK |
330 | .IR Note : |
331 | for | |
332 | .BR FUTEX_WAIT , | |
1ae6b2c7 | 333 | .I timeout |
4100abc5 | 334 | is interpreted as a |
1ae6b2c7 | 335 | .I relative |
4100abc5 MK |
336 | value. |
337 | This differs from other futex operations, where | |
338 | .I timeout | |
339 | is interpreted as an absolute value. | |
340 | To obtain the equivalent of | |
1ae6b2c7 | 341 | .B FUTEX_WAIT |
4100abc5 | 342 | with an absolute timeout, employ |
1ae6b2c7 | 343 | .B FUTEX_WAIT_BITSET |
4100abc5 | 344 | with |
1ae6b2c7 | 345 | .I val3 |
4100abc5 MK |
346 | specified as |
347 | .BR FUTEX_BITSET_MATCH_ANY . | |
efeece04 | 348 | .IP |
c13182ef | 349 | The arguments |
fea681da MK |
350 | .I uaddr2 |
351 | and | |
352 | .I val3 | |
353 | are ignored. | |
9915ea23 MK |
354 | .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a |
355 | .\" different example. | |
356 | .\" | |
357 | .\" For | |
358 | .\" .BR futex (7), | |
359 | .\" this call is executed if decrementing the count gave a negative value | |
360 | .\" (indicating contention), | |
361 | .\" and will sleep until another process or thread releases | |
362 | .\" the futex and executes the | |
363 | .\" .B FUTEX_WAKE | |
364 | .\" operation. | |
70b06b90 MK |
365 | .\" |
366 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
367 | .\" | |
fea681da | 368 | .TP |
81c9d87e MK |
369 | .BR FUTEX_WAKE " (since Linux 2.6.0)" |
370 | .\" Strictly speaking, since Linux 2.5.x | |
f065673c MK |
371 | This operation wakes at most |
372 | .I val | |
4b35dc5d | 373 | of the waiters that are waiting (e.g., inside |
f065673c | 374 | .BR FUTEX_WAIT ) |
4b35dc5d | 375 | on the futex word at the address |
f065673c MK |
376 | .IR uaddr . |
377 | Most commonly, | |
378 | .I val | |
379 | is specified as either 1 (wake up a single waiter) or | |
1ae6b2c7 | 380 | .B INT_MAX |
f065673c | 381 | (wake up all waiters). |
730bfbda MK |
382 | No guarantee is provided about which waiters are awoken |
383 | (e.g., a waiter with a higher scheduling priority is not guaranteed | |
384 | to be awoken in preference to a waiter with a lower priority). | |
efeece04 | 385 | .IP |
fea681da MK |
386 | The arguments |
387 | .IR timeout , | |
c8b921bd | 388 | .IR uaddr2 , |
fea681da MK |
389 | and |
390 | .I val3 | |
391 | are ignored. | |
9915ea23 MK |
392 | .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to |
393 | .\" a different example. | |
394 | .\" | |
4c8cb0ff MK |
395 | .\" For |
396 | .\" .BR futex (7), | |
397 | .\" this is executed if incrementing the count showed that | |
398 | .\" there were waiters, | |
399 | .\" once the futex value has been set to 1 | |
400 | .\" (indicating that it is available). | |
401 | .\" | |
9915ea23 | 402 | .\" How does "incrementing the count show that there were waiters"? |
70b06b90 MK |
403 | .\" |
404 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
405 | .\" | |
a7c2bf45 MK |
406 | .TP |
407 | .BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)" | |
b324e17d | 408 | .\" Strictly speaking, from Linux 2.5.x to Linux 2.6.25 |
4c8cb0ff MK |
409 | This operation creates a file descriptor that is associated with |
410 | the futex at | |
a7c2bf45 | 411 | .IR uaddr . |
bdc5957a MK |
412 | The caller must close the returned file descriptor after use. |
413 | When another process or thread performs a | |
1ae6b2c7 | 414 | .B FUTEX_WAKE |
4b35dc5d | 415 | on the futex word, the file descriptor indicates as being readable with |
a7c2bf45 MK |
416 | .BR select (2), |
417 | .BR poll (2), | |
418 | and | |
419 | .BR epoll (7) | |
efeece04 | 420 | .IP |
f1d2171d | 421 | The file descriptor can be used to obtain asynchronous notifications: if |
a7c2bf45 | 422 | .I val |
ca4e5b2b | 423 | is nonzero, then, when another process or thread executes a |
a7c2bf45 MK |
424 | .BR FUTEX_WAKE , |
425 | the caller will receive the signal number that was passed in | |
426 | .IR val . | |
efeece04 | 427 | .IP |
a7c2bf45 MK |
428 | The arguments |
429 | .IR timeout , | |
d556548b | 430 | .IR uaddr2 , |
a7c2bf45 MK |
431 | and |
432 | .I val3 | |
433 | are ignored. | |
efeece04 | 434 | .IP |
a7c2bf45 MK |
435 | Because it was inherently racy, |
436 | .B FUTEX_FD | |
437 | has been removed | |
438 | .\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80 | |
439 | from Linux 2.6.26 onward. | |
70b06b90 MK |
440 | .\" |
441 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
442 | .\" | |
a7c2bf45 MK |
443 | .TP |
444 | .BR FUTEX_REQUEUE " (since Linux 2.6.0)" | |
a7c2bf45 | 445 | This operation performs the same task as |
1ae6b2c7 | 446 | .B FUTEX_CMP_REQUEUE |
27dd3a6e | 447 | (see below), except that no check is made using the value in |
a7c2bf45 MK |
448 | .IR val3 . |
449 | (The argument | |
450 | .I val3 | |
451 | is ignored.) | |
70b06b90 MK |
452 | .\" |
453 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
454 | .\" | |
a7c2bf45 MK |
455 | .TP |
456 | .BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)" | |
4b35dc5d | 457 | This operation first checks whether the location |
a7c2bf45 MK |
458 | .I uaddr |
459 | still contains the value | |
460 | .IR val3 . | |
461 | If not, the operation fails with the error | |
462 | .BR EAGAIN . | |
4b35dc5d | 463 | Otherwise, the operation wakes up a maximum of |
a7c2bf45 MK |
464 | .I val |
465 | waiters that are waiting on the futex at | |
466 | .IR uaddr . | |
467 | If there are more than | |
468 | .I val | |
469 | waiters, then the remaining waiters are removed | |
470 | from the wait queue of the source futex at | |
471 | .I uaddr | |
472 | and added to the wait queue of the target futex at | |
473 | .IR uaddr2 . | |
474 | The | |
768d3c23 | 475 | .I val2 |
936876a9 | 476 | argument specifies an upper limit on the number of waiters |
a7c2bf45 | 477 | that are requeued to the futex at |
768d3c23 | 478 | .IR uaddr2 . |
efeece04 | 479 | .IP |
d6bb5a38 MK |
480 | .\" FIXME(Torvald) Is the following correct? Or is just the decision |
481 | .\" which threads to wake or requeue part of the atomic operation? | |
4b35dc5d TR |
482 | The load from |
483 | .I uaddr | |
4c8cb0ff MK |
484 | is an atomic memory access (i.e., using atomic machine instructions of |
485 | the respective architecture). | |
077981d4 | 486 | This load, the comparison with |
4b35dc5d | 487 | .IR val3 , |
4c8cb0ff MK |
488 | and the requeueing of any waiters are performed atomically and totally |
489 | ordered with respect to other operations on the same futex word. | |
ee65b0e8 MK |
490 | .\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ### |
491 | .\" The operation is serialized with respect to operations on both | |
492 | .\" source and target futex. No other waiter can enqueue itself | |
493 | .\" for waiting and no other waiter can dequeue itself because of | |
494 | .\" a timeout or signal. | |
efeece04 | 495 | .IP |
a7c2bf45 MK |
496 | Typical values to specify for |
497 | .I val | |
ed1819cf | 498 | are 0 or 1. |
a7c2bf45 | 499 | (Specifying |
1ae6b2c7 | 500 | .B INT_MAX |
a7c2bf45 | 501 | is not useful, because it would make the |
1ae6b2c7 | 502 | .B FUTEX_CMP_REQUEUE |
a7c2bf45 MK |
503 | operation equivalent to |
504 | .BR FUTEX_WAKE .) | |
936876a9 | 505 | The limit value specified via |
768d3c23 MK |
506 | .I val2 |
507 | is typically either 1 or | |
a7c2bf45 MK |
508 | .BR INT_MAX . |
509 | (Specifying the argument as 0 is not useful, because it would make the | |
1ae6b2c7 | 510 | .B FUTEX_CMP_REQUEUE |
a7c2bf45 MK |
511 | operation equivalent to |
512 | .BR FUTEX_WAIT .) | |
efeece04 | 513 | .IP |
627b50ce MK |
514 | The |
515 | .B FUTEX_CMP_REQUEUE | |
516 | operation was added as a replacement for the earlier | |
517 | .BR FUTEX_REQUEUE . | |
518 | The difference is that the check of the value at | |
519 | .I uaddr | |
520 | can be used to ensure that requeueing happens only under certain | |
521 | conditions, which allows race conditions to be avoided in certain use cases. | |
dcb410c3 | 522 | .\" But, as Rich Felker points out, there remain valid use cases for |
627b50ce MK |
523 | .\" FUTEX_REQUEUE, for example, when the calling thread is requeuing |
524 | .\" the target(s) to a lock that the calling thread owns | |
525 | .\" From: Rich Felker <dalias@libc.org> | |
526 | .\" Date: Wed, 29 Oct 2014 22:43:17 -0400 | |
527 | .\" To: Darren Hart <dvhart@infradead.org> | |
528 | .\" CC: libc-alpha@sourceware.org, ... | |
529 | .\" Subject: Re: Add futex wrapper to glibc? | |
efeece04 | 530 | .IP |
627b50ce | 531 | Both |
1ae6b2c7 | 532 | .B FUTEX_REQUEUE |
627b50ce | 533 | and |
1ae6b2c7 | 534 | .B FUTEX_CMP_REQUEUE |
627b50ce MK |
535 | can be used to avoid "thundering herd" wake-ups that could occur when using |
536 | .B FUTEX_WAKE | |
537 | in cases where all of the waiters that are woken need to acquire | |
538 | another futex. | |
539 | Consider the following scenario, | |
540 | where multiple waiter threads are waiting on B, | |
541 | a wait queue implemented using a futex: | |
efeece04 | 542 | .IP |
627b50ce | 543 | .in +4n |
b76974c1 | 544 | .EX |
627b50ce MK |
545 | lock(A) |
546 | while (!check_value(V)) { | |
547 | unlock(A); | |
548 | block_on(B); | |
549 | lock(A); | |
550 | }; | |
551 | unlock(A); | |
b76974c1 | 552 | .EE |
627b50ce | 553 | .in |
efeece04 | 554 | .IP |
627b50ce MK |
555 | If a waker thread used |
556 | .BR FUTEX_WAKE , | |
557 | then all waiters waiting on B would be woken up, | |
67c67ff2 | 558 | and they would all try to acquire lock A. |
627b50ce MK |
559 | However, waking all of the threads in this manner would be pointless because |
560 | all except one of the threads would immediately block on lock A again. | |
561 | By contrast, a requeue operation wakes just one waiter and moves | |
562 | the other waiters to lock A, | |
563 | and when the woken waiter unlocks A then the next waiter can proceed. | |
43d16602 | 564 | .\" |
70b06b90 MK |
565 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
566 | .\" | |
fea681da | 567 | .TP |
d67e21f5 MK |
568 | .BR FUTEX_WAKE_OP " (since Linux 2.6.14)" |
569 | .\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721 | |
6bac3b85 MK |
570 | .\" Author: Jakub Jelinek <jakub@redhat.com> |
571 | .\" Date: Tue Sep 6 15:16:25 2005 -0700 | |
9915ea23 | 572 | .\" FIXME. (Torvald) The glibc condvar implementation is currently being |
4c8cb0ff MK |
573 | .\" revised (e.g., to not use an internal lock anymore). |
574 | .\" It is probably more future-proof to remove this paragraph. | |
d6bb5a38 | 575 | .\" [Torvald, do you have an update here?] |
6bac3b85 MK |
576 | This operation was added to support some user-space use cases |
577 | where more than one futex must be handled at the same time. | |
578 | The most notable example is the implementation of | |
579 | .BR pthread_cond_signal (3), | |
580 | which requires operations on two futexes, | |
581 | the one used to implement the mutex and the one used in the implementation | |
582 | of the wait queue associated with the condition variable. | |
1ae6b2c7 | 583 | .B FUTEX_WAKE_OP |
6bac3b85 MK |
584 | allows such cases to be implemented without leading to |
585 | high rates of contention and context switching. | |
efeece04 | 586 | .IP |
6bac3b85 | 587 | The |
1ae6b2c7 | 588 | .B FUTEX_WAKE_OP |
e61abc20 | 589 | operation is equivalent to executing the following code atomically |
4c8cb0ff MK |
590 | and totally ordered with respect to other futex operations on |
591 | any of the two supplied futex words: | |
efeece04 | 592 | .IP |
6bac3b85 | 593 | .in +4n |
b76974c1 | 594 | .EX |
2253ecf0 AC |
595 | uint32_t oldval = *(uint32_t *) uaddr2; |
596 | *(uint32_t *) uaddr2 = oldval \fIop\fP \fIoparg\fP; | |
6bac3b85 MK |
597 | futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); |
598 | if (oldval \fIcmp\fP \fIcmparg\fP) | |
768d3c23 | 599 | futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); |
b76974c1 | 600 | .EE |
6bac3b85 | 601 | .in |
efeece04 | 602 | .IP |
6bac3b85 | 603 | In other words, |
1ae6b2c7 | 604 | .B FUTEX_WAKE_OP |
6bac3b85 MK |
605 | does the following: |
606 | .RS | |
cdede5cd | 607 | .IP \[bu] 3 |
4b35dc5d | 608 | saves the original value of the futex word at |
1ae6b2c7 | 609 | .I uaddr2 |
4b35dc5d | 610 | and performs an operation to modify the value of the futex at |
6bac3b85 | 611 | .IR uaddr2 ; |
4c8cb0ff MK |
612 | this is an atomic read-modify-write memory access (i.e., using atomic |
613 | machine instructions of the respective architecture) | |
cdede5cd | 614 | .IP \[bu] |
6bac3b85 MK |
615 | wakes up a maximum of |
616 | .I val | |
4b35dc5d | 617 | waiters on the futex for the futex word at |
6bac3b85 MK |
618 | .IR uaddr ; |
619 | and | |
cdede5cd | 620 | .IP \[bu] |
4c8cb0ff MK |
621 | dependent on the results of a test of the original value of the |
622 | futex word at | |
6bac3b85 MK |
623 | .IR uaddr2 , |
624 | wakes up a maximum of | |
768d3c23 | 625 | .I val2 |
4b35dc5d | 626 | waiters on the futex for the futex word at |
6bac3b85 MK |
627 | .IR uaddr2 . |
628 | .RE | |
629 | .IP | |
6bac3b85 MK |
630 | The operation and comparison that are to be performed are encoded |
631 | in the bits of the argument | |
632 | .IR val3 . | |
633 | Pictorially, the encoding is: | |
efeece04 | 634 | .IP |
3cf61490 | 635 | .in +4n |
b76974c1 | 636 | .EX |
f6af90e7 MK |
637 | +---+---+-----------+-----------+ |
638 | |op |cmp| oparg | cmparg | | |
639 | +---+---+-----------+-----------+ | |
640 | 4 4 12 12 <== # of bits | |
b76974c1 | 641 | .EE |
6bac3b85 | 642 | .in |
efeece04 | 643 | .IP |
6bac3b85 | 644 | Expressed in code, the encoding is: |
efeece04 | 645 | .IP |
6bac3b85 | 646 | .in +4n |
b76974c1 | 647 | .EX |
d1a71985 MK |
648 | #define FUTEX_OP(op, oparg, cmp, cmparg) \e |
649 | (((op & 0xf) << 28) | \e | |
650 | ((cmp & 0xf) << 24) | \e | |
651 | ((oparg & 0xfff) << 12) | \e | |
6bac3b85 | 652 | (cmparg & 0xfff)) |
b76974c1 | 653 | .EE |
6bac3b85 | 654 | .in |
efeece04 | 655 | .IP |
6bac3b85 MK |
656 | In the above, |
657 | .I op | |
658 | and | |
659 | .I cmp | |
660 | are each one of the codes listed below. | |
661 | The | |
662 | .I oparg | |
663 | and | |
664 | .I cmparg | |
665 | components are literal numeric values, except as noted below. | |
efeece04 | 666 | .IP |
6bac3b85 MK |
667 | The |
668 | .I op | |
669 | component has one of the following values: | |
efeece04 | 670 | .IP |
6bac3b85 | 671 | .in +4n |
b76974c1 | 672 | .EX |
6bac3b85 MK |
673 | FUTEX_OP_SET 0 /* uaddr2 = oparg; */ |
674 | FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ | |
675 | FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ | |
3f029bc9 | 676 | FUTEX_OP_ANDN 3 /* uaddr2 &= \[ti]oparg; */ |
a1e9245d | 677 | FUTEX_OP_XOR 4 /* uaddr2 \[ha]= oparg; */ |
b76974c1 | 678 | .EE |
6bac3b85 | 679 | .in |
efeece04 | 680 | .IP |
5d771a4a | 681 | In addition, bitwise ORing the following value into |
6bac3b85 MK |
682 | .I op |
683 | causes | |
1ae6b2c7 | 684 | .I (1\~<<\~oparg) |
6bac3b85 | 685 | to be used as the operand: |
efeece04 | 686 | .IP |
6bac3b85 | 687 | .in +4n |
b76974c1 | 688 | .EX |
6bac3b85 | 689 | FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ |
b76974c1 | 690 | .EE |
6bac3b85 | 691 | .in |
efeece04 | 692 | .IP |
6bac3b85 MK |
693 | The |
694 | .I cmp | |
695 | field is one of the following: | |
efeece04 | 696 | .IP |
6bac3b85 | 697 | .in +4n |
b76974c1 | 698 | .EX |
6bac3b85 MK |
699 | FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ |
700 | FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ | |
701 | FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ | |
702 | FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ | |
703 | FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ | |
704 | FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ | |
b76974c1 | 705 | .EE |
6bac3b85 | 706 | .in |
efeece04 | 707 | .IP |
6bac3b85 | 708 | The return value of |
1ae6b2c7 | 709 | .B FUTEX_WAKE_OP |
6bac3b85 | 710 | is the sum of the number of waiters woken on the futex |
1ae6b2c7 | 711 | .I uaddr |
6bac3b85 MK |
712 | plus the number of waiters woken on the futex |
713 | .IR uaddr2 . | |
70b06b90 MK |
714 | .\" |
715 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
716 | .\" | |
d67e21f5 | 717 | .TP |
79c9b436 TG |
718 | .BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)" |
719 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
fd9e59d4 | 720 | This operation is like |
1ae6b2c7 | 721 | .B FUTEX_WAIT |
79c9b436 TG |
722 | except that |
723 | .I val3 | |
84abf4ba | 724 | is used to provide a 32-bit bit mask to the kernel. |
2ae96e8a | 725 | This bit mask, in which at least one bit must be set, |
6c38ce7f | 726 | is stored in the kernel-internal state of the waiter. |
79c9b436 | 727 | See the description of |
1ae6b2c7 | 728 | .B FUTEX_WAKE_BITSET |
79c9b436 | 729 | for further details. |
efeece04 | 730 | .IP |
8064bfa5 MK |
731 | If |
732 | .I timeout | |
733 | is not NULL, the structure it points to specifies | |
734 | an absolute timeout for the wait operation. | |
735 | If | |
736 | .I timeout | |
737 | is NULL, the operation can block indefinitely. | |
efeece04 | 738 | .IP |
79c9b436 TG |
739 | The |
740 | .I uaddr2 | |
741 | argument is ignored. | |
70b06b90 MK |
742 | .\" |
743 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
744 | .\" | |
79c9b436 | 745 | .TP |
d67e21f5 MK |
746 | .BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)" |
747 | .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d | |
55cc422d | 748 | This operation is the same as |
1ae6b2c7 | 749 | .B FUTEX_WAKE |
55cc422d | 750 | except that the |
e24fbf10 | 751 | .I val3 |
5e1456d4 | 752 | argument is used to provide a 32-bit bit mask to the kernel. |
6c38ce7f MK |
753 | This bit mask, in which at least one bit must be set, |
754 | is used to select which waiters should be woken up. | |
5d771a4a | 755 | The selection is done by a bitwise AND of the "wake" bit mask |
98d769c0 MK |
756 | (i.e., the value in |
757 | .IR val3 ) | |
5e1456d4 MK |
758 | and the bit mask which is stored in the kernel-internal |
759 | state of the waiter (the "wait" bit mask that is set using | |
98d769c0 MK |
760 | .BR FUTEX_WAIT_BITSET ). |
761 | All of the waiters for which the result of the AND is nonzero are woken up; | |
762 | the remaining waiters are left sleeping. | |
efeece04 | 763 | .IP |
e9d4496b | 764 | The effect of |
1ae6b2c7 | 765 | .B FUTEX_WAIT_BITSET |
e9d4496b | 766 | and |
1ae6b2c7 | 767 | .B FUTEX_WAKE_BITSET |
9732dd8b MK |
768 | is to allow selective wake-ups among multiple waiters that are blocked |
769 | on the same futex. | |
ac894879 | 770 | However, note that, depending on the use case, |
5e1456d4 | 771 | employing this bit-mask multiplexing feature on a |
ac894879 | 772 | futex can be less efficient than simply using multiple futexes, |
5e1456d4 | 773 | because employing bit-mask multiplexing requires the kernel |
e9d4496b MK |
774 | to check all waiters on a futex, |
775 | including those that are not interested in being woken up | |
5e1456d4 | 776 | (i.e., they do not have the relevant bit set in their "wait" bit mask). |
e9d4496b MK |
777 | .\" According to http://locklessinc.com/articles/futex_cheat_sheet/: |
778 | .\" | |
779 | .\" "The original reason for the addition of these extensions | |
780 | .\" was to improve the performance of pthread read-write locks | |
781 | .\" in glibc. However, the pthreads library no longer uses the | |
782 | .\" same locking algorithm, and these extensions are not used | |
783 | .\" without the bitset parameter being all ones. | |
e24fbf10 | 784 | .\" |
e9d4496b | 785 | .\" The page goes on to note that the FUTEX_WAIT_BITSET operation |
5e1456d4 | 786 | .\" is nevertheless used (with a bit mask of all ones) in order to |
e9d4496b MK |
787 | .\" obtain the absolute timeout functionality that is useful |
788 | .\" for efficiently implementing Pthreads APIs (which use absolute | |
789 | .\" timeouts); FUTEX_WAIT provides only relative timeouts. | |
efeece04 | 790 | .IP |
678c9986 MK |
791 | The constant |
792 | .BR FUTEX_BITSET_MATCH_ANY , | |
793 | which corresponds to all 32 bits set in the bit mask, can be used as the | |
794 | .I val3 | |
795 | argument for | |
1ae6b2c7 | 796 | .B FUTEX_WAIT_BITSET |
98d769c0 | 797 | and |
678c9986 MK |
798 | .BR FUTEX_WAKE_BITSET . |
799 | Other than differences in the handling of the | |
98d769c0 | 800 | .I timeout |
678c9986 | 801 | argument, the |
1ae6b2c7 | 802 | .B FUTEX_WAIT |
678c9986 | 803 | operation is equivalent to |
1ae6b2c7 | 804 | .B FUTEX_WAIT_BITSET |
678c9986 | 805 | with |
1ae6b2c7 | 806 | .I val3 |
678c9986 MK |
807 | specified as |
808 | .BR FUTEX_BITSET_MATCH_ANY ; | |
809 | that is, allow a wake-up by any waker. | |
810 | The | |
1ae6b2c7 | 811 | .B FUTEX_WAKE |
678c9986 | 812 | operation is equivalent to |
1ae6b2c7 | 813 | .B FUTEX_WAKE_BITSET |
678c9986 | 814 | with |
1ae6b2c7 | 815 | .I val3 |
678c9986 MK |
816 | specified as |
817 | .BR FUTEX_BITSET_MATCH_ANY ; | |
818 | that is, wake up any waiter(s). | |
efeece04 | 819 | .IP |
678c9986 MK |
820 | The |
821 | .I uaddr2 | |
822 | and | |
823 | .I timeout | |
824 | arguments are ignored. | |
bd90a5f9 | 825 | .\" |
70b06b90 | 826 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
bd90a5f9 MK |
827 | .\" |
828 | .SS Priority-inheritance futexes | |
b52e1cd4 MK |
829 | Linux supports priority-inheritance (PI) futexes in order to handle |
830 | priority-inversion problems that can be encountered with | |
831 | normal futex locks. | |
b565548b | 832 | Priority inversion is the problem that occurs when a high-priority |
bdc5957a MK |
833 | task is blocked waiting to acquire a lock held by a low-priority task, |
834 | while tasks at an intermediate priority continuously preempt | |
835 | the low-priority task from the CPU. | |
836 | Consequently, the low-priority task makes no progress toward | |
837 | releasing the lock, and the high-priority task remains blocked. | |
c6d039a3 | 838 | .P |
7d20efd7 MK |
839 | Priority inheritance is a mechanism for dealing with |
840 | the priority-inversion problem. | |
bdc5957a MK |
841 | With this mechanism, when a high-priority task becomes blocked |
842 | by a lock held by a low-priority task, | |
9cee832c MK |
843 | the priority of the low-priority task is temporarily raised |
844 | to that of the high-priority task, | |
bdc5957a | 845 | so that it is not preempted by any intermediate level tasks, |
7d20efd7 MK |
846 | and can thus make progress toward releasing the lock. |
847 | To be effective, priority inheritance must be transitive, | |
bdc5957a | 848 | meaning that if a high-priority task blocks on a lock |
ca4e5b2b | 849 | held by a lower-priority task that is itself blocked by a lock |
bdc5957a | 850 | held by another intermediate-priority task |
7d20efd7 | 851 | (and so on, for chains of arbitrary length), |
b0f35fbb | 852 | then both of those tasks |
bdc5957a MK |
853 | (or more generally, all of the tasks in a lock chain) |
854 | have their priorities raised to be the same as the high-priority task. | |
c6d039a3 | 855 | .P |
9e2b90ee | 856 | From a user-space perspective, |
39e9b2e1 MK |
857 | what makes a futex PI-aware is a policy agreement (described below) |
858 | between user space and the kernel about the value of the futex word, | |
601399f3 MK |
859 | coupled with the use of the PI-futex operations described below. |
860 | (Unlike the other futex operations described above, | |
861 | the PI-futex operations are designed | |
862 | for the implementation of very specific IPC mechanisms.) | |
863 | .\" | |
9e2b90ee MK |
864 | .\" Quoting Darren Hart: |
865 | .\" These opcodes paired with the PI futex value policy (described below) | |
866 | .\" defines a "futex" as PI aware. These were created very specifically | |
867 | .\" in support of PI pthread_mutexes, so it makes a lot more sense to | |
868 | .\" talk about a PI aware pthread_mutex, than a PI aware futex, since | |
869 | .\" there is a lot of policy and scaffolding that has to be built up | |
870 | .\" around it to use it properly (this is what a PI pthread_mutex is). | |
c6d039a3 | 871 | .P |
ac894879 | 872 | .\" mtk: The following text is drawn from the Hart/Guniguntala paper |
1af427a4 | 873 | .\" (listed in SEE ALSO), but I have reworded some pieces |
8d825152 | 874 | .\" significantly. |
79d918c7 | 875 | .\" |
f0a9e8f4 | 876 | The PI-futex operations described below differ from the other |
4b35dc5d TR |
877 | futex operations in that they impose policy on the use of the value of the |
878 | futex word: | |
cdede5cd | 879 | .IP \[bu] 3 |
4b35dc5d | 880 | If the lock is not acquired, the futex word's value shall be 0. |
cdede5cd | 881 | .IP \[bu] |
4c8cb0ff MK |
882 | If the lock is acquired, the futex word's value shall |
883 | be the thread ID (TID; | |
4b35dc5d | 884 | see |
79d918c7 MK |
885 | .BR gettid (2)) |
886 | of the owning thread. | |
cdede5cd | 887 | .IP \[bu] |
79d918c7 MK |
888 | If the lock is owned and there are threads contending for the lock, |
889 | then the | |
890 | .B FUTEX_WAITERS | |
4b35dc5d | 891 | bit shall be set in the futex word's value; in other words, this value is: |
efeece04 | 892 | .IP |
1ae6b2c7 AC |
893 | .in +4n |
894 | .EX | |
895 | FUTEX_WAITERS | TID | |
896 | .EE | |
897 | .in | |
601399f3 MK |
898 | .IP |
899 | (Note that is invalid for a PI futex word to have no owner and | |
1ae6b2c7 | 900 | .B FUTEX_WAITERS |
601399f3 | 901 | set.) |
c6d039a3 | 902 | .P |
79d918c7 | 903 | With this policy in place, |
fd105614 | 904 | a user-space application can acquire an unacquired |
601399f3 | 905 | lock or release a lock using atomic instructions executed in user mode |
fd105614 | 906 | (e.g., a compare-and-swap operation such as |
b52e1cd4 MK |
907 | .I cmpxchg |
908 | on the x86 architecture). | |
4c8cb0ff MK |
909 | Acquiring a lock simply consists of using compare-and-swap to atomically |
910 | set the futex word's value to the caller's TID if its previous value was 0. | |
4b35dc5d TR |
911 | Releasing a lock requires using compare-and-swap to set the futex word's |
912 | value to 0 if the previous value was the expected TID. | |
c6d039a3 | 913 | .P |
4b35dc5d | 914 | If a futex is already acquired (i.e., has a nonzero value), |
b52e1cd4 | 915 | waiters must employ the |
79d918c7 | 916 | .B FUTEX_LOCK_PI |
e79977ae KK |
917 | or |
918 | .B FUTEX_LOCK_PI2 | |
919 | operations to acquire the lock. | |
4b35dc5d | 920 | If other threads are waiting for the lock, then the |
79d918c7 | 921 | .B FUTEX_WAITERS |
4c8cb0ff MK |
922 | bit is set in the futex value; |
923 | in this case, the lock owner must employ the | |
79d918c7 | 924 | .B FUTEX_UNLOCK_PI |
b52e1cd4 | 925 | operation to release the lock. |
c6d039a3 | 926 | .P |
79d918c7 MK |
927 | In the cases where callers are forced into the kernel |
928 | (i.e., required to perform a | |
929 | .BR futex () | |
0c3ec26b | 930 | call), |
79d918c7 MK |
931 | they then deal directly with a so-called RT-mutex, |
932 | a kernel locking mechanism which implements the required | |
933 | priority-inheritance semantics. | |
934 | After the RT-mutex is acquired, the futex value is updated accordingly, | |
935 | before the calling thread returns to user space. | |
c6d039a3 | 936 | .P |
a59fca75 | 937 | It is important to note |
ac894879 | 938 | .\" tglx (July 2015): |
30239c10 MK |
939 | .\" If there are multiple waiters on a pi futex then a wake pi operation |
940 | .\" will wake the first waiter and hand over the lock to this waiter. This | |
941 | .\" includes handing over the rtmutex which represents the futex in the | |
942 | .\" kernel. The strict requirement is that the futex owner and the rtmutex | |
943 | .\" owner must be the same, except for the update period which is | |
944 | .\" serialized by the futex internal locking. That means the kernel must | |
1d09c150 | 945 | .\" update the user-space value prior to returning to user space |
4b35dc5d | 946 | that the kernel will update the futex word's value prior |
79d918c7 | 947 | to returning to user space. |
601399f3 MK |
948 | (This prevents the possibility of the futex word's value ending |
949 | up in an invalid state, such as having an owner but the value being 0, | |
950 | or having waiters but not having the | |
951 | .B FUTEX_WAITERS | |
952 | bit set.) | |
c6d039a3 | 953 | .P |
30239c10 MK |
954 | If a futex has an associated RT-mutex in the kernel |
955 | (i.e., there are blocked waiters) | |
956 | and the owner of the futex/RT-mutex dies unexpectedly, | |
957 | then the kernel cleans up the RT-mutex and hands it over to the next waiter. | |
958 | This in turn requires that the user-space value is updated accordingly. | |
959 | To indicate that this is required, the kernel sets the | |
960 | .B FUTEX_OWNER_DIED | |
961 | bit in the futex word along with the thread ID of the new owner. | |
8adaf0a7 MK |
962 | User space can detect this situation via the presence of the |
963 | .B FUTEX_OWNER_DIED | |
964 | bit and is then responsible for cleaning up the stale state left over by | |
1d09c150 | 965 | the dead owner. |
30239c10 MK |
966 | .\" tglx (July 2015): |
967 | .\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where | |
968 | .\" the kernel has no state associated. This happens via the robust futex | |
969 | .\" mechanism. In that case the futex value will be set to | |
970 | .\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non | |
971 | .\" PI futexes. | |
c6d039a3 | 972 | .P |
601399f3 MK |
973 | PI futexes are operated on by specifying one of the values listed below in |
974 | .IR futex_op . | |
975 | Note that the PI futex operations must be used as paired operations | |
976 | and are subject to some additional requirements: | |
cdede5cd | 977 | .IP \[bu] 3 |
eeeee811 AC |
978 | .BR FUTEX_LOCK_PI , |
979 | .BR FUTEX_LOCK_PI2 , | |
601399f3 MK |
980 | and |
981 | .B FUTEX_TRYLOCK_PI | |
982 | pair with | |
d8012462 | 983 | .BR FUTEX_UNLOCK_PI . |
601399f3 MK |
984 | .B FUTEX_UNLOCK_PI |
985 | must be called only on a futex owned by the calling thread, | |
986 | as defined by the value policy, otherwise the error | |
987 | .B EPERM | |
988 | results. | |
cdede5cd | 989 | .IP \[bu] |
601399f3 MK |
990 | .B FUTEX_WAIT_REQUEUE_PI |
991 | pairs with | |
992 | .BR FUTEX_CMP_REQUEUE_PI . | |
993 | This must be performed from a non-PI futex to a distinct PI futex | |
994 | (or the error | |
995 | .B EINVAL | |
996 | results). | |
997 | Additionally, | |
998 | .I val | |
999 | (the number of waiters to be woken) must be 1 | |
1000 | (or the error | |
1001 | .B EINVAL | |
1002 | results). | |
c6d039a3 | 1003 | .P |
601399f3 | 1004 | The PI futex operations are as follows: |
70b06b90 MK |
1005 | .\" |
1006 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1007 | .\" | |
d67e21f5 MK |
1008 | .TP |
1009 | .BR FUTEX_LOCK_PI " (since Linux 2.6.18)" | |
1010 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc | |
bc54ed38 | 1011 | This operation is used after an attempt to acquire |
fd105614 | 1012 | the lock via an atomic user-mode instruction failed |
36546c38 | 1013 | because the futex word has a nonzero value\[em]specifically, |
8297383e | 1014 | because it contained the (PID-namespace-specific) TID of the lock owner. |
efeece04 | 1015 | .IP |
4b35dc5d | 1016 | The operation checks the value of the futex word at the address |
67833bec | 1017 | .IR uaddr . |
70b06b90 MK |
1018 | If the value is 0, then the kernel tries to atomically set |
1019 | the futex value to the caller's TID. | |
c3875d1d | 1020 | If the futex word's value is nonzero, |
67833bec | 1021 | the kernel atomically sets the |
e0547e70 | 1022 | .B FUTEX_WAITERS |
67833bec MK |
1023 | bit, which signals the futex owner that it cannot unlock the futex in |
1024 | user space atomically by setting the futex value to 0. | |
c3875d1d MK |
1025 | .\" tglx (July 2015): |
1026 | .\" The operation here is similar to the FUTEX_WAIT logic. When the user | |
1027 | .\" space atomic acquire does not succeed because the futex value was non | |
1028 | .\" zero, then the waiter goes into the kernel, takes the kernel internal | |
1029 | .\" lock and retries the acquisition under the lock. If the acquisition | |
1030 | .\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal | |
1031 | .\" the lock owner that it needs to go into the kernel. Here is the pseudo | |
1032 | .\" code: | |
1033 | .\" | |
1034 | .\" lock(kernel_lock); | |
1035 | .\" retry: | |
9bfc9cb1 | 1036 | .\" |
c3875d1d | 1037 | .\" /* |
ba2c4752 | 1038 | .\" * Owner might have unlocked in user space before we |
c3875d1d MK |
1039 | .\" * were able to set the waiter bit. |
1040 | .\" */ | |
1041 | .\" if (atomic_acquire(futex) == SUCCESS) { | |
1042 | .\" unlock(kernel_lock()); | |
1043 | .\" return 0; | |
1044 | .\" } | |
1045 | .\" | |
1046 | .\" /* | |
1047 | .\" * Owner might have unlocked after the above atomic_acquire() | |
1048 | .\" * attempt. | |
1049 | .\" */ | |
1050 | .\" if (atomic_set_waiters_bit(futex) != SUCCESS) | |
1051 | .\" goto retry; | |
1052 | .\" | |
1053 | .\" queue_waiter(); | |
1054 | .\" unlock(kernel_lock); | |
1055 | .\" block(); | |
1056 | .\" | |
1057 | After that, the kernel: | |
1058 | .RS | |
22356d97 | 1059 | .IP (1) 5 |
c3875d1d | 1060 | Tries to find the thread which is associated with the owner TID. |
22356d97 | 1061 | .IP (2) |
c3875d1d MK |
1062 | Creates or reuses kernel state on behalf of the owner. |
1063 | (If this is the first waiter, there is no kernel state for this | |
1064 | futex, so kernel state is created by locking the RT-mutex | |
1065 | and the futex owner is made the owner of the RT-mutex. | |
1066 | If there are existing waiters, then the existing state is reused.) | |
22356d97 | 1067 | .IP (3) |
ca4e5b2b | 1068 | Attaches the waiter to the futex |
c3875d1d MK |
1069 | (i.e., the waiter is enqueued on the RT-mutex waiter list). |
1070 | .RE | |
1071 | .IP | |
ac894879 MK |
1072 | If more than one waiter exists, |
1073 | the enqueueing of the waiter is in descending priority order. | |
1074 | (For information on priority ordering, see the discussion of the | |
1075 | .BR SCHED_DEADLINE , | |
1076 | .BR SCHED_FIFO , | |
1077 | and | |
1ae6b2c7 | 1078 | .B SCHED_RR |
ac894879 MK |
1079 | scheduling policies in |
1080 | .BR sched (7).) | |
1081 | The owner inherits either the waiter's CPU bandwidth | |
1082 | (if the waiter is scheduled under the | |
1ae6b2c7 | 1083 | .B SCHED_DEADLINE |
ac894879 | 1084 | policy) or the waiter's priority (if the waiter is scheduled under the |
1ae6b2c7 | 1085 | .B SCHED_RR |
ac894879 | 1086 | or |
1ae6b2c7 | 1087 | .B SCHED_FIFO |
ac894879 | 1088 | policy). |
1d09c150 MK |
1089 | .\" August 2015: |
1090 | .\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL) | |
1091 | .\" processes, does the nice value come into play also? | |
1092 | .\" | |
1093 | .\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order | |
c3875d1d | 1094 | This inheritance follows the lock chain in the case of nested locking |
ca4e5b2b MK |
1095 | .\" (i.e., task 1 blocks on lock A, held by task 2, |
1096 | .\" while task 2 blocks on lock B, held by task 3) | |
c3875d1d | 1097 | and performs deadlock detection. |
efeece04 | 1098 | .IP |
e0547e70 TG |
1099 | The |
1100 | .I timeout | |
9ce19cf1 | 1101 | argument provides a timeout for the lock attempt. |
8064bfa5 MK |
1102 | If |
1103 | .I timeout | |
1104 | is not NULL, the structure it points to specifies | |
1105 | an absolute timeout, measured against the | |
1ae6b2c7 | 1106 | .B CLOCK_REALTIME |
9ce19cf1 | 1107 | clock. |
c082f385 MK |
1108 | .\" 2016-07-07 response from Thomas Gleixner on LKML: |
1109 | .\" From: Thomas Gleixner <tglx@linutronix.de> | |
1110 | .\" Date: 6 July 2016 at 20:57 | |
1111 | .\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op | |
2ae96e8a | 1112 | .\" |
c082f385 MK |
1113 | .\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: |
1114 | .\" > On 06/23/2016 08:28 PM, Darren Hart wrote: | |
1115 | .\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using | |
1116 | .\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a | |
1117 | .\" > > specific amount of time, regardless of wall time. | |
1118 | .\" > | |
1119 | .\" > Yes, that's another weird inconsistency. | |
2ae96e8a | 1120 | .\" |
c082f385 MK |
1121 | .\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on |
1122 | .\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back | |
1123 | .\" then when we added LOCK_PI. | |
9ce19cf1 MK |
1124 | If |
1125 | .I timeout | |
1126 | is NULL, the operation will block indefinitely. | |
efeece04 | 1127 | .IP |
a449c634 | 1128 | The |
e0547e70 TG |
1129 | .IR uaddr2 , |
1130 | .IR val , | |
1131 | and | |
eeeee811 | 1132 | .I val3 |
a449c634 | 1133 | arguments are ignored. |
67833bec | 1134 | .\" |
70b06b90 MK |
1135 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
1136 | .\" | |
d67e21f5 | 1137 | .TP |
eeeee811 | 1138 | .BR FUTEX_LOCK_PI2 " (since Linux 5.14)" |
e79977ae | 1139 | .\" commit bf22a6976897977b0a3f1aeba6823c959fc4fdae |
69bc3836 MK |
1140 | This operation is the same as |
1141 | .BR FUTEX_LOCK_PI , | |
1142 | except that the clock against which | |
eeeee811 | 1143 | .I timeout |
69bc3836 MK |
1144 | is measured is selectable. |
1145 | By default, the (absolute) timeout specified in | |
e79977ae | 1146 | .I timeout |
caaa2efa | 1147 | is measured against the |
69bc3836 MK |
1148 | .B CLOCK_MONOTONIC |
1149 | clock, but if the | |
1150 | .B FUTEX_CLOCK_REALTIME | |
1151 | flag is specified in | |
1152 | .IR futex_op , | |
1153 | then the timeout is measured against the | |
1154 | .B CLOCK_REALTIME | |
1155 | clock. | |
e79977ae KK |
1156 | .\" |
1157 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1158 | .\" | |
1159 | .TP | |
12fdbe23 | 1160 | .BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 1161 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
3fbb1be1 | 1162 | This operation tries to acquire the lock at |
12fdbe23 | 1163 | .IR uaddr . |
c3875d1d MK |
1164 | It is invoked when a user-space atomic acquire did not |
1165 | succeed because the futex word was not 0. | |
efeece04 | 1166 | .IP |
8adaf0a7 MK |
1167 | Because the kernel has access to more state information than user space, |
1168 | acquisition of the lock might succeed if performed by the | |
1169 | kernel in cases where the futex word | |
1170 | (i.e., the state information accessible to use-space) contains stale state | |
c3875d1d MK |
1171 | .RB ( FUTEX_WAITERS |
1172 | and/or | |
1173 | .BR FUTEX_OWNER_DIED ). | |
1174 | This can happen when the owner of the futex died. | |
1d09c150 MK |
1175 | User space cannot handle this condition in a race-free manner, |
1176 | but the kernel can fix this up and acquire the futex. | |
ee65b0e8 MK |
1177 | .\" Paraphrasing a f2f conversation with Thomas Gleixner about the |
1178 | .\" above point (Aug 2015): ### | |
1179 | .\" There is a rare possibility of a race condition involving an | |
1180 | .\" uncontended futex with no owner, but with waiters. The | |
1181 | .\" kernel-user-space contract is that if a futex is nonzero, you must | |
1182 | .\" go into kernel. The futex was owned by a task, and that task dies | |
1183 | .\" but there are no waiters, so the futex value is non zero. | |
1184 | .\" Therefore, the next locker has to go into the kernel, | |
1185 | .\" so that the kernel has a chance to clean up. (CMXCH on zero | |
1186 | .\" in user space would fail, so kernel has to clean up.) | |
8adaf0a7 MK |
1187 | .\" Darren Hart (Oct 2015): |
1188 | .\" The trylock in the kernel has more state, so it can independently | |
ba2c4752 | 1189 | .\" verify the flags that user space must trust implicitly. |
efeece04 | 1190 | .IP |
084744ef MK |
1191 | The |
1192 | .IR uaddr2 , | |
1193 | .IR val , | |
1194 | .IR timeout , | |
1195 | and | |
1ae6b2c7 | 1196 | .I val3 |
084744ef | 1197 | arguments are ignored. |
70b06b90 MK |
1198 | .\" |
1199 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1200 | .\" | |
d67e21f5 | 1201 | .TP |
12fdbe23 | 1202 | .BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)" |
d67e21f5 | 1203 | .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
d4ba4328 | 1204 | This operation wakes the top priority waiter that is waiting in |
ecae2099 | 1205 | .B FUTEX_LOCK_PI |
e79977ae KK |
1206 | or |
1207 | .B FUTEX_LOCK_PI2 | |
ecae2099 TG |
1208 | on the futex address provided by the |
1209 | .I uaddr | |
1210 | argument. | |
efeece04 | 1211 | .IP |
1d09c150 | 1212 | This is called when the user-space value at |
ecae2099 TG |
1213 | .I uaddr |
1214 | cannot be changed atomically from a TID (of the owner) to 0. | |
efeece04 | 1215 | .IP |
ecae2099 TG |
1216 | The |
1217 | .IR uaddr2 , | |
1218 | .IR val , | |
1219 | .IR timeout , | |
1220 | and | |
1ae6b2c7 | 1221 | .I val3 |
11a194bf | 1222 | arguments are ignored. |
70b06b90 MK |
1223 | .\" |
1224 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1225 | .\" | |
d67e21f5 | 1226 | .TP |
d67e21f5 MK |
1227 | .BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)" |
1228 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
f812a08b DH |
1229 | This operation is a PI-aware variant of |
1230 | .BR FUTEX_CMP_REQUEUE . | |
1231 | It requeues waiters that are blocked via | |
1232 | .B FUTEX_WAIT_REQUEUE_PI | |
1233 | on | |
1234 | .I uaddr | |
1235 | from a non-PI source futex | |
1236 | .RI ( uaddr ) | |
1237 | to a PI target futex | |
1238 | .RI ( uaddr2 ). | |
efeece04 | 1239 | .IP |
9e54d26d MK |
1240 | As with |
1241 | .BR FUTEX_CMP_REQUEUE , | |
1242 | this operation wakes up a maximum of | |
1243 | .I val | |
1244 | waiters that are waiting on the futex at | |
1245 | .IR uaddr . | |
1246 | However, for | |
1247 | .BR FUTEX_CMP_REQUEUE_PI , | |
1248 | .I val | |
6fbeb8f4 | 1249 | is required to be 1 |
939ca89f | 1250 | (since the main point is to avoid a thundering herd). |
9e54d26d MK |
1251 | The remaining waiters are removed from the wait queue of the source futex at |
1252 | .I uaddr | |
1253 | and added to the wait queue of the target futex at | |
1254 | .IR uaddr2 . | |
efeece04 | 1255 | .IP |
9e54d26d | 1256 | The |
768d3c23 | 1257 | .I val2 |
c6d8cf21 MK |
1258 | .\" val2 is the cap on the number of requeued waiters. |
1259 | .\" In the glibc pthread_cond_broadcast() implementation, this argument | |
1260 | .\" is specified as INT_MAX, and for pthread_cond_signal() it is 0. | |
9e54d26d | 1261 | and |
768d3c23 | 1262 | .I val3 |
9e54d26d MK |
1263 | arguments serve the same purposes as for |
1264 | .BR FUTEX_CMP_REQUEUE . | |
70b06b90 | 1265 | .\" |
8297383e | 1266 | .\" The page at http://locklessinc.com/articles/futex_cheat_sheet/ |
be376673 | 1267 | .\" notes that "priority-inheritance Futex to priority-inheritance |
8297383e MK |
1268 | .\" Futex requeues are currently unsupported". However, probably |
1269 | .\" the page does not need to say nothing about this, since | |
1270 | .\" Thomas Gleixner commented (July 2015): "they never will be | |
1271 | .\" supported because they make no sense at all" | |
70b06b90 MK |
1272 | .\" |
1273 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1274 | .\" | |
d67e21f5 MK |
1275 | .TP |
1276 | .BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)" | |
1277 | .\" commit 52400ba946759af28442dee6265c5c0180ac7122 | |
70b06b90 | 1278 | .\" |
c3875d1d | 1279 | Wait on a non-PI futex at |
6ff1b4c0 | 1280 | .I uaddr |
c3875d1d | 1281 | and potentially be requeued (via a |
1ae6b2c7 | 1282 | .B FUTEX_CMP_REQUEUE_PI |
c3875d1d | 1283 | operation in another task) onto a PI futex at |
6ff1b4c0 TG |
1284 | .IR uaddr2 . |
1285 | The wait operation on | |
1286 | .I uaddr | |
c3875d1d | 1287 | is the same as for |
6ff1b4c0 | 1288 | .BR FUTEX_WAIT . |
efeece04 | 1289 | .IP |
6ff1b4c0 TG |
1290 | The waiter can be removed from the wait on |
1291 | .I uaddr | |
6ff1b4c0 | 1292 | without requeueing on |
1ae6b2c7 | 1293 | .I uaddr2 |
c3875d1d | 1294 | via a |
1ae6b2c7 | 1295 | .B FUTEX_WAKE |
c3875d1d MK |
1296 | operation in another task. |
1297 | In this case, the | |
1ae6b2c7 | 1298 | .B FUTEX_WAIT_REQUEUE_PI |
3fbb1be1 MK |
1299 | operation fails with the error |
1300 | .BR EAGAIN . | |
efeece04 | 1301 | .IP |
63bea7dc MK |
1302 | If |
1303 | .I timeout | |
8064bfa5 MK |
1304 | is not NULL, the structure it points to specifies |
1305 | an absolute timeout for the wait operation. | |
63bea7dc MK |
1306 | If |
1307 | .I timeout | |
1308 | is NULL, the operation can block indefinitely. | |
efeece04 | 1309 | .IP |
a4e69912 MK |
1310 | The |
1311 | .I val3 | |
1312 | argument is ignored. | |
efeece04 | 1313 | .IP |
abb571e8 | 1314 | The |
1ae6b2c7 | 1315 | .B FUTEX_WAIT_REQUEUE_PI |
abb571e8 | 1316 | and |
1ae6b2c7 | 1317 | .B FUTEX_CMP_REQUEUE_PI |
abb571e8 MK |
1318 | were added to support a fairly specific use case: |
1319 | support for priority-inheritance-aware POSIX threads condition variables. | |
1320 | The idea is that these operations should always be paired, | |
1321 | in order to ensure that user space and the kernel remain in sync. | |
1322 | Thus, in the | |
1ae6b2c7 | 1323 | .B FUTEX_WAIT_REQUEUE_PI |
abb571e8 MK |
1324 | operation, the user-space application pre-specifies the target |
1325 | of the requeue that takes place in the | |
1ae6b2c7 | 1326 | .B FUTEX_CMP_REQUEUE_PI |
abb571e8 MK |
1327 | operation. |
1328 | .\" | |
1329 | .\" Darren Hart notes that a patch to allow glibc to fully support | |
1af427a4 | 1330 | .\" PI-aware pthreads condition variables has not yet been accepted into |
abb571e8 MK |
1331 | .\" glibc. The story is complex, and can be found at |
1332 | .\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588 | |
1333 | .\" Darren notes that in the meantime, the patch is shipped with various | |
1af427a4 | 1334 | .\" PREEMPT_RT-enabled Linux systems. |
abb571e8 MK |
1335 | .\" |
1336 | .\" Related to the preceding, Darren proposed that somewhere, man-pages | |
1337 | .\" should document the following point: | |
1af427a4 | 1338 | .\" |
b324e17d | 1339 | .\" While the Linux kernel, since Linux 2.6.31, supports requeueing of |
4c8cb0ff MK |
1340 | .\" priority-inheritance (PI) aware mutexes via the |
1341 | .\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations, | |
1342 | .\" the glibc implementation does not yet take full advantage of this. | |
1343 | .\" Specifically, the condvar internal data lock remains a non-PI aware | |
1344 | .\" mutex, regardless of the type of the pthread_mutex associated with | |
1345 | .\" the condvar. This can lead to an unbounded priority inversion on | |
1346 | .\" the internal data lock even when associating a PI aware | |
1347 | .\" pthread_mutex with a condvar during a pthread_cond*_wait | |
1348 | .\" operation. For this reason, it is not recommended to rely on | |
1349 | .\" priority inheritance when using pthread condition variables. | |
1af427a4 MK |
1350 | .\" |
1351 | .\" The problem is that the obvious location for this text is | |
1352 | .\" the pthread_cond*wait(3) man page. However, such a man page | |
abb571e8 | 1353 | .\" does not currently exist. |
70b06b90 | 1354 | .\" |
6700de24 | 1355 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
70b06b90 | 1356 | .\" |
47297adb | 1357 | .SH RETURN VALUE |
a5c5a06a MK |
1358 | In the event of an error (and assuming that |
1359 | .BR futex () | |
1360 | was invoked via | |
1361 | .BR syscall (2)), | |
1362 | all operations return \-1 and set | |
e808bba0 | 1363 | .I errno |
855d489a | 1364 | to indicate the error. |
c6d039a3 | 1365 | .P |
e808bba0 MK |
1366 | The return value on success depends on the operation, |
1367 | as described in the following list: | |
fea681da MK |
1368 | .TP |
1369 | .B FUTEX_WAIT | |
077981d4 | 1370 | Returns 0 if the caller was woken up. |
4c8cb0ff MK |
1371 | Note that a wake-up can also be caused by common futex usage patterns |
1372 | in unrelated code that happened to have previously used the futex word's | |
1373 | memory location (e.g., typical futex-based implementations of | |
1374 | Pthreads mutexes can cause this under some conditions). | |
1375 | Therefore, callers should always conservatively assume that a return | |
1376 | value of 0 can mean a spurious wake-up, and use the futex word's value | |
bc54ed38 MK |
1377 | (i.e., the user-space synchronization scheme) |
1378 | to decide whether to continue to block or not. | |
fea681da MK |
1379 | .TP |
1380 | .B FUTEX_WAKE | |
bdc5957a | 1381 | Returns the number of waiters that were woken up. |
fea681da MK |
1382 | .TP |
1383 | .B FUTEX_FD | |
1384 | Returns the new file descriptor associated with the futex. | |
1385 | .TP | |
1386 | .B FUTEX_REQUEUE | |
bdc5957a | 1387 | Returns the number of waiters that were woken up. |
fea681da MK |
1388 | .TP |
1389 | .B FUTEX_CMP_REQUEUE | |
bdc5957a | 1390 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1391 | requeued to the futex for the futex word at |
3dfcc11d MK |
1392 | .IR uaddr2 . |
1393 | If this value is greater than | |
1394 | .IR val , | |
fd105614 | 1395 | then the difference is the number of waiters requeued to the futex for the |
4c8cb0ff | 1396 | futex word at |
3dfcc11d | 1397 | .IR uaddr2 . |
dcad19c0 MK |
1398 | .TP |
1399 | .B FUTEX_WAKE_OP | |
a8b5b324 | 1400 | Returns the total number of waiters that were woken up. |
4c8cb0ff MK |
1401 | This is the sum of the woken waiters on the two futexes for |
1402 | the futex words at | |
a8b5b324 MK |
1403 | .I uaddr |
1404 | and | |
1405 | .IR uaddr2 . | |
dcad19c0 MK |
1406 | .TP |
1407 | .B FUTEX_WAIT_BITSET | |
077981d4 MK |
1408 | Returns 0 if the caller was woken up. |
1409 | See | |
4b35dc5d TR |
1410 | .B FUTEX_WAIT |
1411 | for how to interpret this correctly in practice. | |
dcad19c0 MK |
1412 | .TP |
1413 | .B FUTEX_WAKE_BITSET | |
bdc5957a | 1414 | Returns the number of waiters that were woken up. |
dcad19c0 MK |
1415 | .TP |
1416 | .B FUTEX_LOCK_PI | |
bf02a260 | 1417 | Returns 0 if the futex was successfully locked. |
dcad19c0 | 1418 | .TP |
e79977ae KK |
1419 | .B FUTEX_LOCK_PI2 |
1420 | Returns 0 if the futex was successfully locked. | |
1421 | .TP | |
dcad19c0 | 1422 | .B FUTEX_TRYLOCK_PI |
5c716eef | 1423 | Returns 0 if the futex was successfully locked. |
dcad19c0 MK |
1424 | .TP |
1425 | .B FUTEX_UNLOCK_PI | |
52bb928f | 1426 | Returns 0 if the futex was successfully unlocked. |
dcad19c0 MK |
1427 | .TP |
1428 | .B FUTEX_CMP_REQUEUE_PI | |
bdc5957a | 1429 | Returns the total number of waiters that were woken up or |
4b35dc5d | 1430 | requeued to the futex for the futex word at |
dddd395a MK |
1431 | .IR uaddr2 . |
1432 | If this value is greater than | |
1433 | .IR val , | |
4c8cb0ff MK |
1434 | then difference is the number of waiters requeued to the futex for |
1435 | the futex word at | |
dddd395a | 1436 | .IR uaddr2 . |
dcad19c0 MK |
1437 | .TP |
1438 | .B FUTEX_WAIT_REQUEUE_PI | |
4c8cb0ff MK |
1439 | Returns 0 if the caller was successfully requeued to the futex for |
1440 | the futex word at | |
22c15de9 | 1441 | .IR uaddr2 . |
70b06b90 MK |
1442 | .\" |
1443 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1444 | .\" | |
fea681da MK |
1445 | .SH ERRORS |
1446 | .TP | |
1447 | .B EACCES | |
4b35dc5d | 1448 | No read access to the memory of a futex word. |
fea681da MK |
1449 | .TP |
1450 | .B EAGAIN | |
f48516d1 | 1451 | .RB ( FUTEX_WAIT , |
4b35dc5d | 1452 | .BR FUTEX_WAIT_BITSET , |
f48516d1 | 1453 | .BR FUTEX_WAIT_REQUEUE_PI ) |
badbf70c MK |
1454 | The value pointed to by |
1455 | .I uaddr | |
1456 | was not equal to the expected value | |
1457 | .I val | |
1458 | at the time of the call. | |
efeece04 | 1459 | .IP |
9732dd8b MK |
1460 | .BR Note : |
1461 | on Linux, the symbolic names | |
1462 | .B EAGAIN | |
1463 | and | |
1464 | .B EWOULDBLOCK | |
77da5feb | 1465 | (both of which appear in different parts of the kernel futex code) |
9732dd8b | 1466 | have the same value. |
badbf70c MK |
1467 | .TP |
1468 | .B EAGAIN | |
8f2068bb MK |
1469 | .RB ( FUTEX_CMP_REQUEUE , |
1470 | .BR FUTEX_CMP_REQUEUE_PI ) | |
ce5602fd | 1471 | The value pointed to by |
9f6c40c0 МК |
1472 | .I uaddr |
1473 | is not equal to the expected value | |
1474 | .IR val3 . | |
fea681da | 1475 | .TP |
eeeee811 | 1476 | .B EAGAIN |
5662f56a | 1477 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1478 | .BR FUTEX_LOCK_PI2 , |
aaec9032 MK |
1479 | .BR FUTEX_TRYLOCK_PI , |
1480 | .BR FUTEX_CMP_REQUEUE_PI ) | |
1481 | The futex owner thread ID of | |
1482 | .I uaddr | |
1483 | (for | |
1484 | .BR FUTEX_CMP_REQUEUE_PI : | |
1485 | .IR uaddr2 ) | |
1486 | is about to exit, | |
5662f56a MK |
1487 | but has not yet handled the internal state cleanup. |
1488 | Try again. | |
1489 | .TP | |
eeeee811 | 1490 | .B EDEADLK |
7a39e745 | 1491 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1492 | .BR FUTEX_LOCK_PI2 , |
9732dd8b MK |
1493 | .BR FUTEX_TRYLOCK_PI , |
1494 | .BR FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1495 | The futex word at |
7a39e745 MK |
1496 | .I uaddr |
1497 | is already locked by the caller. | |
1498 | .TP | |
1ae6b2c7 | 1499 | .B EDEADLK |
c3875d1d | 1500 | .\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some |
d6bb5a38 | 1501 | .\" places, and EDEADLOCK in others. On almost all architectures |
4c8cb0ff MK |
1502 | .\" these constants are synonymous. Is there a reason that both |
1503 | .\" names are used? | |
8297383e MK |
1504 | .\" |
1505 | .\" tglx (July 2015): "No. We should probably fix that." | |
1506 | .\" | |
662c0da8 | 1507 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
4b35dc5d | 1508 | While requeueing a waiter to the PI futex for the futex word at |
662c0da8 MK |
1509 | .IR uaddr2 , |
1510 | the kernel detected a deadlock. | |
1511 | .TP | |
fea681da | 1512 | .B EFAULT |
1ea901e8 MK |
1513 | A required pointer argument (i.e., |
1514 | .IR uaddr , | |
1515 | .IR uaddr2 , | |
1516 | or | |
1517 | .IR timeout ) | |
496df304 | 1518 | did not point to a valid user-space address. |
fea681da | 1519 | .TP |
9f6c40c0 | 1520 | .B EINTR |
e808bba0 | 1521 | A |
9f6c40c0 | 1522 | .B FUTEX_WAIT |
2674f781 MK |
1523 | or |
1524 | .B FUTEX_WAIT_BITSET | |
e808bba0 | 1525 | operation was interrupted by a signal (see |
f529fd20 | 1526 | .BR signal (7)). |
b324e17d | 1527 | Before Linux 2.6.22, this error could also be returned for |
b5fff4ea | 1528 | a spurious wakeup; since Linux 2.6.22, this no longer happens. |
9f6c40c0 | 1529 | .TP |
fea681da | 1530 | .B EINVAL |
180f97b7 | 1531 | The operation in |
eeeee811 | 1532 | .I futex_op |
180f97b7 | 1533 | is one of those that employs a timeout, but the supplied |
fb2f4c27 MK |
1534 | .I timeout |
1535 | argument was invalid | |
1536 | .RI ( tv_sec | |
1537 | was less than zero, or | |
eeeee811 | 1538 | .I tv_nsec |
cabee29d | 1539 | was not less than 1,000,000,000). |
fb2f4c27 MK |
1540 | .TP |
1541 | .B EINVAL | |
0c74df0b | 1542 | The operation specified in |
eeeee811 | 1543 | .I futex_op |
0c74df0b | 1544 | employs one or both of the pointers |
51ee94be | 1545 | .I uaddr |
a1f47699 | 1546 | and |
0c74df0b | 1547 | .IR uaddr2 , |
36546c38 | 1548 | but one of these does not point to a valid object\[em]that is, |
0c74df0b | 1549 | the address is not four-byte-aligned. |
51ee94be MK |
1550 | .TP |
1551 | .B EINVAL | |
55cc422d TG |
1552 | .RB ( FUTEX_WAIT_BITSET , |
1553 | .BR FUTEX_WAKE_BITSET ) | |
5e1456d4 | 1554 | The bit mask supplied in |
eeeee811 | 1555 | .I val3 |
79c9b436 TG |
1556 | is zero. |
1557 | .TP | |
1558 | .B EINVAL | |
2abcba67 | 1559 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
add875c0 MK |
1560 | .I uaddr |
1561 | equals | |
eeeee811 | 1562 | .I uaddr2 |
add875c0 MK |
1563 | (i.e., an attempt was made to requeue to the same futex). |
1564 | .TP | |
eeeee811 | 1565 | .B EINVAL |
ff597681 MK |
1566 | .RB ( FUTEX_FD ) |
1567 | The signal number supplied in | |
1568 | .I val | |
1569 | is invalid. | |
1570 | .TP | |
6bac3b85 | 1571 | .B EINVAL |
476debd7 MK |
1572 | .RB ( FUTEX_WAKE , |
1573 | .BR FUTEX_WAKE_OP , | |
1574 | .BR FUTEX_WAKE_BITSET , | |
1575 | .BR FUTEX_REQUEUE , | |
1576 | .BR FUTEX_CMP_REQUEUE ) | |
1577 | The kernel detected an inconsistency between the user-space state at | |
1578 | .I uaddr | |
36546c38 | 1579 | and the kernel state\[em]that is, it detected a waiter which waits in |
eeeee811 | 1580 | .B FUTEX_LOCK_PI |
e79977ae | 1581 | or |
eeeee811 | 1582 | .B FUTEX_LOCK_PI2 |
476debd7 MK |
1583 | on |
1584 | .IR uaddr . | |
1585 | .TP | |
1586 | .B EINVAL | |
a218ef20 | 1587 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1588 | .BR FUTEX_LOCK_PI2 , |
ce022f18 MK |
1589 | .BR FUTEX_TRYLOCK_PI , |
1590 | .BR FUTEX_UNLOCK_PI ) | |
a218ef20 MK |
1591 | The kernel detected an inconsistency between the user-space state at |
1592 | .I uaddr | |
1593 | and the kernel state. | |
ce022f18 | 1594 | This indicates either state corruption |
ce022f18 | 1595 | or that the kernel found a waiter on |
a218ef20 MK |
1596 | .I uaddr |
1597 | which is waiting via | |
eeeee811 | 1598 | .B FUTEX_WAIT |
a218ef20 MK |
1599 | or |
1600 | .BR FUTEX_WAIT_BITSET . | |
1601 | .TP | |
1602 | .B EINVAL | |
f9250b1a MK |
1603 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1604 | The kernel detected an inconsistency between the user-space state at | |
99c0041d MK |
1605 | .I uaddr2 |
1606 | and the kernel state; | |
ee65b0e8 MK |
1607 | .\" From a conversation with Thomas Gleixner (Aug 2015): ### |
1608 | .\" The kernel sees: I have non PI state for a futex you tried to | |
1609 | .\" tell me was PI | |
99c0041d | 1610 | that is, the kernel detected a waiter which waits via |
eeeee811 | 1611 | .B FUTEX_WAIT |
8297383e | 1612 | or |
eeeee811 | 1613 | .B FUTEX_WAIT_BITSET |
99c0041d MK |
1614 | on |
1615 | .IR uaddr2 . | |
1616 | .TP | |
1617 | .B EINVAL | |
1618 | .RB ( FUTEX_CMP_REQUEUE_PI ) | |
1619 | The kernel detected an inconsistency between the user-space state at | |
f9250b1a MK |
1620 | .I uaddr |
1621 | and the kernel state; | |
1622 | that is, the kernel detected a waiter which waits via | |
eeeee811 | 1623 | .B FUTEX_WAIT |
99c0041d | 1624 | or |
eeeee811 | 1625 | .B FUTEX_WAIT_BITSET |
f9250b1a MK |
1626 | on |
1627 | .IR uaddr . | |
1628 | .TP | |
1629 | .B EINVAL | |
99c0041d | 1630 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
75299c8d MK |
1631 | The kernel detected an inconsistency between the user-space state at |
1632 | .I uaddr | |
1633 | and the kernel state; | |
1634 | that is, the kernel detected a waiter which waits on | |
1635 | .I uaddr | |
1636 | via | |
eeeee811 | 1637 | .B FUTEX_LOCK_PI |
e79977ae | 1638 | or |
eeeee811 | 1639 | .B FUTEX_LOCK_PI2 |
75299c8d MK |
1640 | (instead of |
1641 | .BR FUTEX_WAIT_REQUEUE_PI ). | |
99c0041d MK |
1642 | .TP |
1643 | .B EINVAL | |
9786b3ca | 1644 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
8297383e MK |
1645 | .\" This deals with the case: |
1646 | .\" wait_requeue_pi(A, B); | |
1647 | .\" requeue_pi(A, C); | |
9786b3ca MK |
1648 | An attempt was made to requeue a waiter to a futex other than that |
1649 | specified by the matching | |
1650 | .B FUTEX_WAIT_REQUEUE_PI | |
1651 | call for that waiter. | |
1652 | .TP | |
1653 | .B EINVAL | |
f0c0d61c MK |
1654 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
1655 | The | |
1656 | .I val | |
1657 | argument is not 1. | |
1658 | .TP | |
1659 | .B EINVAL | |
4832b48a | 1660 | Invalid argument. |
fea681da | 1661 | .TP |
d07d4ef3 MK |
1662 | .B ENFILE |
1663 | .RB ( FUTEX_FD ) | |
1664 | The system-wide limit on the total number of open files has been reached. | |
1665 | .TP | |
eeeee811 | 1666 | .B ENOMEM |
a449c634 | 1667 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1668 | .BR FUTEX_LOCK_PI2 , |
e34a8fb6 MK |
1669 | .BR FUTEX_TRYLOCK_PI , |
1670 | .BR FUTEX_CMP_REQUEUE_PI ) | |
a449c634 MK |
1671 | The kernel could not allocate memory to hold state information. |
1672 | .TP | |
4701fc28 MK |
1673 | .B ENOSYS |
1674 | Invalid operation specified in | |
d33602c4 | 1675 | .IR futex_op . |
9f6c40c0 | 1676 | .TP |
4a7e5b05 MK |
1677 | .B ENOSYS |
1678 | The | |
eeeee811 | 1679 | .B FUTEX_CLOCK_REALTIME |
4a7e5b05 | 1680 | option was specified in |
1afcee7c | 1681 | .IR futex_op , |
4a7e5b05 | 1682 | but the accompanying operation was neither |
017d194b MK |
1683 | .BR FUTEX_WAIT , |
1684 | .BR FUTEX_WAIT_BITSET , | |
e79977ae | 1685 | .BR FUTEX_WAIT_REQUEUE_PI , |
4a7e5b05 | 1686 | nor |
e79977ae | 1687 | .BR FUTEX_LOCK_PI2 . |
4a7e5b05 | 1688 | .TP |
eeeee811 | 1689 | .B ENOSYS |
a9dcb4d1 | 1690 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1691 | .BR FUTEX_LOCK_PI2 , |
f2424fae | 1692 | .BR FUTEX_TRYLOCK_PI , |
4945ff19 | 1693 | .BR FUTEX_UNLOCK_PI , |
4cf92894 | 1694 | .BR FUTEX_CMP_REQUEUE_PI , |
794bb106 | 1695 | .BR FUTEX_WAIT_REQUEUE_PI ) |
4b35dc5d | 1696 | A run-time check determined that the operation is not available. |
f0a9e8f4 | 1697 | The PI-futex operations are not implemented on all architectures and |
077981d4 | 1698 | are not supported on some CPU variants. |
a9dcb4d1 | 1699 | .TP |
eeeee811 | 1700 | .B EPERM |
c7589177 | 1701 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1702 | .BR FUTEX_LOCK_PI2 , |
dc2742a8 MK |
1703 | .BR FUTEX_TRYLOCK_PI , |
1704 | .BR FUTEX_CMP_REQUEUE_PI ) | |
04331c3f | 1705 | The caller is not allowed to attach itself to the futex at |
dc2742a8 MK |
1706 | .I uaddr |
1707 | (for | |
1708 | .BR FUTEX_CMP_REQUEUE_PI : | |
1709 | the futex at | |
1710 | .IR uaddr2 ). | |
c7589177 MK |
1711 | (This may be caused by a state corruption in user space.) |
1712 | .TP | |
eeeee811 | 1713 | .B EPERM |
87276709 | 1714 | .RB ( FUTEX_UNLOCK_PI ) |
4b35dc5d | 1715 | The caller does not own the lock represented by the futex word. |
76f347ba | 1716 | .TP |
eeeee811 | 1717 | .B ESRCH |
0b0e4934 | 1718 | .RB ( FUTEX_LOCK_PI , |
e79977ae | 1719 | .BR FUTEX_LOCK_PI2 , |
9732dd8b MK |
1720 | .BR FUTEX_TRYLOCK_PI , |
1721 | .BR FUTEX_CMP_REQUEUE_PI ) | |
4b35dc5d | 1722 | The thread ID in the futex word at |
0b0e4934 MK |
1723 | .I uaddr |
1724 | does not exist. | |
1725 | .TP | |
eeeee811 | 1726 | .B ESRCH |
360f773c | 1727 | .RB ( FUTEX_CMP_REQUEUE_PI ) |
4b35dc5d | 1728 | The thread ID in the futex word at |
360f773c MK |
1729 | .I uaddr2 |
1730 | does not exist. | |
1731 | .TP | |
9f6c40c0 | 1732 | .B ETIMEDOUT |
4d85047f | 1733 | The operation in |
eeeee811 | 1734 | .I futex_op |
4d85047f MK |
1735 | employed the timeout specified in |
1736 | .IR timeout , | |
1737 | and the timeout expired before the operation completed. | |
70b06b90 MK |
1738 | .\" |
1739 | .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" | |
1740 | .\" | |
4131356c AC |
1741 | .SH STANDARDS |
1742 | Linux. | |
1743 | .SH HISTORY | |
1744 | Linux 2.6.0. | |
c6d039a3 | 1745 | .P |
4c8cb0ff MK |
1746 | Initial futex support was merged in Linux 2.5.7 but with different |
1747 | semantics from what was described above. | |
52dee70e | 1748 | A four-argument system call with the semantics |
fd3fa7ef | 1749 | described in this page was introduced in Linux 2.5.40. |
d0442d14 MK |
1750 | A fifth argument was added in Linux 2.5.70, |
1751 | and a sixth argument was added in Linux 2.6.7. | |
a14af333 | 1752 | .SH EXAMPLES |
bc54ed38 MK |
1753 | The program below demonstrates use of futexes in a program where a parent |
1754 | process and a child process use a pair of futexes located inside a | |
305cc415 MK |
1755 | shared anonymous mapping to synchronize access to a shared resource: |
1756 | the terminal. | |
1757 | The two processes each write | |
1ae6b2c7 | 1758 | .I nloops |
305cc415 MK |
1759 | (a command-line argument that defaults to 5 if omitted) |
1760 | messages to the terminal and employ a synchronization protocol | |
1761 | that ensures that they alternate in writing messages. | |
1762 | Upon running this program we see output such as the following: | |
c6d039a3 | 1763 | .P |
305cc415 | 1764 | .in +4n |
b76974c1 | 1765 | .EX |
305cc415 MK |
1766 | $ \fB./futex_demo\fP |
1767 | Parent (18534) 0 | |
1768 | Child (18535) 0 | |
1769 | Parent (18534) 1 | |
1770 | Child (18535) 1 | |
1771 | Parent (18534) 2 | |
1772 | Child (18535) 2 | |
1773 | Parent (18534) 3 | |
1774 | Child (18535) 3 | |
1775 | Parent (18534) 4 | |
1776 | Child (18535) 4 | |
b76974c1 | 1777 | .EE |
305cc415 MK |
1778 | .in |
1779 | .SS Program source | |
1780 | \& | |
33857069 | 1781 | .\" SRC BEGIN (futex.c) |
e7d0bb47 | 1782 | .EX |
305cc415 | 1783 | /* futex_demo.c |
fe5dba13 | 1784 | \& |
305cc415 MK |
1785 | Usage: futex_demo [nloops] |
1786 | (Default: 5) | |
fe5dba13 | 1787 | \& |
305cc415 MK |
1788 | Demonstrate the use of futexes in a program where parent and child |
1789 | use a pair of futexes located inside a shared anonymous mapping to | |
1790 | synchronize access to a shared resource: the terminal. The two | |
b957f81f | 1791 | processes each write \[aq]num\-loops\[aq] messages to the terminal and employ |
305cc415 MK |
1792 | a synchronization protocol that ensures that they alternate in |
1793 | writing messages. | |
1794 | */ | |
1795 | #define _GNU_SOURCE | |
5a5208c1 | 1796 | #include <err.h> |
305cc415 | 1797 | #include <errno.h> |
4ae706b0 | 1798 | #include <linux/futex.h> |
915c4ba3 | 1799 | #include <stdatomic.h> |
8eb90116 | 1800 | #include <stdint.h> |
4ae706b0 | 1801 | #include <stdio.h> |
305cc415 | 1802 | #include <stdlib.h> |
305cc415 MK |
1803 | #include <sys/mman.h> |
1804 | #include <sys/syscall.h> | |
305cc415 | 1805 | #include <sys/time.h> |
4ae706b0 AC |
1806 | #include <sys/wait.h> |
1807 | #include <unistd.h> | |
fe5dba13 | 1808 | \& |
2253ecf0 | 1809 | static uint32_t *futex1, *futex2, *iaddr; |
fe5dba13 | 1810 | \& |
305cc415 | 1811 | static int |
2253ecf0 AC |
1812 | futex(uint32_t *uaddr, int futex_op, uint32_t val, |
1813 | const struct timespec *timeout, uint32_t *uaddr2, uint32_t val3) | |
305cc415 MK |
1814 | { |
1815 | return syscall(SYS_futex, uaddr, futex_op, val, | |
c1e04f01 | 1816 | timeout, uaddr2, val3); |
305cc415 | 1817 | } |
fe5dba13 | 1818 | \& |
b957f81f | 1819 | /* Acquire the futex pointed to by \[aq]futexp\[aq]: wait for its value to |
305cc415 | 1820 | become 1, and then set the value to 0. */ |
fe5dba13 | 1821 | \& |
305cc415 | 1822 | static void |
2253ecf0 | 1823 | fwait(uint32_t *futexp) |
305cc415 | 1824 | { |
041a155c AC |
1825 | long s; |
1826 | const uint32_t one = 1; | |
fe5dba13 | 1827 | \& |
915c4ba3 BP |
1828 | /* atomic_compare_exchange_strong(ptr, oldval, newval) |
1829 | atomically performs the equivalent of: | |
fe5dba13 | 1830 | \& |
915c4ba3 | 1831 | if (*ptr == *oldval) |
305cc415 | 1832 | *ptr = newval; |
fe5dba13 | 1833 | \& |
915c4ba3 | 1834 | It returns true if the test yielded true and *ptr was updated. */ |
fe5dba13 | 1835 | \& |
305cc415 | 1836 | while (1) { |
fe5dba13 | 1837 | \& |
63ad44cb | 1838 | /* Is the futex available? */ |
09e456c2 | 1839 | if (atomic_compare_exchange_strong(futexp, &one, 0)) |
305cc415 | 1840 | break; /* Yes */ |
fe5dba13 | 1841 | \& |
c6beb8a1 | 1842 | /* Futex is not available; wait. */ |
fe5dba13 | 1843 | \& |
63ad44cb HS |
1844 | s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); |
1845 | if (s == \-1 && errno != EAGAIN) | |
5a5208c1 | 1846 | err(EXIT_FAILURE, "futex\-FUTEX_WAIT"); |
305cc415 MK |
1847 | } |
1848 | } | |
fe5dba13 | 1849 | \& |
b957f81f | 1850 | /* Release the futex pointed to by \[aq]futexp\[aq]: if the futex currently |
a00c4698 | 1851 | has the value 0, set its value to 1 and then wake any futex waiters, |
e2c75104 | 1852 | so that if the peer is blocked in fwait(), it can proceed. */ |
fe5dba13 | 1853 | \& |
305cc415 | 1854 | static void |
2253ecf0 | 1855 | fpost(uint32_t *futexp) |
305cc415 | 1856 | { |
041a155c AC |
1857 | long s; |
1858 | const uint32_t zero = 0; | |
fe5dba13 | 1859 | \& |
68219aba | 1860 | /* atomic_compare_exchange_strong() was described |
c6beb8a1 | 1861 | in comments above. */ |
fe5dba13 | 1862 | \& |
09e456c2 | 1863 | if (atomic_compare_exchange_strong(futexp, &zero, 1)) { |
305cc415 MK |
1864 | s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); |
1865 | if (s == \-1) | |
5a5208c1 | 1866 | err(EXIT_FAILURE, "futex\-FUTEX_WAKE"); |
305cc415 MK |
1867 | } |
1868 | } | |
fe5dba13 | 1869 | \& |
305cc415 MK |
1870 | int |
1871 | main(int argc, char *argv[]) | |
1872 | { | |
b42296e4 AC |
1873 | pid_t childPid; |
1874 | unsigned int nloops; | |
fe5dba13 | 1875 | \& |
305cc415 | 1876 | setbuf(stdout, NULL); |
fe5dba13 | 1877 | \& |
305cc415 | 1878 | nloops = (argc > 1) ? atoi(argv[1]) : 5; |
fe5dba13 | 1879 | \& |
305cc415 MK |
1880 | /* Create a shared anonymous mapping that will hold the futexes. |
1881 | Since the futexes are being shared between processes, we | |
1882 | subsequently use the "shared" futex operations (i.e., not the | |
c6beb8a1 | 1883 | ones suffixed "_PRIVATE"). */ |
fe5dba13 | 1884 | \& |
d60a7a9a | 1885 | iaddr = mmap(NULL, sizeof(*iaddr) * 2, PROT_READ | PROT_WRITE, |
4687ab0e | 1886 | MAP_ANONYMOUS | MAP_SHARED, \-1, 0); |
305cc415 | 1887 | if (iaddr == MAP_FAILED) |
5a5208c1 | 1888 | err(EXIT_FAILURE, "mmap"); |
fe5dba13 | 1889 | \& |
305cc415 MK |
1890 | futex1 = &iaddr[0]; |
1891 | futex2 = &iaddr[1]; | |
fe5dba13 | 1892 | \& |
305cc415 MK |
1893 | *futex1 = 0; /* State: unavailable */ |
1894 | *futex2 = 1; /* State: available */ | |
fe5dba13 | 1895 | \& |
305cc415 | 1896 | /* Create a child process that inherits the shared anonymous |
c6beb8a1 | 1897 | mapping. */ |
fe5dba13 | 1898 | \& |
305cc415 | 1899 | childPid = fork(); |
92a46690 | 1900 | if (childPid == \-1) |
5a5208c1 | 1901 | err(EXIT_FAILURE, "fork"); |
fe5dba13 | 1902 | \& |
305cc415 | 1903 | if (childPid == 0) { /* Child */ |
b42296e4 | 1904 | for (unsigned int j = 0; j < nloops; j++) { |
305cc415 | 1905 | fwait(futex1); |
b42296e4 | 1906 | printf("Child (%jd) %u\en", (intmax_t) getpid(), j); |
305cc415 MK |
1907 | fpost(futex2); |
1908 | } | |
fe5dba13 | 1909 | \& |
305cc415 MK |
1910 | exit(EXIT_SUCCESS); |
1911 | } | |
fe5dba13 | 1912 | \& |
c6beb8a1 | 1913 | /* Parent falls through to here. */ |
fe5dba13 | 1914 | \& |
b42296e4 | 1915 | for (unsigned int j = 0; j < nloops; j++) { |
305cc415 | 1916 | fwait(futex2); |
b42296e4 | 1917 | printf("Parent (%jd) %u\en", (intmax_t) getpid(), j); |
305cc415 MK |
1918 | fpost(futex1); |
1919 | } | |
fe5dba13 | 1920 | \& |
305cc415 | 1921 | wait(NULL); |
fe5dba13 | 1922 | \& |
305cc415 MK |
1923 | exit(EXIT_SUCCESS); |
1924 | } | |
e7d0bb47 | 1925 | .EE |
33857069 | 1926 | .\" SRC END |
47297adb | 1927 | .SH SEE ALSO |
4c222281 | 1928 | .ad l |
9913033c | 1929 | .BR get_robust_list (2), |
d806bc05 | 1930 | .BR restart_syscall (2), |
e0074751 | 1931 | .BR pthread_mutexattr_getprotocol (3), |
ac894879 MK |
1932 | .BR futex (7), |
1933 | .BR sched (7) | |
c6d039a3 | 1934 | .P |
f5ad572f | 1935 | The following kernel source files: |
cdede5cd | 1936 | .IP \[bu] 3 |
b49c2acb | 1937 | .I Documentation/pi\-futex.txt |
cdede5cd | 1938 | .IP \[bu] |
b49c2acb | 1939 | .I Documentation/futex\-requeue\-pi.txt |
cdede5cd | 1940 | .IP \[bu] |
b49c2acb | 1941 | .I Documentation/locking/rt\-mutex.txt |
cdede5cd | 1942 | .IP \[bu] |
b49c2acb | 1943 | .I Documentation/locking/rt\-mutex\-design.txt |
cdede5cd | 1944 | .IP \[bu] |
b49c2acb | 1945 | .I Documentation/robust\-futex\-ABI.txt |
c6d039a3 | 1946 | .P |
4c222281 | 1947 | Franke, H., Russell, R., and Kirwood, M., 2002. |
52087dd3 | 1948 | \fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP |
4c222281 | 1949 | (from proceedings of the Ottawa Linux Symposium 2002), |
9b936e9e | 1950 | .br |
5465ae95 | 1951 | .UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf |
608bf950 | 1952 | .UE |
c6d039a3 | 1953 | .P |
4c222281 | 1954 | Hart, D., 2009. \fIA futex overview and update\fP, |
2ed26199 MK |
1955 | .UR http://lwn.net/Articles/360699/ |
1956 | .UE | |
c6d039a3 | 1957 | .P |
8fb01fde | 1958 | Hart, D.\& and Guniguntala, D., 2009. |
75c018a1 | 1959 | \fIRequeue-PI: Making glibc Condvars PI-Aware\fP |
4c222281 | 1960 | (from proceedings of the 2009 Real-Time Linux Workshop), |
0483b6cc MK |
1961 | .UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf |
1962 | .UE | |
c6d039a3 | 1963 | .P |
4c222281 | 1964 | Drepper, U., 2011. \fIFutexes Are Tricky\fP, |
f42eb21b MK |
1965 | .UR http://www.akkadia.org/drepper/futex.pdf |
1966 | .UE | |
c6d039a3 | 1967 | .P |
6693a21c | 1968 | Futex example library, futex\-*.tar.bz2 at |
9b936e9e | 1969 | .br |
99ef5aaf | 1970 | .UR https://mirrors.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/ |
608bf950 | 1971 | .UE |
34f14794 | 1972 | .\" |
74f58a64 | 1973 | .\" FIXME(Torvald) We should probably refer to the glibc code here, in |
9915ea23 MK |
1974 | .\" particular the glibc-internal futex wrapper functions that are |
1975 | .\" WIP, and the generic pthread_mutex_t and perhaps condvar | |
1976 | .\" implementations. |