]>
Commit | Line | Data |
---|---|---|
919e9e63 TG |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _kernel_hacking_locktypes: | |
4 | ||
5 | ========================== | |
6 | Lock types and their rules | |
7 | ========================== | |
8 | ||
9 | Introduction | |
10 | ============ | |
11 | ||
12 | The kernel provides a variety of locking primitives which can be divided | |
1edcd467 | 13 | into three categories: |
919e9e63 TG |
14 | |
15 | - Sleeping locks | |
91710728 | 16 | - CPU local locks |
919e9e63 TG |
17 | - Spinning locks |
18 | ||
19 | This document conceptually describes these lock types and provides rules | |
20 | for their nesting, including the rules for use under PREEMPT_RT. | |
21 | ||
22 | ||
23 | Lock categories | |
24 | =============== | |
25 | ||
26 | Sleeping locks | |
27 | -------------- | |
28 | ||
29 | Sleeping locks can only be acquired in preemptible task context. | |
30 | ||
31 | Although implementations allow try_lock() from other contexts, it is | |
32 | necessary to carefully evaluate the safety of unlock() as well as of | |
33 | try_lock(). Furthermore, it is also necessary to evaluate the debugging | |
34 | versions of these primitives. In short, don't acquire sleeping locks from | |
35 | other contexts unless there is no other option. | |
36 | ||
37 | Sleeping lock types: | |
38 | ||
39 | - mutex | |
40 | - rt_mutex | |
41 | - semaphore | |
42 | - rw_semaphore | |
43 | - ww_mutex | |
44 | - percpu_rw_semaphore | |
45 | ||
46 | On PREEMPT_RT kernels, these lock types are converted to sleeping locks: | |
47 | ||
91710728 | 48 | - local_lock |
919e9e63 TG |
49 | - spinlock_t |
50 | - rwlock_t | |
51 | ||
91710728 TG |
52 | |
53 | CPU local locks | |
54 | --------------- | |
55 | ||
56 | - local_lock | |
57 | ||
58 | On non-PREEMPT_RT kernels, local_lock functions are wrappers around | |
59 | preemption and interrupt disabling primitives. Contrary to other locking | |
60 | mechanisms, disabling preemption or interrupts are pure CPU local | |
61 | concurrency control mechanisms and not suited for inter-CPU concurrency | |
62 | control. | |
63 | ||
64 | ||
919e9e63 TG |
65 | Spinning locks |
66 | -------------- | |
67 | ||
68 | - raw_spinlock_t | |
69 | - bit spinlocks | |
70 | ||
71 | On non-PREEMPT_RT kernels, these lock types are also spinning locks: | |
72 | ||
73 | - spinlock_t | |
74 | - rwlock_t | |
75 | ||
76 | Spinning locks implicitly disable preemption and the lock / unlock functions | |
77 | can have suffixes which apply further protections: | |
78 | ||
79 | =================== ==================================================== | |
80 | _bh() Disable / enable bottom halves (soft interrupts) | |
81 | _irq() Disable / enable interrupts | |
82 | _irqsave/restore() Save and disable / restore interrupt disabled state | |
83 | =================== ==================================================== | |
84 | ||
91710728 | 85 | |
7ecc6aa5 TG |
86 | Owner semantics |
87 | =============== | |
88 | ||
89 | The aforementioned lock types except semaphores have strict owner | |
90 | semantics: | |
91 | ||
92 | The context (task) that acquired the lock must release it. | |
93 | ||
94 | rw_semaphores have a special interface which allows non-owner release for | |
95 | readers. | |
96 | ||
919e9e63 TG |
97 | |
98 | rtmutex | |
99 | ======= | |
100 | ||
101 | RT-mutexes are mutexes with support for priority inheritance (PI). | |
102 | ||
51e69e65 | 103 | PI has limitations on non-PREEMPT_RT kernels due to preemption and |
919e9e63 TG |
104 | interrupt disabled sections. |
105 | ||
106 | PI clearly cannot preempt preemption-disabled or interrupt-disabled | |
107 | regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels | |
108 | execute most such regions of code in preemptible task context, especially | |
109 | interrupt handlers and soft interrupts. This conversion allows spinlock_t | |
110 | and rwlock_t to be implemented via RT-mutexes. | |
111 | ||
112 | ||
7ecc6aa5 TG |
113 | semaphore |
114 | ========= | |
115 | ||
116 | semaphore is a counting semaphore implementation. | |
117 | ||
118 | Semaphores are often used for both serialization and waiting, but new use | |
119 | cases should instead use separate serialization and wait mechanisms, such | |
120 | as mutexes and completions. | |
121 | ||
122 | semaphores and PREEMPT_RT | |
123 | ---------------------------- | |
124 | ||
125 | PREEMPT_RT does not change the semaphore implementation because counting | |
126 | semaphores have no concept of owners, thus preventing PREEMPT_RT from | |
127 | providing priority inheritance for semaphores. After all, an unknown | |
128 | owner cannot be boosted. As a consequence, blocking on semaphores can | |
129 | result in priority inversion. | |
130 | ||
131 | ||
132 | rw_semaphore | |
133 | ============ | |
134 | ||
135 | rw_semaphore is a multiple readers and single writer lock mechanism. | |
136 | ||
137 | On non-PREEMPT_RT kernels the implementation is fair, thus preventing | |
138 | writer starvation. | |
139 | ||
140 | rw_semaphore complies by default with the strict owner semantics, but there | |
141 | exist special-purpose interfaces that allow non-owner release for readers. | |
142 | These interfaces work independent of the kernel configuration. | |
143 | ||
144 | rw_semaphore and PREEMPT_RT | |
145 | --------------------------- | |
146 | ||
147 | PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based | |
148 | implementation, thus changing the fairness: | |
149 | ||
150 | Because an rw_semaphore writer cannot grant its priority to multiple | |
151 | readers, a preempted low-priority reader will continue holding its lock, | |
152 | thus starving even high-priority writers. In contrast, because readers | |
153 | can grant their priority to a writer, a preempted low-priority writer will | |
154 | have its priority boosted until it releases the lock, thus preventing that | |
155 | writer from starving readers. | |
156 | ||
157 | ||
91710728 TG |
158 | local_lock |
159 | ========== | |
160 | ||
161 | local_lock provides a named scope to critical sections which are protected | |
162 | by disabling preemption or interrupts. | |
163 | ||
164 | On non-PREEMPT_RT kernels local_lock operations map to the preemption and | |
165 | interrupt disabling and enabling primitives: | |
166 | ||
94dea151 MR |
167 | =============================== ====================== |
168 | local_lock(&llock) preempt_disable() | |
169 | local_unlock(&llock) preempt_enable() | |
170 | local_lock_irq(&llock) local_irq_disable() | |
171 | local_unlock_irq(&llock) local_irq_enable() | |
172 | local_lock_irqsave(&llock) local_irq_save() | |
173 | local_unlock_irqrestore(&llock) local_irq_restore() | |
174 | =============================== ====================== | |
91710728 TG |
175 | |
176 | The named scope of local_lock has two advantages over the regular | |
177 | primitives: | |
178 | ||
179 | - The lock name allows static analysis and is also a clear documentation | |
180 | of the protection scope while the regular primitives are scopeless and | |
181 | opaque. | |
182 | ||
183 | - If lockdep is enabled the local_lock gains a lockmap which allows to | |
184 | validate the correctness of the protection. This can detect cases where | |
185 | e.g. a function using preempt_disable() as protection mechanism is | |
186 | invoked from interrupt or soft-interrupt context. Aside of that | |
187 | lockdep_assert_held(&llock) works as with any other locking primitive. | |
188 | ||
189 | local_lock and PREEMPT_RT | |
190 | ------------------------- | |
191 | ||
192 | PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing | |
193 | semantics: | |
194 | ||
195 | - All spinlock_t changes also apply to local_lock. | |
196 | ||
197 | local_lock usage | |
198 | ---------------- | |
199 | ||
200 | local_lock should be used in situations where disabling preemption or | |
201 | interrupts is the appropriate form of concurrency control to protect | |
202 | per-CPU data structures on a non PREEMPT_RT kernel. | |
203 | ||
204 | local_lock is not suitable to protect against preemption or interrupts on a | |
205 | PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. | |
206 | ||
207 | ||
919e9e63 TG |
208 | raw_spinlock_t and spinlock_t |
209 | ============================= | |
210 | ||
211 | raw_spinlock_t | |
212 | -------------- | |
213 | ||
919e9e63 TG |
214 | raw_spinlock_t is a strict spinning lock implementation in all kernels, |
215 | including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical | |
51e69e65 | 216 | core code, low-level interrupt handling and places where disabling |
919e9e63 TG |
217 | preemption or interrupts is required, for example, to safely access |
218 | hardware state. raw_spinlock_t can sometimes also be used when the | |
219 | critical section is tiny, thus avoiding RT-mutex overhead. | |
220 | ||
221 | spinlock_t | |
222 | ---------- | |
223 | ||
7ecc6aa5 | 224 | The semantics of spinlock_t change with the state of PREEMPT_RT. |
919e9e63 | 225 | |
51e69e65 RD |
226 | On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has |
227 | exactly the same semantics. | |
919e9e63 TG |
228 | |
229 | spinlock_t and PREEMPT_RT | |
230 | ------------------------- | |
231 | ||
51e69e65 RD |
232 | On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation |
233 | based on rt_mutex which changes the semantics: | |
919e9e63 | 234 | |
51e69e65 | 235 | - Preemption is not disabled. |
919e9e63 TG |
236 | |
237 | - The hard interrupt related suffixes for spin_lock / spin_unlock | |
51e69e65 RD |
238 | operations (_irq, _irqsave / _irqrestore) do not affect the CPU's |
239 | interrupt disabled state. | |
919e9e63 TG |
240 | |
241 | - The soft interrupt related suffix (_bh()) still disables softirq | |
242 | handlers. | |
243 | ||
244 | Non-PREEMPT_RT kernels disable preemption to get this effect. | |
245 | ||
246 | PREEMPT_RT kernels use a per-CPU lock for serialization which keeps | |
0d2be10b | 247 | preemption enabled. The lock disables softirq handlers and also |
919e9e63 TG |
248 | prevents reentrancy due to task preemption. |
249 | ||
250 | PREEMPT_RT kernels preserve all other spinlock_t semantics: | |
251 | ||
252 | - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels | |
253 | avoid migration by disabling preemption. PREEMPT_RT kernels instead | |
254 | disable migration, which ensures that pointers to per-CPU variables | |
255 | remain valid even if the task is preempted. | |
256 | ||
257 | - Task state is preserved across spinlock acquisition, ensuring that the | |
258 | task-state rules apply to all kernel configurations. Non-PREEMPT_RT | |
259 | kernels leave task state untouched. However, PREEMPT_RT must change | |
260 | task state if the task blocks during acquisition. Therefore, it saves | |
261 | the current task state before blocking and the corresponding lock wakeup | |
7ecc6aa5 TG |
262 | restores it, as shown below:: |
263 | ||
264 | task->state = TASK_INTERRUPTIBLE | |
265 | lock() | |
266 | block() | |
267 | task->saved_state = task->state | |
268 | task->state = TASK_UNINTERRUPTIBLE | |
269 | schedule() | |
270 | lock wakeup | |
271 | task->state = task->saved_state | |
919e9e63 TG |
272 | |
273 | Other types of wakeups would normally unconditionally set the task state | |
274 | to RUNNING, but that does not work here because the task must remain | |
275 | blocked until the lock becomes available. Therefore, when a non-lock | |
276 | wakeup attempts to awaken a task blocked waiting for a spinlock, it | |
277 | instead sets the saved state to RUNNING. Then, when the lock | |
278 | acquisition completes, the lock wakeup sets the task state to the saved | |
7ecc6aa5 TG |
279 | state, in this case setting it to RUNNING:: |
280 | ||
281 | task->state = TASK_INTERRUPTIBLE | |
282 | lock() | |
283 | block() | |
284 | task->saved_state = task->state | |
285 | task->state = TASK_UNINTERRUPTIBLE | |
286 | schedule() | |
287 | non lock wakeup | |
288 | task->saved_state = TASK_RUNNING | |
289 | ||
290 | lock wakeup | |
291 | task->state = task->saved_state | |
292 | ||
293 | This ensures that the real wakeup cannot be lost. | |
294 | ||
919e9e63 TG |
295 | |
296 | rwlock_t | |
297 | ======== | |
298 | ||
299 | rwlock_t is a multiple readers and single writer lock mechanism. | |
300 | ||
301 | Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the | |
302 | suffix rules of spinlock_t apply accordingly. The implementation is fair, | |
303 | thus preventing writer starvation. | |
304 | ||
305 | rwlock_t and PREEMPT_RT | |
306 | ----------------------- | |
307 | ||
308 | PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based | |
309 | implementation, thus changing semantics: | |
310 | ||
311 | - All the spinlock_t changes also apply to rwlock_t. | |
312 | ||
313 | - Because an rwlock_t writer cannot grant its priority to multiple | |
314 | readers, a preempted low-priority reader will continue holding its lock, | |
315 | thus starving even high-priority writers. In contrast, because readers | |
316 | can grant their priority to a writer, a preempted low-priority writer | |
317 | will have its priority boosted until it releases the lock, thus | |
318 | preventing that writer from starving readers. | |
319 | ||
320 | ||
321 | PREEMPT_RT caveats | |
322 | ================== | |
323 | ||
91710728 TG |
324 | local_lock on RT |
325 | ---------------- | |
326 | ||
327 | The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few | |
328 | implications. For example, on a non-PREEMPT_RT kernel the following code | |
329 | sequence works as expected:: | |
330 | ||
331 | local_lock_irq(&local_lock); | |
332 | raw_spin_lock(&lock); | |
333 | ||
334 | and is fully equivalent to:: | |
335 | ||
336 | raw_spin_lock_irq(&lock); | |
337 | ||
338 | On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() | |
339 | is mapped to a per-CPU spinlock_t which neither disables interrupts nor | |
340 | preemption. The following code sequence works perfectly correct on both | |
341 | PREEMPT_RT and non-PREEMPT_RT kernels:: | |
342 | ||
343 | local_lock_irq(&local_lock); | |
344 | spin_lock(&lock); | |
345 | ||
346 | Another caveat with local locks is that each local_lock has a specific | |
347 | protection scope. So the following substitution is wrong:: | |
348 | ||
349 | func1() | |
350 | { | |
351 | local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags); | |
352 | func3(); | |
94dea151 | 353 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags); |
91710728 TG |
354 | } |
355 | ||
356 | func2() | |
357 | { | |
358 | local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags); | |
359 | func3(); | |
94dea151 | 360 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags); |
91710728 TG |
361 | } |
362 | ||
363 | func3() | |
364 | { | |
365 | lockdep_assert_irqs_disabled(); | |
366 | access_protected_data(); | |
367 | } | |
368 | ||
369 | On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel | |
370 | local_lock_1 and local_lock_2 are distinct and cannot serialize the callers | |
371 | of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel | |
372 | because local_lock_irqsave() does not disable interrupts due to the | |
373 | PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: | |
374 | ||
375 | func1() | |
376 | { | |
377 | local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); | |
378 | func3(); | |
94dea151 | 379 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); |
91710728 TG |
380 | } |
381 | ||
382 | func2() | |
383 | { | |
384 | local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); | |
385 | func3(); | |
94dea151 | 386 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); |
91710728 TG |
387 | } |
388 | ||
389 | func3() | |
390 | { | |
391 | lockdep_assert_held(&local_lock); | |
392 | access_protected_data(); | |
393 | } | |
394 | ||
395 | ||
919e9e63 TG |
396 | spinlock_t and rwlock_t |
397 | ----------------------- | |
398 | ||
91710728 | 399 | The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels |
919e9e63 TG |
400 | have a few implications. For example, on a non-PREEMPT_RT kernel the |
401 | following code sequence works as expected:: | |
402 | ||
403 | local_irq_disable(); | |
404 | spin_lock(&lock); | |
405 | ||
406 | and is fully equivalent to:: | |
407 | ||
408 | spin_lock_irq(&lock); | |
409 | ||
410 | Same applies to rwlock_t and the _irqsave() suffix variants. | |
411 | ||
412 | On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a | |
413 | fully preemptible context. Instead, use spin_lock_irq() or | |
414 | spin_lock_irqsave() and their unlock counterparts. In cases where the | |
415 | interrupt disabling and locking must remain separate, PREEMPT_RT offers a | |
416 | local_lock mechanism. Acquiring the local_lock pins the task to a CPU, | |
51e69e65 RD |
417 | allowing things like per-CPU interrupt disabled locks to be acquired. |
418 | However, this approach should be used only where absolutely necessary. | |
919e9e63 | 419 | |
91710728 | 420 | A typical scenario is protection of per-CPU variables in thread context:: |
919e9e63 | 421 | |
91710728 TG |
422 | struct foo *p = get_cpu_ptr(&var1); |
423 | ||
424 | spin_lock(&p->lock); | |
425 | p->count += this_cpu_read(var2); | |
426 | ||
427 | This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel | |
428 | this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does | |
429 | not allow to acquire p->lock because get_cpu_ptr() implicitly disables | |
430 | preemption. The following substitution works on both kernels:: | |
431 | ||
432 | struct foo *p; | |
433 | ||
434 | migrate_disable(); | |
435 | p = this_cpu_ptr(&var1); | |
436 | spin_lock(&p->lock); | |
437 | p->count += this_cpu_read(var2); | |
438 | ||
91710728 TG |
439 | migrate_disable() ensures that the task is pinned on the current CPU which |
440 | in turn guarantees that the per-CPU access to var1 and var2 are staying on | |
6a631c04 | 441 | the same CPU while the task remains preemptible. |
91710728 TG |
442 | |
443 | The migrate_disable() substitution is not valid for the following | |
444 | scenario:: | |
445 | ||
446 | func() | |
447 | { | |
448 | struct foo *p; | |
449 | ||
450 | migrate_disable(); | |
451 | p = this_cpu_ptr(&var1); | |
452 | p->val = func2(); | |
453 | ||
6a631c04 SAS |
454 | This breaks because migrate_disable() does not protect against reentrancy from |
455 | a preempting task. A correct substitution for this case is:: | |
91710728 TG |
456 | |
457 | func() | |
458 | { | |
459 | struct foo *p; | |
460 | ||
461 | local_lock(&foo_lock); | |
462 | p = this_cpu_ptr(&var1); | |
463 | p->val = func2(); | |
464 | ||
465 | On a non-PREEMPT_RT kernel this protects against reentrancy by disabling | |
466 | preemption. On a PREEMPT_RT kernel this is achieved by acquiring the | |
467 | underlying per-CPU spinlock. | |
468 | ||
469 | ||
470 | raw_spinlock_t on RT | |
471 | -------------------- | |
919e9e63 TG |
472 | |
473 | Acquiring a raw_spinlock_t disables preemption and possibly also | |
474 | interrupts, so the critical section must avoid acquiring a regular | |
475 | spinlock_t or rwlock_t, for example, the critical section must avoid | |
476 | allocating memory. Thus, on a non-PREEMPT_RT kernel the following code | |
477 | works perfectly:: | |
478 | ||
479 | raw_spin_lock(&lock); | |
480 | p = kmalloc(sizeof(*p), GFP_ATOMIC); | |
481 | ||
482 | But this code fails on PREEMPT_RT kernels because the memory allocator is | |
483 | fully preemptible and therefore cannot be invoked from truly atomic | |
484 | contexts. However, it is perfectly fine to invoke the memory allocator | |
485 | while holding normal non-raw spinlocks because they do not disable | |
486 | preemption on PREEMPT_RT kernels:: | |
487 | ||
488 | spin_lock(&lock); | |
489 | p = kmalloc(sizeof(*p), GFP_ATOMIC); | |
490 | ||
491 | ||
492 | bit spinlocks | |
493 | ------------- | |
494 | ||
7ecc6aa5 TG |
495 | PREEMPT_RT cannot substitute bit spinlocks because a single bit is too |
496 | small to accommodate an RT-mutex. Therefore, the semantics of bit | |
497 | spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t | |
498 | caveats also apply to bit spinlocks. | |
919e9e63 | 499 | |
7ecc6aa5 TG |
500 | Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT |
501 | using conditional (#ifdef'ed) code changes at the usage site. In contrast, | |
502 | usage-site changes are not needed for the spinlock_t substitution. | |
d56b699d | 503 | Instead, conditionals in header files and the core locking implementation |
7ecc6aa5 | 504 | enable the compiler to do the substitution transparently. |
919e9e63 TG |
505 | |
506 | ||
507 | Lock type nesting rules | |
508 | ======================= | |
509 | ||
510 | The most basic rules are: | |
511 | ||
91710728 TG |
512 | - Lock types of the same lock category (sleeping, CPU local, spinning) |
513 | can nest arbitrarily as long as they respect the general lock ordering | |
514 | rules to prevent deadlocks. | |
515 | ||
516 | - Sleeping lock types cannot nest inside CPU local and spinning lock types. | |
919e9e63 | 517 | |
91710728 | 518 | - CPU local and spinning lock types can nest inside sleeping lock types. |
919e9e63 | 519 | |
91710728 | 520 | - Spinning lock types can nest inside all lock types |
919e9e63 | 521 | |
7ecc6aa5 | 522 | These constraints apply both in PREEMPT_RT and otherwise. |
919e9e63 | 523 | |
7ecc6aa5 | 524 | The fact that PREEMPT_RT changes the lock category of spinlock_t and |
91710728 TG |
525 | rwlock_t from spinning to sleeping and substitutes local_lock with a |
526 | per-CPU spinlock_t means that they cannot be acquired while holding a raw | |
527 | spinlock. This results in the following nesting ordering: | |
919e9e63 TG |
528 | |
529 | 1) Sleeping locks | |
91710728 | 530 | 2) spinlock_t, rwlock_t, local_lock |
919e9e63 TG |
531 | 3) raw_spinlock_t and bit spinlocks |
532 | ||
7ecc6aa5 TG |
533 | Lockdep will complain if these constraints are violated, both in |
534 | PREEMPT_RT and otherwise. |