]>
Commit | Line | Data |
---|---|---|
7dbf15c5 TH |
1 | ========= |
2 | Workqueue | |
3 | ========= | |
c54fce6e | 4 | |
e7f08ffb SF |
5 | :Date: September, 2010 |
6 | :Author: Tejun Heo <tj@kernel.org> | |
7 | :Author: Florian Mickler <florian@mickler.org> | |
c54fce6e TH |
8 | |
9 | ||
e7f08ffb SF |
10 | Introduction |
11 | ============ | |
c54fce6e TH |
12 | |
13 | There are many cases where an asynchronous process execution context | |
14 | is needed and the workqueue (wq) API is the most commonly used | |
15 | mechanism for such cases. | |
16 | ||
17 | When such an asynchronous execution context is needed, a work item | |
18 | describing which function to execute is put on a queue. An | |
19 | independent thread serves as the asynchronous execution context. The | |
20 | queue is called workqueue and the thread is called worker. | |
21 | ||
22 | While there are work items on the workqueue the worker executes the | |
23 | functions associated with the work items one after the other. When | |
24 | there is no work item left on the workqueue the worker becomes idle. | |
25 | When a new work item gets queued, the worker begins executing again. | |
26 | ||
27 | ||
7dbf15c5 TH |
28 | Why Concurrency Managed Workqueue? |
29 | ================================== | |
c54fce6e TH |
30 | |
31 | In the original wq implementation, a multi threaded (MT) wq had one | |
32 | worker thread per CPU and a single threaded (ST) wq had one worker | |
33 | thread system-wide. A single MT wq needed to keep around the same | |
34 | number of workers as the number of CPUs. The kernel grew a lot of MT | |
35 | wq users over the years and with the number of CPU cores continuously | |
36 | rising, some systems saturated the default 32k PID space just booting | |
37 | up. | |
38 | ||
39 | Although MT wq wasted a lot of resource, the level of concurrency | |
40 | provided was unsatisfactory. The limitation was common to both ST and | |
41 | MT wq albeit less severe on MT. Each wq maintained its own separate | |
47684e11 RD |
42 | worker pool. An MT wq could provide only one execution context per CPU |
43 | while an ST wq one for the whole system. Work items had to compete for | |
c54fce6e TH |
44 | those very limited execution contexts leading to various problems |
45 | including proneness to deadlocks around the single execution context. | |
46 | ||
47 | The tension between the provided level of concurrency and resource | |
48 | usage also forced its users to make unnecessary tradeoffs like libata | |
49 | choosing to use ST wq for polling PIOs and accepting an unnecessary | |
50 | limitation that no two polling PIOs can progress at the same time. As | |
51 | MT wq don't provide much better concurrency, users which require | |
52 | higher level of concurrency, like async or fscache, had to implement | |
53 | their own thread pool. | |
54 | ||
55 | Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with | |
56 | focus on the following goals. | |
57 | ||
58 | * Maintain compatibility with the original workqueue API. | |
59 | ||
60 | * Use per-CPU unified worker pools shared by all wq to provide | |
61 | flexible level of concurrency on demand without wasting a lot of | |
62 | resource. | |
63 | ||
64 | * Automatically regulate worker pool and level of concurrency so that | |
65 | the API users don't need to worry about such details. | |
66 | ||
67 | ||
e7f08ffb SF |
68 | The Design |
69 | ========== | |
c54fce6e TH |
70 | |
71 | In order to ease the asynchronous execution of functions a new | |
72 | abstraction, the work item, is introduced. | |
73 | ||
74 | A work item is a simple struct that holds a pointer to the function | |
75 | that is to be executed asynchronously. Whenever a driver or subsystem | |
76 | wants a function to be executed asynchronously it has to set up a work | |
77 | item pointing to that function and queue that work item on a | |
78 | workqueue. | |
79 | ||
4cb1ef64 TH |
80 | A work item can be executed in either a thread or the BH (softirq) context. |
81 | ||
82 | For threaded workqueues, special purpose threads, called [k]workers, execute | |
83 | the functions off of the queue, one after the other. If no work is queued, | |
84 | the worker threads become idle. These worker threads are managed in | |
85 | worker-pools. | |
c54fce6e TH |
86 | |
87 | The cmwq design differentiates between the user-facing workqueues that | |
88 | subsystems and drivers queue work items on and the backend mechanism | |
546d30c4 | 89 | which manages worker-pools and processes the queued work items. |
c54fce6e | 90 | |
546d30c4 L |
91 | There are two worker-pools, one for normal work items and the other |
92 | for high priority ones, for each possible CPU and some extra | |
93 | worker-pools to serve work items queued on unbound workqueues - the | |
94 | number of these backing pools is dynamic. | |
c54fce6e | 95 | |
4cb1ef64 TH |
96 | BH workqueues use the same framework. However, as there can only be one |
97 | concurrent execution context, there's no need to worry about concurrency. | |
98 | Each per-CPU BH worker pool contains only one pseudo worker which represents | |
99 | the BH execution context. A BH workqueue can be considered a convenience | |
100 | interface to softirq. | |
101 | ||
c54fce6e TH |
102 | Subsystems and drivers can create and queue work items through special |
103 | workqueue API functions as they see fit. They can influence some | |
104 | aspects of the way the work items are executed by setting flags on the | |
105 | workqueue they are putting the work item on. These flags include | |
12076373 TH |
106 | things like CPU locality, concurrency limits, priority and more. To |
107 | get a detailed overview refer to the API description of | |
e7f08ffb | 108 | ``alloc_workqueue()`` below. |
c54fce6e | 109 | |
546d30c4 L |
110 | When a work item is queued to a workqueue, the target worker-pool is |
111 | determined according to the queue parameters and workqueue attributes | |
112 | and appended on the shared worklist of the worker-pool. For example, | |
113 | unless specifically overridden, a work item of a bound workqueue will | |
114 | be queued on the worklist of either normal or highpri worker-pool that | |
115 | is associated to the CPU the issuer is running on. | |
c54fce6e | 116 | |
4cb1ef64 | 117 | For any thread pool implementation, managing the concurrency level |
c54fce6e TH |
118 | (how many execution contexts are active) is an important issue. cmwq |
119 | tries to keep the concurrency at a minimal but sufficient level. | |
120 | Minimal to save resources and sufficient in that the system is used at | |
121 | its full capacity. | |
122 | ||
546d30c4 L |
123 | Each worker-pool bound to an actual CPU implements concurrency |
124 | management by hooking into the scheduler. The worker-pool is notified | |
3270476a TH |
125 | whenever an active worker wakes up or sleeps and keeps track of the |
126 | number of the currently runnable workers. Generally, work items are | |
127 | not expected to hog a CPU and consume many cycles. That means | |
128 | maintaining just enough concurrency to prevent work processing from | |
129 | stalling should be optimal. As long as there are one or more runnable | |
546d30c4 | 130 | workers on the CPU, the worker-pool doesn't start execution of a new |
3270476a TH |
131 | work, but, when the last running worker goes to sleep, it immediately |
132 | schedules a new worker so that the CPU doesn't sit idle while there | |
133 | are pending work items. This allows using a minimal number of workers | |
134 | without losing execution bandwidth. | |
c54fce6e TH |
135 | |
136 | Keeping idle workers around doesn't cost other than the memory space | |
137 | for kthreads, so cmwq holds onto idle ones for a while before killing | |
138 | them. | |
139 | ||
546d30c4 L |
140 | For unbound workqueues, the number of backing pools is dynamic. |
141 | Unbound workqueue can be assigned custom attributes using | |
e7f08ffb | 142 | ``apply_workqueue_attrs()`` and workqueue will automatically create |
546d30c4 L |
143 | backing worker pools matching the attributes. The responsibility of |
144 | regulating concurrency level is on the users. There is also a flag to | |
145 | mark a bound wq to ignore the concurrency management. Please refer to | |
146 | the API section for details. | |
c54fce6e TH |
147 | |
148 | Forward progress guarantee relies on that workers can be created when | |
149 | more execution contexts are necessary, which in turn is guaranteed | |
150 | through the use of rescue workers. All work items which might be used | |
151 | on code paths that handle memory reclaim are required to be queued on | |
152 | wq's that have a rescue-worker reserved for execution under memory | |
546d30c4 | 153 | pressure. Else it is possible that the worker-pool deadlocks waiting |
c54fce6e TH |
154 | for execution contexts to free up. |
155 | ||
156 | ||
e7f08ffb SF |
157 | Application Programming Interface (API) |
158 | ======================================= | |
c54fce6e | 159 | |
e7f08ffb SF |
160 | ``alloc_workqueue()`` allocates a wq. The original |
161 | ``create_*workqueue()`` functions are deprecated and scheduled for | |
47684e11 | 162 | removal. ``alloc_workqueue()`` takes three arguments - ``@name``, |
e7f08ffb SF |
163 | ``@flags`` and ``@max_active``. ``@name`` is the name of the wq and |
164 | also used as the name of the rescuer thread if there is one. | |
c54fce6e TH |
165 | |
166 | A wq no longer manages execution resources but serves as a domain for | |
e7f08ffb SF |
167 | forward progress guarantee, flush and work item attributes. ``@flags`` |
168 | and ``@max_active`` control how work items are assigned execution | |
c54fce6e TH |
169 | resources, scheduled and executed. |
170 | ||
c54fce6e | 171 | |
e7f08ffb SF |
172 | ``flags`` |
173 | --------- | |
174 | ||
4cb1ef64 TH |
175 | ``WQ_BH`` |
176 | BH workqueues can be considered a convenience interface to softirq. BH | |
177 | workqueues are always per-CPU and all BH work items are executed in the | |
178 | queueing CPU's softirq context in the queueing order. | |
179 | ||
180 | All BH workqueues must have 0 ``max_active`` and ``WQ_HIGHPRI`` is the | |
181 | only allowed additional flag. | |
182 | ||
183 | BH work items cannot sleep. All other features such as delayed queueing, | |
184 | flushing and canceling are supported. | |
185 | ||
e7f08ffb SF |
186 | ``WQ_UNBOUND`` |
187 | Work items queued to an unbound wq are served by the special | |
188 | worker-pools which host workers which are not bound to any | |
189 | specific CPU. This makes the wq behave as a simple execution | |
190 | context provider without concurrency management. The unbound | |
191 | worker-pools try to start execution of work items as soon as | |
192 | possible. Unbound wq sacrifices locality but is useful for | |
193 | the following cases. | |
194 | ||
195 | * Wide fluctuation in the concurrency level requirement is | |
196 | expected and using bound wq may end up creating large number | |
197 | of mostly unused workers across different CPUs as the issuer | |
198 | hops through different CPUs. | |
199 | ||
200 | * Long running CPU intensive workloads which can be better | |
201 | managed by the system scheduler. | |
202 | ||
203 | ``WQ_FREEZABLE`` | |
204 | A freezable wq participates in the freeze phase of the system | |
205 | suspend operations. Work items on the wq are drained and no | |
206 | new work item starts execution until thawed. | |
207 | ||
208 | ``WQ_MEM_RECLAIM`` | |
209 | All wq which might be used in the memory reclaim paths **MUST** | |
210 | have this flag set. The wq is guaranteed to have at least one | |
211 | execution context regardless of memory pressure. | |
212 | ||
213 | ``WQ_HIGHPRI`` | |
214 | Work items of a highpri wq are queued to the highpri | |
215 | worker-pool of the target cpu. Highpri worker-pools are | |
216 | served by worker threads with elevated nice level. | |
217 | ||
218 | Note that normal and highpri worker-pools don't interact with | |
47684e11 | 219 | each other. Each maintains its separate pool of workers and |
e7f08ffb SF |
220 | implements concurrency management among its workers. |
221 | ||
222 | ``WQ_CPU_INTENSIVE`` | |
223 | Work items of a CPU intensive wq do not contribute to the | |
224 | concurrency level. In other words, runnable CPU intensive | |
225 | work items will not prevent other work items in the same | |
226 | worker-pool from starting execution. This is useful for bound | |
227 | work items which are expected to hog CPU cycles so that their | |
228 | execution is regulated by the system scheduler. | |
229 | ||
230 | Although CPU intensive work items don't contribute to the | |
231 | concurrency level, start of their executions is still | |
232 | regulated by the concurrency management and runnable | |
233 | non-CPU-intensive work items can delay execution of CPU | |
234 | intensive work items. | |
235 | ||
236 | This flag is meaningless for unbound wq. | |
237 | ||
e7f08ffb SF |
238 | |
239 | ``max_active`` | |
240 | -------------- | |
241 | ||
636b927e TH |
242 | ``@max_active`` determines the maximum number of execution contexts per |
243 | CPU which can be assigned to the work items of a wq. For example, with | |
244 | ``@max_active`` of 16, at most 16 work items of the wq can be executing | |
245 | at the same time per CPU. This is always a per-CPU attribute, even for | |
246 | unbound workqueues. | |
247 | ||
248 | The maximum limit for ``@max_active`` is 512 and the default value used | |
249 | when 0 is specified is 256. These values are chosen sufficiently high | |
250 | such that they are not the limiting factor while providing protection in | |
251 | runaway cases. | |
c54fce6e TH |
252 | |
253 | The number of active work items of a wq is usually regulated by the | |
254 | users of the wq, more specifically, by how many work items the users | |
255 | may queue at the same time. Unless there is a specific need for | |
256 | throttling the number of active work items, specifying '0' is | |
257 | recommended. | |
258 | ||
3bc1e711 TH |
259 | Some users depend on strict execution ordering where only one work item |
260 | is in flight at any given time and the work items are processed in | |
261 | queueing order. While the combination of ``@max_active`` of 1 and | |
262 | ``WQ_UNBOUND`` used to achieve this behavior, this is no longer the | |
263 | case. Use ``alloc_ordered_queue()`` instead. | |
0e0cafcd | 264 | |
c54fce6e | 265 | |
e7f08ffb SF |
266 | Example Execution Scenarios |
267 | =========================== | |
c54fce6e TH |
268 | |
269 | The following example execution scenarios try to illustrate how cmwq | |
270 | behave under different configurations. | |
271 | ||
272 | Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. | |
273 | w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms | |
274 | again before finishing. w1 and w2 burn CPU for 5ms then sleep for | |
275 | 10ms. | |
276 | ||
277 | Ignoring all other tasks, works and processing overhead, and assuming | |
278 | simple FIFO scheduling, the following is one highly simplified version | |
e7f08ffb | 279 | of possible sequences of events with the original wq. :: |
c54fce6e TH |
280 | |
281 | TIME IN MSECS EVENT | |
282 | 0 w0 starts and burns CPU | |
283 | 5 w0 sleeps | |
284 | 15 w0 wakes up and burns CPU | |
285 | 20 w0 finishes | |
286 | 20 w1 starts and burns CPU | |
287 | 25 w1 sleeps | |
288 | 35 w1 wakes up and finishes | |
289 | 35 w2 starts and burns CPU | |
290 | 40 w2 sleeps | |
291 | 50 w2 wakes up and finishes | |
292 | ||
e7f08ffb | 293 | And with cmwq with ``@max_active`` >= 3, :: |
c54fce6e TH |
294 | |
295 | TIME IN MSECS EVENT | |
296 | 0 w0 starts and burns CPU | |
297 | 5 w0 sleeps | |
298 | 5 w1 starts and burns CPU | |
299 | 10 w1 sleeps | |
300 | 10 w2 starts and burns CPU | |
301 | 15 w2 sleeps | |
302 | 15 w0 wakes up and burns CPU | |
303 | 20 w0 finishes | |
304 | 20 w1 wakes up and finishes | |
305 | 25 w2 wakes up and finishes | |
306 | ||
e7f08ffb | 307 | If ``@max_active`` == 2, :: |
c54fce6e TH |
308 | |
309 | TIME IN MSECS EVENT | |
310 | 0 w0 starts and burns CPU | |
311 | 5 w0 sleeps | |
312 | 5 w1 starts and burns CPU | |
313 | 10 w1 sleeps | |
314 | 15 w0 wakes up and burns CPU | |
315 | 20 w0 finishes | |
316 | 20 w1 wakes up and finishes | |
317 | 20 w2 starts and burns CPU | |
318 | 25 w2 sleeps | |
319 | 35 w2 wakes up and finishes | |
320 | ||
321 | Now, let's assume w1 and w2 are queued to a different wq q1 which has | |
e7f08ffb | 322 | ``WQ_CPU_INTENSIVE`` set, :: |
c54fce6e TH |
323 | |
324 | TIME IN MSECS EVENT | |
325 | 0 w0 starts and burns CPU | |
326 | 5 w0 sleeps | |
327 | 5 w1 and w2 start and burn CPU | |
328 | 10 w1 sleeps | |
329 | 15 w2 sleeps | |
330 | 15 w0 wakes up and burns CPU | |
331 | 20 w0 finishes | |
332 | 20 w1 wakes up and finishes | |
333 | 25 w2 wakes up and finishes | |
334 | ||
335 | ||
e7f08ffb SF |
336 | Guidelines |
337 | ========== | |
c54fce6e | 338 | |
e7f08ffb SF |
339 | * Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work |
340 | items which are used during memory reclaim. Each wq with | |
341 | ``WQ_MEM_RECLAIM`` set has an execution context reserved for it. If | |
342 | there is dependency among multiple work items used during memory | |
343 | reclaim, they should be queued to separate wq each with | |
344 | ``WQ_MEM_RECLAIM``. | |
c54fce6e TH |
345 | |
346 | * Unless strict ordering is required, there is no need to use ST wq. | |
347 | ||
348 | * Unless there is a specific need, using 0 for @max_active is | |
349 | recommended. In most use cases, concurrency level usually stays | |
350 | well under the default limit. | |
351 | ||
6370a6ad | 352 | * A wq serves as a domain for forward progress guarantee |
e7f08ffb SF |
353 | (``WQ_MEM_RECLAIM``, flush and work item attributes. Work items |
354 | which are not involved in memory reclaim and don't need to be | |
355 | flushed as a part of a group of work items, and don't require any | |
356 | special attribute, can use one of the system wq. There is no | |
357 | difference in execution characteristics between using a dedicated wq | |
358 | and a system wq. | |
c54fce6e TH |
359 | |
360 | * Unless work items are expected to consume a huge amount of CPU | |
361 | cycles, using a bound wq is usually beneficial due to the increased | |
362 | level of locality in wq operations and work item execution. | |
e2de9e08 FM |
363 | |
364 | ||
63c5484e TH |
365 | Affinity Scopes |
366 | =============== | |
367 | ||
368 | An unbound workqueue groups CPUs according to its affinity scope to improve | |
369 | cache locality. For example, if a workqueue is using the default affinity | |
370 | scope of "cache", it will group CPUs according to last level cache | |
8639eceb TH |
371 | boundaries. A work item queued on the workqueue will be assigned to a worker |
372 | on one of the CPUs which share the last level cache with the issuing CPU. | |
373 | Once started, the worker may or may not be allowed to move outside the scope | |
374 | depending on the ``affinity_strict`` setting of the scope. | |
63c5484e | 375 | |
523a301e TH |
376 | Workqueue currently supports the following affinity scopes. |
377 | ||
378 | ``default`` | |
379 | Use the scope in module parameter ``workqueue.default_affinity_scope`` | |
380 | which is always set to one of the scopes below. | |
63c5484e TH |
381 | |
382 | ``cpu`` | |
383 | CPUs are not grouped. A work item issued on one CPU is processed by a | |
384 | worker on the same CPU. This makes unbound workqueues behave as per-cpu | |
385 | workqueues without concurrency management. | |
386 | ||
387 | ``smt`` | |
388 | CPUs are grouped according to SMT boundaries. This usually means that the | |
389 | logical threads of each physical CPU core are grouped together. | |
390 | ||
391 | ``cache`` | |
392 | CPUs are grouped according to cache boundaries. Which specific cache | |
393 | boundary is used is determined by the arch code. L3 is used in a lot of | |
394 | cases. This is the default affinity scope. | |
395 | ||
396 | ``numa`` | |
89405db5 | 397 | CPUs are grouped according to NUMA boundaries. |
63c5484e TH |
398 | |
399 | ``system`` | |
400 | All CPUs are put in the same group. Workqueue makes no effort to process a | |
401 | work item on a CPU close to the issuing CPU. | |
402 | ||
403 | The default affinity scope can be changed with the module parameter | |
404 | ``workqueue.default_affinity_scope`` and a specific workqueue's affinity | |
405 | scope can be changed using ``apply_workqueue_attrs()``. | |
406 | ||
407 | If ``WQ_SYSFS`` is set, the workqueue will have the following affinity scope | |
bd9e7326 | 408 | related interface files under its ``/sys/devices/virtual/workqueue/WQ_NAME/`` |
63c5484e TH |
409 | directory. |
410 | ||
411 | ``affinity_scope`` | |
412 | Read to see the current affinity scope. Write to change. | |
413 | ||
523a301e TH |
414 | When default is the current scope, reading this file will also show the |
415 | current effective scope in parentheses, for example, ``default (cache)``. | |
416 | ||
8639eceb TH |
417 | ``affinity_strict`` |
418 | 0 by default indicating that affinity scopes are not strict. When a work | |
419 | item starts execution, workqueue makes a best-effort attempt to ensure | |
420 | that the worker is inside its affinity scope, which is called | |
421 | repatriation. Once started, the scheduler is free to move the worker | |
422 | anywhere in the system as it sees fit. This enables benefiting from scope | |
423 | locality while still being able to utilize other CPUs if necessary and | |
424 | available. | |
425 | ||
426 | If set to 1, all workers of the scope are guaranteed always to be in the | |
427 | scope. This may be useful when crossing affinity scopes has other | |
428 | implications, for example, in terms of power consumption or workload | |
429 | isolation. Strict NUMA scope can also be used to match the workqueue | |
430 | behavior of older kernels. | |
431 | ||
63c5484e | 432 | |
7dbf15c5 TH |
433 | Affinity Scopes and Performance |
434 | =============================== | |
435 | ||
436 | It'd be ideal if an unbound workqueue's behavior is optimal for vast | |
437 | majority of use cases without further tuning. Unfortunately, in the current | |
438 | kernel, there exists a pronounced trade-off between locality and utilization | |
439 | necessitating explicit configurations when workqueues are heavily used. | |
440 | ||
441 | Higher locality leads to higher efficiency where more work is performed for | |
442 | the same number of consumed CPU cycles. However, higher locality may also | |
443 | cause lower overall system utilization if the work items are not spread | |
444 | enough across the affinity scopes by the issuers. The following performance | |
445 | testing with dm-crypt clearly illustrates this trade-off. | |
446 | ||
447 | The tests are run on a CPU with 12-cores/24-threads split across four L3 | |
448 | caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. | |
449 | ``/dev/dm-0`` is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and | |
450 | opened with ``cryptsetup`` with default settings. | |
451 | ||
452 | ||
453 | Scenario 1: Enough issuers and work spread across the machine | |
454 | ------------------------------------------------------------- | |
455 | ||
456 | The command used: :: | |
457 | ||
458 | $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \ | |
459 | --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \ | |
460 | --name=iops-test-job --verify=sha512 | |
461 | ||
462 | There are 24 issuers, each issuing 64 IOs concurrently. ``--verify=sha512`` | |
463 | makes ``fio`` generate and read back the content each time which makes | |
22160b08 | 464 | execution locality matter between the issuer and ``kcryptd``. The following |
7dbf15c5 TH |
465 | are the read bandwidths and CPU utilizations depending on different affinity |
466 | scope settings on ``kcryptd`` measured over five runs. Bandwidths are in | |
467 | MiBps, and CPU util in percents. | |
468 | ||
469 | .. list-table:: | |
470 | :widths: 16 20 20 | |
471 | :header-rows: 1 | |
472 | ||
473 | * - Affinity | |
474 | - Bandwidth (MiBps) | |
475 | - CPU util (%) | |
476 | ||
477 | * - system | |
478 | - 1159.40 ±1.34 | |
479 | - 99.31 ±0.02 | |
480 | ||
481 | * - cache | |
482 | - 1166.40 ±0.89 | |
483 | - 99.34 ±0.01 | |
484 | ||
485 | * - cache (strict) | |
486 | - 1166.00 ±0.71 | |
487 | - 99.35 ±0.01 | |
488 | ||
489 | With enough issuers spread across the system, there is no downside to | |
490 | "cache", strict or otherwise. All three configurations saturate the whole | |
491 | machine but the cache-affine ones outperform by 0.6% thanks to improved | |
492 | locality. | |
493 | ||
494 | ||
495 | Scenario 2: Fewer issuers, enough work for saturation | |
496 | ----------------------------------------------------- | |
497 | ||
498 | The command used: :: | |
499 | ||
500 | $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ | |
501 | --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \ | |
502 | --time_based --group_reporting --name=iops-test-job --verify=sha512 | |
503 | ||
504 | The only difference from the previous scenario is ``--numjobs=8``. There are | |
505 | a third of the issuers but is still enough total work to saturate the | |
506 | system. | |
507 | ||
508 | .. list-table:: | |
509 | :widths: 16 20 20 | |
510 | :header-rows: 1 | |
511 | ||
512 | * - Affinity | |
513 | - Bandwidth (MiBps) | |
514 | - CPU util (%) | |
515 | ||
516 | * - system | |
517 | - 1155.40 ±0.89 | |
518 | - 97.41 ±0.05 | |
519 | ||
520 | * - cache | |
521 | - 1154.40 ±1.14 | |
522 | - 96.15 ±0.09 | |
523 | ||
524 | * - cache (strict) | |
525 | - 1112.00 ±4.64 | |
526 | - 93.26 ±0.35 | |
527 | ||
528 | This is more than enough work to saturate the system. Both "system" and | |
529 | "cache" are nearly saturating the machine but not fully. "cache" is using | |
530 | less CPU but the better efficiency puts it at the same bandwidth as | |
531 | "system". | |
532 | ||
533 | Eight issuers moving around over four L3 cache scope still allow "cache | |
534 | (strict)" to mostly saturate the machine but the loss of work conservation | |
535 | is now starting to hurt with 3.7% bandwidth loss. | |
536 | ||
537 | ||
538 | Scenario 3: Even fewer issuers, not enough work to saturate | |
539 | ----------------------------------------------------------- | |
540 | ||
541 | The command used: :: | |
542 | ||
543 | $ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \ | |
544 | --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \ | |
545 | --time_based --group_reporting --name=iops-test-job --verify=sha512 | |
546 | ||
547 | Again, the only difference is ``--numjobs=4``. With the number of issuers | |
548 | reduced to four, there now isn't enough work to saturate the whole system | |
549 | and the bandwidth becomes dependent on completion latencies. | |
550 | ||
551 | .. list-table:: | |
552 | :widths: 16 20 20 | |
553 | :header-rows: 1 | |
554 | ||
555 | * - Affinity | |
556 | - Bandwidth (MiBps) | |
557 | - CPU util (%) | |
558 | ||
559 | * - system | |
560 | - 993.60 ±1.82 | |
561 | - 75.49 ±0.06 | |
562 | ||
563 | * - cache | |
564 | - 973.40 ±1.52 | |
565 | - 74.90 ±0.07 | |
566 | ||
567 | * - cache (strict) | |
568 | - 828.20 ±4.49 | |
569 | - 66.84 ±0.29 | |
570 | ||
571 | Now, the tradeoff between locality and utilization is clearer. "cache" shows | |
572 | 2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%. | |
573 | ||
574 | ||
575 | Conclusion and Recommendations | |
576 | ------------------------------ | |
577 | ||
578 | In the above experiments, the efficiency advantage of the "cache" affinity | |
579 | scope over "system" is, while consistent and noticeable, small. However, the | |
580 | impact is dependent on the distances between the scopes and may be more | |
581 | pronounced in processors with more complex topologies. | |
582 | ||
583 | While the loss of work-conservation in certain scenarios hurts, it is a lot | |
584 | better than "cache (strict)" and maximizing workqueue utilization is | |
585 | unlikely to be the common case anyway. As such, "cache" is the default | |
586 | affinity scope for unbound pools. | |
587 | ||
588 | * As there is no one option which is great for most cases, workqueue usages | |
589 | that may consume a significant amount of CPU are recommended to configure | |
590 | the workqueues using ``apply_workqueue_attrs()`` and/or enable | |
591 | ``WQ_SYSFS``. | |
592 | ||
593 | * An unbound workqueue with strict "cpu" affinity scope behaves the same as | |
594 | ``WQ_CPU_INTENSIVE`` per-cpu workqueue. There is no real advanage to the | |
595 | latter and an unbound workqueue provides a lot more flexibility. | |
596 | ||
597 | * Affinity scopes are introduced in Linux v6.5. To emulate the previous | |
598 | behavior, use strict "numa" affinity scope. | |
599 | ||
600 | * The loss of work-conservation in non-strict affinity scopes is likely | |
601 | originating from the scheduler. There is no theoretical reason why the | |
602 | kernel wouldn't be able to do the right thing and maintain | |
603 | work-conservation in most cases. As such, it is possible that future | |
604 | scheduler improvements may make most of these tunables unnecessary. | |
605 | ||
606 | ||
7f7dc377 TH |
607 | Examining Configuration |
608 | ======================= | |
609 | ||
610 | Use tools/workqueue/wq_dump.py to examine unbound CPU affinity | |
611 | configuration, worker pools and how workqueues map to the pools: :: | |
612 | ||
613 | $ tools/workqueue/wq_dump.py | |
614 | Affinity Scopes | |
615 | =============== | |
616 | wq_unbound_cpumask=0000000f | |
617 | ||
63c5484e TH |
618 | CPU |
619 | nr_pods 4 | |
620 | pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 | |
621 | pod_node [0]=0 [1]=0 [2]=1 [3]=1 | |
622 | cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 | |
623 | ||
624 | SMT | |
625 | nr_pods 4 | |
626 | pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008 | |
627 | pod_node [0]=0 [1]=0 [2]=1 [3]=1 | |
628 | cpu_pod [0]=0 [1]=1 [2]=2 [3]=3 | |
629 | ||
630 | CACHE (default) | |
631 | nr_pods 2 | |
632 | pod_cpus [0]=00000003 [1]=0000000c | |
633 | pod_node [0]=0 [1]=1 | |
634 | cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 | |
635 | ||
7f7dc377 TH |
636 | NUMA |
637 | nr_pods 2 | |
638 | pod_cpus [0]=00000003 [1]=0000000c | |
639 | pod_node [0]=0 [1]=1 | |
640 | cpu_pod [0]=0 [1]=0 [2]=1 [3]=1 | |
641 | ||
642 | SYSTEM | |
643 | nr_pods 1 | |
644 | pod_cpus [0]=0000000f | |
645 | pod_node [0]=-1 | |
646 | cpu_pod [0]=0 [1]=0 [2]=0 [3]=0 | |
647 | ||
648 | Worker Pools | |
649 | ============ | |
650 | pool[00] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 0 | |
651 | pool[01] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 0 | |
652 | pool[02] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 1 | |
653 | pool[03] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 1 | |
654 | pool[04] ref= 1 nice= 0 idle/workers= 4/ 4 cpu= 2 | |
655 | pool[05] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 2 | |
656 | pool[06] ref= 1 nice= 0 idle/workers= 3/ 3 cpu= 3 | |
657 | pool[07] ref= 1 nice=-20 idle/workers= 2/ 2 cpu= 3 | |
658 | pool[08] ref=42 nice= 0 idle/workers= 6/ 6 cpus=0000000f | |
659 | pool[09] ref=28 nice= 0 idle/workers= 3/ 3 cpus=00000003 | |
660 | pool[10] ref=28 nice= 0 idle/workers= 17/ 17 cpus=0000000c | |
661 | pool[11] ref= 1 nice=-20 idle/workers= 1/ 1 cpus=0000000f | |
662 | pool[12] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=00000003 | |
663 | pool[13] ref= 2 nice=-20 idle/workers= 1/ 1 cpus=0000000c | |
664 | ||
665 | Workqueue CPU -> pool | |
666 | ===================== | |
667 | [ workqueue \ CPU 0 1 2 3 dfl] | |
668 | events percpu 0 2 4 6 | |
669 | events_highpri percpu 1 3 5 7 | |
670 | events_long percpu 0 2 4 6 | |
671 | events_unbound unbound 9 9 10 10 8 | |
672 | events_freezable percpu 0 2 4 6 | |
673 | events_power_efficient percpu 0 2 4 6 | |
2c534f2f | 674 | events_freezable_pwr_ef percpu 0 2 4 6 |
7f7dc377 TH |
675 | rcu_gp percpu 0 2 4 6 |
676 | rcu_par_gp percpu 0 2 4 6 | |
677 | slub_flushwq percpu 0 2 4 6 | |
678 | netns ordered 8 8 8 8 8 | |
679 | ... | |
680 | ||
681 | See the command's help message for more info. | |
682 | ||
683 | ||
725e8ec5 TH |
684 | Monitoring |
685 | ========== | |
686 | ||
687 | Use tools/workqueue/wq_monitor.py to monitor workqueue operations: :: | |
688 | ||
689 | $ tools/workqueue/wq_monitor.py events | |
8639eceb | 690 | total infl CPUtime CPUhog CMW/RPR mayday rescued |
8a1dd1e5 TH |
691 | events 18545 0 6.1 0 5 - - |
692 | events_highpri 8 0 0.0 0 0 - - | |
693 | events_long 3 0 0.0 0 0 - - | |
8639eceb | 694 | events_unbound 38306 0 0.1 - 7 - - |
8a1dd1e5 TH |
695 | events_freezable 0 0 0.0 0 0 - - |
696 | events_power_efficient 29598 0 0.2 0 0 - - | |
2c534f2f | 697 | events_freezable_pwr_ef 10 0 0.0 0 0 - - |
8a1dd1e5 TH |
698 | sock_diag_events 0 0 0.0 0 0 - - |
699 | ||
8639eceb | 700 | total infl CPUtime CPUhog CMW/RPR mayday rescued |
8a1dd1e5 TH |
701 | events 18548 0 6.1 0 5 - - |
702 | events_highpri 8 0 0.0 0 0 - - | |
703 | events_long 3 0 0.0 0 0 - - | |
8639eceb | 704 | events_unbound 38322 0 0.1 - 7 - - |
8a1dd1e5 TH |
705 | events_freezable 0 0 0.0 0 0 - - |
706 | events_power_efficient 29603 0 0.2 0 0 - - | |
2c534f2f | 707 | events_freezable_pwr_ef 10 0 0.0 0 0 - - |
8a1dd1e5 | 708 | sock_diag_events 0 0 0.0 0 0 - - |
725e8ec5 TH |
709 | |
710 | ... | |
711 | ||
712 | See the command's help message for more info. | |
713 | ||
714 | ||
e7f08ffb SF |
715 | Debugging |
716 | ========= | |
e2de9e08 FM |
717 | |
718 | Because the work functions are executed by generic worker threads | |
719 | there are a few tricks needed to shed some light on misbehaving | |
720 | workqueue users. | |
721 | ||
e7f08ffb | 722 | Worker threads show up in the process list as: :: |
e2de9e08 | 723 | |
e7f08ffb SF |
724 | root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] |
725 | root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] | |
726 | root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] | |
727 | root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] | |
e2de9e08 FM |
728 | |
729 | If kworkers are going crazy (using too much cpu), there are two types | |
730 | of possible problems: | |
731 | ||
6888c6f2 | 732 | 1. Something being scheduled in rapid succession |
e2de9e08 FM |
733 | 2. A single work item that consumes lots of cpu cycles |
734 | ||
e7f08ffb | 735 | The first one can be tracked using tracing: :: |
e2de9e08 | 736 | |
2abfcd29 RZ |
737 | $ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event |
738 | $ cat /sys/kernel/tracing/trace_pipe > out.txt | |
e2de9e08 FM |
739 | (wait a few secs) |
740 | ^C | |
741 | ||
742 | If something is busy looping on work queueing, it would be dominating | |
743 | the output and the offender can be determined with the work item | |
744 | function. | |
745 | ||
746 | For the second type of problems it should be possible to just check | |
e7f08ffb | 747 | the stack trace of the offending worker thread. :: |
e2de9e08 FM |
748 | |
749 | $ cat /proc/THE_OFFENDING_KWORKER/stack | |
750 | ||
751 | The work item's function should be trivially visible in the stack | |
752 | trace. | |
e7f08ffb | 753 | |
725e8ec5 | 754 | |
f9eaaa82 BF |
755 | Non-reentrance Conditions |
756 | ========================= | |
757 | ||
758 | Workqueue guarantees that a work item cannot be re-entrant if the following | |
759 | conditions hold after a work item gets queued: | |
760 | ||
761 | 1. The work function hasn't been changed. | |
762 | 2. No one queues the work item to another workqueue. | |
763 | 3. The work item hasn't been reinitiated. | |
764 | ||
765 | In other words, if the above conditions hold, the work item is guaranteed to be | |
766 | executed by at most one worker system-wide at any given time. | |
767 | ||
768 | Note that requeuing the work item (to the same queue) in the self function | |
769 | doesn't break these conditions, so it's safe to do. Otherwise, caution is | |
770 | required when breaking the conditions inside a work function. | |
771 | ||
e7f08ffb SF |
772 | |
773 | Kernel Inline Documentations Reference | |
774 | ====================================== | |
775 | ||
776 | .. kernel-doc:: include/linux/workqueue.h | |
c9e3d519 MCC |
777 | |
778 | .. kernel-doc:: kernel/workqueue.c |