gdb/dwarf: use dynamic partitioning for DWARF CU indexing
The DWARF indexer splits the work statically based on the unit sizes,
attempting to give each worker thread about the same amount of bytes to
process. This works relatively well with standard compilation. But
when compiling with DWO files (-gsplit-dwarf), it's not as good. I see
this when loading a relatively big program (telegram-desktop, which
includes a lot of static dependencies) compiled with -gsplit-dwarf:
Time for "DWARF indexing worker": wall 0.000, user 0.000, sys 0.000, user+sys 0.000, -nan % CPU
Time for "DWARF indexing worker": wall 0.001, user 0.000, sys 0.000, user+sys 0.000, 0.0 % CPU
Time for "DWARF indexing worker": wall 0.001, user 0.001, sys 0.000, user+sys 0.001, 100.0 % CPU
Time for "DWARF indexing worker": wall 0.748, user 0.284, sys 0.297, user+sys 0.581, 77.7 % CPU
Time for "DWARF indexing worker": wall 0.818, user 0.408, sys 0.262, user+sys 0.670, 81.9 % CPU
Time for "DWARF indexing worker": wall 1.196, user 0.580, sys 0.402, user+sys 0.982, 82.1 % CPU
Time for "DWARF indexing worker": wall 1.250, user 0.511, sys 0.500, user+sys 1.011, 80.9 % CPU
Time for "DWARF indexing worker": wall 7.730, user 5.891, sys 1.729, user+sys 7.620, 98.6 % CPU
Note how the wall times vary from 0 to 7.7 seconds. This is
undesirable, because the time to do that indexing step takes as long as
the slowest worker thread takes.
The imbalance in this step also causes imbalance in the following
"finalize" step:
Time for "DWARF finalize worker": wall 0.007, user 0.004, sys 0.002, user+sys 0.006, 85.7 % CPU
Time for "DWARF finalize worker": wall 0.012, user 0.005, sys 0.005, user+sys 0.010, 83.3 % CPU
Time for "DWARF finalize worker": wall 0.015, user 0.010, sys 0.004, user+sys 0.014, 93.3 % CPU
Time for "DWARF finalize worker": wall 0.389, user 0.359, sys 0.029, user+sys 0.388, 99.7 % CPU
Time for "DWARF finalize worker": wall 0.680, user 0.644, sys 0.035, user+sys 0.679, 99.9 % CPU
Time for "DWARF finalize worker": wall 0.929, user 0.907, sys 0.020, user+sys 0.927, 99.8 % CPU
Time for "DWARF finalize worker": wall 1.093, user 1.055, sys 0.037, user+sys 1.092, 99.9 % CPU
Time for "DWARF finalize worker": wall 2.016, user 1.934, sys 0.082, user+sys 2.016, 100.0 % CPU
Time for "DWARF finalize worker": wall 25.882, user 25.471, sys 0.404, user+sys 25.875, 100.0 % CPU
With DWO files, the split of the workload by size doesn't work, because
it is done using the size of the skeleton units in the main file, which
is not representative of how much DWARF is contained in each DWO file.
I haven't tried it, but a similar problem could occur with cross-unit
imports, which can happen with dwz or LTO. You could have a small unit
that imports a lot from other units, in which case the size of the units
is not representative of the work to accomplish.
To try to improve this situation, change the DWARF indexer to use
dynamic partitioning, using gdb::parallel_for_each_async. With this,
each worker thread pops one unit at a time from a shared work queue to
process it, until the queue is empty. That should result in a more
balance workload split. I chose 1 as the minimum batch size here,
because I judged that indexing one CU was a big enough piece of work
compared to the synchronization overhead of the queue. That can always
be tweaked later if someone wants to do more tests.
As a result, the timings are much more balanced:
Time for "DWARF indexing worker": wall 2.325, user 1.033, sys 0.573, user+sys 1.606, 69.1 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.028, sys 0.568, user+sys 1.596, 68.6 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.068, sys 0.513, user+sys 1.581, 68.0 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.005, sys 0.579, user+sys 1.584, 68.1 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.070, sys 0.516, user+sys 1.586, 68.2 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.063, sys 0.584, user+sys 1.647, 70.8 % CPU
Time for "DWARF indexing worker": wall 2.326, user 1.049, sys 0.550, user+sys 1.599, 68.7 % CPU
Time for "DWARF indexing worker": wall 2.328, user 1.058, sys 0.541, user+sys 1.599, 68.7 % CPU
...
Time for "DWARF finalize worker": wall 2.833, user 2.791, sys 0.040, user+sys 2.831, 99.9 % CPU
Time for "DWARF finalize worker": wall 2.939, user 2.896, sys 0.043, user+sys 2.939, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.016, user 2.969, sys 0.046, user+sys 3.015, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.076, user 2.957, sys 0.118, user+sys 3.075, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.159, user 3.054, sys 0.104, user+sys 3.158, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.198, user 3.082, sys 0.114, user+sys 3.196, 99.9 % CPU
Time for "DWARF finalize worker": wall 3.197, user 3.076, sys 0.121, user+sys 3.197, 100.0 % CPU
Time for "DWARF finalize worker": wall 3.268, user 3.136, sys 0.131, user+sys 3.267, 100.0 % CPU
Time for "DWARF finalize worker": wall 1.907, user 1.810, sys 0.096, user+sys 1.906, 99.9 % CPU
In absolute terms, the total time for GDB to load the file and exit goes
down from about 42 seconds to 17 seconds.
Some implementation notes:
- The state previously kept in as local variables in
cooked_index_worker_debug_info::process_units becomes fields of the
new parallel worker object.
- The work previously done for each unit in
cooked_index_worker_debug_info::process_units becomes the operator()
of the new parallel worker object.
- The work previously done at the end of
cooked_index_worker_debug_info::process_units (including calling
bfd_thread_cleanup) becomes the destructor of the new parallel worker
object.
- The "done" callback of gdb::task_group becomes the "done" callback of
gdb::parallel_for_each_async.
- I placed the parallel_indexing_worker struct inside
cooked_index_worker_debug_info, so that it has access to
parallel_indexing_worker's private fields (e.g. m_results, to push
the results). It will also be possible to re-use it for skeletonless
type units in a later patch.
Change-Id: I5dc5cf8793abe9ebe2659e78da38ffc94289e5f2 Approved-By: Tom Tromey <tom@tromey.com>