git.ipfire.org Git - thirdparty/binutils-gdb.git/commit

gdb/dwarf: use dynamic partitioning for DWARF CU indexing

The DWARF indexer splits the work statically based on the unit sizes,
attempting to give each worker thread about the same amount of bytes to
process.  This works relatively well with standard compilation.  But
when compiling with DWO files (-gsplit-dwarf), it's not as good.  I see
this when loading a relatively big program (telegram-desktop, which
includes a lot of static dependencies) compiled with -gsplit-dwarf:

    Time for "DWARF indexing worker": wall 0.000, user 0.000, sys 0.000, user+sys 0.000, -nan % CPU
    Time for "DWARF indexing worker": wall 0.001, user 0.000, sys 0.000, user+sys 0.000, 0.0 % CPU
    Time for "DWARF indexing worker": wall 0.001, user 0.001, sys 0.000, user+sys 0.001, 100.0 % CPU
    Time for "DWARF indexing worker": wall 0.748, user 0.284, sys 0.297, user+sys 0.581, 77.7 % CPU
    Time for "DWARF indexing worker": wall 0.818, user 0.408, sys 0.262, user+sys 0.670, 81.9 % CPU
    Time for "DWARF indexing worker": wall 1.196, user 0.580, sys 0.402, user+sys 0.982, 82.1 % CPU
    Time for "DWARF indexing worker": wall 1.250, user 0.511, sys 0.500, user+sys 1.011, 80.9 % CPU
    Time for "DWARF indexing worker": wall 7.730, user 5.891, sys 1.729, user+sys 7.620, 98.6 % CPU

Note how the wall times vary from 0 to 7.7 seconds.  This is
undesirable, because the time to do that indexing step takes as long as
the slowest worker thread takes.

The imbalance in this step also causes imbalance in the following
"finalize" step:

    Time for "DWARF finalize worker": wall 0.007, user 0.004, sys 0.002, user+sys 0.006, 85.7 % CPU
    Time for "DWARF finalize worker": wall 0.012, user 0.005, sys 0.005, user+sys 0.010, 83.3 % CPU
    Time for "DWARF finalize worker": wall 0.015, user 0.010, sys 0.004, user+sys 0.014, 93.3 % CPU
    Time for "DWARF finalize worker": wall 0.389, user 0.359, sys 0.029, user+sys 0.388, 99.7 % CPU
    Time for "DWARF finalize worker": wall 0.680, user 0.644, sys 0.035, user+sys 0.679, 99.9 % CPU
    Time for "DWARF finalize worker": wall 0.929, user 0.907, sys 0.020, user+sys 0.927, 99.8 % CPU
    Time for "DWARF finalize worker": wall 1.093, user 1.055, sys 0.037, user+sys 1.092, 99.9 % CPU
    Time for "DWARF finalize worker": wall 2.016, user 1.934, sys 0.082, user+sys 2.016, 100.0 % CPU
    Time for "DWARF finalize worker": wall 25.882, user 25.471, sys 0.404, user+sys 25.875, 100.0 % CPU

With DWO files, the split of the workload by size doesn't work, because
it is done using the size of the skeleton units in the main file, which
is not representative of how much DWARF is contained in each DWO file.

I haven't tried it, but a similar problem could occur with cross-unit
imports, which can happen with dwz or LTO.  You could have a small unit
that imports a lot from other units, in which case the size of the units
is not representative of the work to accomplish.

To try to improve this situation, change the DWARF indexer to use
dynamic partitioning, using gdb::parallel_for_each_async.  With this,
each worker thread pops one unit at a time from a shared work queue to
process it, until the queue is empty.  That should result in a more
balance workload split.  I chose 1 as the minimum batch size here,
because I judged that indexing one CU was a big enough piece of work
compared to the synchronization overhead of the queue.  That can always
be tweaked later if someone wants to do more tests.

As a result, the timings are much more balanced:

    Time for "DWARF indexing worker": wall 2.325, user 1.033, sys 0.573, user+sys 1.606, 69.1 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.028, sys 0.568, user+sys 1.596, 68.6 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.068, sys 0.513, user+sys 1.581, 68.0 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.005, sys 0.579, user+sys 1.584, 68.1 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.070, sys 0.516, user+sys 1.586, 68.2 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.063, sys 0.584, user+sys 1.647, 70.8 % CPU
    Time for "DWARF indexing worker": wall 2.326, user 1.049, sys 0.550, user+sys 1.599, 68.7 % CPU
    Time for "DWARF indexing worker": wall 2.328, user 1.058, sys 0.541, user+sys 1.599, 68.7 % CPU
    ...
    Time for "DWARF finalize worker": wall 2.833, user 2.791, sys 0.040, user+sys 2.831, 99.9 % CPU
    Time for "DWARF finalize worker": wall 2.939, user 2.896, sys 0.043, user+sys 2.939, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.016, user 2.969, sys 0.046, user+sys 3.015, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.076, user 2.957, sys 0.118, user+sys 3.075, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.159, user 3.054, sys 0.104, user+sys 3.158, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.198, user 3.082, sys 0.114, user+sys 3.196, 99.9 % CPU
    Time for "DWARF finalize worker": wall 3.197, user 3.076, sys 0.121, user+sys 3.197, 100.0 % CPU
    Time for "DWARF finalize worker": wall 3.268, user 3.136, sys 0.131, user+sys 3.267, 100.0 % CPU
    Time for "DWARF finalize worker": wall 1.907, user 1.810, sys 0.096, user+sys 1.906, 99.9 % CPU

In absolute terms, the total time for GDB to load the file and exit goes
down from about 42 seconds to 17 seconds.

Some implementation notes:

- The state previously kept in as local variables in
   cooked_index_worker_debug_info::process_units becomes fields of the
   new parallel worker object.

- The work previously done for each unit in
   cooked_index_worker_debug_info::process_units becomes the operator()
   of the new parallel worker object.

- The work previously done at the end of
   cooked_index_worker_debug_info::process_units (including calling
   bfd_thread_cleanup) becomes the destructor of the new parallel worker
   object.

- The "done" callback of gdb::task_group becomes the "done" callback of
   gdb::parallel_for_each_async.

- I placed the parallel_indexing_worker struct inside
   cooked_index_worker_debug_info, so that it has access to
   parallel_indexing_worker's private fields (e.g. m_results, to push
   the results).  It will also be possible to re-use it for skeletonless
   type units in a later patch.

Change-Id: I5dc5cf8793abe9ebe2659e78da38ffc94289e5f2
Approved-By: Tom Tromey <tom@tromey.com>

author	Simon Marchi <simon.marchi@efficios.com>
	Fri, 19 Sep 2025 20:27:05 +0000 (16:27 -0400)
committer	Simon Marchi <simon.marchi@efficios.com>
	Tue, 30 Sep 2025 19:37:20 +0000 (19:37 +0000)
commit	dad36cf91992ba78a8c3e51eaaf5a95bf19fefa8
tree	831b47057d8db5c0587e529a43e214b548c94bc5	tree \| snapshot
parent	08a48dff02328ecae81250df39e0e9940afd4e67	commit \| diff

gdb/dwarf2/cooked-index-worker.h		diff \| blob \| blame \| history
gdb/dwarf2/read.c		diff \| blob \| blame \| history