From a01cb764bd34d3bb11b12a3262ffaffe2a99fd22 Mon Sep 17 00:00:00 2001 From: Simon Marchi Date: Fri, 19 Sep 2025 16:27:00 -0400 Subject: [PATCH] gdbsupport: use dynamic partitioning in gdb::parallel_for_each MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit gdb::parallel_for_each uses static partitioning of the workload, meaning that each worker thread receives a similar number of work items. Change it to use dynamic partitioning, where worker threads pull work items from a shared work queue when they need to. Note that gdb::parallel_for_each is currently only used for processing minimal symbols in GDB. I am looking at improving the startup performance of GDB, where the minimal symbol process is one step. With static partitioning, there is a risk of workload imbalance if some threads receive "easier" work than others. Some threads sit still while others finish working on their share of the work. This is not desirable, because the gdb::parallel_for_each takes as long as the slowest thread takes. When loading a file with a lot of minimal symbols (~600k) in GDB, with "maint set per-command time on", I observe some imbalance: Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU ^ ----´ The fastest thread took 0.732 seconds to complete its work (and then sat still), while the slowest took 3.927 seconds. This means the parallel_for_each took a bit less than 4 seconds. Even if the number of minimal symbols assigned to each worker is the same, I suppose that some symbols (e.g. those that need demangling) take longer to process, which could explain the imbalance. With this patch, things are much more balanced: Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU ^ ----´ In this version, the parallel_for_each took about 2.8 seconds, representing a reduction of ~1.2 seconds for this step. Not life-changing, but it's still good I think. Note that this patch helps when loading big programs. My go-to test program for this is telegram-desktop that I built from source. For small programs (including loading gdb itself), it makes no perceptible difference. Now the technical bits: - One impact that this change has on the minimal symbol processing specifically is that not all calls to compute_and_set_names (a critical region guarded by a mutex) are done at the end of each worker thread's task anymore. Before this patch, each thread would compute the names and hash values for all the minimal symbols it has been assigned, and then would call compute_and_set_names for all of them, while holding the mutex (thus preventing other threads from doing this same step). With the shared work queue approach, each thread grabs a batch of of minimal symbols, computes the names and hash values for them, and then calls compute_and_set_names (with the mutex held) for this batch only. It then repeats that until the work queue is empty. There are therefore more small and spread out compute_and_set_names critical sections, instead of just one per worker thread at the end. Given that before this patch the work was not well balanced among worker threads, I guess that threads would enter that critical region at roughly different times, causing little contention. In the "with this patch" results, the CPU utilization numbers are not as good, suggesting that there is some contention. But I don't know if it's contention due to the compute_and_set_names critical section or the shared work queue critical section. That can be investigated later. In any case, what ultimately counts is the wall time, which improves. - One choice I had to make was to decide how many work items (in this case minimal symbols) each worker should pop when getting work from the shared queue. The general wisdom is that: - popping too few items, and the synchronization overhead becomes significant, and the total processing time increases - popping too many items, and we get some imbalance back, and the total processing time increases again I experimented using a dynamic batch size proportional to the number of remaining work items. It worked well in some cases but not always. So I decided to keep it simple, with a fixed batch size. That can always be tweaked later. - I want to still be able to use scoped_time_it to measure the time that each worker thread spent working on the task. I find it really handy when measuring the performance impact of changes. Unfortunately, the current interface of gdb::parallel_for_each, which receives a simple callback, is not well-suited for that, once I introduce the dynamic partitioning. The callback would get called once for each work item batch (multiple time for each worker thread), so it's not possible to maintain a per-worker thread object for the duration of the parallel for. To allow this, I changed gdb::parallel_for_each to receive a worker type as a template parameter. Each worker thread creates one local instance of that type, and calls operator() on it for each work item batch. By having a scoped_time_it object as a field of that worker, we can get the timings per worker thread. The drawbacks of this approach is that we must now define the parallel task in a separate class and manually capture any context we need as fields of that class. Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf Approved-By: Tom Tromey --- gdb/minsyms.c | 140 ++++++++++++-------- gdb/unittests/parallel-for-selftests.c | 27 +++- gdbsupport/parallel-for.h | 169 ++++++++++++------------- 3 files changed, 195 insertions(+), 141 deletions(-) diff --git a/gdb/minsyms.c b/gdb/minsyms.c index c4a0fa87376..cc50febba6c 100644 --- a/gdb/minsyms.c +++ b/gdb/minsyms.c @@ -1390,6 +1390,88 @@ build_minimal_symbol_hash_tables } } +/* gdb::parallel_for_each worker to compute minimal symbol names and hashes. */ + +class minimal_symbol_install_worker +{ +public: + minimal_symbol_install_worker + (minimal_symbol *msymbols, + gdb::array_view hash_values, + objfile_per_bfd_storage *per_bfd +#if CXX_STD_THREAD + , std::mutex &demangled_mutex +#endif + ) + : m_time_it ("minsym install worker"), + m_msymbols (msymbols), + m_hash_values (hash_values), + m_per_bfd (per_bfd) +#if CXX_STD_THREAD + , m_demangled_mutex (demangled_mutex) +#endif + {} + + void operator() (minimal_symbol *start, minimal_symbol *end) noexcept + { + for (minimal_symbol *msym = start; msym < end; ++msym) + { + size_t idx = msym - m_msymbols; + m_hash_values[idx].name_length = strlen (msym->linkage_name ()); + + if (!msym->name_set) + { + /* This will be freed later, by compute_and_set_names. */ + gdb::unique_xmalloc_ptr demangled_name + = symbol_find_demangled_name (msym, msym->linkage_name ()); + msym->set_demangled_name (demangled_name.release (), + &m_per_bfd->storage_obstack); + msym->name_set = 1; + } + + /* This mangled_name_hash computation has to be outside of + the name_set check, or compute_and_set_names below will + be called with an invalid hash value. */ + m_hash_values[idx].mangled_name_hash + = fast_hash (msym->linkage_name (), m_hash_values[idx].name_length); + m_hash_values[idx].minsym_hash = msymbol_hash (msym->linkage_name ()); + + /* We only use this hash code if the search name differs + from the linkage name. See the code in + build_minimal_symbol_hash_tables. */ + if (msym->search_name () != msym->linkage_name ()) + m_hash_values[idx].minsym_demangled_hash + = search_name_hash (msym->language (), msym->search_name ()); + } + + { + /* To limit how long we hold the lock, we only acquire it here + and not while we demangle the names above. */ +#if CXX_STD_THREAD + std::lock_guard guard (m_demangled_mutex); +#endif + for (minimal_symbol *msym = start; msym < end; ++msym) + { + size_t idx = msym - m_msymbols; + std::string_view name (msym->linkage_name (), + m_hash_values[idx].name_length); + hashval_t hashval = m_hash_values[idx].mangled_name_hash; + + msym->compute_and_set_names (name, false, m_per_bfd, hashval); + } + } + } + +private: + scoped_time_it m_time_it; + minimal_symbol *m_msymbols; + gdb::array_view m_hash_values; + objfile_per_bfd_storage *m_per_bfd; +#if CXX_STD_THREAD + std::mutex &m_demangled_mutex; +#endif +}; + /* Add the minimal symbols in the existing bunches to the objfile's official minimal symbol table. In most cases there is no minimal symbol table yet for this objfile, and the existing bunches are used to create one. Once @@ -1476,59 +1558,15 @@ minimal_symbol_reader::install () std::vector hash_values (mcount); msymbols = m_objfile->per_bfd->msymbols.get (); - /* Arbitrarily require at least 10 elements in a thread. */ - gdb::parallel_for_each<10> (&msymbols[0], &msymbols[mcount], - [&] (minimal_symbol *start, minimal_symbol *end) - { - scoped_time_it time_it ("minsyms install worker"); - - for (minimal_symbol *msym = start; msym < end; ++msym) - { - size_t idx = msym - msymbols; - hash_values[idx].name_length = strlen (msym->linkage_name ()); - if (!msym->name_set) - { - /* This will be freed later, by compute_and_set_names. */ - gdb::unique_xmalloc_ptr demangled_name - = symbol_find_demangled_name (msym, msym->linkage_name ()); - msym->set_demangled_name - (demangled_name.release (), - &m_objfile->per_bfd->storage_obstack); - msym->name_set = 1; - } - /* This mangled_name_hash computation has to be outside of - the name_set check, or compute_and_set_names below will - be called with an invalid hash value. */ - hash_values[idx].mangled_name_hash - = fast_hash (msym->linkage_name (), - hash_values[idx].name_length); - hash_values[idx].minsym_hash - = msymbol_hash (msym->linkage_name ()); - /* We only use this hash code if the search name differs - from the linkage name. See the code in - build_minimal_symbol_hash_tables. */ - if (msym->search_name () != msym->linkage_name ()) - hash_values[idx].minsym_demangled_hash - = search_name_hash (msym->language (), msym->search_name ()); - } - { - /* To limit how long we hold the lock, we only acquire it here - and not while we demangle the names above. */ + + gdb::parallel_for_each<1000, minimal_symbol *, minimal_symbol_install_worker> + (&msymbols[0], &msymbols[mcount], msymbols, + gdb::array_view (hash_values), + m_objfile->per_bfd #if CXX_STD_THREAD - std::lock_guard guard (demangled_mutex); + , demangled_mutex #endif - for (minimal_symbol *msym = start; msym < end; ++msym) - { - size_t idx = msym - msymbols; - msym->compute_and_set_names - (std::string_view (msym->linkage_name (), - hash_values[idx].name_length), - false, - m_objfile->per_bfd, - hash_values[idx].mangled_name_hash); - } - } - }); + ); build_minimal_symbol_hash_tables (m_objfile, hash_values); } diff --git a/gdb/unittests/parallel-for-selftests.c b/gdb/unittests/parallel-for-selftests.c index 559446deef4..9d86aa67b10 100644 --- a/gdb/unittests/parallel-for-selftests.c +++ b/gdb/unittests/parallel-for-selftests.c @@ -60,6 +60,9 @@ test_one (do_foreach_t do_foreach, int upper_bound) std::vector output; std::mutex mtx; + /* The (unfortunate) reason why we don't use std::vector::iterator as + the parallel-for-each iterator type is that std::atomic won't work with + that type when building with -D_GLIBCXX_DEBUG. */ do_foreach (input.data (), input.data () + input.size (), [&] (int *start, int *end) { @@ -97,12 +100,32 @@ test_one_function (int n_threads, do_foreach_t do_foreach) static void test_parallel_for_each () { + struct test_worker + { + /* DUMMY is there to test passing multiple arguments to the worker + constructor. */ + test_worker (foreach_callback_t callback, int dummy) + : m_callback (callback) + { + } + + void operator() (int *first, int *last) + { + return m_callback (first, last); + } + + private: + foreach_callback_t m_callback; + }; + const std::vector for_each_functions { [] (int *start, int *end, foreach_callback_t callback) - { gdb::parallel_for_each<1> (start, end, callback); }, + { gdb::parallel_for_each<1, int *, test_worker> (start, end, callback, + 0); }, [] (int *start, int *end, foreach_callback_t callback) - { gdb::sequential_for_each (start, end, callback);} + { gdb::sequential_for_each (start, end, callback, + 0); }, }; int default_thread_count = gdb::thread_pool::g_thread_pool->thread_count (); diff --git a/gdbsupport/parallel-for.h b/gdbsupport/parallel-for.h index 50406763ded..8d085df42fb 100644 --- a/gdbsupport/parallel-for.h +++ b/gdbsupport/parallel-for.h @@ -21,117 +21,108 @@ #define GDBSUPPORT_PARALLEL_FOR_H #include -#include +#include +#include #include "gdbsupport/thread-pool.h" -#include "gdbsupport/function-view.h" namespace gdb { -/* A very simple "parallel for". This splits the range of iterators - into subranges, and then passes each subrange to the callback. The - work may or may not be done in separate threads. +/* A "parallel-for" implementation using a shared work queue. Work items get + popped in batches of size up to BATCH_SIZE from the queue and handed out to + worker threads. - This approach was chosen over having the callback work on single - items because it makes it simple for the caller to do - once-per-subrange initialization and destruction. + Each worker thread instantiates an object of type Worker, forwarding ARGS to + its constructor. The Worker object can be used to keep some per-worker + thread state. - The parameter N says how batching ought to be done -- there will be - at least N elements processed per thread. Setting N to 0 is not - allowed. */ + Worker threads call Worker::operator() repeatedly until the queue is + empty. */ -template +template void -parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback) +parallel_for_each (const RandomIt first, const RandomIt last, + WorkerArgs &&...worker_args) { /* If enabled, print debug info about how the work is distributed across the threads. */ const bool parallel_for_each_debug = false; - size_t n_worker_threads = thread_pool::g_thread_pool->thread_count (); - size_t n_threads = n_worker_threads; - size_t n_elements = last - first; - size_t elts_per_thread = 0; - size_t elts_left_over = 0; + gdb_assert (first <= last); - if (n_threads > 1) + if (parallel_for_each_debug) { - /* Require that there should be at least N elements in a - thread. */ - gdb_assert (n > 0); - if (n_elements / n_threads < n) - n_threads = std::max (n_elements / n, (size_t) 1); - elts_per_thread = n_elements / n_threads; - elts_left_over = n_elements % n_threads; - /* n_elements == n_threads * elts_per_thread + elts_left_over. */ + debug_printf ("Parallel for: n elements: %zu\n", + static_cast (last - first)); + debug_printf ("Parallel for: batch size: %zu\n", batch_size); } - size_t count = n_threads == 0 ? 0 : n_threads - 1; + const size_t n_worker_threads + = std::max (thread_pool::g_thread_pool->thread_count (), 1); std::vector> results; - if (parallel_for_each_debug) - { - debug_printf (_("Parallel for: n_elements: %zu\n"), n_elements); - debug_printf (_("Parallel for: minimum elements per thread: %zu\n"), n); - debug_printf (_("Parallel for: elts_per_thread: %zu\n"), elts_per_thread); - } + /* The next item to hand out. */ + std::atomic next = first; - for (int i = 0; i < count; ++i) - { - RandomIt end; - end = first + elts_per_thread; - if (i < elts_left_over) - /* Distribute the leftovers over the worker threads, to avoid having - to handle all of them in a single thread. */ - end++; - - /* This case means we don't have enough elements to really - distribute them. Rather than ever submit a task that does - nothing, we short-circuit here. */ - if (first == end) - end = last; - - if (end == last) - { - /* We're about to dispatch the last batch of elements, which - we normally process in the main thread. So just truncate - the result list here. This avoids submitting empty tasks - to the thread pool. */ - count = i; - break; - } + /* The worker thread task. - if (parallel_for_each_debug) - { - debug_printf (_("Parallel for: elements on worker thread %i\t: %zu"), - i, (size_t)(end - first)); - debug_printf (_("\n")); - } - results.push_back (gdb::thread_pool::g_thread_pool->post_task ([=] () - { - return callback (first, end); - })); - first = end; - } + We need to capture args as a tuple, because it's not possible to capture + the parameter pack directly in C++17. Once we migrate to C++20, the + capture can be simplified to: - for (int i = count; i < n_worker_threads; ++i) - if (parallel_for_each_debug) - { - debug_printf (_("Parallel for: elements on worker thread %i\t: 0"), i); - debug_printf (_("\n")); - } + ... args = std::forward(args) - /* Process all the remaining elements in the main thread. */ - if (parallel_for_each_debug) + and `args` can be used as-is in the lambda. */ + auto args_tuple + = std::forward_as_tuple (std::forward (worker_args)...); + auto task = [&next, first, last, n_worker_threads, &args_tuple] () { - debug_printf (_("Parallel for: elements on main thread\t\t: %zu"), - (size_t)(last - first)); - debug_printf (_("\n")); - } + /* Instantiate the user-defined worker. */ + auto worker = std::make_from_tuple (args_tuple); - if (first != last) - callback (first, last); + for (;;) + { + /* Grab a snapshot of NEXT. */ + auto local_next = next.load (); + gdb_assert (local_next <= last); + + /* Number of remaining items. */ + auto n_remaining = last - local_next; + gdb_assert (n_remaining >= 0); + + /* Are we done? */ + if (n_remaining == 0) + break; + + const auto this_batch_size + = std::min (batch_size, n_remaining); + + /* The range to process in this iteration. */ + const auto this_batch_first = local_next; + const auto this_batch_last = local_next + this_batch_size; + + /* Update NEXT. If the current value of NEXT doesn't match + LOCAL_NEXT, it means another thread updated it concurrently, + restart. */ + if (!next.compare_exchange_weak (local_next, this_batch_last)) + continue; + + if (parallel_for_each_debug) + debug_printf ("Processing %zu items, range [%zu, %zu[\n", + this_batch_size, + static_cast (this_batch_first - first), + static_cast (this_batch_last - first)); + + worker (this_batch_first, this_batch_last); + } + }; + /* Start N_WORKER_THREADS tasks. */ + for (int i = 0; i < n_worker_threads; ++i) + results.push_back (gdb::thread_pool::g_thread_pool->post_task (task)); + + /* Wait for all of them to be finished. */ for (auto &fut : results) fut.get (); } @@ -140,12 +131,14 @@ parallel_for_each (RandomIt first, RandomIt last, RangeFunction callback) when debugging multi-threading behavior, and you want to limit multi-threading in a fine-grained way. */ -template +template void -sequential_for_each (RandomIt first, RandomIt last, RangeFunction callback) +sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args) { - if (first != last) - callback (first, last); + if (first == last) + return; + + Worker (std::forward (worker_args)...) (first, last); } } -- 2.47.3