This patch implements an algorithm to lay out local data-share (LDS)
space. It currently works for AMD GCN. At the moment, LDS is used for
three things:
After the patch is applied, (2) and (3) are placed at preallocated
locations in LDS, and (1) continues to be handled by the backend (as it
is at present prior to this patch being applied). LDS now looks like this:
+--------------+ (gang local size + 1024, = 1536)
| free space |
| ... |
| - - - - - - -|
| worker bcast |
+--------------+
| reductions |
+--------------+ <<< -mgang-local-size=<number> (def. 512)
| gang private |
| vars |
+--------------+ (32)
| low LDS vars |
+--------------+ LDS base
So, gang-private space is fixed at a constant amount at compile time
(which can be increased with a command-line switch if necessary
for some given code). The layout algorithm takes out a slice of the
remainder of usable space for reduction vars, and uses the rest for
worker partitioning.
The partitioning algorithm works as follows.
1. An "adjacency" set is built up for each basic block that might
do a broadcast. This is calculated by starting at each such block,
and doing a recursive DFS walk over successors to find the next
block (or blocks) that *also* does a broadcast
(dfs_broadcast_reachable_1).
2. The adjacency set is inverted to get adjacent predecessor blocks also.
3. Blocks that will perform a broadcast are sorted by size of that
broadcast: the biggest blocks are handled first.
4. A splay tree structure is used to calculate the spans of LDS memory
that are already allocated by the blocks adjacent to this one
(merge_ranges{,_1}.
5. The current block's broadcast space is allocated from the first free
span not allocated in the splay tree structure calculated above
(first_fit_range). This seems to work quite nicely and efficiently
with the splay tree structure.
6. Continue with the next-biggest broadcast block until we're done.
In this way, "adjacent" broadcasts will not use the same piece of
LDS memory.
2020-07-15 Julian Brown <julian@codesourcery.com>
gcc/
* config/gcn/gcn-protos.h (gcn_goacc_adjust_private_decl): Update
prototype.
* config/gcn/gcn-tree.c (gcn_goacc_get_worker_red_decl): Use
preallocated block of LDS memory.
(gcn_goacc_create_propagation_record): Add OFFSET parameter, and return
temporary LDS space at that offset. Return pointer in "sender" case.
(gcn_goacc_adjust_private_decl): Return var.
* config/gcn/gcn.c (acc_lds_size, gangprivate_hwm, lds_allocs): New
global vars.
(ACC_LDS_SIZE): Define as acc_lds_size.
(gcn_init_machine_status): Don't initialise lds_allocated and
lds_allocs fields of machine function struct.
(gcn_option_override): Handle default size for gang-private variables
and -mgang-local-size option.
(gcn_expand_prologue): Use LDS_SIZE instead of LDS_SIZE-1 when
initialising M0_REG.
(gcn_shared_mem_layout): New function.
(gcn_print_lds_decl): Update comment. Use global lds_allocs map and
gangprivate_hwm variable.
(TARGET_GOACC_SHARED_MEM_LAYOUT): Define target hook.
* config/gcn/gcn.h (machine_function): Remove lds_allocated,
lds_allocs. Add reduction_base, reduction_limit.
* config/gcn/gcn.opt (gang_local_size_opt): New global.
(mgang-local-size=): New option.
* config/nvptx/nvptx.c (nvptx_goacc_adjust_private_decl): Return var.
* doc/tm.texi.in (TARGET_GOACC_SHARED_MEM_LAYOUT): Place documentation
hook.
* doc/tm.texi: Regenerate.
* omp-offload.c (addr_expr_rewrite_info): Change adjusted_vars to a
hash_map.
(rewrite_addr_expr): Rewrite VAR_DECLs also.
(default_goacc_create_propagation_record): Add OFFSET parameter.
(execute_oacc_gimple_workers): Calculate per-function reduction
temporary and private-variable size. Call OpenACC shared_mem_layout
hook. Move num_workers==1 handling here.
(execute_oacc_device_lower): Fix for adjusted_vars being a hash_map
rather than a hash_set.
(pass_oacc_gimple_workers::gate): Remove num_workers==1 handling from
here. Enable pass for all OpenACC routines in order to call shared
memory-layout hook.
* omp-sese.c (targhooks.h, diagnostic-core.h): Add includes.
(build_sender_ref): Handle sender_decl being pointer.
(worker_single_copy): Add PLACEMENT and ISOLATE_BROADCASTS parameters.
Pass placement argument to create_propagation_record hook invocations.
Handle sender_decl being pointer and isolate_broadcasts inserting extra
barriers.
(blk_offset_map_t): Add typedef.
(neuter_worker_single): Add BLK_OFFSET_MAP parameter. Pass
preallocated range to worker_single_copy call.
(dfs_broadcast_reachable_1): New function.
(idx_decl_pair_t, used_range_vec_t): New typedefs.
(sort_size_descending): New function.
(addr_range): New class.
(splay_tree_compare_addr_range, splay_tree_free_key, first_fit_range,
merge_ranges_1, merge_ranges): New functions.
(oacc_do_neutering): Add BOUNDS_LO, BOUNDS_HI parameters. Arrange
layout of shared memory for broadcast operations.
* omp-sese.h (oacc_do_neutering): Update prototype.
* target.def (adjust_private_decl): Change return type to tree.
(create_propagation_record): Add OFFSET parameter.
(shared_mem_layout): New hook.
* targhooks.h (default_goacc_create_propagation_record): Update
prototype.
libgomp/
* testsuite/libgomp.oacc-c-c++-common/broadcast-many.c: New test.