The pass is local (only considers a BB at a time). In theory, it should
be possible to extend it to run over EBBs, at least in the case of pure
(MEM_READONLY_P) loads, but this is left for future work.
The pass works by identifying two kinds of bases: tree decls obtained
via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
If a candidate memory access has a MEM_EXPR base, then we track it via
this base, and otherwise if it is of a simple reg + <imm> form, we track
it via the RTL-SSA def_info for the register.
For each BB, for a given kind of base, we build up a hash table mapping
the base to an access_group. The access_group data structure holds a
list of accesses at each offset relative to the same base. It uses a
splay tree to support efficient insertion (while walking the bb), and
the nodes are chained using a linked list to support efficient
iteration (while doing the transformation).
For each base, we then iterate over the access_group to identify
adjacent accesses, and try to form load/store pairs for those insns that
access adjacent memory.
The pass is currently run twice, both before and after register
allocation. The first copy of the pass is run late in the pre-RA RTL
pipeline, immediately after sched1, since it was found that sched1 was
increasing register pressure when the pass was run before. The second
copy of the pass runs immediately before peephole2, so as to get any
opportunities that the existing ldp/stp peepholes can handle.
There are some cases that we punt on before RA, e.g.
accesses relative to eliminable regs (such as the soft frame pointer).
We do this since we can't know the elimination offset before RA, and we
want to avoid the RA reloading the offset (due to being out of ldp/stp
immediate range) as this can generate worse code.
The post-RA copy of the pass is there to pick up the crumbs that were
left behind / things we punted on in the pre-RA pass. Among other
things, it's needed to handle accesses relative to the stack pointer.
It can also handle code that didn't exist at the time the pre-RA pass
was run (spill code, prologue/epilogue code).
This is an initial implementation, and there are (among other possible
improvements) the following notable caveats / missing features that are
left for future work, but could give further improvements:
- Moving accesses between BBs within in an EBB, see above.
- Out-of-range opportunities: currently the pass refuses to form pairs
if there isn't a suitable base register with an immediate in range
for ldp/stp, but it can be profitable to emit anchor addresses in the
case that there are four or more out-of-range nearby accesses that can
be formed into pairs. This is handled by the current ldp/stp
peepholes, so it would be good to support this in the future.
- Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
lead to us missing opportunities in the case that two accesses have distinct
MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
(e.g. adjacent variables on the stack). I hope to address this for GCC 15,
hopefully getting to the point where we can remove the ldp/stp peepholes and
scheduling hooks. Furthermore it would be nice to make the pass aware of
section anchors (adding these as a third kind of base) allowing merging
accesses to adjacent variables within the same section.
gcc/ChangeLog:
* config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
* config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
before and after RA.
* config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
* config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
(-mlate-ldp-fusion): New.
(--param=aarch64-ldp-alias-check-limit): New.
(--param=aarch64-ldp-writeback): New.
* config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
* config/aarch64/aarch64-ldp-fusion.cc: New file.
* doc/invoke.texi (AArch64 Options): Document new
-m{early,late}-ldp-fusion options.