At the moment, early-ra ducks out of allocating any region
that contains a register with both a strong FPR affinity and
a strong GPR affinity. The proper allocators are much better
at handling that situation.
But this means that early-ra tends not to allocate a region
of vector code that ends in a reduction to a scalar integer
if any later arithmetic is done on the scalar integer result.
Currently, if a block acts as an isolated allocation region, the pass
will try to split the block into subregions *between* instructions if
there are no live FPRs or FPR allocnos. In the reduction case described
above, it's convenient to try the same thing *within* instructions.
If a block of vector code ends in a reduction, all FPRs and FPR
allocnos will be dead between the "use phase" and the "def phase"
of the reduction: the vector input will then have died, but the
scalar result will not yet have been born.
If we split the block that way, the problematic reduction result
will be part of the second region, which we can skip allocating,
but the vector work will be part of a separate region, which we
might be able to allocate.
This avoids a MOV in the testcase and also helps a small amount
with x264.
gcc/
* config/aarch64/aarch64-early-ra.cc
(early_ra::IGNORE_REG): New flag.
(early_ra::fpr_preference): Handle it.
(early_ra::record_constraints): Fail the allocation if an
IGNORE_REG output operand is not independent of the inputs.
(defines_multi_def_pseudo): New function.
(early_ra::could_split_region_here): New member function, split
out from...
(early_ra::process_block): ...here. Try splitting a block into
multiple regions between the definition and use phases of an
instruction. Set IGNORE_REG on the output registers if we do so.
gcc/testsuite/
* gcc.target/aarch64/early_ra_1.c: New test.