This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.
The pass currently has a single objective: remove definitions by
substituting into all uses. The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.
The patch fixes PR106594. It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.
This is just a first step. I'm hoping that the pass could be
used for other combine-related optimisations in future. In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure. If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.
On most targets, the pass is enabled by default at -O2 and above.
However, it has a tendency to undo x86's STV and RPAD passes,
by folding the more complex post-STV/RPAD form back into the
simpler pre-pass form.
Also, running a pass after register allocation means that we can
now match define_insn_and_splits that were previously only matched
before register allocation. This trips things like:
(define_insn_and_split "..."
[...pattern...]
"...cond..."
"#"
"&& 1"
[...pattern...]
{
...unconditional use of gen_reg_rtx ()...;
}
because matching and splitting after RA will call gen_reg_rtx when
pseudos are no longer allowed. rs6000 has several instances of this.
xtensa has a variation in which the split condition is:
"&& can_create_pseudo_p ()"
The failure then is that, if we match after RA, we'll never be
able to split the instruction.
The patch therefore disables the pass by default on i386, rs6000
and xtensa. Hopefully we can fix those ports later (if their
maintainers want). It seems better to add the pass first, though,
to make it easier to test any such fixes.
gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need
quite a few updates for the late-combine output. That might be
worth doing, but it seems too complex to do as part of this patch.
I tried compiling at least one target per CPU directory and comparing
the assembly output for parts of the GCC testsuite. This is just a way
of getting a flavour of how the pass performs; it obviously isn't a
meaningful benchmark. All targets seemed to improve on average: