Make some changes in reassoc pass to make it more friendly to fma pass later.
Using FMA instead of mult + add reduces register pressure and insruction
retired.
There are mainly two changes
1. Put no-mult ops and mult ops alternately at the end of the queue, which is
conducive to generating more fma and reducing the loss of FMA when breaking
the chain.
2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
chains according to the given correlation width, keeping the FMA chance as
much as possible.
With the patch applied
On ICX:
507.cactuBSSN_r: Improved by 1.7% for multi-copy .
503.bwaves_r : Improved by 0.60% for single copy .
507.cactuBSSN_r: Improved by 1.10% for single copy .
519.lbm_r : Improved by 2.21% for single copy .
no measurable changes for other benchmarks.
On aarch64
507.cactuBSSN_r: Improved by 1.7% for multi-copy.
503.bwaves_r : Improved by 6.00% for single-copy.
no measurable changes for other benchmarks.
TEST1:
float
foo (float a, float b, float c, float d, float *e)
{
return *e + a * b + c * d ;
}
For "-Ofast -mfpmath=sse -mfma" GCC generates:
vmulss %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss (%rdi), %xmm0, %xmm0
ret
With this patch GCC generates:
vfmadd213ss (%rdi), %xmm1, %xmm0
vfmadd231ss %xmm2, %xmm3, %xmm0
ret