aarch64: Take into account when VF is higher than known scalar iters
Consider low overhead loops like:
void
foo (char *restrict a, int *restrict b, int *restrict c, int n)
{
for (int i = 0; i < 9; i++)
{
int res = c[i];
int t = b[i];
if (a[i] != 0)
res = t;
c[i] = res;
}
}
For such loops we use latency only costing since the loop bounds is known and
small.
The current costing however does not consider the case where niters < VF.
So when comparing the scalar vs vector costs it doesn't keep in mind that the
scalar code can't perform VF iterations. This makes it overestimate the cost
for the scalar loop and we incorrectly vectorize.
This patch takes the minimum of the VF and niters in such cases.
Before the patch we generate:
note: Original vector body cost = 46
note: Vector loop iterates at most 1 times
note: Scalar issue estimate:
note: load operations = 2
note: store operations = 1
note: general operations = 1
note: reduction latency = 0
note: estimated min cycles per iteration = 1.000000
note: estimated cycles per vector iteration (for VF 32) = 32.000000
note: SVE issue estimate:
note: load operations = 5
note: store operations = 4
note: general operations = 11
note: predicate operations = 12
note: reduction latency = 0
note: estimated min cycles per iteration without predication = 5.500000
note: estimated min cycles per iteration for predication = 12.000000
note: estimated min cycles per iteration = 12.000000
note: Low iteration count, so using pure latency costs
note: Cost model analysis:
vs after:
note: Original vector body cost = 46
note: Known loop bounds, capping VF to 9 for analysis
note: Vector loop iterates at most 1 times
note: Scalar issue estimate:
note: load operations = 2
note: store operations = 1
note: general operations = 1
note: reduction latency = 0
note: estimated min cycles per iteration = 1.000000
note: estimated cycles per vector iteration (for VF 9) = 9.000000
note: SVE issue estimate:
note: load operations = 5
note: store operations = 4
note: general operations = 11
note: predicate operations = 12
note: reduction latency = 0
note: estimated min cycles per iteration without predication = 5.500000
note: estimated min cycles per iteration for predication = 12.000000
note: estimated min cycles per iteration = 12.000000
note: Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations
note: Low iteration count, so using pure latency costs
note: Cost model analysis:
gcc/ChangeLog:
* config/aarch64/aarch64.cc (adjust_body_cost):
Cap VF for low iteration loops.