AArch64: Add if-conversion target cost model [PR123017]
Since g:
b219cbeda72d23b7ad6ff12cd159784b7ef00667
The following
void f(const int *restrict in,
int *restrict out,
int n, int threshold)
{
for (int i = 0; i < n; ++i) {
int v = in[i];
if (v > threshold) {
int t = v * 3;
t += 7;
t ^= 0x55;
t *= 0x55;
t -= 0x5;
t &= 0xFE;
t ^= 0x55;
out[i] = t;
} else {
out[i] = v;
}
}
}
compiled at -O2
results in aggressive if-conversion which increases the number of dynamic
instructions and the latency of the loop as it has to wait for t to be
calculated now in all cases.
This has led to big performance losses in packages like zstd [1] which in turns
affects packaging and LTO speed.
The default cost model for if-conversion is overly permissive and allows if
conversions assuming that branches are very expensive.
This patch implements an if-conversion cost model for AArch64. AArch64 has a
number of conditional instructions that need to be accounted for, however this
initial version keeps things simple and is only really concerned about csel.
The issue specifically with csel is that it may have to wait for two argument
to be evaluated before it can be executed. This means it has a direct
correlation to increases in dynamic instructions.
To fix this I add a new tuning parameter that indicates a rough estimation of
the branch misprediction cost of a branch. We then accept if-conversion while
the cost of this multiplied by the cost of branches is cheaper.
There is a basic detection of CINC and CSET because these usually are ok. We
also accept all if-conversion when not inside a loop. Because CE is not an RTL
SSA pass we can't do more extensive checks like checking if the csel is a loop
carried dependency. As such this is a best effort thing and intends to catch the
most egregious cases like the above.
This recovers the ~25% performance loss in zstd decoding and gives better
results than GCC 14 which was before the regression happened.
Additionally I've benchmarked on a number of cores all the attached examples
and checked various cases. On average the patch gives an improvement between
20-40%.
[1] https://github.com/facebook/zstd/pull/4418#issuecomment-
3004606000
gcc/ChangeLog:
PR target/123017
* config/aarch64/aarch64-json-schema.h: Add br_mispredict_factor.
* config/aarch64/aarch64-json-tunings-parser-generated.inc
(parse_branch_costs): Add br_mispredict_factor.
* config/aarch64/aarch64-json-tunings-printer-generated.inc
(serialize_branch_costs): Add br_mispredict_factor.
* config/aarch64/aarch64-protos.h (struct cpu_branch_cost): Add
br_mispredict_factor.
* config/aarch64/aarch64.cc (aarch64_max_noce_ifcvt_seq_cost,
aarch64_noce_conversion_profitable_p,
TARGET_MAX_NOCE_IFCVT_SEQ_COST,
TARGET_NOCE_CONVERSION_PROFITABLE_P): New.
* config/aarch64/tuning_models/generic.h (generic_branch_cost): Add
br_mispredict_factor.
* config/aarch64/tuning_models/generic_armv8_a.h: Remove
generic_armv8_a_branch_cost and use generic_branch_cost.
gcc/testsuite/ChangeLog:
PR target/123017
* gcc.target/aarch64/pr123017_1.c: New test.
* gcc.target/aarch64/pr123017_2.c: New test.
* gcc.target/aarch64/pr123017_3.c: New test.
* gcc.target/aarch64/pr123017_4.c: New test.
* gcc.target/aarch64/pr123017_5.c: New test.
* gcc.target/aarch64/pr123017_6.c: New test.
* gcc.target/aarch64/pr123017_7.c: New test.