aarch64: disable LDP via tuning structure for -mcpu=ampere1
AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops.
Given the chance that this causes instructions to slip into the next
decoding cycle and the additional overheads when handling
cacheline-crossing LDP instructions, we disable the generation of LDP
isntructions through the tuning structure from instruction combining
(such as in peephole2).
Given the code-density benefits in builtins and prologue/epilogue
expansion, we allow LDPs there.
This commit:
* adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE
* allows -moverride=tune=... to override this
These changes are benchmark-driven, yielding the following changes
(with a net-overall improvement):
503.bwaves_r. -0.88%
507.cactuBSSN_r 0.35%
508.namd_r 3.09%
510.parest_r -2.99%
511.povray_r 5.54%
519.lbm_r 15.83%
521.wrf_r 0.56%
526.blender_r 2.47%
527.cam4_r 0.70%
538.imagick_r 0.00%
544.nab_r -0.33%
549.fotonik3d_r. -0.42%
554.roms_r 0.00%
-------------------------
= total 1.79%
Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu> Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com>
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE.
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp):
Check for the above tuning option when processing loads.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/ampere1-no_ldp_combine.c: New test.