After some more benchmarking and evaluation I'd like to increase the alignments
needed for -mcpu=olympus. All three of loop_align, function_align and
jump_align are needed to get a good mix of improvements. I've added skip
amounts to keep the code size bloat down. There is some, but the performance
benefit on -mcpu=olympus binaries is worth it (cpython from SPEC2026 in
particular benefits consistently).
As an aside, I do think we'll want a more fine-grained description of alignment
in our CPU tuning structs for -O3 binaries. -O3 may be willing to pay the
extra padding cost for speed-tuned functions and loops and may want to avoid
the skip amount if the CPU is good enough at skipping past them.
But the -O2 settings may still want to use skip amount to avoid excessive
distro binary bloat.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/ChangeLog
* config/aarch64/tuning_models/olympus.h (olympus_tunings):
Adjust loop_align, function_align, jump_align.
}, /* memmov_cost. */
10, /* issue_rate */
AARCH64_FUSE_NEOVERSE_BASE, /* fusible_ops */
- "32:16", /* function_align. */
- "4", /* jump_align. */
- "64:16", /* loop_align. */
+ "32:25", /* function_align. */
+ "16:9", /* jump_align. */
+ "64:33:32", /* loop_align. */
8, /* int_reassoc_width. */
6, /* fp_reassoc_width. */
4, /* fma_reassoc_width. */