]> git.ipfire.org Git - thirdparty/gcc.git/commit
aarch64: Extend HVLA permutations to big-endian
authorRichard Sandiford <richard.sandiford@arm.com>
Thu, 10 Jul 2025 09:57:28 +0000 (10:57 +0100)
committerRichard Sandiford <richard.sandiford@arm.com>
Thu, 10 Jul 2025 09:57:28 +0000 (10:57 +0100)
commit3b870131487d786a74f27a89d0415c8207770f14
tree1af39cd33df5dc488187c053543b1666948b3a09
parent18c48295afb424bfc5c1fbb812e68119e9eb4ccb
aarch64: Extend HVLA permutations to big-endian

TARGET_VECTORIZE_VEC_PERM_CONST has code to match the SVE2.1
"hybrid VLA" DUPQ, EXTQ, UZPQ{1,2}, and ZIPQ{1,2} instructions.
This matching was conditional on !BYTES_BIG_ENDIAN.

The ACLE code also lowered the associated SVE2.1 intrinsics into
suitable VEC_PERM_EXPRs.  This lowering was not conditional on
!BYTES_BIG_ENDIAN.

The mismatch led to lots of ICEs in the ACLE tests on big-endian
targets: we lowered to VEC_PERM_EXPRs that are not supported.

I think the !BYTES_BIG_ENDIAN restriction was unnecessary.
SVE maps the first memory element to the least significant end of
the register for both endiannesses, so no endian correction or lane
number adjustment is necessary.

This is in some ways a bit counterintuitive.  ZIPQ1 is conceptually
"apply Advanced SIMD ZIP1 to each 128-bit block" and endianness does
matter when choosing between Advanced SIMD ZIP1 and ZIP2.  For example,
the V4SI permute selector { 0, 4, 1, 5 } corresponds to ZIP1 for little-
endian and ZIP2 for big-endian.  But the difference between the hybrid
VLA and Advanced SIMD permute selectors is a consequence of the
difference between the SVE and Advanced SIMD element orders.

The same thing applies to ACLE intrinsics.  The current lowering of
svzipq1 etc. is correct for both endiannesses.  If ACLE code does:

  2x svld1_s32 + svzipq1_s32 + svst1_s32

then the byte-for-byte result is the same for both endiannesses.
On big-endian targets, this is different from using the Advanced SIMD
sequence below for each 128-bit block:

  2x LDR + ZIP1 + STR

In contrast, the byte-for-byte result of:

  2x svld1q_gather_s32 + svzipq1_s32 + svst11_scatter_s32

depends on endianness, since the quadword gathers and scatters use
Advanced SIMD byte ordering for each 128-bit block.  This gather/scatter
sequence behaves in the same way as the Advanced SIMD LDR+ZIP1+STR
sequence for both endiannesses.

Programmers writing ACLE code have to be aware of this difference
if they want to support both endiannesses.

The patch includes some new execution tests to verify the expansion
of the VEC_PERM_EXPRs.

gcc/
* doc/sourcebuild.texi (aarch64_sve2_hw, aarch64_sve2p1_hw): Document.
* config/aarch64/aarch64.cc (aarch64_evpc_hvla): Extend to
BYTES_BIG_ENDIAN.

gcc/testsuite/
* lib/target-supports.exp (check_effective_target_aarch64_sve2p1_hw):
New proc.
* gcc.target/aarch64/sve2/dupq_1.c: Extend to big-endian.  Add
noipa attributes.
* gcc.target/aarch64/sve2/extq_1.c: Likewise.
* gcc.target/aarch64/sve2/uzpq_1.c: Likewise.
* gcc.target/aarch64/sve2/zipq_1.c: Likewise.
* gcc.target/aarch64/sve2/dupq_1_run.c: New test.
* gcc.target/aarch64/sve2/extq_1_run.c: Likewise.
* gcc.target/aarch64/sve2/uzpq_1_run.c: Likewise.
* gcc.target/aarch64/sve2/zipq_1_run.c: Likewise.
gcc/config/aarch64/aarch64.cc
gcc/doc/sourcebuild.texi
gcc/testsuite/gcc.target/aarch64/sve2/dupq_1.c
gcc/testsuite/gcc.target/aarch64/sve2/dupq_1_run.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve2/extq_1.c
gcc/testsuite/gcc.target/aarch64/sve2/extq_1_run.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve2/uzpq_1.c
gcc/testsuite/gcc.target/aarch64/sve2/uzpq_1_run.c [new file with mode: 0644]
gcc/testsuite/gcc.target/aarch64/sve2/zipq_1.c
gcc/testsuite/gcc.target/aarch64/sve2/zipq_1_run.c [new file with mode: 0644]
gcc/testsuite/lib/target-supports.exp