With recent patch to improve detection of vector rotates at RTL level
combine now tries matching a V8HImode rotate by 8 in the example in the
testcase. We can teach AArch64 to emit a REV16 instruction for such a rotate
but really this operation corresponds to the RTL code BSWAP, for which we
already have the right patterns. BSWAP is arguably a simpler representation
than ROTATE here because it has only one operand, so let's teach simplify-rtx
to generate it.
With this patch the testcase now generates the simplest form:
.L2:
ldr q31, [x1, x0]
rev16 v31.16b, v31.16b
str q31, [x0, x2]
add x0, x0, 16
cmp x0, 2048
bne .L2
instead of the previous:
.L2:
ldr q31, [x1, x0]
shl v30.8h, v31.8h, 8
usra v30.8h, v31.8h, 8
str q30, [x0, x2]
add x0, x0, 16
cmp x0, 2048
bne .L2
IMO ideally the bswap detection would have been done during vectorisation
time and used the expanders for that, but teaching simplify-rtx to do this
transformation is fairly straightforward and, unlike at tree level, we have
the native RTL BSWAP code. This change is not enough to generate the
equivalent sequence in SVE, but that is something that should be tackled
separately.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Simplify (rotate:HI x:HI, 8) -> (bswap:HI x:HI).
gcc/testsuite/
* gcc.target/aarch64/rot_to_bswap.c: New test.
mode, op0, new_amount_rtx);
}
#endif
+ /* ROTATE/ROTATERT:HI (X:HI, 8) is BSWAP:HI (X). Other combinations
+ such as SImode with a count of 16 do not correspond to RTL BSWAP
+ semantics. */
+ tem = unwrap_const_vec_duplicate (trueop1);
+ if (GET_MODE_UNIT_BITSIZE (mode) == (2 * BITS_PER_UNIT)
+ && CONST_INT_P (tem) && INTVAL (tem) == BITS_PER_UNIT)
+ return simplify_gen_unary (BSWAP, mode, op0, mode);
+
/* FALLTHRU */
case ASHIFTRT:
if (trueop1 == CONST0_RTX (mode))
--- /dev/null
+/* { dg-do compile } */
+/* { dg-options "-O2 --param aarch64-autovec-preference=asimd-only" } */
+
+#pragma GCC target "+nosve"
+
+
+#define N 1024
+
+unsigned short in_s[N];
+unsigned short out_s[N];
+
+void
+foo16 (void)
+{
+ for (unsigned i = 0; i < N; i++)
+ {
+ unsigned short x = in_s[i];
+ out_s[i] = (x >> 8) | (x << 8);
+ }
+}
+
+/* { dg-final { scan-assembler {\trev16\tv([123])?[0-9]\.16b, v([123])?[0-9]\.16b} } } */
+