For the test case
int32_t foo (svint32_t x)
{
svbool_t pg = svpfalse ();
return svlastb_s32 (pg, x);
}
compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
foo:
pfalse p3.b
lastb w0, p3, z0.s
ret
when it could use a Neon lane extract instead:
foo:
umov w0, v0.s[3]
ret
Similar optimizations can be made for VLS with other vector widths.
We implemented this optimization by guarding the emission of
pfalse+lastb in the pattern vec_extract<mode><Vel> by
!val.is_constant ().
Thus, for last-extract operations with VLS, the patterns
*vec_extract<mode><Vel>_v128, *vec_extract<mode><Vel>_dup, or
*vec_extract<mode><Vel>_ext are used instead.
We added tests for 128-bit VLS and adjusted the tests for the other vector
widths.
The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>
gcc/
* config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
Prevent the emission of pfalse+lastb for VLS.