git.ipfire.org Git - thirdparty/gcc.git/commit

author	Jennifer Schmitz <jschmitz@nvidia.com>
	Thu, 17 Oct 2024 09:31:47 +0000 (02:31 -0700)
committer	Jennifer Schmitz <jschmitz@nvidia.com>
	Thu, 24 Oct 2024 09:54:27 +0000 (11:54 +0200)
commit	f6fbc0d2422ce9bea6a23226f4a13a76ffd1784b
tree	024aec1f3e54777a5ece46017dcf9f8bd6d2e30c	tree
parent	3e7549ece7c6b90b9e961778361ee2b65bf104a9	commit \| diff

SVE intrinsics: Fold svsra with op1 all zeros to svlsr/svasr.

A common idiom in intrinsics loops is to have accumulator intrinsics
in an unrolled loop with an accumulator initialized to zero at the beginning.
Propagating the initial zero accumulator into the first iteration
of the loop and simplifying the first accumulate instruction is a
desirable transformation that we should teach GCC.
Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
We implemented this optimization in svsra_impl::fold.

Tests were added to check the produced assembly for use of LSR/ASR.

The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?

Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>
gcc/
* config/aarch64/aarch64-sve-builtins-sve2.cc
(svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.

gcc/testsuite/
* gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
* gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.

gcc/config/aarch64/aarch64-sve-builtins-sve2.cc		diff \| blob \| blame \| history
gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s32.c		diff \| blob \| blame \| history
gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_s64.c		diff \| blob \| blame \| history
gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u32.c		diff \| blob \| blame \| history
gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/sra_u64.c		diff \| blob \| blame \| history