aarch64: Avoid INS-(W|X)ZR instructions when optimising for speed
For inserting zero into a vector lane we usually use an instruction like:
ins v0.h[2], wzr
This, however, has not-so-great performance on some CPUs.
On Grace, for example it has a latency of 5 and throughput 1.
The alternative sequence:
movi v31.8b, #0
ins v0.h[2], v31.h[0]
is prefereble bcause the MOVI-0 is often a zero-latency operation that is
eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and
throughput of 4.
We can avoid the merging of the two instructions into the aarch64_simd_vec_set_zero<mode>
by disabling that pattern when optimizing for speed.
Thanks to wider benchmarking from Tamar, it makes sense to make this change for
all tunings, so no RTX costs or tuning flags are introduced to control this
in a more fine-grained manner. They can be easily added in the future if needed
for a particular CPU.
Bootstrapped and tested on aarch64-none-linux-gnu.