256b wide SVE vectors allow some simplification of truffle. Up to 40%
speedup on graviton3. Going from 12500 MB/s to 17000 MB/s onhe
microbenchmark.
SVE2 also offer this capability for 128b vector with a speedup around
25% compared to normal SVE
Add unit tests and benchmark for this wide variant