Generate vmovsh instead of vpblendw for specific vec_merge.
On SPR, vmovsh can be execute on 3 ports, vpblendw can only be
executed on 2 ports.
On znver4, vpblendw can be executed on 4 ports, if vmovsh is similar
as vmovss, then it can also be executed on 4 ports.
So there's no difference for znver? but vmovsh is more optimized on
SPR.
gcc/ChangeLog:
* config/i386/sse.md: (V8BFH_128): Renamed to ..
(VHFBF_128): .. this.
(V16BFH_256): Renamed to ..
(VHFBF_256): .. this.
(avx512f_mov<mode>): Extend to V_128.
(vcvtnee<bf16_ph>2ps_<mode>): Changed to VHFBF_128.
(vcvtneo<bf16_ph>2ps_<mode>): Ditto.
(vcvtnee<bf16_ph>2ps_<mode>): Changed to VHFBF_256.
(vcvtneo<bf16_ph>2ps_<mode>): Ditto.
* config/i386/i386-expand.cc (expand_vec_perm_blend):
Canonicalize vec_merge.