x86: Optimize and shrink st{r|p}{n}{cat|cpy}-evex functions
Optimizations are:
1. Use more overlapping stores to avoid branches.
2. Reduce how unrolled the aligning copies are (this is more of a
code-size save, its a negative for some sizes in terms of
perf).
3. Improve the loop a bit (similiar to what we do in strlen with
2x vpminu + kortest instead of 3x vpminu + kmov + test).
4. For st{r|p}n{cat|cpy} re-order the branches to minimize the
number that are taken.
Performance Changes:
Times are from N = 10 runs of the benchmark suite and are
reported as geometric mean of all ratios of
New Implementation / Old Implementation.
I couldn't find a way to merge them without making the
ifdefs incredibly difficult to follow.
2. All implementations can be made evex512 by including
"x86-evex512-vecs.h" at the top.
3. All implementations have an optional define:
`USE_EVEX_MASKED_STORE`
Setting to one uses evex-masked stores for handling short
strings. This saves code size and branches. It's disabled
for all implementations are the moment as there are some
serious drawbacks to masked stores in certain cases, but
that may be fixed on future architectures.
Full check passes on x86-64 and build succeeds for all ISA levels w/
and w/o multiarch.