Enable x86-64 SHA-512 family optimizations with SHA512 ISA extension
The SHA-256 (SZ=4) and SHA-512 (SZ=8) dispatcher paths have been
separated while keeping the SHA-256 path unmodified.
Due to early constraints in register availability, two 32-bit
`OPENSSL_ia32cap_P` reads have been coalesced into one. As a
consequence, several bit positions used in feature checks have gained a
32 bits offset.
Replaced `test` with `bt` to allow use of 64-bit immediate indices in
CPUID feature checks.
Split the SHA512 BMI2+AVX2+BMI1 dispatcher branch into:
- AVX2+SHA512: high priority, with SHA512 ISA extension
- AVX2+BMI2+BMI1: fallback
The added implementation has its own copy of `K512` without duplicated
elements every 16 bytes. Shuffle indices have been reused from `K512`.
Added binary translators for `vsha512msg1`, `vsha512msg2`,
`vsha512rnds2` for older assemblers.
Reviewed-by: Neil Horman <nhorman@openssl.org> Reviewed-by: Paul Dale <ppzgs1@gmail.com>
(Merged from https://github.com/openssl/openssl/pull/26147)