powerpc64: Add optimized assembly for sha256-compress-n
This patch introduces an optimized powerpc64 assembly implementation for
sha256-compress-n. This takes advantage of the vshasigma instruction, as
well as unrolling loops to best take advantage of running instructions
in parallel.
The following data was captured on a POWER 10 LPAR @ ~3.896GHz
Current C implementation:
Algorithm mode Mbyte/s
sha256 update 280.97
hmac-sha256 64 bytes 80.81
hmac-sha256 256 bytes 170.50
hmac-sha256 1024 bytes 241.92
hmac-sha256 4096 bytes 268.54
hmac-sha256 single msg 276.16