[S390x] Optimize SHA256 and SHA512 compress functions
This patch optimizes SHA256 and SHA512 compress functions for s390x architecture, the testsuite passes the tests. Benchmark on Z15:
| Algorithm | C | Hardware-accelerated |
| ------ | ------ | ------ |
| SHA265 | 242.76 Mbyte/s | 869.00 Mbyte/s |
| SHA512 | 373.18 Mbyte/s | 1555.21 Mbyte/s |
See merge request nettle/nettle!35