This was pretty much across the board wins for performance, but the wins
are very data dependent and it sort of depends on what copy runs look
like. On our less than realistic data in benchmark_zlib_apps, the
decode test saw some of the bigger gains, ranging anywhere from 6 to 11%
when compiled with AVX2 on a Cascade Lake CPU (and with only AVX2
enabled). The decode on realistic imagery enjoyed smaller gains,
somewhere between 2 and 4%.
Interestingly, there was one outlier on encode, at level 5. The best
theory for this is that the copy runs for that particular compression
level were such that glibc's ERMS aware memmove implementation managed
to marginally outpace the copy during the checksum with the move rep str
sequence thanks to clever microcoding on Intel's part. It's hard to say
for sure but the most standout difference between the two perf profiles
was more time spent in memmove (which is expected, as it's calling
memcpy instead of copying the bytes during the checksum).
There's the distinct possibility that the AVX2 checksums could be
marginally improved by one level of unrolling (like what's done in the
SSE3 implementation). The AVX512 implementations are certainly getting
gains from this but it's not appropriate to append this optimization in
this series of commits.