From: Adam Stylinski <kungfujesus06@gmail.com>
Date: Fri, 21 Nov 2025 15:02:14 +0000 (-0500)
Subject: Conditionally shortcut via the chorba polynomial based on compile flags
X-Git-Tag: 2.3.1~1
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=fe179585e7c25234c2c224116ccfed8b0a78dbd9;p=thirdparty%2Fzlib-ng.git

Conditionally shortcut via the chorba polynomial based on compile flags

As it turns out, the copying CRC32 variant _is_ slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.

This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).

We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
---

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 803a8774a..0a22a4abe 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -111,6 +111,7 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
      * the stream at the following offsets: 6, 9, 10, 16, 20, 22,
      * 24, 25, 27, 28, 30, 31, 32 - this is detailed in the paper
      * as "generator_64_bits_unrolled_8" */
+#if !defined(COPY) || defined(__AVX512VL__)
     while (len >= 512 + 64 + 16*8) {
         __m128i chorba8 = _mm_load_si128((__m128i *)src);
         __m128i chorba7 = _mm_load_si128((__m128i *)src + 1);
@@ -322,6 +323,7 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
         len -= 512;
         src += 512;
     }
+#endif
 
     while (len >= 64) {
         len -= 64;