Dougall Johnson [Mon, 8 Dec 2025 04:11:52 +0000 (20:11 -0800)]
Reorder code struct fields for better access patterns
Place bits field before op field in code struct to optimize memory
access. The bits field is accessed first in the hot path, so placing
it at offset 0 may improve code generation on some architectures.
[configure] Fix detecting -fno-lto support
* Previously -fno-lto support was assumed to be supported on non-gcc compatible or unsupported compilers.
Support for it was never tested on those cases. Set the default to not supported.
Inline all uses of quick_insert_string*/quick_insert_value*.
Inline all uses of update_hash*.
Inline insert_string into deflate_quick, deflate_fast and deflate_medium.
Remove insert_string from deflate_state
Use local function pointer for insert_string.
Fix level check to actually check level and not `s->max_chain_length <= 1024`.
There are no folding techniques in adler32 implementations. It is simply hashing while copying.
- Rename adler32_fold_copy to adler32_copy.
- Remove unnecessary adler32_fold.c file.
- Reorder adler32_copy functions last in source file for consistency.
- Rename adler32_rvv_impl to adler32_copy_impl for consistency.
- Replace dst != NULL with 1 in adler32_copy_neon to remove branching.
Adam Stylinski [Fri, 21 Nov 2025 15:02:14 +0000 (10:02 -0500)]
Conditionally shortcut via the chorba polynomial based on compile flags
As it turns out, the copying CRC32 variant _is_ slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.
This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).
We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
Adam Stylinski [Fri, 21 Nov 2025 14:45:48 +0000 (09:45 -0500)]
Use aligned loads in the chorba portions of the clmul crc routines
We go through the trouble to do aligned loads, we may as well let the
compiler know this is certain in doing so. We can't guarantee an aligned
store but at least with an aligned load the compiler can elide a load
with a subsequent xor multiplication when not copying.
Mika Lindqvist [Mon, 17 Nov 2025 17:15:03 +0000 (19:15 +0200)]
Fix build using configure
* "\i" is not valid escape code in BSD sed
* Some x86 shared sources were missing -fPIC due to using wrong variable in build rule
Brad Smith [Mon, 17 Nov 2025 05:50:47 +0000 (00:50 -0500)]
configure: Determine system architecture properly on *BSD systems
uname -m on a BSD system will provide the architecture port .e.g.
arm64, macppc, octeon instead of the machine architecture .e.g.
aarch64, powerpc, mips64. uname -p will provide the machine
architecture. NetBSD uses x86_64, OpenBSD uses amd64, FreeBSD
is a mix between uname -p and the compiler output.
Mika Lindqvist [Mon, 17 Nov 2025 10:28:21 +0000 (12:28 +0200)]
[CI] Downgrade "Windows GCC Native Instructions (AVX)" workflow
* Windows Server 2025 runner has broken GCC, so use Windows Server 2022 runner instead until fix is propagated to all runners
Use CTest to simplify testing options
Add CMake variable TEST_STOCK_ZLIB to disable some tests if attempting
to run our testsuite on stock zlib.
PR depends on CMP0077, introduced by CMake 3.13.
Upped minimum compatible CMake version to 3.13, same as we have
actually been telling people was the minumum for years on the wiki.
Upped upper compatible CMake version to 3.31, my current version.
- Unify crc32_chorba, chorba_sse2 and chorba_sse41 dispatch functions.
- Fixed alignment diff calculation in crc32_chorba.
- Fixed length check to happen early, avoiding extra branches for too short lengths,
this also allows removing one function call to crc32_braid_internal to handle those.
Gbench shows ~0.15-0.25ns saved per call for lengths shorter than CHORBA_SMALL_THRESHOLD.
- Avoid calculating aligned len if buffer is already aligned
Reorganize Chorba activation.
Now WITHOUT_CHORBA will only disable the crc32_chorba C fallback.
SSE2, SSE41 and pclmul variants will still be able to use their Chorba-algorithm based code,
but their fallback to the generic crc32_chorba C code in SSE2 and SSE41 will be disabled,
reducing their performance on really big input buffers (not used during deflate/inflate,
only when calling crc32 directly).
Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable
routing to select between crc32_braid and crc32_chorba.
Disable sse2 and sse4.1 variants of Chorba-crc32 on MSVC older than 2022 due to code
generation bug in 2019 causing segfaults.
Compile either crc32_chorba_small_nondestructive or crc32_chorba_small_nondestructive_32bit,
not both. Don't compile crc32_chorba_32768_nondestructive on 32bit arch.
Icenowy Zheng [Tue, 11 Nov 2025 14:47:55 +0000 (22:47 +0800)]
riscv: features: test HWCAP regardless of kernel versions
The HWCAP facility comes at day 1 of Linux RISC-V support (date back to
4.15), only the V bit definition is added in 6.5 (because proper vector
support is added in that version too).
There should be no need to test kernel version number before accessing
hwcap, only the V bit will never be present on kernel older than 6.5
(except dirty patched downstream ones).
For Xtheadvector systems that bogusly announce V bit in HWCAP, the
assembly code should be able to factor them out. This is tested on
a Sophgo SG2042 machine with 6.1 kernel.
Update README.md, add a lot of missing info, and reorder some of it.
Add missing parameter to configure help text.
Update descriptions and reorganize some options in CMake
Improve resilience of the functable initialization; during functable init,
make sure none of the function pointers are nullpointers.
Up until now, zlib-ng and the application would have segfaulted either at the start
of processing, or at some point later depending on when a nullpointer call would happen
in the processing. In any case most likely after accepting data from the application.
Now, the deflateinit/inflateinit functions will error with Z_VERSION_ERROR, and
gzopen will return Z_STREAM_ERROR before actually processing any data.
Direct calls to functions like adler32 or crc32 will however print an error message
and call abort(), as these functions have no actual way of reporting errors.
Note: This should never happen with default builds of zlib-ng, only if it is run on
a cpu that is missing both the matching optimized and the generic fallback functions.
This can currently only happen if zlib-ng is compiled using custom cflags or by
editing the code.
Remove force-sse2 config option from x86 builds.
Due to major refactoring done long ago, this option no longer avoids a branch
in a hot path, it currently only removes a single if check during init.
Split out gz_read_init() from gzlook(), and rename gz_init() to gz_write_init().
This makes gzread.c more like gzwrite.c, and fits in with the new code in gzlib.c.
Reorganize initialization and use a single malloc call for both
in and outbuffers in gzopen/gzread/gzwrite.
Also start aligning the allocation to 64 bytes (on a cacheline border).
Adam Stylinski [Sat, 16 Aug 2025 20:04:30 +0000 (16:04 -0400)]
Unroll some of the adler checksum for avx2
Similar to what's done for vmx, avx512, and sse4, let's unroll some
of this checksum since it's a commutative checksum. We take advantage
of ILP and do more intermediate sums before rolling them back together
for the finalization of the checksum.
Adam Stylinski [Sat, 16 Aug 2025 15:35:33 +0000 (11:35 -0400)]
Check the proper bit for BMI2
We were actually checking for BMI1 support here. This is unlikely to have
caused any issues because to date there have not been any x86 CPUs with
AVX2 support but no BMI2 support.
On RHEL9 the GCC is new enough to support AVX512-VNNI, but its assembler
(binutils) is not and errors with
```
Error: unsupported instruction vpdpbusd
```
This was already addressed earlier in
https://github.com/zlib-ng/zlib-ng/pull/1562 to some extent, except that
a check for `_mm256_dpbusd_epi32` was not added, which is what the
assembler errors over.
Remove usage of aligned alloc implementations and instead use malloc
and handle alignment internally. We already always have to do those
checks because we have to support external alloc implementations.