Adam Stylinski [Fri, 12 Dec 2025 21:23:27 +0000 (16:23 -0500)]
Force purely aligned loads in inflate_table code length counting
At the expense of some extra stack space and eating about 4 more cache
lines, let's make these loads purely aligned. On potato CPUs such as the
Core 2, unaligned loads in a loop are not ideal. Additionally some SBC
based ARM chips (usually the little in big.little variants) suffer a
penalty for unaligned loads. This also paves the way for a trivial
altivec implementation, for which unaligned loads don't exist and need
to be synthesized with permutation vectors.
Fix initial crc value loading in crc32_(v)pclmulqdq
In main function, alignment diff processing was getting in the way of XORing
the initial CRC, because it does not guarantee at least 16 bytes have been
loaded.
In fold_16, src data modified by initial crc XORing before being stored to dst.
Adam Stylinski [Tue, 23 Dec 2025 23:58:10 +0000 (18:58 -0500)]
Small optimization in 256 bit wide chunkset
It turns out Intel only parses the bottom 4 bits of the shuffle vector.
This makes it already a sufficient permutation vector and saves us a
small bit of latency.
Improve cmake/detect-arch.cmake to also provide bitness.
Rewrite checks in CMakelists.txt and cmake/detect-intrinsics.cmake
to utilize the new variables.
- Add local window pointer to:
deflate_quick, deflate_fast, deflate_medium and fill_window.
- Add local strm pointer in fill_window.
- Fix missed change to use local lookahead variable in match_tpl
Deflate_state changes:
- Reduce opt_len/static_len sizes.
- Move matches/insert closer to their related varibles.
These now fill a 8-byte hole in the struct on 64-bit platforms.
- Exclude compressed_len and bits_sent if ZLIB_DEBUG is
not enabled. Also move them to the end.
- Remove x86 MSVC-specific padding
- Minor inlining changes in trees_emit.h:
- Inline the small bi_windup function
- Don't attempt inlining for the big zng_emit_dist
- Don't check for too long match in deflate_quick, it cannot happen.
- Move GOTO_NEXT_CHAIN macro outside of LONGEST_MATCH function to
improve readability.
Dougall Johnson [Mon, 8 Dec 2025 04:11:52 +0000 (20:11 -0800)]
Reorder code struct fields for better access patterns
Place bits field before op field in code struct to optimize memory
access. The bits field is accessed first in the hot path, so placing
it at offset 0 may improve code generation on some architectures.
[configure] Fix detecting -fno-lto support
* Previously -fno-lto support was assumed to be supported on non-gcc compatible or unsupported compilers.
Support for it was never tested on those cases. Set the default to not supported.
Inline all uses of quick_insert_string*/quick_insert_value*.
Inline all uses of update_hash*.
Inline insert_string into deflate_quick, deflate_fast and deflate_medium.
Remove insert_string from deflate_state
Use local function pointer for insert_string.
Fix level check to actually check level and not `s->max_chain_length <= 1024`.
There are no folding techniques in adler32 implementations. It is simply hashing while copying.
- Rename adler32_fold_copy to adler32_copy.
- Remove unnecessary adler32_fold.c file.
- Reorder adler32_copy functions last in source file for consistency.
- Rename adler32_rvv_impl to adler32_copy_impl for consistency.
- Replace dst != NULL with 1 in adler32_copy_neon to remove branching.
Adam Stylinski [Fri, 21 Nov 2025 15:02:14 +0000 (10:02 -0500)]
Conditionally shortcut via the chorba polynomial based on compile flags
As it turns out, the copying CRC32 variant _is_ slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.
This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).
We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
Adam Stylinski [Fri, 21 Nov 2025 14:45:48 +0000 (09:45 -0500)]
Use aligned loads in the chorba portions of the clmul crc routines
We go through the trouble to do aligned loads, we may as well let the
compiler know this is certain in doing so. We can't guarantee an aligned
store but at least with an aligned load the compiler can elide a load
with a subsequent xor multiplication when not copying.
Mika Lindqvist [Mon, 17 Nov 2025 17:15:03 +0000 (19:15 +0200)]
Fix build using configure
* "\i" is not valid escape code in BSD sed
* Some x86 shared sources were missing -fPIC due to using wrong variable in build rule
Brad Smith [Mon, 17 Nov 2025 05:50:47 +0000 (00:50 -0500)]
configure: Determine system architecture properly on *BSD systems
uname -m on a BSD system will provide the architecture port .e.g.
arm64, macppc, octeon instead of the machine architecture .e.g.
aarch64, powerpc, mips64. uname -p will provide the machine
architecture. NetBSD uses x86_64, OpenBSD uses amd64, FreeBSD
is a mix between uname -p and the compiler output.
Mika Lindqvist [Mon, 17 Nov 2025 10:28:21 +0000 (12:28 +0200)]
[CI] Downgrade "Windows GCC Native Instructions (AVX)" workflow
* Windows Server 2025 runner has broken GCC, so use Windows Server 2022 runner instead until fix is propagated to all runners
Use CTest to simplify testing options
Add CMake variable TEST_STOCK_ZLIB to disable some tests if attempting
to run our testsuite on stock zlib.
PR depends on CMP0077, introduced by CMake 3.13.
Upped minimum compatible CMake version to 3.13, same as we have
actually been telling people was the minumum for years on the wiki.
Upped upper compatible CMake version to 3.31, my current version.
- Unify crc32_chorba, chorba_sse2 and chorba_sse41 dispatch functions.
- Fixed alignment diff calculation in crc32_chorba.
- Fixed length check to happen early, avoiding extra branches for too short lengths,
this also allows removing one function call to crc32_braid_internal to handle those.
Gbench shows ~0.15-0.25ns saved per call for lengths shorter than CHORBA_SMALL_THRESHOLD.
- Avoid calculating aligned len if buffer is already aligned
Reorganize Chorba activation.
Now WITHOUT_CHORBA will only disable the crc32_chorba C fallback.
SSE2, SSE41 and pclmul variants will still be able to use their Chorba-algorithm based code,
but their fallback to the generic crc32_chorba C code in SSE2 and SSE41 will be disabled,
reducing their performance on really big input buffers (not used during deflate/inflate,
only when calling crc32 directly).
Remove the crc32_c function (and its file crc32_c.c), instead use the normal functable
routing to select between crc32_braid and crc32_chorba.
Disable sse2 and sse4.1 variants of Chorba-crc32 on MSVC older than 2022 due to code
generation bug in 2019 causing segfaults.
Compile either crc32_chorba_small_nondestructive or crc32_chorba_small_nondestructive_32bit,
not both. Don't compile crc32_chorba_32768_nondestructive on 32bit arch.
Icenowy Zheng [Tue, 11 Nov 2025 14:47:55 +0000 (22:47 +0800)]
riscv: features: test HWCAP regardless of kernel versions
The HWCAP facility comes at day 1 of Linux RISC-V support (date back to
4.15), only the V bit definition is added in 6.5 (because proper vector
support is added in that version too).
There should be no need to test kernel version number before accessing
hwcap, only the V bit will never be present on kernel older than 6.5
(except dirty patched downstream ones).
For Xtheadvector systems that bogusly announce V bit in HWCAP, the
assembly code should be able to factor them out. This is tested on
a Sophgo SG2042 machine with 6.1 kernel.
Update README.md, add a lot of missing info, and reorder some of it.
Add missing parameter to configure help text.
Update descriptions and reorganize some options in CMake
Improve resilience of the functable initialization; during functable init,
make sure none of the function pointers are nullpointers.
Up until now, zlib-ng and the application would have segfaulted either at the start
of processing, or at some point later depending on when a nullpointer call would happen
in the processing. In any case most likely after accepting data from the application.
Now, the deflateinit/inflateinit functions will error with Z_VERSION_ERROR, and
gzopen will return Z_STREAM_ERROR before actually processing any data.
Direct calls to functions like adler32 or crc32 will however print an error message
and call abort(), as these functions have no actual way of reporting errors.
Note: This should never happen with default builds of zlib-ng, only if it is run on
a cpu that is missing both the matching optimized and the generic fallback functions.
This can currently only happen if zlib-ng is compiled using custom cflags or by
editing the code.
Remove force-sse2 config option from x86 builds.
Due to major refactoring done long ago, this option no longer avoids a branch
in a hot path, it currently only removes a single if check during init.