Fix scan_endstr offset in longest_match slow path.
LONGEST_MATCH_SLOW was using len - (STD_MIN_MATCH+1) instead of
len - (STD_MIN_MATCH-1) for the end-of-string hash probe, hashing a
window inside the already-matched region instead of ending one byte
past the current match. The slow path was missing match extensions
it should have been finding. The comment and the upstream fast_zlib
source both specify the correct offset.
Closes #2248.
Reported-by: Sergey "Shnatsel" Davidoff <291257+Shnatsel@users.noreply.github.com> Reported-by: Folkert de Vries <7949978+folkertdev@users.noreply.github.com>
Mika Lindqvist [Tue, 5 May 2026 22:33:15 +0000 (01:33 +0300)]
[CI] Cache Ubuntu .deb packages to speed up installing dependencies.
* Purge old packages and unneeded dependencies before copying remaining packages to cached directory
Add early return when prev_length already exceeds lookahead
Near end-of-input the caller's prev_length can exceed the
current lookahead, making the chain walk pointless since no
match can be longer than the available input. The non-slow
path never clamped this case — break_matching was slow-path
only — leaving the output contract unguarded.
Together with the existing `if (len >= lookahead)` early
return in the update block — which stops the chain walk as
soon as a match reaches lookahead — this ensures no
unnecessary chain steps are taken. madler/zlib does the full
chain walk when prev_length exceeds lookahead and clamps
best_len at the function exit resulting in extra work.
Remove dead break_matching label from longest_match
The lookahead guard at break_matching is unreachable because
the early return `if (len >= lookahead) return lookahead` fires
before best_len is ever assigned, keeping best_len < lookahead
as a loop invariant. Replace the three goto sites with direct
returns and delete the label entirely.
The author_association gate already restricts triggers to OWNER, MEMBER,
or COLLABORATOR, so a maintainer running /delta on a fork PR carries the
same trust as checking the PR out locally. Drop the fork rejection and
the unused base/head repo id parsing.
Benchmark libpng row-by-row decoding where avail_out falls
below the 260-byte inflate_fast threshold. Uses a synthetic
gradient-with-noise pixel generator that produces deflate
token distributions representative of real photographs.
Also fix encode_png to use the passed width and height
instead of the hardcoded IMWIDTH and IMHEIGHT constants.
Fix libpng linking and include paths for benchmark apps
The FetchContent path was missing the binary directory from
PNG_INCLUDE_DIR, causing pnglibconf.h not to be found. The
link target was hardcoded to libpng.a which does not resolve
when libpng is built via FetchContent. Use png_static as the
CMake target when fetched, and normalize both paths through
PNG_STATIC_LIBRARY and PNG_INCLUDE_DIR variables.
Call adler32_c directly in adler32_copy_c scalar fallback
The generic copy function was calling through the function
table, which dispatched to the best SIMD implementation
instead of the scalar path. This led to incorrect benchmarks
for adler32_copy/c since it measured the SIMD path rather
than the scalar fallback. Call adler32_c directly so the
scalar copy variant actually exercises the scalar checksum.
Add SWAR scalar adler32 for 64-bit platforms with unaligned access
Borrows the SWAR (SIMD Within A Register) technique from FFmpeg's
libavutil/adler32.c by Michael Niedermayer. The original splits each
8-byte load into even/odd byte lanes packed as 4x16-bit accumulators
in a uint64_t, with a running prefix sum for the s2 contribution, and
a final reduction using multiply-and-shift with positional weight
constants. The chunk size is capped at 23 iterations of 8 bytes (184
bytes) to keep the 16-bit accumulators from overflowing.
Our improvements over the original FFmpeg implementation:
- Process 16 bytes per iteration (two 64-bit loads) instead of 8,
halving loop overhead while staying within the 23-iteration limit.
- Handle an 8-byte remainder after the 16-byte loop so no bytes
fall through to the slow scalar path unnecessarily.
- Applied to both the NMAX inner loop (adler32_c) and the combined
copy+checksum tail path (adler32_copy_tail) for all callers.
Simplify safe-mode copy path selection in inflate_fast
The branch structure now tests safe_mode directly, which is clearer and
produces the same code on all platforms. No functional change to the copy
operations used.
Improve inflate_fast performance for small output buffers
Lowers the inflate_fast entry threshold from 260 to 3 bytes of
available output by adding a safe_mode parameter that uses
bounds-checked copies and bails to the MATCH state when output
space is insufficient. This eliminates the performance cliff
where libpng-style row-by-row decompression falls back to the
slow inflate path for the last 260 bytes of each row.
Replace small/large buffer tests with parameterized test_chunked
test_large_buffers reset d_stream.next_out on every inflate iteration, so the
decompressed output was never compared against the source. test_chunked keeps
the input, compressed, and decompressed buffers separate and checks them with
memcmp.
New avail_out values (3, 64, 128, 256, 259) exercise inflate_fast()'s safe-mode
MATCH-state bailout around the 258-byte maximum match length.
Bump Google Benchmark to v1.9.5
* Google Benchmark v1.9.4 fails to compile with recent versions of clang and Visual C++ if warnings are treated as errors
Adds benchmark_corpora.cc which dynamically discovers and benchmarks
all files from the zlib-ng/corpora repository (silesia, calgary,
canterbury, large, snappy, etc.).
Benchmarks are registered at startup using RegisterBenchmark. If the
corpora directory is not present, no benchmarks are registered.
Deflate is tested at levels 1, 6, and 9 per file. Inflate is tested
once per file using data pre-compressed at level 9.
Add --benchmark_cooldown flag to mitigate thermal throttling
Adds a --benchmark_cooldown=<seconds> flag that inserts a sleep between
benchmark families. This helps produce consistent results on systems
where sustained workloads cause thermal throttling and CPU frequency
scaling.
Uses a wrapping BenchmarkReporter that sleeps before forwarding results
to the default display reporter.
Add /delta workflow for per-PR binary size comparison
On a /delta PR comment the job builds the PR head and base with
RelWithDebInfo, splits the DWARF into sibling .debug companions, and
runs several tools against both stripped libraries:
- binutils size for text/data/bss totals plus a Δ row
- bloaty for sections, top 30 compile units, and top 30 symbols
- nm --defined-only --dynamic to diff the exported symbol set
- abidiff for C ABI changes (honouring test/abi/ignore)
- minigzip at levels 1-9 over silesia-small.tar and, on native
builds, the full silesia.tar
Results come back as a "## Delta Report" PR comment with a details
block per section, reporting both head and base SHAs so offset runs
are unambiguous.
Comment syntax is /delta [arch] [-N]. Arch defaults to x86_64 and
accepts aarch64, powerpc64le, riscv64, and s390x. -N selects the Nth
commit back from the PR head so a regression can be bisected without
force-pushing. Cross-compile builds reuse cmake/toolchain-*.cmake
and run the stripped binaries under qemu-user.
Gate Scalar and SSE chorba uniformly on CRC32_CHORBA_FALLBACK and
CRC32_CHORBA_SSE_FALLBACK across prototypes, dispatch, sources, tests
and benchmarks instead of spot-checking WITHOUT_CHORBA /
WITHOUT_CHORBA_SSE directly at each site.
Also move crc32_chorba_c.c into ZLIB_GENERIC_SRCS and align Makefile.in
to match so the CMake and autotools builds stay bit-identical.
The 'Ubuntu GCC No Chorba' matrix entry was passing -DWITH_CHORBA=OFF
since its introduction in 9d4af458, but the actual CMake option is
named WITH_CRC32_CHORBA.
The MSVC and GCC 32-bit polyfills for _mm_cvtsi64_si128 /
_mm_cvtsi128_si64 had identical bodies. Merge them into a single
block guarded by !__clang__ && ARCH_32BIT, with the MSVC-only
#include <intrin.h> nested inside.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix MSVC v142 miscompile of _mm_cvtsi64_si128 polyfill on 32-bit
MSVC v142 (Visual Studio 2019, and VS 2022 pre-17.11) miscompiles
_mm_set_epi64x(0, a) on 32-bit Windows by routing part of the synthesis
through a GPR, clobbering live register data and causing stack corruption
in the chorba SSE2/SSE4.1 CRC32 code paths.
Replace the _mm_set_epi64x(0, a) polyfill with _mm_loadl_epi64 which
compiles to a single MOVQ xmm,m64 that bypasses the buggy synthesis
path. Also convert the GCC 32-bit _mm_cvtsi64_si128 macro to a static
inline for consistency, and drop the redundant ARCH_X86 guard since
x86_intrins.h is only reachable from x86 code.
The "slow" variant of longest_match uses a 3-byte rolling hash to seed
its offset-search lookups after a match has been found. Rename the
template gate, the functable entry, and all arch-specific instantiations
from *_slow to *_roll to reflect what the variant actually uses, so a
separate integer-hash offset-search variant can coexist under its own
name.
Benchmarks the inflate fast path with constrained output
buffers ranging from 64 to 16384 bytes per call, reproducing
the libpng decompression pattern described in the "running
off a cliff" analysis.
Fix VPCLMULQDQ CRC32 build with partial AVX-512 baselines
The 512-bit path in crc32_pclmulqdq_tpl.h assumed AVX-512F was
enough, but some of the intrinsics it used actually require
AVX-512DQ. Pick the correct variants based on the available
features.
Add fallback defines to skip generic C code when native intrinsics exist
Each arch header now sets *_FALLBACK defines (ADLER32_FALLBACK,
CHUNKSET_FALLBACK, COMPARE256_FALLBACK, CRC32_BRAID_FALLBACK,
SLIDE_HASH_FALLBACK) when no native SIMD implementation exists.
Generic C source files, declarations, functable entries, tests,
and benchmarks are guarded by these defines.
The `vector` keyword requires -fzvector which is not available on all
GCC versions (e.g. EL10). Use __attribute__((vector_size(16))) typedefs
instead, matching the existing style in crc32_vx.c.
```
C:/build/git/zlib-ng/test/gh1235.c: In function 'main':
C:/build/git/zlib-ng/test/gh1235.c:34:43: error: passing argument 2 of 'compress2' from incompatible pointer type [-Wincompatible-pointer-types]
34 | if (PREFIX(compress2)(compressed, &bytes, plain, i, 1) != Z_OK) return -1;
| ^~~~~~
| |
| z_size_t * {aka unsigned int *}
In file included from C:/build/git/zlib-ng/zutil.h:15,
from C:/build/git/zlib-ng/test/gh1235.c:4:
../zlib.h:1261:69: note: expected 'long unsigned int *' but argument is of type 'z_size_t *' {aka 'unsigned int *'}
1261 | Z_EXTERN int Z_EXPORT compress2(unsigned char *dest, unsigned long *destLen, const unsigned char *source,
| ~~~~~~~~~~~~~~~^~~~~~~
```
- Add local variables match_len and strstart in insert_match, to avoid extra lookups from struct.
- Move check for enough lookahead outside of function, can avoid function call
instead of calling and immediately returning.
- Add local variable match_len in emit_match to avoid extra lookups from struct.
- Move s->lookahead decrement to top of function, both branches of the function
does it and they don't care when it is done.
Process 64 bytes per iteration using 8x uint64_t loads
with interleaved memcpy stores and __crc32d calls.
RPi5 benchmarks show 30-51% improvement over the
separate crc32 + memcpy baseline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use OSB workflow as an initial test before queueing all the other tests,
this makes sure we don't spend a lot of CI time testing something that
won't even build.
When 56d3d985 was reverted in b85cfdf9, it restored dead
stores to match.strstart and match.match_length that
have no effect since match is passed by value. The
compiler already eliminated them; remove from source.
Use uintptr_t for ASan function signatures and macro variables
The ASan runtime ABI expects uptr (pointer-sized unsigned) for both
parameters of __asan_loadN/__asan_storeN. On LLP64 targets like
Windows x64, long is 32-bit while pointers are 64-bit, truncating
size values. Use uintptr_t to match the ABI correctly.