- Add local variables match_len and strstart in insert_match, to avoid extra lookups from struct.
- Move check for enough lookahead outside of function, can avoid function call
instead of calling and immediately returning.
- Add local variable match_len in emit_match to avoid extra lookups from struct.
- Move s->lookahead decrement to top of function, both branches of the function
does it and they don't care when it is done.
Process 64 bytes per iteration using 8x uint64_t loads
with interleaved memcpy stores and __crc32d calls.
RPi5 benchmarks show 30-51% improvement over the
separate crc32 + memcpy baseline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use OSB workflow as an initial test before queueing all the other tests,
this makes sure we don't spend a lot of CI time testing something that
won't even build.
When 56d3d985 was reverted in b85cfdf9, it restored dead
stores to match.strstart and match.match_length that
have no effect since match is passed by value. The
compiler already eliminated them; remove from source.
Use uintptr_t for ASan function signatures and macro variables
The ASan runtime ABI expects uptr (pointer-sized unsigned) for both
parameters of __asan_loadN/__asan_storeN. On LLP64 targets like
Windows x64, long is 32-bit while pointers are 64-bit, truncating
size values. Use uintptr_t to match the ABI correctly.
Create zsanitizer.h with all sanitizer detection, declaration
stubs, and instrument_read/write/read_write macros. Include it
only in the chunkset, inflate, and dfltcc files that perform
deliberate out-of-bounds reads for performance.
Add 256-bit VPCLMULQDQ CRC32 path for systems without AVX-512.
Split VPCLMULQDQ CRC32 into separate AVX2 and AVX-512 compilation
units. Compute fold-by-8 constants for the AVX2 path using
bitreverse(x^d mod G(x), 33) with d=992 and d=1056.
Add MSAN to Aarch64.
Change tests so we run UBSAN on neon/armv8 code, testing without
our optimizations is less important.
Fix windows arm test skipping check.
Define NMAX_ALIGNED32 as NMAX rounded down to a multiple of 32 (5536)
and use it in the NEON adler32 implementation to ensure that src stays
32-byte aligned throughout the main SIMD loop. Previously, NMAX (5552)
is not a multiple of 32, so after the alignment preamble the first
iteration could process a non-32-aligned number of bytes, causing src
to lose 32-byte alignment for all subsequent iterations.
The first iteration's budget is rounded down with ALIGN_DOWN after
subtracting align_diff, ensuring k is always a multiple of 32.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove culling of workflows after subsequent pushes.
Doing so breaks coveralls uploads, and the workarounds are causing
cancelled workflows when it should not, and it generally fragile.
It was a useful workaround when CI took ~2 hours, now that it takes
20 minutes, I think we can afford to complete them.
Add coveralls to pigz and make sure coveralls uploads are not finalized until
all jobs are successful, as doing that blocks further uploads from retried builds.
CI: Stop trying to use GCC on macOS, it is apparently deprecated and
keeps breaking every time github actions releases new images.
Converted to use Clang instead
Simplify adler32 alignment loops to advance pointers
Replace done-offset tracking with direct pointer advancement in NEON,
VMX, and SSSE3 adler32 implementations. Use ALIGN_DIFF consistently
across all architectures for the initial alignment step.
Keep bi_buf/bi_valid in registers across compress_block loop
Refactor the emit functions to take bi_buf and bi_valid by reference,
allowing compress_block() to keep these values in CPU registers for the
entire duration of the main compression loop instead of reloading them
from memory on every iteration.
This eliminates two memory loads (s->bi_buf, s->bi_valid) and two memory
stores per symbol in the hot path.
Refactor and unify adler32 short length processing.
We have one function for aligning and one for tail processing. When
processing the tail, we only need to rebase if there is data left to
process, by checking for this condition we can reduce a rebase which
is benefitical for slower machines.
Used a DO4 loop maximum for the inlined tail for GCC/-O2 to limit
register pressure on x86.
For tails where MAX_LEN can be larger, we support using DO16 similar
to the default loop used in scalar C version of adler32.
Z_RESTRICT is necessary to let the compiler know that src and dst
won't overlap and that it doesn't have to account for that case.
Sergey [Tue, 17 Feb 2026 03:42:02 +0000 (20:42 -0700)]
cmake: Fix ARCH is empty in detect-arch
The both `CMAKE_C_COMPILER_TARGET` and `CMAKE_SYSTEM_PROCESSOR` are undefined
while configuring UWP/WinRT build with Clang:
`-G Ninja -D CMAKE_SYSTEM_NAME=WindowsStore`.
These variables are undefined because `-m` is not set to Clang.
`CMAKE_C_COMPILER_ARCHITECTURE_ID` could be used, but it would cause a more
significant change of the cmake script.