Ignore benchmarks in codecov coverage reports.
We already avoid collecting coverage when running benchmarks because the
benchmarks do not perform most error checking, thus even though they might
code increase coverage, they won't detect most bugs unless it actually
crashes the whole benchmark.
Added separate components.
Wait for CI completion before posting status report, avoids emailing an inital report with very low coverage based on pigz tests only.
Make report informational, low coverage will not be a CI failure.
Disable Github Annotations, these are deprecated due to API limits.
Improve benchmark_compress and benchmark_uncompress.
- These now use the same generated data as benchmark_inflate.
- benchmark_uncompress now also uses level 9 for compression, so that
we also get 3-byte matches to uncompress.
- Improve error checking
- Unify code with benchmark_inflate
Add new benchmark inflate_nocrc. This lets us benchmark just the
inflate process more accurately. Also adds a new shared function for
generating highly compressible data that avoids very long matches.
Adam Stylinski [Fri, 12 Dec 2025 21:23:27 +0000 (16:23 -0500)]
Force purely aligned loads in inflate_table code length counting
At the expense of some extra stack space and eating about 4 more cache
lines, let's make these loads purely aligned. On potato CPUs such as the
Core 2, unaligned loads in a loop are not ideal. Additionally some SBC
based ARM chips (usually the little in big.little variants) suffer a
penalty for unaligned loads. This also paves the way for a trivial
altivec implementation, for which unaligned loads don't exist and need
to be synthesized with permutation vectors.
Fix initial crc value loading in crc32_(v)pclmulqdq
In main function, alignment diff processing was getting in the way of XORing
the initial CRC, because it does not guarantee at least 16 bytes have been
loaded.
In fold_16, src data modified by initial crc XORing before being stored to dst.
Adam Stylinski [Tue, 23 Dec 2025 23:58:10 +0000 (18:58 -0500)]
Small optimization in 256 bit wide chunkset
It turns out Intel only parses the bottom 4 bits of the shuffle vector.
This makes it already a sufficient permutation vector and saves us a
small bit of latency.
Improve cmake/detect-arch.cmake to also provide bitness.
Rewrite checks in CMakelists.txt and cmake/detect-intrinsics.cmake
to utilize the new variables.
- Add local window pointer to:
deflate_quick, deflate_fast, deflate_medium and fill_window.
- Add local strm pointer in fill_window.
- Fix missed change to use local lookahead variable in match_tpl
Deflate_state changes:
- Reduce opt_len/static_len sizes.
- Move matches/insert closer to their related varibles.
These now fill a 8-byte hole in the struct on 64-bit platforms.
- Exclude compressed_len and bits_sent if ZLIB_DEBUG is
not enabled. Also move them to the end.
- Remove x86 MSVC-specific padding
- Minor inlining changes in trees_emit.h:
- Inline the small bi_windup function
- Don't attempt inlining for the big zng_emit_dist
- Don't check for too long match in deflate_quick, it cannot happen.
- Move GOTO_NEXT_CHAIN macro outside of LONGEST_MATCH function to
improve readability.
Dougall Johnson [Mon, 8 Dec 2025 04:11:52 +0000 (20:11 -0800)]
Reorder code struct fields for better access patterns
Place bits field before op field in code struct to optimize memory
access. The bits field is accessed first in the hot path, so placing
it at offset 0 may improve code generation on some architectures.
[configure] Fix detecting -fno-lto support
* Previously -fno-lto support was assumed to be supported on non-gcc compatible or unsupported compilers.
Support for it was never tested on those cases. Set the default to not supported.
Inline all uses of quick_insert_string*/quick_insert_value*.
Inline all uses of update_hash*.
Inline insert_string into deflate_quick, deflate_fast and deflate_medium.
Remove insert_string from deflate_state
Use local function pointer for insert_string.
Fix level check to actually check level and not `s->max_chain_length <= 1024`.
There are no folding techniques in adler32 implementations. It is simply hashing while copying.
- Rename adler32_fold_copy to adler32_copy.
- Remove unnecessary adler32_fold.c file.
- Reorder adler32_copy functions last in source file for consistency.
- Rename adler32_rvv_impl to adler32_copy_impl for consistency.
- Replace dst != NULL with 1 in adler32_copy_neon to remove branching.
Adam Stylinski [Fri, 21 Nov 2025 15:02:14 +0000 (10:02 -0500)]
Conditionally shortcut via the chorba polynomial based on compile flags
As it turns out, the copying CRC32 variant _is_ slower when compiled
with generic flags. The reason for this is mainly extra stack spills and
the lack of operations we can overlap with the moves. However, when
compiling for an architecture with more registers, such as avx512, we no
longer have to eat all these costly stack spills and we can overlap with
a 3 operand XOR. Conditionally guarding this means that if a Linux
distribution wants to compile with -march=x86_64-v4 they get all the
upsides to this.
This code notably is not actually used if you happen to have something
that support 512 bit wide clmul, so this does help a somewhat narrow
range of targets (most of the earlier avx512 implementations pre ice
lake).
We also must guard with AVX512VL, as just specifying AVX512F makes GCC
generate vpternlogic instructions of 512 bit widths only, so a bunch of
packing and unpacking of 512 bit to 256 bit registers and vice versa has
to occur, absolutely killing runtime. It's only AVX512VL where there's a
128 bit wide vpternlogic.
Adam Stylinski [Fri, 21 Nov 2025 14:45:48 +0000 (09:45 -0500)]
Use aligned loads in the chorba portions of the clmul crc routines
We go through the trouble to do aligned loads, we may as well let the
compiler know this is certain in doing so. We can't guarantee an aligned
store but at least with an aligned load the compiler can elide a load
with a subsequent xor multiplication when not copying.
Mika Lindqvist [Mon, 17 Nov 2025 17:15:03 +0000 (19:15 +0200)]
Fix build using configure
* "\i" is not valid escape code in BSD sed
* Some x86 shared sources were missing -fPIC due to using wrong variable in build rule
Brad Smith [Mon, 17 Nov 2025 05:50:47 +0000 (00:50 -0500)]
configure: Determine system architecture properly on *BSD systems
uname -m on a BSD system will provide the architecture port .e.g.
arm64, macppc, octeon instead of the machine architecture .e.g.
aarch64, powerpc, mips64. uname -p will provide the machine
architecture. NetBSD uses x86_64, OpenBSD uses amd64, FreeBSD
is a mix between uname -p and the compiler output.
Mika Lindqvist [Mon, 17 Nov 2025 10:28:21 +0000 (12:28 +0200)]
[CI] Downgrade "Windows GCC Native Instructions (AVX)" workflow
* Windows Server 2025 runner has broken GCC, so use Windows Server 2022 runner instead until fix is propagated to all runners