Adam Stylinski [Wed, 27 Apr 2022 01:53:30 +0000 (21:53 -0400)]
Fixed regression introduced by inlining CRC + copy
Pretty much every time updatewindow has been called, implicitly a
checksum was performed unless on s/390 or state->wrap & 4 == 0. The
inflateSetDictionary function instead separately calls this checksum
before invoking update window and checks the checksum to see if it
matches the initial checksum (a property that happens from parsing the
DICTID section of the headers).
Instead, we can make updatewindow have a "copy" parameter, which is the
state->wrap value that is being checked anyway. We instead move the 3rd
bit check to be checked by the caller rather than the callee.
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
Mika Lindqvist [Tue, 12 Apr 2022 22:22:29 +0000 (01:22 +0300)]
Check that sys/auxv.h exists at configure time and add preprocessor define for it.
* Protect including sys/auxv.h in all relevant files with the new preprocessor define
* Test for both existence of both sys/auxv.h and getauxval() with both cmake and configure
Mika Lindqvist [Tue, 5 Apr 2022 21:04:45 +0000 (00:04 +0300)]
Add one extra byte to return value of compressBound and deflateBound for small lengths due to shift returning 0.
* Treat 0 byte input as 1 byte input when calculating compressBound and deflateBound
Rename memory alignment functions because they handle custom allocator which is the first parameter so having calloc and cfree (c = custom) is confusing in the name.
Adam Stylinski [Wed, 6 Apr 2022 22:15:57 +0000 (18:15 -0400)]
Fix the custom PNG image based benchmark
The height parameter was using a fixed macro, written at a time when the
test imagery was fully synthetic. Because of this, images smaller than
than our in-memory generated imagery will artificially throw a CRC
error.
Remove sanitizer support from configure since it is better supported in cmake. Anybody who still needs it can use cmake or manually set CFLAGS and LDFLAGS.
Adam Stylinski [Sun, 27 Mar 2022 23:20:08 +0000 (19:20 -0400)]
Use size_t types for len arithmetic, matching signature
This suppresses a warning and keeps everything safely the same type.
While it's unlikely that the input for any of this will exceed the size
of an unsigned 32 bit integer, this approach is cleaner than casting and
should not result in a performance degradation.
Adam Stylinski [Sat, 12 Mar 2022 21:09:02 +0000 (16:09 -0500)]
Leverage inline CRC + copy
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
Fixed signed comparison warning in zng_calloc_aligned.
zutil.c: In function ‘zng_calloc_aligned’:
zutil.c:133:20: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘long unsigned int’ [-Wsign-compare]
Fixed operator precedence warnings in slide_hash_sse2.
slide_hash_sse2.c(58,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
slide_hash_sse2.c(59,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inffast.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inflate.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
Adam Stylinski [Fri, 18 Mar 2022 23:18:10 +0000 (19:18 -0400)]
Fix an issue with the ubsan for overflow
While this didn't _actually_ cause any issues for us, technically the
_mm512_reduce_add_epi32() intrinsics returns a signed integer and it
does the very last summation in scalar GPRs as signed integers. While
the ALU still did the math properly (the negative representation is the
same addition in hardware, just interpreted differently), the sanitizer
caught window of inputs here definitely outside the range of a signed
integer for this immediate operation.
The solution, as silly as it may seem, would be to implement our own 32
bit horizontal sum function that does all of the work in vector
registers. This allows us to implicitly keep things in vector register
domain and convert at the very end after we've summed the summation.
The compiler's sanitizer doesn't know the wiser and the solution still
results in being correct.
Adam Stylinski [Sun, 20 Mar 2022 15:44:32 +0000 (11:44 -0400)]
Rename adler32_sse41 to adler32_ssse3
As it turns out, the sum of absolute differences instruction _did_ exist
in SSSE3 all along. SSE41 introduced a stranger, less commonly used
variation of the sum of absolute difference instruction. Knowing this,
the old SSSE3 method can be axed entirely and the SSE41 method can now
be used on CPUs only having SSSE3.
Removing this extra functable entry shrinks the code and allows for a
simpler planned refactor later for the adler checksum and copy elision.
Adam Stylinski [Fri, 18 Mar 2022 00:22:56 +0000 (20:22 -0400)]
Fix a latent issue with chunkmemset
It would seem that on some platforms, namely those which are
!UNALIGNED64_OK, there was a likelihood of chunkmemset_safe_c copying all
the bytes before passing control flow to chunkcopy, a function which is
explicitly unsafe to be called with a zero length copy.
Adam Stylinski [Thu, 17 Mar 2022 02:52:44 +0000 (22:52 -0400)]
Fix UBSAN's cry afoul
Technically, we weren't actually doing this the way C wants us to,
legally. The zmemcpy's turn into NOPs for pretty much all > 0
optimization levels and this gets us defined behavior with the
sanitizer, putting the optimized load by arbitrary alignment into the
compiler's hands instead of ours.
Mika Lindqvist [Sun, 13 Mar 2022 15:12:42 +0000 (17:12 +0200)]
Allow bypassing runtime feature check of TZCNT instructions.
* This avoids conditional branch when it's known at build time that TZCNT instructions are always supported
Adam Stylinski [Mon, 21 Feb 2022 21:52:17 +0000 (16:52 -0500)]
Speed up chunkcopy and memset
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode. Some deltas show performance improving
as much as 60%+.
For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers. We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.
If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).
Adam Stylinski [Mon, 24 Jan 2022 04:32:46 +0000 (23:32 -0500)]
Improve SSE2 slide hash performance
At least on pre-nehalem CPUs, we get a > 50% improvement. This is
mostly due to the fact that we're opportunistically doing aligned loads
instead of unaligned loads. This is something that is very likely to be
possible, given that the deflate stream initialization uses the zalloc
function, which most libraries don't override. Our allocator aligns to
64 byte boundaries, meaning we can do aligned loads on even AVX512 for
the zstream->prev and zstream->head pointers. However, only pre-nehalem
CPUs _actually_ benefit from explicitly aligned load instructions.
The other thing being done here is we're unrolling the loop by a factor
of 2 so that we can get a tiny bit more ILP. This improved performance
by another 5%-7% gain.