Ilya Leoshkevich [Tue, 14 Jun 2022 09:19:29 +0000 (11:19 +0200)]
IBM Z DFLTCC: Test with MSan
* Add a CI job.
* Do not collect coverage: LLVM's gcov support (part of compiler-rt)
cannot be built with the MSan instrumentation, which means that
whenever it's called (in particular, in order to write the results to
a file at the end), there is a risk of false positives.
* Add __msan_unpoison() calls to DFLTCC inline assembly.
* Make parameter block sizes symbolic constants.
* Move dfltcc() definition after struct dfltcc_param_v0 definition.
Add a test for concurrently modifying deflate() input
The test simulates what one of the QEMU live migration tests is doing:
increments each buffer byte by 1 while deflate()ing it.
The test tries to produce a race condition and therefore is
probabilistic. The longer it runs, the better are the chances to catch
an issue. The scenario in question is known to be broken on IBM Z
with DFLTCC, and there it is caught in 100ms most of the time. The
run time is therefore set to 1 second in order to balance usability and
reliability.
Add public compile definition for zlib-ng API so that other projects that use CMake and link against the zlib project can easily determine whether or not to include "zlib-ng.h" or "zlib.h".
Negative windowBits arguments are eventually turned positive in
deflateInit2_ and inflateInit2_ (more precisely in inflateReset2).
Such values are used to indicate that raw deflate/inflate should
be performed.
If a user supplies INT32_MIN for windowBits, the code will perform
-INT32_MIN which does not fit into int32_t. In fact, this is
undefined behavior in C and should be avoided.
Clearly this is a user error, but given the careful validation of
input arguments a few lines later in deflateInit2_ I think this
might be of interest.
Proof of Concept:
- Compile zlib-ng with gcc -ftrapv or -fsanitize=undefined
- Compile and run this program:
If gzip support has been disabled during compilation then also
consider gzip relevant states as invalid in deflateStateCheck.
Also the gzip state definitions can be removed.
This change leads to failure in test/example, and I am not sure
what the GZIP conditional is trying to achieve. All gzip related
functions are still defined in zlib.h
Alternative approach is to remove the GZIP define.
Fixed conversion warnings for wsize in slide_hash_c.
slide_hash.c(50,44): warning C4244: 'function': conversion from 'unsigned int' to 'uint16_t', possible loss of data
slide_hash.c(51,40): warning C4244: 'function': conversion from 'unsigned int' to 'uint16_t', possible loss of data
Remove zng_gzgetc_ function from zlib-ng native API.
It exists in zlib for backwards compatibility, but has never been
documented/advertised for use in zlib-ngs native API.
Fabrice Fontaine [Fri, 27 May 2022 21:25:21 +0000 (23:25 +0200)]
CMakeLists.txt: fix version in zlib.pc when building statically
When building statically (i.e. with BUILD_SHARED_LIBS=OFF),
ZLIB_FULL_VERSION is not set resulting in an empty version in zlib.pc
and the following build failure with transmission:
checking for ZLIB... configure: error: Package requirements (zlib >= 1.2.3) were not met:
Package dependency requirement 'zlib >= 1.2.3' could not be satisfied.
Package 'zlib' has version '', required version is '>= 1.2.3'
Mark Adler [Tue, 10 May 2022 15:11:32 +0000 (08:11 -0700)]
Correct incorrect inputs provided to the CRC functions.
The previous releases of zlib were not sensitive to incorrect CRC
inputs with bits set above the low 32. This commit restores that
behavior, so that applications with such bugs will continue to
operate as before.
Speed up software CRC-32 computation by a factor of 1.5 to 3.
Use the interleaved method of Kadatch and Jenkins in order to make
use of pipelined instructions through multiple ALUs in a single
core. This also speeds up and simplifies the combination of CRCs,
and updates the functions to pre-calculate and use an operator for
CRC combination.
Adam Stylinski [Fri, 8 Apr 2022 17:24:21 +0000 (13:24 -0400)]
Adding avx512_vnni inline + copy elision
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
Adam Stylinski [Fri, 8 Apr 2022 02:57:09 +0000 (22:57 -0400)]
Added inlined AVX512 adler checksum + copy
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
Adam Stylinski [Wed, 6 Apr 2022 19:38:20 +0000 (15:38 -0400)]
Add AVX2 inline copy + adler implementation
This was pretty much across the board wins for performance, but the wins
are very data dependent and it sort of depends on what copy runs look
like. On our less than realistic data in benchmark_zlib_apps, the
decode test saw some of the bigger gains, ranging anywhere from 6 to 11%
when compiled with AVX2 on a Cascade Lake CPU (and with only AVX2
enabled). The decode on realistic imagery enjoyed smaller gains,
somewhere between 2 and 4%.
Interestingly, there was one outlier on encode, at level 5. The best
theory for this is that the copy runs for that particular compression
level were such that glibc's ERMS aware memmove implementation managed
to marginally outpace the copy during the checksum with the move rep str
sequence thanks to clever microcoding on Intel's part. It's hard to say
for sure but the most standout difference between the two perf profiles
was more time spent in memmove (which is expected, as it's calling
memcpy instead of copying the bytes during the checksum).
There's the distinct possibility that the AVX2 checksums could be
marginally improved by one level of unrolling (like what's done in the
SSE3 implementation). The AVX512 implementations are certainly getting
gains from this but it's not appropriate to append this optimization in
this series of commits.
Adam Stylinski [Sun, 3 Apr 2022 16:18:12 +0000 (12:18 -0400)]
Adding an SSE42 optimized copy + adler checksum implementation
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.
When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine. The reason for this is the
particularly tricky case of dealing with unaligned buffers. The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.
Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.
Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
Adam Stylinski [Fri, 1 Apr 2022 23:02:05 +0000 (19:02 -0400)]
Create adler32_fold_c* functions
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.
This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
Adam Stylinski [Sun, 10 Apr 2022 17:01:22 +0000 (13:01 -0400)]
Improved chunkset substantially where it's heavily used
For most realistic use cases, this doesn't make a ton of difference.
However, for things which are highly compressible and enjoy very large
run length encodes in the window, this is a huge win.
We leverage a permutation table to swizzle the contents of the memory
chunk into a vector register and then splat that over memory with a fast
copy loop.
In essence, where this helps, it helps a lot. Where it doesn't, it does
no measurable damage to the runtime.
This commit also simplifies a chunkcopy_safe call for determining a
distance. Using labs is enough to give the same behavior as before,
with the added benefit that no predication is required _and_, most
importantly, static analysis by GCC's string fortification can't throw a
fit because it conveys better to the compiler that the input into
builtin_memcpy will always be in range.
Adam Stylinski [Wed, 27 Apr 2022 01:53:30 +0000 (21:53 -0400)]
Fixed regression introduced by inlining CRC + copy
Pretty much every time updatewindow has been called, implicitly a
checksum was performed unless on s/390 or state->wrap & 4 == 0. The
inflateSetDictionary function instead separately calls this checksum
before invoking update window and checks the checksum to see if it
matches the initial checksum (a property that happens from parsing the
DICTID section of the headers).
Instead, we can make updatewindow have a "copy" parameter, which is the
state->wrap value that is being checked anyway. We instead move the 3rd
bit check to be checked by the caller rather than the callee.
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
Mika Lindqvist [Tue, 12 Apr 2022 22:22:29 +0000 (01:22 +0300)]
Check that sys/auxv.h exists at configure time and add preprocessor define for it.
* Protect including sys/auxv.h in all relevant files with the new preprocessor define
* Test for both existence of both sys/auxv.h and getauxval() with both cmake and configure
Mika Lindqvist [Tue, 5 Apr 2022 21:04:45 +0000 (00:04 +0300)]
Add one extra byte to return value of compressBound and deflateBound for small lengths due to shift returning 0.
* Treat 0 byte input as 1 byte input when calculating compressBound and deflateBound
Rename memory alignment functions because they handle custom allocator which is the first parameter so having calloc and cfree (c = custom) is confusing in the name.
Adam Stylinski [Wed, 6 Apr 2022 22:15:57 +0000 (18:15 -0400)]
Fix the custom PNG image based benchmark
The height parameter was using a fixed macro, written at a time when the
test imagery was fully synthetic. Because of this, images smaller than
than our in-memory generated imagery will artificially throw a CRC
error.
Remove sanitizer support from configure since it is better supported in cmake. Anybody who still needs it can use cmake or manually set CFLAGS and LDFLAGS.
Adam Stylinski [Sun, 27 Mar 2022 23:20:08 +0000 (19:20 -0400)]
Use size_t types for len arithmetic, matching signature
This suppresses a warning and keeps everything safely the same type.
While it's unlikely that the input for any of this will exceed the size
of an unsigned 32 bit integer, this approach is cleaner than casting and
should not result in a performance degradation.
Adam Stylinski [Sat, 12 Mar 2022 21:09:02 +0000 (16:09 -0500)]
Leverage inline CRC + copy
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
Fixed signed comparison warning in zng_calloc_aligned.
zutil.c: In function ‘zng_calloc_aligned’:
zutil.c:133:20: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘long unsigned int’ [-Wsign-compare]
Fixed operator precedence warnings in slide_hash_sse2.
slide_hash_sse2.c(58,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
slide_hash_sse2.c(59,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inffast.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inflate.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data