Adam Stylinski [Sun, 27 Mar 2022 23:20:08 +0000 (19:20 -0400)]
Use size_t types for len arithmetic, matching signature
This suppresses a warning and keeps everything safely the same type.
While it's unlikely that the input for any of this will exceed the size
of an unsigned 32 bit integer, this approach is cleaner than casting and
should not result in a performance degradation.
Adam Stylinski [Sat, 12 Mar 2022 21:09:02 +0000 (16:09 -0500)]
Leverage inline CRC + copy
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
Fixed signed comparison warning in zng_calloc_aligned.
zutil.c: In function ‘zng_calloc_aligned’:
zutil.c:133:20: warning: comparison of integer expressions of different signedness: ‘int32_t’ {aka ‘int’} and ‘long unsigned int’ [-Wsign-compare]
Fixed operator precedence warnings in slide_hash_sse2.
slide_hash_sse2.c(58,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
slide_hash_sse2.c(59,5): warning C4554: '&': check operator precedence for possible error; use parentheses to clarify precedence
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inffast.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inflate.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
Adam Stylinski [Fri, 18 Mar 2022 23:18:10 +0000 (19:18 -0400)]
Fix an issue with the ubsan for overflow
While this didn't _actually_ cause any issues for us, technically the
_mm512_reduce_add_epi32() intrinsics returns a signed integer and it
does the very last summation in scalar GPRs as signed integers. While
the ALU still did the math properly (the negative representation is the
same addition in hardware, just interpreted differently), the sanitizer
caught window of inputs here definitely outside the range of a signed
integer for this immediate operation.
The solution, as silly as it may seem, would be to implement our own 32
bit horizontal sum function that does all of the work in vector
registers. This allows us to implicitly keep things in vector register
domain and convert at the very end after we've summed the summation.
The compiler's sanitizer doesn't know the wiser and the solution still
results in being correct.
Adam Stylinski [Sun, 20 Mar 2022 15:44:32 +0000 (11:44 -0400)]
Rename adler32_sse41 to adler32_ssse3
As it turns out, the sum of absolute differences instruction _did_ exist
in SSSE3 all along. SSE41 introduced a stranger, less commonly used
variation of the sum of absolute difference instruction. Knowing this,
the old SSSE3 method can be axed entirely and the SSE41 method can now
be used on CPUs only having SSSE3.
Removing this extra functable entry shrinks the code and allows for a
simpler planned refactor later for the adler checksum and copy elision.
Adam Stylinski [Fri, 18 Mar 2022 00:22:56 +0000 (20:22 -0400)]
Fix a latent issue with chunkmemset
It would seem that on some platforms, namely those which are
!UNALIGNED64_OK, there was a likelihood of chunkmemset_safe_c copying all
the bytes before passing control flow to chunkcopy, a function which is
explicitly unsafe to be called with a zero length copy.
Adam Stylinski [Thu, 17 Mar 2022 02:52:44 +0000 (22:52 -0400)]
Fix UBSAN's cry afoul
Technically, we weren't actually doing this the way C wants us to,
legally. The zmemcpy's turn into NOPs for pretty much all > 0
optimization levels and this gets us defined behavior with the
sanitizer, putting the optimized load by arbitrary alignment into the
compiler's hands instead of ours.
Mika Lindqvist [Sun, 13 Mar 2022 15:12:42 +0000 (17:12 +0200)]
Allow bypassing runtime feature check of TZCNT instructions.
* This avoids conditional branch when it's known at build time that TZCNT instructions are always supported
Adam Stylinski [Mon, 21 Feb 2022 21:52:17 +0000 (16:52 -0500)]
Speed up chunkcopy and memset
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode. Some deltas show performance improving
as much as 60%+.
For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers. We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.
If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).
Adam Stylinski [Mon, 24 Jan 2022 04:32:46 +0000 (23:32 -0500)]
Improve SSE2 slide hash performance
At least on pre-nehalem CPUs, we get a > 50% improvement. This is
mostly due to the fact that we're opportunistically doing aligned loads
instead of unaligned loads. This is something that is very likely to be
possible, given that the deflate stream initialization uses the zalloc
function, which most libraries don't override. Our allocator aligns to
64 byte boundaries, meaning we can do aligned loads on even AVX512 for
the zstream->prev and zstream->head pointers. However, only pre-nehalem
CPUs _actually_ benefit from explicitly aligned load instructions.
The other thing being done here is we're unrolling the loop by a factor
of 2 so that we can get a tiny bit more ILP. This improved performance
by another 5%-7% gain.
Ilya Leoshkevich [Tue, 15 Mar 2022 12:09:04 +0000 (08:09 -0400)]
IBM Z: Delete stale self-hosted builder containers
Due to things like power outage ExecStop may not run, resulting in a
stale actions-runner container. This would prevent ExecStart from
succeeding, so try deleting such stale containers in ExecStartPre.
Adam Stylinski [Mon, 21 Feb 2022 05:17:07 +0000 (00:17 -0500)]
Adding some application-specific benchmarks
So far there's only added png encode and decode with predictably
compressible bytes. This gives us a rough idea of more holistic
impacts of performance improvements (and regressions).
An interesting thing found with this, when compared with stock zlib,
we're slower for png decoding at levels 8 & 9. When we are slower, we
are spending a fair amount of time in the chunk copy function. This
probably merits a closer look.
This code creates optionally an alternative benchmark binary that links
with an alternative static zlib implementation. This can be used to
quickly compare between different forks.
Adam Stylinski [Tue, 8 Feb 2022 22:09:30 +0000 (17:09 -0500)]
Use pclmulqdq accelerated CRC for exported function
We were already using this internally for our CRC calculations, however
the exported function to CRC checksum any arbitrary stream of bytes was
still using a generic C based version that leveraged tables. This
function is now called when len is at least 64 bytes.
Adam Stylinski [Sat, 12 Feb 2022 15:26:50 +0000 (10:26 -0500)]
Improved adler32 NEON performance by 30-47%
We unlocked some ILP by allowing for independent sums in the loop and
reducing these sums outside of the loop. Additionally, the multiplication
by 32 (now 64) is moved outside of this loop. Similar to the chromium
implementation, this code does straight 8 bit -> 16 bit additions and defers
the fused multiply accumulate outside of the loop. However, by unrolling by
another factor of 2, the code is measurably faster. The code does fused multiply
accmulates back to as many scratch registers we have room for in order to maximize
ILP for the 16 integer FMAs that need to occur. The compiler seems to order them
such that the destination register is the same register as the previous instruction,
so perhaps it's not actually able to overlap or maybe the -A73's pipeline is reordering
these instructions, anyway.
On the Odroid-N2, the Cortex-A73 cores are ~30-44% faster on the adler32 benchmark,
and the Cortex-A53 cores are anywhere from 34-47% faster.
Adam Stylinski [Wed, 16 Feb 2022 14:42:40 +0000 (09:42 -0500)]
Unlocked more ILP in SSE variant of adler checksum
This helps uarchs such as sandybridge more than Yorkfield, but there
were some measurable gains on a Core 2 Quad Q9650 as well. We can sum
to two separate vs2 variables and add them back together at the end,
allowing for some overlapping multiply-adds. This was only about a 9-12%
gain on the Q9650 but it nearly doubled performance on cascade lake and
is likely to have appreciable gains on everything in between those two.
Adam Stylinski [Sat, 5 Feb 2022 21:15:46 +0000 (16:15 -0500)]
Improve sse41 adler32 performance
Rather than doing opportunistic aligned loads, we can do scalar
unaligned loads into our two halves of the checksum until we hit
alignment. Then, we can subtract from the max number of sums for the
first run through the loop.
This allows us to force aligned loads for unaligned buffers (likely a
common case for arbitrary runs of memory). This is not meaningful after
Nehalem but pre-Nehalem architectures it makes a substantial difference
to performance and is more foolproof than hoping for an aligned buffer.
Improvement is around 44-50% for unaligned worst case scenarios.
Adam Stylinski [Mon, 21 Feb 2022 21:46:18 +0000 (16:46 -0500)]
Prevent stale stub functions from being called in deflate_slow
Just in case this is the very first call to longest match, we should
instead assign the function pointer instead of the function itself. This
way, by the time it leaves the stub, the function pointer gets
reassigned. This was found incidentally while debugging something else.
Adam Stylinski [Sun, 23 Jan 2022 03:49:04 +0000 (22:49 -0500)]
Write an SSE2 optimized compare256
The SSE4 variant uses the unfortunate string comparison instructions from
SSE4.2 which not only don't work on as many CPUs but, are often slower
than the SSE2 counterparts except in very specific circumstances.
This version should be ~2x faster than unaligned_64 for larger strings
and about half the performance of AVX2 comparisons on identical
hardware.
This version is meant to supplement pre AVX hardware. Because of this,
we're performing 1 extra load + compare at the beginning. In the event
that we're doing a full 256 byte comparison (completely equal strings),
this will result in 2 extra SIMD comparisons if the inputs are unaligned.
Given that the loads will be absorbed by L1, this isn't super likely to
be a giant penalty but for something like a core-i first or second gen,
where unaligned loads aren't nearly as expensive, this going to be
_marginally_ slower in the worst case. This allows us to have half the
loads be aligned, so that the compiler can elide the load and compare by
using a register relative pcmpeqb.
Only define CPU variants that require deflate_state when deflate.h has previously been included. This allows us to include cpu_features.h without including zlib.h or name mangling.