Only define CPU variants that require deflate_state when deflate.h has previously been included. This allows us to include cpu_features.h without including zlib.h or name mangling.
Adam Stylinski [Thu, 3 Feb 2022 23:23:45 +0000 (18:23 -0500)]
Obtained more ILP with VMX by breaking a data dependency
By unrolling and finding the equivalent recurrence relation here, we can
do more independent sums, maximizing ILP. For when the data size fits
into cache, we get a sizable return. For when we don't, it's minor but
still measurable.
Adam Stylinski [Fri, 28 Jan 2022 15:00:07 +0000 (10:00 -0500)]
More than double adler32 performance with altivec
Bits of low hanging and high hanging fruit in this round of
optimization. Altivec has a sum characters into 4 lanes of integers
instructions (intrinsic vec_sum4s) that seems basically made for this
algorithm. Additionally, there's a similar multiply-accumulate routine
that takes two character vectors for input and outputs a vector of 4
ints for their respective adjacent sums. This alone was a good amount
of the performance gains.
Additionally, the shifting by 4 was still done in the loop when it was
easy to roll outside of the loop and do only once. This removed some
latency for a dependent operand to be ready. We also unrolled the loop
with independent sums, though, this only seems to help for much larger
input sizes.
Additionally, we reduced feeding the two 16 bit halves of the sum simply
by packing them into an aligned allocation in the stack next to each
other. Then, when loaded, we permute and shift the values to two
separate vector registers from the same input registers. The separation
of these scalars probably could have been done in vector registers
through some tricks but we need them in scalar GPRs anyhow every time
they leave the loop so it was naturally better to keep those separate
before hitting the vectorized code.
For the horizontal addition, the code was modified to use a sequence of
shifts and adds to produce a vector sum in the first lane. Then, the
much cheaper vec_ste was used to store the value into a general purpose
register rather than vec_extract.
Lastly, instead of doing the relatively expensive modulus in GPRs after
we perform the scalar operations to align all of the loads in the loop,
we can instead reduce "n" here for the first round to be n minus the
alignment offset.
[ARM] rename cmake/configure macros check_{acle,neon}_intrinsics to check_{acle,neon}_compiler_flag
* Currently these macros only check that the compiler flag(s) are supported, not that the compiler supports the actual intrinsics
Michael Hirsch [Tue, 25 Jan 2022 00:22:01 +0000 (19:22 -0500)]
Intel compilers: update deprecated -wn to -Wall style
This removes warnings on every single target like:
icx: command line warning #10430: Unsupported command line options encountered
These options as listed are not supported.
For more information, use '-qnextgen-diag'.
option list:
-w3
Signed-off-by: Michael Hirsch <michael@scivision.dev>
Adam Stylinski [Tue, 25 Jan 2022 05:16:37 +0000 (00:16 -0500)]
Make cmake and configure release flags consistent
CMake sufficiently appends -DNDEBUG to the preprocessor macros when not
compiling with debug symbols. This turns off debug level assertions and
has some other side effects. As such, we should equally append this
define to the configure scripts' CFLAGS.
Adam Stylinski [Tue, 18 Jan 2022 14:47:45 +0000 (09:47 -0500)]
Remove the "avx512_well_suited" cpu flag
Now that we have confirmation that the AVX512 variants so far have been
universally better on every capable CPU we've tested them on, there's no
sense in trying to maintain a whitelist.
Adam Stylinski [Mon, 17 Jan 2022 14:27:32 +0000 (09:27 -0500)]
Improvements to avx512 adler32 implementations
Now that better benchmarks are in place, it became apparent that masked
broadcast was _not_ faster and it's actually faster to use vmovd, as
suspected. Additionally, for the VNNI variant, we've unlocked some
additional ILP by doing a second dot product in the loop to a different
running sum that gets recombined later. This broke a data dependency
chain and allowed the IPC be ~2.75. The result is about a 40-50%
improvement in runtime.
Additionally, we've called the lesser SIMD sized variants if the input
is too small and they happen to be compiled in. This helps for the
impossibly small input that still is large enough to be a vector length.
For size 16 and 32 inputs I was seeing something like sub 10 ns instead
of 50 ns.
Adam Stylinski [Sun, 9 Jan 2022 16:57:24 +0000 (11:57 -0500)]
Improved AVX2 adler32 performance
Did this by simply doing 32 bit horizontal sums and using the same sum
of absolute difference instructions as done in the SSE4 and AVX512_VNNI
versions.
Adam Stylinski [Tue, 4 Jan 2022 15:38:39 +0000 (10:38 -0500)]
Added an SSE4 optimized adler32 checksum
This variant uses the lower number of cycles psadw insruction in place
of pmaddubsw for the running sum that does not need multiplication.
This allows this sum to be done independently, partially overlapping the
running "sum2" half of the checksum. We also have moved the shift
outside of the loop, breaking a small data dependency chain. The code
also now does a vectorized horizontal sum without having to rebase to
the adler32 base, as NMAX is defined as the maximum number of scalar
sums that can be peformed, so we're actually safe in doing this without
upgrading to higher precision. We can do a partial horizontal sum
because psadw only ends up accumulating 16 bit words in 2 vector lanes,
the other two can safely be assumed as 0.
Adam Stylinski [Fri, 7 Jan 2022 20:51:09 +0000 (15:51 -0500)]
Have functioning avx512{,_vnni} adler32
The new adler32 checksum uses the VNNI instructions with appreciable
gains when possible. Otherwise, a pure avx512f variant exists which
still gives appreciable gains.
Fix deflateBound and compressBound returning very small size estimates.
Remove workaround in switchlevels.c, so we do actual testing of this.
Use named defines instead of magic numbers where we can.