Michael Hirsch [Tue, 25 Jan 2022 00:22:01 +0000 (19:22 -0500)]
Intel compilers: update deprecated -wn to -Wall style
This removes warnings on every single target like:
icx: command line warning #10430: Unsupported command line options encountered
These options as listed are not supported.
For more information, use '-qnextgen-diag'.
option list:
-w3
Signed-off-by: Michael Hirsch <michael@scivision.dev>
Adam Stylinski [Tue, 25 Jan 2022 05:16:37 +0000 (00:16 -0500)]
Make cmake and configure release flags consistent
CMake sufficiently appends -DNDEBUG to the preprocessor macros when not
compiling with debug symbols. This turns off debug level assertions and
has some other side effects. As such, we should equally append this
define to the configure scripts' CFLAGS.
Adam Stylinski [Tue, 18 Jan 2022 14:47:45 +0000 (09:47 -0500)]
Remove the "avx512_well_suited" cpu flag
Now that we have confirmation that the AVX512 variants so far have been
universally better on every capable CPU we've tested them on, there's no
sense in trying to maintain a whitelist.
Adam Stylinski [Mon, 17 Jan 2022 14:27:32 +0000 (09:27 -0500)]
Improvements to avx512 adler32 implementations
Now that better benchmarks are in place, it became apparent that masked
broadcast was _not_ faster and it's actually faster to use vmovd, as
suspected. Additionally, for the VNNI variant, we've unlocked some
additional ILP by doing a second dot product in the loop to a different
running sum that gets recombined later. This broke a data dependency
chain and allowed the IPC be ~2.75. The result is about a 40-50%
improvement in runtime.
Additionally, we've called the lesser SIMD sized variants if the input
is too small and they happen to be compiled in. This helps for the
impossibly small input that still is large enough to be a vector length.
For size 16 and 32 inputs I was seeing something like sub 10 ns instead
of 50 ns.
Adam Stylinski [Sun, 9 Jan 2022 16:57:24 +0000 (11:57 -0500)]
Improved AVX2 adler32 performance
Did this by simply doing 32 bit horizontal sums and using the same sum
of absolute difference instructions as done in the SSE4 and AVX512_VNNI
versions.
Adam Stylinski [Tue, 4 Jan 2022 15:38:39 +0000 (10:38 -0500)]
Added an SSE4 optimized adler32 checksum
This variant uses the lower number of cycles psadw insruction in place
of pmaddubsw for the running sum that does not need multiplication.
This allows this sum to be done independently, partially overlapping the
running "sum2" half of the checksum. We also have moved the shift
outside of the loop, breaking a small data dependency chain. The code
also now does a vectorized horizontal sum without having to rebase to
the adler32 base, as NMAX is defined as the maximum number of scalar
sums that can be peformed, so we're actually safe in doing this without
upgrading to higher precision. We can do a partial horizontal sum
because psadw only ends up accumulating 16 bit words in 2 vector lanes,
the other two can safely be assumed as 0.
Adam Stylinski [Fri, 7 Jan 2022 20:51:09 +0000 (15:51 -0500)]
Have functioning avx512{,_vnni} adler32
The new adler32 checksum uses the VNNI instructions with appreciable
gains when possible. Otherwise, a pure avx512f variant exists which
still gives appreciable gains.
Fix deflateBound and compressBound returning very small size estimates.
Remove workaround in switchlevels.c, so we do actual testing of this.
Use named defines instead of magic numbers where we can.
Adam Stylinski [Thu, 2 Dec 2021 22:05:55 +0000 (17:05 -0500)]
Made this work on 32 bit compilations
For some reason the movq instruction from a 128 bit register to a 64 bit
GPR is not supported in 32 bit code. A simple workaround seems to be to
invoke movl if compiling with -m32.
Adam Stylinski [Sun, 24 Oct 2021 23:24:53 +0000 (19:24 -0400)]
Minor efficiency improvement
This now leverages the broadcasting instrinsics with an AND mask
to load up the registers. Additionally, there's a minor efficiency
boost here by casting up to 64 bit precision (by means of register
aliasing) so that the modulo can be safely deferred until the write
back to the full sums.
The "write" back to the stack here is actually optimized out by GCC
and turned into a write directly to a 32 bit GPR for each of the 8
elements. This much is not new, but now, since we don't have to do a
modulus with the BASE value, we can bypass 8 64 bit multiplications,
shifts, and subtractions while in those registers.
I tried to do a horizontal reduction sum on the 8 64 bit elements since
the vpextract* set of instructions aren't exactly low latency, however
to do this safely (no overflow) it requires 2 128 bit register extractions,
8 vpmovsxdq to bring the things up to 64 bit precision, some shuffles, more
128 bit extractions to get around the 128 bit lane requirement of the shuffles,
and finally a trip to a GPR and back to do the modulus on the scalar value.
This method could have been more efficient if there were an inexpensive 64 bit
horizontal addition instruction for AVX, but there isn't.
To test this, I wrote a pretty basic benchmark using Python's zlib bindings on
a huge set of random data, carefully timing only the checksum bits. Invoking
perf stat from within the python process after the RNG shows a lower average
number of cycles to complete and a shorter runtime.
Adam Stylinski [Sat, 23 Oct 2021 16:38:12 +0000 (12:38 -0400)]
Use immediate variant of shift instruction
Since this is constant, anyway, we may as well use the variant that
doesn't add vector register pressure, has better ILP opportunities,
and has shorter instruction latency.
Ilya Leoshkevich [Mon, 25 Oct 2021 22:50:26 +0000 (18:50 -0400)]
DFLTCC update for window optimization from Jim & Nathan
Stop relying on software and hardware inflate window formats being the
same and act the way we already do for deflate: provide and implement
window-related hooks.
Another possibility would be to use an in-line history buffer (by not
setting HBT_CIRCULAR), but this would require an extra memmove().
Also fix a couple corner cases in the software implementation of
inflateGetDictionary() and inflateSetDictionary().
Jim Kukunas [Wed, 30 Jun 2021 23:36:08 +0000 (19:36 -0400)]
Reorganize inflate window layout
This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input.
The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 2 is utilized.
The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 2 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet.
The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers.
In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.
IBM Z: Run DFLTCC tests on the self-hosted builder
* Use the self-hosted builder instead of ubuntu-latest.
* Drop qemu-related settings from DFLTCC configurations.
* Install codecov only for the current user, since the self-hosted
builder runs under a restricted non-root account.
* Use actions/checkout@v2 for configure checks, since for some reason
actions/checkout@v1 cannot find git on the self-hosted builder.
* Update the testing section of the DFLTCC README.
* Add the infrastructure code for the self-hosted builder.
ENH: Transition to Ubuntu 18.04 in `GitHub` actions workflows
Transition to Ubuntu 18.04 in `GitHub` actions workflows.
Fixes:
```
Ubuntu 16.04 Clang
This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.
Ubuntu 16.04 GCC
This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.
```
reported for example at:
https://github.com/zlib-ng/zlib-ng/actions/runs/1326434358
Official `GitHub` notice related to the removal of the 16.04 virtual
environments:
https://github.blog/changelog/2021-04-29-github-actions-ubuntu-16-04-lts-virtual-environment-will-be-removed-on-september-20-2021/
Fixes:
```
itkzlib-ng/inflate.c(1209,24): warning C4267: '=': conversion from 'size_t' to 'unsigned long', possible loss of data
itkzlib-ng/inflate.c(1210,26): warning C4267: '=': conversion from 'size_t' to 'unsigned long', possible loss of data
```
Ilya Leoshkevich [Mon, 11 Oct 2021 10:24:20 +0000 (12:24 +0200)]
IBM Z: Adjust compressBound() for DFLTCC
When DFLTCC was introduced, deflateBound() was adjusted, but
compressBound() was not, leading to compression failures when using
compressBound() + compress() with poorly compressible data.
Ilya Leoshkevich [Mon, 11 Oct 2021 11:47:20 +0000 (13:47 +0200)]
IBM Z: Do not check inflateGetDictionary() with DFLTCC
The zlib manual does not specify a strict contract for
inflateGetDictionary(), it merely says that it "Returns the sliding
dictionary being maintained by inflate", which is an implementation
detail. IBM Z inflate's behavior differs from that of software, and
may change in the future to boot.
Ilya Leoshkevich [Mon, 11 Oct 2021 11:12:42 +0000 (13:12 +0200)]
IBM Z: Fix building outside of a source directory
Do not use relative includes, since they are valid only within the
source directory. Rely on the build system to pass the necessary
include flags instead.