Jim Kukunas [Wed, 30 Jun 2021 23:36:08 +0000 (19:36 -0400)]
Reorganize inflate window layout
This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input.
The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 2 is utilized.
The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 2 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet.
The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers.
In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.
IBM Z: Run DFLTCC tests on the self-hosted builder
* Use the self-hosted builder instead of ubuntu-latest.
* Drop qemu-related settings from DFLTCC configurations.
* Install codecov only for the current user, since the self-hosted
builder runs under a restricted non-root account.
* Use actions/checkout@v2 for configure checks, since for some reason
actions/checkout@v1 cannot find git on the self-hosted builder.
* Update the testing section of the DFLTCC README.
* Add the infrastructure code for the self-hosted builder.
ENH: Transition to Ubuntu 18.04 in `GitHub` actions workflows
Transition to Ubuntu 18.04 in `GitHub` actions workflows.
Fixes:
```
Ubuntu 16.04 Clang
This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.
Ubuntu 16.04 GCC
This request was automatically failed because there were no enabled runners online to process the request for more than 1 days.
```
reported for example at:
https://github.com/zlib-ng/zlib-ng/actions/runs/1326434358
Official `GitHub` notice related to the removal of the 16.04 virtual
environments:
https://github.blog/changelog/2021-04-29-github-actions-ubuntu-16-04-lts-virtual-environment-will-be-removed-on-september-20-2021/
Fixes:
```
itkzlib-ng/inflate.c(1209,24): warning C4267: '=': conversion from 'size_t' to 'unsigned long', possible loss of data
itkzlib-ng/inflate.c(1210,26): warning C4267: '=': conversion from 'size_t' to 'unsigned long', possible loss of data
```
Ilya Leoshkevich [Mon, 11 Oct 2021 10:24:20 +0000 (12:24 +0200)]
IBM Z: Adjust compressBound() for DFLTCC
When DFLTCC was introduced, deflateBound() was adjusted, but
compressBound() was not, leading to compression failures when using
compressBound() + compress() with poorly compressible data.
Ilya Leoshkevich [Mon, 11 Oct 2021 11:47:20 +0000 (13:47 +0200)]
IBM Z: Do not check inflateGetDictionary() with DFLTCC
The zlib manual does not specify a strict contract for
inflateGetDictionary(), it merely says that it "Returns the sliding
dictionary being maintained by inflate", which is an implementation
detail. IBM Z inflate's behavior differs from that of software, and
may change in the future to boot.
Ilya Leoshkevich [Mon, 11 Oct 2021 11:12:42 +0000 (13:12 +0200)]
IBM Z: Fix building outside of a source directory
Do not use relative includes, since they are valid only within the
source directory. Rely on the build system to pass the necessary
include flags instead.
Matheus Castanho [Wed, 16 Jun 2021 17:36:24 +0000 (14:36 -0300)]
Add optimized crc32 for POWER8 and later processors
This commit adds an optimized version of the crc32 function based
on crc32-vpmsum from https://github.com/antonblanchard/crc32-vpmsum/ .
The code has been relicensed to the zlib license.
This is the C implementation created by Rogerio Alves <rogealve@br.ibm.com>
It makes use of vector instructions to speed up CRC32 algorithm. Decompression
times were improved by +30% on tests.
Based on Daniel Black's work for the original zlib (madler/zlib#478).
Mika Lindqvist [Wed, 21 Jul 2021 16:26:43 +0000 (19:26 +0300)]
[arm] Disable ACLE, UNALIGNED_OK and UNALIGNED64_OK on armv7 and earlier.
* armv7 has partial support for unaligned reads, but compiler might use instructions that do not support unaligned accesses
Mika Lindqvist [Tue, 22 Jun 2021 19:19:13 +0000 (22:19 +0300)]
[PowerPC] Use templatized code for slide_hash as code for VMX and VSX is very similar
* Any differences can be handled using compiler options or added as macros before including template header
While DFLTCC takes care of accelerating compression on level 1, other
levels can be sped up too by computing CRC32 using various vector
instructions.
Take the Linux kernel assembly code that does that - its original
author (Hendrik Brueckner) works for IBM at the time of writing and has
allowed reusing the code under the zlib license. Rewrite it in C for
better maintainability, but keep the original structure, variable names
and comments.
Mika Lindqvist [Fri, 18 Jun 2021 21:10:44 +0000 (00:10 +0300)]
[chunkset_neon] Use vdupq_n_u64.
* Using vdupq_n_u64 duplicates the unsigned 64-bit integer to two consecutive aligned memory locations in stack so compiler can use wider load instructions.
All different-sized general-purpose registers overlay on ARM/AArch64, so any vector cast is no-op in assembly.
Mika Lindqvist [Fri, 18 Jun 2021 20:15:28 +0000 (23:15 +0300)]
[chunkset_neon] Don't use signed vector types.
* There is no need to convert between unsigned and signed vector types. All relevant intrinsics have versions for all unsigned vector types.
Must use safe chunk copies due to inflateBack using the same allocation for output and window. In this instance if too many bytes are written it will not correctly write matches with distances close to the window size.
Rebalance levels 1-4.
- Deflate_quick (level 1), no longer limit window, improves compression.
- Deflate_medium, don't check next position for levels below 5.
- Use deflate_medium instead of deflate_fast for level 3.
- Tweak level 4 to give a more predictable speed/compression tradeoff curve.