nmlgc [Tue, 20 Jun 2017 20:12:42 +0000 (22:12 +0200)]
configure: For Windows builds, add the CROSS_PREFIX to $RC and $STRIP.
zlib's original win32/Makefile.gcc did the same, but this was removed in 7d17132436431d5f62cf5089623073d72d07deb0. It is kind of essential for
cross-compiling a Win32 build on Linux, since `windres` most certainly
doesn't exist, and the regular `strip` may not be able to handle DLLs.
It should probably actually be something like
RC="${RC-${CROSS_PREFIX}windres}"
and
STRIP="${STRIP-${CROSS_PREFIX}strip}"
to be consistent with the assignments of $AR, $RANLIB and $NM, but this
didn't work for some reason.
R.J.V. Bertin [Tue, 23 May 2017 17:46:55 +0000 (19:46 +0200)]
ZLIB_COMPAT: add an extra 32 bits of padding in z_stream
zlib "stock" uses an "uLong" for zstream::adler, meaning 4 bytes in 64
bit bits. The padding makes zlib-ng a drop-in replacement for libz; without,
the deflateInit2_() function returns a version error when called from
dependents that were built against "stock" zlib.
R.J.V. Bertin [Tue, 23 May 2017 17:32:53 +0000 (19:32 +0200)]
various CMake fixes:
- on Mac, builds can target 1 or more architectures that are not the host
architecture. Pick the first from the list and ignore the others.
A more complete implementation would warn if i386 and x86_64 builds are
mixed via the compiler options.
- use CMake's compiler IDs to detect GCC and Clang (should be applied to
icc too but I can't test)
- disable PCLMUL optimisation in 32bit Mac builds. It crashes and provides
very little gain (to builds that are probably increasingly rare)
Mat [Mon, 29 May 2017 09:06:26 +0000 (11:06 +0200)]
Fix: wrong register for BMI1 bit (#112)
The BMI1 bit is in the ebx register and not in ecx.
See reference: https://software.intel.com/sites/default/files/article/405250/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family.pdf
Mika Lindqvist [Wed, 3 May 2017 17:14:57 +0000 (20:14 +0300)]
Lazily initialize functable members. (#108)
- Split functableInit() function as separate functions for each functable member, so we don't need to initialize full functable in multiple places in the zlib-ng code, or to check for NULL on every invocation.
- Optimized function for each functable member is detected on first invocation and the functable item is updated for subsequent invocations.
- Remove NULL check in adler32() and adler32_z() as it is no longer needed.
- Add adler32 to functable
- Add missing call to functableinit from inflateinit
- Fix external direct calls to adler32 functions without calling functableinit
Mika Lindqvist [Mon, 24 Apr 2017 09:22:11 +0000 (12:22 +0300)]
ARM optimizations part 2 (#107)
* add adler32_neon to main dependency checking and ARM/Windows Makefile
* split non-optimized adler32 to adler32_c so we can test/compare both without recompiling.
* add detection of default floating point ABI in gcc
NOTE: This should avoid build error when gcc supports both ABIs but header for just one ABI is installed.
Add a struct func_table and function functableInit.
The struct contains pointers to select functions to be used by the
rest of zlib, and the init function selects what functions will be
used depending on what optimizations has been compiled in and what
instruction-sets are available at runtime.
Tests done on a haswell cpu running minigzip -6 compression of a
40M file shows a 2.5% decrease in branches, and a 25-30% reduction
in iTLB-loads. The reduction i iTLB-loads is likely mostly due to
the inability to inline functions. This also causes a slight
performance regression of around 1%, this might still be worth it
to make it much easier to implement new optimized functions for
various architectures and instruction sets.
The performance penalty will get smaller for functions that get more
alternative implementations to choose from, since there is no need
to add more branches to every call of the function.
Today insert_string has 1 branch to choose insert_string_sse
or insert_string_c, but if we also add for example insert_string_sse4
then that would have needed another branch, and it would probably
at some point hinder effective inlining too.
The checksum is calculated in the uncompressed PNG data and can be
made much faster by using SIMD. Tests in ARMv8 yielded an improvement
of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes
executed 30 times).
This yields an improvement in image decoding in Chromium around 18%
(see https://bugs.chromium.org/p/chromium/issues/detail?id=688601).
Sebastian Pop [Thu, 16 Mar 2017 15:43:36 +0000 (10:43 -0500)]
inflate: improve performance of memory copy operations
When memory copy operations happen byte by byte, the processors are unable to
fuse the loads and stores together because of aliasing issues. This patch
clusters some of the memory copy operations in chunks of 16 and 8 bytes.
For byte memset, the compiler knows how to prepare the chunk to be stored.
When the memset pattern is larger than a byte, this patch builds the pattern for
chunk memset using the same technique as in Simon Hosie's patch
https://codereview.chromium.org/2722063002
This patch improves by 50% the performance of zlib decompression of a 50K PNG on
aarch64-linux and x86_64-linux when compiled with gcc-7 or llvm-5.
The number of executed instructions reported by valgrind --tool=cachegrind
on the decompression of a 50K PNG file on aarch64-linux:
- before the patch:
I refs: 3,783,757,451
D refs: 1,574,572,882 (869,116,630 rd + 705,456,252 wr)
- with the patch:
I refs: 2,391,899,214
D refs: 899,359,836 (516,666,051 rd + 382,693,785 wr)
The compression of a 260MB directory containing the code of llvm into a tar.gz
of 35MB and decompressing that with minigzip -d
on i7-4790K x86_64-linux, it takes 0.533s before the patch and 0.493s with the patch,
on Juno-r0 aarch64-linux A57, it takes 2.796s before the patch and 2.467s with the patch,
on Juno-r0 aarch64-linux A53, it takes 4.055s before the patch and 3.604s with the patch.
Simon Hosie [Wed, 22 Mar 2017 17:48:39 +0000 (10:48 -0700)]
Inflate using wider loads and stores and a minimum of branches. (#95)
* Inflate using wider loads and stores.
In inflate_fast() the output pointer always has plenty of room to write. This
means that so long as the target is capable, wide un-aligned loads and stores
can be used to transfer several bytes at once.
When the reference distance is too short simply unroll the data a little to
increase the distance.
Don't pass unnecessary stream to fold_[1-4] and partial_fold.
Also fix some whitespace to make the code easier to read, and
better match the rest of the zlib-ng codebase.
Sebastian Pop [Mon, 27 Feb 2017 17:21:59 +0000 (11:21 -0600)]
call memset for read after write dependences at distance 1
On a benchmark using zlib to decompress a PNG image this change shows a 20%
speedup. It makes sense to special case distance = 1 of read after write
dependences because it is possible to replace the loop kernel with a memset
which is usually implemented in assembly in the libc, and because of the
frequency at which distance = 1 appears during the PNG decompression:
Let all platforms defining UNALIGNED_OK use the optimized put_short
implementation. Also change from pre-increment to post-increment to
prevent a double-store on non-x86 platforms.
Let all x86 and x86_64 archs use the new UPDATE_HASH implementation,
this improves compression performance and can often provide slightly
better compression.
Mark Adler [Fri, 28 Oct 2016 05:50:43 +0000 (22:50 -0700)]
Fix bug when level 0 used with Z_HUFFMAN or Z_RLE.
Compression level 0 requests no compression, using only stored
blocks. When Z_HUFFMAN or Z_RLE was used with level 0 (granted,
an odd choice, but permitted), the resulting blocks were mostly
fixed or dynamic. The reason is that deflate_stored() was not
being called in that case. The compressed data was valid, but it
was not what the application requested. This commit assures that
only stored blocks are emitted for compression level 0, regardless
of the strategy selected.
Mika Lindqvist [Tue, 14 Feb 2017 09:40:52 +0000 (11:40 +0200)]
Avoid hashing same memory location twice by truncating overlapping byte ranges,
it's speed optimization as the inner code also checks that previous hash value
is not same as new hash value. Essentially those two checks together makes the
compression a little more efficient as it can remember matches further apart.
As far as I remember from my tests, the secondary path was triggered only twice
in very long uncompressed file, but the gain in compression rate was still noticeable.
Fix only one half of a macro is executed in the correct side of the conditional,
causing the potential for hash corruption on calls to deflateParam() to change
level from 0 to something else.
Mika Lindqvist [Thu, 28 Apr 2016 19:48:15 +0000 (22:48 +0300)]
Add support for internal attribute
The advantage of this over hidden is for example that the compiler can
safely assume that pointers to functions declared internal can never be
passed externally. This allows the compiler to consider optimizations
otherwise impossible.
René J.V. Bertin [Thu, 11 Jun 2015 20:08:19 +0000 (22:08 +0200)]
CMakeLists.txt: use check_c_source_runs instead of check_c_source_compiles
to try to avoid using intrinsics and an instruction set the compiler
knows but the host CPU doesn't support.
René J.V. Bertin [Fri, 12 Jun 2015 13:13:49 +0000 (15:13 +0200)]
CMakeLists.txt : preliminary support for MSVC and ICC
- select the CMAKE_BUILD_TYPE "Release" by default if none has been set,
to ensure maximum generic optimisation possible on the host platform
- add WITH_NATIVE_INSTRUCTIONS to build with -march=native or its equivalent
option with other compilers (when we identify those alternatives)
- NATIVEFLAG (-march=native) will be used instead of -msseN/-mpclmul when
defined/requested
TODO: discuss whether -msseN/-mpclmul should be used only for the files that
need them instead of globally, while NATIVEFLAG can (is supposed to) be used
globally.
René J.V. Bertin [Thu, 11 Jun 2015 17:20:41 +0000 (19:20 +0200)]
CMakeLists.txt: better checking for Intel intrinsics.
The checks currently assume that instructions that build also execute.
This is not necessarily true: building with -msse4 on an AMD CPU (a C60)
that only has SSE4a leads to a crash in deflateInit2 when the compiler
apparently uses an unsupported instruction to set
s->hash_bits = memLevel + 7;
Phil Vachon [Mon, 30 Jan 2017 14:28:25 +0000 (15:28 +0100)]
Add block_open state for deflate_quick
By storing whether or not a block has been opened (or terminated), the
static trees used for the block and the end block markers can be emitted
appropriately.
Phil Vachon [Mon, 30 Jan 2017 14:20:20 +0000 (15:20 +0100)]
Fix Partial Symbol Generation for QUICK deflate
When using deflate_quick() in a streaming fashion and the output buffer
runs out of space while the input buffer still has data, deflate_quick()
would emit partial symbols. Force the deflate_quick() loop to terminate
for a flush before any further processing is done, returning to the main
deflate() routine to do its thing.
Mark Adler [Sun, 15 Jan 2017 16:22:16 +0000 (08:22 -0800)]
Permit immediate deflateParams changes before any deflate input.
This permits deflateParams to change the strategy and level right
after deflateInit, without having to wait until a header has been
written. The parameters can be changed immediately up until the
first deflate call that consumes any input data.
Mark Adler [Sun, 15 Jan 2017 16:15:55 +0000 (08:15 -0800)]
Update high water mark in deflate_stored.
This avoids unnecessary filling of bytes in the sliding window
buffer when switching from level zero to a non-zero level. This
also provides a consistent indication of deflate having taken
input for a later commit ...