Daniel Axtens [Fri, 1 May 2015 05:56:21 +0000 (15:56 +1000)]
x86: Do not try X86_QUICK_STRATEGY without HAVE_SSE2_INTRIN
QUICK depends on fill_window_sse, and fails to link without it.
Therefore, disable QUICK_STRATEGY if we lack SSE2 support.
This could easily be worked around by making the QUICK code
fall back to regular fill_window, but it's probably not important:
if you care about speed you probably have SSE2.
Daniel Axtens [Fri, 1 May 2015 05:30:05 +0000 (15:30 +1000)]
Remove unneeded and confusing alwaysinline
alwaysinline expands to __attribute__((always_inline)).
This does not force gcc to inline the function. Instead, it allows gcc to
inline the function when complied without optimisations. (Normally, inline
functions are only inlined when compiled with optimisations.)[0]
alwaysinline was only used for bulk_insert_str, and it seems to be using it
in an attempt to force the function to be inlined. That won't work.
Furthermore, bulk_insert_str wasn't even declared inline, causing warnings.
Remove alwaysinline and replace with inline.
Remove the #defines, as they're no longer used.
Mika Lindqvist [Wed, 29 Apr 2015 13:49:43 +0000 (16:49 +0300)]
* Remove assembler targets and OBJA from Visual Studio makefile
* Fix creating manifest files in Visual Studio makefile
* Add missing dependency information for match.obj in Visual Studio
makefile
hansr [Thu, 6 Nov 2014 19:40:56 +0000 (20:40 +0100)]
Drop support for old systems in configure. The remaining ones should
ideally be tested by someone familiar with them and a decision made
whether to keep/remove/update the detection and settings for them.
hansr [Wed, 5 Nov 2014 12:54:07 +0000 (13:54 +0100)]
Remove support for ASMV and ASMINF defines and clean up match.c handling.
This makes it easier to implement support for ASM replacements using
configure parameters if needed later. Also since zlib-ng uses
compiler intrinsics, this needed a cleanup in any case.
Testing on a Raspberry Pi shows that -DUNALIGNED_OK and -DCRC32_UNROLL_LESS
both give a consistent performance gain, so enable these on the armv6 arch.
Also enabled -DADLER32_UNROLL_LESS on the untested assumption that it will
also be faster.
hansr [Tue, 14 Oct 2014 08:01:18 +0000 (10:01 +0200)]
Merge x86 and x86_64 handling in configure.
Add parameter to disable new strategies.
Add parameter to disable arch-specific optimizations.
(This is just the first few steps, more changes needed)
Shuxin Yang [Sun, 20 Apr 2014 22:50:33 +0000 (15:50 -0700)]
Minor enhancement to put_short() macro. This change saw marginal speedup
(about 0% to 3% depending on the compression level and input). I guess
the speedup likely arises from following facts:
1) "s->pending" now is loaded once, and stored once. In the original
implementation, it needs to be loaded and stored twice as the
compiler isn't able to disambiguate "s->pending" and
"s->pending_buf[]"
2) better code generations:
2.1) no instruction are needed for extracting two bytes from a short.
2.2) need less registers
2.3) stores to adjacent bytes are merged into a single store, albeit
at the cost of penalty of potentially unaligned access.
Shuxin Yang [Tue, 18 Mar 2014 01:17:23 +0000 (18:17 -0700)]
Restructure the loop, and see about 3% speedup in run time. I believe the
speedup arises from:
o. Remove the conditional branch in the loop
o. Remove some indirection memory accesses:
The memory accesses to "s->prev_length" s->strstart" cannot be promoted
to register because the compiler is not able to disambiguate them with
store-operation in INSERT_STRING()
o. Convert non-countable loop to countable loop.
I'm not sure if this change really contribute, in general, countable
loop is lots easier to optimized than non-countable loop.
shuxinyang [Mon, 10 Mar 2014 00:20:02 +0000 (17:20 -0700)]
Rewrite the loops such that gcc can vectorize them using saturated-sub
on x86-64 architecture. Speedup the performance by some 7% on my linux box
with corei7 archiecture.
The original loop is legal to be vectorized; gcc 4.7.* and 4.8.*
somehow fail to catch this case. There are still have room to squeeze
from the vectorized code. However, since these loops now account for about
1.5% of execution time, it is not worthwhile to sequeeze the performance
via hand-writing assembly.
The original loops are guarded with "#ifdef NOT_TWEAK_COMPILER". By
default, the modified version is picked up unless the code is compiled
explictly with -DNOT_TWEAK_COMPILER.
Jim Kukunas [Thu, 18 Jul 2013 22:45:18 +0000 (15:45 -0700)]
deflate: add new deflate_medium strategy
From: Arjan van de Ven <arjan@linux.intel.com>
As the name suggests, the deflate_medium deflate strategy is designed
to provide an intermediate strategy between deflate_fast and deflate_slow.
After finding two adjacent matches, deflate_medium scans left from
the second match in order to determine whether a better match can be
formed.
Jim Kukunas [Thu, 18 Jul 2013 20:19:05 +0000 (13:19 -0700)]
deflate: add new deflate_quick strategy for level 1
The deflate_quick strategy is designed to provide maximum
deflate performance.
deflate_quick achieves this through:
- only checking the first hash match
- using a small inline SSE4.2-optimized longest_match
- forcing a window size of 8K, and using a precomputed dist/len
table
- forcing the static Huffman tree and emitting codes immediately
instead of tallying
This patch changes the scope of flush_pending, bi_windup, and
static_ltree to ZLIB_INTERNAL and moves END_BLOCK, send_code,
put_short, and send_bits to deflate.h.
Updates the configure script to enable by default for x86. On systems
without SSE4.2, fallback is to deflate_fast strategy.
Jim Kukunas [Thu, 11 Jul 2013 20:49:05 +0000 (13:49 -0700)]
add PCLMULQDQ optimized CRC folding
Rather than copy the input data from strm->next_in into the window and
then compute the CRC, this patch combines these two steps into one. It
performs a SSE memory copy, while folding the data down in the SSE
registers. A final step is added, when we write the gzip trailer,
to reduce the 4 SSE registers to 32b.
Adds some extra padding bytes to the window to allow for SSE partial
writes.
Jim Kukunas [Thu, 18 Jul 2013 18:40:09 +0000 (11:40 -0700)]
add SSE4.2 optimized hash function
For systems supporting SSE4.2, use the crc32 instruction as a fast
hash function. Also, provide a better fallback hash.
For both new hash functions, we hash 4 bytes, instead of 3, for certain
levels. This shortens the hash chains, and also improves the quality
of each hash entry.
Jim Kukunas [Tue, 2 Jul 2013 19:09:37 +0000 (12:09 -0700)]
Adds SSE2 optimized hash shifting to fill_window.
Uses SSE2 subtraction with saturation to shift the hash in
16B chunks. Renames the old fill_window implementation to
fill_window_c(), and adds a new fill_window_sse() implementation
in fill_window_sse.c.
Moves UPDATE_HASH into deflate.h and changes the scope of
read_buf from local to ZLIB_INTERNAL for sharing between
the two implementations.
Updates the configure script to check for SSE2 intrinsics and enables
this optimization by default on x86. The runtime check for SSE2 support
only occurs on 32-bit, as x86_64 requires SSE2. Adds an explicit
rule in Makefile.in to build fill_window_sse.c with the -msse2 compiler
flag, which is required for SSE2 intrinsics.
Jim Kukunas [Mon, 1 Jul 2013 18:18:26 +0000 (11:18 -0700)]
Tune longest_match implementation
Separates the byte-by-byte and short-by-short longest_match
implementations into two separately tweakable versions and
splits all of the longest match functions into a separate file.
Split the end-chain and early-chain scans and provide likely/unlikely
hints to improve branh prediction.
Add an early termination condition for levels 5 and under to stop
iterating the hash chain when the match length for the current
entry is less than the current best match.
Also adjust variable types and scopes to provide better optimization
hints to the compiler.
Jim Kukunas [Wed, 17 Jul 2013 17:34:56 +0000 (10:34 -0700)]
Add preprocessor define to tune Adler32 loop unrolling.
Excessive loop unrolling is detrimental to performance. This patch
adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling
factor from 16 to 8.