Simon Hosie [Wed, 22 Mar 2017 17:48:39 +0000 (10:48 -0700)]
Inflate using wider loads and stores and a minimum of branches. (#95)
* Inflate using wider loads and stores.
In inflate_fast() the output pointer always has plenty of room to write. This
means that so long as the target is capable, wide un-aligned loads and stores
can be used to transfer several bytes at once.
When the reference distance is too short simply unroll the data a little to
increase the distance.
Don't pass unnecessary stream to fold_[1-4] and partial_fold.
Also fix some whitespace to make the code easier to read, and
better match the rest of the zlib-ng codebase.
Sebastian Pop [Mon, 27 Feb 2017 17:21:59 +0000 (11:21 -0600)]
call memset for read after write dependences at distance 1
On a benchmark using zlib to decompress a PNG image this change shows a 20%
speedup. It makes sense to special case distance = 1 of read after write
dependences because it is possible to replace the loop kernel with a memset
which is usually implemented in assembly in the libc, and because of the
frequency at which distance = 1 appears during the PNG decompression:
Let all platforms defining UNALIGNED_OK use the optimized put_short
implementation. Also change from pre-increment to post-increment to
prevent a double-store on non-x86 platforms.
Let all x86 and x86_64 archs use the new UPDATE_HASH implementation,
this improves compression performance and can often provide slightly
better compression.
Mark Adler [Fri, 28 Oct 2016 05:50:43 +0000 (22:50 -0700)]
Fix bug when level 0 used with Z_HUFFMAN or Z_RLE.
Compression level 0 requests no compression, using only stored
blocks. When Z_HUFFMAN or Z_RLE was used with level 0 (granted,
an odd choice, but permitted), the resulting blocks were mostly
fixed or dynamic. The reason is that deflate_stored() was not
being called in that case. The compressed data was valid, but it
was not what the application requested. This commit assures that
only stored blocks are emitted for compression level 0, regardless
of the strategy selected.
Mika Lindqvist [Tue, 14 Feb 2017 09:40:52 +0000 (11:40 +0200)]
Avoid hashing same memory location twice by truncating overlapping byte ranges,
it's speed optimization as the inner code also checks that previous hash value
is not same as new hash value. Essentially those two checks together makes the
compression a little more efficient as it can remember matches further apart.
As far as I remember from my tests, the secondary path was triggered only twice
in very long uncompressed file, but the gain in compression rate was still noticeable.
Fix only one half of a macro is executed in the correct side of the conditional,
causing the potential for hash corruption on calls to deflateParam() to change
level from 0 to something else.
Mika Lindqvist [Thu, 28 Apr 2016 19:48:15 +0000 (22:48 +0300)]
Add support for internal attribute
The advantage of this over hidden is for example that the compiler can
safely assume that pointers to functions declared internal can never be
passed externally. This allows the compiler to consider optimizations
otherwise impossible.
CMakeLists.txt: use check_c_source_runs instead of check_c_source_compiles
to try to avoid using intrinsics and an instruction set the compiler
knows but the host CPU doesn't support.
CMakeLists.txt : preliminary support for MSVC and ICC
- select the CMAKE_BUILD_TYPE "Release" by default if none has been set,
to ensure maximum generic optimisation possible on the host platform
- add WITH_NATIVE_INSTRUCTIONS to build with -march=native or its equivalent
option with other compilers (when we identify those alternatives)
- NATIVEFLAG (-march=native) will be used instead of -msseN/-mpclmul when
defined/requested
TODO: discuss whether -msseN/-mpclmul should be used only for the files that
need them instead of globally, while NATIVEFLAG can (is supposed to) be used
globally.
CMakeLists.txt: better checking for Intel intrinsics.
The checks currently assume that instructions that build also execute.
This is not necessarily true: building with -msse4 on an AMD CPU (a C60)
that only has SSE4a leads to a crash in deflateInit2 when the compiler
apparently uses an unsupported instruction to set
s->hash_bits = memLevel + 7;
Phil Vachon [Mon, 30 Jan 2017 14:28:25 +0000 (15:28 +0100)]
Add block_open state for deflate_quick
By storing whether or not a block has been opened (or terminated), the
static trees used for the block and the end block markers can be emitted
appropriately.
Phil Vachon [Mon, 30 Jan 2017 14:20:20 +0000 (15:20 +0100)]
Fix Partial Symbol Generation for QUICK deflate
When using deflate_quick() in a streaming fashion and the output buffer
runs out of space while the input buffer still has data, deflate_quick()
would emit partial symbols. Force the deflate_quick() loop to terminate
for a flush before any further processing is done, returning to the main
deflate() routine to do its thing.
Mark Adler [Sun, 15 Jan 2017 16:22:16 +0000 (08:22 -0800)]
Permit immediate deflateParams changes before any deflate input.
This permits deflateParams to change the strategy and level right
after deflateInit, without having to wait until a header has been
written. The parameters can be changed immediately up until the
first deflate call that consumes any input data.
Mark Adler [Sun, 15 Jan 2017 16:15:55 +0000 (08:15 -0800)]
Update high water mark in deflate_stored.
This avoids unnecessary filling of bytes in the sliding window
buffer when switching from level zero to a non-zero level. This
also provides a consistent indication of deflate having taken
input for a later commit ...
Mark Adler [Sat, 3 Dec 2016 16:29:57 +0000 (08:29 -0800)]
Don't need to emit an empty fixed block when changing parameters.
gzsetparams() was using Z_PARTIAL_FLUSH when it could use Z_BLOCK
instead. This commit uses Z_BLOCK, which avoids emitting an
unnecessary ten bits into the stream.
Mark Adler [Sat, 3 Dec 2016 16:18:56 +0000 (08:18 -0800)]
Clean up gz* function return values.
In some cases the return values did not match the documentation,
or the documentation did not document all of the return values.
gzprintf() now consistently returns negative values on error,
which matches the behavior of the stdio fprintf() function.
Mark Adler [Sat, 5 Nov 2016 15:43:29 +0000 (08:43 -0700)]
Speed up deflation for level 0 (storing).
The previous code slid the window and the hash table and copied
every input byte three times in order to just write the data as
stored blocks with no compression. This commit minimizes sliding
and copying, especially for large input and output buffers.
Level 0 compression is now more than 20 times faster than before
the commit.
Most of the speedup is due to deferring hash table slides until
deflateParams() is called to change the compression level away
from 0. More speedup is due to copying directly from next_in to
next_out when the amounts of available input data and output space
permit it, avoiding the intermediate pending buffer. Additionally,
only the last 32K of the used input data is copied back to the
sliding window when large input buffers are provided.
Mark Adler [Wed, 23 Nov 2016 07:29:19 +0000 (23:29 -0800)]
Assure that deflateParams() will not switch functions mid-block.
This alters the specification in zlib.h, so that deflateParams()
will not change any parameters if there is not enough output space
in the event that a block is emitted in order to allow switching
the compression function.
Mark Adler [Sun, 30 Oct 2016 16:25:32 +0000 (09:25 -0700)]
Use memcpy for stored blocks.
This speeds up level 0 by about a factor of three, as compared to
the previous byte-at-a-time loop. We can do much better though. A
later commit avoids this copy for level 0 with large buffers,
instead copying directly from the input to the output. This commit
still speeds up storing incompressible data found when compressing
normally.
Original patch notes:
This updates the OS_CODE determination at compile time to match as
closely as possible the operating system mappings documented in
the PKWare APPNOTE.TXT version 6.3.4, section 4.4.2.2. That byte
in the gzip header is used by nobody for anything, as far as I can
tell. However we might as well try to set it appropriately.
Mark Adler [Tue, 25 Oct 2016 03:11:41 +0000 (20:11 -0700)]
Do a more thorough check of the state for every stream call.
This verifies that the state has been initialized, that it is the
expected type of state, deflate or inflate, and that at least the
first several bytes of the internal state have not been clobbered.
Mark Adler [Mon, 24 Oct 2016 22:52:19 +0000 (15:52 -0700)]
Reject a window size of 256 bytes if not using the zlib wrapper.
There is a bug in deflate for windowBits == 8 (256-byte window).
As a result, zlib silently changes a request for 8 to a request
for 9 (512-byte window), and sets the zlib header accordingly so
that the decompressor knows to use a 512-byte window. However if
deflateInit2() is used for raw deflate or gzip streams, then there
is no indication that the request was not honored, and the
application might assume that it can use a 256-byte window when
decompressing. This commit returns an error if the user requests
a 256-byte window when using raw deflate or gzip encoding.