Yann Collet [Sat, 26 May 2018 00:41:16 +0000 (17:41 -0700)]
changed dynamic fse threshold for offset
recent experienced showed that
default distribution table for offset
can get it wrong pretty quickly with the nb of symbols,
while it remains a reasonable choice much longer for lengths symbols.
Changed the formula,
so that dynamic threshold is now 32 symbols for offsets.
It remains at 64 symbols for lengths.
Nick Terrell [Wed, 23 May 2018 19:16:00 +0000 (12:16 -0700)]
[zstd] Fix decompression edge case
This edge case is only possible with the new optimal encoding selector,
since before zstd would always choose `set_basic` for small numbers of
sequences.
Fix `FSE_readNCount()` to support buffers < 4 bytes.
Nick Terrell [Mon, 16 Apr 2018 22:37:27 +0000 (15:37 -0700)]
Approximate FSE encoding costs for selection
Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.
* The cost of `set_basic` is computed using the cross-entropy cost
function `ZSTD_crossEntropyCost()`, using the normalized default count
and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
`ZSTD_entropyCost()`, together with the cost of writing the normalized
count `ZSTD_NCountCost()`.
Nick Terrell [Fri, 18 May 2018 22:25:10 +0000 (15:25 -0700)]
[cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.
The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.
I benchmarked on the following data sets using this command:
There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.
If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.
| Run | Before | After | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 | -0.50% |
| MacBook | 738451 | 737132 | -0.18% |
Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
Yann Collet [Mon, 14 May 2018 22:32:28 +0000 (15:32 -0700)]
decompress: changed error code when input is too large
ZSTD_decompress() can decompress multiple frames sent as a single input.
But the input size must be the exact sum of all compressed frames, no more.
In the case of a mistake on srcSize, being larger than required,
ZSTD_decompress() will try to decompress a new frame after current one, and fail.
As a consequence, it will issue an error code, ERROR(prefix_unknown).
While the error is technically correct
(the decoder could not recognise the header of _next_ frame),
it's confusing, as users will believe that the first header of the first frame is wrong,
which is not the case (it's correct).
It makes it more difficult to understand that the error is in the source size, which is too large.
This patch changes the error code provided in such a scenario.
If (at least) a first frame was successfully decoded,
and then following bytes are garbage values,
the decoder assumes the provided input size is wrong (too large),
and issue the error code ERROR(srcSize_wrong).
Yann Collet [Sat, 12 May 2018 16:40:04 +0000 (09:40 -0700)]
paramgrill: subtle change in level spacing
distance between levels is slightly increased
to compensate for level 1 speed improvements
and the will to have stronger level 19
extending the range of speed to cover.