Yann Collet [Sun, 16 May 2021 06:09:42 +0000 (23:09 -0700)]
improve tar compatibility
This patch is supposed to improve compatibility with less featured tar variants
"when the tar program used does not support historical options (without hyphen) nor the '-z' option."
Nick Terrell [Thu, 13 May 2021 23:16:47 +0000 (16:16 -0700)]
[fuzz] Add determinism fuzzing to simple & dictionary round trip
Compress the input twice in the `simple_round_trip` and
`dictionary_round_trip` fuzzers with exactly the same parameters, but
reusing the context. Then ensure that the compressed output is
identical.
Nick Terrell [Thu, 13 May 2021 23:13:29 +0000 (16:13 -0700)]
[lib] Fix dictionary invalidation logic
Call `ZSTD_enforceMaxDist()` before each block with the beginning of the
block. This ensures that `lowLimit` is updated to `dictLimit` whenever
the ext-dict is out of range, so we can use prefix mode for speed.
This can cause non-determinism because prefix mode and ext-dict mode
match finders can return different results. It can also hurt speed
because ext-dict match finders are slower.
The scenario is:
1. Compress large data with a dictionary.
2. The dictionary goes out of bounds, so we invalidate it.
3. However, we still have `lowLimit < dictLimit`, since it is
never updated.
4. We will call the ext-dict match finder instead of the prefix one.
Nick Terrell [Thu, 13 May 2021 22:51:15 +0000 (15:51 -0700)]
[lib] Fix off-by-one error in repcode checks
The repcode checks disallowed repcodes that are equal to `windowLow`.
This is slightly inefficient, but isn't a problem on its own. Together
with the next commit, it cause non-determinism.
This optimization is based off the length longest match found. However,
when indices are reset, we only ensure that we can reference the whole
window starting from `ip`. If the previous block ended with a long match
then `nextToUpdate` could be much less than `ip`. It might be far enough
back that `nextToUpdate < maxDist`, so it doesn't have a full window of
data to reference. This can cause non-determinism bugs, because we may
find a match that is beyond `ip - maxDist`, and may sometimes be
un-referencable, and that match triggers the speed optimization.
The fix is to base the `windowLow` off of the `target` of
`ZSTD_updateTree_internal()`, because anything below that value will be
obsolete by the time `ZSTD_updateTree_internal()` completes.
Olivier Perret [Wed, 12 May 2021 20:11:15 +0000 (22:11 +0200)]
fileio: clamp value of windowLog in patch-mode (#2637)
With small enough input files, the inferred value of fileWindowLog could
be smaller than ZSTD_WINDOWLOG_MIN.
This can be reproduced like so:
$ echo abc > small
$ echo abcdef > small2
$ zstd --patch-from small small2 -o patch
previously, this would fail with the error "zstd: error 11 : Parameter is out of bound"
When running armv6 userspace on armv8 hardware with a 64 bit Linux kernel,
the mode 2 caused SIGBUS (unaligned memory access).
Running all our arm builds in the build farm
only on armv8 simplifies administration a lot.
Depending on compiler and environment, this change might slow down
memory accesses (did not benchmark it). The original analysis is 6 years old.
Nick Terrell [Fri, 7 May 2021 04:56:51 +0000 (21:56 -0700)]
[lib] Fix fuzzer timeouts by backing off overflow correction
Linearly back off the frequency of overflow correction based on the
number of times the `ZSTD_window_t` has been overflow corrected. This
will still allow the fuzzer to quickly find overflow correction bugs,
while also keeping good speed for larger inputs.
Additionally, the `nbOverflowCorrections` variable can be useful for
debugging coredumps, since we can inspect the `ZSTD_CCtx` to see if
overflow correction has happened yet.
I've verified this fixes the timeouts in OSS-Fuzz (176 seconds -> 6
seconds). I've also verified that fuzzers and `fuzzer` and `zstreamtest`
still catch the row-hash overflow correction bug.
Nick Terrell [Thu, 6 May 2021 02:44:24 +0000 (19:44 -0700)]
[zdict] Add a FAQ to the top of zdict.h
The FAQ covers the questions asked in Issue #2566. It first covers why
you would want to use a dictionary, then what a dictionary is, and
finally it tells you how to train a dictionary, and clarifies some of
the parameters.
There is definitely more that could be said about some of the advanced
trainers, but this should be a good start.
Nick Terrell [Wed, 5 May 2021 19:18:47 +0000 (12:18 -0700)]
[lib] Add ZSTD_c_deterministicRefPrefix
This flag forces zstd to always load the prefix in ext-dict mode, even
if it happens to be contiguous, to force determinism. It also applies to
dictionaries that are re-processed.
A determinism test case is also added, which fails without
`ZSTD_c_deterministicRefPrefix` and passes with it set.
Question: Should this be the default behavior? It isn't in this PR.