Yann Collet [Wed, 3 Aug 2022 19:39:35 +0000 (21:39 +0200)]
fileio_types.h : avoid dependency on mem.h
fileio_types.h cannot be parsed by itself
because it relies on basic types defined in `lib/common/mem.h`.
As for #3231, it likely wasn't detected because `mem.h` was probably included before within target files.
But this is not proper.
A "easy" solution would be to add the missing include,
but each dependency should be considered "bad" by default,
and only allowed if it brings some tangible value.
In this case, since these types are only used to declare internal structure variables
which are effectively only flags,
I believe it's really not valuable to add a dependency on `mem.h` for this purpose
while the standard `int` type can do the same job.
I was expecting some compiler warnings following this change,
but it turns out we don't use `-Wconversion` by default on `zstd` source code,
so there is none.
Nevertheless, I enabled `-Wconversion` locally and proceeded to fix a few conversion warnings in the process.
Adding `-Wconversion` to the list of flags used for `zstd` is something I would be favorable over the long term,
but it cannot be done overnight,
because the nb of places where this warning is triggered is daunting.
Better progressively reduce the nb of triggered `-Wconversion` warnings before enabling this flag by default.
Tom Wang [Fri, 29 Jul 2022 19:51:58 +0000 (12:51 -0700)]
Add warning when multi-thread decompression is requested (#3208)
When user pass in argument for both decompression and multi-thread, print a warning message
to indicate that multi-threaded decompression is not supported.
* Add warning when multi-thread decompression is requested
* add test case for multi-threaded decoding warning
Expectation is for -d -T0 we will not throw any warning,
and see warning for any other -d -T(>1) inputs
In zlib 1.2.12 the OF macro was changed to _Z_OF breaking any
project that used zlibWrapper. To fix this the OF has been
changed to _Z_OF everywhere and _Z_OF is defined as OF in the
case it is not yet defined for zlib 1.2.11 and older.
Jun He [Fri, 29 Jul 2022 17:28:04 +0000 (01:28 +0800)]
lib: add hint to generate more pipeline friendly code (#3138)
With statistic data of test data files of silesia
the chance of position beyond highThreshold is very
low (~1.3%@L8 in most cases, all <2.5%), and is in
"lowprob area". Add the branch hint so compiler can
get better pipiline codegen.
With this change it is observed ~1% of mozilla and
xml, and slight (0.3%~0.8%) but consistent uplift on
other files on Arm N1.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Id9ba1d5c767e975290b5c1bf0ecce906544f4ade
Jun He [Fri, 29 Jul 2022 17:27:20 +0000 (01:27 +0800)]
decomp: add prefetch for matched seq on aarch64 (#3164)
match is used for following sequence copy. It is
only updated when extDict is needed, which is a
low probability case. So it can be prefetched to
reduce cache miss.
The benchmarks on various Arm platforms showed
uplift from 1% ~ 14% with gcc-11/clang-14.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: If201af4799d2455d74c79f8387404439d7f684ae
Han Zhu [Wed, 20 Jul 2022 23:01:32 +0000 (16:01 -0700)]
[largeNbDicts] Second try at fixing decompression segfault to always create compressInstructions
Summary:
Freeing an uninitialized pointer is undefined behavior. This caused a segfault
when compiling the benchmark with Clang -O3 and benching decompression.
V2: always create compressInstructions but check if cctxParams is NULL before
setting CCtx params to avoid segfault.
Han Zhu [Wed, 20 Jul 2022 18:14:51 +0000 (11:14 -0700)]
[largeNbDicts] Add an option to print out median speed
Summary:
Added an option -p# where -p0 (default) sets the aggregation method to fastest
speed while -p1 sets the aggregation method to median. Also added a new column
in the csv file to report this option's value.
Test Plan:
``
$ ./largeNbDicts -1 --nbDicts=1 -D ~/benchmarks/html/html_8_16K.32K.dict
~/benchmarks/html/html_8_16K/*
loading 7450 files...
created src buffer of size 83.4 MB
split input into 7450 blocks
loading dictionary /home/zhuhan/benchmarks/html/html_8_16K.32K.dict
compressing at level 1 without dictionary : Ratio=3.03 (28827863 bytes)
compressed using a 32768 bytes dictionary : Ratio=4.28 (20410262 bytes)
generating 1 dictionaries, using 0.1 MB of memory
Compression Speed : 306.0 MB/s
Fastest Speed : 310.6 MB/s
$ ./largeNbDicts -1 --nbDicts=1 -p1 -D ~/benchmarks/html/html_8_16K.32K.dict
~/benchmarks/html/html_8_16K/*
loading 7450 files...
created src buffer of size 83.4 MB
split input into 7450 blocks
loading dictionary /home/zhuhan/benchmarks/html/html_8_16K.32K.dict
compressing at level 1 without dictionary : Ratio=3.03 (28827863 bytes)
compressed using a 32768 bytes dictionary : Ratio=4.28 (20410262 bytes)
generating 1 dictionaries, using 0.1 MB of memory
Compression Speed : 306.9 MB/s
Median Speed : 298.4 MB/s
```
Han Zhu [Tue, 19 Jul 2022 23:50:28 +0000 (16:50 -0700)]
[largeNbDicts] Print more metrics into csv file
Summary:
Add column headers and data for whether it's a compression or a decompression
run, compression level, nbDicts and dictAttachPref in additional to
compr/decompr speed.
Han Zhu [Tue, 19 Jul 2022 20:55:48 +0000 (13:55 -0700)]
[largeNbDicts] Fix decompression segfault in createCompressInstructions
Benchmarking decompression results in a segfault in `createCompressInstructions`
because `cctxParams` is NULL. Skip running that function if we are not benching
compression.
Yann Collet [Wed, 22 Jun 2022 01:14:11 +0000 (18:14 -0700)]
Streaming decompression can detect incorrect header ID sooner
Streaming decompression used to wait for a minimum of 5 bytes before attempting decoding.
This meant that, in the case that only a few bytes (<5) were provided,
and assuming these bytes are incorrect,
there would be no error reported.
The streaming API would simply request more data, waiting for at least 5 bytes.
This PR makes it possible to detect incorrect Frame IDs as soon as the first byte is provided.
Nick Terrell [Mon, 6 Jun 2022 18:56:13 +0000 (11:56 -0700)]
Remove expensive assert in --rsyncable hot loop
This assert slows the loop down by 10x. We can get similar
coverage by asserting at the beginning & end of the loop.
We need this fix because Debian compiles zstd with asserts
enabled. Separately, we should ask them why, and if they would
consider disabling asserts in their builds. Since we don't
optimize for assert enabled builds.
Jun He [Mon, 23 May 2022 06:25:10 +0000 (14:25 +0800)]
dec: adjust seqSymbol load on aarch64
ZSTD_seqSymbol is a structure with total of 64 bits
wide. So it can be loaded in one operation and
extract its fields by simply shifting or extracting
on aarch64.
GCC doesn't recognize this and generates more
unnecessary ldr/ldrb/ldrh operations that cause
performance drop.
With this change it is observed 2~4% uplift of
silesia and 2.5~6% of cantrbry @L8 on Arm N1.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: I7748909204cf78a17eb9d4f2333692d53239daa8
Jun He [Wed, 25 May 2022 14:26:41 +0000 (22:26 +0800)]
common: apply two stage copy to aarch64
On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.
Signed-off-by: Jun He <jun.he@arm.com>
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be
Danila Kutenin [Sun, 22 May 2022 10:34:33 +0000 (10:34 +0000)]
[lazy] Optimize ZSTD_row_getMatchMask for level 8-10
We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.
For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5%
improvement.
The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group
Then we do AND to have only one bit set in the group so that iteration
with lowering bit `a &= (a - 1)` works as well.
Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.
vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's
also used in
[Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446).
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.
64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.
W. Felix Handte [Tue, 10 May 2022 21:29:39 +0000 (14:29 -0700)]
ZSTD_fast_noDict: Minimize Checks When Writing Hash Table for ip1
This commit avoids checking whether a hashtable write is safe in two of the
three match-found paths in `ZSTD_compressBlock_fast_noDict_generic`. This pro-
duces a ~0.5% speed-up in compression.
A comment in the code describes why we can skip this check in the other two
paths (the repcode check and the first match check in the unrolled loop).
A downside is that in the new position where we make this check, we have not
yet computed `mLength`. We therefore have to avoid writing *possibly* dangerous
positions, rather than the old check which only avoids writing *actually*
dangerous positions. This leads to a miniscule loss in ratio (remember that
this scenario can only been triggered in very negative levels or under incomp-
ressibility acceleration).
Eli Schwartz [Thu, 28 Apr 2022 22:22:55 +0000 (18:22 -0400)]
meson: for internal linkage, link to both libzstd and a static copy of it
Partial, Meson-only implementation of #2976 for non-MSVC builds.
Due to the prevalence of private symbol reuse, linking to a shared
library is simply utterly unreliable, but we still want to defer to the
shared library for installable applications. By linking to both, we can
share symbols where possible, and statically link where needed.
This means we no longer need to manually track every file that needs to
be extracted and reused.
The flip side is that MSVC completely does not support this, so for MSVC
builds we just link to a full static copy even where
-Ddefault_library=shared.
As a side benefit, by using library inclusion rather than including
extra explicit object files, the zstd program shrinks in size slightly
(~4kb).