* perf improvements for zstd decode
tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge)
Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization.
See how GCC autovectorizes the loop here:
https://godbolt.org/z/apr0x0
Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on)
After: https://godbolt.org/z/OwO4F8
Note that autovectorization still does not do a good job on the optimized version, so it's turned off\
via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both.
silesia benchmark data - second triad of each file is with the original code:
file orig compressedratio encode decode change
1#dickens
10192446->
4268865(2.388), 198.9MB/s 709.6MB/s
2#dickens
10192446->
3876126(2.630), 128.7MB/s 552.5MB/s
3#dickens
10192446->
3682956(2.767), 104.6MB/s 537MB/s
1#dickens
10192446->
4268865(2.388), 195.4MB/s 659.5MB/s 7.60%
2#dickens
10192446->
3876126(2.630), 127MB/s 516.3MB/s 7.01%
3#dickens
10192446->
3682956(2.767), 105MB/s 479.5MB/s 11.99%
1#mozilla
51220480->
20117517(2.546), 285.4MB/s 734.9MB/s
2#mozilla
51220480->
19067018(2.686), 220.8MB/s 686.3MB/s
3#mozilla
51220480->
18508283(2.767), 152.2MB/s 669.4MB/s
1#mozilla
51220480->
20117517(2.546), 283.4MB/s 697.9MB/s 5.30%
2#mozilla
51220480->
19067018(2.686), 225.9MB/s 665MB/s 3.20%
3#mozilla
51220480->
18508283(2.767), 154.5MB/s 640.6MB/s 4.50%
1#mr
9970564->
3840242(2.596), 262.4MB/s 899.8MB/s
2#mr
9970564->
3600976(2.769), 181.2MB/s 717.9MB/s
3#mr
9970564->
3563987(2.798), 116.3MB/s 620MB/s
1#mr
9970564->
3840242(2.596), 253.2MB/s 827.3MB/s 8.76%
2#mr
9970564->
3600976(2.769), 177.4MB/s 655.4MB/s 9.54%
3#mr
9970564->
3563987(2.798), 111.2MB/s 564.2MB/s 9.89%
1#nci
33553445->
2849306(11.78), 575.2MB/s , 1335.8MB/s
2#nci
33553445->
2890166(11.61), 509.3MB/s , 1238.1MB/s
3#nci
33553445->
2857408(11.74), 431MB/s , 1210.7MB/s
1#nci
33553445->
2849306(11.78), 565.4MB/s , 1220.2MB/s 9.47%
2#nci
33553445->
2890166(11.61), 508.2MB/s , 1128.4MB/s 9.72%
3#nci
33553445->
2857408(11.74), 429.1MB/s , 1097.7MB/s 10.29%
1#ooffice
6152192->
3590954(1.713), 231.4MB/s , 662.6MB/s
2#ooffice
6152192->
3323931(1.851), 162.8MB/s , 592.6MB/s
3#ooffice
6152192->
3145625(1.956), 99.9MB/s , 549.6MB/s
1#ooffice
6152192->
3590954(1.713), 224.7MB/s , 624.2MB/s 6.15%
2#ooffice
6152192->
3323931 (1.851), 155MB/s , 564.5MB/s 4.98%
3#ooffice
6152192->
3145625(1.956), 101.1MB/s , 521.2MB/s 5.45%
1#osdb
10085684->
3739042(2.697), 271.9MB/s 876.4MB/s
2#osdb
10085684->
3493875(2.887), 208.2MB/s 857MB/s
3#osdb
10085684->
3515831(2.869), 135.3MB/s 805.4MB/s
1#osdb
10085684->
3739042(2.697), 257.4MB/s 793.8MB/s 10.41%
2#osdb
10085684->
3493875(2.887), 209.7MB/s 776.1MB/s 10.42%
3#osdb
10085684->
3515831(2.869), 130.6MB/s 727.7MB/s 10.68%
1#reymont
6627202->
2152771(3.078), 198.9MB/s 696.2MB/s
2#reymont
6627202->
2071140(3.200), 170MB/s 595.2MB/s
3#reymont
6627202->
1953597(3.392), 128.5MB/s 609.7MB/s
1#reymont
6627202->
2152771(3.078), 199.6MB/s 655.2MB/s 6.26%
2#reymont
6627202->
2071140(3.200), 168.2MB/s 554.4MB/s 7.36%
3#reymont
6627202->
1953597(3.392), 128.7MB/s 557.4MB/s 9.38%
1#samba
21606400->
5510994(3.921), 338.1MB/s 1066MB/s
2#samba
21606400->
5240208(4.123), 258.7MB/s 992.3MB/s
3#samba
21606400->
5003358(4.318), 200.2MB/s 991.1MB/s
1#samba
21606400->
5510994(3.921), 330.8MB/s 974MB/s 9.45%
2#samba
21606400->
5240208(4.123), 257.9MB/s 919.4MB/s 7.93%
3#samba
21606400->
5003358(4.318), 198.5MB/s 908.9MB/s 9.04%
1#sao
7251944->
6256401(1.159), 194.6MB/s 602.2MB/s
2#sao
7251944->
5808761(1.248), 128.2MB/s 532.1MB/s
3#sao
7251944->
5556318(1.305), 73MB/s 509.4MB/s
1#sao
7251944->
6256401(1.159), 198.7MB/s 580.7MB/s 3.70%
2#sao
7251944->
5808761(1.248), 129.1MB/s 502.7MB/s 5.85%
3#sao
7251944->
5556318(1.305), 74.6MB/s 493.1MB/s 3.31%
1#webster
41458703->
13692222(3.028), 222.3MB/s 752MB/s
2#webster
41458703->
12842646(3.228), 157.6MB/s 532.2MB/s
3#webster
41458703->
12191964(3.400), 124MB/s 468.5MB/s
1#webster
41458703->
13692222(3.028), 219.7MB/s 697MB/s 7.89%
2#webster
41458703->
12842646(3.228), 153.9MB/s 495.4MB/s 7.43%
3#webster
41458703->
12191964(3.400), 124.8MB/s 444.8MB/s 5.33%
1#xml
5345280-> 696652(7.673), 485MB/s , 1333.9MB/s
2#xml
5345280-> 681492(7.843), 405.2MB/s , 1237.5MB/s
3#xml
5345280-> 639057(8.364), 328.5MB/s , 1281.3MB/s
1#xml
5345280-> 696652(7.673), 473.1MB/s , 1232.4MB/s 8.24%
2#xml
5345280-> 681492(7.843), 398.6MB/s , 1145.9MB/s 7.99%
3#xml
5345280-> 639057(8.364), 327.1MB/s , 1175MB/s 9.05%
1#x-ray
8474240->
6772557(1.251), 521.3MB/s 762.6MB/s
2#x-ray
8474240->
6684531(1.268), 230.5MB/s 688.5MB/s
3#x-ray
8474240->
6166679(1.374), 68.7MB/s 478.8MB/s
1#x-ray
8474240->
6772557(1.251), 502.8MB/s 736.7MB/s 3.52%
2#x-ray
8474240->
6684531(1.268), 224.4MB/s 662MB/s 4.00%
3#x-ray
8474240->
6166679(1.374), 67.3MB/s 437.8MB/s 9.37%
7.51%
* makefile changed to only pass -fno-tree-vectorize to gcc
* <Replace this line with a title. Use 1 line only, 67 chars or less>
Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__)
* fix for warning/error with subtraction of void* pointers
* fix c90 conformance issue - ISO C90 forbids mixed declarations and code
* Fix assert for negative diff, only when there is no overlap
* fix overflow revealed in fuzzing tests
* tweak for small speed increase
LIBVER_PATCH := $(shell echo $(LIBVER_PATCH_SCRIPT))
LIBVER := $(shell echo $(LIBVER_SCRIPT))
VERSION?= $(LIBVER)
+CCVER := $(shell $(CC) --version)
CPPFLAGS+= -I. -I./common -DXXH_NAMESPACE=ZSTD_
ifeq ($(OS),Windows_NT) # MinGW assumed
ZDEPR_FILES := $(sort $(wildcard deprecated/*.c))
ZSTD_FILES := $(ZSTDCOMMON_FILES)
+ifeq ($(findstring GCC,$(CCVER)),GCC)
+decompress/zstd_decompress_block.o : CFLAGS+=-fno-tree-vectorize
+endif
+
ZSTD_LEGACY_SUPPORT ?= 5
ZSTD_LIB_COMPRESSION ?= 1
ZSTD_LIB_DECOMPRESSION ?= 1
} \
}
+/* vectorization */
+#if !defined(__clang__) && defined(__GNUC__)
+# define DONT_VECTORIZE __attribute__((optimize("no-tree-vectorize")))
+#else
+# define DONT_VECTORIZE
+#endif
+
/* disable warnings */
#ifdef _MSC_VER /* Visual Studio */
# include <intrin.h> /* For Visual 2005 */
#endif
#include "xxhash.h" /* XXH_reset, update, digest */
-
#if defined (__cplusplus)
extern "C" {
#endif
* Shared functions to include for inlining
*********************************************/
static void ZSTD_copy8(void* dst, const void* src) { memcpy(dst, src, 8); }
+
#define COPY8(d,s) { ZSTD_copy8(d,s); d+=8; s+=8; }
+static void ZSTD_copy16(void* dst, const void* src) { memcpy(dst, src, 16); }
+#define COPY16(d,s) { ZSTD_copy16(d,s); d+=16; s+=16; }
+
+#define WILDCOPY_OVERLENGTH 8
+#define VECLEN 16
+
+typedef enum {
+ ZSTD_no_overlap,
+ ZSTD_overlap_src_before_dst,
+ /* ZSTD_overlap_dst_before_src, */
+} ZSTD_overlap_e;
/*! ZSTD_wildcopy() :
* custom version of memcpy(), can overwrite up to WILDCOPY_OVERLENGTH bytes (if length==0) */
-#define WILDCOPY_OVERLENGTH 8
-MEM_STATIC void ZSTD_wildcopy(void* dst, const void* src, ptrdiff_t length)
+MEM_STATIC FORCE_INLINE_ATTR DONT_VECTORIZE
+void ZSTD_wildcopy(void* dst, const void* src, ptrdiff_t length, ZSTD_overlap_e ovtype)
{
+ ptrdiff_t diff = (BYTE*)dst - (const BYTE*)src;
const BYTE* ip = (const BYTE*)src;
BYTE* op = (BYTE*)dst;
BYTE* const oend = op + length;
- do
- COPY8(op, ip)
- while (op < oend);
+
+ assert(diff >= 8 || (ovtype == ZSTD_no_overlap && diff < -8));
+ if (length < VECLEN || (ovtype == ZSTD_overlap_src_before_dst && diff < VECLEN)) {
+ do
+ COPY8(op, ip)
+ while (op < oend);
+ }
+ else {
+ if ((length & 8) == 0)
+ COPY8(op, ip);
+ do {
+ COPY16(op, ip);
+ }
+ while (op < oend);
+ }
+}
+
+/*! ZSTD_wildcopy_16min() :
+ * same semantics as ZSTD_wilcopy() except guaranteed to be able to copy 16 bytes at the start */
+MEM_STATIC FORCE_INLINE_ATTR DONT_VECTORIZE
+void ZSTD_wildcopy_16min(void* dst, const void* src, ptrdiff_t length, ZSTD_overlap_e ovtype)
+{
+ ptrdiff_t diff = (BYTE*)dst - (const BYTE*)src;
+ const BYTE* ip = (const BYTE*)src;
+ BYTE* op = (BYTE*)dst;
+ BYTE* const oend = op + length;
+
+ assert(length >= 8);
+ assert(diff >= 8 || (ovtype == ZSTD_no_overlap && diff < -8));
+
+ if (ovtype == ZSTD_overlap_src_before_dst && diff < VECLEN) {
+ do
+ COPY8(op, ip)
+ while (op < oend);
+ }
+ else {
+ if ((length & 8) == 0)
+ COPY8(op, ip);
+ do {
+ COPY16(op, ip);
+ }
+ while (op < oend);
+ }
}
MEM_STATIC void ZSTD_wildcopy_e(void* dst, const void* src, void* dstEnd) /* should be faster for decoding, but strangely, not verified on all platform */
/* copy Literals */
assert(seqStorePtr->maxNbLit <= 128 KB);
assert(seqStorePtr->lit + litLength <= seqStorePtr->litStart + seqStorePtr->maxNbLit);
- ZSTD_wildcopy(seqStorePtr->lit, literals, litLength);
+ ZSTD_wildcopy(seqStorePtr->lit, literals, litLength, ZSTD_no_overlap);
seqStorePtr->lit += litLength;
/* literal Length */
if (oLitEnd>oend_w) return ZSTD_execSequenceLast7(op, oend, sequence, litPtr, litLimit, prefixStart, virtualStart, dictEnd);
/* copy Literals */
- ZSTD_copy8(op, *litPtr);
if (sequence.litLength > 8)
- ZSTD_wildcopy(op+8, (*litPtr)+8, sequence.litLength - 8); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */
+ ZSTD_wildcopy_16min(op, (*litPtr), sequence.litLength, ZSTD_no_overlap); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */
+ else
+ ZSTD_copy8(op, *litPtr);
op = oLitEnd;
*litPtr = iLitEnd; /* update for next sequence */
if (oMatchEnd > oend-(16-MINMATCH)) {
if (op < oend_w) {
- ZSTD_wildcopy(op, match, oend_w - op);
+ ZSTD_wildcopy(op, match, oend_w - op, ZSTD_overlap_src_before_dst);
match += oend_w - op;
op = oend_w;
}
while (op < oMatchEnd) *op++ = *match++;
} else {
- ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength-8); /* works even if matchLength < 8 */
+ ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength-8, ZSTD_overlap_src_before_dst); /* works even if matchLength < 8 */
}
return sequenceLength;
}
if (oLitEnd > oend_w) return ZSTD_execSequenceLast7(op, oend, sequence, litPtr, litLimit, prefixStart, dictStart, dictEnd);
/* copy Literals */
- ZSTD_copy8(op, *litPtr); /* note : op <= oLitEnd <= oend_w == oend - 8 */
if (sequence.litLength > 8)
- ZSTD_wildcopy(op+8, (*litPtr)+8, sequence.litLength - 8); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */
+ ZSTD_wildcopy_16min(op, *litPtr, sequence.litLength, ZSTD_no_overlap); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */
+ else
+ ZSTD_copy8(op, *litPtr); /* note : op <= oLitEnd <= oend_w == oend - 8 */
+
op = oLitEnd;
*litPtr = iLitEnd; /* update for next sequence */
if (oMatchEnd > oend-(16-MINMATCH)) {
if (op < oend_w) {
- ZSTD_wildcopy(op, match, oend_w - op);
+ ZSTD_wildcopy(op, match, oend_w - op, ZSTD_overlap_src_before_dst);
match += oend_w - op;
op = oend_w;
}
while (op < oMatchEnd) *op++ = *match++;
} else {
- ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength-8); /* works even if matchLength < 8 */
+ ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength-8, ZSTD_overlap_src_before_dst); /* works even if matchLength < 8 */
}
return sequenceLength;
}
}
FORCE_INLINE_TEMPLATE size_t
+DONT_VECTORIZE
ZSTD_decompressSequences_body( ZSTD_DCtx* dctx,
void* dst, size_t maxDstSize,
const void* seqStart, size_t seqSize, int nbSeq,
#ifndef ZSTD_FORCE_DECOMPRESS_SEQUENCES_LONG
static TARGET_ATTRIBUTE("bmi2") size_t
+DONT_VECTORIZE
ZSTD_decompressSequences_bmi2(ZSTD_DCtx* dctx,
void* dst, size_t maxDstSize,
const void* seqStart, size_t seqSize, int nbSeq,