git.ipfire.org Git - thirdparty/zstd.git/commit

author	Nick Terrell <terrelln@meta.com>
	Mon, 20 Nov 2023 19:33:57 +0000 (11:33 -0800)
committer	Nick Terrell <nickrterrell@gmail.com>
	Mon, 20 Nov 2023 22:13:01 +0000 (17:13 -0500)
commit	5ab78c0418dd2b77e76e8350a563b9771a424b27
tree	5af6a23988a4b888574b6c72ac5b3e44ad49b23f	tree \| snapshot
parent	c7269add7eaf028ed828d9af41e732cf01993aad	commit \| diff

[huf] Improve fast C & ASM performance on small data

* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
is 0. Initially, I thought that when it got towards the end, the
computation of how many iterations of safe might become expensive. But
it ends up being slower to have to decode each of the 4 streams
individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in #3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |

lib/decompress/huf_decompress.c		diff \| blob \| blame \| history
lib/decompress/huf_decompress_amd64.S		diff \| blob \| blame \| history