]> git.ipfire.org Git - thirdparty/zstd.git/commit
Improve ZSTD_get1BlockSummary
authorArpad Panyik <Arpad.Panyik@arm.com>
Tue, 8 Jul 2025 17:05:45 +0000 (17:05 +0000)
committerArpad Panyik <Arpad.Panyik@arm.com>
Thu, 10 Jul 2025 18:20:49 +0000 (18:20 +0000)
commit8e4400463adc7bc7633641d6a485cfef4f28bc31
tree56649b11ebbab9b170c7075d39e279df4c986768
parent1dbc2e09084e843f6c0dcc2d0791610015c50979
Improve ZSTD_get1BlockSummary

Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.

Corresponding unit tests are included.

Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`

Neoverse-V2   before     after
GCC-13:      100.000%  290.527%
GCC-14:      100.000%  291.714%
GCC-15:       99.914%  291.495%
Clang-18:    148.072%  264.524%
Clang-19:    148.075%  264.512%
Clang-20:    148.062%  264.490%

Cortex-A720   before     after
GCC-13:      100.000%  235.261%
GCC-14:      101.064%  234.903%
GCC-15:      112.977%  218.547%
Clang-18:    127.135%  180.359%
Clang-19:    127.149%  180.297%
Clang-20:    127.154%  180.260%

Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
lib/compress/zstd_compress.c
tests/fuzzer.c