AArch64: Enhance struct access in Huffman decode 2X
In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.
Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.
On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 Clang-* GCC-13 GCC-14 GCC-15
1#silesia.tar: +0.820% +1.365% +2.480% +1.348% +0.987%
2#silesia.tar: +0.426% +0.784% +1.218% +0.665% +0.554%
3#silesia.tar: +0.112% +0.389% +0.508% +0.188% +0.261%
* Requires Clang-21 support from LLVM commit hash
`
a53003fe23cb6c871e72d70ff2d3a075a7490da2`
(Clang-21 hasn’t been released as of this writing)