replaced by one vs2025 runner,
which is badly named since it still running MSVC 2022,
but it's a good test that shows that the matrix is able to handle multiple MSVC versions.
AArch64: Add Neon path for convertSequences_noRepcodes
Add a 4-way Neon implementation for the convertSequences_noRepcodes
function. Remove 'static' keywords from all of its implementations to
be able to add unit tests.
Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5`
Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.
Corresponding unit tests are included.
Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`
Arpad Panyik [Tue, 24 Jun 2025 11:26:58 +0000 (11:26 +0000)]
AArch64: Improve ZSTD_decodeSequence performance
LLVM's alias-analysis sometimes fails to see that a static-array member
of a struct cannot alias other members. This patch:
- Reduces array accesses via struct indirection to aid load/store alias
analysis under Clang.
- Converts dynamic array indexing into conditional-move arithmetic,
eliminating branches and extra loads/stores on out-of-order CPUs.
- Reloads the bitstream only when match-length bits are consumed
(assuming each reload only needs to happen once per match-length
read), improving branch-prediction rates.
- Removes the UNLIKELY() hint, which recent compilers already handle
well without cost.
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Arpad Panyik [Fri, 20 Jun 2025 15:29:17 +0000 (15:29 +0000)]
AArch64: Enhance struct access in Huffman decode 2X
In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.
Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.
On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Arpad Panyik [Wed, 11 Jun 2025 12:19:42 +0000 (12:19 +0000)]
Add unit tests for HIST_count_wksp
The following tests are included:
- Empty input scenario test.
- Workspace size and alignment tests.
- Symbol out-of-range tests.
- Cover multiple input sizes, vary permitted maximum symbol
values, and include diverse symbol distributions.
These tests verifies count table correctness, maxSymbolValuePtr
updates, and error-handling paths. It enables automated regression
of core histogram logic as well.
jinyaoguo [Thu, 12 Jun 2025 23:52:58 +0000 (19:52 -0400)]
Ensure BMK_timedFnState is always freed in benchMem
When an error occurs in BMK_isSuccessful_runOutcome, the code
previously skipped the call to BMK_freeTimedFnState(tfs),
leaking the allocated tfs object.
Fiexed by calling BMK_freeTimedFnState(tfs) before goto _cleanOut.
Arpad Panyik [Wed, 11 Jun 2025 12:14:22 +0000 (12:14 +0000)]
AArch64: Add SVE2 implementation of histogram computation
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Yann Collet [Sun, 8 Jun 2025 20:25:25 +0000 (20:25 +0000)]
ci: separate cmake tests into dedicated workflow file
- Create new .github/workflows/cmake-tests.yml with all cmake-related jobs
- Move cmake-build-and-test-check, cmake-source-directory-with-spaces, and cmake-visual-2022 jobs
- Remove cmake tests from dev-short-tests.yml to improve organization
- Maintain same trigger conditions and test configurations
- Add dedicated concurrency group for cmake tests
This separation allows cmake tests to run independently and makes
the CI configuration more modular and easier to maintain.
Dominik Loidolt [Thu, 5 Jun 2025 13:36:29 +0000 (15:36 +0200)]
fuzz: Fix FUZZ_malloc_rand() to return non-NULL for zero-size allocations
The FUZZ_malloc_rand() function was incorrectly always returning NULL for
zero-size allocations. The random offset generated by
FUZZ_dataProducer_int32Range() was not being added to the pointer variable,
causing the function to always return (void *)0.
jinyaoguo [Wed, 4 Jun 2025 22:08:11 +0000 (18:08 -0400)]
Release resources in error paths via cleanup
Replace direct returns in error-handling branches with a unified
cleanup block that frees allocated resources before returning,
improving code quality and robustness.
jinyaoguo [Tue, 3 Jun 2025 19:28:11 +0000 (15:28 -0400)]
Release resources before returning
In main, resources were freed on the success path but not in the error path.
This change ensures all allocated resources are released before returning.
Dave Vasilevsky [Wed, 7 May 2025 04:10:10 +0000 (00:10 -0400)]
seekable_format: Fix race in parallel_processing
There was no memory barrier between writing and reading `done`, which
would allow reordering to cause races. With so little data to handle
after each job completes, we might as well just join.
Dave Vasilevsky [Wed, 7 May 2025 03:26:32 +0000 (23:26 -0400)]
seekable_format: Make parallel_compression use memory properly
Previously, parallel_compression would only handle each job's results
after ALL jobs were successfully queued. This caused all src/dst
buffers to remain in memory until then!
It also polled to check whether a job completed, which is racy without
any memory barrier.
Now, we flush results as a side effect of completing a job. Completed
frames are placed in an ordered linked-list, and any eligible frames
are flushed. This may be zero or multiple frames, depending on the
order in which jobs finish.
This design also makes it simple to support streaming input, so that
is now available. Just pass `-` as the filename, and stdin/stdout will
be used for I/O.
After the update to MacOS 15.4, the dynamic loader dyld treats duplicated LC_RPATH as an error.
The `FLAGS` variable already contains `LDFLAGS`, thus using both `FLAGS` and `LDFLAGS`
duplicates all `LDFLAGS`, including `-Wl,rpath` parameters.
The duplicate LC_RPATH causes this kind of errors: