Remove usage of aligned alloc implementations and instead use malloc
and handle alignment internally. We already always have to do those
checks because we have to support external alloc implementations.
LFF [Sat, 7 Jun 2025 08:23:29 +0000 (16:23 +0800)]
Optimize chunkcopy_rvv:
1. Skip aligning memcpy when dist >= len.
Obviously aligning memcpy is redundant when dist >= len which
contains extra very slow load&store instrutions. And I noticed
that dist is way larger than len in most cases by adding printf in
chunkcopy_rvv with apt install (very narrow situation but makes
sense). So I tend to move the comparing before aligning memcpy
since it is only needed by the overlap situation.
2. Make the largest copy while len > dist.
Chunkcopy_rvv only copies as much memory as possible once after
aligning memcpy then uses sizeof(chunk_t) to finish the rest
copying. However, we should do the largest copy as long as
len < dist.
Adam Stylinski [Tue, 11 Mar 2025 01:17:25 +0000 (21:17 -0400)]
SSE4.1 optimized chorba
This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.
There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
Improve the speed of sub-16 byte matches by first using a
128-bit intrinsic, after that use only 512-bit intrinsics.
This requires us to overlap on the last run, but this is cheaper than
processing the tail using a 256-bit and then a 128-bit run.
Change benchmark steps to avoid it hitting chunk boundaries
of one or the other function as much, this gives more fair benchmarks.
Speed up benchmarks when run as part of gtest as it does not check data
for correctness, making it only run each benchmark for 1 iteration, instead
of thousands or hundreds of thousands.
Add a separate CI step to crashtest benchmarks without collecting any coverage data.
Activate benchmarks in more arches.
Disable some warnings to avoid errors in compiling google benchmark.
Remove separate benchmark CI job, now included in other jobs instead.
Reduce development burden by getting rid of NMake files that are manually
kept up to date. For continued NMake support please generate NMake project
files using CMake.
Pass POSIX_C_SOURCE for std::alligned_alloc try_compile checks
On FreeBSD 11, definining POSIX_C_SOURCE to a lower level has the efect of inhibiting the language level (__ISO_C_VISIBLE ) to be lower than C11, even in the presence of -std=c11
Since the check_symbol_exists runs without setting POSIX_C_SOURCE, this means that we will spuriously define HAVE_ALIGNED_ALLOC, while in the actual build it is not going to be defined
Adam Stylinski [Sun, 16 Feb 2025 17:13:00 +0000 (12:13 -0500)]
Explicit SSE2 vectorization of Chorba CRC method
The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack. In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
Icenowy Zheng [Mon, 24 Mar 2025 08:50:37 +0000 (16:50 +0800)]
riscv: chunkset_rvv: fix SIGSEGV in CHUNKCOPY
The chunkset_tpl comment allows negative dist (out - from) as long as
the length is smaller than the absolute value of dist (i.e. memory does
not overlap). However this case is currently broken in the RVV override
of CHUNKCOPY -- it compares dist (which is a ptrdiff_t, a value that
should be of the same size with size_t but signed) with the result of
sizeof (which is a size_t), and this triggers the implicit conversion
from signed to unsigned (thus losing negative values).
As it's promised to be not overlapping when dist is negative, just use a
gaint memcpy() call to copy everything.
Adam Stylinski [Tue, 25 Mar 2025 21:58:19 +0000 (17:58 -0400)]
Fix a bug on the 32k and greater chorba specializations
In testing a SIMD vectorization for this, I wrote a gtest which stumbled
onto the fact that this had a bug on big endian. Before the initial CRC
had been mixed in it needed to be byte swapped.
Icenowy Zheng [Tue, 25 Mar 2025 08:23:31 +0000 (16:23 +0800)]
ci: drop RISC-V Clang test
The SiFive GitHub organization now deploys an IP allowlist which blocked
GitHub Actions, which makes this test always fail. In addition, this is
a quite different test than other non-x86 tests.
Disable MSVC optimizations for AVX512 GET_CHUNK_MAG #1883
MSVC compiler (VS 17.11.x) incorrectly optimizes the GET_CHUNK_MAG code on
older versions. Appears to be resolved in VS 17.13.2. The compiler would
optimize the code in such a way that it would cause a decompression failure.
It only happens when /Os flag is set.
ports: Use memalign or _aligned_malloc, when available. Fallback to malloc
Using "_WIN32" to decide,
if the MSVC extensions _aligned_malloc / _aligned_free are available
is a bug that breaks other Compiler on Windows. (OpenWatcom as Example)
Adam Stylinski [Sat, 30 Nov 2024 17:01:28 +0000 (12:01 -0500)]
Fold a copy into the adler32 function for UPDATEWINDOW for neon
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.
Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix #1824.
Clean up internal crc32 function handling.
Mark crc32_c and crc32_braid functions as internal, and remove prefix.
Reorder contents of generic_functions, and remove Z_INTERNAL hints from declarations.
Add test/benchmark output to indicate whether Chorba is used.
Clean up crc32_braid.
- Rename N and W to BRAID_N and BRAID_W
- Remove override capabilities for BRAID_N and BRAID_W
- Fix formatting in crc32_braid_tbl.h
- Make makecrct not rely on crc32_braid_p.h
Adam Stylinski [Mon, 3 Feb 2025 02:05:37 +0000 (21:05 -0500)]
Fix an unfortunate bug with Visual Studio 2015
Evidently this instruction, despite the intrinsic having a register operand,
is a memory-register instruction. There seems to be no alignment requirement
for the source operand. Because of this, compilers when not optimized are doing
the unaligned load and then dumping back to the stack to do the broadcasting load.
In doing this, MSVC seems to be dumping to the stack with an aligned move at an
unaligned address, causing a segfault. GCC does not seem to make this mistake, as
it stashes to an aligned address.
If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128
bit load followed by a vinserti128. This _should_ fix this (issue #1861).
Eduard Stefes [Tue, 21 Jan 2025 09:48:07 +0000 (10:48 +0100)]
Disable CRC32-VX Extention for some Clang versions
We have to disable the CRC32-VX implementation for some Clang versions
(18 <= version < 19.1.2) that generate bad code for the IBM S390 VGFMA intrinsics.
Dmitry Kurtaev [Wed, 15 Jan 2025 17:28:44 +0000 (20:28 +0300)]
Workaround error G6E97C40B
Warning as an error with GCC from Uubuntu 24.04:
```
/home/runner/work/dotnet_riscv/dotnet_riscv/runtime/src/native/external/zlib-ng/arch/riscv/riscv_features.c(25,33): error G6E97C40B: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses] [/home/runner/work/dotnet_riscv/dotnet_riscv/runtime/src/native/libs/build-native.proj]
```
Sam James [Thu, 9 Jan 2025 11:36:40 +0000 (11:36 +0000)]
cmake: disable LTO for some configure checks
Some of zlib-ng's configure tests define a function expecting it to be compiled but
don't call that function, or don't use its return value. This is risky with
LTO where the whole thing may be optimised out, which has happened before:
* https://github.com/zlib-ng/zlib-ng/issues/1616
* https://github.com/zlib-ng/zlib-ng/pull/1622
* https://gitlab.kitware.com/cmake/cmake/-/issues/26103