Since we long ago make unaligned reads safe (by using memcpy or intrinsics),
it is time to replace the UNALIGNED_OK checks that have since really only been
used to select the optimal comparison sizes for the arch instead.
Adam Stylinski [Sat, 30 Nov 2024 14:23:28 +0000 (09:23 -0500)]
Improve pipeling for AVX512 chunking
For reasons that aren't quite so clear, using the masked writes here
did not pipeline very well. Either setting up the mask stalled things
or masked moves have issues overlapping regular moves. Simply putting
the masked moves behind a branch that is rarely taken seemed to do the
trick in improving the ILP. While here, put masked loads behind the same
branch in case there were ever a hazard for overreading.
Adam Stylinski [Thu, 28 Nov 2024 00:00:52 +0000 (19:00 -0500)]
Enable AVX2 functions to be built with BMI2 instructions
While these are technically different instructions, no such CPU exists
that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to
eliminate several flag stalls by having flagless versions of shifts, and
allows us to not clobber and move around GPRs so much in scalar code.
There's usually a sizeable benefit for enabling it. Since we're building
with BMI2 for AVX2 functions, let's also just make sure the CPU claims
to support it (just to cover our bases).
Adam Stylinski [Thu, 28 Nov 2024 19:05:32 +0000 (14:05 -0500)]
Fix native detection of CRC instruction
It's unclear if raspberry pi OS's shipped GCC doesn't properly detect
ACLE or not (/proc/cpuinfo claims to support AES), but in any case, the
preprocessor macro for that flag is not defined with -march=native on a
raspberry pi 5. Unfortunately that means when built "WITH_NATIVE", we do
not get a fast CRC function. The CRC32 preprocessor macro _IS_ defined,
and the auto detection when built without NATIVE support does properly
get dispatched to. Since we only need the scalar CRC32 and not the polynomial
stuff anyhow, let's make it be an || condition and not a && one.
Pavel P [Wed, 27 Nov 2024 21:13:34 +0000 (23:13 +0200)]
Fix casting warning/error in test_compress_bound.cc
Fixes the following error when building with msvc compiler
```
test_compress_bound.cc
D:\zlib-ng\test\test_compress_bound.cc(41,50): error C2220: the following warning is treated as an error
D:\zlib-ng\test\test_compress_bound.cc(41,50): warning C4267: 'argument': conversion from 'size_t' to 'unsigned long', possible loss of data
D:\zlib-ng\test\test_compress_bound.cc(43,68): warning C4267: 'argument': conversion from 'size_t' to 'unsigned long', possible loss of data
```
Adam Stylinski [Wed, 25 Sep 2024 21:56:36 +0000 (17:56 -0400)]
Make an AVX512 inflate fast with low cost masked writes
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
Adam Stylinski [Thu, 12 Sep 2024 21:47:30 +0000 (17:47 -0400)]
Make chunkset_avx2 half chunk aware
This gives us appreciable gains on a number of fronts. The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.
We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.
Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
Adam Stylinski [Wed, 11 Sep 2024 22:34:54 +0000 (18:34 -0400)]
Simplify avx2 chunkset a bit
Put length 16 in the length checking ladder and take care of it there
since it's also a simple case to handle. We kind of went out of our way
to pretend 128 bit vectors didn't exist when using avx2 but this can be
handled in a single instruction. Strangely the intrinsic uses vector
register operands but the instruction itself assumes a memory operand
for the source. This also means we don't have to handle this case in our
"GET_CHUNK_MAG" function.
Adam Stylinski [Thu, 3 Oct 2024 21:17:44 +0000 (17:17 -0400)]
Compute the "safe" distance properly
The safe pointer that is computed is an exclusive, not inclusive bounds.
While we were probably rarely ever bit this, if ever, it still makes
sense to apply the limit, properly.
Adam Stylinski [Sun, 15 Sep 2024 16:23:50 +0000 (12:23 -0400)]
Simplify chunking in the copy ladder here
As it turns out, trying to peel off the remainder with so many branches
caused the code size to inflate a bit too much that this function
wouldn't inline without some fairly aggressive optimization flags. Only
catching vector sized chunks here makes the loop body small enough and
having the byte by byte copy idiom at the bottom gives the compiler some
flexibility that it is likely to do something there.
Fix overridde CMAKE_C_STANDARD, CMAKE_C_STANDARD_REQUIRED, CMAKE_C_EXTENSIONS. False value is allowed for CMAKE_C_STANDARD_REQUIRED and CMAKE_C_EXTENSIONS.
When building with CMake toolchain provided by NDK, the ARCH variable is
not "aarch64", but "aarch64-none-linux-android26" (or similar). The
strict string match check causes the WITH_ARMV6 option to be enabled in
such a case. In result, arch/arm/slide_hash_armv6.c is compiled, which
is not intended to be used on aarch64, and fails.
Relax the check and assume aarch64 if the ARCH variable contains aarch64.
If the output buffer and the window buffer are the same
memory allocation, we cannot make the assumptions that chunkunroll
does, that it is okay to overwrite the output buffer.
Compiling zlib-ng with glibc 2.17 (minimum version still supported by
crosstool-ng) fails due to the lack of HWCAP_S390_VX - it was
introduced in glibc 2.23.
Strictly speaking, this is a problem with the feature detection logic
in cmake. However, it's not worth disabling the s390x vectorized CRC32
if the hwcap constant is missing and the compiler intrinsics are
available.
So fix by hardcoding the constant. It's a part of the kernel ABI,
which does not change.
Harmen Stoppels [Fri, 28 Jun 2024 11:21:33 +0000 (13:21 +0200)]
don't use zlib-ng's -Wl,--version-script in tests (#1750)
lld 18 errors when a version script assigns a version to a symbol that
is not defined in the object files. Therefore configure scripts should
not use zlib-ng's version script -- all tests will fail.
Also test whether the linker supports the flag instead of assuming.
Rewrite inflate memory allocation.
Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
- Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.
Rewrite deflate memory allocation.
Deflate used to call allocate 5 times during init.
- 5 calls to external alloc function now becomes 1
- Handling alignment of allocated buffers is simplified
- Efforts to align the allocated buffer now needs to happen only once.
- Individual buffers are ordered so that they have natural sequential alignment.
- Due to reduced losses to alignment, we allocate less memory in total.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- Removed need for extra padding after window, chunked reads can now go beyond the window
buffer without causing a segfault.
Fix illegal instruction usage in Xeon Phi x200 processors
The Xeon Phi x200 family of processors (Knights Landing) supports
AVX512 (F, CD, ER, PF) but does not support AVX512 (VL, DQ, BW).
Because of processors like this, the Intel Software Developer's Manual
suggests the bits AVX512 (DQ,BW,VL) are also tested in EBX together with
AVX512F before deciding to run AVX512 (DQ,BW,VL) instructions.
This also adds a new x86 feature called avx512_common that indicates
that AVX512 (F,DQ,BW,VL) are all available and start using this for both
adler32_avx512 and crc32_vpclmulqdq implementations because they are
both built with -mavx512dq -mavx512bw -mavx512vl.
This has been reported downstream as
https://bugzilla.redhat.com/show_bug.cgi?id=2280347 .
Update s390x CI setup.
- New dockerfile
- Using native actions-runner instead of relying on qemu.
- To support s390x, we include patches to actions-runner.
- Using Almalinux 9 instead of Ubuntu, with functional .Net.
- Update CI workflow.
- Update readme guide.
Currently the DFLTCC sanitizer instrumentation is limited to
MSAN-unpoisoning the parameter block. Add ASAN and MSAN checks;
also MSAN-unpoison the window.
Introduce the generic instrument_read(), instrument_write() and
instrument_read_write() macros, that are modeled after the repsective
functions in the Linux kernel.
Deniz Bahadir [Wed, 24 Apr 2024 14:37:34 +0000 (16:37 +0200)]
.gitattributes: Enforce LF line-endings on all non-binary files
Although Git is able to automatically modify line-endings during checkin
and checkout, this brings a lot of trouble, especially when trying to
use a repository from different platforms (as Windows and Linux). This
is due to the fact that Git consults different local gitconfig settings
and uses problematic defaults if not set.
Therefore, Git should enfoce one type of line-ending (LF) and not
consult the local config, which is what the change from this commit
does.
Signed-off-by: Deniz Bahadir <deniz@code.bahadir.email>
Matt McCormick [Tue, 9 Apr 2024 14:44:07 +0000 (10:44 -0400)]
Bump max CMake policy version to 3.29.0
Addresses:
3.29.0/share/cmake/Modules/CMakeDependentOption.cmake:89 (message):
Policy CMP0127 is not set: cmake_dependent_option() supports full Condition
Syntax. Run "cmake --help-policy CMP0127" for policy details. Use the
cmake_policy command to set the policy and suppress this warning.
Call Stack (most recent call first):
CMakeLists.txt:107 (cmake_dependent_option)
This warning is for project developers. Use -Wno-dev to suppress it.
/home/autobuild/autobuild/instance-2/output-1/build/zlib-ng-2.1.6/arch/riscv/riscv_features.c:4:10: fatal error: sys/auxv.h: No such file or directory
4 | #include <sys/auxv.h>
| ^~~~~~~~~~~~
Deniz Bahadir [Fri, 5 Apr 2024 20:37:11 +0000 (22:37 +0200)]
CMake: Replace ';' by '$<SEMICOLON>' in generator-expression
Note: CMake generator-expressions should not contain semicolons,
especially if they might end up in a CMake list, because a semicolon
would be interpreted as list-item separator and therefore render the
generator-expression invalid. The generator-expression `$<SEMICOLON>`
should be used instead.
Signed-off-by: Deniz Bahadir <deniz@code.bahadir.email>
Mika Lindqvist [Sun, 25 Feb 2024 14:42:43 +0000 (16:42 +0200)]
[ARM] Override Clang x4 NEON intrinsics for Android
* Clang for Android requires 256-bit alignment for x4 loads and stores, which can't be guaranteed and is unnecessary