git.ipfire.org Git - thirdparty/zlib-ng.git/commit

author	Adam Stylinski <kungfujesus06@gmail.com>
	Tue, 11 Mar 2025 01:17:25 +0000 (21:17 -0400)
committer	Hans Kristian Rosbach <hk-github@circlestorm.org>
	Tue, 15 Apr 2025 12:11:12 +0000 (14:11 +0200)
commit	46fc33f39d0de9b85330c275b26391645a04ffa5
tree	c584a95758b96341d47c64a2e1595bf8eebc2748	tree
parent	5a232688e1bc5ce41bc1f0e11cedaa0c6c643c8b	commit \| diff

SSE4.1 optimized chorba

This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.

CMakeLists.txt		diff \| blob \| blame \| history
arch/x86/Makefile.in		diff \| blob \| blame \| history
arch/x86/chorba_sse41.c	[new file with mode: 0644]	blob
arch/x86/x86_features.c		diff \| blob \| blame \| history
arch/x86/x86_features.h		diff \| blob \| blame \| history
arch/x86/x86_functions.h		diff \| blob \| blame \| history
cmake/detect-intrinsics.cmake		diff \| blob \| blame \| history
configure		diff \| blob \| blame \| history
functable.c		diff \| blob \| blame \| history
test/benchmarks/benchmark_crc32.cc		diff \| blob \| blame \| history
test/test_crc32.cc		diff \| blob \| blame \| history