git.ipfire.org Git - thirdparty/zlib-ng.git/commit

author	Adam Stylinski <kungfujesus06@gmail.com>
	Sat, 12 Feb 2022 15:26:50 +0000 (10:26 -0500)
committer	Hans Kristian Rosbach <hk-github@circlestorm.org>
	Thu, 24 Feb 2022 15:00:51 +0000 (16:00 +0100)
commit	43dbfd6709fb3a8028430ea30f3da88fbeb3ced9
tree	fc81dcb2d4fa40edb7b86f3296e1f92bbf561001	tree
parent	f7d284c68b08560a6903fe88c7bc63fedfbbc421	commit \| diff

Improved adler32 NEON performance by 30-47%

We unlocked some ILP by allowing for independent sums in the loop and
reducing these sums outside of the loop. Additionally, the multiplication
by 32 (now 64) is moved outside of this loop. Similar to the chromium
implementation, this code does straight 8 bit -> 16 bit additions and defers
the fused multiply accumulate outside of the loop. However, by unrolling by
another factor of 2, the code is measurably faster. The code does fused multiply
accmulates back to as many scratch registers we have room for in order to maximize
ILP for the 16 integer FMAs that need to occur. The compiler seems to order them
such that the destination register is the same register as the previous instruction,
so perhaps it's not actually able to overlap or maybe the -A73's pipeline is reordering
these instructions, anyway.

On the Odroid-N2, the Cortex-A73 cores are ~30-44% faster on the adler32 benchmark,
and the Cortex-A53 cores are anywhere from 34-47% faster.

arch/arm/adler32_neon.c		diff \| blob \| blame \| history
fallback_builtins.h		diff \| blob \| blame \| history