git.ipfire.org Git - thirdparty/zstd.git/commit

author	Danila Kutenin <kutdanila@yandex.ru>
	Sun, 22 May 2022 10:34:33 +0000 (10:34 +0000)
committer	Danila Kutenin <kutdanila@yandex.ru>
	Sun, 22 May 2022 10:44:24 +0000 (10:44 +0000)
commit	e11783b04d1c49678bb4f95a4ecaa26323bd823d
tree	ba9a78eebaacc4a213d70efac039520a62a1a6a3	tree
parent	fda537b299bfa598e65b47ef77c412dc09313445	commit \| diff

[lazy] Optimize ZSTD_row_getMatchMask for level 8-10

We found that movemask is not used properly or consumes too much CPU.
This effort helps to optimize the movemask emulation on ARM.

For level 8-9 we saw 3-5% improvements. For level 10 we say 1.5%
improvement.

The key idea is not to use pure movemasks but to have groups of bits.
For rowEntries == 16, 32 we are going to have groups of size 4 and 2
respectively. It means that each bit will be duplicated within the group

Then we do AND to have only one bit set in the group so that iteration
with lowering bit `a &= (a - 1)` works as well.

Also, aarch64 does not have rotate instructions for 16 bit, only for 32
and 64, that's why we see more improvements for level 8-9.

vshrn_n_u16 instruction is used to achieve that: vshrn_n_u16 shifts by
4 every u16 and narrows to 8 lower bits. See the picture below. It's
also used in
[Folly](https://github.com/facebook/folly/blob/c5702590080aa5d0e8d666d91861d64634065132/folly/container/detail/F14Table.h#L446).
It also uses 2 cycles according to Neoverse-N{1,2} guidelines.

64 bit movemask is already well optimized. We have ongoing experiments
but were not able to validate other implementations work reliably faster.