git.ipfire.org Git - thirdparty/haproxy.git/commit

author	Willy Tarreau <w@1wt.eu>
	Mon, 30 Nov 2020 17:58:16 +0000 (18:58 +0100)
committer	Willy Tarreau <w@1wt.eu>
	Fri, 5 Mar 2021 07:30:08 +0000 (08:30 +0100)
commit	46cca8690029dcdd13c86894aefdf2ef2d6f6b07
tree	139a716bc5fba5798ccfd8199be99d20ff037c24	tree
parent	168fc5332c7b3f43c8841a999fc40a3acef85223	commit \| diff

MINOR: atomic: add armv8.1-a atomics variant for cas-dw

This variant uses the CASP instruction available on armv8.1-a CPU cores,
which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7,
mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS'
Graviton2 CPUs.

The instruction performs way better on high thread counts since it
guarantees some forward progress when facing extreme contention while
the original LL/SC approach is light on low-thread counts but doesn't
guarantee progress.

The implementation is not the most optimal possible. In particular since
the instruction requires to work on register pairs and there doesn't seem
to be a way to force gcc to emit register pairs, we have to decide to force
to use the pair (x0,x1) to store the old value, and (x2,x3) to store the
new one, and this necessarily involves some extra moves. But at least it
does improve the situation with 16 threads and more. See issue #958 for
more context.

Note, a first implementation of this function was making use of an
input/output constraint passed using "+Q"(*(void**)target), which was
resulting in smaller overall code than passing "target" as an input
register only. It turned out that the cause was directly related to
whether the function was inlined or not, hence the "forceinline"
attribute. Any changes to this code should still pay attention to this
important factor.