math: Use lgamma from CORE-MATH
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):
* Range [-20, 20]
* FE_TONEAREST
0:
6701254075 67.01%
1:
3230897408 32.31%
2:
63986940 0.64%
3:
3605417 0.04%
4: 233189 0.00%
5: 20973 0.00%
6: 1869 0.00%
7: 125 0.00%
8: 4 0.00%
* FE_UPWARDA
0:
4207428861 42.07%
1:
5001137116 50.01%
2:
740542213 7.41%
3:
49116304 0.49%
4:
1715617 0.02%
5: 54464 0.00%
6: 4956 0.00%
7: 451 0.00%
8: 16 0.00%
9: 2 0.00%
* FE_DOWNWARD
0:
4155925193 41.56%
1:
4989821364 49.90%
2:
770312796 7.70%
3:
72014726 0.72%
4:
11040522 0.11%
5: 872811 0.01%
6: 12480 0.00%
7: 106 0.00%
8: 2 0.00%
* FE_TOWARDZERO
0:
4225861532 42.26%
1:
5027051105 50.27%
2:
706443411 7.06%
3:
39877908 0.40%
4: 713109 0.01%
5: 47513 0.00%
6: 4961 0.00%
7: 438 0.00%
8: 23 0.00%
* Range [20, 0x5.d53649e2d4674p+1012]
* FE_TONEAREST
0:
7262241995 72.62%
1:
2737758005 27.38%
* FE_UPWARD
0:
4690392401 46.90%
1:
5143728216 51.44%
2:
165879383 1.66%
* FE_DOWNWARD
0:
4690333331 46.90%
1:
5143794937 51.44%
2:
165871732 1.66%
* FE_TOWARDZERO
0:
4690343071 46.90%
1:
5143786761 51.44%
2:
165870168 1.66%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 112.9740 135.8640 -20.26%
x86_64v2 111.8910 131.7590 -17.76%
x86_64v3 108.2800 68.0935 37.11%
aarch64 61.3759 49.2403 19.77%
power10 42.4483 24.1943 43.00%
Latency master patched improvement
x86_64 144.0090 167.9750 -16.64%
x86_64v2 139.2690 167.1900 -20.05%
x86_64v3 130.1320 96.9347 25.51%
aarch64 66.8538 53.2747 20.31%
power10 49.5076 29.6917 40.03%
For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>