math: Use erf from CORE-MATH
The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MIN, -4.2]
* FE_TONEAREST
0:
10000000000 100.00%
* FE_UPWARD
0:
10000000000 100.00%
* FE_DOWNWARD
0:
10000000000 100.00%
* FE_TOWARDZERO
0:
10000000000 100.00%
* Range [-4.2, 4.2]
* FE_TONEAREST
0:
9764404513 97.64%
1:
235595487 2.36%
* FE_UPWARD
0:
9468013928 94.68%
1:
531986072 5.32%
* FE_DOWNWARD
0:
9493787693 94.94%
1:
506212307 5.06%
* FE_TOWARDZERO
0:
9585271351 95.85%
1:
414728649 4.15%
* Range [4.2, DBL_MAX]
* FE_TONEAREST
0:
10000000000 100.00%
* FE_UPWARD
0:
10000000000 100.00%
* FE_DOWNWARD
0:
10000000000 100.00%
* FE_TOWARDZERO
0:
10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 38.2754 78.0311 -103.87%
x86_64v2 38.3325 75.7555 -97.63%
x86_64v3 34.6604 28.3182 18.30%
aarch64 23.1499 21.4307 7.43%
power10 12.3051 9.3766 23.80%
Latency master patched improvement
x86_64 84.3062 121.3580 -43.95%
x86_64v2 84.1817 117.4250 -39.49%
x86_64v3 81.0933 70.6458 12.88%
aarch64 35.012 29.5012 15.74%
power10 21.7205 18.4589 15.02%
For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>