git.ipfire.org Git - thirdparty/glibc.git/log

linux: Add STATX_DOALIGN definition to generic statx

The commit 07937809ac377f8ffb5bad3335194dd9a447922f added
STATX_MNT_ID_UNIQUE on the statx-generic.h without updating the
generic statx struct.

linux: Add STATX_MNT_ID_UNIQUE definition to generic statx

The commit 88a2cf6c4bab6e94a65e9c0db8813709372e9180 added
STATX_MNT_ID_UNIQUE on the statx-generic.h without updating the
generic statx struct.

Update syscall lists for Linux 6.17

Linux 6.16 adds no new syscalls, while Linux 6.17 adds file_getattr
and file_setattr (commit be7efb2d20d67f334a7de2aef77ae6c69367e646).
Update syscall-names.list and regenerate the arch-syscall.h headers
with build-many-glibcs.py update-syscalls.

Update PIDFD_* constants for Linux 6.17

The pidfd interface was extended with:

  * PIDFD_GET_INFO and pidfd_info (along with related extra flags) to
    allow get information about the process without the need to parse
    /proc (commit cdda1f26e74ba, Linux 6.13).

  * PIDFD_SELF_{THREAD,THREAD_GROUP,SELF,SELF_PROCESS} to allow
    pidfd_send_signal refer to the own process or thread lead groups
    without the need of allocating a file descriptor (commit f08d0c3a71114,
    Linux 6.15).

  * PIDFD_INFO_COREDUMP that extends PIDFD_GET_INFO to obtain coredump
    information.

Linux uAPI header defines both PIDFD_SELF_THREAD and
PIDFD_SELF_THREAD_GROUP on linux/fcntl.h (since they reserve part of the
AT_* values), however for glibc I do not see any good reason to add pidfd
definitions on fcntl-linux.h.

The tst-pidfd.c is extended with some PIDFD_SELF_* tests and a new
‘tst-pidfd_getinfo.c’ test is added to check PIDFD_GET_INFO. The
PIDFD_INFO_COREDUMP tests would require very large and complex tests
that are already covered by kernel tests.

Checked on aarch64-linux-gnu and x86_64-linux-gnu on kernels 6.8 and
6.17.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

Update kernel version to 6.17 in header constant tests

There are no new constants covered by tst-mman-consts.py,
tst-mount-consts.py or tst-sched-consts.py in Linux 6.17.

math: Remove the SVID error handling from atan2f

It improves latency for about 3-6% and throughput for about 5-12%.

Tested on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

Add feature test macros for POSIX.1-2024.

* include/features.h (_POSIX_C_SOURCE): Document the value of 202405L
for POSIX.1-2024.  Set it to 202405L when _GNU_SOURCE or _DEFAULT_SOURCE
is defined.
(_XOPEN_SOURCE): Document the value of 800 for POSIX-1.2024.  Set it to
800 when _GNU_SOURCE is defined.
(__USE_XOPEN2K24, __USE_XOPEN2K24XSI): New internal macros.  Set them
when _POSIX_C_SOURCE is 202405L or greater and/or when _XOPEN_SOURCE is
800 or greater.
* manual/creature.texi (Feature Test Macros): Document the new values
for _POSIX_C_SOURCE and _XOPEN_SOURCE.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Signed-off-by: Collin Funk <collin.funk1@gmail.com>

Rename fromfp files in preparation for changing types for C23

As discussed in bug 28327, the fromfp functions changed type in C23
(compared to the version in TS 18661-1); they now return the same type
as the floating-point argument, instead of intmax_t / uintmax_t.

As with other such incompatible changes compared to the initial TS
18661 versions of interfaces (the types of totalorder functions, in
particular), it seems appropriate to support only the new version as
an API, not the old one (although many programs written for the old
API might in fact work wtih the new one as well). Thus, the existing
implementations should become compat symbols. They are sufficiently
different from how I'd expect to implement the new version that using
separate implementations in separate files is more convenient than
trying to share code, and directly sharing testcases would be
problematic as well.

Rename the existing fromfp implementation and test files to names
reflecting how they're intended to become compat symbols, so freeing
up the existing filenames for a subsequent implementation of the C23
versions of these functions (which is the point at which the existing
implementations would actually become compat symbols).

gen-fromfp-tests.py and gen-fromfp-tests-inputs are not renamed; I
think it will make sense to adapt the test generator to be able to
generate most tests for both versions of the functions (with extra
test inputs added that are only of interest with the C23 version).
The ldbl-opt/nldbl-* files are also not renamed; since those are for a
static only library, no compat versions are needed, and they'll just
have their contents changed when the C23 version is implemented.

Tested for x86_64, and with build-many-glibcs.py.

Add C23 long_double_t, _FloatN_t

C23 Annex H adds <math.h> typedefs long_double_t and _FloatN_t
(originally introduced in TS 18661-3), analogous to float_t and
double_t.  Add these typedefs to glibc.  (There are no _FloatNx_t
typedefs.)

C23 also slightly changes the rules for how such typedef names should
be defined, compared to the definition in TS 18661-3.  In both cases,
<TYPE>_t corresponds to the evaluation format for <TYPE>, as specified
by FLT_EVAL_METHOD (for which <math.h> uses glibc's internal
__GLIBC_FLT_EVAL_METHOD).  Specifically, each FLT_EVAL_METHOD value
corresponds to some type U (for example, 64 corresponds to U =
_Float64), and for types with exactly the same set of values as U, TS
18661-3 says expressions with those types are to be evaluated to the
range and precision of type U (so <TYPE>_t is defined to U), whereas
C23 only does that for types whose values are a strict subset of those
of type U (so <TYPE>_t is defined to <TYPE>).

As with other cases where semantics changed between TS 18661 and C23,
this patch only implements the newer version of the semantics
(including adjusting existing definitions of float_t and double_t as
needed).  The new semantics are contradictory between the main
standard and Annex H for the case of FLT_EVAL_METHOD == 2 and the
choice of double_t when double and long double have the same values
(the main standard says it's defined as long double in that case,
whereas Annex H would define it as double), which I've raised on the
WG14 reflector (but I think setting FLT_EVAL_METHOD == 2 when double
and long double have the same values is a fairly theoretical
combination of features); for now glibc follows the value in the main
standard in that case.

Note that I think all existing GCC targets supported by glibc only use
values -1, 0, 1, 2 or 16 for FLT_EVAL_METHOD (so most of the header
code is somewhat theoretical, though potentially relevant with other
compilers since the choice of FLT_EVAL_METHOD is only an API choice,
not an ABI one; it can vary with compiler options, and these typedefs
should not be used in ABIs).  The testcase (expanded to cover the new
typedefs) is really just repeating the same logic in a second place
(so all it really tests is that __GLIBC_FLT_EVAL_METHOD is consistent
with FLT_EVAL_METHOD).

Tested for x86_64 and x86, and with build-many-glibcs.py.

riscv: Add vector registers to __SYSCALL_CLOBBERS

The Linux kernel ABI specifies that the vector registers are not preserved
across system calls, but the __SYSCALL_CLOBBERS macro doesn't mention them.
This could possibly lead to compilers trying to keep data in the vector
registers across the syscall leading to corruption. Add the vector registers
to __SYSCALL_CLOBBERS when the vector extension is enabled. If the vector
extension is enabled, then require GCC 15 or later and RVV 1.0 or later.

Fixes: 36960f0c76 ("RISC-V: Linux Syscall Interface")
Signed-off-by: Peter Bergner <bergner@tenstorrent.com>

Regenerate charmap-kw.h and locfile-kw.h with gperf 3.3

In commit 970364dac00b38333e5b2d91c90d11e80141d265 we switched some
/*FALLTHROUGH*/ comments to [[fallthrough]] to avoid warnings with
Clang. However, since gperf emitted different output the buildbot
failed. The buildbot has been updated to use gperf 3.3 which will use
__attribute__ ((__fallthrough__)) where needed to avoid warnings [1].
This patch regenerates these files with the same version.

[1] https://sourceware.org/pipermail/libc-testresults/2025q4/014123.html

Reviewed-by: Mark Wielaard <mark@klomp.org>

math: Remove the SVID error handling wrapper from sqrt

i386 and m68k architectures should use math-use-builtins-sqrt.h rather
than relying on architecture-specific or inline assembly implementations.

The PowerPC optimization for PPC 601/603 (30 years old) is removed.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from sinhf

It improves latency for about 3-10% and throughput for about 5-15%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from remainder

The optimized i386 version is faster than the generic one, and
gcc implements it through the builtin. This optimization enables
us to migrate the implementation to a C version.  The performance
on a Zen3 chip is similar to the SVID one.

The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).

Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):

reciprocal-throughput           input    master   NO-SVID  improvement
x86_64                     subnormals   18.8522   16.2506       13.80%
x86_64                         normal  421.8260  403.9270        4.24%
x86_64                 close-exponent   21.0579   18.7642       10.89%
i686                       subnormals   21.3443   21.4229       -0.37%
i686                           normal  525.8380   538.807       -2.47%
i686                   close-exponent   21.6589   21.7983       -0.64%

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from remainderf

The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin.  This optimization enables us to
migrate the implementation to a C version.  The performance on a Zen3
chip is similar to the SVID one.

The m68k provided an optimized version through __m81_u(remainderf)
(mathimpl.h), and gcc does not implement it through a builtin (different
than i386).

Performance improves a bit on x86_64 (Zen3, gcc 15.2.1):

reciprocal-throughput          input   master  NO-SVID  improvement
x86_64                    subnormals  17.5349  15.6125       10.96%
x86_64                        normal  53.8134  52.5754        2.30%
x86_64                close-exponent  20.0211  18.6656        6.77%
i686                      subnormals  21.8105  20.1856        7.45%
i686                          normal  73.1945  71.2199        2.70%
i686                  close-exponent  22.2141   20.331        8.48%

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

nptl: Remove ATOMIC_EXCHANGE_USES_CAS usage

The only usage was for pthread_spin_lock, introduced by 12d2dd706099aa4,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

Define __HAVE_64B_ATOMICS from compiler support

Now that atomic builtins are used by default, we can rely on the
compiler to define when to use 64-bit atomic operations.

It allows the use of 64-bit atomic operations on some 32-bit ABIs where
they were not previously enabled due to missing pre-processor handling:
hppa, mips64n32, s390, and sparcv9.

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

atomic: Consolidate atomic_write_barrier implementation

All ABIs, except alpha and sparc, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_RELEASE).

For alpha, it uses a 'wmb' which does not map to any of C11
barriers.

For sparc it uses a stronger 'member #LoadStore | #StoreStore',
where the release barrier maps to just 'membar #StoreLoad'. The
patch keeps the sparc definition.

For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).

Tested on aarch64-linux-gnu.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

atomic: Consolidate atomic_read_barrier implementation

All ABIs, except alpha, powerpc, and x86_64, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_SEQ_CST) in most cases, with the
exception of aarch64 (where the acquire fence is generated as
'dmb ishld' instead of 'dmb ish').

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline
synchronization).

For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).

Tested on aarch64-linux-gnu, where the acquire produces a different
instruction that the current code.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

atomic: Consolidate atomic_full_barrier implementation

All ABIs save for sparcv9 and s390 defines it to __sync_synchronize,
which can be mapped to __atomic_thread_fence (__ATOMIC_SEQ_CST).

For Sparc, it uses a stricter #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
instead of the #StoreLoad generated by __sync_synchronize.

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline synchronization).

The barrier is used only in one place (pthread_mutex_setprioceiling),
and using a stricter barrier for s390 is ok performance-wise.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

microblaze: Remove USE_ATOMIC_COMPILER_BUILTINS definition

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

alpha: Remove USE_ATOMIC_COMPILER_BUILTINS definition

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

sh: Move atomic-machine to generic sysdep

There is no Linux specific definitions.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

riscv: Consolidade atomic-machine.h and remove ununsed atomic macros

The resulting definitions are not Linux specific, so move the header
to generic sysdep folder.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

powerpc: Consolidate atomic-machine.h

The __HAVE_64B_ATOMICS can be define based on __WORDSIZE, and
the __ARCH_ACQ_INSTR, MUTEX_HINT_*, and barriers definition are
defined by the target cpu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

loongarch: Consolidate atomic-machine.h and remove ununsed atomic macros

These are already provided by the generic include/atomic.h and
the resulting macros are not Linux specific.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

m68k: Consolidade atomic-machine.h and Remove ununsed atomic macros

Both m68k and m68k-colfire do not support 64 bit atomis. The
atomic_barrier syscall on m68k is a no-op, so it can use the compiler
builtin.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

hppa: Move atomic-machine to generic sysdep

There is no Linux specific definitions.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

arm: Consolidate atomic-machine.h and Remove ununsed atomic macros

The libgcc provides the required support to calling the kernel
auxiliary routines for !__GCC_HAVE_SYNC_COMPARE_AND_SWAP_4.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

x86: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

sparc: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

s390: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h. Also
remove outdated comment from unsupported gcc versions.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

or1k: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

mips: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

csky: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

arc: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

aarch64: Remove ununsed atomic macros

These are already provided by the generic include/atomic.h.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

Build programs in $(others-noinstall) like tests if libgcc_s is available

Build programs in $(others-noinstall) like tests only if libgcc_s is
available. Otherwise, "build-many-glibcs.py compilers" will fail to
build the initial glibc with the initial limited gcc due to the missing
libgcc_s.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

Support assert as a variadic macro for C23

C23 makes assert into a variadic macro to handle cases of an argument
that would be interpreted as a single function argument but more than
one macro argument (in particular, compound literals with an
unparenthesized comma in an initializer list); this change was made by
N2829.  Note that this only applies to assert, not to other macros
specified in the C standard with particular numbers of arguments.

Implement this support in glibc.  This change is only for C; C++ would
need a separate change to its separate assert implementations.  It's
also applied only in C23 mode.  It depends on support for (C99)
variadic macros, and also (in order to detect calls where more than
one expression is passed, via an unevaluated function call) a C99
boolean type.  These requirements are encapsulated in the definition
of __ASSERT_VARIADIC.  Tests with -std=c99 and -std=gnu99 (using
implementations continue to work.

I don't think we have a way in the glibc testsuite to validate that
passing more than one expression as an argument does produce the
desired error.

Tested for x86_64.

docs: Add dynamic linker environment variable docs

The Dynamic Linker chapter now includes a new section detailing
environment variables that influence its behavior.

This new section documents the `LD_DEBUG` environment variable,
explaining how to enable debugging output and listing its various
keywords like `libs`, `reloc`, `files`, `symbols`, `bindings`,
`versions`, `scopes`, `tls`, `all`, `statistics`, `unused`, and `help`.

It also documents `LD_DEBUG_OUTPUT`, which controls where the debug
output is written, allowing redirection to a file with the process ID
appended.

This provides users with essential information for controlling and
debugging the dynamic linker.

Reviewed-by: DJ Delorie <dj@redhat.com>

tls: Add debug logging for TLS and TCB management

Introduce the `DL_DEBUG_TLS` debug mask to enable detailed logging for
Thread-Local Storage (TLS) and Thread Control Block (TCB) management.

This change integrates a new `tls` option into the `LD_DEBUG`
environment variable, allowing developers to trace:
- TCB allocation, deallocation, and reuse events in `dl-tls.c`,
`nptl/allocatestack.c`, and `nptl/nptl-stack.c`.
- Thread startup events, including the TID and TCB address, in
`nptl/pthread_create.c`.

A new test, `tst-dl-debug-tid`, has been added to validate the
functionality of this new debug logging, ensuring that relevant messages
are correctly generated for both main and worker threads.

This enhances the debugging capabilities for diagnosing issues related
to TLS allocation and thread lifecycle within the dynamic linker.

Reviewed-by: DJ Delorie <dj@redhat.com>

riscv: Add Zbkb optimized repeat_bytes helper

Introduce a RISC-V specific string-misc.h to provide an optimized
repeat_bytes implementation when the Zbkb extension is available.
The new version uses packh/packw/pack instruction count and avoiding
high latency instructions. This helper is used by several mem and string
functions, and falls back to the generic implementation when Zbkb is not
present.

Signed-off-by: Pincheng Wang <pincheng.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

math: Remove xfail from pow test [BZ #33563]

Remove xfail from pow testcase since pow and powf have been fixed.
Also check float128 maximum value. See BZ #33563.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

math: Fix pow special case [BZ #33563]

Fix pow (DBL_MAX, 1.0) to return DBL_MAX when rouding upwards without FMA.
This fixes BZ #33563.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

math: Fix powf special case [BZ #33563]

Fix powf (0x1.fffffep+127, 1.0f) to return 0x1.fffffep+127 when
rouding upwards. Cleanup the special case code - performance
improves by ~1.2%. This fixes BZ #33563.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

debug: mark __libc_message_wrapper as always inline

When building with -Og to enable debugging, there is currently a compiler error
because if __libc_message_wrapper() is not inline, the __va_arg_pack_len macro
cannot be used.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

aarch64: fix cfi directives around __libc_arm_za_disable

Incorrect CFI directive corrupted call stack information
and prevented debuggers from correctly displaying call
stack information.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

cdefs: allow __attribute__ on tcc

According to the tcc (tiny C compiler) Changelog, tcc supports
__attribute__ since 0.9.3. Looking at history of tcc at
<https://repo.or.cz/tinycc.git>, __attribute__ support was added
in commit 14658993425878be300aae2e879560698e0c6c4c on 2002-01-03,
which also looks like the release of 0.9.3. While I'm unable to
find release tags for tcc before 0.9.18 (2003-04-14), the next
release (0.9.28) will include __attribute__((cleanup(func)) which
I rely on.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

Cleanup some recently added whitespace.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

riscv: memcpy_noalignment: Reorder to store via a3, then bump a3

Rewrite the copy micro-step from:

    REG_L  a4, 0(a5)
    addi   a3, a3, SZREG
    addi   a5, a5, SZREG
    REG_S  a4, -SZREG(a3)

to:

    REG_L  a4, 0(a5)
    addi   a5, a5, SZREG
    REG_S  a4, 0(a3)
    addi   a3, a3, SZREG

Semantics are unchanged: both read *(a5_old), write *(a3_old), and then
increment a3/a5 by SZREG. memcpy assumes non-overlapping regions, so the
reordering preserves correctness.

No functional change.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

riscv: memcpy_noalignment: Fold SZREG/BLOCK_SIZE alignment to single andi

Simplify the alignment steps for SZREG and BLOCK_SIZE multiples. The previous
three-instruction sequences

    addi    a7, a2, -SZREG
    andi    a7, a7, -SZREG
    addi    a7, a7, SZREG

and

    addi    a7, a2, -BLOCK_SIZE
    andi    a7, a7, -BLOCK_SIZE
    addi    a7, a7, BLOCK_SIZE

are equivalent to a single

    andi    a7, a2, -SZREG
    andi    a7, a2, -BLOCK_SIZE

because SZREG and BLOCK_SIZE are powers of two in this context, making the
surrounding addi steps cancel out. Folding to one instruction reduces code
size with identical semantics.

No functional change.

sysdeps/riscv/multiarch/memcpy_noalignment.S: Remove redundant addi around
alignment; keep a single andi for SZREG/BLOCK_SIZE rounding.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

riscv: memcpy_noalignment: Make register allocation Zca-friendly

Tidy the temporary register allocation to favor registers eligible for
compressed encodings when Zca/Zcb are enabled. This keeps the ABI and
clobber set unchanged and does not alter control flow or memory access
behavior.

No functional change.

sysdeps/riscv/multiarch/memcpy_noalignment.S: Reassign temps to improve
compressed encoding opportunities.

Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

math: Remove the SVID error handling wrapper from yn/jn

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling wrapper from y1/j1

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling wrapper from y0/j0

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from coshf

It improves latency for about 3-10% and throughput for about 5-15%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from atanhf

It improves latency for about 1-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from acoshf

It improves latency for about 3-7% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from asinf

It improves latency for about 2% and throughput for about 5%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from acosf

It improves latency for about 2-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

math: Remove the SVID error handling from log10f

It improves latency for about 3-10% and throughput for about 5-10%.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

m68k: Remove SVID error handling on fmod

The m68k provided an optimized version through __m81_u(fmod)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

m68k: Avoid include e_fmod.c on fmod/remainder implementation

And open-code each implementation. It simplifies SVID error handling
removal.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

m68k: Remove the SVID error handling from fmodf

The m68k provided an optimized version through __m81_u(fmodf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

i386: Remove the SVID error handling from fmodf

The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one.

The performance on a Zen3 chip is slight better:

reciprocal-throughput           input   master  no-SVID  improvement
i686                       subnormals  22.4741  20.1571       10.31%
i686                           normal  74.1631  70.3606        5.13%
i686                   close-exponent  22.5625  20.2435       10.28%

Tested on i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

i386: Remove the SVID error handling from fmod

The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one. The performance on a Zen3 chip is
similar to the SVID one.

Tested on i686-linux-gnu.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

x86: fix wmemset ifunc stray '!' (bug 33542)

The ifunc selector for wmemset had a stray '!' in the
X86_ISA_CPU_FEATURES_ARCH_P(...) check:

  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)
      && X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
                                      AVX_Fast_Unaligned_Load, !))

This effectively negated the predicate and caused the AVX2/AVX512
paths to be skipped, making the dispatcher fall back to the SSE2
implementation even on CPUs where AVX2/AVX512 are available. The
regression leads to noticeable throughput loss for wmemset.

Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is
tested as intended and the correct AVX2/EVEX variants are selected.

Impact:
- On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly
  falls back to SSE2; perf now shows __wmemset_evex/avx2 variants.

Testing:
- benchtests/bench-wmemset shows improved bandwidth across sizes.
- perf confirm the selected symbol is no longer SSE2.

Signed-off-by: xiejiamei <xiejiamei@hygon.com>
Signed-off-by: Li jing <lijing@hygon.cn>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Updates struct tcp_zerocopy_receive from 5.11 to netinet/tcp.h.

This patch updates struct tcp_zerocopy_receive to contain filed including
copybuf_address, copybuf_len, and others.

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

aarch64: Fix tst-ifunc-arg-4 on clang-18

It issues:

../sysdeps/aarch64/tst-ifunc-arg-4.c:39:1: error: unused function 'resolver' [-Werror,-Wunused-function]
39 | resolver (uint64_t arg0, const uint64_t arg1[])
| ^~~~~~~~
1 error generated.

clang-19 and onwards do not trigger the warning.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Enable --no-undefined-version by default

Recent lld version default to --no-undefined-version, which triggers
errors when building multiple libraries. For ld.so on x86_64 it fails
with:

ld.lld: error: version script assignment of 'GLIBC_2.4' to symbol '__stack_chk_guard' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '__nptl_set_robust_list_avail' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '__pointer_chk_guard' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_PRIVATE' to symbol '_dl_starting_up' failed: symbol not defined

While for libc.so:

ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_clearerr' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fgetc' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fileno' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_freopen' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fscanf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_fseek' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_peekc_unlocked' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stderr_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stdin_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_stdout_' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_pclose' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_perror' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_rewind' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_scanf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_setlinebuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_wdefault_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '_IO_wfile_setbuf' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '__ctype32_tolower' failed: symbol not defined
ld.lld: error: version script assignment of 'GLIBC_2.17' to symbol '__ctype32_toupper' failed: symbol not defined
ld.lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)

The version script is created with multiple missing symbols to simplify
the build for multiple ABIs, each of which may have different symbols.
For instance, __stack_chk_guard is defined by default. This avoids
requiring each ABI to add this symbol to its version script, depending
on the stack protector ABI it uses.

The libc.so warnings do show unused symbols being defined (like
_IO_clearerr), which might trigger potential errors depending on how
symbols are exported. However, since we already have ABI checks for
missing and extra symbols, the linker's extra checks are not really
necessary.

The --no-undefined-version is the default for ld.bfd.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Supress unused command arguments warning with clang

clang 20 issues an warning for the unused '-c' argument used to create
errlist-data-aux-shared.S, errlist-data-aux.S, siglist-aux-shared.S,
and siglist-aux.S. Filter out the '-c' from the $(compile-command.c).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Annotate swtich fall-through

The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

argp: Move attribute_hidden to argp-fmtstream.h

The internal header redefines the some internal argp functions with
attribute_hidden, which triggers clang warning of mismatched attributes.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

argp: Expand argp_usage, _option_is_short, and _option_is_end

The argp code uses macro redefinitions to avoid duplicating static inline
implementations for argp_usage, _option_is_short, and _option_is_end.
However, this causes build issues with clang, as some function prototypes
are redefined to add the hidden attribute with libc_hidden_proto.

To avoid extensive changes to internal headers, just expand the function
implementations and avoid the macro redefine tricks.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

Replace count_leading_zeros with stdc_leading_zeros

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Collin Funk <collin.funk1@gmail.com>

malloc: Remove unused tcache_set_inactive

clang warns that this function is not used.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

include: Sync gnulib intprops

It syncs with gnulib commit 1790ef25d81983d1d25a77d452c0080345df459b.

The main change is to proper support clang by using builtins.  It
fixes a sprof build issue, where previous version uses the generic
code path when building with clang:

sprof.c:682:8: error: result of comparison of constant 288230376151711743 with expression of type 'Elf64_Half' (aka 'unsigned short') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
  682 |           if (INT_MULTIPLY_WRAPV (ehdr2.e_shnum, sizeof (ElfW(Shdr)), &size))
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../include/intprops.h:415:34: note: expanded from macro 'INT_MULTIPLY_WRAPV'
  415 |    _GL_INT_OP_WRAPV (a, b, r, *, _GL_INT_MULTIPLY_RANGE_OVERFLOW)
      |    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../include/intprops.h:504:45: note: expanded from macro '_GL_INT_OP_WRAPV'
  504 |     : _GL_INT_OP_WRAPV_LONGISH(a, b, r, op, overflow))
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
../include/intprops.h:511:41: note: expanded from macro '_GL_INT_OP_WRAPV_LONGISH'
  511 |         : _GL_INT_OP_CALC (a, b, r, op, overflow, unsigned long int, \
      |           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  512 |                            unsigned long int, 0, ULONG_MAX)) \
      |                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../include/intprops.h:533:4: note: expanded from macro '_GL_INT_OP_CALC'
  533 |   (overflow (a, b, tmin, tmax) \
      |    ^~~~~~~~~~~~~~~~~~~~~~~~~~~
../include/intprops.h:608:22: note: expanded from macro '_GL_INT_MULTIPLY_RANGE_OVERFLOW'
  608 |       : (tmax) / (b) < (a)))
      |         ~~~~~~~~~~~~ ^ ~~~
1 error generated.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

i386: Build s_erf_common.c with -fexcess-precision=standard

It is requires to provide correctly rounded results.

Checked on i686-linux-gnu.

Build programs in $(others-noinstall) like tests

Programs in $(others-noinstall) are internal to glibc build and they
aren't installed. They should be treated like programs in $(others),
but linked like tests so that --enable-hardcoded-path-in-tests also
applies to them.

Also replace run-via-rtld-prefix with test-via-rtld-prefix when running
container tests.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: DJ Delorie <dj@redhat.com>

Fix incorrect setrlimit return value checks in tests

The setrlimit(2) function returns 0 on success and -1 on error, but
several test files were incorrectly checking for a return value of 1
to detect errors. This means the error checks would never trigger,
causing tests to continue silently even when setrlimit() failed.

This commit fixes the error checks in five files to correctly test
for -1, matching both the documented behavior and the pattern used
correctly in other parts of the codebase.

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Reviewed-by: Collin Funk <collin.funk1@gmail.com>

Rename uimaxabs to umaxabs (bug 33325)

The C2y function uimaxabs has been renamed to umaxabs. Implement this
change in glibc, keeping a compat symbol under the old name, copying
the test to test the new name and changing the old test to test the
compat symbol. Jakub has done the corresponding change to the
built-in function in GCC.

Tested for x86_64 and x86.

math: Consolidate CORE-MATH double-double routines

For lgamma and tgamma the muldd, mulddd, and polydd are renamed
to muldd2, mulddd2, and polydd2 respectively.

Checked on aarch64-linux-gnu and x86_64-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Consolidate erf/erfc definitions

The common code definitions are consolidated in s_erf_common.h
and s_erf_common.c.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Consolidate internal erf/erfc tables

The shared internal data definitions are consolidated in
s_erf_data.c and the erfc only one are moved to s_erfc_data.c.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use erfc from CORE-MATH

The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,5], [-5,5], [5,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MAX, -5]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-5, 5]
* FE_TONEAREST
     0:       8069309665  80.69%
     1:       1882910247  18.83%
     2:         47485296   0.47%
     3:           293749   0.00%
     4:             1043   0.00%
* FE_UPWARD
     0:       5540301026  55.40%
     1:       2026739127  20.27%
     2:       1774882486  17.75%
     3:        567324466   5.67%
     4:         86913847   0.87%
     5:          3820789   0.04%
     6:            18259   0.00%
* FE_DOWNWARD
     0:       5520969586  55.21%
     1:       2057293099  20.57%
     2:       1778334818  17.78%
     3:        557521494   5.58%
     4:         82473927   0.82%
     5:          3393276   0.03%
     6:            13800   0.00%
* FE_TOWARDZERO
     0:       6220287175  62.20%
     1:       2323846149  23.24%
     2:       1251999920  12.52%
     3:        190748245   1.91%
     4:         12996232   0.13%
     5:           122279   0.00%

* Range [5, DBL_MAX]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      49.0980       267.0660      -443.94%
x86_64v2                    49.3220       257.6310      -422.34%
x86_64v3                    42.9539        84.9571       -97.79%
aarch64                     28.7266        52.9096       -84.18%
power10                     14.1673        25.1273       -77.36%

Latency                      master        patched   improvement
x86_64                      95.6640       269.7060      -181.93%
x86_64v2                    95.8296       260.4860      -171.82%
x86_64v3                    91.1658       112.7150       -23.64%
aarch64                     37.0745        58.6791       -58.27%
power10                     23.3197        31.5737       -35.39%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use erf from CORE-MATH

The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MIN, -4.2]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-4.2, 4.2]
* FE_TONEAREST
     0:       9764404513  97.64%
     1:        235595487   2.36%
* FE_UPWARD
     0:       9468013928  94.68%
     1:        531986072   5.32%
* FE_DOWNWARD
     0:       9493787693  94.94%
     1:        506212307   5.06%
* FE_TOWARDZERO
     0:       9585271351  95.85%
     1:        414728649   4.15%

* Range [4.2, DBL_MAX]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master       patched   improvement
x86_64                      38.2754       78.0311      -103.87%
x86_64v2                    38.3325       75.7555       -97.63%
x86_64v3                    34.6604       28.3182        18.30%
aarch64                     23.1499       21.4307         7.43%
power10                     12.3051       9.3766         23.80%

Latency                      master       patched   improvement
x86_64                      84.3062      121.3580       -43.95%
x86_64v2                    84.1817      117.4250       -39.49%
x86_64v3                    81.0933       70.6458        12.88%
aarch64                      35.012       29.5012        15.74%
power10                     21.7205       18.4589        15.02%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use tgamma from CORE-MATH

The current implementation precision shows the following accuracy, on
one range ([-20,20]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-20,20]
* FE_TONEAREST
     0:       4504877808  45.05%
     1:       4402224940  44.02%
     2:        947652295   9.48%
     3:        131076831   1.31%
     4:         13222216   0.13%
     5:           910045   0.01%
     6:            35253   0.00%
     7:              606   0.00%
     8:                6   0.00%
* FE_UPWARD
     0:       3477307921  34.77%
     1:       4838637866  48.39%
     2:       1413942684  14.14%
     3:        240762564   2.41%
     4:         27113094   0.27%
     5:          2130934   0.02%
     6:           102599   0.00%
     7:             2324   0.00%
     8:               14   0.00%
* FE_DOWNWARD
     0:       3923545410  39.24%
     1:       4745067290  47.45%
     2:       1137899814  11.38%
     3:        171596912   1.72%
     4:         20013805   0.20%
     5:          1773899   0.02%
     6:            99911   0.00%
     7:             2928   0.00%
     8:               31   0.00%
* FE_TOWARDZERO
     0:       3697160741  36.97%
     1:       4731951491  47.32%
     2:       1303092738  13.03%
     3:        231969191   2.32%
     4:         32344517   0.32%
     5:          3283092   0.03%
     6:           193010   0.00%
     7:             5175   0.00%
     8:               45   0.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                     237.7960       175.4090        26.24%
x86_64v2                   232.9320       163.4460        29.83%
x86_64v3                   193.0680        89.7721        53.50%
aarch64                    113.6340        56.7350        50.07%
power10                     92.0617        26.6137        71.09%

Latency                      master        patched   improvement
x86_64                     266.7190       208.0130        22.01%
x86_64v2                   263.6070       200.0280        24.12%
x86_64v3                   214.0260       146.5180        31.54%
aarch64                    114.4760        58.5235        48.88%
power10                     84.3718        35.7473        57.63%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use lgamma from CORE-MATH

The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-20, 20]
* FE_TONEAREST
     0:       6701254075  67.01%
     1:       3230897408  32.31%
     2:         63986940   0.64%
     3:          3605417   0.04%
     4:           233189   0.00%
     5:            20973   0.00%
     6:             1869   0.00%
     7:              125   0.00%
     8:                4   0.00%
* FE_UPWARDA
     0:       4207428861  42.07%
     1:       5001137116  50.01%
     2:        740542213   7.41%
     3:         49116304   0.49%
     4:          1715617   0.02%
     5:            54464   0.00%
     6:             4956   0.00%
     7:              451   0.00%
     8:               16   0.00%
     9:                2   0.00%
* FE_DOWNWARD
     0:       4155925193  41.56%
     1:       4989821364  49.90%
     2:        770312796   7.70%
     3:         72014726   0.72%
     4:         11040522   0.11%
     5:           872811   0.01%
     6:            12480   0.00%
     7:              106   0.00%
     8:                2   0.00%
* FE_TOWARDZERO
     0:       4225861532  42.26%
     1:       5027051105  50.27%
     2:        706443411   7.06%
     3:         39877908   0.40%
     4:           713109   0.01%
     5:            47513   0.00%
     6:             4961   0.00%
     7:              438   0.00%
     8:               23   0.00%

* Range [20, 0x5.d53649e2d4674p+1012]
* FE_TONEAREST
     0:       7262241995  72.62%
     1:       2737758005  27.38%
* FE_UPWARD
     0:       4690392401  46.90%
     1:       5143728216  51.44%
     2:        165879383   1.66%
* FE_DOWNWARD
     0:       4690333331  46.90%
     1:       5143794937  51.44%
     2:        165871732   1.66%
* FE_TOWARDZERO
     0:       4690343071  46.90%
     1:       5143786761  51.44%
     2:        165870168   1.66%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                     112.9740       135.8640       -20.26%
x86_64v2                   111.8910       131.7590       -17.76%
x86_64v3                   108.2800        68.0935        37.11%
aarch64                     61.3759        49.2403        19.77%
power10                     42.4483        24.1943        43.00%

Latency                      master        patched   improvement
x86_64                     144.0090       167.9750       -16.64%
x86_64v2                   139.2690       167.1900       -20.05%
x86_64v3                   130.1320        96.9347        25.51%
aarch64                     66.8538        53.2747        20.31%
power10                     49.5076        29.6917        40.03%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Move atanh internal data to separate file

The internal data definitions are moved to s_atanh_data.c.
It helps on ABIs that build the implementation multiple times for
ifunc optimizations, like x86_64.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Consolidate acosh and asinh internal table

The shared internal data definitions are consolidated in
s_asincosh_data.c.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use atanh from CORE-MATH

The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):

* Range [-1, 1]
* FE_TONEAREST
     0:       8180011860  81.80%
     1:       1819865257  18.20%
     2:           122883   0.00%
* FE_UPWARDA
     0:       3903695744  39.04%
     1:       4992324465  49.92%
     2:       1096319340  10.96%
     3:          7660451   0.08%
* FE_DOWNWARDA
     0:       3904555484  39.05%
     1:       4991970864  49.92%
     2:       1095447471  10.95%
     3:          8026181   0.08%
* FE_TOWARDZERO
     0:       7070209165  70.70%
     1:       2908447434  29.08%
     2:         21343401   0.21%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      26.4969        22.4625       15.23%
x86_64v2                    26.0792        22.9822       11.88%
x86_64v3                    25.6357        22.2147       13.34%
aarch64                     20.2295        19.7001        2.62%
power10                     10.0986         9.3846        7.07%

Latency                      master        patched   improvement
x86_64                      80.2311        59.9745       25.25%
x86_64v2                    79.7010        61.4066       22.95%
x86_64v3                    78.2679        58.5804       25.15%
aarch64                     34.3959        28.1523       18.15%
power10                     23.2417        18.2694       21.39%

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use asinh from CORE-MATH

The current implementation precision shows the following accuracy, on
tthree different ranges ([-DBL_MAX, -10], [-10,10], and [10, DBL_MAX))
with 10e9 uniform randomly generated numbers for each range (first
column is the accuracy in ULP, with '0' being correctly rounded, second
is the number of samples with the corresponding precision):

* range [-DBL_MAX, -10]
* FE_TONEAREST
     0:       5164019099  51.64%
     1:       4835980901  48.36%
* FE_UPWARD
     1:       4836053540  48.36%
     2:       5163946460  51.64%
* FE_DOWNWARD
     1:       5163926134  51.64%
     2:       4836073866  48.36%
* FE_TOWARDZERO
     0:       5163937001  51.64%
     1:       4836062999  48.36%

* Range [-10, 10)
* FE_TONEAREST
     0:       8679029381  86.79%
     1:       1320934581  13.21%
     2:            36038   0.00%
* FE_UPWARD
     0:       3965704277  39.66%
     1:       4993616710  49.94%
     2:       1039680225  10.40%
     3:           998788   0.01%
* FE_DOWNWARD
     0:       3965806523  39.66%
     1:       4993534438  49.94%
     2:       1039601726  10.40%
     3:          1057313   0.01%
* FE_TOWARDZEROA
     0:       7734210130  77.34%
     1:       2261868439  22.62%
     2:          3921431   0.04%

* Range [10, DBL_MAX)
* FE_TONEAREST
     0:       5163973212  51.64%
     1:       4836026788  48.36%
* FE_UPWARD
     0:       4835991071  48.36%
     1:       5164008929  51.64%
* FE_DOWNWARD
     0:       5163983594  51.64%
     1:       4836016406  48.36%
* FE_TOWARDZERO
     0:       5163993394  51.64%
     1:       4836006606  48.36%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master        patched   improvement
x86_64                      26.5178        45.3754       -71.11%
x86_64v2                    26.3167        44.7870       -70.18%
x86_64v3                    25.9109        25.4887         1.63%
aarch64                     18.0555        17.3374         3.98%
power10                     19.8535        20.5586        -3.55%

Latency                      master        patched   improvement
x86_64                      82.6755        91.2026       -10.31%
x86_64v2                    82.4581        90.7152       -10.01%
x86_64v3                    80.7000        71.9454        10.85%
aarch64                     32.8320        28.8565        12.11%
power10                     44.5309        37.0096        16.89%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use acosh from CORE-MATH

The current implementation precision shows the following accuracy, on
two different ranges ([1,21) and [21, DBL_MAX)) with 10e9 uniform
randomly generated numbers (first column is the accuracy in ULP, with
'0' being correctly rounded, second is the number of samples with the
corresponding precision):

* range [1,21]

* FE_TONEAREST
    0:       8931139411  89.31%
    1:       1068697545  10.69%
    2:           163044   0.00%
* FE_UPWARD
    0:       7936620731  79.37%
    1:       2062594522  20.63%
    2:           783977   0.01%
    3:              770   0.00%
* FE_DOWNWARD
    0:       7936459794  79.36%
    1:       2062734117  20.63%
    2:           805312   0.01%
    3:              777   0.00%
* FE_TOWARDZERO
    0:       7910345595  79.10%
    1:       2088584522  20.89%
    2:          1069106   0.01%
    3:              777   0.00%

* Range [21, DBL_MAX)
* FE_TONEAREST
    0:       5163888431  51.64%
    1:       4836111569  48.36%
* FE_UPWARD
    0:       4835951885  48.36%
    1:       5164048115  51.64%
* FE_DOWNWARD
    0:       5164048432  51.64%
    1:       4835951568  48.36%
* FE_TOWARDZERO
    0:       5164058042  51.64%
    1:       4835941958  48.36%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:

reciprocal-throughput        master       patched   improvement
x86_64                      20.9131       47.2187      -125.79%
x86_64v2                    20.8823       41.1042       -96.84%
x86_64v3                    19.0282       25.8045       -35.61%
aarch64                     14.7419       18.1535       -23.14%
power10                     8.98341       11.0423       -22.92%

Latency                      master       patched   improvement
x86_64                      75.5494       89.5979      -18.60%
x86_64v2                    74.4443       87.6292      -17.71%
x86_64v3                    71.8558       70.7086        1.60%
aarch64                     30.3361       29.2709        3.51%
power10                     20.9263       19.2482        8.02%

For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

Linux: fix tst-copy_file_range-large test on 32-bit platforms.

Since SSIZE_MAX is less than UINT_MAX on 32-bit platforms we must AND
the expression with SSIZE_MAX.

Tested on x86_64 and x86.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

x86: Disable AVX Fast Unaligned Load on Hygon 1/2/3

- Performance testing revealed significant memcpy performance degradation
  when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3.
- Hygon confirmed AVX performance issues in certain memory functions.
- Glibc benchmarks show SSE outperforms AVX for
  memcpy/memmove/memset/strcmp/strcpy/strlen and so on.
- Hardware differences primarily in floating-point operations don't justify
  AVX usage for memory operations.

Reviewed-by: gaoxiang <gaoxiang@kylinos.cn>
Signed-off-by: litenglong <litenglong@kylinos.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

ppc64le: Power 10 rawmemchr clobbers v20 (bug #33091)

Replace non-volatile(v20) by volatile(v17)
since v20 is not restored

Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

malloc: fix large tcache code to check for exact size match

The tcache is used for allocation only if an exact match is found. In the
large tcache code added in commit cbfd7988107b, we currently extract a
chunk of size greater than or equal to the size we need, but don't check
strict equality. This patch fixes that behaviour.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

Fix configure from ab22e5ec37396f6c6f29d3e3306f6fcc2ebe9d49

The "-Wno-unused-command-line-argument" was incorrectly added.

misc: Fix clang -Wstring-plus-int warnings on syslog

clang issues:

syslog.c:193:9: error: adding 'int' to a string does not append to the string [-Werror,-Wstring-plus-int]
  193 |                       SYSLOG_HEADER (pri, timestamp, &msgoff, pid));
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
syslog.c:180:7: note: expanded from macro 'SYSLOG_HEADER'
  180 |   "[" + (pid == 0), pid, "]" + (pid == 0)

Use array indexes instead of string addition (it is simpler than
add a supress warning).

sprof: fix -Wformat warnings on 32-bit hosts

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>