git.ipfire.org Git - thirdparty/glibc.git/log

elf: Run constructors on cyclic recursive dlopen (bug 31986)

This is conceptually similar to the reported bug, but does not
depend on auditing. The fix is simple: just complete execution
of the constructors. This exposed the fact that the link map
for statically linked executables does not have l_init_called
set, even though constructors have run.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9897ced8e78db5d813166a7ccccfd5a42c69ef20)

ldconfig: Move endswithn into a new header file

is_gdb_python_file is doing a similar test, so it can use this helper
function as well.

Signed-off-by: Adam Sampson <ats@offog.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit ed2b8d3a866eb37e069f6a71bdf10421cd4c5e54)

x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212]

When the linker -z mark-plt option is used to add DT_X86_64_PLT,
DT_X86_64_PLTSZ and DT_X86_64_PLTENT, the r_addend field of the
R_X86_64_JUMP_SLOT relocation stores the offset of the indirect
branch instruction.  However, glibc versions without the commit:

commit f8587a61892cbafd98ce599131bf4f103466f084
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 19:21:48 2022 -0700

    x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT

    According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
    and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
    can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
won't ignore the r_addend value in the R_X86_64_JUMP_SLOT relocation.
Such programs and shared libraries will fail at run-time randomly.

Add GLIBC_ABI_DT_X86_64_PLT version to indicate that glibc is compatible
with DT_X86_64_PLT.

The linker can add the glibc GLIBC_ABI_DT_X86_64_PLT version dependency
whenever -z mark-plt is passed to the linker.  The resulting programs and
shared libraries will fail to load at run-time against libc.so without the
GLIBC_ABI_DT_X86_64_PLT version, instead of fail randomly.

This fixes BZ #33212.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 399384e0c8193e31aea014220ccfa24300ae5938)

x86-64: Add GLIBC_ABI_GNU2_TLS version [BZ #33129]

Programs and shared libraries compiled with -mtls-dialect=gnu2 may fail
silently at run-time against glibc without the GNU2 TLS run-time fix
for:

https://sourceware.org/bugzilla/show_bug.cgi?id=31372

Add GLIBC_ABI_GNU2_TLS version to indicate that glibc has the working
GNU2 TLS run-time. Linker can add the GLIBC_ABI_GNU2_TLS version to
binaries which depend on the working GNU2 TLS run-time:

https://sourceware.org/bugzilla/show_bug.cgi?id=33130

so that such programs and shared libraries will fail to load and run at
run-time against libc.so without the GLIBC_ABI_GNU2_TLS version, instead
of fail silently at random.

This fixes BZ #33129.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 9df8fa397d515dc86ff5565f6c45625e672d539e)

libio: Test for fdopen memory leak without SEEK_END support (bug 31840)

The bug report used /dev/mem, but /proc/self/mem works as well
(if available).

(cherry picked from commit d0106b6ae26c8cc046269358a77188105c99d5e3)

Remove memory leak in fdopen (bug 31840)

Deallocate the memory for the FILE structure when seeking to the end fails
in append mode.

Fixes: ea33158c96 ("Fix offset caching for streams and use it for ftell (BZ #16680)")
(cherry picked from commit b2c3ee3724900975deaf5eae57640bb0c2d7315e)

math: Remove no-mathvec flag

More routines are to follow, some of which hit many failures in the
current testsuite due to wrong sign of zero (mathvec routines are not
required to get this right). Instead of disabling a large number of
tests, change the failure condition such that, for vector routines,
tests pass as long as computed == expected == 0.0, regardless of sign.

Affected tests (vector tests for expm1, log1p, sin, tan and tanh) all
still pass.

(cherry picked from commit 939e770e0196ebd763cacc602421b76d62df0798)

Use TLS initial-exec model for __libc_tsd_CTYPE_* thread variables [BZ #33234]

Commit 10a66a8e421b ("Remove <libc-tsd.h>") removed the TLS initial-exec
(IE) model attribute from the __libc_tsd_CTYPE_* thread variable declarations
and definitions. Commit a894f04d8776 ("Optimize __libc_tsd_* thread
variable access") restored it on declarations.

Restore the TLS initial-exec model attribute on __libc_tsd_CTYPE_* thread
variable definitions.

This resolves test tst-locale1 failure on s390 32-bit, when using a
GNU linker without the fix from GNU binutils commit aefebe82dc89
("IBM zSystems: Fix offset relative to static TLS").

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit e5363e6f460c2d58809bf10fc96d70fd1ef8b5b2)

ctype: Fallback initialization of TLS using relocations (bug 19341, bug 32483)

This ensures that the ctype data pointers in TLS are valid
in secondary namespaces even without initialization via
__ctype_init.

Reviewed-by: Frédéric Bérat <fberat@redhat.com>
(cherry picked from commit 2745db8dd3ec31045acd761b612516490085bc20)

Use proper extern declaration for _nl_C_LC_CTYPE_{class,toupper,tolower}

The existing initializers already contain explicit casts. Keep them
due to int/uint32_t mismatch.

Reviewed-by: Frédéric Bérat <fberat@redhat.com>
(cherry picked from commit e0c0f856f58ceb68800a964c36c15c606e7a8c4c)

Remove <libc-tsd.h>

Use __thread variables directly instead.  The macros do not save any
typing.  It seems unlikely that a future port will lack __thread
variable support.

Some of the __libc_tsd_* variables are referenced from assembler
files, so keep their names.  Previously, <libc-tls.h> included
<tls.h>, which in turn included <errno.h>, so a few direct includes
of <errno.h> are now required.

Reviewed-by: Frédéric Bérat <fberat@redhat.com>
(cherry picked from commit 10a66a8e421b09682b774c795ef1da402235dddc)

ctype: Reformat Makefile.

Reflow and sort Makefile.

Code generation changes present due to link order changes.

No regressions on x86_64 and i686.

(cherry picked from commit 12956e0a330e3d90fc196f7d7a047ce613f78920)

elf: Handle ld.so with LOAD segment gaps in _dl_find_object (bug 31943)

Detect if ld.so not contiguous and handle that case in _dl_find_object.
Set l_find_object_processed even for initially loaded link maps,
otherwise dlopen of an initially loaded object adds it to
_dlfo_loaded_mappings (where maps are expected to be contiguous),
in addition to _dlfo_nodelete_mappings.

Test elf/tst-link-map-contiguous-ldso iterates over the loader
image, reading every word to make sure memory is actually mapped.
It only does that if the l_contiguous flag is set for the link map.
Otherwise, it finds gaps with mmap and checks that _dl_find_object
does not return the ld.so mapping for them.

The test elf/tst-link-map-contiguous-main does the same thing for
the libc.so shared object. This only works if the kernel loaded
the main program because the glibc dynamic loader may fill
the gaps with PROT_NONE mappings in some cases, making it contiguous,
but accesses to individual words may still fault.

Test elf/tst-link-map-contiguous-libc is again slightly different
because the dynamic loader always fills the gaps with PROT_NONE
mappings, so a different form of probing has to be used.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 20681be149b9eb1b6c1f4246bf4bd801221c86cd)

elf: Extract rtld_setup_phdr function from dl_main

Remove historic binutils reference from comment and update
how this data is used by applications.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 2cac9559e06044ba520e785c151fbbd25011865f)

elf: Do not add a copy of _dl_find_object to libc.so

This reduces code size and dependencies on ld.so internals from
libc.so.

Fixes commit f4c142bb9fe6b02c0af8cfca8a920091e2dba44b
("arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405)").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 96429bcc91a14f71b177ddc5e716de3069060f2c)

arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405)

Instead of __dl_iterate_phdr. On ARM dlfo_eh_frame/dlfo_eh_count
maps to PT_ARM_EXIDX vaddr start / length.

On a Neoverse N1 machine with 160 cores, the following program:

  $ cat test.c
  #include <stdlib.h>
  #include <pthread.h>
  #include <assert.h>

  enum {
    niter = 1024,
    ntimes = 128,
  };

  static void *
  tf (void *arg)
  {
    int a = (int) arg;

    for (int i = 0; i < niter; i++)
      {
        void *p[ntimes];
        for (int j = 0; j < ntimes; j++)
   p[j] = malloc (a * 128);
        for (int j = 0; j < ntimes; j++)
   free (p[j]);
      }

    return NULL;
  }

  int main (int argc, char *argv[])
  {
    enum { nthreads = 16 };
    pthread_t t[nthreads];

    for (int i = 0; i < nthreads; i ++)
      assert (pthread_create (&t[i], NULL, tf, (void *) i) == 0);

    for (int i = 0; i < nthreads; i++)
      {
        void *r;
        assert (pthread_join (t[i], &r) == 0);
        assert (r == NULL);
      }

    return 0;
  }
  $ arm-linux-gnueabihf-gcc -fsanitize=address test.c -o test

Improves from ~15s to 0.5s.

Checked on arm-linux-gnueabihf.

(cherry picked from commit f4c142bb9fe6b02c0af8cfca8a920091e2dba44b)

AArch64: Improve codegen in SVE log1p

Improves memory access, reformat evaluation scheme to pack coefficients.
5% improvement in throughput microbenchmark on Neoverse V1.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit da196e6134ede64728006518352d75b6c3902fec)

AArch64: Optimize inverse trig functions

Improve performance of Inverse trig functions by altering how coefficients are
loaded.

Performance improvement on Neoverse V1:
SVE     acos   14%
AdvSIMD acos   6%

AdvSIMD asin   6%
SVE     asin   5%
AdvSIMD asinf  2%

AdvSIMD atanf  22%
SVE     atanf  20%
SVE     atan   11%
AdvSIMD atan   5%

SVE     atan2  7%
SVE     atan2f 4%
AdvSIMD atan2f 3%
AdvSIMD atan2  2%

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit 1e84509e0041c0a83997aba602a585bb3b8285f0)

AArch64: Avoid memset ifunc in cpu-features.c [BZ #33112]

During early startup memcpy or memset must not be called since many targets
use ifuncs for them which won't be initialized yet. Security hardening may
use -ftrivial-auto-var-init=zero which inserts calls to memset. Redirect
memset to memset_generic by including dl-symbol-redir-ifunc.h in cpu-features.c.
This fixes BZ #33112.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 681a24ae4d0cb8ed92de98b4da660308840b09ba)

posix: Fix double-free after allocation failure in regcomp (bug 33185)

If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.

Reported-by: Anastasia Belova <abelova@astralinux.ru>
Co-authored-by: Paul Eggert <eggert@cs.ucla.edu>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e994093fa0bcca0d0ee2c1db271d8d7885d)

Fix error reporting (false negatives) in SGID tests

And simplify the interface of support_capture_subprogram_self_sgid.

Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion.  Handle all execution
errors within support_capture_subprogram_self_sgid.  In particular,
this includes test failures because the invoked program did not
exit with exit status zero.  Existing tests that expect exit
status 42 are adjusted to use zero instead.

In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).

Fixes commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2
("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3a3fb2ed83f79100c116c824454095ecfb335ad7)

support: Pick group in support_capture_subprogram_self_sgid if UID == 0

When running as root, it is likely that we can run under any group.
Pick a harmless group from /etc/group in this case.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2f769cec448d84a62b7dd0d4ff56978fe22c0cd6)

ppc64le: Revert "powerpc: Optimized strcmp for power10" (CVE-2025-5702)

This reverts commit 3367d8e180848030d1646f088759f02b8dfe0d6f

Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)

Tested on ppc64le without regression.

(cherry picked from commit 15808c77b35319e67ee0dc8f984a9a1a434701bc)

ppc64le: Revert "powerpc : Add optimized memchr for POWER10" (Bug 33059)

This reverts commit b9182c793caa05df5d697427c0538936e6396d4b

Reason for revert: Power10 memchr clobbers v20 vector register
(Bug 33059)

This is not a security issue, unlike CVE-2025-5745 and
CVE-2025-5702.

Tested on ppc64le without regression.

(cherry picked from commit a7877bb6685300f159fa095c9f50b22b112cddb8)

ppc64le: Revert "powerpc: Fix performance issues of strcmp power10" (CVE-2025-5702)

This reverts commit 90bcc8721ef82b7378d2b080141228660e862d56

This change is in the chain of the final revert that fixes the CVE
i.e. 3367d8e180848030d1646f088759f02b8dfe0d6f

Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)

Tested on ppc64le with no regressions.

(cherry picked from commit c22de63588df7a8a0edceea9bb02534064c9d201)

elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)

This should really move into support_capture_subprogram_self_sgid.

Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2)

x86_64: Fix typo in ifunc-impl-list.c.

Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit f2aeb6ff941dccc4c777b5621e77addea6cc076c)

elf: Test case for bug 32976 (CVE-2025-4802)

Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically
linked binaries, using support_capture_subprogram_self_sgid.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit d8f7a79335b0d861c12c42aec94c04cd5bb181e2)

support: Add support_record_failure_barrier

This can be used to stop execution after a TEST_COMPARE_BLOB
failure, for example.

(cherry picked from commit d0b8aa6de4529231fadfe604ac2c434e559c2d9e)

support: Use const char * argument in support_capture_subprogram_self_sgid

The function does not modify the passed-in string, so make this clear
via the prototype.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit f0c09fe61678df6f7f18fe1ebff074e62fa5ca7a)

elf: Keep using minimal malloc after early DTV resize (bug 32412)

If an auditor loads many TLS-using modules during startup, it is
possible to trigger DTV resizing. Previously, the DTV was marked
as allocated by the main malloc afterwards, even if the minimal
malloc was still in use. With this change, _dl_resize_dtv marks
the resized DTV as allocated with the minimal malloc.

The new test reuses TLS-using modules from other auditing tests.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit aa3d7bd5299b33bffc118aa618b59bfa66059bcb)

sysdeps/unix/sysv/linux/x86_64/Makefile: Add the end marker

Add the end marker to tests, tests-container and modules-names.

(cherry picked from commit e6350be7e9cae8f71c96c1f06eab61b9acb227c8)

sysdeps/x86_64/Makefile (tests): Add the end marker

(cherry picked from commit 71d133c500b0d23f6b6a7c6e3595e3fc447bfe91)

sort-makefile-lines.py: Allow '_' in name and "^# name"

'_' is used in Makefile variable names and many variables end with
"^# name". Relax sort-makefile-lines.py to allow '_' in name and
"^# name" as variable end. This fixes BZ #31385.

(cherry picked from commit 6a2512bf1605a4208dd94ef67408488d8acb2409)

libio: Correctly link tst-popen-fork against libpthread

tst-popen-fork failed to build for Hurd due to not being linked with
libpthread. This commit fixes that.

Tested with build-many-glibcs.py for i686-gnu.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 6a290b2895b77be839fcb7c44a6a9879560097ad)

libio: Fix a deadlock after fork in popen

popen modifies its file handler book-keeping under a lock that wasn't
being taken during fork. This meant that a concurrent popen and fork
could end up copying the lock in a "locked" state into the fork child,
where subsequently calling popen would lead to a deadlock due to the
already (spuriously) held lock.

This commit fixes the deadlock by appropriately taking the lock before
fork, and releasing/resetting it in the parent/child after the fork.

A new test for concurrent popen and fork is also added. It consistently
hangs (and therefore fails via timeout) without the fix applied.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 9f0d2c0ee6c728643fcf9a4879e9f20f5e45ce5f)

libio: Sort test variables in Makefile

Sort test variables in libio/Makefile using scripts/sort-makefile-lines.py.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit ddf71c550a5940deca74cc676f1cae134a891717)

Linux: Switch back to assembly syscall wrapper for prctl (bug 29770)

Commit ff026950e280bc3e9487b41b460fb31bc5b57721 ("Add a C wrapper for
prctl [BZ #25896]") replaced the assembler wrapper with a C function.
However, on powerpc64le-linux-gnu, the C variadic function
implementation requires extra work in the caller to set up the
parameter save area.  Calling a function that needs a parameter save
area without one (because the prototype used indicates the function is
not variadic) corrupts the caller's stack.   The Linux manual pages
project documents prctl as a non-variadic function.  This has resulted
in various projects over the years using non-variadic prototypes,
including the sanitizer libraries in LLVm and GCC (GCC PR 113728).

This commit switches back to the assembler implementation on most
targets and only keeps the C implementation for x86-64 x32.

Also add the __prctl_time64 alias from commit
b39ffab860cd743a82c91946619f1b8158b0b65e ("Linux: Add time64 alias for
prctl") to sysdeps/unix/sysv/linux/syscalls.list; it was not yet
present in commit ff026950e280bc3e9487b41b460fb31bc5b57721.

This restores the old ABI on powerpc64le-linux-gnu, thus fixing
bug 29770.

Reviewed-By: Simon Chopin <simon.chopin@canonical.com>
(cherry picked from commit 6a04404521ac4119ae36827eeb288ea84eee7cf6)

nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)

The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.

Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit dbc5a50d12eff4cb3f782129029d04b8a76f58e7)

nptl: Use all of g1_start and g_signals

The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:

"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"

I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 91bb902f58264a2fd50fbce8f39a9a290dd23706)

nptl: rename __condvar_quiesce_and_switch_g1

This function no longer waits for threads to leave g1, so rename it to
__condvar_switch_g1

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4b79e27a5073c02f6bff9aa8f4791230a0ab1867)

nptl: Fix indentation

In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ee6c14ed59d480720721aaacc5fb03213dc153da)

nptl: Use a single loop in pthread_cond_wait instaed of a nested loop

The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 929a4764ac90382616b6a21f099192b2475da674)

nptl: Remove g_refs from condition variables

This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit c36fc50781995e6758cae2b6927839d0157f213c)

nptl: Remove unnecessary quadruple check in pthread_cond_wait

pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:

1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
   for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
   close while we do an atomic increment. Obviously it could close at any
   point, but that doesn't mean we have to recheck after every step. This
   check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
   we check that we're not in a closed group. There is no need to check again.
   The check would only have helped in cases where the compare_exchange in the
   next line would also have failed. Relying on the compare_exchange is fine.

Removing the duplicate checks clarifies the code.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4f7b051f8ee3feff1b53b27a906f245afaa9cee1)

nptl: Remove unnecessary catch-all-wake in condvar group switch

This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.

In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit b42cc6af11062c260c7dfa91f1c89891366fed3e)

nptl: Update comments and indentation for new condvar implementation

Some comments were wrong after the most recent commit. This fixes that.

Also fixing indentation where it was using spaces instead of tabs.

Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 0cc973160c23bb67f895bc887dd6942d29f8fee3)

pthreads NPTL: lost wakeup fix 2

This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch.  This greatly reduces the average
and maximum latency for pthread_cond_signal.

The g_signals[] field now contains a signal count that is relative to
the current g1_start value.  Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)

By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same.  This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change.  This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.

By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore.  The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit.  If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.

The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters.  Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2.  They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.

Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block.  This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.

Signed-off-by: Frank Barrus <frankbarrus_sw@shaggy.cc>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 1db84775f831a1494993ce9c118deaf9537cc50a)

x86: Detect Intel Diamond Rapids

Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit de14f1959ee5f9b845a7cae43bee03068b8136f0)

x86: Handle unknown Intel processor with default tuning

Enable default tuning for unknown Intel processor.

Tested on x86, no regression.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9f0deff558d1d6b08c425c157f50de85013ada9c)

x86: Add ARL/PTL/CWF model detection support

- Add ARROWLAKE model detection.
- Add PANTHERLAKE model detection.
- Add CLEARWATERFOREST model detection.

Intel® Architecture Instruction Set Extensions Programming Reference
https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.

No regression, validated model detection on SDE.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e53eb952b970ac94c97d74fb447418fb327ca096)

x86: Optimize xstate size calculation

Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.

Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 70b648855185e967e54668b101d24704c3fb869d)

x86: Use `Avoid_Non_Temporal_Memset` to control non-temporal path

This is just a refactor and there should be no behavioral change from
this commit.

The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b93dddfaf440aa12f45d7c356f6ffe9f27d35577)

x86: Tunables may incorrectly set Prefer_PMINUB_for_stringop (bug 32047)

Fixes commit 5bcf6265f215326d14dfacdce8532792c2c7f8f8 ("x86:
Disable non-temporal memset on Skylake Server").

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 7a630f7d3392ca391a399486ce2846f9e4b4ee63)

x86: Disable non-temporal memset on Skylake Server

The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).

Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.

This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.

Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.

Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5bcf6265f215326d14dfacdce8532792c2c7f8f8)

x86: Fix value for `x86_memset_non_temporal_threshold` when it is undesirable

When we don't want to use non-temporal stores for memset, we set
`x86_memset_non_temporal_threshold` to SIZE_MAX.

The current code, however, we using `maximum_non_temporal_threshold`
as the upper bound which is `SIZE_MAX >> 4` so we ended up with a
value of `0`.

Fix is to just use `SIZE_MAX` as the upper bound for when setting the
tunable.
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5b54a33435e5533653a9956728f2de9d16a3b4ee)

x86: Enable non-temporal memset tunable for AMD

In commit 46b5e98ef6f1 ("x86: Add seperate non-temporal tunable for
memset") a tunable threshold for enabling non-temporal memset was added,
but only for Intel hardware.

Since that commit, new benchmark results suggest that non-temporal
memset is beneficial on AMD, as well, so allow this tunable to be set
for AMD.

See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
which has been updated to include data using different stategies for
large memset on AMD Zen2, Zen3, and Zen4.

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit bef2a827a55fc759693ccc5b0f614353b8ad712d)

x86: Add seperate non-temporal tunable for memset

The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.

This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 46b5e98ef6f1b9f4b53851f152ecb8209064b26c)

x86: Link tst-gnu2-tls2-x86-noxsave{,c,xsavec} with libpthread

This fixes a test build failure on Hurd.

Fixes commit 145097dff170507fe73190e8e41194f5b5f7e6bf ("x86: Use separate
variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit c6e2895695118ab59c7b17feb0fcb75a53e3478c)

x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)

Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.

This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.

Fixes commit 9b7091415af47082664717210ac49d51551456ab ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 145097dff170507fe73190e8e41194f5b5f7e6bf)

x86: Skip XSAVE state size reset if ISA level requires XSAVE

If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.

Related to commit befe2d3c4dec8be2cdd01a47132e47bdb7020922
("x86-64: Don't use SSE resolvers for ISA level 3 or above").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 59585ddaa2d44f22af04bb4b8bd4ad1e302c4c02)

x86_64: Add atanh with FMA

On SPR, it improves atanh bench performance by:

Before After Improvement
reciprocal-throughput 15.1715 14.8628 2%
latency 57.1941 56.1883 2%

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c7c4a5906f326f1290b1c2413a83c530564ec4b8)

x86_64: Add sinh with FMA

On SPR, it improves sinh bench performance by:

Before After Improvement
reciprocal-throughput 14.2017 11.815 17%
latency 36.4917 35.2114 4%

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit dded0d20f67ba1925ccbcb9cf28f0c75febe0dbe)

x86_64: Add tanh with FMA

On Skylake, it improves tanh bench performance by:

Before After Improvement
max 110.89 95.826 14%
min 20.966 20.157 4%
mean 30.9601 29.8431 4%

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6352111c72a20b3588ae304dd99b63e25dd6d85)

x86-64: Exclude FMA4 IFUNC functions for -mapxf

When -mapxf is used to build glibc, the resulting glibc will never run
on FMA4 machines. Exclude FMA4 IFUNC functions when -mapxf is used.
This requires GCC which defines __APX_F__ for -mapxf with commit:

1df56719bd8 x86: Define __APX_F__ for -mapxf

Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9e1f4aef865ddeffeb4b5f6578fefab606783120)

nptl: clear the whole rseq area before registration

Due to the extensible nature of the rseq area we can't explictly
initialize fields that are not part of the ABI yet. It was agreed with
upstream that all new fields will be documented as zero initialized by
userspace. Future kernels configured with CONFIG_DEBUG_RSEQ will
validate the content of all fields during registration.

Replace the explicit field initialization with a memset of the whole
rseq area which will cover fields as they are added to future kernels.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 689a62a4217fae78b9ce0db781dc2a421f2b1ab4)

math: Improve layout of exp/exp10 data

GCC aligns global data to 16 bytes if their size is >= 16 bytes.  This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps.  As a result on targets that support them, more load-pair
instructions are used in exp.  Exp10 is improved by moving invlog10_2N later
so that neglog10_2hiN and neglog10_2loN can be loaded using load-pair.

The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.  Exp10 improves by 1.5%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)

AArch64: Use prefer_sve_ifuncs for SVE memset

Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)

AArch64: Add SVE memset

Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)

math: Improve layout of expf data

GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)

AArch64: Remove zva_128 from memset

Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
(cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)

AArch64: Optimize memset

Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)

AArch64: Improve generic strlen

Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)

AArch64: Improve codegen for SVE logs

Reduce memory access by using lanewise MLA and moving constants to struct
and reduce number of MOVPRFXs.
Update maximum ULP error for double log_sve from 1 to 2.
Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).

(cherry picked from commit 32d193a372feb28f9da247bb7283d404b84429c6)

AArch64: Improve codegen in SVE tans

Improves memory access.
Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return.
Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return.

(cherry picked from commit aa6609feb20ebf8653db639dabe2a6afc77b02cc)

AArch64: Improve codegen of AdvSIMD atan(2)(f)

Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
8% improvement in throughput microbenchmark on Neoverse V1.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit 6914774b9d3460876d9ad4482782213ec01a752e)

AArch64: Improve codegen of AdvSIMD logf function family

Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log,
and 2% for log10.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit d6e034f5b222a9ed1aeb5de0c0c7d0dda8b63da3)

AArch64: Improve codegen in AdvSIMD logs

Remove spurious ADRP and a few MOVs.
Reduce memory access by using more indexed MLAs in polynomial.
Align notation so that algorithms are easier to compare.
Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%).
Update error threshold in AdvSIMD log (now matches SVE log).

(cherry picked from commit 8eb5ad2ebc94cc5bedbac57c226c02ec254479c7)

AArch64: Simplify rounding-multiply pattern in several AdvSIMD routines

This operation can be simplified to use simpler multiply-round-convert
sequence, which uses fewer instructions and constants.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit 16a59571e4e9fd019d3fc23a2e7d73c1df8bb5cb)

aarch64: Avoid redundant MOVs in AdvSIMD F32 logs

Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
(cherry picked from commit 8b09af572b208bfde4d31c6abbae047dcc217675)

aarch64: Fix AdvSIMD libmvec routines for big-endian

Previously many routines used * to load from vector types stored
in the data table. This is emitted as ldr, which byte-swaps the
entire vector register, and causes bugs for big-endian when not
all lanes contain the same value. When a vector is to be used
this way, it has been replaced with an array and the load with an
explicit ld1 intrinsic, which byte-swaps only within lanes.

As well, many routines previously used non-standard GCC syntax
for vector operations such as indexing into vectors types with []
and assembling vectors using {}. This syntax should not be mixed
with ACLE, as the former does not respect endianness whereas the
latter does. Such examples have been replaced with, for instance,
vcombine_* and vgetq_lane* intrinsics. Helpers which only use the
GCC syntax, such as the v_call helpers, do not need changing as
they do not use intrinsics.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 90a6ca8b28bf34e361e577e526e1b0f4c39a32a5)

assert: Add test for CVE-2025-0395

Use the __progname symbol to override the program name to induce the
failure that CVE-2025-0395 describes.

This is related to BZ #32582

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cdb9ba84191ce72e86346fb8b1d906e7cd930ea2)

stdlib: Test using setenv with updated environ [BZ #32588]

Add a test for setenv with updated environ. Verify that BZ #32588 is
fixed.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8ab34497de14e35aff09b607222fe1309ef156da)

malloc: obscure calloc use in tst-calloc

Similar to a9944a52c967ce76a5894c30d0274b824df43c7a and
f9493a15ea9cfb63a815c00c23142369ec09d8ce, we need to hide calloc use from
the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change.

First, include tst-malloc-aux.h, but then use `volatile` variables
for size.

The test passes without the tst-malloc-aux.h change but IMO we want
it there for consistency and to avoid future problems (possibly silent).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c3d1dac96bdd10250aa37bb367d5ef8334a093a1)

Hide all malloc functions from compiler [BZ #32366]

Since -1 isn't a power of two, compiler may reject it, hide memalign from
Clang 19 which issues an error:

tst-memalign.c:86:31: error: requested alignment is not a power of 2 [-Werror,-Wnon-power-of-two-alignment]
   86 |   p = memalign (-1, pagesize);
      |                 ^~
tst-memalign.c:86:31: error: requested alignment must be 4294967296 bytes or smaller; maximum alignment assumed [-Werror,-Wbuiltin-assume-aligned-alignment]
   86 |   p = memalign (-1, pagesize);
      |                 ^~

Update tst-malloc-aux.h to hide all malloc functions and include it in
all malloc tests to prevent compiler from optimizing out any malloc
functions.

Tested with Clang 19.1.5 and GCC 15 20241206 for BZ #32366.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit f9493a15ea9cfb63a815c00c23142369ec09d8ce)

Fix underallocation of abort_msg_s struct (CVE-2025-0395)

Include the space needed to store the length of the message itself, in
addition to the message string. This resolves BZ #32582.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578)

x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]

The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 483443d3211532903d7e790211af5a1d55fdb1f3)

x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5bf0ab80573d66e4ae5d94b094659094336da90f)

x86: Avoid integer truncation with large cache sizes (bug 32470)

Some hypervisors report 1 TiB L3 cache size. This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.

(cherry picked from commit 61c3450db96dce96ad2b24b4f0b548e6a46d68e5)

math: Exclude internal math symbols for tests [BZ #32414]

Since internal tests don't have access to internal symbols in libm,
exclude them for internal tests. Also make tst-strtod5 and tst-strtod5i
depend on $(libm) to support older versions of GCC which can't inline
copysign family functions. This fixes BZ #32414.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 5df09b444835fca6e64b3d4b4a5beb19b3b2ba21)

malloc: add indirection for malloc(-like) functions in tests [BZ #32366]

GCC 15 introduces allocation dead code removal (DCE) for PR117370 in
r15-5255-g7828dc070510f8. This breaks various glibc tests which want
to assert various properties of the allocator without doing anything
obviously useful with the allocated memory.

Alexander Monakov rightly pointed out that we can and should do better
than passing -fno-malloc-dce to paper over the problem. Not least because
GCC 14 already does such DCE where there's no testing of malloc's return
value against NULL, and LLVM has such optimisations too.

Handle this by providing malloc (and friends) wrappers with a volatile
function pointer to obscure that we're calling malloc (et. al) from the
compiler.

Reviewed-by: Paul Eggert <eggert@cs.ucla.edu>
(cherry picked from commit a9944a52c967ce76a5894c30d0274b824df43c7a)

Pass -nostdlib -nostartfiles together with -r [BZ #31753]

Since -r in GCC 6/7/8 doesn't imply -nostdlib -nostartfiles, update the
link-static-libc.out rule to also pass -nostdlib -nostartfiles. This
fixes BZ #31753.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 2be3352f0b1ebaa39596393fffe1062275186669)

nptl: initialize cpu_id_start prior to rseq registration

When adding explicit initialization of rseq fields prior to
registration, I glossed over the fact that 'cpu_id_start' is also
documented as initialized by user-space.

While current kernels don't validate the content of this field on
registration, future ones could.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit d9f40387d3305d97e30a8cf8724218c42a63680a)

nptl: initialize rseq area prior to registration

Per the rseq syscall documentation, 3 fields are required to be
initialized by userspace prior to registration, they are 'cpu_id',
'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread'
is cleared on all architectures, explicitly set those 3 fields prior to
registration.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 97f60abd25628425971f07e9b0e7f8eec0741235)

elf: Change ldconfig auxcache magic number (bug 32231)

In commit c628c2296392ed3bf2cb8d8470668e64fe53389f (elf: Remove
ldconfig kernel version check), the layout of auxcache entries
changed because the osversion field was removed from
struct aux_cache_file_entry. However, AUX_CACHEMAGIC was not
changed, so existing files are still used, potentially leading
to unintended ldconfig behavior. This commit changes AUX_CACHEMAGIC,
so that the file is regenerated.

Reported-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 0a536f6e2f76e3ef581b3fd9af1e5cf4ddc7a5a2)

Make tst-strtod-underflow type-generic

The test tst-strtod-underflow covers various edge cases close to the
underflow threshold for strtod (especially cases where underflow on
architectures with after-rounding tininess detection depends on the
rounding mode). Make it use the type-generic machinery, with
corresponding test inputs for each supported floating-point format, so
that other functions in the strtod family are tested for underflow
edge cases as well.

Tested for x86_64.

(cherry picked from commit 94ca2c0894f0e1b62625c369cc598a2b9236622c)

Add crt1-2.0.o for glibc 2.0 compatibility tests

Starting from glibc 2.1, crt1.o contains _IO_stdin_used which is checked
by _IO_check_libio to provide binary compatibility for glibc 2.0.  Add
crt1-2.0.o for tests against glibc 2.0.  Define tests-2.0 for glibc 2.0
compatibility tests.  Add and update glibc 2.0 compatibility tests for
stderr, matherr and pthread_kill.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 5f245f3bfbe61b2182964dafb94907e38284b806)

Add tests of more strtod special cases

There is very little test coverage of inputs to strtod-family
functions that don't contain anything that can be parsed as a number
(one test of ".y" in tst-strtod2), and none that I can see of skipping
initial whitespace. Add some tests of these things to tst-strtod2.

Tested for x86_64.

(cherry picked from commit 378039ca578c2ea93095a1e710d96f58c68a3997)

Add more tests of strtod end pointer

Although there are some tests in tst-strtod2 and tst-strtod3 for the
end pointer provided by strtod when it doesn't parse the whole string,
they aren't very thorough. Add tests of more such cases to
tst-strtod2.

Tested for x86_64.

(cherry picked from commit b5d3737b305525315e0c7c93ca49eadc868eabd5)

Make tst-strtod2 and tst-strtod5 type-generic

Some of the strtod tests use type-generic machinery in tst-strtod.h to
test the strto* functions for all floating types, while others only
test double even when the tests are in fact meaningful for all
floating types.

Convert tst-strtod2 and tst-strtod5 to use the type-generic machinery
so they test all floating types. I haven't tried to convert them to
use newer test interfaces in other ways, just made the changes
necessary to use the type-generic machinery.

Tested for x86_64.

(cherry picked from commit 8de031bcb9adfa736c0caed2c79d10947b8d8f48)