git.ipfire.org Git - thirdparty/glibc.git/log

posix: Fix double-free after allocation failure in regcomp (bug 33185)

If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.

Reported-by: Anastasia Belova <abelova@astralinux.ru>
Co-authored-by: Paul Eggert <eggert@cs.ucla.edu>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e994093fa0bcca0d0ee2c1db271d8d7885d)

malloc: cleanup casts in tst-calloc

Followup to c3d1dac96bdd10250aa37bb367d5ef8334a093a1. As pointed out by
Maciej W. Rozycki, the casts are obviously useless now.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fc8f253d808ade5e97c93b363bd1932023e770ba)

malloc: obscure calloc use in tst-calloc

Similar to a9944a52c967ce76a5894c30d0274b824df43c7a and
f9493a15ea9cfb63a815c00c23142369ec09d8ce, we need to hide calloc use from
the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change.

First, include tst-malloc-aux.h, but then use `volatile` variables
for size.

The test passes without the tst-malloc-aux.h change but IMO we want
it there for consistency and to avoid future problems (possibly silent).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c3d1dac96bdd10250aa37bb367d5ef8334a093a1)

malloc: add indirection for malloc(-like) functions in tests [BZ #32366]

GCC 15 introduces allocation dead code removal (DCE) for PR117370 in
r15-5255-g7828dc070510f8. This breaks various glibc tests which want
to assert various properties of the allocator without doing anything
obviously useful with the allocated memory.

Alexander Monakov rightly pointed out that we can and should do better
than passing -fno-malloc-dce to paper over the problem. Not least because
GCC 14 already does such DCE where there's no testing of malloc's return
value against NULL, and LLVM has such optimisations too.

Handle this by providing malloc (and friends) wrappers with a volatile
function pointer to obscure that we're calling malloc (et. al) from the
compiler.

Reviewed-by: Paul Eggert <eggert@cs.ucla.edu>
(cherry picked from commit a9944a52c967ce76a5894c30d0274b824df43c7a)

nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)

[BZ #25847]

The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.

(cherry picked from commit dbc5a50d12eff4cb3f782129029d04b8a76f58e7)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Use all of g1_start and g_signals

[BZ #25847]

The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:

"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"

I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.

(cherry picked from commit 91bb902f58264a2fd50fbce8f39a9a290dd23706)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: rename __condvar_quiesce_and_switch_g1

[BZ #25847]

This function no longer waits for threads to leave g1, so rename it to
__condvar_switch_g1

(cherry picked from commit 4b79e27a5073c02f6bff9aa8f4791230a0ab1867)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Fix indentation

[BZ #25847]

In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.

(cherry picked from commit ee6c14ed59d480720721aaacc5fb03213dc153da)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Use a single loop in pthread_cond_wait instaed of a nested loop

[BZ #25847]

The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.

(cherry picked from commit 929a4764ac90382616b6a21f099192b2475da674)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove g_refs from condition variables

[BZ #25847]

This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.

(cherry picked from commit c36fc50781995e6758cae2b6927839d0157f213c)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove unnecessary quadruple check in pthread_cond_wait

[BZ #25847]

pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:

1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
   for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
   close while we do an atomic increment. Obviously it could close at any
   point, but that doesn't mean we have to recheck after every step. This
   check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
   we check that we're not in a closed group. There is no need to check again.
   The check would only have helped in cases where the compare_exchange in the
   next line would also have failed. Relying on the compare_exchange is fine.

Removing the duplicate checks clarifies the code.

(cherry picked from commit 4f7b051f8ee3feff1b53b27a906f245afaa9cee1)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove unnecessary catch-all-wake in condvar group switch

[BZ #25847]

This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.

In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.

(cherry picked from commit b42cc6af11062c260c7dfa91f1c89891366fed3e)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Update comments and indentation for new condvar implementation

[BZ #25847]

Some comments were wrong after the most recent commit. This fixes that.

Also fixing indentation where it was using spaces instead of tabs.

(cherry picked from commit 0cc973160c23bb67f895bc887dd6942d29f8fee3)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

pthreads NPTL: lost wakeup fix 2

[BZ #25847]

This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch.  This greatly reduces the average
and maximum latency for pthread_cond_signal.

The g_signals[] field now contains a signal count that is relative to
the current g1_start value.  Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)

By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same.  This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change.  This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.

By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore.  The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit.  If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.

The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters.  Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2.  They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.

Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block.  This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.

(cherry picked from commit 1db84775f831a1494993ce9c118deaf9537cc50a)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

Fix error reporting (false negatives) in SGID tests

And simplify the interface of support_capture_subprogram_self_sgid.

Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion.  Handle all execution
errors within support_capture_subprogram_self_sgid.  In particular,
this includes test failures because the invoked program did not
exit with exit status zero.  Existing tests that expect exit
status 42 are adjusted to use zero instead.

In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).

Fixes commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2
("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3a3fb2ed83f79100c116c824454095ecfb335ad7)

support: Pick group in support_capture_subprogram_self_sgid if UID == 0

When running as root, it is likely that we can run under any group.
Pick a harmless group from /etc/group in this case.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2f769cec448d84a62b7dd0d4ff56978fe22c0cd6)

elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)

This should really move into support_capture_subprogram_self_sgid.

Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2)

x86_64: Fix typo in ifunc-impl-list.c.

Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit f2aeb6ff941dccc4c777b5621e77addea6cc076c)

elf: Test case for bug 32976 (CVE-2025-4802)

Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically
linked binaries, using support_capture_subprogram_self_sgid.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit d8f7a79335b0d861c12c42aec94c04cd5bb181e2)

support: Add support_record_failure_barrier

This can be used to stop execution after a TEST_COMPARE_BLOB
failure, for example.

(cherry picked from commit d0b8aa6de4529231fadfe604ac2c434e559c2d9e)

support: Use const char * argument in support_capture_subprogram_self_sgid

The function does not modify the passed-in string, so make this clear
via the prototype.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit f0c09fe61678df6f7f18fe1ebff074e62fa5ca7a)

elf: Ignore LD_LIBRARY_PATH and debug env var for setuid for static

It mimics the ld.so behavior.

Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 5451fa962cd0a90a0e2ec1d8910a559ace02bba0)

Changes:

git/elf/dl-support.c
(missing commit 55f41ef8de4a4d0c5762d78659e11202d3c765d4
("elf: Remove LD_PROFILE for static binaries"))

math: Improve layout of exp/exp10 data

GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp.

The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)

AArch64: Use prefer_sve_ifuncs for SVE memset

Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)

AArch64: Add SVE memset

Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)

math: Improve layout of expf data

GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)

AArch64: Remove zva_128 from memset

Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
(cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)

AArch64: Optimize memset

Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)

AArch64: Improve generic strlen

Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)

assert: Add test for CVE-2025-0395

Use the __progname symbol to override the program name to induce the
failure that CVE-2025-0395 describes.

This is related to BZ #32582

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cdb9ba84191ce72e86346fb8b1d906e7cd930ea2)

stdlib: Test using setenv with updated environ [BZ #32588]

Add a test for setenv with updated environ. Verify that BZ #32588 is
fixed.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8ab34497de14e35aff09b607222fe1309ef156da)

Fix underallocation of abort_msg_s struct (CVE-2025-0395)

Include the space needed to store the length of the message itself, in
addition to the message string. This resolves BZ #32582.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578)

elf: Support recursive use of dynamic TLS in interposed malloc

It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS).  The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.

This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update.  This
is possible because an interposed malloc has a fixed module ID and
TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc.  It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.

The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.

All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 018f0fc3b818d4d1460a4e2384c24802504b1d20)

elf: Avoid some free (NULL) calls in _dl_update_slotinfo

This has been confirmed to work around some interposed mallocs.  Here
is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper
in lttng-tools:

  New TLS usage in libgcc_s.so.1, compatibility impact
  <https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/>

Reportedly, this patch also papers over a similar issue when tcmalloc
2.9.1 is not compiled with -ftls-model=initial-exec.  Of course the
goal really should be to compile mallocs with the initial-exec TLS
model, but this commit appears to be a useful interim workaround.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit afe42e935b3ee97bac9a7064157587777259c60e)

x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]

The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 483443d3211532903d7e790211af5a1d55fdb1f3)

x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5bf0ab80573d66e4ae5d94b094659094336da90f)

x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4)

This code expects the WCSCAT preprocessor macro to be predefined in case
the evex implementation of the function should be defined with a name
different from __wcsncat_evex. However, when glibc is built for
x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines
WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the
variable to WCSNCAT, as it is actually a better naming choice for the
variable in this case.

Reported-by: Kenton Groombridge
Link: https://bugs.gentoo.org/921945
Fixes: 64b8b6516b ("x86: Add evex optimized functions for the wchar_t strcpy family")
Signed-off-by: Gabi Falk <gabifalk@gmx.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit dd5f891c1ad9f1b43b9db93afe2a55cbb7a6194e)

sysdeps/x86/Makefile: Split and sort tests

Put each test on a separate line and sort tests.

(cherry picked from commit 7e03e0de7e7c2de975b5c5e18f5a4b0c75816674)

x86: Only align destination to 1x VEC_SIZE in memset 4x loop

Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)

elf: Fix slow tls access after dlopen [BZ #19924]

In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.

It may be possible to detect up-to-date dtv faster. But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.

This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once. The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.

This patch uses acquire/release synchronization when accessing the
generation counter.

Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.

I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit d2123d68275acc0f061e73d5f86ca504e0d5a344)

x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]

The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored.  If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead.  If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all.  This addresses BZ #30643.

(cherry picked from commit 1493622f4f9048ffede3fbedb64695efa49d662a)

x86_64: Add log1p with FMA

On Skylake, it changes log1p bench performance by:

        Before       After     Improvement
max     63.349       58.347       8%
min     4.448        5.651        -30%
mean    12.0674      10.336       14%

The minimum code path is

if (hx < 0x3FDA827A)                          /* x < 0.41422  */
    {
      if (__glibc_unlikely (ax >= 0x3ff00000))           /* x <= -1.0 */
        {
   ...
        }
      if (__glibc_unlikely (ax < 0x3e200000))           /* |x| < 2**-29 */
        {
          math_force_eval (two54 + x);          /* raise inexact */
          if (ax < 0x3c900000)                  /* |x| < 2**-54 */
            {
      ...
            }
          else
            return x - x * x * 0.5;

FMA and non-FMA code sequences look similar.  Non-FMA version is slightly
faster.  Since log1p is called by asinh and atanh, it improves asinh
performance by:

        Before       After     Improvement
max     75.645       63.135       16%
min     10.074       10.071       0%
mean    15.9483      14.9089      6%

and improves atanh performance by:

        Before       After     Improvement
max     91.768       75.081       18%
min     15.548       13.883       10%
mean    18.3713      16.8011      8%

(cherry picked from commit a8ecb126d4c26c52f4ad828c566afe4043a28155)

x86_64: Add expm1 with FMA

On Skylake, it improves expm1 bench performance by:

        Before       After     Improvement
max     70.204       68.054       3%
min     20.709       16.2         22%
mean    22.1221      16.7367      24%

NB: Add

extern long double __expm1l (long double);
extern long double __expm1f128 (long double);

for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.

(cherry picked from commit 1b214630ce6f7e0099b8b6f87246246739b079cf)

x86_64: Add log2 with FMA

On Skylake, it improves log2 bench performance by:

        Before       After     Improvement
max     208.779      63.827       69%
min     9.977        6.55         34%
mean    10.366       6.8191       34%

(cherry picked from commit f6b10ed8e9a00de49d0951e760cc2b5288862b47)

x86_64: Sort fpu/multiarch/Makefile

Sort Makefile variables using scripts/sort-makefile-lines.py.

No code generation changes observed in libm. No regressions on x86_64.

(cherry picked from commit 881546979d0219c18337e1b4f4d00cfacab13c40)

x86: Avoid integer truncation with large cache sizes (bug 32470)

Some hypervisors report 1 TiB L3 cache size. This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.

(cherry picked from commit 61c3450db96dce96ad2b24b4f0b548e6a46d68e5)

nptl: initialize cpu_id_start prior to rseq registration

When adding explicit initialization of rseq fields prior to
registration, I glossed over the fact that 'cpu_id_start' is also
documented as initialized by user-space.

While current kernels don't validate the content of this field on
registration, future ones could.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit d9f40387d3305d97e30a8cf8724218c42a63680a)

nptl: initialize rseq area prior to registration

Per the rseq syscall documentation, 3 fields are required to be
initialized by userspace prior to registration, they are 'cpu_id',
'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread'
is cleared on all architectures, explicitly set those 3 fields prior to
registration.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 97f60abd25628425971f07e9b0e7f8eec0741235)

elf: Change ldconfig auxcache magic number (bug 32231)

In commit c628c2296392ed3bf2cb8d8470668e64fe53389f (elf: Remove
ldconfig kernel version check), the layout of auxcache entries
changed because the osversion field was removed from
struct aux_cache_file_entry. However, AUX_CACHEMAGIC was not
changed, so existing files are still used, potentially leading
to unintended ldconfig behavior. This commit changes AUX_CACHEMAGIC,
so that the file is regenerated.

Reported-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 0a536f6e2f76e3ef581b3fd9af1e5cf4ddc7a5a2)

Add crt1-2.0.o for glibc 2.0 compatibility tests

Starting from glibc 2.1, crt1.o contains _IO_stdin_used which is checked
by _IO_check_libio to provide binary compatibility for glibc 2.0.  Add
crt1-2.0.o for tests against glibc 2.0.  Define tests-2.0 for glibc 2.0
compatibility tests.  Add and update glibc 2.0 compatibility tests for
stderr, matherr and pthread_kill.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 5f245f3bfbe61b2182964dafb94907e38284b806)

libio: Attempt wide backup free only for non-legacy code

_wide_data and _mode are not available in legacy code, so do not attempt
to free the wide backup buffer in legacy code.

Resolves: BZ #32137 and BZ #27821

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit ae4d44b1d501421ad9a3af95279b8f4d1546f1ce)

nptl: Use <support/check.h> facilities in tst-setuid3

Remove local FAIL macro in favor to FAIL_EXIT1 from <support/check.h>,
which provides equivalent reporting, with the name of the file and the
line number within of the failure site additionally included. Remove
FAIL_ERR altogether and include ": %m" explicitly with the format string
supplied to FAIL_EXIT1 as there seems little value to have a separate
macro just for this.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8c98195af6e6f1ce21743fc26c723e0f7e45bcf2)

posix: Use <support/check.h> facilities in tst-truncate and tst-truncate64

Remove local FAIL macro in favor to FAIL_RET from <support/check.h>,
which provides equivalent reporting, with the name of the file of the
failure site additionally included, for the tst-truncate-common core
shared between the tst-truncate and tst-truncate64 tests.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit fe47595504a55e7bb992f8928533df154b510383)

ungetc: Fix backup buffer leak on program exit [BZ #27821]

If a file descriptor is left unclosed and is cleaned up by _IO_cleanup
on exit, its backup buffer remains unfreed, registering as a leak in
valgrind.  This is not strictly an issue since (1) the program should
ideally be closing the stream once it's not in use and (2) the program
is about to exit anyway, so keeping the backup buffer around a wee bit
longer isn't a real problem.  Free it anyway to keep valgrind happy
when the streams in question are the standard ones, i.e. stdout, stdin
or stderr.

Also, the _IO_have_backup macro checks for _IO_save_base,
which is a roundabout way to check for a backup buffer instead of
directly looking for _IO_backup_base.  The roundabout check breaks when
the main get area has not been used and user pushes a char into the
backup buffer with ungetc.  Fix this to use the _IO_backup_base
directly.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3e1d8d1d1dca24ae90df2ea826a8916896fc7e77)

ungetc: Fix uninitialized read when putting into unused streams [BZ #27821]

When ungetc is called on an unused stream, the backup buffer is
allocated without the main get area being present. This results in
every subsequent ungetc (as the stream remains in the backup area)
checking uninitialized memory in the backup buffer when trying to put a
character back into the stream.

Avoid comparing the input character with buffer contents when in backup
to avoid this uninitialized read. The uninitialized read is harmless in
this context since the location is promptly overwritten with the input
character, thus fulfilling ungetc functionality.

Also adjust wording in the manual to drop the paragraph that says glibc
cannot do multiple ungetc back to back since with this change, ungetc
can actually do this.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit cdf0f88f97b0aaceb894cc02b21159d148d7065c)

Make tst-ungetc use libsupport

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3f7df7e757f4efec38e45d4068e5492efcac4856)

stdio-common: Add test for vfscanf with matches longer than INT_MAX [BZ #27650]

Complement commit b03e4d7bd25b ("stdio: fix vfscanf with matches longer
than INT_MAX (bug 27650)") and add a test case for the issue, inspired
by the reproducer provided with the bug report.

This has been verified to succeed as from the commit referred and fail
beforehand.

As the test requires 2GiB of data to be passed around its performance
has been evaluated using a choice of systems and the execution time
determined to be respectively in the range of 9s for POWER9@2.166GHz,
24s for FU740@1.2GHz, and 40s for 74Kf@950MHz. As this is on the verge
of and beyond the default timeout it has been increased by the factor of
8. Regardless, following recent practice the test has been added to the
standard rather than extended set.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 89cddc8a7096f3d9225868304d2bc0a1aaf07d63)

support: Add FAIL test failure helper

Add a FAIL test failure helper analogous to FAIL_RET, that does not
cause the current function to return, providing a standardized way to
report a test failure with a message supplied while permitting the
caller to continue executing, for further reporting, cleaning up, etc.

Update existing test cases that provide a conflicting definition of FAIL
by removing the local FAIL definition and then as follows:

- tst-fortify-syslog: provide a meaningful message in addition to the
  file name already added by <support/check.h>; 'support_record_failure'
  is already called by 'support_print_failure_impl' invoked by the new
  FAIL test failure helper.

- tst-ctype: no update to FAIL calls required, with the name of the file
  and the line number within of the failure site additionally included
  by the new FAIL test failure helper, and error counting plus count
  reporting upon test program termination also already provided by
  'support_record_failure' and 'support_report_failure' respectively,
  called by 'support_print_failure_impl' and 'adjust_exit_status' also
  respectively.  However in a number of places 'printf' is called and
  the error count adjusted by hand, so update these places to make use
  of FAIL instead.  And last but not least adjust the final summary just
  to report completion, with any error count following as reported by
  the test driver.

- test-tgmath2: no update to FAIL calls required, with the name of the
  file of the failure site additionally included by the new FAIL test
  failure helper.  Also there is no need to track the return status by
  hand as any call to FAIL will eventually cause the test case to return
  an unsuccesful exit status regardless of the return status from the
  test function, via a call to 'adjust_exit_status' made by the test
  driver.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 1b97a9f23bf605ca608162089c94187573fb2a9e)

x86: Fix bug in strchrnul-evex512 [BZ #32078]

Issue was we were expecting not matches with CHAR before the start of
the string in the page cross case.

The check code in the page cross case:
```
    and    $0xffffffffffffffc0,%rax
    vmovdqa64 (%rax),%zmm17
    vpcmpneqb %zmm17,%zmm16,%k1
    vptestmb %zmm17,%zmm17,%k0{%k1}
    kmovq  %k0,%rax
    inc    %rax
    shr    %cl,%rax
    je     L(continue)
```

expects that all characters that neither match null nor CHAR will be
1s in `rax` prior to the `inc`. Then the `inc` will overflow all of
the 1s where no relevant match was found.

This is incorrect in the page-cross case, as the
`vmovdqa64 (%rax),%zmm17` loads from before the start of the input
string.

If there are matches with CHAR before the start of the string, `rax`
won't properly overflow.

The fix is quite simple. Just replace:

```
    inc    %rax
    shr    %cl,%rax
```
With:
```
    sar    %cl,%rax
    inc    %rax
```

The arithmetic shift will clear any matches prior to the start of the
string while maintaining the signbit so the 1s can properly overflow
to zero in the case of no matches.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7da08862471dfec6fdae731c2a5f351ad485c71f)

Fix name space violation in fortify wrappers (bug 32052)

Rename the identifier sz to __sz everywhere.

Fixes: a643f60c53 ("Make sure that the fortified function conditionals are constant")
(cherry picked from commit 39ca997ab378990d5ac1aadbaa52aaf1db6d526f)
(redone from scratch because of many conflicts)

resolv: Fix tst-resolv-short-response for older GCC (bug 32042)

Previous GCC versions do not support the C23 change that
allows labels on declarations.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit ec119972cb2598c04ec7d4219e20506006836f64)

Update syscall lists for Linux 6.5

Linux 6.5 has one new syscall, cachestat, and also enables the
cacheflush syscall for hppa. Update syscall-names.list and regenerate
the arch-syscall.h headers with build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit 72511f539cc34681ec61c6a0dc2fe6d684760ffe)

Add mremap tests

Add tests for MREMAP_MAYMOVE and MREMAP_FIXED. On Linux, also test
MREMAP_DONTUNMAP.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit ff0320bec2810192d453c579623482fab87bfa01)

mremap: Update manual entry

Update mremap manual entry:

1. Change mremap to variadic.
2. Document MREMAP_FIXED and MREMAP_DONTUNMAP.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cb2dee4eccf46642eef588bee64f9c875c408f1c)

linux: Update the mremap C implementation [BZ #31968]

Update the mremap C implementation to support the optional argument for
MREMAP_DONTUNMAP added in Linux 5.7 since it may not always be correct
to implement a variadic function as a non-variadic function on all Linux
targets. Return MAP_FAILED and set errno to EINVAL for unknown flag bits.
This fixes BZ #31968.

Note: A test must be added when a new flag bit is introduced.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 6c40cb0e9f893d49dc7caee580a055de53562206)

resolv: Track single-request fallback via _res._flags (bug 31476)

This avoids changing _res.options, which inteferes with change
detection as part of automatic reloading of /etc/resolv.conf.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 868ab8923a2ec977faafec97ecafac0c3159c1b2)

resolv: Do not wait for non-existing second DNS response after error (bug 30081)

In single-request mode, there is no second response after an error
because the second query has not been sent yet. Waiting for it
introduces an unnecessary timeout.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit af625987d619388a100b153520d3ee308bda9889)

resolv: Allow short error responses to match any query (bug 31890)

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 691a3b2e9bfaba842e46a5ccb7f5e6ea144c3ade)

Linux: Make __rseq_size useful for feature detection (bug 31965)

The __rseq_size value is now the active area of struct rseq
(so 20 initially), not the full struct size including padding
at the end (32 initially).

Update misc/tst-rseq to print some additional diagnostics.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
(cherry picked from commit 2e456ccf0c34a056e3ccafac4a0c7effef14d918)

elf: Make dl-rseq-symbols Linux only

And avoid a Hurd build failures.

Checked on x86_64-linux-gnu.

(cherry picked from commit 9fc639f654dc004736836613be703e6bed0c36a8)

nptl: fix potential merge of __rseq_* relro symbols

While working on a patch to add support for the extensible rseq ABI, we
came across an issue where a new 'const' variable would be merged with
the existing '__rseq_size' variable. We tracked this to the use of
'-fmerge-all-constants' which allows the compiler to merge identical
constant variables. This means that all 'const' variables in a compile
unit that are of the same size and are initialized to the same value can
be merged.

In this specific case, on 32 bit systems 'unsigned int' and 'ptrdiff_t'
are both 4 bytes and initialized to 0 which should trigger the merge.
However for reasons we haven't delved into when the attribute 'section
(".data.rel.ro")' is added to the mix, only variables of the same exact
types are merged. As far as we know this behavior is not specified
anywhere and could change with a new compiler version, hence this patch.

Move the definitions of these variables into an assembler file and add
hidden writable aliases for internal use. This has the added bonus of
removing the asm workaround to set the values on rseq registration.

Tested on Debian 12 with GCC 12.2.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 2b92982e2369d292560793bee8e730f695f48ff3)

s390x: Fix segfault in wcsncmp [BZ #31934]

The z13/vector-optimized wcsncmp implementation segfaults if n=1
and there is only one character (equal on both strings) before
the page end. Then it loads and compares one character and misses
to check n again. The following load fails.

This patch removes the extra load and compare of the first character
and just start with the loop which uses vector-load-to-block-boundary.
This code-path also checks n.

With this patch both tests are passing:
- the simplified one mentioned in the bugzilla 31934
- the full one in Florian Weimer's patch:
"manual: Document a GNU extension for strncmp/wcsncmp"
(https://patchwork.sourceware.org/project/glibc/patch/874j9eml6y.fsf@oldenburg.str.redhat.com/):
On s390x-linux-gnu (z16), the new wcsncmp test fails due to bug 31934.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 9b7651410375ec8848a1944992d663d514db4ba7)

misc: Add support for Linux uio.h RWF_NOAPPEND flag

In Linux 6.9 a new flag is added to allow for Per-io operations to
disable append mode even if a file was opened with the flag O_APPEND.
This is done with the new RWF_NOAPPEND flag.

This caused two test failures as these tests expected the flag 0x00000020
to be unused.  Adding the flag definition now fixes these tests on Linux
6.9 (v6.9-rc1).

  FAIL: misc/tst-preadvwritev2
  FAIL: misc/tst-preadvwritev64v2

This patch adds the flag, adjusts the test and adds details to
documentation.

Link: https://lore.kernel.org/all/20200831153207.GO3265@brightrain.aerifal.cx/
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3db9d208dd5f30b12900989c6d2214782b8e2011)

i386: Disable Intel Xeon Phi tests for GCC 15 and above (BZ 31782)

This patch disables Intel Xeon Phi tests for GCC 15 and above.

GCC 15 removed Intel Xeon Phi ISA support.
commit e1a7e2c54d52d0ba374735e285b617af44841ace
Author: Haochen Jiang <haochen.jiang@intel.com>
Date: Mon May 20 10:43:44 2024 +0800

i386: Remove Xeon Phi ISA support

Fixes BZ 31782.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 1b713c9a5349ef3cd1a8ccf9de017c7865713c67)

Force DT_RPATH for --enable-hardcoded-path-in-tests

On Fedora 40/x86-64, linker enables --enable-new-dtags by default which
generates DT_RUNPATH instead of DT_RPATH.  Unlike DT_RPATH, DT_RUNPATH
only applies to DT_NEEDED entries in the executable and doesn't applies
to DT_NEEDED entries in shared libraries which are loaded via DT_NEEDED
entries in the executable.  Some glibc tests have libstdc++.so.6 in
DT_NEEDED, which has libm.so.6 in DT_NEEDED.  When DT_RUNPATH is generated,
/lib64/libm.so.6 is loaded for such tests.  If the newly built glibc is
older than glibc 2.36, these tests fail with

assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_2.36' not found (required by /lib64/libm.so.6)
assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /lib64/libm.so.6)

Pass -Wl,--disable-new-dtags to linker when building glibc tests with
--enable-hardcoded-path-in-tests.  This fixes BZ #31719.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2dcaf70643710e22f92a351e36e3cff8b48c60dc)

resolv: Fix some unaligned accesses in resolver [BZ #30750]

Signed-off-by: John David Anglin <dave.anglin@bell.net>

nscd: Use time_t for return type of addgetnetgrentX

Using int may give false results for future dates (timeouts after the
year 2028).

Fixes commit 04a21e050d64a1193a6daab872bca2528bda44b ("CVE-2024-33601,
CVE-2024-33602: nscd: netgroup: Use two buffers in addgetnetgrentX
(bug 31680)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4bbca1a44691a6e9adcee5c6798a707b626bc331)

elf: Also compile dl-misc.os with $(rtld-early-cflags)

Also compile dl-misc.os with $(rtld-early-cflags) to avoid

Program received signal SIGILL, Illegal instruction.
0x00007ffff7fd36ea in _dl_strtoul (nptr=nptr@entry=0x7fffffffe2c9 "2",
    endptr=endptr@entry=0x7fffffffd728) at dl-misc.c:156
156   bool positive = true;
(gdb) bt
#0  0x00007ffff7fd36ea in _dl_strtoul (nptr=nptr@entry=0x7fffffffe2c9 "2",
    endptr=endptr@entry=0x7fffffffd728) at dl-misc.c:156
#1  0x00007ffff7fdb1a9 in tunable_initialize (
    cur=cur@entry=0x7ffff7ffbc00 <tunable_list+2176>,
    strval=strval@entry=0x7fffffffe2c9 "2", len=len@entry=1)
    at dl-tunables.c:131
#2  0x00007ffff7fdb3a2 in parse_tunables (valstring=<optimized out>)
    at dl-tunables.c:258
#3  0x00007ffff7fdb5d9 in __GI___tunables_init (envp=0x7fffffffdd58)
    at dl-tunables.c:288
#4  0x00007ffff7fe44c3 in _dl_sysdep_start (
    start_argptr=start_argptr@entry=0x7fffffffdcb0,
    dl_main=dl_main@entry=0x7ffff7fe5f80 <dl_main>)
    at ../sysdeps/unix/sysv/linux/dl-sysdep.c:110
#5  0x00007ffff7fe5cae in _dl_start_final (arg=0x7fffffffdcb0) at rtld.c:494
#6  _dl_start (arg=0x7fffffffdcb0) at rtld.c:581
#7  0x00007ffff7fe4b38 in _start ()
(gdb)

when setting GLIBC_TUNABLES in glibc compiled with APX.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 049b7684c912dd32b67b1b15b0f43bf07d5f512e)

CVE-2024-33601, CVE-2024-33602: nscd: netgroup: Use two buffers in addgetnetgrentX (bug 31680)

This avoids potential memory corruption when the underlying NSS
callback function does not use the buffer space to store all strings
(e.g., for constant strings).

Instead of custom buffer management, two scratch buffers are used.
This increases stack usage somewhat.

Scratch buffer allocation failure is handled by return -1
(an invalid timeout value) instead of terminating the process.
This fixes bug 31679.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit c04a21e050d64a1193a6daab872bca2528bda44b)

CVE-2024-33600: nscd: Avoid null pointer crashes after notfound response (bug 31678)

The addgetnetgrentX call in addinnetgrX may have failed to produce
a result, so the result variable in addinnetgrX can be NULL.
Use db->negtimeout as the fallback value if there is no result data;
the timeout is also overwritten below.

Also avoid sending a second not-found response. (The client
disconnects after receiving the first response, so the data stream did
not go out of sync even without this fix.) It is still beneficial to
add the negative response to the mapping, so that the client can get
it from there in the future, instead of going through the socket.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit b048a482f088e53144d26a61c390bed0210f49f2)

CVE-2024-33600: nscd: Do not send missing not-found response in addgetnetgrentX (bug 31678)

If we failed to add a not-found response to the cache, the dataset
point can be null, resulting in a null pointer dereference.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 7835b00dbce53c3c87bbbb1754a95fb5e58187aa)

CVE-2024-33599: nscd: Stack-based buffer overflow in netgroup cache (bug 31677)

Using alloca matches what other caches do. The request length is
bounded by MAXKEYLEN.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 87801a8fd06db1d654eea3e4f7626ff476a9bdaa)

i386: ulp update for SSE2 --disable-multi-arch configurations

(cherry picked from commit 3a3a4497421422aa854c855cbe5110ca7d598ffc)

nptl: Fix tst-cancel30 on kernels without ppoll_time64 support

Fall back to ppoll if ppoll_time64 fails with ENOSYS.
Fixes commit 370da8a121c3ba9eeb2f13da15fc0f21f4136b25 ("nptl: Fix
tst-cancel30 on sparc64").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit f4724843ada64a51d66f65d3199fe431f9d4c254)

login: structs utmp, utmpx, lastlog _TIME_BITS independence (bug 30701)

These structs describe file formats under /var/log, and should not
depend on the definition of _TIME_BITS. This is achieved by
defining __WORDSIZE_TIME64_COMPAT32 to 1 on 32-bit ports that
support 32-bit time_t values (where __time_t is 32 bits).

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9abdae94c7454c45e02e97e4ed1eb1b1915d13d8)

login: Check default sizes of structs utmp, utmpx, lastlog

The default <utmp-size.h> is for ports with a 64-bit time_t.
Ports with a 32-bit time_t or with __WORDSIZE_TIME64_COMPAT32=1
need to override it.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 4d4da5aab936504b2d3eca3146e109630d9093c4)

sparc: Remove 64 bit check on sparc32 wordsize (BZ 27574)

The sparc32 is always 32 bits.

Checked on sparcv9-linux-gnu.

(cherry picked from commit dd57f5e7b652772499cb220d78157c1038d24f06)

iconv: ISO-2022-CN-EXT: fix out-of-bound writes when writing escape sequence (CVE-2024-2961)

ISO-2022-CN-EXT uses escape sequences to indicate character set changes
(as specified by RFC 1922). While the SOdesignation has the expected
bounds checks, neither SS2designation nor SS3designation have its;
allowing a write overflow of 1, 2, or 3 bytes with fixed values:
'$+I', '$+J', '$+K', '$+L', '$+M', or '$*H'.

Checked on aarch64-linux-gnu.

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit f9dc609e06b1136bb0408be9605ce7973a767ada)

powerpc: Fix ld.so address determination for PCREL mode (bug 31640)

This seems to have stopped working with some GCC 14 versions,
which clobber r2. With other compilers, the kernel-provided
r2 value is still available at this point.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
(cherry picked from commit 14e56bd4ce15ac2d1cc43f762eb2e6b83fec1afe)

AArch64: Check kernel version for SVE ifuncs

Old Linux kernels disable SVE after every system call.  Calling the
SVE-optimized memcpy afterwards will then cause a trap to reenable SVE.
As a result, applications with a high use of syscalls may run slower with
the SVE memcpy.  This is true for kernels between 4.15.0 and before 6.2.0,
except for 5.14.0 which was patched.  Avoid this by checking the kernel
version and selecting the SVE ifunc on modern kernels.

Parse the kernel version reported by uname() into a 24-bit kernel.major.minor
value without calling any library functions.  If uname() is not supported or
if the version format is not recognized, assume the kernel is modern.

Tested-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 2e94e2f5d2bf2de124c8ad7da85463355e54ccb2)

aarch64: fix check for SVE support in assembler

Due to GCC bug 110901 -mcpu can override -march setting when compiling
asm code and thus a compiler targetting a specific cpu can fail the
configure check even when binutils gas supports SVE.

The workaround is that explicit .arch directive overrides both -mcpu
and -march, and since that's what the actual SVE memcpy uses the
configure check should use that too even if the GCC issue is fixed
independently.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 73c26018ed0ecd9c807bb363cc2c2ab4aca66a82)

aarch64: correct CFI in rawmemchr (bug 31113)

The .cfi_return_column directive changes the return column for the whole
FDE range. But the actual intent is to tell the unwinder that the value
in x30 (lr) now resides in x15 after the move, and that is expressed by
the .cfi_register directive.

(cherry picked from commit 3f798427884fa57770e8e2291cf58d5918254bb5)

AArch64: Remove Falkor memcpy

The latest implementations of memcpy are actually faster than the Falkor
implementations [1], so remove the falkor/phecda ifuncs for memcpy and
the now unused IS_FALKOR/IS_PHECDA defines.

[1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 2f5524cc5381eb75fef55f7901bb907bd5628333)

AArch64: Add memset_zva64

Add a specialized memset for the common ZVA size of 64 to avoid the
overhead of reading the ZVA size. Since the code is identical to
__memset_falkor, remove the latter.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3d7090f14b13312320e425b27dcf0fe72de026fd)

AArch64: Cleanup emag memset

Cleanup emag memset - merge the memset_base64.S file, remove
the unused ZVA code (since it is disabled on emag).

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9627ab99b50d250c6dd3001a3355aa03692f7fe5)

AArch64: Cleanup ifuncs

Cleanup ifuncs.  Remove uses of libc_hidden_builtin_def, use ENTRY rather than
ENTRY_ALIGN, remove unnecessary defines and conditional compilation.  Rename
strlen_mte to strlen_generic.  Remove rtld-memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 9fd3409842b3e2d31cff5dbd6f96066c430f0aa2)

AArch64: Add support for MOPS memcpy/memmove/memset

Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for
memcpy, memmove and memset (use .inst for now so it works with all binutils
versions without needing complex configure and conditional compilation).

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 2bd00179885928fd95fcabfafc50e7b5c6e660d2)

Add HWCAP2_MOPS from Linux 6.5 to AArch64 bits/hwcap.h

Linux 6.5 adds a new AArch64 HWCAP2 value, HWCAP2_MOPS. Add it to
glibc's bits/hwcap.h.

Tested with build-many-glibcs.py for aarch64-linux-gnu.

(cherry picked from commit ff5d2abd18629e0efac41e31699cdff3be0e08fa)

LoongArch: Correct {__ieee754, _}_scalb -> {__ieee754, _}_scalbf

linux: Use rseq area unconditionally in sched_getcpu (bug 31479)

Originally, nptl/descr.h included <sys/rseq.h>, but we removed that
in commit 2c6b4b272e6b4d07303af25709051c3e96288f2d ("nptl:
Unconditionally use a 32-byte rseq area").  After that, it was
not ensured that the RSEQ_SIG macro was defined during sched_getcpu.c
compilation that provided a definition.  This commit always checks
the rseq area for CPU number information before using the other
approaches.

This adds an unnecessary (but well-predictable) branch on
architectures which do not define RSEQ_SIG, but its cost is small
compared to the system call.  Most architectures that have vDSO
acceleration for getcpu also have rseq support.

Fixes: 2c6b4b272e6b4d07303af25709051c3e96288f2d
Fixes: 1d350aa06091211863e41169729cee1bca39f72f
Reviewed-by: Arjun Shankar <arjun@redhat.com>
(cherry picked from commit 7a76f218677d149d8b7875b336722108239f7ee9)