Add GLIBC_ABI_GNU2_TLS version to indicate that glibc has the working
GNU2 TLS run-time. Linker can add the GLIBC_ABI_GNU2_TLS version to
binaries which depend on the working GNU2 TLS run-time:
so that such programs and shared libraries will fail to load and run at
run-time against libc.so without the GLIBC_ABI_GNU2_TLS version, instead
of fail silently at random.
Joe Ramsay [Fri, 3 Jan 2025 19:13:36 +0000 (19:13 +0000)]
math: Remove no-mathvec flag
More routines are to follow, some of which hit many failures in the
current testsuite due to wrong sign of zero (mathvec routines are not
required to get this right). Instead of disabling a large number of
tests, change the failure condition such that, for vector routines,
tests pass as long as computed == expected == 0.0, regardless of sign.
Affected tests (vector tests for expm1, log1p, sin, tan and tanh) all
still pass.
Use TLS initial-exec model for __libc_tsd_CTYPE_* thread variables [BZ #33234]
Commit 10a66a8e421b ("Remove <libc-tsd.h>") removed the TLS initial-exec
(IE) model attribute from the __libc_tsd_CTYPE_* thread variable declarations
and definitions. Commit a894f04d8776 ("Optimize __libc_tsd_* thread
variable access") restored it on declarations.
Restore the TLS initial-exec model attribute on __libc_tsd_CTYPE_* thread
variable definitions.
This resolves test tst-locale1 failure on s390 32-bit, when using a
GNU linker without the fix from GNU binutils commit aefebe82dc89
("IBM zSystems: Fix offset relative to static TLS").
Florian Weimer [Fri, 16 May 2025 17:53:09 +0000 (19:53 +0200)]
Remove <libc-tsd.h>
Use __thread variables directly instead. The macros do not save any
typing. It seems unlikely that a future port will lack __thread
variable support.
Some of the __libc_tsd_* variables are referenced from assembler
files, so keep their names. Previously, <libc-tls.h> included
<tls.h>, which in turn included <errno.h>, so a few direct includes
of <errno.h> are now required.
Florian Weimer [Fri, 1 Aug 2025 10:19:49 +0000 (12:19 +0200)]
elf: Handle ld.so with LOAD segment gaps in _dl_find_object (bug 31943)
Detect if ld.so not contiguous and handle that case in _dl_find_object.
Set l_find_object_processed even for initially loaded link maps,
otherwise dlopen of an initially loaded object adds it to
_dlfo_loaded_mappings (where maps are expected to be contiguous),
in addition to _dlfo_nodelete_mappings.
Test elf/tst-link-map-contiguous-ldso iterates over the loader
image, reading every word to make sure memory is actually mapped.
It only does that if the l_contiguous flag is set for the link map.
Otherwise, it finds gaps with mmap and checks that _dl_find_object
does not return the ld.so mapping for them.
The test elf/tst-link-map-contiguous-main does the same thing for
the libc.so shared object. This only works if the kernel loaded
the main program because the glibc dynamic loader may fill
the gaps with PROT_NONE mappings in some cases, making it contiguous,
but accesses to individual words may still fault.
Test elf/tst-link-map-contiguous-libc is again slightly different
because the dynamic loader always fills the gaps with PROT_NONE
mappings, so a different form of probing has to be used.
Pierre Blanchard [Tue, 18 Mar 2025 17:07:31 +0000 (17:07 +0000)]
AArch64: Optimize algorithm in users of SVE expf helper
Polynomial order was unnecessarily high, unlocking multiple
optimizations.
Max error for new SVE expf is 0.88 +0.5ULP.
Max error for new SVE coshf is 2.56 +0.5ULP.
Performance improvement on Neoverse V1: expf (30%), coshf (26%).
Wilco Dijkstra [Fri, 27 Jun 2025 14:10:55 +0000 (14:10 +0000)]
AArch64: Avoid memset ifunc in cpu-features.c [BZ #33112]
During early startup memcpy or memset must not be called since many targets
use ifuncs for them which won't be initialized yet. Security hardening may
use -ftrivial-auto-var-init=zero which inserts calls to memset. Redirect
memset to memset_generic by including dl-symbol-redir-ifunc.h in cpu-features.c.
This fixes BZ #33112.
posix: Fix double-free after allocation failure in regcomp (bug 33185)
If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.
Reported-by: Anastasia Belova <abelova@astralinux.ru> Co-authored-by: Paul Eggert <eggert@cs.ucla.edu> Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e994093fa0bcca0d0ee2c1db271d8d7885d)
Florian Weimer [Thu, 22 May 2025 12:36:37 +0000 (14:36 +0200)]
Fix error reporting (false negatives) in SGID tests
And simplify the interface of support_capture_subprogram_self_sgid.
Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion. Handle all execution
errors within support_capture_subprogram_self_sgid. In particular,
this includes test failures because the invoked program did not
exit with exit status zero. Existing tests that expect exit
status 42 are adjusted to use zero instead.
In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).
Florian Weimer [Thu, 13 Feb 2025 20:56:52 +0000 (21:56 +0100)]
elf: Keep using minimal malloc after early DTV resize (bug 32412)
If an auditor loads many TLS-using modules during startup, it is
possible to trigger DTV resizing. Previously, the DTV was marked
as allocated by the main malloc afterwards, even if the minimal
malloc was still in use. With this change, _dl_resize_dtv marks
the resized DTV as allocated with the minimal malloc.
The new test reuses TLS-using modules from other auditing tests.
Arjun Shankar [Fri, 18 Oct 2024 14:03:25 +0000 (16:03 +0200)]
libio: Fix a deadlock after fork in popen
popen modifies its file handler book-keeping under a lock that wasn't
being taken during fork. This meant that a concurrent popen and fork
could end up copying the lock in a "locked" state into the fork child,
where subsequently calling popen would lead to a deadlock due to the
already (spuriously) held lock.
This commit fixes the deadlock by appropriately taking the lock before
fork, and releasing/resetting it in the parent/child after the fork.
A new test for concurrent popen and fork is also added. It consistently
hangs (and therefore fails via timeout) without the fix applied. Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 9f0d2c0ee6c728643fcf9a4879e9f20f5e45ce5f)
Florian Weimer [Thu, 13 Mar 2025 05:07:07 +0000 (06:07 +0100)]
nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)
The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.
Malte Skarupke [Wed, 4 Dec 2024 13:05:40 +0000 (08:05 -0500)]
nptl: Use all of g1_start and g_signals
The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:
"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"
I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 91bb902f58264a2fd50fbce8f39a9a290dd23706)
Malte Skarupke [Wed, 4 Dec 2024 13:04:10 +0000 (08:04 -0500)]
nptl: Fix indentation
In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ee6c14ed59d480720721aaacc5fb03213dc153da)
Malte Skarupke [Wed, 4 Dec 2024 13:03:44 +0000 (08:03 -0500)]
nptl: Use a single loop in pthread_cond_wait instaed of a nested loop
The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 929a4764ac90382616b6a21f099192b2475da674)
Malte Skarupke [Wed, 4 Dec 2024 12:56:38 +0000 (07:56 -0500)]
nptl: Remove g_refs from condition variables
This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit c36fc50781995e6758cae2b6927839d0157f213c)
Malte Skarupke [Wed, 4 Dec 2024 12:56:13 +0000 (07:56 -0500)]
nptl: Remove unnecessary quadruple check in pthread_cond_wait
pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:
1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
close while we do an atomic increment. Obviously it could close at any
point, but that doesn't mean we have to recheck after every step. This
check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
we check that we're not in a closed group. There is no need to check again.
The check would only have helped in cases where the compare_exchange in the
next line would also have failed. Relying on the compare_exchange is fine.
Removing the duplicate checks clarifies the code.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4f7b051f8ee3feff1b53b27a906f245afaa9cee1)
Malte Skarupke [Wed, 4 Dec 2024 12:55:50 +0000 (07:55 -0500)]
nptl: Remove unnecessary catch-all-wake in condvar group switch
This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.
In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit b42cc6af11062c260c7dfa91f1c89891366fed3e)
Frank Barrus [Wed, 4 Dec 2024 12:55:02 +0000 (07:55 -0500)]
pthreads NPTL: lost wakeup fix 2
This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch. This greatly reduces the average
and maximum latency for pthread_cond_signal.
The g_signals[] field now contains a signal count that is relative to
the current g1_start value. Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)
By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same. This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change. This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.
By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore. The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit. If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.
The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters. Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2. They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.
Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block. This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.
Signed-off-by: Frank Barrus <frankbarrus_sw@shaggy.cc> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 1db84775f831a1494993ce9c118deaf9537cc50a)
H.J. Lu [Sat, 12 Apr 2025 15:37:29 +0000 (08:37 -0700)]
x86: Detect Intel Diamond Rapids
Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit de14f1959ee5f9b845a7cae43bee03068b8136f0)
Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.
Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 70b648855185e967e54668b101d24704c3fb869d)
Noah Goldstein [Wed, 14 Aug 2024 06:37:30 +0000 (14:37 +0800)]
x86: Use `Avoid_Non_Temporal_Memset` to control non-temporal path
This is just a refactor and there should be no behavioral change from
this commit.
The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b93dddfaf440aa12f45d7c356f6ffe9f27d35577)
Florian Weimer [Fri, 28 Mar 2025 08:26:59 +0000 (09:26 +0100)]
x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
Florian Weimer [Fri, 28 Mar 2025 08:26:06 +0000 (09:26 +0100)]
x86: Skip XSAVE state size reset if ISA level requires XSAVE
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.
Michael Jeanson [Fri, 14 Feb 2025 18:54:22 +0000 (13:54 -0500)]
nptl: clear the whole rseq area before registration
Due to the extensible nature of the rseq area we can't explictly
initialize fields that are not part of the ABI yet. It was agreed with
upstream that all new fields will be documented as zero initialized by
userspace. Future kernels configured with CONFIG_DEBUG_RSEQ will
validate the content of all fields during registration.
Replace the explicit field initialization with a memset of the whole
rseq area which will cover fields as they are added to future kernels.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 689a62a4217fae78b9ce0db781dc2a421f2b1ab4)
Wilco Dijkstra [Fri, 13 Dec 2024 15:43:07 +0000 (15:43 +0000)]
math: Improve layout of exp/exp10 data
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp. Exp10 is improved by moving invlog10_2N later
so that neglog10_2hiN and neglog10_2loN can be loaded using load-pair.
The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2. Exp10 improves by 1.5%.
Wilco Dijkstra [Tue, 24 Dec 2024 18:01:59 +0000 (18:01 +0000)]
AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.
Wilco Dijkstra [Mon, 25 Nov 2024 18:43:08 +0000 (18:43 +0000)]
AArch64: Remove zva_128 from memset
Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.
Wilco Dijkstra [Wed, 7 Aug 2024 13:43:47 +0000 (14:43 +0100)]
AArch64: Improve generic strlen
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.
Luna Lamb [Thu, 13 Feb 2025 17:54:46 +0000 (17:54 +0000)]
Aarch64: Improve codegen in SVE exp and users, and update expf_inline
Use unpredicted muls, and improve memory access.
7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1,
for exp, exp2 and cosh respectively.
Luna Lamb [Fri, 3 Jan 2025 20:15:17 +0000 (20:15 +0000)]
AArch64: Improve codegen in SVE expm1f and users
Use unpredicated muls, use absolute compare and improve memory access.
Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput
microbenchmark on Neoverse V1.
Yat Long Poon [Fri, 3 Jan 2025 19:09:05 +0000 (19:09 +0000)]
AArch64: Improve codegen for SVE log1pf users
Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs.
Move log1pf implementation to inline helper function.
Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
Yat Long Poon [Fri, 3 Jan 2025 19:07:30 +0000 (19:07 +0000)]
AArch64: Improve codegen for SVE logs
Reduce memory access by using lanewise MLA and moving constants to struct
and reduce number of MOVPRFXs.
Update maximum ULP error for double log_sve from 1 to 2.
Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
Joana Cruz [Tue, 17 Dec 2024 14:50:33 +0000 (14:50 +0000)]
AArch64: Improve codegen of AdvSIMD expf family
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
Also use intrinsics instead of native operations.
expf: 3% improvement in throughput microbenchmark on Neoverse V1, exp2f: 5%,
exp10f: 13%, coshf: 14%.
Joana Cruz [Tue, 17 Dec 2024 14:47:31 +0000 (14:47 +0000)]
AArch64: Improve codegen of AdvSIMD logf function family
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs.
8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log,
and 2% for log10.
AArch64: Improve codegen in users of ADVSIMD log1p helper
Add inline helper for log1p and rearrange operations so MOV
is not necessary in reduction or around the special-case handler.
Reduce memory access by using more indexed MLAs in polynomial.
Speedup on Neoverse V1 for log1p (3.5%), acosh (7.5%) and atanh (10%).
Remove spurious ADRP and a few MOVs.
Reduce memory access by using more indexed MLAs in polynomial.
Align notation so that algorithms are easier to compare.
Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%).
Update error threshold in AdvSIMD log (now matches SVE log).
Joe Ramsay [Fri, 1 Nov 2024 15:48:54 +0000 (15:48 +0000)]
AArch64: Remove SVE erf and erfc tables
By using a combination of mask-and-add instead of the shift-based
index calculation the routines can share the same table as other
variants with no performance degradation.
The tables change name because of other changes in downstream AOR.
Joe Ramsay [Mon, 28 Oct 2024 14:58:35 +0000 (14:58 +0000)]
AArch64: Small optimisation in AdvSIMD erf and erfc
In both routines, reduce register pressure such that GCC 14 emits no
spills for erf and fewer spills for erfc. Also use more efficient
comparison for the special-case in erf.
Benchtests show erf improves by 6.4%, erfc by 1.0%.
Joe Ramsay [Mon, 23 Sep 2024 14:32:53 +0000 (15:32 +0100)]
AArch64: Improve codegen in users of ADVSIMD expm1f helper
Rearrange operations so MOV is not necessary in reduction or around
the special-case handler. Reduce memory access by using more indexed
MLAs in polynomial.
Joe Ramsay [Mon, 23 Sep 2024 14:32:14 +0000 (15:32 +0100)]
AArch64: Improve codegen in users of AdvSIMD log1pf helper
log1pf is quite register-intensive - use fewer registers for the
polynomial, and make various changes to shorten dependency chains in
parent routines. There is now no spilling with GCC 14. Accuracy moves
around a little - comments adjusted accordingly but does not require
regen-ulps.
Use the helper in log1pf as well, instead of having separate
implementations. The more accurate polynomial means special-casing can
be simplified, and the shorter dependency chain avoids the usual dance
around v0, which is otherwise difficult.
There is a small duplication of vectors containing 1.0f (or 0x3f800000) -
GCC is not currently able to efficiently handle values which fit in FMOV
but not MOVI, and are reinterpreted to integer. There may be potential
for more optimisation if this is fixed.
Joe Ramsay [Mon, 23 Sep 2024 14:30:20 +0000 (15:30 +0100)]
AArch64: Improve codegen in SVE F32 logs
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions
where possible. Similar to the recent change to AdvSIMD F32 logs,
adjust special-case arguments and bounds to allow for more optimal
register usage. For all 3 routines one MOVPRFX remains in the
reduction, which cannot be avoided as immediate AND and ASR are both
destructive.
Joe Ramsay [Mon, 23 Sep 2024 14:26:12 +0000 (15:26 +0100)]
AArch64: Improve codegen in SVE expf & related routines
Reduce MOV and MOVPRFX by improving special-case handling. Use inline
helper to duplicate the entire computation between the special- and
non-special case branches, removing the contention for z0 between x
and the return value.
Also rearrange some MLAs and MLSs - by making the multiplicand the
destination we can avoid a MOVPRFX in several cases. Also change which
constants go in the vector used for lanewise ops - the last lane is no
longer wasted.
Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the
comment that explains it. Fixed - worst-case ULP for exp2f moves
around but it doesn't change significantly for either routine.
Worst-case error for coshf increases due to passing x to exp rather
than abs(x) - updated the comment, but does not require regen-ulps.
Joe Ramsay [Thu, 19 Sep 2024 16:34:02 +0000 (17:34 +0100)]
AArch64: Add vector logp1 alias for log1p
This enables vectorisation of C23 logp1, which is an alias for log1p.
There are no new tests or ulp entries because the new symbols are simply
aliases.
Joe Ramsay [Mon, 9 Sep 2024 12:00:01 +0000 (13:00 +0100)]
aarch64: Avoid redundant MOVs in AdvSIMD F32 logs
Since the last operation is destructive, the first argument to the FMA
also has to be the first argument to the special-case in order to
avoid unnecessary MOVs. Reorder arguments and adjust special-case
bounds to facilitate this.
math: Add optimization barrier to ensure a1 + u.d is not reused [BZ #30664]
A number of fma tests started to fail on hppa when gcc was changed to
use Ranger rather than EVRP. Eventually I found that the value of
a1 + u.d in this is block of code was being computed in FE_TOWARDZERO
mode and not the original rounding mode:
if (TININESS_AFTER_ROUNDING)
{
w.d = a1 + u.d;
if (w.ieee.exponent == 109)
return w.d * 0x1p-108;
}
This caused the exponent value to be wrong and the wrong return path
to be used.
Here we add an optimization barrier after the rounding mode is reset
to ensure that the previous value of a1 + u.d is not reused.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
nptl: Correct stack size attribute when stack grows up [BZ #32574]
Set stack size attribute to the size of the mmap'd region only
when the size of the remaining stack space is less than the size
of the mmap'd region.
This was reversed. As a result, the initial stack size was only
135168 bytes. On architectures where the stack grows down, the
initial stack size is approximately 8384512 bytes with the default
rlimit settings. The small main stack size on hppa broke
applications like ruby that check for stack overflows.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
H.J. Lu [Tue, 17 Dec 2024 10:41:45 +0000 (18:41 +0800)]
Hide all malloc functions from compiler [BZ #32366]
Since -1 isn't a power of two, compiler may reject it, hide memalign from
Clang 19 which issues an error:
tst-memalign.c:86:31: error: requested alignment is not a power of 2 [-Werror,-Wnon-power-of-two-alignment]
86 | p = memalign (-1, pagesize);
| ^~
tst-memalign.c:86:31: error: requested alignment must be 4294967296 bytes or smaller; maximum alignment assumed [-Werror,-Wbuiltin-assume-aligned-alignment]
86 | p = memalign (-1, pagesize);
| ^~
Update tst-malloc-aux.h to hide all malloc functions and include it in
all malloc tests to prevent compiler from optimizing out any malloc
functions.
Tested with Clang 19.1.5 and GCC 15 20241206 for BZ #32366.
This change implements vfork.S for direct support of the vfork
syscall. clone.S is revised to correct child support for the
vfork case.
The main bug was creating a frame prior to the clone syscall.
This was done to allow the rp and r4 registers to be saved and
restored from the stack frame. r4 was used to save and restore
the PIC register, r19, across the system call and the call to
set errno. But in the vfork case, it is undefined behavior
for the child to return from the function in which vfork was
called. It is surprising that this usually worked.
Syscalls on hppa save and restore rp and r19, so we don't need
to create a frame prior to the clone syscall. We only need a
frame when __syscall_error is called. We also don't need to
save and restore r19 around the call to $$dyncall as r19 is not
used in the code after $$dyncall.
This considerably simplifies clone.S.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Florian Weimer [Tue, 17 Dec 2024 17:12:03 +0000 (18:12 +0100)]
x86: Avoid integer truncation with large cache sizes (bug 32470)
Some hypervisors report 1 TiB L3 cache size. This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.
H.J. Lu [Thu, 5 Dec 2024 00:39:44 +0000 (08:39 +0800)]
math: Exclude internal math symbols for tests [BZ #32414]
Since internal tests don't have access to internal symbols in libm,
exclude them for internal tests. Also make tst-strtod5 and tst-strtod5i
depend on $(libm) to support older versions of GCC which can't inline
copysign family functions. This fixes BZ #32414.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 5df09b444835fca6e64b3d4b4a5beb19b3b2ba21)
The commit 'sparc: Use Linux kABI for syscall return'
(86c5d2cf0ce046279baddc7faa27da71f1a89fde) did not take into account
a subtle sparc syscall kABI constraint. For syscalls that might block
indefinitely, on an interrupt (like SIGCONT) the kernel will set the
instruction pointer to just before the syscall:
arch/sparc/kernel/signal_64.c
476 static void do_signal(struct pt_regs *regs, unsigned long orig_i0)
477 {
[...]
525 if (restart_syscall) {
526 switch (regs->u_regs[UREG_I0]) {
527 case ERESTARTNOHAND:
528 case ERESTARTSYS:
529 case ERESTARTNOINTR:
530 /* replay the system call when we are done */
531 regs->u_regs[UREG_I0] = orig_i0;
532 regs->tpc -= 4;
533 regs->tnpc -= 4;
534 pt_regs_clear_syscall(regs);
535 fallthrough;
536 case ERESTART_RESTARTBLOCK:
537 regs->u_regs[UREG_G1] = __NR_restart_syscall;
538 regs->tpc -= 4;
539 regs->tnpc -= 4;
540 pt_regs_clear_syscall(regs);
541 }
However, on a SIGCONT it seems that 'g1' register is being clobbered after the
syscall returns. Before 86c5d2cf0ce046279, the 'g1' was always placed jus
before the 'ta' instruction which then reloads the syscall number and restarts
the syscall.
On master, where 'g1' might be placed before 'ta':
$ cat test.c
#include <unistd.h>
int main ()
{
pause ();
}
$ gcc test.c -o test
$ strace -f ./t
[...]
ppoll(NULL, 0, NULL, NULL, 0
Just moving the 'g1' setting near the syscall asm is not suffice,
the compiler might optimize it away (as I saw on cancellation.c by
trying this fix). Instead, I have change the inline asm to put the
'g1' setup in ithe asm block. This would require to change the asm
constraint for INTERNAL_SYSCALL_NCS, since the syscall number is not
constant.