Noah Goldstein [Sun, 6 Feb 2022 06:54:18 +0000 (00:54 -0600)]
x86: Improve vec generation in memset-vec-unaligned-erms.S
No bug.
Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.
For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.
Results for memset-avx2 small (geomean of N = 20 benchset runs).
size, New Time, Old Time, New / Old
0, 4.100, 3.831, 0.934
1, 5.074, 4.399, 0.867
2, 4.433, 4.411, 0.995
4, 4.487, 4.415, 0.984
8, 4.454, 4.396, 0.987
16, 4.502, 4.443, 0.987
Noah Goldstein [Mon, 10 Jan 2022 21:35:39 +0000 (15:35 -0600)]
x86: Optimize strcmp-evex.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases as well are nearly universally improved.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Noah Goldstein [Mon, 10 Jan 2022 21:35:38 +0000 (15:35 -0600)]
x86: Optimize strcmp-avx2.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Noah Goldstein [Sat, 25 Dec 2021 00:54:41 +0000 (18:54 -0600)]
x86: Optimize L(less_vec) case in memcmp-evex-movbe.S
No bug.
Optimizations are twofold.
1) Replace page cross and 0/1 checks with masked load instructions in
L(less_vec). In applications this reduces branch-misses in the
hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.
Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range. From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de090ae84e380aff68a98bd94ef704667)
Noah Goldstein [Fri, 3 Dec 2021 23:29:25 +0000 (15:29 -0800)]
x86-64: Use notl in EVEX strcmp [BZ #28646]
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.
Noah Goldstein [Wed, 10 Nov 2021 22:18:56 +0000 (16:18 -0600)]
x86: Shrink memcmp-sse4.S code size
No bug.
This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.
The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.
See https://research.google/pubs/pub48320/ for more details.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d7171850451e6044ef78d91ff8c017b9c0)
Noah Goldstein [Mon, 1 Nov 2021 05:49:52 +0000 (00:49 -0500)]
x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h
No bug.
This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.
H.J. Lu [Fri, 29 Oct 2021 19:56:53 +0000 (12:56 -0700)]
x86-64: Remove Prefer_AVX2_STRCMP
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
H.J. Lu [Fri, 29 Oct 2021 19:40:20 +0000 (12:40 -0700)]
x86-64: Improve EVEX strcmp with masked load
In strcmp-evex.S, to compare 2 32-byte strings, replace
VMOVU (%rdi, %rdx), %YMM0
VMOVU (%rsi, %rdx), %YMM1
/* Each bit in K0 represents a mismatch in YMM0 and YMM1. */
VPCMP $4, %YMM0, %YMM1, %k0
VPCMP $0, %YMMZERO, %YMM0, %k1
VPCMP $0, %YMMZERO, %YMM1, %k2
/* Each bit in K1 represents a NULL in YMM0 or YMM1. */
kord %k1, %k2, %k1
/* Each bit in K1 represents a NULL or a mismatch. */
kord %k0, %k1, %k1
kmovd %k1, %ecx
testl %ecx, %ecx
jne L(last_vector)
with
VMOVU (%rdi, %rdx), %YMM0
VPTESTM %YMM0, %YMM0, %k2
/* Each bit cleared in K1 represents a mismatch or a null CHAR
in YMM0 and 32 bytes at (%rsi, %rdx). */
VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2}
kmovd %k1, %ecx
incl %ecx
jne L(last_vector)
It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.
Noah Goldstein [Sat, 23 Oct 2021 05:26:47 +0000 (01:26 -0400)]
x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.
it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.
It costs 2-extra bytes but the extra bytes should only eat into
alignment padding. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bad852b61b79503fcb3c5fc379c70f768df3e1fb)
1. change control flow for L(more_2x_vec) to fall through to loop and
jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
size and saves jumps for length > 4x VEC_SIZE.
2. For EVEX/AVX512 move L(less_vec) closer to entry.
3. Avoid complex address mode for length > 2x VEC_SIZE
4. Slightly better aligning code for the loop from the perspective of
code size and uops.
5. Align targets so they make full use of their fetch block and if
possible cache line.
6. Try and reduce total number of icache lines that will need to be
pulled in for a given length.
7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
jumping to the stosb target in the sse2 code section will almost
certainly be to a new page. The new version does increase code size
marginally by duplicating the target but should get better iTLB
behavior as a result.
test-memset, test-wmemset, and test-bzero are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e59ced238482fd71f3e493717f14f6507346741e)
x86: Optimize memcmp-evex-movbe.S for frontend behavior and size
No bug.
The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.
Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.
In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.
The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.
Noah Goldstein [Sun, 9 Jan 2022 22:02:28 +0000 (16:02 -0600)]
x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_evex. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
x86-64: Avoid rep movsb with short distance [BZ #27130]
introduced some regressions on Intel processors without Fast Short REP
MOV (FSRM). Add Avoid_Short_Distance_REP_MOVSB to avoid rep movsb with
short distance only on Intel processors with FSRM. bench-memmove-large
on Skylake server shows that cycles of __memmove_evex_unaligned_erms
improves for the following data size:
Which words independently of s + maxlen overflowing. So the
second overflow check is unnecissary for correctness and
just extra overhead in the common no overflow case.
test-strlen.c, test-wcslen.c, test-strnlen.c and test-wcsnlen.c are
all passing
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 08cbcd4dbc686bb38ec3093aff2f919fbff5ec17)
Noah Goldstein [Sun, 23 May 2021 23:43:24 +0000 (19:43 -0400)]
x86: Improve memmove-vec-unaligned-erms.S
This patch changes the condition for copy 4x VEC so that if length is
exactly equal to 4 * VEC_SIZE it will use the 4x VEC case instead of
8x VEC case.
Results For Skylake memcpy-avx2-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 0 , 9.137 , 6.873 , New , 75.22
128 , 7 , 0 , 12.933 , 7.732 , New , 59.79
128 , 0 , 7 , 11.852 , 6.76 , New , 57.04
128 , 7 , 7 , 12.587 , 6.808 , New , 54.09
Results For Icelake memcpy-evex-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 0 , 9.963 , 5.416 , New , 54.36
128 , 7 , 0 , 16.467 , 8.061 , New , 48.95
128 , 0 , 7 , 14.388 , 7.644 , New , 53.13
128 , 7 , 7 , 14.546 , 7.642 , New , 52.54
Results For Tigerlake memcpy-evex-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 0 , 8.979 , 4.95 , New , 55.13
128 , 7 , 0 , 14.245 , 7.122 , New , 50.0
128 , 0 , 7 , 12.668 , 6.675 , New , 52.69
128 , 7 , 7 , 13.042 , 6.802 , New , 52.15
Results For Skylake memmove-avx2-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 32 , 6.181 , 5.691 , New , 92.07
128 , 32 , 0 , 6.165 , 5.752 , New , 93.3
128 , 0 , 7 , 13.923 , 9.37 , New , 67.3
128 , 7 , 0 , 12.049 , 10.182 , New , 84.5
Results For Icelake memmove-evex-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 32 , 5.479 , 4.889 , New , 89.23
128 , 32 , 0 , 5.127 , 4.911 , New , 95.79
128 , 0 , 7 , 18.885 , 13.547 , New , 71.73
128 , 7 , 0 , 15.565 , 14.436 , New , 92.75
Results For Tigerlake memmove-evex-erms
size, al1 , al2 , Cur T , New T , Win , New / Cur
128 , 0 , 32 , 5.275 , 4.815 , New , 91.28
128 , 32 , 0 , 5.376 , 4.565 , New , 84.91
128 , 0 , 7 , 19.426 , 14.273 , New , 73.47
128 , 7 , 0 , 15.924 , 14.951 , New , 93.89
Noah Goldstein [Thu, 20 May 2021 17:13:51 +0000 (13:13 -0400)]
x86: Improve memset-vec-unaligned-erms.S
No bug. This commit makes a few small improvements to
memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64
instead of 128. Either alignment will perform equally well in a loop
and 128 just increases the odds of having to do an extra iteration
which can be significant overhead for small values. 2) Align some
targets and the loop. 3) Remove an ALU from the alignment process. 4)
Reorder the last 4x VEC so that they are stored after the loop. 5)
Move the condition for leq 8x VEC to before the alignment
process. test-memset and test-wmemset are both passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6abf27980a947f9b6e514d6b33b83059d39566ae)
Noah Goldstein [Mon, 17 May 2021 17:57:24 +0000 (13:57 -0400)]
x86: Optimize memcmp-evex-movbe.S
No bug. This commit optimizes memcmp-evex.S. The optimizations include
adding a new vec compare path for small sizes, reorganizing the entry
control flow, removing some unnecissary ALU instructions from the main
loop, and most importantly replacing the heavy use of vpcmp + kand
logic with vpxor + vptern. test-memcmp and test-wmemcmp are both
passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 4ad473e97acdc5f6d811755b67c09f2128a644ce)
Noah Goldstein [Mon, 17 May 2021 17:56:52 +0000 (13:56 -0400)]
x86: Optimize memcmp-avx2-movbe.S
No bug. This commit optimizes memcmp-avx2.S. The optimizations include
adding a new vec compare path for small sizes, reorganizing the entry
control flow, and removing some unnecissary ALU instructions from the
main loop. test-memcmp and test-wmemcmp are both passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 16d12015c57701b08d7bbed6ec536641bcafb428)
Noah Goldstein [Tue, 4 May 2021 23:02:40 +0000 (19:02 -0400)]
x86: Add EVEX optimized memchr family not safe for RTM
No bug.
This commit adds a new implementation for EVEX memchr that is not safe
for RTM because it uses vzeroupper. The benefit is that by using
ymm0-ymm15 it can use vpcmpeq and vpternlogd in the 4x loop which is
faster than the RTM safe version which cannot use vpcmpeq because
there is no EVEX encoding for the instruction. All parts of the
implementation aside from the 4x loop are the same for the two
versions and the optimization is only relevant for large sizes.
No bug. This commit optimizes strchr-evex.S. The optimizations are
mostly small things such as save an ALU in the alignment process,
saving a few instructions in the loop return. The one significant
change is saving 2 instructions in the 4x loop. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.
No bug. This commit optimizes strchr-avx2.S. The optimizations are all
small things such as save an ALU in the alignment process, saving a
few instructions in the loop return, saving some bytes in the main
loop, and increasing the ILP in the return cases. test-strchr,
test-strchrnul, test-wcschr, and test-wcschrnul are all passing.
x86: Optimize less_vec evex and avx512 memset-vec-unaligned-erms.S
No bug. This commit adds optimized cased for less_vec memset case that
uses the avx512vl/avx512bw mask store avoiding the excessive
branches. test-memset and test-wmemset are passing.
x86: Update large memcpy case in memmove-vec-unaligned-erms.S
No Bug. This commit updates the large memcpy case (no overlap). The
update is to perform memcpy on either 2 or 4 contiguous pages at
once. This 1) helps to alleviate the affects of false memory aliasing
when destination and source have a close 4k alignment and 2) In most
cases and for most DRAM units is a modestly more efficient access
pattern. These changes are a clear performance improvement for
VEC_SIZE =16/32, though more ambiguous for VEC_SIZE=64. test-memcpy,
test-memccpy, test-mempcpy, test-memmove, and tst-memmove-overflow all
pass.
noah [Wed, 3 Feb 2021 05:38:59 +0000 (00:38 -0500)]
x86-64: Refactor and improve performance of strchr-avx2.S
No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.
x86: Adding an upper bound for Enhanced REP MOVSB.
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.
Stefan Liebler [Wed, 13 Apr 2022 12:36:09 +0000 (14:36 +0200)]
S390: Add new s390 platform z16.
The new IBM z16 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented.
_dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2
is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active
as those instructions may be replaced or removed in future.
tst-glibc-hwcaps.c is extended in order to test z16 via new marker5.
A fatal glibc error is dumped if glibc was build with architecture
level set for z16, but run on an older machine. (See dl-hwcap-check.h)
This change fixes two warnings from _dl_lookup_address.
The first warning comes from dropping the volatile keyword from
desc in the call to _dl_read_access_allowed. We now have a full
atomic barrier between loading desc[0] and the access check, so
desc no longer needs to be declared as volatile.
The second warning comes from the implicit declaration of
_dl_fix_reloc_arg. This is fixed by including dl-runtime.h and
declaring _dl_fix_reloc_arg in dl-runtime.h.
_STACK_GROWS_DOWN is defined to 0 when the stack grows up. The
code in unwind.c used `#ifdef _STACK_GROWS_DOWN' to selct the
stack grows down define for FRAME_LEFT. As a result, the
_STACK_GROWS_DOWN define was always selected and cleanups were
incorrectly sequenced when the stack grows up.
The current getcontext return trampoline is overly complex and it
unnecessarily clobbers several registers. By saving the context
pointer (r26) in the context, __getcontext_ret can restore any
registers not restored by setcontext. This allows getcontext to
save and restore the entire register context present when getcontext
is entered. We use the unused oR0 context slot for the return
from __getcontext_ret.
While this is not directly useful in C, it can be exploited in
assembly code. Registers r20, r23, r24 and r25 are not clobbered
in the call path to getcontext. This allows a small simplification
of swapcontext.
It also allows saving and restoring the 6-bit SAR register in the
LSB of the oSAR context slot. The getcontext flag value can be
stored in the MSB of the oSAR slot.
This change fixes the failure of stdlib/tst-setcontext2 and
stdlib/tst-setcontext7 on hppa. The implementation of swapcontext
in C is broken. C saves the return pointer (rp) and any non
call-clobbered registers (in this case r3, r4 and r5) on the
stack. However, the setcontext call in swapcontext pops the
stack and subsequent calls clobber the saved registers. When
the context in oucp is restored, both tests fault.
Here we rewrite swapcontext in assembly code to avoid using
the stack for register values that need to be used after
restoration. The getcontext and setcontext routines are
revised to save and restore register ret1 for normal returns.
We copy the oucp pointer to ret1. This allows access to
the old context after calling getcontext and setcontext.
The test elf/tst-audit2 fails on hppa with a segmentation fault in the
long branch stub used to call malloc from calloc. This occurs because
the test is not a PIC executable and calloc is called from the dynamic
linker before the dp register is initialized in _dl_start_user.
The fix is to move the dp register initialization into
elf_machine_runtime_setup. Since the address of $global$ can't be
loaded directly, we continue to use the DT_PLTGOT value from the
the main_map to initialize dp. Since l_main_map is not available
in v2.34 and earlier, we use a new function, elf_machine_main_map,
to find the main map.
Noah Goldstein [Fri, 18 Feb 2022 23:00:25 +0000 (17:00 -0600)]
x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.c
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Noah Goldstein [Fri, 18 Feb 2022 20:19:15 +0000 (14:19 -0600)]
x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)
Noah Goldstein [Tue, 15 Feb 2022 14:18:15 +0000 (08:18 -0600)]
x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Samuel Thibault [Sun, 17 Oct 2021 23:39:02 +0000 (01:39 +0200)]
hurd if_index: Explicitly use AF_INET for if index discovery
5bf07e1b3a74 ("Linux: Simplify __opensock and fix race condition [BZ #28353]")
made __opensock try NETLINK then UNIX then INET. On the Hurd, only INET
knows about network interfaces, so better actually specify that in
if_index.
Linux: Simplify __opensock and fix race condition [BZ #28353]
AF_NETLINK support is not quite optional on modern Linux systems
anymore, so it is likely that the first attempt will always succeed.
Consequently, there is no need to cache the result. Keep AF_UNIX
and the Internet address families as a fallback, for the rare case
that AF_NETLINK is missing. The other address families previously
probed are totally obsolete be now, so remove them.
Use this simplified version as the generic implementation, disabling
Netlink support as needed.
added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the
random value in the the RSI register is larger than the wide-character
string length in the existing wcslen test, it didn't trigger the wcslen
test failure. Add a test to force 0 into the RSI register before calling
wcslen.
Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.
Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
H.J. Lu [Wed, 23 Jun 2021 21:27:58 +0000 (14:27 -0700)]
x86: Copy IBT and SHSTK usable only if CET is enabled
IBT and SHSTK usable bits are copied from CPUID feature bits and later
cleared if kernel doesn't support CET. Copy IBT and SHSTK usable only
if CET is enabled so that they aren't set on CET capable processors
with non-CET enabled glibc.
Noah Goldstein [Wed, 9 Jun 2021 20:17:14 +0000 (16:17 -0400)]
String: Add overflow tests for strnlen, memchr, and strncat [BZ #27974]
This commit adds tests for a bug in the wide char variant of the
functions where the implementation may assume that maxlen for wcsnlen
or n for wmemchr/strncat will not overflow when multiplied by
sizeof(wchar_t).
These tests show the following implementations failing on x86_64:
wcsnlen-sse4_1
wcsnlen-avx2
wmemchr-sse2
wmemchr-avx2
strncat would fail as well if it where on a system that prefered
either of the wcsnlen implementations that failed as it relies on
wcsnlen.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.
Noah Goldstein [Wed, 23 Jun 2021 05:19:34 +0000 (01:19 -0400)]
x86-64: Add wcslen optimize for sse4.1
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)
H.J. Lu [Wed, 23 Jun 2021 03:42:10 +0000 (20:42 -0700)]
x86-64: Move strlen.S to multiarch/strlen-vec.S
Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.
Noah Goldstein [Mon, 3 May 2021 07:03:19 +0000 (03:03 -0400)]
x86: Optimize memchr-evex.S
No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)
No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.
Noah Goldstein [Mon, 3 May 2021 07:01:58 +0000 (03:01 -0400)]
x86: Optimize memchr-avx2.S
No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)
H.J. Lu [Sun, 7 Mar 2021 17:45:23 +0000 (09:45 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions
Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Sun, 7 Mar 2021 17:44:18 +0000 (09:44 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Tue, 23 Feb 2021 14:33:10 +0000 (06:33 -0800)]
x86: Add string/memory function tests in RTM region
At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.
H.J. Lu [Fri, 5 Mar 2021 15:26:42 +0000 (07:26 -0800)]
x86-64: Add AVX optimized string/memory functions for RTM
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with
xtest
jz 1f
vzeroall
ret
1:
vzeroupper
ret
at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
H.J. Lu [Fri, 5 Mar 2021 15:20:28 +0000 (07:20 -0800)]
x86-64: Add memcmp family functions with 256-bit EVEX
Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.
H.J. Lu [Fri, 5 Mar 2021 15:15:03 +0000 (07:15 -0800)]
x86-64: Add memset family functions with 256-bit EVEX
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Fri, 5 Mar 2021 14:46:08 +0000 (06:46 -0800)]
x86-64: Add memmove family functions with 256-bit EVEX
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:36:50 +0000 (06:36 -0800)]
x86-64: Add strcpy family functions with 256-bit EVEX
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:24:52 +0000 (06:24 -0800)]
x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.
For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
H.J. Lu [Fri, 26 Feb 2021 13:36:59 +0000 (05:36 -0800)]
x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
Noah Goldstein [Sun, 9 Jan 2022 22:02:21 +0000 (16:02 -0600)]
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
getcwd: Set errno to ERANGE for size == 1 (CVE-2021-3999)
No valid path returned by getcwd would fit into 1 byte, so reject the
size early and return NULL with errno set to ERANGE. This change is
prompted by CVE-2021-3999, which describes a single byte buffer
underflow and overflow when all of the following conditions are met:
- The buffer size (i.e. the second argument of getcwd) is 1 byte
- The current working directory is too long
- '/' is also mounted on the current working directory
Sequence of events:
- In sysdeps/unix/sysv/linux/getcwd.c, the syscall returns ENAMETOOLONG
because the linux kernel checks for name length before it checks
buffer size
- The code falls back to the generic getcwd in sysdeps/posix
- In the generic func, the buf[0] is set to '\0' on line 250
- this while loop on line 262 is bypassed:
while (!(thisdev == rootdev && thisino == rootino))
since the rootfs (/) is bind mounted onto the directory and the flow
goes on to line 449, where it puts a '/' in the byte before the
buffer.
- Finally on line 458, it moves 2 bytes (the underflowed byte and the
'\0') to the buf[0] and buf[1], resulting in a 1 byte buffer overflow.
- buf is returned on line 469 and errno is not set.
It is a wrapper for Linux clone syscall, to simplify the call to the
use only the most common arguments and remove architecture specific
handling (such as ia64 different name and signature).
realpath: Set errno to ENAMETOOLONG for result larger than PATH_MAX [BZ #28770]
realpath returns an allocated string when the result exceeds PATH_MAX,
which is unexpected when its second argument is not NULL. This results
in the second argument (resolved) being uninitialized and also results
in a memory leak since the caller expects resolved to be the same as the
returned value.
Return NULL and set errno to ENAMETOOLONG if the result exceeds
PATH_MAX. This fixes [BZ #28770], which is CVE-2021-3998.
support: Add helpers to create paths longer than PATH_MAX
Add new helpers support_create_and_chdir_toolong_temp_directory and
support_chdir_toolong_temp_directory to create and descend into
directory trees longer than PATH_MAX.
Florian Weimer [Fri, 25 Jun 2021 06:02:30 +0000 (08:02 +0200)]
elf: Fix glibc-hwcaps priorities with cache flags mismatches [BZ #27046]
If lib->flags (in the cache) did not match GLRO (dl_correct_cache_id),
searching for further glibc-hwcaps entries did not happen, and it
was possible that the best glibc-hwcaps was not found. By accident,
this causes a test failure for elf/tst-glibc-hwcaps-prepend-cache
on armv7l.
This commit changes the cache lookup logic to continue searching
if (a) no match has been found, (b) a named glibc-hwcaps match
has been found(), or (c) non-glibc-hwcaps match has been found
and the entry flags and cache default flags do not match.
_DL_CACHE_DEFAULT_ID is used instead of GLRO (dl_correct_cache_id)
because the latter is only written once on i386 if loading
of libc.so.5 libraries is selected, so GLRO (dl_correct_cache_id)
should probably removed in a future change.
Paul A. Clarke [Sat, 25 Sep 2021 14:57:15 +0000 (09:57 -0500)]
powerpc: Fix unrecognized instruction errors with recent binutils
Recent versions of binutils (with commit b25f942e18d6ecd7ec3e2d2e9930eb4f996c258a) stopped preserving "sticky"
options across a base `.machine` directive, nullifying the use of
passing "-many" through GCC to the assembler. As a result, some
instructions which were recognized even under older, more stringent
`.machine` directives become unrecognized instructions in that
context.
In `sysdeps/powerpc/tst-set_ppr.c`, the use of the `mfppr32` extended
mnemonic became unrecognized, as the default compilation with GCC for
32bit powerpc adds a `.machine ppc` in the resulting assembly, so the
command line option `-Wa,-many` is essentially ignored, and the ISA 2.06
instructions and mnemonics, like `mfppr32`, are unrecognized.
The compilation of `sysdeps/powerpc/tst-set_ppr.c` fails with:
Error: unrecognized opcode: `mfppr32'
Add appropriate `.machine` directives in the assembly to bracket the
`mfppr32` instruction.
Part of a 2019 fix (commit 9250e6610fdb0f3a6f238d2813e319a41fb7a810) to
the above test's Makefile to add `-many` to the compilation when GCC
itself stopped passing `-many` to the assember no longer has any effect,
so remove that.
Florian Weimer [Tue, 9 Mar 2021 20:07:24 +0000 (21:07 +0100)]
<shlib-compat.h>: Support compat_symbol_reference for _ISOMAC
This is helpful for testing compat symbols in cases where _ISOMAC
is activated implicitly due to -DMODULE_NAME=testsuite and cannot
be disabled easily.