git.ipfire.org Git - thirdparty/glibc.git/log

resolv: Add tst-resolv-aliases

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 87aa98aa80627553a66bdcad2701fd6307723645)

resolv: Add tst-resolv-byaddr for testing reverse lookup

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 0b99828d54e5d1fc8f5ad3edf5ba262ad2e9c5b0)

elf: Implement force_first handling in _dl_sort_maps_dfs (bug 28937)

The implementation in _dl_close_worker requires that the first
element of l_initfini is always this very map (“We are always the
zeroth entry, and since we don't include ourselves in the
dependency analysis start at 1.”). Rather than fixing that
assumption, this commit adds an implementation of the force_first
argument to the new dependency sorting algorithm. This also means
that the directly dlopen'ed shared object is always initialized last,
which is the least surprising behavior in the presence of cycles.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 1df71d32fe5f5905ffd5d100e5e9ca8ad6210891)

elf: Rename _dl_sort_maps parameter from skip to force_first

The new implementation will not be able to skip an arbitrary number
of objects.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit dbb75513f5cf9285c77c9e55777c5c35b653f890)

scripts/dso-ordering-test.py: Generate program run-time dependencies

The main program needs to depend on all shared objects, even objects
that have link-time dependencies among shared objects. Filtering
out shared objects that already have an link-time dependencies is not
necessary here; make will do this automatically.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 183d99737298bb3200f0610fdcd1c7549c8ed560)

elf: Fix hwcaps string size overestimation

Commit dad90d528259b669342757c37dedefa8577e2636 added glibc-hwcaps
support for LD_LIBRARY_PATH and, for this, it adjusted the total
string size required in _dl_important_hwcaps. However, in doing so
it inadvertently altered the calculation of the size required for
the power set strings, as the computation of the power set string
size depended on the first value assigned to the total variable,
which is later shifted, resulting in overallocation of string
space. Fix this now by using a different variable to hold the
string size required for glibc-hwcaps.

Signed-off-by: Javier Pello <devel@otheo.eu>
(cherry picked from commit a23820f6052a740246fdc7dcd9c43ce8eed0c45a)

Use __ehdr_start rather than _begin in _dl_start_final

__ehdr_start is already used in rltld.c:dl_main, and can serve the
same purpose as _begin. Besides tidying the code, using linker
defined section relative symbols rather than "-defsym _begin=0" better
reflects the intent of _dl_start_final use of _begin, which is to
refer to the load address of ld.so rather than absolute address zero.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 6f043e0ee7e477f50a44024ed0cb579d5e3f511d)

elf: Run tst-audit-tlsdesc, tst-audit-tlsdesc-dlopen everywhere

The test is valid for all TLS models, but we want to make a reasonable
effort to test the GNU2 model specifically. For example, aarch64
defaults to GNU2, but does not have -mtls-dialect=gnu2, and the test
was not run there.

Suggested-by: Martin Coufal <mcoufal@redhat.com>
(cherry picked from commit dd2315a866a4ac2b838ea1cb10c5ea1c35d51a2f)

Fixes early backport commit 577c2fc7f34eeaaa2ecec94108bb3f0417383a32
("elf: Call __libc_early_init for reused namespaces (bug 29528)");
it had a wrong conflict resolution.

nscd: Fix netlink cache invalidation if epoll is used [BZ #29415]

Processes cache network interface information such as whether IPv4 or IPv6
are enabled. This is only checked again if the "netlink timestamp" provided
by nscd changed, which is triggered by netlink socket activity.

However, in the epoll handler for the netlink socket, it was missed to
assign the new timestamp to the nscd database. The handler for plain poll
did that properly, copy that over.

This bug caused that e.g. processes which started before network
configuration got unusuable addresses from getaddrinfo, like IPv6 only even
though only IPv4 is available:
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1041

It's a bit hard to reproduce, so I verified this by checking the timestamp
on calls to __check_pf manually. Without this patch it's stuck at 1, now
it's increasing on network changes as expected.

Signed-off-by: Fabian Vogt <fvogt@suse.de>
(cherry picked from commit 02ca25fef2785974011e9c5beecc99b900b69fd7)

Apply asm redirections in wchar.h before first use

Similar to d0fa09a770, but for wchar.h. Fixes [BZ #27087] by applying
all long double related asm redirections before using functions in
bits/wchar2.h.
Moves the function declarations from wcsmbs/bits/wchar2.h to a new file
wcsmbs/bits/wchar2-decl.h that will be included first in wcsmbs/wchar.h.

Tested with build-many-glibcs.py.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit c7509d49c4e8fa494120c5ead21338559dad16f5)

elf: Call __libc_early_init for reused namespaces (bug 29528)

libc_map is never reset to NULL, neither during dlclose nor on a
dlopen call which reuses the namespace structure. As a result, if a
namespace is reused, its libc is not initialized properly. The most
visible result is a crash in the <ctype.h> functions.

To prevent similar bugs on namespace reuse from surfacing,
unconditionally initialize the chosen namespace to zero using memset.

(cherry picked from commit d0e357ff45a75553dee3b17ed7d303bfa544f6fe)

NEWS: Add entry for bug 28846

socket: Check lengths before advancing pointer in CMSG_NXTHDR

The inline and library functions that the CMSG_NXTHDR macro may expand
to increment the pointer to the header before checking the stride of
the increment against available space. Since C only allows incrementing
pointers to one past the end of an array, the increment must be done
after a length check. This commit fixes that and includes a regression
test for CMSG_FIRSTHDR and CMSG_NXTHDR.

The Linux, Hurd, and generic headers are all changed.

Tested on Linux on armv7hl, i686, x86_64, aarch64, ppc64le, and s390x.

[BZ #28846]

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 9c443ac4559a47ed99859bd80d14dc4b6dd220a1)

alpha: Fix generic brk system call emulation in __brk_call (bug 29490)

The kernel special-cases the zero argument for alpha brk, and we can
use that to restore the generic Linux error handling behavior.

Fixes commit b57ab258c1140bc45464b4b9908713e3e0ee35aa ("Linux:
Introduce __brk_call for invoking the brk system call").

(cherry picked from commit e7ad26ee3cb74e61d0637c888f24dd478d77af58)

stdlib: Fixup mbstowcs NULL __dst handling. [BZ #29279]

commit 464d189b9622932a75302290625de84931656ec0 (origin/master, origin/HEAD)
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 22 08:24:21 2022 -0700

stdlib: Remove attr_write from mbstows if dst is NULL [BZ: 29265]

Incorrectly called `__mbstowcs_chk` in the NULL __dst case which is
incorrect as in the NULL __dst case we are explicitly skipping
the objsize checks.

As well, remove the `__always_inline` attribute which exists in
`__fortify_function`.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 220b83d83d32aa9e6f5659e2fa2a63a0024c3e4a)

stdlib: Remove attr_write from mbstows if dst is NULL [BZ: 29265]

mbstows is defined if dst is NULL and is defined to special cased if
dst is NULL so the fortify objsize check if incorrect in that case.

Tested on x86-64 linux.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 464d189b9622932a75302290625de84931656ec0)

Update syscall lists for Linux 5.19

Linux 5.19 has no new syscalls, but enables memfd_secret in the uapi
headers for RISC-V. Update the version number in syscall-names.list
to reflect that it is still current for 5.19 and regenerate the
arch-syscall.h headers with build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit fccadcdf5bed7ee67a6cef4714e0b477d6c8472c)

riscv: Update rv64 libm test ulps

Generated on a Microsemi Polarfire Icicle Kit running Linux version
5.15.32. Same ULPs were also produced on QEMU 5.2.0 running Linux
5.18.0.

(cherry picked from commit 7c5db7931f940a0de9d39b566f6fef41148491c0)

dlfcn: Pass caller pointer to static dlopen implementation (bug 29446)

Fixes commit 0c1c3a771eceec46e66ce1183cf988e2303bd373 ("dlfcn: Move
dlopen into libc").

(cherry picked from commit ed0185e4129130cbe081c221efb758fb400623ce)

malloc: Simplify implementation of __malloc_assert

It is prudent not to run too much code after detecting heap
corruption, and __fxprintf is really complex. The line number
and file name do not carry much information, so it is not included
in the error message. (__libc_message only supports %s formatting.)
The function name and assertion should provide some context.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit ac8047cdf326504f652f7db97ec96c0e0cee052f)

Update syscall-names.list for Linux 5.18

Linux 5.18 has no new syscalls. Update the version number in
syscall-names.list to reflect that it is still current for 5.18.

Tested with build-many-glibcs.py.

(cherry picked from commit 3d9926663cba19f40d26d8a8ab3b2a7cc09ffb13)

Apply asm redirections in stdio.h before first use [BZ #27087]

Compilers may not be able to apply asm redirections to functions after
these functions are used for the first time, e.g. clang 13.
Fix [BZ #27087] by applying all long double-related asm redirections
before using functions in bits/stdio.h.
However, as these asm redirections depend on the declarations provided
by libio/bits/stdio2.h, this header was split in 2:

- libio/bits/stdio2-decl.h contains all function declarations;
- libio/bits/stdio2.h remains with the remaining contents, including
redirections.

This also adds the access attribute to __vsnprintf_chk that was missing.

Tested with build-many-glibcs.py.

Reviewed-by: Paul E. Murphy <murphyp@linux.ibm.com>
(cherry picked from commit d0fa09a7701956036ff36f8ca188e9fff81553d8)

x86: Add missing IS_IN (libc) check to strncmp-sse4_2.S

Was missing to for the multiarch build rtld-strncmp-sse4_2.os was
being built and exporting symbols:

build/glibc/string/rtld-strncmp-sse4_2.os:
0000000000000000 T __strncmp_sse42

Introduced in:

commit 11ffcacb64a939c10cfc713746b8ec88837f5c4a
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Wed Jun 21 12:10:50 2017 -0700

x86-64: Implement strcmp family IFUNC selectors in C

(cherry picked from commit 96ac447d915ea5ecef3f9168cc13f4e731349a3b)

x86: Move mem{p}{mov|cpy}_{chk_}erms to its own file

The primary memmove_{impl}_unaligned_erms implementations don't
interact with this function. Putting them in same file both
wastes space and unnecessarily bloats a hot code section.

(cherry picked from commit 21925f64730d52eb7d8b2fb62b412f8ab92b0caf)

x86: Move and slightly improve memset_erms

Implementation wise:
    1. Remove the VZEROUPPER as memset_{impl}_unaligned_erms does not
       use the L(stosb) label that was previously defined.

    2. Don't give the hotpath (fallthrough) to zero size.

Code positioning wise:

Move memset_{chk}_erms to its own file.  Leaving it in between the
memset_{impl}_unaligned both adds unnecessary complexity to the
file and wastes space in a relatively hot cache section.

(cherry picked from commit 4a3f29e7e475dd4e7cce2a24c187e6fb7b5b0a05)

x86: Add definition for __wmemset_chk AVX2 RTM in ifunc impl list

This was simply missing and meant we weren't testing it properly.

(cherry picked from commit 2a1099020cdc1e4c9c928156aa85c8cf9d540291)

x86: Put wcs{n}len-sse4.1 in the sse4.1 text section

Previously was missing but the two implementations shouldn't get in
the sse2 (generic) text section.

(cherry picked from commit afc6e4328ff80973bde50d5401691b4c4b2e522c)

x86: Align entry for memrchr to 64-bytes.

The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.

As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.

(cherry picked from commit 227afaa67213efcdce6a870ef5086200f1076438)

x86: Add BMI1/BMI2 checks for ISA_V3 check

BMI1/BMI2 are part of the ISA V3 requirements:
https://en.wikipedia.org/wiki/X86-64

And defined by GCC when building with `-march=x86-64-v3`

(cherry picked from commit 8da9f346cb2051844348785b8a932ec44489e0b7)

x86: Cleanup bounds checking in large memcpy case

1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.

(cherry picked from commit 89a25c6f64746732b87eaf433af0964b564d4a92)

x86: Add bounds `x86_non_temporal_threshold`

The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.

(cherry picked from commit b446822b6ae4e8149902a78cdd4a886634ad6321)

x86: Add sse42 implementation to strcmp's ifunc

This has been missing since the the ifuncs where added.

The performance of SSE4.2 is preferable to to SSE2.

Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906

(cherry picked from commit ff439c47173565fbff4f0f78d07b0f14e4a7db05)

x86: Fix misordered logic for setting `rep_movsb_stop_threshold`

Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)

(cherry picked from commit 035591551400cfc810b07244a015c9411e8bff7c)

x86: Align varshift table to 32-bytes

This ensures the load will never split a cache line.

(cherry picked from commit 0f91811333f23b61cf681cab2704b35a0a073b97)

x86: ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST expect no transactions

Give fall-through path to `vzeroupper` and taken-path to `vzeroall`.

Generally even on machines with RTM the expectation is the
string-library functions will not be called in transactions.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c28db9cb29a7d6cf3ce08fd8445e6b7dea03f35b)

x86: Shrink code size of memchr-evex.S

This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 64 bytes

There are no non-negligible changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 1.000

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 56da3fe1dd075285fa8186d44b3c28e68c687e62)

x86: Shrink code size of memchr-avx2.S

This is not meant as a performance optimization. The previous code was
far to liberal in aligning targets and wasted code size unnecissarily.

The total code size saving is: 59 bytes

There are no major changes in the benchmarks.
Geometric Mean of all benchmarks New / Old: 0.967

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35)

x86: Fix page cross case in rawmemchr-avx2 [BZ #29234]

commit 6dcbb7d95dded20153b12d76d2f4e0ef0cda4f35
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jun 6 21:11:33 2022 -0700

x86: Shrink code size of memchr-avx2.S

Changed how the page cross case aligned string (rdi) in
rawmemchr. This was incompatible with how
`L(cross_page_continue)` expected the pointer to be aligned and
would cause rawmemchr to read data start started before the
beginning of the string. What it would read was in valid memory
but could count CHAR matches resulting in an incorrect return
value.

This commit fixes that issue by essentially reverting the changes to
the L(page_cross) case as they didn't really matter.

Test cases added and all pass with the new code (and where confirmed
to fail with the old code).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2c9af8421d2b4a7fcce163e7bc81a118d22fd346)

x86: Optimize memrchr-avx2.S

The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 306 bytes
Geometric Mean of all benchmarks New / Old: 0.760

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
10-20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 15-45% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit af5306a735eb0966fdc2f8ccdafa8888e2df0c87)

x86: Optimize memrchr-evex.S

The new code:
    1. prioritizes smaller user-arg lengths more.
    2. optimizes target placement more carefully
    3. reuses logic more
    4. fixes up various inefficiencies in the logic. The biggest
       case here is the `lzcnt` logic for checking returns which
       saves either a branch or multiple instructions.

The total code size saving is: 263 bytes
Geometric Mean of all benchmarks New / Old: 0.755

Regressions:
There are some regressions. Particularly where the length (user arg
length) is large but the position of the match char is near the
beginning of the string (in first VEC). This case has roughly a
20% regression.

This is because the new logic gives the hot path for immediate matches
to shorter lengths (the more common input). This case has roughly
a 35% speedup.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b4209615a06b01c974f47b4998b00e4c7b1aa5d9)

x86: Optimize memrchr-sse2.S

The new code:
    1. prioritizes smaller lengths more.
    2. optimizes target placement more carefully.
    3. reuses logic more.
    4. fixes up various inefficiencies in the logic.

The total code size saving is: 394 bytes
Geometric Mean of all benchmarks New / Old: 0.874

Regressions:
    1. The page cross case is now colder, especially re-entry from the
       page cross case if a match is not found in the first VEC
       (roughly 50%). My general opinion with this patch is this is
       acceptable given the "coldness" of this case (less than 4%) and
       generally performance improvement in the other far more common
       cases.

    2. There are some regressions 5-15% for medium/large user-arg
       lengths that have a match in the first VEC. This is because the
       logic was rewritten to optimize finds in the first VEC if the
       user-arg length is shorter (where we see roughly 20-50%
       performance improvements). It is not always the case this is a
       regression. My intuition is some frontend quirk is partially
       explaining the data although I haven't been able to find the
       root cause.

Full xcheck passes on x86_64.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 731feee3869550e93177e604604c1765d81de571)

x86: Add COND_VZEROUPPER that can replace vzeroupper if no `ret`

The RTM vzeroupper mitigation has no way of replacing inline
vzeroupper not before a return.

This can be useful when hoisting a vzeroupper to save code size
for example:

```
L(foo):
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
VZEROUPPER_RETURN

L(bar):
xorl %eax, %eax
VZEROUPPER_RETURN
```

Can become:

```
L(foo):
COND_VZEROUPPER
cmpl %eax, %edx
jz L(bar)
tzcntl %eax, %eax
addq %rdi, %rax
ret

L(bar):
xorl %eax, %eax
ret
```

This code does not change any existing functionality.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit dd5c483b2598f411428df4d8864c15c4b8a3cd68)

x86: Create header for VEC classes in x86 strings library

This patch does not touch any existing code and is only meant to be a
tool for future patches so that simple source files can more easily be
maintained to target multiple VEC classes.

There is no difference in the objdump of libc.so before and after this
patch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 8a780a6b910023e71f3173f37f0793834c047554)

x86_64: Add strstr function with 512-bit EVEX

Adding a 512-bit EVEX version of strstr. The algorithm works as follows:

(1) We spend a few cycles at the begining to peek into the needle. We
locate an edge in the needle (first occurance of 2 consequent distinct
characters) and also store the first 64-bytes into a zmm register.

(2) We search for the edge in the haystack by looking into one cache
line of the haystack at a time. This avoids having to read past a page
boundary which can cause a seg fault.

(3) If an edge is found in the haystack we first compare the first
64-bytes of the needle (already stored in a zmm register) before we
proceed with a full string compare performed byte by byte.

Benchmarking results: (old = strstr_sse2_unaligned, new = strstr_avx512)

Geometric mean of all benchmarks: new / old =  0.66

Difficult skiptable(0) : new / old =  0.02
Difficult skiptable(1) : new / old =  0.01
Difficult 2-way : new / old =  0.25
Difficult testing first 2 : new / old =  1.26
Difficult skiptable(0) : new / old =  0.05
Difficult skiptable(1) : new / old =  0.06
Difficult 2-way : new / old =  0.26
Difficult testing first 2 : new / old =  1.05
Difficult skiptable(0) : new / old =  0.42
Difficult skiptable(1) : new / old =  0.24
Difficult 2-way : new / old =  0.21
Difficult testing first 2 : new / old =  1.04
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5082a287d5e9a1f9cb98b7c982a708a3684f1d5c)

x86: Remove __mmask intrinsics in strstr-avx512.c

The intrinsics are not available before GCC7 and using standard
operators generates code of equivalent or better quality.

Removed:
    _cvtmask64_u64
    _kshiftri_mask64
    _kand_mask64

Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958

(cherry picked from commit f2698954ff9c2f9626d4bcb5a30eb5729714e0b0)

x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT

According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
(cherry picked from commit f8587a61892cbafd98ce599131bf4f103466f084)

x86_64: Implement evex512 version of strlen, strnlen, wcslen and wcsnlen

This patch implements following evex512 version of string functions.
Perf gain for evex512 version is up to 50% as compared to evex,
depending on length and alignment.

Placeholder function, not used by any processor at the moment.

- String length function using 512 bit vectors.
- String N length using 512 bit vectors.
- Wide string length using 512 bit vectors.
- Wide string N length using 512 bit vectors.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 9c66efb86fe384f77435f7e326333fb2e4e10676)

x86_64: Remove bzero optimization

Both symbols are marked as legacy in POSIX.1-2001 and removed on
POSIX.1-2008, although the prototypes are defined for _GNU_SOURCE
or _DEFAULT_SOURCE.

GCC also replaces bcopy with a memmove and bzero with memset on default
configuration (to actually get a bzero libc call the code requires
to omit string.h inclusion and built with -fno-builtin), so it is
highly unlikely programs are actually calling libc bzero symbol.

On a recent Linux distro (Ubuntu 22.04), there is no bzero calls
by the installed binaries.

  $ cat count_bstring.sh
  #!/bin/bash

  files=`IFS=':';for i in $PATH; do test -d "$i" && find "$i" -maxdepth 1 -executable -type f; done`
  total=0
  for file in $files; do
    symbols=`objdump -R $file 2>&1`
    if [ $? -eq 0 ]; then
      ncalls=`echo $symbols | grep -w $1 | wc -l`
      ((total=total+ncalls))
      if [ $ncalls -gt 0 ]; then
        echo "$file: $ncalls"
      fi
    fi
  done
  echo "TOTAL=$total"
  $ ./count_bstring.sh bzero
  TOTAL=0

Checked on x86_64-linux-gnu.

(cherry picked from commit 9403b71ae97e3f1a91c796ddcbb4e6f044434734)

nptl: Fix ___pthread_unregister_cancel_restore asynchronous restore

This was due a wrong revert done on 404656009b459658.

Checked on x86_64-linux-gnu and i686-linux-gnu.

(cherry picked from commit f27e5e21787abc9f719879af47687221aa1027b3)

linux: Fix mq_timereceive check for 32 bit fallback code (BZ 29304)

On success, mq_receive() and mq_timedreceive() return the number of
bytes in the received message, so it requires to check if the value
is larger than 0.

Checked on i686-linux-gnu.

(cherry picked from commit 71d87d85bf54f6522813aec97c19bdd24997341e)

nss: handle stat failure in check_reload_and_get (BZ #28752)

Skip the chroot test if the database isn't loaded
correctly (because the chroot test uses some
existing DB state).

The __stat64_time64 -> fstatat call can fail if
running under an (aggressive) seccomp filter,
like Firefox seems to use.

This manifested in a crash when using glib built
with FAM support with such a Firefox build.

Suggested-by: DJ Delorie <dj@redhat.com>
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit ace9e3edbca62d978b1e8f392d8a5d78500272d9)

nss: add assert to DB_LOOKUP_FCT (BZ #28752)

It's interesting if we have a null action list,
so an assert is worthwhile.

Suggested-by: DJ Delorie <dj@redhat.com>
Signed-off-by: Sam James <sam@gentoo.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 3fdf0a205b622e40fa7e3c4ed1e4ed4d5c6c5380)

nios2: Remove _dl_skip_args usage (BZ# 29187)

Since ad43cac44a the generic code already shuffles the argv/envp/auxv
on the stack to remove the ld.so own arguments and thus _dl_skip_args
is always 0. So there is no need to adjust the argc or argv.

Checked with qemu-user that arguments are correctly passed on both
constructors and main program.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4868ba5d257a7fb415674e79c4ae5a3af2827f55)

hppa: Remove _dl_skip_args usage (BZ# 29165)

Different than other architectures, hppa creates an unrelated stack
frame where ld.so argc/argv adjustments done by ad43cac44a6860eaefc
is not done on the argc/argv saved/restore by _dl_start_user.

Instead load _dl_argc and _dl_argv directlty instead of adjust them
using _dl_skip_args value.

Checked on hppa-linux-gnu.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 6242602273feb8d68cd51cff0ad21b3c8ee11fc6)

NEWS: Add a bug fix entry for BZ #29225

nptl: Fix __libc_cleanup_pop_restore asynchronous restore (BZ#29214)

This was due a wrong revert done on 404656009b459658.

Checked on x86_64-linux-gnu.

(cherry picked from commit c7d36dcecc08a29825175f65c4ee873ff3177a23)

powerpc: Fix VSX register number on __strncpy_power9 [BZ #29197]

__strncpy_power9 initializes VR 18 with zeroes to be used throughout the
code, including when zero-padding the destination string. However, the
v18 reference was mistakenly being used for stxv and stxvl, which take a
VSX vector as operand. The code ended up using the uninitialized VSR 18
register by mistake.

Both occurrences have been changed to use the proper VSX number for VR 18
(i.e. VSR 50).

Tested on powerpc, powerpc64 and powerpc64le.

Signed-off-by: Kewen Lin <linkw@gcc.gnu.org>
(cherry picked from commit 0218463dd8265ed937622f88ac68c7d984fe0cfc)

socket: Fix mistyped define statement in socket/sys/socket.h (BZ #29225)

(cherry picked from commit 999835533bc60fbd0b0b65d2412a6742e5a54b9d)

iconv: Use 64 bit stat for gconv_parseconfdir (BZ# 29213)

The issue is only when used within libc.so (iconvconfig already builds
with _TIME_SIZE=64).

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit c789e6e40974e2b67bd33a17f29b20dce6ae8822)

catgets: Use 64 bit stat for __open_catalog (BZ# 29211)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 634f566c3e20a8a620dbd869a0089e33c105a3ea)

inet: Use 64 bit stat for ruserpass (BZ# 29210)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 3cd4785ea02cc3878bf21996cf9b61b3a306447e)

socket: Use 64 bit stat for isfdtype (BZ# 29209)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 87f1ec12e79a3895b33801fa816884f0d24ae7ef)

posix: Use 64 bit stat for fpathconf (_PC_ASYNC_IO) (BZ# 29208)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 6e7137f28c9d743d66b5a1cb8fa0d1717b96f853)

posix: Use 64 bit stat for posix_fallocate fallback (BZ# 29207)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 574ba60fc8a7fb35e6216e2fdecc521acab7ffd2)

misc: Use 64 bit stat for getusershell (BZ# 29204)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit ec995fb2152f160f02bf695ff83c45df4a6cd868)

misc: Use 64 bit stat for daemon (BZ# 29203)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 3fbc33010c76721d34f676d8efb45bcc54e0d575)

Fix deadlock when pthread_atfork handler calls pthread_atfork or dlclose

In multi-threaded programs, registering via pthread_atfork,
de-registering implicitly via dlclose, or running pthread_atfork
handlers during fork was protected by an internal lock.  This meant
that a pthread_atfork handler attempting to register another handler or
dlclose a dynamically loaded library would lead to a deadlock.

This commit fixes the deadlock in the following way:

During the execution of handlers at fork time, the atfork lock is
released prior to the execution of each handler and taken again upon its
return.  Any handler registrations or de-registrations that occurred
during the execution of the handler are accounted for before proceeding
with further handler execution.

If a handler that hasn't been executed yet gets de-registered by another
handler during fork, it will not be executed.   If a handler gets
registered by another handler during fork, it will not be executed
during that particular fork.

The possibility that handlers may now be registered or deregistered
during handler execution means that identifying the next handler to be
run after a given handler may register/de-register others requires some
bookkeeping.  The fork_handler struct has an additional field, 'id',
which is assigned sequentially during registration.  Thus, handlers are
executed in ascending order of 'id' during 'prepare', and descending
order of 'id' during parent/child handler execution after the fork.

Two tests are included:

* tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org>
  This test exercises calling dlclose from prepare, parent, and child
  handlers.

* tst-atfork4: This test exercises calling pthread_atfork and dlclose
  from the prepare handler.

[BZ #24595, BZ #27054]

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 52a103e237329b9f88a28513fe7506ffc3bd8ced)

x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #29127]

Re-cherry-pick commit c627209832 for strcmp-avx2.S change which was
omitted in intial cherry pick because at the time this bug was not
present on release branch.

Fixes BZ #29127.

In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.

Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)

string.h: fix __fortified_attr_access macro call [BZ #29162]

commit e938c0274 "Don't add access size hints to fortifiable functions"
converted a few '__attr_access ((...))' into '__fortified_attr_access (...)'
calls.

But one of conversions had double parentheses of '__fortified_attr_access (...)'.

Noticed as a gnat6 build failure:

/<<NIX>>-glibc-2.34-210-dev/include/bits/string_fortified.h:110:50: error: macro "__fortified_attr_access" requires 3 arguments, but only 1 given

The change fixes parentheses.

This is seen when using compilers that do not support
__builtin___stpncpy_chk, e.g. gcc older than 4.7, clang older than 2.6
or some compiler not derived from gcc or clang.

Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 5a5f94af0542f9a35aaa7992c18eb4e2403a29b9)

linux: Add a getauxval test [BZ #23293]

This is for bug 23293 and it relies on the glibc test system running
tests via explicit ld.so invokation by default.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9faf5262c77487c96da8a3e961b88c0b1879e186)

rtld: Use generic argv adjustment in ld.so [BZ #23293]

When an executable is invoked as

  ./ld.so [ld.so-args] ./exe [exe-args]

then the argv is adujusted in ld.so before calling the entry point of
the executable so ld.so args are not visible to it.  On most targets
this requires moving argv, env and auxv on the stack to ensure correct
stack alignment at the entry point.  This had several issues:

- The code for this adjustment on the stack is written in asm as part
  of the target specific ld.so _start code which is hard to maintain.

- The adjustment is done after _dl_start returns, where it's too late
  to update GLRO(dl_auxv), as it is already readonly, so it points to
  memory that was clobbered by the adjustment. This is bug 23293.

- _environ is also wrong in ld.so after the adjustment, but it is
  likely not used after _dl_start returns so this is not user visible.

- _dl_argv was updated, but for this it was moved out of relro, which
  changes security properties across targets unnecessarily.

This patch introduces a generic _dl_start_args_adjust function that
handles the argument adjustments after ld.so processed its own args
and before relro protection is applied.

The same algorithm is used on all targets, _dl_skip_args is now 0, so
existing target specific adjustment code is no longer used.  The bug
affects aarch64, alpha, arc, arm, csky, ia64, nios2, s390-32 and sparc,
other targets don't need the change in principle, only for consistency.

The GNU Hurd start code relied on _dl_skip_args after dl_main returned,
now it checks directly if args were adjusted and fixes the Hurd startup
data accordingly.

Follow up patches can remove _dl_skip_args and DL_ARGV_NOT_RELRO.

Tested on aarch64-linux-gnu and cross tested on i686-gnu.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit ad43cac44a6860eaefcadadfb2acb349921e96bf)

S390: Enable static PIE

This commit enables static PIE on 64bit.  On 31bit, static PIE is
not supported.

A new configure check in sysdeps/s390/s390-64/configure.ac also performs
a minimal test for requirements in ld:
Ensure you also have those patches for:
- binutils (ld)
  - "[PR ld/22263] s390: Avoid dynamic TLS relocs in PIE"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=26b1426577b5dcb32d149c64cca3e603b81948a9
    (Tested by configure check above)
    Otherwise there will be a R_390_TLS_TPOFF relocation, which fails to
    be processed in _dl_relocate_static_pie() as static TLS map is not setup.
  - "s390: Add DT_JMPREL pointing to .rela.[i]plt with static-pie"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d942d8db12adf4c9e5c7d9ed6496a779ece7149e
    (We can't test it in configure as we are not able to link a static PIE
    executable if the system glibc lacks static PIE support)
    Otherwise there won't be DT_JMPREL, DT_PLTRELA, DT_PLTRELASZ entries
    and the IFUNC symbols are not processed, which leads to crashes.

- kernel (the mentioned links to the commits belong to 5.19 merge window):
  - "s390/mmap: increase stack/mmap gap to 128MB"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=f2f47d0ef72c30622e62471903ea19446ea79ee2
  - "s390/vdso: move vdso mapping to its own function"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=57761da4dc5cd60bed2c81ba0edb7495c3c740b8
  - "s390/vdso: map vdso above stack"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=9e37a2e8546f9e48ea76c839116fa5174d14e033
  - "s390/vdso: add vdso randomization"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=41cd81abafdc4e58a93fcb677712a76885e3ca25
  (We can't test the kernel of the target system)
  Otherwise if /proc/sys/kernel/randomize_va_space is turned off (0),
  static PIE executables like ldconfig will crash.  While startup sbrk is
  used to enlarge the HEAP.  Unfortunately the underlying brk syscall fails
  as there is not enough space after the HEAP.  Then the address of the TLS
  image is invalid and the following memcpy in __libc_setup_tls() leads
  to a segfault.
  If /proc/sys/kernel/randomize_va_space is activated (default: 2), there
  is enough space after HEAP.

- glibc
  - "Linux: Define MMAP_CALL_INTERNAL"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=c1b68685d438373efe64e5f076f4215723004dfb
  - "i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=6e5c7a1e262961adb52443ab91bd2c9b72316402
  - "i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=60f0f2130d30cfd008ca39743027f1e200592dff
  - "ia64: Always define IA64_USE_NEW_STUB as a flag macro"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=18bd9c3d3b1b6a9182698c85354578d1d58e9d64
  - "Linux: Implement a useful version of _startup_fatal"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=a2a6bce7d7e52c1c34369a7da62c501cc350bc31
  - "Linux: Introduce __brk_call for invoking the brk system call"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=b57ab258c1140bc45464b4b9908713e3e0ee35aa
  - "csu: Implement and use _dl_early_allocate during static startup"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=f787e138aa0bf677bf74fa2a08595c446292f3d7
  The mentioned patch series by Florian Weimer avoids the mentioned failing
  sbrk syscall by falling back to mmap.

This commit also adjusts startup code in start.S to be ready for static PIE.
We have to add a wrapper function for main as we are not allowed to use
GOT relocations before __libc_start_main is called.
(Compare also to:
- commit 14d886edbd3d80b771e1c42fbd9217f9074de9c6
  "aarch64: fix start code for static pie"
- commit 3d1d79283e6de4f7c434cb67fb53a4fd28359669
  "aarch64: fix static pie enabled libc when main is in a shared library"
)

(cherry picked from commit 728894dba4a19578bd803906de184a8dd51ed13c)

csu: Implement and use _dl_early_allocate during static startup

This implements mmap fallback for a brk failure during TLS
allocation.

scripts/tls-elf-edit.py is updated to support the new patching method.
The script no longer requires that in the input object is of ET_DYN
type.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit f787e138aa0bf677bf74fa2a08595c446292f3d7)

Linux: Introduce __brk_call for invoking the brk system call

Alpha and sparc can now use the generic implementation.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit b57ab258c1140bc45464b4b9908713e3e0ee35aa)

Linux: Implement a useful version of _startup_fatal

On i386 and ia64, the TCB is not available at this point.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit a2a6bce7d7e52c1c34369a7da62c501cc350bc31)

ia64: Always define IA64_USE_NEW_STUB as a flag macro

And keep the previous definition if it exists. This allows
disabling IA64_USE_NEW_STUB while keeping USE_DL_SYSINFO defined.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 18bd9c3d3b1b6a9182698c85354578d1d58e9d64)

Linux: Define MMAP_CALL_INTERNAL

Unlike MMAP_CALL, this avoids a TCB dependency for an errno update
on failure.

<mmap_internal.h> cannot be included as is on several architectures
due to the definition of page_unit, so introduce a separate header
file for the definition of MMAP_CALL and MMAP_CALL_INTERNAL,
<mmap_call.h>.

Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
(cherry picked from commit c1b68685d438373efe64e5f076f4215723004dfb)

i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls

Introduce an int-80h-based version of __libc_do_syscall and use
it if I386_USE_SYSENTER is defined as 0.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 60f0f2130d30cfd008ca39743027f1e200592dff)

i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S

After commit a78e6a10d0b50d0ca80309775980fc99944b1727
("i386: Remove broken CAN_USE_REGISTER_ASM_EBP (bug 28771)"),
it is never defined.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6e5c7a1e262961adb52443ab91bd2c9b72316402)

elf: Remove __libc_init_secure

After 73fc4e28b9464f0e13edc719a5372839970e7ddb,
__libc_enable_secure_decided is always 0 and a statically linked
executable may overwrite __libc_enable_secure without considering
AT_SECURE.

The __libc_enable_secure has been correctly initialized in _dl_aux_init,
so just remove __libc_enable_secure_decided and __libc_init_secure.
This allows us to remove some startup_get*id functions from
22b79ed7f413cd980a7af0cf258da5bf82b6d5e5.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 3e9acce8c50883b6cd8a3fb653363d9fa21e1608)

Linux: Consolidate auxiliary vector parsing (redo)

And optimize it slightly.

This is commit 8c8510ab2790039e58995ef3a22309582413d3ff revised.

In _dl_aux_init in elf/dl-support.c, use an explicit loop
and -fno-tree-loop-distribute-patterns to avoid memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 73fc4e28b9464f0e13edc719a5372839970e7ddb)

Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED

Otherwise, <dl-auxv.h> on POWER ends up being included twice,
once in dl-sysdep.c, once in dl-support.c. That leads to a linker
failure due to multiple definitions of _dl_cache_line_size.

Fixes commit d96d2995c1121d3310102afda2deb1f35761b5e6
("Revert "Linux: Consolidate auxiliary vector parsing").

(cherry picked from commit 098c795e85fbd05c5ef59c2d0ce59529331bea27)

Revert "Linux: Consolidate auxiliary vector parsing"

This reverts commit 8c8510ab2790039e58995ef3a22309582413d3ff. The
revert is not perfect because the commit included a bug fix for
_dl_sysdep_start with an empty argv, introduced in commit
2d47fa68628e831a692cba8fc9050cef435afc5e ("Linux: Remove
DL_FIND_ARG_COMPONENTS"), and this bug fix is kept.

The revert is necessary because the reverted commit introduced an
early memset call on aarch64, which leads to crash due to lack of TCB
initialization.

(cherry picked from commit d96d2995c1121d3310102afda2deb1f35761b5e6)

Linux: Consolidate auxiliary vector parsing

And optimize it slightly.

The large switch statement in _dl_sysdep_start can be replaced with
a large array.  This reduces source code and binary size.  On
i686-linux-gnu:

Before:

   text    data     bss     dec     hex filename
   7791      12       0    7803    1e7b elf/dl-sysdep.os

After:

   text    data     bss     dec     hex filename
   7135      12       0    7147    1beb elf/dl-sysdep.os

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 8c8510ab2790039e58995ef3a22309582413d3ff)

Linux: Assume that NEED_DL_SYSINFO_DSO is always defined

The definition itself is still needed for generic code.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit f19fc997a5754a6c0bb9e43618f0597e878061f7)

Linux: Remove DL_FIND_ARG_COMPONENTS

The generic definition is always used since the Native Client
port has been removed.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 2d47fa68628e831a692cba8fc9050cef435afc5e)

Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE

They are always defined.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit b9c3d3382f6f50e9723002deb2dc8127de720fa6)

elf: Merge dl-sysdep.c into the Linux version

The generic version is the de-facto Linux implementation. It
requires an auxiliary vector, so Hurd does not use it.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 91c0a47ffb66e7cd802de870686465db3b3976a0)

x86: Optimize {str|wcs}rchr-evex

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c966099cdc3e0fdf92f63eac09b22fa7e5f5f02d)

x86: Optimize {str|wcs}rchr-avx2

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit df7e295d18ffa34f629578c0017a9881af7620f6)

x86: Optimize {str|wcs}rchr-sse2

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5307aa9c1800f36a64c183c091c9af392c1fa75c)

x86-64: Fix SSE2 memcmp and SSSE3 memmove for x32

Clear the upper 32 bits in RDX (memory size) for x32 to fix

FAIL: string/tst-size_t-memcmp
FAIL: string/tst-size_t-memcmp-2
FAIL: string/tst-size_t-memcpy
FAIL: wcsmbs/tst-size_t-wmemcmp

on x32 introduced by

8804157ad9 x86: Optimize memcmp SSE2 in memcmp.S
26b2478322 x86: Reduce code size of mem{move|pcpy|cpy}-ssse3

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8ea20ee5f6145de4bff9481d3e09ac36ba9df8f3)

x86: Fix missing __wmemcmp def for disable-multiarch build

commit 8804157ad9da39631703b92315460808eac86b0c
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Fri Apr 15 12:27:59 2022 -0500

x86: Optimize memcmp SSE2 in memcmp.S

Only defined wmemcmp and missed __wmemcmp. This commit fixes that by
defining __wmemcmp and setting wmemcmp as a weak alias to __wmemcmp.

Both multiarch and disable-multiarch builds succeed and full xchecks
pass.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c72a1a062a1ded52719802c07ab459e1fd54d2a6)

x86: Cleanup page cross code in memcmp-avx2-movbe.S

Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 23102686ec67b856a2d4fd25ddaa1c0b8d175c4f)

x86: Remove memcmp-sse4.S

Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d7171850451e6044ef78d91ff8c017b9c0
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
128,      4,      8,   0,             0.858
128,      4,      8,   1,             0.879
128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)

x86: Optimize memcmp SSE2 in memcmp.S

New code save size (-303 bytes) and has significantly better
performance.

geometric_mean(N=20) of page cross cases New / Original: 0.634
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 8804157ad9da39631703b92315460808eac86b0c)

x86: Small improvements for wcslen

Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 244b415d386487521882debb845a040a4758cb18)

x86: Remove AVX str{n}casecmp

The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 305769b2a15c2e96f9e1b5195d3c4e0d6f0f4b68)

x86: Add EVEX optimized str{n}casecmp

geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 84e7c46df4086873eae28a1fb87d2cf5388b1e16)

x86: Add AVX2 optimized str{n}casecmp

geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bbf81222343fed5cd704001a2ae0d86c71544151)

x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S

Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .920

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit d154758e618ec9324f5d339c46db0aa27e8b1226)

x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S

Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .894

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 670b54bc585ea4a94f3b2e9272ba44aa6b730b73)