git.ipfire.org Git - thirdparty/glibc.git/log

catgets: Use 64 bit stat for __open_catalog (BZ# 29211)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit c86631de6fa2fb5fa293810c66e53898537a4ddc)

inet: Use 64 bit stat for ruserpass (BZ# 29210)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 3cd4785ea02cc3878bf21996cf9b61b3a306447e)

socket: Use 64 bit stat for isfdtype (BZ# 29209)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 87f1ec12e79a3895b33801fa816884f0d24ae7ef)

posix: Use 64 bit stat for fpathconf (_PC_ASYNC_IO) (BZ# 29208)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 6e7137f28c9d743d66b5a1cb8fa0d1717b96f853)

posix: Use 64 bit stat for posix_fallocate fallback (BZ# 29207)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 574ba60fc8a7fb35e6216e2fdecc521acab7ffd2)

misc: Use 64 bit stat for getusershell (BZ# 29204)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit ec995fb2152f160f02bf695ff83c45df4a6cd868)

misc: Use 64 bit stat for daemon (BZ# 29203)

This is a missing spot initially from 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 3fbc33010c76721d34f676d8efb45bcc54e0d575)

Fix deadlock when pthread_atfork handler calls pthread_atfork or dlclose

In multi-threaded programs, registering via pthread_atfork,
de-registering implicitly via dlclose, or running pthread_atfork
handlers during fork was protected by an internal lock.  This meant
that a pthread_atfork handler attempting to register another handler or
dlclose a dynamically loaded library would lead to a deadlock.

This commit fixes the deadlock in the following way:

During the execution of handlers at fork time, the atfork lock is
released prior to the execution of each handler and taken again upon its
return.  Any handler registrations or de-registrations that occurred
during the execution of the handler are accounted for before proceeding
with further handler execution.

If a handler that hasn't been executed yet gets de-registered by another
handler during fork, it will not be executed.   If a handler gets
registered by another handler during fork, it will not be executed
during that particular fork.

The possibility that handlers may now be registered or deregistered
during handler execution means that identifying the next handler to be
run after a given handler may register/de-register others requires some
bookkeeping.  The fork_handler struct has an additional field, 'id',
which is assigned sequentially during registration.  Thus, handlers are
executed in ascending order of 'id' during 'prepare', and descending
order of 'id' during parent/child handler execution after the fork.

Two tests are included:

* tst-atfork3: Adhemerval Zanella <adhemerval.zanella@linaro.org>
  This test exercises calling dlclose from prepare, parent, and child
  handlers.

* tst-atfork4: This test exercises calling pthread_atfork and dlclose
  from the prepare handler.

[BZ #24595, BZ #27054]

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 52a103e237329b9f88a28513fe7506ffc3bd8ced)

x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #29127]

Re-cherry-pick commit c627209832 for strcmp-avx2.S change which was
omitted in intial cherry pick because at the time this bug was not
present on release branch.

Fixes BZ #29127.

In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.

Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)

string.h: fix __fortified_attr_access macro call [BZ #29162]

commit e938c0274 "Don't add access size hints to fortifiable functions"
converted a few '__attr_access ((...))' into '__fortified_attr_access (...)'
calls.

But one of conversions had double parentheses of '__fortified_attr_access (...)'.

Noticed as a gnat6 build failure:

/<<NIX>>-glibc-2.34-210-dev/include/bits/string_fortified.h:110:50: error: macro "__fortified_attr_access" requires 3 arguments, but only 1 given

The change fixes parentheses.

This is seen when using compilers that do not support
__builtin___stpncpy_chk, e.g. gcc older than 4.7, clang older than 2.6
or some compiler not derived from gcc or clang.

Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 5a5f94af0542f9a35aaa7992c18eb4e2403a29b9)

linux: Add a getauxval test [BZ #23293]

This is for bug 23293 and it relies on the glibc test system running
tests via explicit ld.so invokation by default.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 9faf5262c77487c96da8a3e961b88c0b1879e186)

rtld: Use generic argv adjustment in ld.so [BZ #23293]

When an executable is invoked as

  ./ld.so [ld.so-args] ./exe [exe-args]

then the argv is adujusted in ld.so before calling the entry point of
the executable so ld.so args are not visible to it.  On most targets
this requires moving argv, env and auxv on the stack to ensure correct
stack alignment at the entry point.  This had several issues:

- The code for this adjustment on the stack is written in asm as part
  of the target specific ld.so _start code which is hard to maintain.

- The adjustment is done after _dl_start returns, where it's too late
  to update GLRO(dl_auxv), as it is already readonly, so it points to
  memory that was clobbered by the adjustment. This is bug 23293.

- _environ is also wrong in ld.so after the adjustment, but it is
  likely not used after _dl_start returns so this is not user visible.

- _dl_argv was updated, but for this it was moved out of relro, which
  changes security properties across targets unnecessarily.

This patch introduces a generic _dl_start_args_adjust function that
handles the argument adjustments after ld.so processed its own args
and before relro protection is applied.

The same algorithm is used on all targets, _dl_skip_args is now 0, so
existing target specific adjustment code is no longer used.  The bug
affects aarch64, alpha, arc, arm, csky, ia64, nios2, s390-32 and sparc,
other targets don't need the change in principle, only for consistency.

The GNU Hurd start code relied on _dl_skip_args after dl_main returned,
now it checks directly if args were adjusted and fixes the Hurd startup
data accordingly.

Follow up patches can remove _dl_skip_args and DL_ARGV_NOT_RELRO.

Tested on aarch64-linux-gnu and cross tested on i686-gnu.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit ad43cac44a6860eaefcadadfb2acb349921e96bf)

S390: Enable static PIE

This commit enables static PIE on 64bit.  On 31bit, static PIE is
not supported.

A new configure check in sysdeps/s390/s390-64/configure.ac also performs
a minimal test for requirements in ld:
Ensure you also have those patches for:
- binutils (ld)
  - "[PR ld/22263] s390: Avoid dynamic TLS relocs in PIE"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=26b1426577b5dcb32d149c64cca3e603b81948a9
    (Tested by configure check above)
    Otherwise there will be a R_390_TLS_TPOFF relocation, which fails to
    be processed in _dl_relocate_static_pie() as static TLS map is not setup.
  - "s390: Add DT_JMPREL pointing to .rela.[i]plt with static-pie"
    https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d942d8db12adf4c9e5c7d9ed6496a779ece7149e
    (We can't test it in configure as we are not able to link a static PIE
    executable if the system glibc lacks static PIE support)
    Otherwise there won't be DT_JMPREL, DT_PLTRELA, DT_PLTRELASZ entries
    and the IFUNC symbols are not processed, which leads to crashes.

- kernel (the mentioned links to the commits belong to 5.19 merge window):
  - "s390/mmap: increase stack/mmap gap to 128MB"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=f2f47d0ef72c30622e62471903ea19446ea79ee2
  - "s390/vdso: move vdso mapping to its own function"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=57761da4dc5cd60bed2c81ba0edb7495c3c740b8
  - "s390/vdso: map vdso above stack"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=9e37a2e8546f9e48ea76c839116fa5174d14e033
  - "s390/vdso: add vdso randomization"
    https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git/commit/?h=features&id=41cd81abafdc4e58a93fcb677712a76885e3ca25
  (We can't test the kernel of the target system)
  Otherwise if /proc/sys/kernel/randomize_va_space is turned off (0),
  static PIE executables like ldconfig will crash.  While startup sbrk is
  used to enlarge the HEAP.  Unfortunately the underlying brk syscall fails
  as there is not enough space after the HEAP.  Then the address of the TLS
  image is invalid and the following memcpy in __libc_setup_tls() leads
  to a segfault.
  If /proc/sys/kernel/randomize_va_space is activated (default: 2), there
  is enough space after HEAP.

- glibc
  - "Linux: Define MMAP_CALL_INTERNAL"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=c1b68685d438373efe64e5f076f4215723004dfb
  - "i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=6e5c7a1e262961adb52443ab91bd2c9b72316402
  - "i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=60f0f2130d30cfd008ca39743027f1e200592dff
  - "ia64: Always define IA64_USE_NEW_STUB as a flag macro"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=18bd9c3d3b1b6a9182698c85354578d1d58e9d64
  - "Linux: Implement a useful version of _startup_fatal"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=a2a6bce7d7e52c1c34369a7da62c501cc350bc31
  - "Linux: Introduce __brk_call for invoking the brk system call"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=b57ab258c1140bc45464b4b9908713e3e0ee35aa
  - "csu: Implement and use _dl_early_allocate during static startup"
    https://sourceware.org/git/?p=glibc.git;a=commit;h=f787e138aa0bf677bf74fa2a08595c446292f3d7
  The mentioned patch series by Florian Weimer avoids the mentioned failing
  sbrk syscall by falling back to mmap.

This commit also adjusts startup code in start.S to be ready for static PIE.
We have to add a wrapper function for main as we are not allowed to use
GOT relocations before __libc_start_main is called.
(Compare also to:
- commit 14d886edbd3d80b771e1c42fbd9217f9074de9c6
  "aarch64: fix start code for static pie"
- commit 3d1d79283e6de4f7c434cb67fb53a4fd28359669
  "aarch64: fix static pie enabled libc when main is in a shared library"
)

(cherry picked from commit 728894dba4a19578bd803906de184a8dd51ed13c)

csu: Implement and use _dl_early_allocate during static startup

This implements mmap fallback for a brk failure during TLS
allocation.

scripts/tls-elf-edit.py is updated to support the new patching method.
The script no longer requires that in the input object is of ET_DYN
type.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit f787e138aa0bf677bf74fa2a08595c446292f3d7)

Linux: Introduce __brk_call for invoking the brk system call

Alpha and sparc can now use the generic implementation.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit b57ab258c1140bc45464b4b9908713e3e0ee35aa)

Linux: Implement a useful version of _startup_fatal

On i386 and ia64, the TCB is not available at this point.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit a2a6bce7d7e52c1c34369a7da62c501cc350bc31)

ia64: Always define IA64_USE_NEW_STUB as a flag macro

And keep the previous definition if it exists. This allows
disabling IA64_USE_NEW_STUB while keeping USE_DL_SYSINFO defined.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 18bd9c3d3b1b6a9182698c85354578d1d58e9d64)

Linux: Define MMAP_CALL_INTERNAL

Unlike MMAP_CALL, this avoids a TCB dependency for an errno update
on failure.

<mmap_internal.h> cannot be included as is on several architectures
due to the definition of page_unit, so introduce a separate header
file for the definition of MMAP_CALL and MMAP_CALL_INTERNAL,
<mmap_call.h>.

Reviewed-by: Stefan Liebler <stli@linux.ibm.com>
(cherry picked from commit c1b68685d438373efe64e5f076f4215723004dfb)

i386: Honor I386_USE_SYSENTER for 6-argument Linux system calls

Introduce an int-80h-based version of __libc_do_syscall and use
it if I386_USE_SYSENTER is defined as 0.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 60f0f2130d30cfd008ca39743027f1e200592dff)

i386: Remove OPTIMIZE_FOR_GCC_5 from Linux libc-do-syscall.S

After commit a78e6a10d0b50d0ca80309775980fc99944b1727
("i386: Remove broken CAN_USE_REGISTER_ASM_EBP (bug 28771)"),
it is never defined.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6e5c7a1e262961adb52443ab91bd2c9b72316402)

elf: Remove __libc_init_secure

After 73fc4e28b9464f0e13edc719a5372839970e7ddb,
__libc_enable_secure_decided is always 0 and a statically linked
executable may overwrite __libc_enable_secure without considering
AT_SECURE.

The __libc_enable_secure has been correctly initialized in _dl_aux_init,
so just remove __libc_enable_secure_decided and __libc_init_secure.
This allows us to remove some startup_get*id functions from
22b79ed7f413cd980a7af0cf258da5bf82b6d5e5.

Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 3e9acce8c50883b6cd8a3fb653363d9fa21e1608)

Linux: Consolidate auxiliary vector parsing (redo)

And optimize it slightly.

This is commit 8c8510ab2790039e58995ef3a22309582413d3ff revised.

In _dl_aux_init in elf/dl-support.c, use an explicit loop
and -fno-tree-loop-distribute-patterns to avoid memset.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 73fc4e28b9464f0e13edc719a5372839970e7ddb)

Linux: Include <dl-auxv.h> in dl-sysdep.c only for SHARED

Otherwise, <dl-auxv.h> on POWER ends up being included twice,
once in dl-sysdep.c, once in dl-support.c. That leads to a linker
failure due to multiple definitions of _dl_cache_line_size.

Fixes commit d96d2995c1121d3310102afda2deb1f35761b5e6
("Revert "Linux: Consolidate auxiliary vector parsing").

(cherry picked from commit 098c795e85fbd05c5ef59c2d0ce59529331bea27)

Revert "Linux: Consolidate auxiliary vector parsing"

This reverts commit 8c8510ab2790039e58995ef3a22309582413d3ff. The
revert is not perfect because the commit included a bug fix for
_dl_sysdep_start with an empty argv, introduced in commit
2d47fa68628e831a692cba8fc9050cef435afc5e ("Linux: Remove
DL_FIND_ARG_COMPONENTS"), and this bug fix is kept.

The revert is necessary because the reverted commit introduced an
early memset call on aarch64, which leads to crash due to lack of TCB
initialization.

(cherry picked from commit d96d2995c1121d3310102afda2deb1f35761b5e6)

Linux: Consolidate auxiliary vector parsing

And optimize it slightly.

The large switch statement in _dl_sysdep_start can be replaced with
a large array.  This reduces source code and binary size.  On
i686-linux-gnu:

Before:

   text    data     bss     dec     hex filename
   7791      12       0    7803    1e7b elf/dl-sysdep.os

After:

   text    data     bss     dec     hex filename
   7135      12       0    7147    1beb elf/dl-sysdep.os

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 8c8510ab2790039e58995ef3a22309582413d3ff)

Linux: Assume that NEED_DL_SYSINFO_DSO is always defined

The definition itself is still needed for generic code.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit f19fc997a5754a6c0bb9e43618f0597e878061f7)

Linux: Remove DL_FIND_ARG_COMPONENTS

The generic definition is always used since the Native Client
port has been removed.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 2d47fa68628e831a692cba8fc9050cef435afc5e)

Linux: Remove HAVE_AUX_SECURE, HAVE_AUX_XID, HAVE_AUX_PAGESIZE

They are always defined.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit b9c3d3382f6f50e9723002deb2dc8127de720fa6)

elf: Merge dl-sysdep.c into the Linux version

The generic version is the de-facto Linux implementation. It
requires an auxiliary vector, so Hurd does not use it.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 91c0a47ffb66e7cd802de870686465db3b3976a0)

elf: Remove unused NEED_DL_BASE_ADDR and _dl_base_addr

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cd0c333d2ea82d0ae14719bdbef86d99615bdb00)

x86: Optimize {str|wcs}rchr-evex

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.755
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c966099cdc3e0fdf92f63eac09b22fa7e5f5f02d)

x86: Optimize {str|wcs}rchr-avx2

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.832
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit df7e295d18ffa34f629578c0017a9881af7620f6)

x86: Optimize {str|wcs}rchr-sse2

The new code unrolls the main loop slightly without adding too much
overhead and minimizes the comparisons for the search CHAR.

Geometric Mean of all benchmarks New / Old: 0.741
See email for all results.

Full xcheck passes on x86_64 with and without multiarch enabled.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5307aa9c1800f36a64c183c091c9af392c1fa75c)

x86: Cleanup page cross code in memcmp-avx2-movbe.S

Old code was both inefficient and wasted code size. New code (-62
bytes) and comparable or better performance in the page cross case.

geometric_mean(N=20) of page cross cases New / Original: 0.960

size, align0, align1, ret, New Time/Old Time
   1,   4095,      0,   0,             1.001
   1,   4095,      0,   1,             0.999
   1,   4095,      0,  -1,               1.0
   2,   4094,      0,   0,               1.0
   2,   4094,      0,   1,               1.0
   2,   4094,      0,  -1,               1.0
   3,   4093,      0,   0,               1.0
   3,   4093,      0,   1,               1.0
   3,   4093,      0,  -1,               1.0
   4,   4092,      0,   0,             0.987
   4,   4092,      0,   1,               1.0
   4,   4092,      0,  -1,               1.0
   5,   4091,      0,   0,             0.984
   5,   4091,      0,   1,             1.002
   5,   4091,      0,  -1,             1.005
   6,   4090,      0,   0,             0.993
   6,   4090,      0,   1,             1.001
   6,   4090,      0,  -1,             1.003
   7,   4089,      0,   0,             0.991
   7,   4089,      0,   1,               1.0
   7,   4089,      0,  -1,             1.001
   8,   4088,      0,   0,             0.875
   8,   4088,      0,   1,             0.881
   8,   4088,      0,  -1,             0.888
   9,   4087,      0,   0,             0.872
   9,   4087,      0,   1,             0.879
   9,   4087,      0,  -1,             0.883
  10,   4086,      0,   0,             0.878
  10,   4086,      0,   1,             0.886
  10,   4086,      0,  -1,             0.873
  11,   4085,      0,   0,             0.878
  11,   4085,      0,   1,             0.881
  11,   4085,      0,  -1,             0.879
  12,   4084,      0,   0,             0.873
  12,   4084,      0,   1,             0.889
  12,   4084,      0,  -1,             0.875
  13,   4083,      0,   0,             0.873
  13,   4083,      0,   1,             0.863
  13,   4083,      0,  -1,             0.863
  14,   4082,      0,   0,             0.838
  14,   4082,      0,   1,             0.869
  14,   4082,      0,  -1,             0.877
  15,   4081,      0,   0,             0.841
  15,   4081,      0,   1,             0.869
  15,   4081,      0,  -1,             0.876
  16,   4080,      0,   0,             0.988
  16,   4080,      0,   1,              0.99
  16,   4080,      0,  -1,             0.989
  17,   4079,      0,   0,             0.978
  17,   4079,      0,   1,             0.981
  17,   4079,      0,  -1,              0.98
  18,   4078,      0,   0,             0.981
  18,   4078,      0,   1,              0.98
  18,   4078,      0,  -1,             0.985
  19,   4077,      0,   0,             0.977
  19,   4077,      0,   1,             0.979
  19,   4077,      0,  -1,             0.986
  20,   4076,      0,   0,             0.977
  20,   4076,      0,   1,             0.986
  20,   4076,      0,  -1,             0.984
  21,   4075,      0,   0,             0.977
  21,   4075,      0,   1,             0.983
  21,   4075,      0,  -1,             0.988
  22,   4074,      0,   0,             0.983
  22,   4074,      0,   1,             0.994
  22,   4074,      0,  -1,             0.993
  23,   4073,      0,   0,              0.98
  23,   4073,      0,   1,             0.992
  23,   4073,      0,  -1,             0.995
  24,   4072,      0,   0,             0.989
  24,   4072,      0,   1,             0.989
  24,   4072,      0,  -1,             0.991
  25,   4071,      0,   0,              0.99
  25,   4071,      0,   1,             0.999
  25,   4071,      0,  -1,             0.996
  26,   4070,      0,   0,             0.993
  26,   4070,      0,   1,             0.995
  26,   4070,      0,  -1,             0.998
  27,   4069,      0,   0,             0.993
  27,   4069,      0,   1,             0.999
  27,   4069,      0,  -1,               1.0
  28,   4068,      0,   0,             0.997
  28,   4068,      0,   1,               1.0
  28,   4068,      0,  -1,             0.999
  29,   4067,      0,   0,             0.996
  29,   4067,      0,   1,             0.999
  29,   4067,      0,  -1,             0.999
  30,   4066,      0,   0,             0.991
  30,   4066,      0,   1,             1.001
  30,   4066,      0,  -1,             0.999
  31,   4065,      0,   0,             0.988
  31,   4065,      0,   1,             0.998
  31,   4065,      0,  -1,             0.998
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 23102686ec67b856a2d4fd25ddaa1c0b8d175c4f)

x86: Remove memcmp-sse4.S

Code didn't actually use any sse4 instructions since `ptest` was
removed in:

commit 2f9062d7171850451e6044ef78d91ff8c017b9c0
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Nov 10 16:18:56 2021 -0600

    x86: Shrink memcmp-sse4.S code size

The new memcmp-sse2 implementation is also faster.

geometric_mean(N=20) of page cross cases SSE2 / SSE4: 0.905

Note there are two regressions preferring SSE2 for Size = 1 and Size =
65.

Size = 1:
size, align0, align1, ret, New Time/Old Time
   1,      1,      1,   0,               1.2
   1,      1,      1,   1,             1.197
   1,      1,      1,  -1,               1.2

This is intentional. Size == 1 is significantly less hot based on
profiles of GCC11 and Python3 than sizes [4, 8] (which is made
hotter).

Python3 Size = 1        -> 13.64%
Python3 Size = [4, 8]   -> 60.92%

GCC11   Size = 1        ->  1.29%
GCC11   Size = [4, 8]   -> 33.86%

size, align0, align1, ret, New Time/Old Time
   4,      4,      4,   0,             0.622
   4,      4,      4,   1,             0.797
   4,      4,      4,  -1,             0.805
   5,      5,      5,   0,             0.623
   5,      5,      5,   1,             0.777
   5,      5,      5,  -1,             0.802
   6,      6,      6,   0,             0.625
   6,      6,      6,   1,             0.813
   6,      6,      6,  -1,             0.788
   7,      7,      7,   0,             0.625
   7,      7,      7,   1,             0.799
   7,      7,      7,  -1,             0.795
   8,      8,      8,   0,             0.625
   8,      8,      8,   1,             0.848
   8,      8,      8,  -1,             0.914
   9,      9,      9,   0,             0.625

Size = 65:
size, align0, align1, ret, New Time/Old Time
  65,      0,      0,   0,             1.103
  65,      0,      0,   1,             1.216
  65,      0,      0,  -1,             1.227
  65,     65,      0,   0,             1.091
  65,      0,     65,   1,              1.19
  65,     65,     65,  -1,             1.215

This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.

size, align0, align1, ret, New Time/Old Time
128,      4,      8,   0,             0.858
128,      4,      8,   1,             0.879
128,      4,      8,  -1,             0.888

As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)

x86: Small improvements for wcslen

Just a few QOL changes.
    1. Prefer `add` > `lea` as it has high execution units it can run
       on.
    2. Don't break macro-fusion between `test` and `jcc`
    3. Reduce code size by removing gratuitous padding bytes (-90
       bytes).

geometric_mean(N=20) of all benchmarks New / Original: 0.959

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 244b415d386487521882debb845a040a4758cb18)

x86: Remove AVX str{n}casecmp

The rational is:

1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
   regression on Tigerlake using SSE42 versus AVX across the
   benchtest suite).
2. AVX2 version covers the majority of targets that previously
   prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
   becoming outdated.

All in all the saving the code size is worth it.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 305769b2a15c2e96f9e1b5195d3c4e0d6f0f4b68)

x86: Add EVEX optimized str{n}casecmp

geometric_mean(N=40) of all benchmarks EVEX / SSE42: .621

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 84e7c46df4086873eae28a1fb87d2cf5388b1e16)

x86: Add AVX2 optimized str{n}casecmp

geometric_mean(N=40) of all benchmarks AVX2 / SSE42: .702

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bbf81222343fed5cd704001a2ae0d86c71544151)

x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S

Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .920

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit d154758e618ec9324f5d339c46db0aa27e8b1226)

x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S

Slightly faster method of doing TOLOWER that saves an
instruction.

Also replace the hard coded 5-byte no with .p2align 4. On builds with
CET enabled this misaligned entry to strcasecmp.

geometric_mean(N=40) of all benchmarks New / Original: .894

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 670b54bc585ea4a94f3b2e9272ba44aa6b730b73)

x86: Remove strspn-sse2.S and use the generic implementation

The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .710

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9c8a6ad620b49a27120ecdd7049c26bf05900397)

x86: Remove strpbrk-sse2.S and use the generic implementation

The generic implementation is faster (see strcspn commit).

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 653358535280a599382cb6c77538a187dac6a87f)

x86: Remove strcspn-sse2.S and use the generic implementation

The generic implementation is faster.

geometric_mean(N=20) of all benchmarks New / Original: .678

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fe28e7d9d9535ebab4081d195c553b4fbf39d9ae)

x86: Optimize strspn in strspn-c.c

Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2; New / Original: .901

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 412d10343168b05b8cf6c3683457cf9711d28046)

x86: Optimize strcspn and strpbrk in strcspn-c.c

Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
_mm_cmpistri. Also change offset to unsigned to avoid unnecessary
sign extensions.

geometric_mean(N=20) of all benchmarks that dont fallback on
sse2/strlen; New / Original: .928

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 30d627d477d7255345a4b713cf352ac32d644d61)

x86: Code cleanup in strchr-evex and comment justifying branch

Small code cleanup for size: -81 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks New / Original: .985
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit ec285ea90415458225623ddc0492ae3f705af043)

x86: Code cleanup in strchr-avx2 and comment justifying branch

Small code cleanup for size: -53 bytes.

Add comment justifying using a branch to do NULL/non-null return.

All string/memory tests pass and no regressions in benchtests.

geometric_mean(N=20) of all benchmarks Original / New: 1.00
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6fbf4d51e9ba8063c4f8331564892ead9c67344)

x86_64: Remove bcopy optimizations

The symbols is not present in current POSIX specification and compiler
already generates memmove call.

(cherry picked from commit bf92893a14ebc161b08b28acc24fa06ae6be19cb)

x86-64: Remove bzero weak alias in SS2 memset

commit 3d9f171bfb5325bd5f427e9fc386453358c6e840
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Mon Feb 7 05:55:15 2022 -0800

    x86-64: Optimize bzero

added the optimized bzero.  Remove bzero weak alias in SS2 memset to
avoid undefined __bzero in memset-sse2-unaligned-erms.

(cherry picked from commit 0fb8800029d230b3711bf722b2a47db92d0e273f)

x86_64/multiarch: Sort sysdep_routines and put one entry per line

(cherry picked from commit c328d0152d4b14cca58407ec68143894c8863004)

x86: Improve L to support L(XXX_SYMBOL (YYY, ZZZ))

(cherry picked from commit 1283948f236f209b7d3f44b69a42b96806fa6da0)

fortify: Ensure that __glibc_fortify condition is a constant [BZ #29141]

The fix c8ee1c85 introduced a -1 check for object size without also
checking that object size is a constant. Because of this, the tree
optimizer passes in gcc fail to fold away one of the branches in
__glibc_fortify and trips on a spurious Wstringop-overflow. The warning
itself is incorrect and the branch does go away eventually in DCE in the
rtl passes in gcc, but the constant check is a helpful hint to simplify
code early, so add it in.

Resolves: BZ #29141
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 61a87530108ec9181e1b18a9b727ec3cc3ba7532)

dlfcn: Implement the RTLD_DI_PHDR request type for dlinfo

The information is theoretically available via dl_iterate_phdr as
well, but that approach is very slow if there are many shared
objects.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@rehdat.com>
(cherry picked from commit d056c212130280c0a54d9a4f72170ec621b70ce5)

manual: Document the dlinfo function

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@rehdat.com>
(cherry picked from commit 93804a1ee084d4bdc620b2b9f91615c7da0fabe1)

Also includes partial backport of commit 5d28a8962dcb6ec056b81d730e
(the addition of manual/dynlink.texi).

x86: Fix fallback for wcsncmp_avx2 in strcmp-avx2.S [BZ #28896]

Overflow case for __wcsncmp_avx2_rtm should be __wcscmp_avx2_rtm not
__wcscmp_avx2.

commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Jan 9 16:02:21 2022 -0600

x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.

This change will need to be backported.

All string/memory tests pass.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 9fef7039a7d04947bc89296ee0d187bc8d89b772)

x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]

Logic can read before the start of `s1` / `s2` if both `s1` and `s2`
are near the start of a page. To avoid having the result contimated by
these comparisons the `strcmp` variants would mask off these
comparisons. This was missing in the `strncmp` variants causing
the bug. This commit adds the masking to `strncmp` so that out of
range comparisons don't affect the result.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as
well a full xcheck on x86_64 linux.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e108c02a5e23c8c88ce66d8705d4a24bb6b9a8bf)

x86: Set .text section in memset-vec-unaligned-erms

commit 3d9f171bfb5325bd5f427e9fc386453358c6e840
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Mon Feb 7 05:55:15 2022 -0800

x86-64: Optimize bzero

Remove setting the .text section for the code. This commit
adds that back.

(cherry picked from commit 7912236f4a597deb092650ca79f33504ddb4af28)

x86-64: Optimize bzero

memset with zero as the value to set is by far the majority value (99%+
for Python3 and GCC).

bzero can be slightly more optimized for this case by using a zero-idiom
xor for broadcasting the set value to a register (vector or GPR).

Co-developed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 3d9f171bfb5325bd5f427e9fc386453358c6e840)

x86: Remove SSSE3 instruction for broadcast in memset.S (SSE2 Only)

commit b62ace2740a106222e124cc86956448fa07abf4d
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Sun Feb 6 00:54:18 2022 -0600

x86: Improve vec generation in memset-vec-unaligned-erms.S

Revert usage of 'pshufb' in broadcast logic as it is an SSSE3
instruction and memset.S is restricted to only SSE2 instructions.

(cherry picked from commit 1b0c60f95bbe2eded80b2bb5be75c0e45b11cde1)

x86: Improve vec generation in memset-vec-unaligned-erms.S

No bug.

Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.

For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.

Results for memset-avx2 small (geomean of N = 20 benchset runs).

size, New Time, Old Time, New / Old
   0,    4.100,    3.831,     0.934
   1,    5.074,    4.399,     0.867
   2,    4.433,    4.411,     0.995
   4,    4.487,    4.415,     0.984
   8,    4.454,    4.396,     0.987
  16,    4.502,    4.443,     0.987

All relevant string/wcsmbs tests are passing.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b62ace2740a106222e124cc86956448fa07abf4d)

x86-64: Fix strcmp-evex.S

Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jan 10 15:35:39 2022 -0600

x86: Optimize strcmp-evex.S

(cherry picked from commit 0e0199a9e02ebe42e2b36958964d63f03573c382)

x86-64: Fix strcmp-avx2.S

Change "movl %edx, %rdx" to "movl %edx, %edx" in:

commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Mon Jan 10 15:35:38 2022 -0600

x86: Optimize strcmp-avx2.S

(cherry picked from commit c15efd011cea3d8f0494269eb539583215a1feed)

x86: Optimize strcmp-evex.S

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases as well are nearly universally improved.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8418eb3ff4b781d31c4ed5dc6c0bd7356bc45db9)

x86: Optimize strcmp-avx2.S

Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.

The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.

The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b77b06e0e296f1a2276c27a67e1d44f2cfa38d45)

manual: Clarify that abbreviations of long options are allowed

The man page and code comments clearly state that abbreviations of long
option names are recognized correctly as long as they are unique.
Document this fact in the glibc manual as well.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Andreas Schwab <schwab@linux-m68k.org>
(cherry picked from commit db1efe02c9f15affc3908d6ae73875b82898a489)

Add HWCAP2_AFP, HWCAP2_RPRES from Linux 5.17 to AArch64 bits/hwcap.h

Add the new HWCAP2_AFP and HWCAP2_RPRES constants from Linux 5.17.
Tested with build-many-glibcs.py for aarch64-linux-gnu.

(cherry picked from commit 866c599182e87f116440b5d854f9e99533c48eb3)

aarch64: Add HWCAP2_ECV from Linux 5.16

Indicates the availability of enhanced counter virtualization extension
of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable
in userspace.

(cherry picked from commit 5a1be8ebdf6f02d4efec6e5f12ad06db17511f90)

Add SOL_MPTCP, SOL_MCTP from Linux 5.16 to bits/socket.h

Linux 5.16 adds constants SOL_MPTCP and SOL_MCTP to the getsockopt /
setsockopt levels; add these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit fdc1ae67fef27eea1445bab4bdfe2f0fb3bc7aa1)

Update kernel version to 5.17 in tst-mman-consts.py

This patch updates the kernel version in the test tst-mman-consts.py
to 5.17. (There are no new MAP_* constants covered by this test in
5.17 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 23808a422e6036accaba7236fd3b9a0d7ab7e8ee)

Update kernel version to 5.16 in tst-mman-consts.py

This patch updates the kernel version in the test tst-mman-consts.py
to 5.16. (There are no new MAP_* constants covered by this test in
5.16 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 790a607e234aa10d4b977a1b80aebe8a2acac970)

Update syscall lists for Linux 5.17

Linux 5.17 has one new syscall, set_mempolicy_home_node. Update
syscall-names.list and regenerate the arch-syscall.h headers with
build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit 8ef9196b26793830515402ea95aca2629f7721ec)

Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.h

Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along
with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit
cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol
numbers for PF_CAN") but apparently missed for glibc at the time.

Tested for x86_64.

(cherry picked from commit a94d9659cd69dbc70d3494b1cbbbb5a1551675c5)

Update kernel version to 5.15 in tst-mman-consts.py

This patch updates the kernel version in the test tst-mman-consts.py
to 5.15. (There are no new MAP_* constants covered by this test in
5.15 that need any other header changes.)

Tested with build-many-glibcs.py.

(cherry picked from commit 5c3ece451d46a7d8721311609bfcb6faafacb39e)

Add PF_MCTP, AF_MCTP from Linux 5.15 to bits/socket.h

Linux 5.15 adds a new address / protocol family PF_MCTP / AF_MCTP; add
these constants to bits/socket.h.

Tested for x86_64.

(cherry picked from commit bdeb7a8fa9989d18dab6310753d04d908125dc1d)

posix/glob.c: update from gnulib

Copied from gnulib/lib/glob.c in order to fix rhbz 1982608
Also fixes swbz 25659

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 7c477b57a31487eda516db02b9e04f22d1a6e6af)

linux: Fix fchmodat with AT_SYMLINK_NOFOLLOW for 64 bit time_t (BZ#29097)

The AT_SYMLINK_NOFOLLOW emulation ues the default 32 bit stat internal
calls, which fails with EOVERFLOW if the file constains timestamps
beyond 2038.

Checked on i686-linux-gnu.

(cherry picked from commit 118a2aee07f64d605b6668cbe195c1f44eac6be6)

i386: Regenerate ulps

These failures were caught while building glibc master for Fedora
Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse'
using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon
processor.

(cherry picked from commit e465d97653311c3687aee49de782177353acfe86)

linux: Fix missing internal 64 bit time_t stat usage

These are two missing spots initially done by 52a5fe70a2c77935.

Checked on i686-linux-gnu.

(cherry picked from commit 834ddd0432f68d6dc85b6aac95065721af0d86e9)

x86: Optimize L(less_vec) case in memcmp-evex-movbe.S

No bug.
Optimizations are twofold.

1) Replace page cross and 0/1 checks with masked load instructions in
   L(less_vec). In applications this reduces branch-misses in the
   hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.

Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range.  From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de090ae84e380aff68a98bd94ef704667)

x86: Don't set Prefer_No_AVX512 for processors with AVX512 and AVX-VNNI

Don't set Prefer_No_AVX512 on processors with AVX512 and AVX-VNNI since
they won't lower CPU frequency when ZMM load and store instructions are
used.

(cherry picked from commit ceeffe968c01b1202e482f4855cb6baf5c6cb713)

x86-64: Use notl in EVEX strcmp [BZ #28646]

Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 4df1fa6ddc8925a75f3da644d5da3bb16eb33f02)

x86: Shrink memcmp-sse4.S code size

No bug.

This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.

memcmp-sse4 code size reduction : -3487 bytes
wmemcmp-sse4 code size reduction: -1472 bytes

The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.

See https://research.google/pubs/pub48320/ for more details.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d7171850451e6044ef78d91ff8c017b9c0)

x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h

No bug.

This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.

On Skylake with ERMS:

Size,   Align1, Align2, dst>src,(rep movsb) / (vec copy)
4096,   0,      0,      0,      0.975
4096,   0,      0,      1,      0.953
4096,   12,     0,      0,      0.969
4096,   12,     0,      1,      0.872
4096,   44,     0,      0,      0.979
4096,   44,     0,      1,      0.83
4096,   0,      12,     0,      1.006
4096,   0,      12,     1,      0.989
4096,   0,      44,     0,      0.739
4096,   0,      44,     1,      0.942
4096,   12,     12,     0,      1.009
4096,   12,     12,     1,      0.973
4096,   44,     44,     0,      0.791
4096,   44,     44,     1,      0.961
4096,   2048,   0,      0,      0.978
4096,   2048,   0,      1,      0.951
4096,   2060,   0,      0,      0.986
4096,   2060,   0,      1,      0.963
4096,   2048,   12,     0,      0.971
4096,   2048,   12,     1,      0.941
4096,   2060,   12,     0,      0.977
4096,   2060,   12,     1,      0.949
8192,   0,      0,      0,      0.85
8192,   0,      0,      1,      0.845
8192,   13,     0,      0,      0.937
8192,   13,     0,      1,      0.939
8192,   45,     0,      0,      0.932
8192,   45,     0,      1,      0.927
8192,   0,      13,     0,      0.621
8192,   0,      13,     1,      0.62
8192,   0,      45,     0,      0.53
8192,   0,      45,     1,      0.516
8192,   13,     13,     0,      0.664
8192,   13,     13,     1,      0.659
8192,   45,     45,     0,      0.593
8192,   45,     45,     1,      0.575
8192,   2048,   0,      0,      0.854
8192,   2048,   0,      1,      0.834
8192,   2061,   0,      0,      0.863
8192,   2061,   0,      1,      0.857
8192,   2048,   13,     0,      0.63
8192,   2048,   13,     1,      0.629
8192,   2061,   13,     0,      0.627
8192,   2061,   13,     1,      0.62

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 475b63702ef38b69558fc3d31a0b66776a70f1d3)

x86: Optimize memmove-vec-unaligned-erms.S

No bug.

The optimizations are as follows:

1) Always align entry to 64 bytes. This makes behavior more
   predictable and makes other frontend optimizations easier.

2) Make the L(more_8x_vec) cases 4k aliasing aware. This can have
   significant benefits in the case that:
        0 < (dst - src) < [256, 512]

3) Align before `rep movsb`. For ERMS this is roughly a [0, 30%]
   improvement and for FSRM [-10%, 25%].

In addition to these primary changes there is general cleanup
throughout to optimize the aligning routines and control flow logic.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6b7502ec0c2da89a7437f43171f160d713e39c6)

x86-64: Replace movzx with movzbl

Clang cannot assemble movzx in the AT&T dialect mode.

../sysdeps/x86_64/strcmp.S:2232:16: error: invalid operand for instruction
movzx (%rsi), %ecx
^~~~

Change movzx to movzbl, which follows the AT&T dialect and is used
elsewhere in the file.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6720d36b6623c5e48c070d86acf61198b33e144e)

x86-64: Remove Prefer_AVX2_STRCMP

Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.

(cherry picked from commit 14dbbf46a007ae5df36646b51ad0c9e5f5259f30)

x86-64: Improve EVEX strcmp with masked load

In strcmp-evex.S, to compare 2 32-byte strings, replace

        VMOVU   (%rdi, %rdx), %YMM0
        VMOVU   (%rsi, %rdx), %YMM1
        /* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
        VPCMP   $4, %YMM0, %YMM1, %k0
        VPCMP   $0, %YMMZERO, %YMM0, %k1
        VPCMP   $0, %YMMZERO, %YMM1, %k2
        /* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
        kord    %k1, %k2, %k1
        /* Each bit in K1 represents a NULL or a mismatch.  */
        kord    %k0, %k1, %k1
        kmovd   %k1, %ecx
        testl   %ecx, %ecx
        jne     L(last_vector)

with

        VMOVU   (%rdi, %rdx), %YMM0
        VPTESTM %YMM0, %YMM0, %k2
        /* Each bit cleared in K1 represents a mismatch or a null CHAR
           in YMM0 and 32 bytes at (%rsi, %rdx).  */
        VPCMP   $0, (%rsi, %rdx), %YMM0, %k1{%k2}
        kmovd   %k1, %ecx
        incl    %ecx
        jne     L(last_vector)

It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.

Co-Authored-By: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit c46e9afb2df5fc9e39ff4d13777e4b4c26e04e55)

x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S

This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.

it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.

It costs 2-extra bytes but the extra bytes should only eat into
alignment padding.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bad852b61b79503fcb3c5fc379c70f768df3e1fb)

x86: Optimize memset-vec-unaligned-erms.S

No bug.

Optimization are

1. change control flow for L(more_2x_vec) to fall through to loop and
   jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
   size and saves jumps for length > 4x VEC_SIZE.

2. For EVEX/AVX512 move L(less_vec) closer to entry.

3. Avoid complex address mode for length > 2x VEC_SIZE

4. Slightly better aligning code for the loop from the perspective of
   code size and uops.

5. Align targets so they make full use of their fetch block and if
   possible cache line.

6. Try and reduce total number of icache lines that will need to be
   pulled in for a given length.

7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
   jumping to the stosb target in the sse2 code section will almost
   certainly be to a new page. The new version does increase code size
   marginally by duplicating the target but should get better iTLB
   behavior as a result.

test-memset, test-wmemset, and test-bzero are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e59ced238482fd71f3e493717f14f6507346741e)

x86: Optimize memcmp-evex-movbe.S for frontend behavior and size

No bug.

The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.

Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.

In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.

The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.

test-memcmp and test-wmemcmp are both passing.

(cherry picked from commit 1bd8b8d58fc9967cc073d2c13bfb6befefca2faa)

x86: Modify ENTRY in sysdep.h so that p2align can be specified

No bug.

This change adds a new macro ENTRY_P2ALIGN which takes a second
argument, log2 of the desired function alignment.

The old ENTRY(name) macro is just ENTRY_P2ALIGN(name, 4) so this
doesn't affect any existing functionality.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit fc5bd179ef3a953dff8d1655bd530d0e230ffe71)

x86-64: Optimize load of all bits set into ZMM register [BZ #28252]

Optimize loads of all bits set into ZMM register in AVX512 SVML codes
by replacing

vpbroadcastq .L_2il0floatpacket.16(%rip), %zmmX

and

vmovups .L_2il0floatpacket.13(%rip), %zmmX

with
vpternlogd $0xff, %zmmX, %zmmX, %zmmX

This fixes BZ #28252.

(cherry picked from commit 78c9ec9000f873abe7a15a91b87080a2e4308260)

scripts/glibcelf.py: Mark as UNSUPPORTED on Python 3.5 and earlier

enum.IntFlag and enum.EnumMeta._missing_ support are not part of
earlier Python versions.

(cherry picked from commit b571f3adffdcbed23f35ea39b0ca43809dbb4f5b)

dlfcn: Do not use rtld_active () to determine ld.so state (bug 29078)

When audit modules are loaded, ld.so initialization is not yet
complete, and rtld_active () returns false even though ld.so is
mostly working. Instead, the static dlopen hook is used, but that
does not work at all because this is not a static dlopen situation.

Commit 466c1ea15f461edb8e3ffaf5d86d708876343bbf ("dlfcn: Rework
static dlopen hooks") moved the hook pointer into _rtld_global_ro,
which means that separate protection is not needed anymore and the
hook pointer can be checked directly.

The guard for disabling libio vtable hardening in _IO_vtable_check
should stay for now.

Fixes commit 8e1472d2c1e25e6eabc2059170731365f6d5b3d1 ("ld.so:
Examine GLRO to detect inactive loader [BZ #20204]").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 8dcb6d0af07fda3607b541857e4f3970a74ed55b)

INSTALL: Rephrase -with-default-link documentation

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit c935789bdf40ba22b5698da869d3a4789797e09f)

misc: Fix rare fortify crash on wchar funcs. [BZ 29030]

If `__glibc_objsize (__o) == (size_t) -1` (i.e. `__o` is unknown size), fortify
checks should pass, and `__whatever_alias` should be called.

Previously, `__glibc_objsize (__o) == (size_t) -1` was explicitly checked, but
on commit a643f60c53876b, this was moved into `__glibc_safe_or_unknown_len`.

A comment says the -1 case should work as: "The -1 check is redundant because
since it implies that __glibc_safe_len_cond is true.". But this fails when:
* `__s > 1`
* `__osz == -1` (i.e. unknown size at compile time)
* `__l` is big enough
* `__l * __s <= __osz` can be folded to a constant
(I only found this to be true for `mbsrtowcs` and other functions in wchar2.h)

In this case `__l * __s <= __osz` is false, and `__whatever_chk_warn` will be
called by `__glibc_fortify` or `__glibc_fortify_n` and crash the program.

This commit adds the explicit `__osz == -1` check again.
moc crashes on startup due to this, see: https://bugs.archlinux.org/task/74041

Minimal test case (test.c):
    #include <wchar.h>

    int main (void)
    {
        const char *hw = "HelloWorld";
        mbsrtowcs (NULL, &hw, (size_t)-1, NULL);
        return 0;
    }

Build with:
    gcc -O2 -Wp,-D_FORTIFY_SOURCE=2 test.c -o test && ./test

Output:
    *** buffer overflow detected ***: terminated

Fixes: BZ #29030
Signed-off-by: Joan Bruguera <joanbrugueram@gmail.com>
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 33e03f9cd2be4f2cd62f93fda539cc07d9c8130e)

Default to --with-default-link=no (bug 25812)

This is necessary to place the libio vtables into the RELRO segment.
New tests elf/tst-relro-ldso and elf/tst-relro-libc are added to
verify that this is what actually happens.

The new tests fail on ia64 due to lack of (default) RELRO support
inbutils, so they are XFAILed there.

(cherry picked from commit 198abcbb94618730dae1b3f4393efaa49e0ec8c7)

scripts: Add glibcelf.py module

Hopefully, this will lead to tests that are easier to maintain. The
current approach of parsing readelf -W output using regular expressions
is not necessarily easier than parsing the ELF data directly.

This module is still somewhat incomplete (e.g., coverage of relocation
types and versioning information is missing), but it is sufficient to
perform basic symbol analysis or program header analysis.

The EM_* mapping for architecture-specific constant classes (e.g.,
SttX86_64) is not yet implemented. The classes are defined for the
benefit of elf/tst-glibcelf.py.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 30035d67728a846fa39749cd162afd278ac654c4)

nptl: Fix pthread_cancel cancelhandling atomic operations

The 404656009b reversion did not setup the atomic loop to set the
cancel bits correctly. The fix is essentially what pthread_cancel
did prior 26cfbb7162ad.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.

(cherry picked from commit 62be9681677e7ce820db721c126909979382d379)