git.ipfire.org Git - thirdparty/glibc.git/log

x86: fix wmemset ifunc stray '!' (bug 33542)

The ifunc selector for wmemset had a stray '!' in the
X86_ISA_CPU_FEATURES_ARCH_P(...) check:

  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)
      && X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
                                      AVX_Fast_Unaligned_Load, !))

This effectively negated the predicate and caused the AVX2/AVX512
paths to be skipped, making the dispatcher fall back to the SSE2
implementation even on CPUs where AVX2/AVX512 are available. The
regression leads to noticeable throughput loss for wmemset.

Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is
tested as intended and the correct AVX2/EVEX variants are selected.

Impact:
- On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly
  falls back to SSE2; perf now shows __wmemset_evex/avx2 variants.

Testing:
- benchtests/bench-wmemset shows improved bandwidth across sizes.
- perf confirm the selected symbol is no longer SSE2.

Signed-off-by: xiejiamei <xiejiamei@hygon.com>
Signed-off-by: Li jing <lijing@hygon.cn>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 4d86b6cdd8132e0410347e07262239750f86dfb4)

x86: Skip XSAVE state size reset if ISA level requires XSAVE

If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.

Related to commit befe2d3c4dec8be2cdd01a47132e47bdb7020922
("x86-64: Don't use SSE resolvers for ISA level 3 or above").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 59585ddaa2d44f22af04bb4b8bd4ad1e302c4c02)

x86-64: Simplify minimum ISA check ifdef conditional with if

Replace minimum ISA check ifdef conditional with if. Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b6e3898194bbae78910bbe9cd086937014961e45)

x86-64: Don't use SSE resolvers for ISA level 3 or above

When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:

ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value

For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.

This fixes BZ #31429.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit befe2d3c4dec8be2cdd01a47132e47bdb7020922)

i386: Add GLIBC_ABI_GNU_TLS version [BZ #33221]

On i386, programs and shared libraries with __thread usage may fail
silently at run-time against glibc without the TLS run-time fix for:

https://sourceware.org/bugzilla/show_bug.cgi?id=32996

Add GLIBC_ABI_GNU_TLS version to indicate that glibc has the working
GNU TLS run-time. Linker can add the GLIBC_ABI_GNU_TLS version to
binaries which depend on the working TLS run-time so that such programs
and shared libraries will fail to load and run at run-time against
libc.so without the GLIBC_ABI_GNU_TLS version, instead of fail silently
at random.

This fixes BZ #33221.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit ed1b7a5a489ab555a27fad9c101ebe2e1c1ba881)

i386: Also add GLIBC_ABI_GNU2_TLS version [BZ #33129]

Since the GNU2 TLS run-time bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=31372

affects both i386 and x86-64, also add GLIBC_ABI_GNU2_TLS version to i386
to indicate the working GNU2 TLS run-time. For x86-64, the additional
GNU2 TLS run-time bug fix is needed for

https://sourceware.org/bugzilla/show_bug.cgi?id=31501

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit bd4628f3f18ac312408782eea450429c6f044860)

x86-64: Add GLIBC_ABI_GNU2_TLS version [BZ #33129]

Programs and shared libraries compiled with -mtls-dialect=gnu2 may fail
silently at run-time against glibc without the GNU2 TLS run-time fix
for:

https://sourceware.org/bugzilla/show_bug.cgi?id=31372

Add GLIBC_ABI_GNU2_TLS version to indicate that glibc has the working
GNU2 TLS run-time. Linker can add the GLIBC_ABI_GNU2_TLS version to
binaries which depend on the working GNU2 TLS run-time:

https://sourceware.org/bugzilla/show_bug.cgi?id=33130

so that such programs and shared libraries will fail to load and run at
run-time against libc.so without the GLIBC_ABI_GNU2_TLS version, instead
of fail silently at random.

This fixes BZ #33129.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 9df8fa397d515dc86ff5565f6c45625e672d539e)

i386: Update ___tls_get_addr to preserve vector registers

Compiler generates the following instruction sequence for dynamic TLS
access:

leal tls_var@tlsgd(,%ebx,1), %eax
call ___tls_get_addr@PLT

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL.  But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.

1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:

_dl_tlsdesc_dynamic:
/* Like all TLS resolvers, preserve call-clobbered registers.
   We need two scratch regs anyway.  */
subl $32, %esp
cfi_adjust_cfa_offset (32)

It is wrong to use

movl %ebx, -28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl -28(%esp), %ebx

to preserve EBX on stack.  Fix it with:

movl %ebx, 28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl 28(%esp), %ebx

4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.

This fixes BZ #32996.

Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 848f0e46f03f22404ed9a8aabf3fd5ce8809a1be)

x86: Optimize xstate size calculation

Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.

Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 70b648855185e967e54668b101d24704c3fb869d)

x86: Link tst-gnu2-tls2-x86-noxsave{,c,xsavec} with libpthread

This fixes a test build failure on Hurd.

Fixes commit 145097dff170507fe73190e8e41194f5b5f7e6bf ("x86: Use separate
variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit c6e2895695118ab59c7b17feb0fcb75a53e3478c)
(cherry picked from commit 837a36c371f18a3152d032e8060f4e5120c25e2b)

x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)

Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.

This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.

Fixes commit 9b7091415af47082664717210ac49d51551456ab ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 145097dff170507fe73190e8e41194f5b5f7e6bf)

Apply the Makefile sorting fix

Apply the Makefile sorting fix generated by sort-makefile-lines.py.

(cherry picked from commit ef7f4b1fef67430a8f3cfc77fa6aada2add851d7)

x86-64: Allocate state buffer space for RDI, RSI and RBX

_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11.  Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).

   +==================+<- stack frame start aligned at 8 or 16 bytes
   |                  |<- RDI saved in the red zone
   |                  |<- RSI saved in the red zone
   |                  |<- RBX saved in the red zone
   |                  |<- paddings for stack realignment of 64 bytes
   |------------------|<- xsave buffer end aligned at 64 bytes
   |                  |<-
   |                  |<-
   |                  |<-
   |------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
   |                  |<- 8-byte padding for 64-byte alignment
   |                  |<- 8-byte padding for 64-byte alignment
   |                  |<- R11
   |                  |<- R10
   |                  |<- R9
   |                  |<- R8
   |                  |<- RDX
   |                  |<- RCX
   +==================+<- RSP aligned at 64 bytes

Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone.  This fixes BZ #31501.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 717ebfa85c8240d32d0d19d86a484c31c55c9617)

x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers

_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved.  Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK.  Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec.  This fixes
the AMX part of BZ #31372.  Tested on AMX processor.

AMX test is enabled only for compilers with the fix for

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098

GCC 14 and GCC 11/12/13 branches have the bug fix.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9b7091415af47082664717210ac49d51551456ab)

x86: Update _dl_tlsdesc_dynamic to preserve caller-saved registers

Compiler generates the following instruction sequence for GNU2 dynamic
TLS access:

leaq tls_var@TLSDESC(%rip), %rax
call *tls_var@TLSCALL(%rax)

or

leal tls_var@TLSDESC(%ebx), %eax
call *tls_var@TLSCALL(%eax)

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL.  When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path.  __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers.  _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers.  Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers.  This
fixes BZ #31372.

Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 0aac205a814a8511e98d02b91a8dc908f1c53cde)

x32/cet: Support shadow stack during startup for Linux 6.10

Use RXX_LP in RTLD_START_ENABLE_X86_FEATURES.  Support shadow stack during
startup for Linux 6.10:

commit 2883f01ec37dd8668e7222dfdb5980c86fdfe277
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 15 07:04:33 2024 -0700

    x86/shstk: Enable shadow stacks for x32

    1. Add shadow stack support to x32 signal.
    2. Use the 64-bit map_shadow_stack syscall for x32.
    3. Set up shadow stack for x32.

Add the map_shadow_stack system call to <fixup-asm-unistd.h> and regenerate
arch-syscall.h.  Tested on Intel Tiger Lake with CET enabled x32.  There
are no regressions with CET enabled x86-64.  There are no changes in CET
enabled x86-64 _dl_start_user.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8344c1f5514b1b5b1c8c6e48f4b802653bd23b71)

x86-64: Remove sysdeps/x86_64/x32/dl-machine.h

Remove sysdeps/x86_64/x32/dl-machine.h by folding x32 ARCH_LA_PLTENTER,
ARCH_LA_PLTEXIT and RTLD_START into sysdeps/x86_64/dl-machine.h.  There
are no regressions on x86-64 nor x32.  There are no changes in x86-64
_dl_start_user.  On x32, _dl_start_user changes are

<_dl_start_user>:
mov    %eax,%r12d
+ mov    %esp,%r13d
mov    (%rsp),%edx
mov    %edx,%esi
- mov    %esp,%r13d
and    $0xfffffff0,%esp
mov    0x0(%rip),%edi        # <_dl_start_user+0x14>
lea    0x8(%r13,%rdx,4),%ecx

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 652c6cf26927352fc0e37e4e60c6fc98ddf6d3b4)

x86/cet: fix shadow stack test scripts

Some shadow stack test scripts use the '==' operator with the 'test'
command to validate exit codes resulting in the following error:

sysdeps/x86_64/tst-shstk-legacy-1e.sh: 31: test: 139: unexpected operator

The '==' operator is invalid for the 'test' command, use '-eq' like the
previous call to 'test'.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 155bb9d036646138348fee0ac045de601811e0c5)

x86-64/cet: Make CET feature check specific to Linux/x86

CET feature bits in TCB, which are Linux specific, are used to check if
CET features are active. Move CET feature check to Linux/x86 directory.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit f2b65a44714e8fa13c7637cd9413169590795879)

i386: Remove CET support bits

1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 874214db624a8e6c5d2dbe47419fab126f330d68)

x86-64/cet: Move check-cet.awk to x86_64

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 7d544dd049a2e3f1480b668f51b72dcc89e376ab)

x86: Move CET infrastructure to x86_64

The CET is only supported for x86_64 and there is no plan to add
kernel support for i386. Move the Makefile rules and files from the
generic x86 folder to x86_64 one.

Checked on x86_64-linux-gnu and i686-linux-gnu.

(cherry picked from commit b7fc4a07f206a640e6d807d72f5c1ee3ea7a25b6)

x86-64/cet: Move dl-cet.[ch] to x86_64 directories

Since CET is only enabled for x86-64, move dl-cet.[ch] to x86_64
directories.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit a1bbee9fd17a84d4b550f8405d5e4d31ff24f87d)

i386: Fail if configured with --enable-cet

Since it is only supported for x86_64.

Checked on i686-linux-gnu.

(cherry picked from commit a0cfc48e8a67506e3f0b2d3ea5e04b45408b3683)

x86-64/cet: Check the restore token in longjmp

setcontext and swapcontext put a restore token on the old shadow stack
which is used to restore the target shadow stack when switching user
contexts.  When longjmp from a user context, the target shadow stack
can be different from the current shadow stack and INCSSP can't be
used to restore the shadow stack pointer to the target shadow stack.

Update longjmp to search for a restore token.  If found, use the token
to restore the shadow stack pointer before using INCSSP to pop the
shadow stack.  Stop the token search and use INCSSP if the shadow stack
entry value is the same as the current shadow stack pointer.

It is a user error if there is a shadow stack switch without leaving a
restore token on the old shadow stack.

The only difference between __longjmp.S and __longjmp_chk.S is that
__longjmp_chk.S has a check for invalid longjmp usages.  Merge
__longjmp.S and __longjmp_chk.S by adding the CHECK_INVALID_LONGJMP
macro.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 35694d3416b273ac19d67ffa49b7969f36684ae1)

i386: Ignore --enable-cet

Since shadow stack is only supported for x86-64, ignore --enable-cet for
i386. Always setting $(enable-cet) for i386 to "no" to support

ifneq ($(enable-cet),no)

in x86 Makefiles. We can't use

ifeq ($(enable-cet),yes)

since $(enable-cet) can be "yes", "no" or "permissive".
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit bbfb54930cdd85269504a34b362e77a3ac2a207a)

x86/cet: Add -fcf-protection=none before -fcf-protection=branch

When shadow stack is enabled, some CET tests failed when compiled with
GCC 14:

FAIL: elf/tst-cet-legacy-4
FAIL: elf/tst-cet-legacy-5a
FAIL: elf/tst-cet-legacy-6a

which are caused by

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113039

These tests use -fcf-protection -fcf-protection=branch and assume that
-fcf-protection=branch will override -fcf-protection. But this GCC 14
commit:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=1c6231c05bdcca

changed the -fcf-protection behavior such that

-fcf-protection -fcf-protection=branch

is treated the same as

-fcf-protection

Use

-fcf-protection -fcf-protection=none -fcf-protection=branch

as the workaround. This fixes BZ #31187.

Tested with GCC 13 and GCC 14 on Intel Tiger Lake.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b5dcccfb12385ee492eb074f6beb9ead56b5e5fd)

x86/cet: Run some CET tests with shadow stack

When CET is disabled by default, run some CET tests with shadow stack
enabled using

$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK

(cherry picked from commit cf9481724bcb86ad4a86cca7befed74bb9cc15eb)

x86/cet: Don't set CET active by default

Not all CET enabled applications and libraries have been properly tested
in CET enabled environments.  Some CET enabled applications or libraries
will crash or misbehave when CET is enabled.  Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not.  Shadow stack can be enabled by

$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK

at run-time if shadow stack can be enabled by kernel.

NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.

(cherry picked from commit 55d63e731253de82e96ed4ddca2e294076cd0bc5)

x86/cet: Check feature_1 in TCB for active IBT and SHSTK

Initially, IBT and SHSTK are marked as active when CPU supports them
and CET are enabled in glibc.  They can be disabled early by tunables
before relocation.  Since after relocation, GLRO(dl_x86_cpu_features)
becomes read-only, we can't update GLRO(dl_x86_cpu_features) to mark
IBT and SHSTK as inactive.  Instead, check the feature_1 field in TCB
to decide if IBT and SHST are active.

(cherry picked from commit d360dcc001cb12504cd3e8dbddee20df6bebb0f8)

x86/cet: Enable shadow stack during startup

Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled.  Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled.  There
is no need to disable shadow stack at startup.  Shadow stack can only
be enabled in a function which will never return.  Otherwise, shadow
stack will underflow at the function return.

1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable.  Only non-zero
features in GL(dl_x86_feature_1) should be enabled.  After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable.  It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable.  Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.

(cherry picked from commit 541641a3de8d89464151bd879552755e882c832e)

x86/cet: Sync with Linux kernel 6.6 shadow stack interface

Sync with Linux kernel 6.6 shadow stack interface.  Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.

1. When the shadow stack base in TCB is unset, the default shadow stack
is in use.  Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts.  If yes,
INCSSP will be used to unwind shadow stack.  Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall.  Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.

Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.

(cherry picked from commit edb5e0c8f915a798629717b5680a852c8bb3db25)

x86/cet: Don't disable CET if not single threaded

In permissive mode, don't disable IBT nor SHSTK when dlopening a legacy
shared library if not single threaded since IBT and SHSTK may be still
enabled in other threads. Other threads with IBT or SHSTK enabled will
crash when calling functions in the legacy shared library. Instead, an
error will be issued.

(cherry picked from commit 41560a9312ce0ec7203480eef8f865076bff9edb)

x86: Modularize sysdeps/x86/dl-cet.c

Improve readability and make maintenance easier for dl-feature.c by
modularizing sysdeps/x86/dl-cet.c:
1. Support processors with:
   a. Only IBT.  Or
   b. Only SHSTK.  Or
   c. Both IBT and SHSTK.
2. Lock CET features only if IBT or SHSTK are enabled and are not
enabled permissively.

(cherry picked from commit c04035809a393c0c6f1cc523df6b316b05fdb50f)

x86/cet: Update tst-cet-vfork-1

Change tst-cet-vfork-1.c to verify that vfork child return triggers
SIGSEGV due to shadow stack mismatch.

(cherry picked from commit 1a23b39f9d2caeca72dc12adbbcb5d2d632d942a)

x86/cet: Check CPU_FEATURE_ACTIVE in permissive mode

Verify that CPU_FEATURE_ACTIVE works properly in permissive mode.

(cherry picked from commit 4d8a01d2b0963f7c7714ff53c313430599f0722f)

x86/cet: Check legacy shadow stack code in .init_array section

Verify that legacy shadow stack code in .init_array section in application
and shared library, which are marked as shadow stack enabled, will trigger
segfault.

(cherry picked from commit 28bd6f832d4c8ec9a223c153427c1ab6fd19a548)

x86/cet: Add tests for GLIBC_TUNABLES=glibc.cpu.hwcaps=-SHSTK

Verify that GLIBC_TUNABLES=glibc.cpu.hwcaps=-SHSTK turns off shadow
stack properly.

(cherry picked from commit 9424ce80c2a08f4dfc06d5442b770ed5ec798c4b)

x86/cet: Check CPU_FEATURE_ACTIVE when CET is disabled

Verify that CPU_FEATURE_ACTIVE (SHSTK) works properly when CET is
disabled.

(cherry picked from commit 71c0cc3357fe6d72f1dbef1c695e54b117d91b96)

x86/cet: Check legacy shadow stack applications

Add tests to verify that legacy shadow stack applications run properly
when shadow stack is enabled in Linux kernel.

(cherry picked from commit f418fe6f973300c4c61461ed241928cba11017c2)

x86/cet: Don't assume that SHSTK implies IBT

Since shadow stack (SHSTK) is enabled in the Linux kernel without
enabling indirect branch tracking (IBT), don't assume that SHSTK
implies IBT. Use "CPU_FEATURE_ACTIVE (IBT)" to check if IBT is active
and "CPU_FEATURE_ACTIVE (SHSTK)" to check if SHSTK is active.

(cherry picked from commit 442983319ba70de801fc856e8dd4748fba8f7f1b)

x86/cet: Check user_shstk in /proc/cpuinfo

Linux kernel reports CPU shadow stack feature in /proc/cpuinfo as
user_shstk, instead of shstk.

(cherry picked from commit 0b850186fd3177311f10dcb938b668cc750fa3be)

Update syscall lists for Linux 6.7

Linux 6.7 adds the futex_requeue, futex_wait and futex_wake syscalls,
and enables map_shadow_stack for architectures previously missing it.
Update syscall-names.list and regenerate the arch-syscall.h headers
with build-many-glibcs.py update-syscalls.

Tested with build-many-glibcs.py.

(cherry picked from commit df11c05be91fda5ef490c76fd0d4a53821750116)

Update syscall lists for Linux 6.6

Linux 6.6 has one new syscall for all architectures, fchmodat2, and
the map_shadow_stack on x86_64.

(cherry picked from commit 582383b37d95b133c1ee6855ffaa2b1f5cb3d3b8)

Remove installed header rule on $(..)include/%.h

On x86-64 machine with

[hjl@gnu-cfl-3 x86-glibc]$ ls -l /usr/include/asm/prctl.h sysdeps/unix/sysv/linux/x86_64/include/asm/prctl.h
-rw-r--r-- 1 hjl  hjl   825 Jan  9 09:41 sysdeps/unix/sysv/linux/x86_64/include/asm/prctl.h
-rw-r--r-- 1 root root 1170 Nov 27 16:00 /usr/include/asm/prctl.h
[hjl@gnu-cfl-3 x86-glibc]$

glibc configured with --enable-cet build failed:

make[2]: Entering directory '/export/gnu/import/git/gitlab/x86-glibc/iconv'
../Makerules:327: update target
'/export/build/gnu/tools-build/glibc-cet-gitlab/build-x86_64-linux/gnu/lib-names-64.h'
due to: /export/build/gnu/tools-build/glibc-cet-gitlab/build-x86_64-linux/gnu/lib-names-64.stmp
:
../Makeconfig:1216: update target
'/export/build/gnu/tools-build/glibc-cet-gitlab/build-x86_64-linux/libc-modules.h'
due to: /export/build/gnu/tools-build/glibc-cet-gitlab/build-x86_64-linux/libc-modules.stmp
:
../Makerules:1126: update target '/usr/include/asm/prctl.h' due to:
../sysdeps/unix/sysv/linux/x86_64/64/../include/asm/prctl.h
force-install
/usr/bin/install -c -m 644
../sysdeps/unix/sysv/linux/x86_64/64/../include/asm/prctl.h
/usr/include/asm/prctl.h
/usr/bin/install: cannot remove '/usr/include/asm/prctl.h': Permission denied
make[2]: *** [../Makerules:1126: /usr/include/asm/prctl.h] Error 1
make[2]: Leaving directory '/export/gnu/import/git/gitlab/x86-glibc/iconv'
make[1]: *** [Makefile:484: iconv/subdir_lib] Error 2
make[1]: Leaving directory '/export/gnu/import/git/gitlab/x86-glibc'
make: *** [Makefile:9: all] Error 2

This is triggered by the rule in Makerules:

$(inst_includedir)/%.h: $(..)include/%.h $(+force)
  $(do-install)

Since no files under include/ should be installed, remove it from
Makerules.

Tested it on x86-64.  There are no differences in the installed header
files.

(cherry picked from commit 1eae989cb7632760fd6f4008be73549da861b202)

debug: Fix tst-longjmp_chk3 build failure on Hurd

Explicitly include <unistd.h> for _exit and getpid.

(cherry picked from commit 4836a9af89f1b4d482e6c72ff67e36226d36434c)

debug: Wire up tst-longjmp_chk3

The test was added in commit ac8cc9e300a002228eb7e660df3e7b333d9a7414
without all the required Makefile scaffolding. Tweak the test
so that it actually builds (including with dynamic SIGSTKSZ).

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 4b7cfcc3fbfab55a1bbb32a2da69c048060739d6)

debug: Adapt fortify tests to libsupport

Checked on aarch64, armhf, x86_64, and i686.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 9556acd249687ac562deb6309503165d66eb06fa)

x86-64: Save APX registers in ld.so trampoline

Add APX registers to STATE_SAVE_MASK so that APX registers are saved in
ld.so trampoline. This fixes BZ #31371.

Also update STATE_SAVE_OFFSET and STATE_SAVE_MASK for i386 which will
be used by i386 _dl_tlsdesc_dynamic.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit dfb05f8e704edac70db38c4c8ee700769d91a413)

i386: Remove CET support

CET is only support for x86_64, this patch reverts:

  - faaee1f07ed x86: Support shadow stack pointer in setjmp/longjmp.
  - be9ccd27c09 i386: Add _CET_ENDBR to indirect jump targets in
    add_n.S/sub_n.S
  - c02695d7764 x86/CET: Update vfork to prevent child return
  - 5d844e1b725 i386: Enable CET support in ucontext functions
  - 124bcde683 x86: Add _CET_ENDBR to functions in crti.S
  - 562837c002 x86: Add _CET_ENDBR to functions in dl-tlsdesc.S
  - f753fa7dea x86: Support IBT and SHSTK in Intel CET [BZ #21598]
  - 825b58f3fb i386-mcount.S: Add _CET_ENDBR to _mcount and __fentry__
  - 7e119cd582 i386: Use _CET_NOTRACK in i686/memcmp.S
  - 177824e232 i386: Use _CET_NOTRACK in memcmp-sse4.S
  - 0a899af097 i386: Use _CET_NOTRACK in memcpy-ssse3-rep.S
  - 7fb613361c i386: Use _CET_NOTRACK in memcpy-ssse3.S
  - 77a8ae0948 i386: Use _CET_NOTRACK in memset-sse2-rep.S
  - 00e7b76a8f i386: Use _CET_NOTRACK in memset-sse2.S
  - 90d15dc577 i386: Use _CET_NOTRACK in strcat-sse2.S
  - f1574581c7 i386: Use _CET_NOTRACK in strcpy-sse2.S
  - 4031d7484a i386/sub_n.S: Add a missing _CET_ENDBR to indirect jump
  - target
  -
Checked on i686-linux-gnu.

(cherry picked from commit 25f1e16ef03a6a8fb1701c4647d46c564480d88c)

x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212]

When the linker -z mark-plt option is used to add DT_X86_64_PLT,
DT_X86_64_PLTSZ and DT_X86_64_PLTENT, the r_addend field of the
R_X86_64_JUMP_SLOT relocation stores the offset of the indirect
branch instruction.  However, glibc versions without the commit:

commit f8587a61892cbafd98ce599131bf4f103466f084
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 19:21:48 2022 -0700

    x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT

    According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
    and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
    can ignore their r_addends.

Reviewed-by: Fangrui Song <maskray@google.com>
won't ignore the r_addend value in the R_X86_64_JUMP_SLOT relocation.
Such programs and shared libraries will fail at run-time randomly.

Add GLIBC_ABI_DT_X86_64_PLT version to indicate that glibc is compatible
with DT_X86_64_PLT.

The linker can add the glibc GLIBC_ABI_DT_X86_64_PLT version dependency
whenever -z mark-plt is passed to the linker.  The resulting programs and
shared libraries will fail to load at run-time against libc.so without the
GLIBC_ABI_DT_X86_64_PLT version, instead of fail randomly.

This fixes BZ #33212.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 399384e0c8193e31aea014220ccfa24300ae5938)

elf: Handle ld.so with LOAD segment gaps in _dl_find_object (bug 31943)

Detect if ld.so not contiguous and handle that case in _dl_find_object.
Set l_find_object_processed even for initially loaded link maps,
otherwise dlopen of an initially loaded object adds it to
_dlfo_loaded_mappings (where maps are expected to be contiguous),
in addition to _dlfo_nodelete_mappings.

Test elf/tst-link-map-contiguous-ldso iterates over the loader
image, reading every word to make sure memory is actually mapped.
It only does that if the l_contiguous flag is set for the link map.
Otherwise, it finds gaps with mmap and checks that _dl_find_object
does not return the ld.so mapping for them.

The test elf/tst-link-map-contiguous-main does the same thing for
the libc.so shared object. This only works if the kernel loaded
the main program because the glibc dynamic loader may fill
the gaps with PROT_NONE mappings in some cases, making it contiguous,
but accesses to individual words may still fault.

Test elf/tst-link-map-contiguous-libc is again slightly different
because the dynamic loader always fills the gaps with PROT_NONE
mappings, so a different form of probing has to be used.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 20681be149b9eb1b6c1f4246bf4bd801221c86cd)

elf: Extract rtld_setup_phdr function from dl_main

Remove historic binutils reference from comment and update
how this data is used by applications.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 2cac9559e06044ba520e785c151fbbd25011865f)

elf: Do not add a copy of _dl_find_object to libc.so

This reduces code size and dependencies on ld.so internals from
libc.so.

Fixes commit f4c142bb9fe6b02c0af8cfca8a920091e2dba44b
("arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405)").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 96429bcc91a14f71b177ddc5e716de3069060f2c)

arm: Use _dl_find_object on __gnu_Unwind_Find_exidx (BZ 31405)

Instead of __dl_iterate_phdr. On ARM dlfo_eh_frame/dlfo_eh_count
maps to PT_ARM_EXIDX vaddr start / length.

On a Neoverse N1 machine with 160 cores, the following program:

  $ cat test.c
  #include <stdlib.h>
  #include <pthread.h>
  #include <assert.h>

  enum {
    niter = 1024,
    ntimes = 128,
  };

  static void *
  tf (void *arg)
  {
    int a = (int) arg;

    for (int i = 0; i < niter; i++)
      {
        void *p[ntimes];
        for (int j = 0; j < ntimes; j++)
   p[j] = malloc (a * 128);
        for (int j = 0; j < ntimes; j++)
   free (p[j]);
      }

    return NULL;
  }

  int main (int argc, char *argv[])
  {
    enum { nthreads = 16 };
    pthread_t t[nthreads];

    for (int i = 0; i < nthreads; i ++)
      assert (pthread_create (&t[i], NULL, tf, (void *) i) == 0);

    for (int i = 0; i < nthreads; i++)
      {
        void *r;
        assert (pthread_join (t[i], &r) == 0);
        assert (r == NULL);
      }

    return 0;
  }
  $ arm-linux-gnueabihf-gcc -fsanitize=address test.c -o test

Improves from ~15s to 0.5s.

Checked on arm-linux-gnueabihf.

(cherry picked from commit f4c142bb9fe6b02c0af8cfca8a920091e2dba44b)

posix: Fix double-free after allocation failure in regcomp (bug 33185)

If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.

Reported-by: Anastasia Belova <abelova@astralinux.ru>
Co-authored-by: Paul Eggert <eggert@cs.ucla.edu>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e994093fa0bcca0d0ee2c1db271d8d7885d)

malloc: cleanup casts in tst-calloc

Followup to c3d1dac96bdd10250aa37bb367d5ef8334a093a1. As pointed out by
Maciej W. Rozycki, the casts are obviously useless now.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit fc8f253d808ade5e97c93b363bd1932023e770ba)

malloc: obscure calloc use in tst-calloc

Similar to a9944a52c967ce76a5894c30d0274b824df43c7a and
f9493a15ea9cfb63a815c00c23142369ec09d8ce, we need to hide calloc use from
the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change.

First, include tst-malloc-aux.h, but then use `volatile` variables
for size.

The test passes without the tst-malloc-aux.h change but IMO we want
it there for consistency and to avoid future problems (possibly silent).

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c3d1dac96bdd10250aa37bb367d5ef8334a093a1)

malloc: add indirection for malloc(-like) functions in tests [BZ #32366]

GCC 15 introduces allocation dead code removal (DCE) for PR117370 in
r15-5255-g7828dc070510f8. This breaks various glibc tests which want
to assert various properties of the allocator without doing anything
obviously useful with the allocated memory.

Alexander Monakov rightly pointed out that we can and should do better
than passing -fno-malloc-dce to paper over the problem. Not least because
GCC 14 already does such DCE where there's no testing of malloc's return
value against NULL, and LLVM has such optimisations too.

Handle this by providing malloc (and friends) wrappers with a volatile
function pointer to obscure that we're calling malloc (et. al) from the
compiler.

Reviewed-by: Paul Eggert <eggert@cs.ucla.edu>
(cherry picked from commit a9944a52c967ce76a5894c30d0274b824df43c7a)

nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)

[BZ #25847]

The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.

(cherry picked from commit dbc5a50d12eff4cb3f782129029d04b8a76f58e7)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Use all of g1_start and g_signals

[BZ #25847]

The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:

"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"

I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.

(cherry picked from commit 91bb902f58264a2fd50fbce8f39a9a290dd23706)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: rename __condvar_quiesce_and_switch_g1

[BZ #25847]

This function no longer waits for threads to leave g1, so rename it to
__condvar_switch_g1

(cherry picked from commit 4b79e27a5073c02f6bff9aa8f4791230a0ab1867)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Fix indentation

[BZ #25847]

In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.

(cherry picked from commit ee6c14ed59d480720721aaacc5fb03213dc153da)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Use a single loop in pthread_cond_wait instaed of a nested loop

[BZ #25847]

The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.

(cherry picked from commit 929a4764ac90382616b6a21f099192b2475da674)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove g_refs from condition variables

[BZ #25847]

This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.

(cherry picked from commit c36fc50781995e6758cae2b6927839d0157f213c)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove unnecessary quadruple check in pthread_cond_wait

[BZ #25847]

pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:

1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
   for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
   close while we do an atomic increment. Obviously it could close at any
   point, but that doesn't mean we have to recheck after every step. This
   check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
   we check that we're not in a closed group. There is no need to check again.
   The check would only have helped in cases where the compare_exchange in the
   next line would also have failed. Relying on the compare_exchange is fine.

Removing the duplicate checks clarifies the code.

(cherry picked from commit 4f7b051f8ee3feff1b53b27a906f245afaa9cee1)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Remove unnecessary catch-all-wake in condvar group switch

[BZ #25847]

This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.

In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.

(cherry picked from commit b42cc6af11062c260c7dfa91f1c89891366fed3e)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

nptl: Update comments and indentation for new condvar implementation

[BZ #25847]

Some comments were wrong after the most recent commit. This fixes that.

Also fixing indentation where it was using spaces instead of tabs.

(cherry picked from commit 0cc973160c23bb67f895bc887dd6942d29f8fee3)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

pthreads NPTL: lost wakeup fix 2

[BZ #25847]

This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch.  This greatly reduces the average
and maximum latency for pthread_cond_signal.

The g_signals[] field now contains a signal count that is relative to
the current g1_start value.  Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)

By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same.  This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change.  This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.

By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore.  The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit.  If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.

The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters.  Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2.  They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.

Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block.  This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.

(cherry picked from commit 1db84775f831a1494993ce9c118deaf9537cc50a)

Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

Fix error reporting (false negatives) in SGID tests

And simplify the interface of support_capture_subprogram_self_sgid.

Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion.  Handle all execution
errors within support_capture_subprogram_self_sgid.  In particular,
this includes test failures because the invoked program did not
exit with exit status zero.  Existing tests that expect exit
status 42 are adjusted to use zero instead.

In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).

Fixes commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2
("elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 3a3fb2ed83f79100c116c824454095ecfb335ad7)

support: Pick group in support_capture_subprogram_self_sgid if UID == 0

When running as root, it is likely that we can run under any group.
Pick a harmless group from /etc/group in this case.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2f769cec448d84a62b7dd0d4ff56978fe22c0cd6)

elf: Fix subprocess status handling for tst-dlopen-sgid (bug 32987)

This should really move into support_capture_subprogram_self_sgid.

Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 35fc356fa3b4f485bd3ba3114c9f774e5df7d3c2)

x86_64: Fix typo in ifunc-impl-list.c.

Fix wcsncpy and wcpncpy typo in ifunc-impl-list.c.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit f2aeb6ff941dccc4c777b5621e77addea6cc076c)

elf: Test case for bug 32976 (CVE-2025-4802)

Check that LD_LIBRARY_PATH is ignored for AT_SECURE statically
linked binaries, using support_capture_subprogram_self_sgid.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit d8f7a79335b0d861c12c42aec94c04cd5bb181e2)

support: Add support_record_failure_barrier

This can be used to stop execution after a TEST_COMPARE_BLOB
failure, for example.

(cherry picked from commit d0b8aa6de4529231fadfe604ac2c434e559c2d9e)

support: Use const char * argument in support_capture_subprogram_self_sgid

The function does not modify the passed-in string, so make this clear
via the prototype.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit f0c09fe61678df6f7f18fe1ebff074e62fa5ca7a)

elf: Ignore LD_LIBRARY_PATH and debug env var for setuid for static

It mimics the ld.so behavior.

Checked on x86_64-linux-gnu.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 5451fa962cd0a90a0e2ec1d8910a559ace02bba0)

Changes:

git/elf/dl-support.c
(missing commit 55f41ef8de4a4d0c5762d78659e11202d3c765d4
("elf: Remove LD_PROFILE for static binaries"))

math: Improve layout of exp/exp10 data

GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp.

The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)

AArch64: Use prefer_sve_ifuncs for SVE memset

Use prefer_sve_ifuncs for SVE memset just like memcpy.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)

AArch64: Add SVE memset

Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
(cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)

math: Improve layout of expf data

GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)

AArch64: Remove zva_128 from memset

Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit
e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.

[1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html

Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com>
(cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)

AArch64: Optimize memset

Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)

AArch64: Improve generic strlen

Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)

assert: Add test for CVE-2025-0395

Use the __progname symbol to override the program name to induce the
failure that CVE-2025-0395 describes.

This is related to BZ #32582

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit cdb9ba84191ce72e86346fb8b1d906e7cd930ea2)

stdlib: Test using setenv with updated environ [BZ #32588]

Add a test for setenv with updated environ. Verify that BZ #32588 is
fixed.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
(cherry picked from commit 8ab34497de14e35aff09b607222fe1309ef156da)

Fix underallocation of abort_msg_s struct (CVE-2025-0395)

Include the space needed to store the length of the message itself, in
addition to the message string. This resolves BZ #32582.

Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578)

elf: Support recursive use of dynamic TLS in interposed malloc

It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS).  The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.

This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update.  This
is possible because an interposed malloc has a fixed module ID and
TLS slot.  (It cannot be unloaded.)  If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc.  It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.

The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.

All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 018f0fc3b818d4d1460a4e2384c24802504b1d20)

elf: Avoid some free (NULL) calls in _dl_update_slotinfo

This has been confirmed to work around some interposed mallocs.  Here
is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper
in lttng-tools:

  New TLS usage in libgcc_s.so.1, compatibility impact
  <https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/>

Reportedly, this patch also papers over a similar issue when tcmalloc
2.9.1 is not compiled with -ftls-model=initial-exec.  Of course the
goal really should be to compile mallocs with the initial-exec TLS
model, but this commit appears to be a useful interim workaround.

Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit afe42e935b3ee97bac9a7064157587777259c60e)

x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]

The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 483443d3211532903d7e790211af5a1d55fdb1f3)

x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5bf0ab80573d66e4ae5d94b094659094336da90f)

x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4)

This code expects the WCSCAT preprocessor macro to be predefined in case
the evex implementation of the function should be defined with a name
different from __wcsncat_evex. However, when glibc is built for
x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines
WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the
variable to WCSNCAT, as it is actually a better naming choice for the
variable in this case.

Reported-by: Kenton Groombridge
Link: https://bugs.gentoo.org/921945
Fixes: 64b8b6516b ("x86: Add evex optimized functions for the wchar_t strcpy family")
Signed-off-by: Gabi Falk <gabifalk@gmx.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit dd5f891c1ad9f1b43b9db93afe2a55cbb7a6194e)

sysdeps/x86/Makefile: Split and sort tests

Put each test on a separate line and sort tests.

(cherry picked from commit 7e03e0de7e7c2de975b5c5e18f5a4b0c75816674)

x86: Only align destination to 1x VEC_SIZE in memset 4x loop

Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)

elf: Fix slow tls access after dlopen [BZ #19924]

In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.

It may be possible to detect up-to-date dtv faster. But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.

This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once. The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.

This patch uses acquire/release synchronization when accessing the
generation counter.

Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.

I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit d2123d68275acc0f061e73d5f86ca504e0d5a344)

x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]

The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored.  If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead.  If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all.  This addresses BZ #30643.

(cherry picked from commit 1493622f4f9048ffede3fbedb64695efa49d662a)

x86_64: Add log1p with FMA

On Skylake, it changes log1p bench performance by:

        Before       After     Improvement
max     63.349       58.347       8%
min     4.448        5.651        -30%
mean    12.0674      10.336       14%

The minimum code path is

if (hx < 0x3FDA827A)                          /* x < 0.41422  */
    {
      if (__glibc_unlikely (ax >= 0x3ff00000))           /* x <= -1.0 */
        {
   ...
        }
      if (__glibc_unlikely (ax < 0x3e200000))           /* |x| < 2**-29 */
        {
          math_force_eval (two54 + x);          /* raise inexact */
          if (ax < 0x3c900000)                  /* |x| < 2**-54 */
            {
      ...
            }
          else
            return x - x * x * 0.5;

FMA and non-FMA code sequences look similar.  Non-FMA version is slightly
faster.  Since log1p is called by asinh and atanh, it improves asinh
performance by:

        Before       After     Improvement
max     75.645       63.135       16%
min     10.074       10.071       0%
mean    15.9483      14.9089      6%

and improves atanh performance by:

        Before       After     Improvement
max     91.768       75.081       18%
min     15.548       13.883       10%
mean    18.3713      16.8011      8%

(cherry picked from commit a8ecb126d4c26c52f4ad828c566afe4043a28155)

x86_64: Add expm1 with FMA

On Skylake, it improves expm1 bench performance by:

        Before       After     Improvement
max     70.204       68.054       3%
min     20.709       16.2         22%
mean    22.1221      16.7367      24%

NB: Add

extern long double __expm1l (long double);
extern long double __expm1f128 (long double);

for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.

(cherry picked from commit 1b214630ce6f7e0099b8b6f87246246739b079cf)

x86_64: Add log2 with FMA

On Skylake, it improves log2 bench performance by:

        Before       After     Improvement
max     208.779      63.827       69%
min     9.977        6.55         34%
mean    10.366       6.8191       34%

(cherry picked from commit f6b10ed8e9a00de49d0951e760cc2b5288862b47)

x86_64: Sort fpu/multiarch/Makefile

Sort Makefile variables using scripts/sort-makefile-lines.py.

No code generation changes observed in libm. No regressions on x86_64.

(cherry picked from commit 881546979d0219c18337e1b4f4d00cfacab13c40)