Jiamei Xie [Tue, 14 Oct 2025 12:14:11 +0000 (20:14 +0800)]
x86: fix wmemset ifunc stray '!' (bug 33542)
The ifunc selector for wmemset had a stray '!' in the
X86_ISA_CPU_FEATURES_ARCH_P(...) check:
if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)
&& X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
AVX_Fast_Unaligned_Load, !))
This effectively negated the predicate and caused the AVX2/AVX512
paths to be skipped, making the dispatcher fall back to the SSE2
implementation even on CPUs where AVX2/AVX512 are available. The
regression leads to noticeable throughput loss for wmemset.
Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is
tested as intended and the correct AVX2/EVEX variants are selected.
Impact:
- On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly
falls back to SSE2; perf now shows __wmemset_evex/avx2 variants.
Testing:
- benchtests/bench-wmemset shows improved bandwidth across sizes.
- perf confirm the selected symbol is no longer SSE2.
Signed-off-by: xiejiamei <xiejiamei@hygon.com> Signed-off-by: Li jing <lijing@hygon.cn> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 4d86b6cdd8132e0410347e07262239750f86dfb4)
Florian Weimer [Fri, 28 Mar 2025 08:26:06 +0000 (09:26 +0100)]
x86: Skip XSAVE state size reset if ISA level requires XSAVE
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.
Sunil K Pandey [Fri, 1 Mar 2024 01:57:02 +0000 (17:57 -0800)]
x86-64: Simplify minimum ISA check ifdef conditional with if
Replace minimum ISA check ifdef conditional with if. Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.
H.J. Lu [Wed, 28 Feb 2024 17:51:14 +0000 (09:51 -0800)]
x86-64: Don't use SSE resolvers for ISA level 3 or above
When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:
ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value
For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.
Add GLIBC_ABI_GNU_TLS version to indicate that glibc has the working
GNU TLS run-time. Linker can add the GLIBC_ABI_GNU_TLS version to
binaries which depend on the working TLS run-time so that such programs
and shared libraries will fail to load and run at run-time against
libc.so without the GLIBC_ABI_GNU_TLS version, instead of fail silently
at random.
affects both i386 and x86-64, also add GLIBC_ABI_GNU2_TLS version to i386
to indicate the working GNU2 TLS run-time. For x86-64, the additional
GNU2 TLS run-time bug fix is needed for
Add GLIBC_ABI_GNU2_TLS version to indicate that glibc has the working
GNU2 TLS run-time. Linker can add the GLIBC_ABI_GNU2_TLS version to
binaries which depend on the working GNU2 TLS run-time:
so that such programs and shared libraries will fail to load and run at
run-time against libc.so without the GLIBC_ABI_GNU2_TLS version, instead
of fail silently at random.
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL. But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.
1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:
_dl_tlsdesc_dynamic:
/* Like all TLS resolvers, preserve call-clobbered registers.
We need two scratch regs anyway. */
subl $32, %esp
cfi_adjust_cfa_offset (32)
4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.
This fixes BZ #32996.
Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org> Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 848f0e46f03f22404ed9a8aabf3fd5ce8809a1be)
Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.
Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 70b648855185e967e54668b101d24704c3fb869d)
Florian Weimer [Fri, 28 Mar 2025 08:26:59 +0000 (09:26 +0100)]
x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
H.J. Lu [Mon, 18 Mar 2024 13:40:16 +0000 (06:40 -0700)]
x86-64: Allocate state buffer space for RDI, RSI and RBX
_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11. Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).
+==================+<- stack frame start aligned at 8 or 16 bytes
| |<- RDI saved in the red zone
| |<- RSI saved in the red zone
| |<- RBX saved in the red zone
| |<- paddings for stack realignment of 64 bytes
|------------------|<- xsave buffer end aligned at 64 bytes
| |<-
| |<-
| |<-
|------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
| |<- 8-byte padding for 64-byte alignment
| |<- 8-byte padding for 64-byte alignment
| |<- R11
| |<- R10
| |<- R9
| |<- R8
| |<- RDX
| |<- RCX
+==================+<- RSP aligned at 64 bytes
Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone. This fixes BZ #31501. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 717ebfa85c8240d32d0d19d86a484c31c55c9617)
H.J. Lu [Wed, 28 Feb 2024 20:08:03 +0000 (12:08 -0800)]
x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers
_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved. Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK. Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec. This fixes
the AMX part of BZ #31372. Tested on AMX processor.
AMX test is enabled only for compilers with the fix for
GCC 14 and GCC 11/12/13 branches have the bug fix. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9b7091415af47082664717210ac49d51551456ab)
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL. When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path. __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers. _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers. Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers. This
fixes BZ #31372.
Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 0aac205a814a8511e98d02b91a8dc908f1c53cde)
1. Add shadow stack support to x32 signal.
2. Use the 64-bit map_shadow_stack syscall for x32.
3. Set up shadow stack for x32.
Add the map_shadow_stack system call to <fixup-asm-unistd.h> and regenerate
arch-syscall.h. Tested on Intel Tiger Lake with CET enabled x32. There
are no regressions with CET enabled x86-64. There are no changes in CET
enabled x86-64 _dl_start_user.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 8344c1f5514b1b5b1c8c6e48f4b802653bd23b71)
H.J. Lu [Tue, 23 Jul 2024 00:47:21 +0000 (17:47 -0700)]
x86-64: Remove sysdeps/x86_64/x32/dl-machine.h
Remove sysdeps/x86_64/x32/dl-machine.h by folding x32 ARCH_LA_PLTENTER,
ARCH_LA_PLTEXIT and RTLD_START into sysdeps/x86_64/dl-machine.h. There
are no regressions on x86-64 nor x32. There are no changes in x86-64
_dl_start_user. On x32, _dl_start_user changes are
The '==' operator is invalid for the 'test' command, use '-eq' like the
previous call to 'test'.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 155bb9d036646138348fee0ac045de601811e0c5)
H.J. Lu [Wed, 10 Jan 2024 16:48:47 +0000 (08:48 -0800)]
x86-64/cet: Make CET feature check specific to Linux/x86
CET feature bits in TCB, which are Linux specific, are used to check if
CET features are active. Move CET feature check to Linux/x86 directory. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit f2b65a44714e8fa13c7637cd9413169590795879)
H.J. Lu [Tue, 9 Jan 2024 20:23:27 +0000 (12:23 -0800)]
i386: Remove CET support bits
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit 874214db624a8e6c5d2dbe47419fab126f330d68)
The CET is only supported for x86_64 and there is no plan to add
kernel support for i386. Move the Makefile rules and files from the
generic x86 folder to x86_64 one.
H.J. Lu [Tue, 9 Jan 2024 20:23:25 +0000 (12:23 -0800)]
x86-64/cet: Move dl-cet.[ch] to x86_64 directories
Since CET is only enabled for x86-64, move dl-cet.[ch] to x86_64
directories. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit a1bbee9fd17a84d4b550f8405d5e4d31ff24f87d)
H.J. Lu [Tue, 2 Jan 2024 15:03:29 +0000 (07:03 -0800)]
x86-64/cet: Check the restore token in longjmp
setcontext and swapcontext put a restore token on the old shadow stack
which is used to restore the target shadow stack when switching user
contexts. When longjmp from a user context, the target shadow stack
can be different from the current shadow stack and INCSSP can't be
used to restore the shadow stack pointer to the target shadow stack.
Update longjmp to search for a restore token. If found, use the token
to restore the shadow stack pointer before using INCSSP to pop the
shadow stack. Stop the token search and use INCSSP if the shadow stack
entry value is the same as the current shadow stack pointer.
It is a user error if there is a shadow stack switch without leaving a
restore token on the old shadow stack.
The only difference between __longjmp.S and __longjmp_chk.S is that
__longjmp_chk.S has a check for invalid longjmp usages. Merge
__longjmp.S and __longjmp_chk.S by adding the CHECK_INVALID_LONGJMP
macro. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 35694d3416b273ac19d67ffa49b7969f36684ae1)
H.J. Lu [Wed, 3 Jan 2024 20:09:23 +0000 (12:09 -0800)]
i386: Ignore --enable-cet
Since shadow stack is only supported for x86-64, ignore --enable-cet for
i386. Always setting $(enable-cet) for i386 to "no" to support
ifneq ($(enable-cet),no)
in x86 Makefiles. We can't use
ifeq ($(enable-cet),yes)
since $(enable-cet) can be "yes", "no" or "permissive". Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit bbfb54930cdd85269504a34b362e77a3ac2a207a)
Tested with GCC 13 and GCC 14 on Intel Tiger Lake. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit b5dcccfb12385ee492eb074f6beb9ead56b5e5fd)
H.J. Lu [Fri, 29 Dec 2023 16:43:53 +0000 (08:43 -0800)]
x86/cet: Don't set CET active by default
Not all CET enabled applications and libraries have been properly tested
in CET enabled environments. Some CET enabled applications or libraries
will crash or misbehave when CET is enabled. Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not. Shadow stack can be enabled by
$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK
at run-time if shadow stack can be enabled by kernel.
NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.
H.J. Lu [Fri, 29 Dec 2023 16:43:52 +0000 (08:43 -0800)]
x86/cet: Check feature_1 in TCB for active IBT and SHSTK
Initially, IBT and SHSTK are marked as active when CPU supports them
and CET are enabled in glibc. They can be disabled early by tunables
before relocation. Since after relocation, GLRO(dl_x86_cpu_features)
becomes read-only, we can't update GLRO(dl_x86_cpu_features) to mark
IBT and SHSTK as inactive. Instead, check the feature_1 field in TCB
to decide if IBT and SHST are active.
H.J. Lu [Fri, 29 Dec 2023 16:43:51 +0000 (08:43 -0800)]
x86/cet: Enable shadow stack during startup
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled. Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled. There
is no need to disable shadow stack at startup. Shadow stack can only
be enabled in a function which will never return. Otherwise, shadow
stack will underflow at the function return.
1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable. Only non-zero
features in GL(dl_x86_feature_1) should be enabled. After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable. It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable. Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
H.J. Lu [Fri, 29 Dec 2023 16:43:49 +0000 (08:43 -0800)]
x86/cet: Sync with Linux kernel 6.6 shadow stack interface
Sync with Linux kernel 6.6 shadow stack interface. Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.
1. When the shadow stack base in TCB is unset, the default shadow stack
is in use. Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts. If yes,
INCSSP will be used to unwind shadow stack. Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall. Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.
Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
H.J. Lu [Fri, 28 Jul 2023 21:06:01 +0000 (14:06 -0700)]
x86/cet: Don't disable CET if not single threaded
In permissive mode, don't disable IBT nor SHSTK when dlopening a legacy
shared library if not single threaded since IBT and SHSTK may be still
enabled in other threads. Other threads with IBT or SHSTK enabled will
crash when calling functions in the legacy shared library. Instead, an
error will be issued.
H.J. Lu [Fri, 24 Mar 2023 20:20:06 +0000 (13:20 -0700)]
x86: Modularize sysdeps/x86/dl-cet.c
Improve readability and make maintenance easier for dl-feature.c by
modularizing sysdeps/x86/dl-cet.c:
1. Support processors with:
a. Only IBT. Or
b. Only SHSTK. Or
c. Both IBT and SHSTK.
2. Lock CET features only if IBT or SHSTK are enabled and are not
enabled permissively.
H.J. Lu [Wed, 22 Mar 2023 20:34:55 +0000 (13:34 -0700)]
x86/cet: Check legacy shadow stack code in .init_array section
Verify that legacy shadow stack code in .init_array section in application
and shared library, which are marked as shadow stack enabled, will trigger
segfault.
H.J. Lu [Sat, 16 Dec 2023 16:53:12 +0000 (08:53 -0800)]
x86/cet: Don't assume that SHSTK implies IBT
Since shadow stack (SHSTK) is enabled in the Linux kernel without
enabling indirect branch tracking (IBT), don't assume that SHSTK
implies IBT. Use "CPU_FEATURE_ACTIVE (IBT)" to check if IBT is active
and "CPU_FEATURE_ACTIVE (SHSTK)" to check if SHSTK is active.
Joseph Myers [Wed, 17 Jan 2024 15:38:54 +0000 (15:38 +0000)]
Update syscall lists for Linux 6.7
Linux 6.7 adds the futex_requeue, futex_wait and futex_wake syscalls,
and enables map_shadow_stack for architectures previously missing it.
Update syscall-names.list and regenerate the arch-syscall.h headers
with build-many-glibcs.py update-syscalls.
Florian Weimer [Mon, 25 Nov 2024 16:32:54 +0000 (17:32 +0100)]
debug: Wire up tst-longjmp_chk3
The test was added in commit ac8cc9e300a002228eb7e660df3e7b333d9a7414
without all the required Makefile scaffolding. Tweak the test
so that it actually builds (including with dynamic SIGSTKSZ).
H.J. Lu [Fri, 16 Feb 2024 15:17:10 +0000 (07:17 -0800)]
x86-64: Save APX registers in ld.so trampoline
Add APX registers to STATE_SAVE_MASK so that APX registers are saved in
ld.so trampoline. This fixes BZ #31371.
Also update STATE_SAVE_OFFSET and STATE_SAVE_MASK for i386 which will
be used by i386 _dl_tlsdesc_dynamic. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit dfb05f8e704edac70db38c4c8ee700769d91a413)
CET is only support for x86_64, this patch reverts:
- faaee1f07ed x86: Support shadow stack pointer in setjmp/longjmp.
- be9ccd27c09 i386: Add _CET_ENDBR to indirect jump targets in
add_n.S/sub_n.S
- c02695d7764 x86/CET: Update vfork to prevent child return
- 5d844e1b725 i386: Enable CET support in ucontext functions
- 124bcde683 x86: Add _CET_ENDBR to functions in crti.S
- 562837c002 x86: Add _CET_ENDBR to functions in dl-tlsdesc.S
- f753fa7dea x86: Support IBT and SHSTK in Intel CET [BZ #21598]
- 825b58f3fb i386-mcount.S: Add _CET_ENDBR to _mcount and __fentry__
- 7e119cd582 i386: Use _CET_NOTRACK in i686/memcmp.S
- 177824e232 i386: Use _CET_NOTRACK in memcmp-sse4.S
- 0a899af097 i386: Use _CET_NOTRACK in memcpy-ssse3-rep.S
- 7fb613361c i386: Use _CET_NOTRACK in memcpy-ssse3.S
- 77a8ae0948 i386: Use _CET_NOTRACK in memset-sse2-rep.S
- 00e7b76a8f i386: Use _CET_NOTRACK in memset-sse2.S
- 90d15dc577 i386: Use _CET_NOTRACK in strcat-sse2.S
- f1574581c7 i386: Use _CET_NOTRACK in strcpy-sse2.S
- 4031d7484a i386/sub_n.S: Add a missing _CET_ENDBR to indirect jump
- target
-
Checked on i686-linux-gnu.
H.J. Lu [Thu, 14 Aug 2025 14:03:20 +0000 (07:03 -0700)]
x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212]
When the linker -z mark-plt option is used to add DT_X86_64_PLT,
DT_X86_64_PLTSZ and DT_X86_64_PLTENT, the r_addend field of the
R_X86_64_JUMP_SLOT relocation stores the offset of the indirect
branch instruction. However, glibc versions without the commit:
x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT
According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
and R_X86_64_JUMP_SLOT. Since linkers always set their r_addends to 0, we
can ignore their r_addends.
Reviewed-by: Fangrui Song <maskray@google.com>
won't ignore the r_addend value in the R_X86_64_JUMP_SLOT relocation.
Such programs and shared libraries will fail at run-time randomly.
Add GLIBC_ABI_DT_X86_64_PLT version to indicate that glibc is compatible
with DT_X86_64_PLT.
The linker can add the glibc GLIBC_ABI_DT_X86_64_PLT version dependency
whenever -z mark-plt is passed to the linker. The resulting programs and
shared libraries will fail to load at run-time against libc.so without the
GLIBC_ABI_DT_X86_64_PLT version, instead of fail randomly.
Florian Weimer [Fri, 1 Aug 2025 10:19:49 +0000 (12:19 +0200)]
elf: Handle ld.so with LOAD segment gaps in _dl_find_object (bug 31943)
Detect if ld.so not contiguous and handle that case in _dl_find_object.
Set l_find_object_processed even for initially loaded link maps,
otherwise dlopen of an initially loaded object adds it to
_dlfo_loaded_mappings (where maps are expected to be contiguous),
in addition to _dlfo_nodelete_mappings.
Test elf/tst-link-map-contiguous-ldso iterates over the loader
image, reading every word to make sure memory is actually mapped.
It only does that if the l_contiguous flag is set for the link map.
Otherwise, it finds gaps with mmap and checks that _dl_find_object
does not return the ld.so mapping for them.
The test elf/tst-link-map-contiguous-main does the same thing for
the libc.so shared object. This only works if the kernel loaded
the main program because the glibc dynamic loader may fill
the gaps with PROT_NONE mappings in some cases, making it contiguous,
but accesses to individual words may still fault.
Test elf/tst-link-map-contiguous-libc is again slightly different
because the dynamic loader always fills the gaps with PROT_NONE
mappings, so a different form of probing has to be used.
posix: Fix double-free after allocation failure in regcomp (bug 33185)
If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.
Reported-by: Anastasia Belova <abelova@astralinux.ru> Co-authored-by: Paul Eggert <eggert@cs.ucla.edu> Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e994093fa0bcca0d0ee2c1db271d8d7885d)
Sam James [Mon, 9 Dec 2024 23:11:25 +0000 (23:11 +0000)]
malloc: add indirection for malloc(-like) functions in tests [BZ #32366]
GCC 15 introduces allocation dead code removal (DCE) for PR117370 in r15-5255-g7828dc070510f8. This breaks various glibc tests which want
to assert various properties of the allocator without doing anything
obviously useful with the allocated memory.
Alexander Monakov rightly pointed out that we can and should do better
than passing -fno-malloc-dce to paper over the problem. Not least because
GCC 14 already does such DCE where there's no testing of malloc's return
value against NULL, and LLVM has such optimisations too.
Handle this by providing malloc (and friends) wrappers with a volatile
function pointer to obscure that we're calling malloc (et. al) from the
compiler.
nptl: PTHREAD_COND_INITIALIZER compatibility with pre-2.41 versions (bug 32786)
[BZ #25847]
The new initializer and struct layout does not initialize the
__g_signals field in the old struct layout before the change in
commit c36fc50781995e6758cae2b6927839d0157f213c ("nptl: Remove
g_refs from condition variables"). Bring back fields at the end
of struct __pthread_cond_s, so that they are again zero-initialized.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:
"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"
I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
nptl: Use a single loop in pthread_cond_wait instaed of a nested loop
[BZ #25847]
The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
nptl: Remove unnecessary quadruple check in pthread_cond_wait
[BZ #25847]
pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:
1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
close while we do an atomic increment. Obviously it could close at any
point, but that doesn't mean we have to recheck after every step. This
check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
we check that we're not in a closed group. There is no need to check again.
The check would only have helped in cases where the compare_exchange in the
next line would also have failed. Relying on the compare_exchange is fine.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
nptl: Remove unnecessary catch-all-wake in condvar group switch
[BZ #25847]
This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.
In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Frank Barrus [Fri, 11 Jul 2025 12:57:41 +0000 (05:57 -0700)]
pthreads NPTL: lost wakeup fix 2
[BZ #25847]
This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch. This greatly reduces the average
and maximum latency for pthread_cond_signal.
The g_signals[] field now contains a signal count that is relative to
the current g1_start value. Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)
By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same. This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change. This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.
By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore. The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit. If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.
The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters. Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2. They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.
Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block. This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.
Signed-off-by: Sunil Dora <sunilkumar.dora@windriver.com> Tested-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Florian Weimer [Thu, 22 May 2025 12:36:37 +0000 (14:36 +0200)]
Fix error reporting (false negatives) in SGID tests
And simplify the interface of support_capture_subprogram_self_sgid.
Use the existing framework for temporary directories (now with
mode 0700) and directory/file deletion. Handle all execution
errors within support_capture_subprogram_self_sgid. In particular,
this includes test failures because the invoked program did not
exit with exit status zero. Existing tests that expect exit
status 42 are adjusted to use zero instead.
In addition, fix callers not to call exit (0) with test failures
pending (which may mask them, especially when running with --direct).
Wilco Dijkstra [Fri, 13 Dec 2024 15:43:07 +0000 (15:43 +0000)]
math: Improve layout of exp/exp10 data
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp.
The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2.
Wilco Dijkstra [Tue, 24 Dec 2024 18:01:59 +0000 (18:01 +0000)]
AArch64: Add SVE memset
Add SVE memset based on the generic memset with predicated load for sizes < 16.
Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned
stores for the last 64 bytes. Performance of random memset benchmark improves
by ~2% on Neoverse V1.
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp2f_data struct slightly so that the fields are better aligned.
As a result on targets that support them, load-pair instructions accessing
poly_scaled and invln2_scaled are now 16-byte aligned.
Wilco Dijkstra [Mon, 25 Nov 2024 18:43:08 +0000 (18:43 +0000)]
AArch64: Remove zva_128 from memset
Remove ZVA 128 support from memset - the new memset no longer
guarantees count >= 256, which can result in underflow and a
crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA
size of 128 and its memcpy implementation was removed in commit e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special
case too.
Improve small memsets by avoiding branches and use overlapping stores.
Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes
other than 64 and 128. Performance of random memset benchmark improves by 24%
on Neoverse N1.
Wilco Dijkstra [Wed, 7 Aug 2024 13:43:47 +0000 (14:43 +0100)]
AArch64: Improve generic strlen
Improve performance by handling another 16 bytes before entering the loop.
Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final
size computation to avoid increasing latency. On Neoverse V1 performance
of the random strlen benchmark improves by 4.6%.
elf: Support recursive use of dynamic TLS in interposed malloc
It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS). The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.
This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update. This
is possible because an interposed malloc has a fixed module ID and
TLS slot. (It cannot be unloaded.) If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc. It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.
The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.
All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.
Florian Weimer [Mon, 3 Jun 2024 08:49:40 +0000 (10:49 +0200)]
elf: Avoid some free (NULL) calls in _dl_update_slotinfo
This has been confirmed to work around some interposed mallocs. Here
is a discussion of the impact test ust/libc-wrapper/test_libc-wrapper
in lttng-tools:
New TLS usage in libgcc_s.so.1, compatibility impact
<https://inbox.sourceware.org/libc-alpha/8734v1ieke.fsf@oldenburg.str.redhat.com/>
Reportedly, this patch also papers over a similar issue when tcmalloc
2.9.1 is not compiled with -ftls-model=initial-exec. Of course the
goal really should be to compile mallocs with the initial-exec TLS
model, but this commit appears to be a useful interim workaround.
x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]
The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.
This results in a 25% performance degradation for the non-aligned
version.
The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).
NB: The fix was actually 64-byte alignment. This is because 64-byte
alignment generally produces more stable performance than 32-byte
aligned code (cache line crosses can affect perf), so if we are going
past 16-byte alignmnent, might as well go to 64. 64-byte alignment
also matches most other functions we over-align, so it creates a
common point of optimization.
Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.
The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.
Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.
The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.
Noah Goldstein [Fri, 24 May 2024 17:38:50 +0000 (12:38 -0500)]
x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 5bf0ab80573d66e4ae5d94b094659094336da90f)
Gabi Falk [Tue, 7 May 2024 18:25:00 +0000 (18:25 +0000)]
x86_64: Fix missing wcsncat function definition without multiarch (x86-64-v4)
This code expects the WCSCAT preprocessor macro to be predefined in case
the evex implementation of the function should be defined with a name
different from __wcsncat_evex. However, when glibc is built for
x86-64-v4 without multiarch support, sysdeps/x86_64/wcsncat.S defines
WCSNCAT variable instead of WCSCAT to build it as wcsncat. Rename the
variable to WCSNCAT, as it is actually a better naming choice for the
variable in this case.
Reported-by: Kenton Groombridge Link: https://bugs.gentoo.org/921945 Fixes: 64b8b6516b ("x86: Add evex optimized functions for the wchar_t strcpy family") Signed-off-by: Gabi Falk <gabifalk@gmx.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit dd5f891c1ad9f1b43b9db93afe2a55cbb7a6194e)
Noah Goldstein [Wed, 1 Nov 2023 20:30:26 +0000 (15:30 -0500)]
x86: Only align destination to 1x VEC_SIZE in memset 4x loop
Current code aligns to 2x VEC_SIZE. Aligning to 2x has no affect on
performance other than potentially resulting in an additional
iteration of the loop.
1x maintains aligned stores (the only reason to align in this case)
and doesn't incur any unnecessary loop iterations. Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit 9469261cf1924d350feeec64d2c80cafbbdcdd4d)
Szabolcs Nagy [Tue, 16 Feb 2021 12:55:13 +0000 (12:55 +0000)]
elf: Fix slow tls access after dlopen [BZ #19924]
In short: __tls_get_addr checks the global generation counter and if
the current dtv is older then _dl_update_slotinfo updates dtv up to the
generation of the accessed module. So if the global generation is newer
than generation of the module then __tls_get_addr keeps hitting the
slow dtv update path. The dtv update path includes a number of checks
to see if any update is needed and this already causes measurable tls
access slow down after dlopen.
It may be possible to detect up-to-date dtv faster. But if there are
many modules loaded (> TLS_SLOTINFO_SURPLUS) then this requires at
least walking the slotinfo list.
This patch tries to update the dtv to the global generation instead, so
after a dlopen the tls access slow path is only hit once. The modules
with larger generation than the accessed one were not necessarily
synchronized before, so additional synchronization is needed.
This patch uses acquire/release synchronization when accessing the
generation counter.
Note: in the x86_64 version of dl-tls.c the generation is only loaded
once, since relaxed mo is not faster than acquire mo load.
I have not benchmarked this. Tested by Adhemerval Zanella on aarch64,
powerpc, sparc, x86 who reported that it fixes the performance issue
of bug 19924.
H.J. Lu [Mon, 28 Aug 2023 19:08:14 +0000 (12:08 -0700)]
x86: Check the lower byte of EAX of CPUID leaf 2 [BZ #30643]
The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored. If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead. If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all. This addresses BZ #30643.
H.J. Lu [Thu, 17 Aug 2023 16:42:29 +0000 (09:42 -0700)]
x86_64: Add log1p with FMA
On Skylake, it changes log1p bench performance by:
Before After Improvement
max 63.349 58.347 8%
min 4.448 5.651 -30%
mean 12.0674 10.336 14%
The minimum code path is
if (hx < 0x3FDA827A) /* x < 0.41422 */
{
if (__glibc_unlikely (ax >= 0x3ff00000)) /* x <= -1.0 */
{
...
}
if (__glibc_unlikely (ax < 0x3e200000)) /* |x| < 2**-29 */
{
math_force_eval (two54 + x); /* raise inexact */
if (ax < 0x3c900000) /* |x| < 2**-54 */
{
...
}
else
return x - x * x * 0.5;
FMA and non-FMA code sequences look similar. Non-FMA version is slightly
faster. Since log1p is called by asinh and atanh, it improves asinh
performance by:
Before After Improvement
max 75.645 63.135 16%
min 10.074 10.071 0%
mean 15.9483 14.9089 6%
and improves atanh performance by:
Before After Improvement
max 91.768 75.081 18%
min 15.548 13.883 10%
mean 18.3713 16.8011 8%
H.J. Lu [Fri, 11 Aug 2023 15:04:08 +0000 (08:04 -0700)]
x86_64: Add expm1 with FMA
On Skylake, it improves expm1 bench performance by:
Before After Improvement
max 70.204 68.054 3%
min 20.709 16.2 22%
mean 22.1221 16.7367 24%
NB: Add
extern long double __expm1l (long double);
extern long double __expm1f128 (long double);
for __typeof (__expm1l) and __typeof (__expm1f128) when __expm1 is
defined since __expm1 may be expanded in their declarations which
causes the build failure.