This is because A) the checks in range [65, 96] are now unrolled 2x
and B) because smaller values <= 16 are now given a hotter path. By
contrast the SSE4 version has a branch for Size = 80. The unrolled
version has get better performance for returns which need both
comparisons.
As well, out of microbenchmark environments that are not full
predictable the branch will have a real-cost. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7cbc03d03091d5664060924789afe46d30a5477e)
Noah Goldstein [Fri, 25 Mar 2022 22:13:33 +0000 (17:13 -0500)]
x86: Small improvements for wcslen
Just a few QOL changes.
1. Prefer `add` > `lea` as it has high execution units it can run
on.
2. Don't break macro-fusion between `test` and `jcc`
3. Reduce code size by removing gratuitous padding bytes (-90
bytes).
geometric_mean(N=20) of all benchmarks New / Original: 0.959
Noah Goldstein [Wed, 23 Mar 2022 21:57:46 +0000 (16:57 -0500)]
x86: Remove AVX str{n}casecmp
The rational is:
1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
regression on Tigerlake using SSE42 versus AVX across the
benchtest suite).
2. AVX2 version covers the majority of targets that previously
prefered it.
3. The targets where AVX would still be best (SnB and IVB) are
becoming outdated.
Noah Goldstein [Wed, 23 Mar 2022 21:57:18 +0000 (16:57 -0500)]
x86: Code cleanup in strchr-evex and comment justifying branch
Small code cleanup for size: -81 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks New / Original: .985 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit ec285ea90415458225623ddc0492ae3f705af043)
Noah Goldstein [Wed, 23 Mar 2022 21:57:16 +0000 (16:57 -0500)]
x86: Code cleanup in strchr-avx2 and comment justifying branch
Small code cleanup for size: -53 bytes.
Add comment justifying using a branch to do NULL/non-null return.
All string/memory tests pass and no regressions in benchtests.
geometric_mean(N=20) of all benchmarks Original / New: 1.00 Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a6fbf4d51e9ba8063c4f8331564892ead9c67344)
fortify: Ensure that __glibc_fortify condition is a constant [BZ #29141]
The fix c8ee1c85 introduced a -1 check for object size without also
checking that object size is a constant. Because of this, the tree
optimizer passes in gcc fail to fold away one of the branches in
__glibc_fortify and trips on a spurious Wstringop-overflow. The warning
itself is incorrect and the branch does go away eventually in DCE in the
rtl passes in gcc, but the constant check is a helpful hint to simplify
code early, so add it in.
dlfcn: Implement the RTLD_DI_PHDR request type for dlinfo
The information is theoretically available via dl_iterate_phdr as
well, but that approach is very slow if there are many shared
objects.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@rehdat.com>
(cherry picked from commit d056c212130280c0a54d9a4f72170ec621b70ce5)
Florian Weimer [Wed, 11 May 2022 18:30:49 +0000 (20:30 +0200)]
manual: Document the dlinfo function
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@rehdat.com>
(cherry picked from commit 93804a1ee084d4bdc620b2b9f91615c7da0fabe1)
Also includes partial backport of commit 5d28a8962dcb6ec056b81d730e
(the addition of manual/dynlink.texi).
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Set the wrong fallback function for `__wcsncmp_avx2_rtm`. It was set
to fallback on to `__wcscmp_avx2` instead of `__wcscmp_avx2_rtm` which
can cause spurious aborts.
Noah Goldstein [Wed, 16 Feb 2022 02:27:21 +0000 (20:27 -0600)]
x86: Fix bug in strncmp-evex and strncmp-avx2 [BZ #28895]
Logic can read before the start of `s1` / `s2` if both `s1` and `s2`
are near the start of a page. To avoid having the result contimated by
these comparisons the `strcmp` variants would mask off these
comparisons. This was missing in the `strncmp` variants causing
the bug. This commit adds the masking to `strncmp` so that out of
range comparisons don't affect the result.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass as
well a full xcheck on x86_64 linux. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e108c02a5e23c8c88ce66d8705d4a24bb6b9a8bf)
Noah Goldstein [Sun, 6 Feb 2022 06:54:18 +0000 (00:54 -0600)]
x86: Improve vec generation in memset-vec-unaligned-erms.S
No bug.
Split vec generation into multiple steps. This allows the
broadcast in AVX2 to use 'xmm' registers for the L(less_vec)
case. This saves an expensive lane-cross instruction and removes
the need for 'vzeroupper'.
For SSE2 replace 2x 'punpck' instructions with zero-idiom 'pxor' for
byte broadcast.
Results for memset-avx2 small (geomean of N = 20 benchset runs).
size, New Time, Old Time, New / Old
0, 4.100, 3.831, 0.934
1, 5.074, 4.399, 0.867
2, 4.433, 4.411, 0.995
4, 4.487, 4.415, 0.984
8, 4.454, 4.396, 0.987
16, 4.502, 4.443, 0.987
Noah Goldstein [Mon, 10 Jan 2022 21:35:39 +0000 (15:35 -0600)]
x86: Optimize strcmp-evex.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases as well are nearly universally improved.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
Noah Goldstein [Mon, 10 Jan 2022 21:35:38 +0000 (15:35 -0600)]
x86: Optimize strcmp-avx2.S
Optimization are primarily to the loop logic and how the page cross
logic interacts with the loop.
The page cross logic is at times more expensive for short strings near
the end of a page but not crossing the page. This is done to retest
the page cross conditions with a non-faulty check and to improve the
logic for entering the loop afterwards. This is only particular cases,
however, and is general made up for by more than 10x improvements on
the transition from the page cross -> loop case.
The non-page cross cases are improved most for smaller sizes [0, 128]
and go about even for (128, 4096]. The loop page cross logic is
improved so some more significant speedup is seen there as well.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
manual: Clarify that abbreviations of long options are allowed
The man page and code comments clearly state that abbreviations of long
option names are recognized correctly as long as they are unique.
Document this fact in the glibc manual as well.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Andreas Schwab <schwab@linux-m68k.org>
(cherry picked from commit db1efe02c9f15affc3908d6ae73875b82898a489)
Szabolcs Nagy [Tue, 14 Dec 2021 11:15:07 +0000 (11:15 +0000)]
aarch64: Add HWCAP2_ECV from Linux 5.16
Indicates the availability of enhanced counter virtualization extension
of armv8.6-a with self-synchronized virtual counter CNTVCTSS_EL0 usable
in userspace.
Joseph Myers [Thu, 24 Mar 2022 15:35:27 +0000 (15:35 +0000)]
Update kernel version to 5.17 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.17. (There are no new MAP_* constants covered by this test in
5.17 that need any other header changes.)
Joseph Myers [Wed, 16 Feb 2022 14:19:24 +0000 (14:19 +0000)]
Update kernel version to 5.16 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.16. (There are no new MAP_* constants covered by this test in
5.16 that need any other header changes.)
Joseph Myers [Wed, 23 Mar 2022 17:11:56 +0000 (17:11 +0000)]
Update syscall lists for Linux 5.17
Linux 5.17 has one new syscall, set_mempolicy_home_node. Update
syscall-names.list and regenerate the arch-syscall.h headers with
build-many-glibcs.py update-syscalls.
Joseph Myers [Mon, 20 Dec 2021 15:38:32 +0000 (15:38 +0000)]
Add ARPHRD_CAN, ARPHRD_MCTP to net/if_arp.h
Add the constant ARPHRD_MCTP, from Linux 5.15, to net/if_arp.h, along
with ARPHRD_CAN which was added to Linux in version 2.6.25 (commit cd05acfe65ed2cf2db683fa9a6adb8d35635263b, "[CAN]: Allocate protocol
numbers for PF_CAN") but apparently missed for glibc at the time.
Joseph Myers [Mon, 22 Nov 2021 15:30:12 +0000 (15:30 +0000)]
Update kernel version to 5.15 in tst-mman-consts.py
This patch updates the kernel version in the test tst-mman-consts.py
to 5.15. (There are no new MAP_* constants covered by this test in
5.15 that need any other header changes.)
DJ Delorie [Wed, 30 Mar 2022 21:44:02 +0000 (17:44 -0400)]
posix/glob.c: update from gnulib
Copied from gnulib/lib/glob.c in order to fix rhbz 1982608
Also fixes swbz 25659
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 7c477b57a31487eda516db02b9e04f22d1a6e6af)
Carlos O'Donell [Tue, 26 Apr 2022 14:52:41 +0000 (10:52 -0400)]
i386: Regenerate ulps
These failures were caught while building glibc master for Fedora
Rawhide which is built with '-mtune=generic -msse2 -mfpmath=sse'
using gcc 11.3 (gcc-11.3.1-2.fc35) on a Cascadelake Intel Xeon
processor.
Noah Goldstein [Sat, 25 Dec 2021 00:54:41 +0000 (18:54 -0600)]
x86: Optimize L(less_vec) case in memcmp-evex-movbe.S
No bug.
Optimizations are twofold.
1) Replace page cross and 0/1 checks with masked load instructions in
L(less_vec). In applications this reduces branch-misses in the
hot [0, 32] case.
2) Change controlflow so that L(less_vec) case gets the fall through.
Change 2) helps copies in the [0, 32] size range but comes at the cost
of copies in the [33, 64] size range. From profiles of GCC and
Python3, 94%+ and 99%+ of calls are in the [0, 32] range so this
appears to the the right tradeoff.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit abddd61de090ae84e380aff68a98bd94ef704667)
Noah Goldstein [Fri, 3 Dec 2021 23:29:25 +0000 (15:29 -0800)]
x86-64: Use notl in EVEX strcmp [BZ #28646]
Must use notl %edi here as lower bits are for CHAR comparisons
potentially out of range thus can be 0 without indicating mismatch.
This fixes BZ #28646.
Noah Goldstein [Wed, 10 Nov 2021 22:18:56 +0000 (16:18 -0600)]
x86: Shrink memcmp-sse4.S code size
No bug.
This implementation refactors memcmp-sse4.S primarily with minimizing
code size in mind. It does this by removing the lookup table logic and
removing the unrolled check from (256, 512] bytes.
The current memcmp-sse4.S implementation has a large code size
cost. This has serious adverse affects on the ICache / ITLB. While
in micro-benchmarks the implementations appears fast, traces of
real-world code have shown that the speed in micro benchmarks does not
translate when the ICache/ITLB are not primed, and that the cost
of the code size has measurable negative affects on overall
application performance.
See https://research.google/pubs/pub48320/ for more details.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2f9062d7171850451e6044ef78d91ff8c017b9c0)
Noah Goldstein [Mon, 1 Nov 2021 05:49:52 +0000 (00:49 -0500)]
x86: Double size of ERMS rep_movsb_threshold in dl-cacheinfo.h
No bug.
This patch doubles the rep_movsb_threshold when using ERMS. Based on
benchmarks the vector copy loop, especially now that it handles 4k
aliasing, is better for these medium ranged.
H.J. Lu [Fri, 29 Oct 2021 19:56:53 +0000 (12:56 -0700)]
x86-64: Remove Prefer_AVX2_STRCMP
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.
H.J. Lu [Fri, 29 Oct 2021 19:40:20 +0000 (12:40 -0700)]
x86-64: Improve EVEX strcmp with masked load
In strcmp-evex.S, to compare 2 32-byte strings, replace
VMOVU (%rdi, %rdx), %YMM0
VMOVU (%rsi, %rdx), %YMM1
/* Each bit in K0 represents a mismatch in YMM0 and YMM1. */
VPCMP $4, %YMM0, %YMM1, %k0
VPCMP $0, %YMMZERO, %YMM0, %k1
VPCMP $0, %YMMZERO, %YMM1, %k2
/* Each bit in K1 represents a NULL in YMM0 or YMM1. */
kord %k1, %k2, %k1
/* Each bit in K1 represents a NULL or a mismatch. */
kord %k0, %k1, %k1
kmovd %k1, %ecx
testl %ecx, %ecx
jne L(last_vector)
with
VMOVU (%rdi, %rdx), %YMM0
VPTESTM %YMM0, %YMM0, %k2
/* Each bit cleared in K1 represents a mismatch or a null CHAR
in YMM0 and 32 bytes at (%rsi, %rdx). */
VPCMP $0, (%rsi, %rdx), %YMM0, %k1{%k2}
kmovd %k1, %ecx
incl %ecx
jne L(last_vector)
It makes EVEX strcmp faster than AVX2 strcmp by up to 40% on Tiger Lake
and Ice Lake.
Noah Goldstein [Sat, 23 Oct 2021 05:26:47 +0000 (01:26 -0400)]
x86: Replace sse2 instructions with avx in memcmp-evex-movbe.S
This commit replaces two usages of SSE2 'movups' with AVX 'vmovdqu'.
it could potentially be dangerous to use SSE2 if this function is ever
called without using 'vzeroupper' beforehand. While compilers appear
to use 'vzeroupper' before function calls if AVX2 has been used, using
SSE2 here is more brittle. Since it is not absolutely necessary it
should be avoided.
It costs 2-extra bytes but the extra bytes should only eat into
alignment padding. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit bad852b61b79503fcb3c5fc379c70f768df3e1fb)
1. change control flow for L(more_2x_vec) to fall through to loop and
jump for L(less_4x_vec) and L(less_8x_vec). This uses less code
size and saves jumps for length > 4x VEC_SIZE.
2. For EVEX/AVX512 move L(less_vec) closer to entry.
3. Avoid complex address mode for length > 2x VEC_SIZE
4. Slightly better aligning code for the loop from the perspective of
code size and uops.
5. Align targets so they make full use of their fetch block and if
possible cache line.
6. Try and reduce total number of icache lines that will need to be
pulled in for a given length.
7. Include "local" version of stosb target. For AVX2/EVEX/AVX512
jumping to the stosb target in the sse2 code section will almost
certainly be to a new page. The new version does increase code size
marginally by duplicating the target but should get better iTLB
behavior as a result.
test-memset, test-wmemset, and test-bzero are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit e59ced238482fd71f3e493717f14f6507346741e)
x86: Optimize memcmp-evex-movbe.S for frontend behavior and size
No bug.
The frontend optimizations are to:
1. Reorganize logically connected basic blocks so they are either in
the same cache line or adjacent cache lines.
2. Avoid cases when basic blocks unnecissarily cross cache lines.
3. Try and 32 byte align any basic blocks possible without sacrificing
code size. Smaller / Less hot basic blocks are used for this.
Overall code size shrunk by 168 bytes. This should make up for any
extra costs due to aligning to 64 bytes.
In general performance before deviated a great deal dependending on
whether entry alignment % 64 was 0, 16, 32, or 48. These changes
essentially make it so that the current implementation is at least
equal to the best alignment of the original for any arguments.
The only additional optimization is in the page cross case. Branch on
equals case was removed from the size == [4, 7] case. As well the [4,
7] and [2, 3] case where swapped as [4, 7] is likely a more hot
argument size.
dlfcn: Do not use rtld_active () to determine ld.so state (bug 29078)
When audit modules are loaded, ld.so initialization is not yet
complete, and rtld_active () returns false even though ld.so is
mostly working. Instead, the static dlopen hook is used, but that
does not work at all because this is not a static dlopen situation.
Commit 466c1ea15f461edb8e3ffaf5d86d708876343bbf ("dlfcn: Rework
static dlopen hooks") moved the hook pointer into _rtld_global_ro,
which means that separate protection is not needed anymore and the
hook pointer can be checked directly.
The guard for disabling libio vtable hardening in _IO_vtable_check
should stay for now.
Joan Bruguera [Mon, 11 Apr 2022 17:49:56 +0000 (19:49 +0200)]
misc: Fix rare fortify crash on wchar funcs. [BZ 29030]
If `__glibc_objsize (__o) == (size_t) -1` (i.e. `__o` is unknown size), fortify
checks should pass, and `__whatever_alias` should be called.
Previously, `__glibc_objsize (__o) == (size_t) -1` was explicitly checked, but
on commit a643f60c53876b, this was moved into `__glibc_safe_or_unknown_len`.
A comment says the -1 case should work as: "The -1 check is redundant because
since it implies that __glibc_safe_len_cond is true.". But this fails when:
* `__s > 1`
* `__osz == -1` (i.e. unknown size at compile time)
* `__l` is big enough
* `__l * __s <= __osz` can be folded to a constant
(I only found this to be true for `mbsrtowcs` and other functions in wchar2.h)
In this case `__l * __s <= __osz` is false, and `__whatever_chk_warn` will be
called by `__glibc_fortify` or `__glibc_fortify_n` and crash the program.
This commit adds the explicit `__osz == -1` check again.
moc crashes on startup due to this, see: https://bugs.archlinux.org/task/74041
Minimal test case (test.c):
#include <wchar.h>
int main (void)
{
const char *hw = "HelloWorld";
mbsrtowcs (NULL, &hw, (size_t)-1, NULL);
return 0;
}
Build with:
gcc -O2 -Wp,-D_FORTIFY_SOURCE=2 test.c -o test && ./test
This is necessary to place the libio vtables into the RELRO segment.
New tests elf/tst-relro-ldso and elf/tst-relro-libc are added to
verify that this is what actually happens.
The new tests fail on ia64 due to lack of (default) RELRO support
inbutils, so they are XFAILed there.
Hopefully, this will lead to tests that are easier to maintain. The
current approach of parsing readelf -W output using regular expressions
is not necessarily easier than parsing the ELF data directly.
This module is still somewhat incomplete (e.g., coverage of relocation
types and versioning information is missing), but it is sufficient to
perform basic symbol analysis or program header analysis.
The EM_* mapping for architecture-specific constant classes (e.g.,
SttX86_64) is not yet implemented. The classes are defined for the
benefit of elf/tst-glibcelf.py.
The 404656009b reversion did not setup the atomic loop to set the
cancel bits correctly. The fix is essentially what pthread_cancel
did prior 26cfbb7162ad.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
nptl: Handle spurious EINTR when thread cancellation is disabled (BZ#29029)
Some Linux interfaces never restart after being interrupted by a signal
handler, regardless of the use of SA_RESTART [1]. It means that for
pthread cancellation, if the target thread disables cancellation with
pthread_setcancelstate and calls such interfaces (like poll or select),
it should not see spurious EINTR failures due the internal SIGCANCEL.
However recent changes made pthread_cancel to always sent the internal
signal, regardless of the target thread cancellation status or type.
To fix it, the previous semantic is restored, where the cancel signal
is only sent if the target thread has cancelation enabled in
asynchronous mode.
The cancel state and cancel type is moved back to cancelhandling
and atomic operation are used to synchronize between threads. The
patch essentially revert the following commits:
8c1c0aae20 nptl: Move cancel type out of cancelhandling 2b51742531 nptl: Move cancel state out of cancelhandling 26cfbb7162 nptl: Remove CANCELING_BITMASK
However I changed the atomic operation to follow the internal C11
semantic and removed the MACRO usage, it simplifies a bit the
resulting code (and removes another usage of the old atomic macros).
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
and powerpc64-linux-gnu.
Stefan Liebler [Wed, 13 Apr 2022 12:36:09 +0000 (14:36 +0200)]
S390: Add new s390 platform z16.
The new IBM z16 is added to platform string array.
The macro _DL_PLATFORMS_COUNT is incremented.
_dl_hwcaps_subdir is extended by "z16" if HWCAP_S390_VXRS_PDE2
is set. HWCAP_S390_NNPA is not tested in _dl_hwcaps_subdirs_active
as those instructions may be replaced or removed in future.
tst-glibc-hwcaps.c is extended in order to test z16 via new marker5.
A fatal glibc error is dumped if glibc was build with architecture
level set for z16, but run on an older machine. (See dl-hwcap-check.h)
On hppa, a function pointer returned by la_symbind is actually a function
descriptor has the plabel bit set (bit 30). This must be cleared to get
the actual address of the descriptor. If the descriptor has been bound,
the first word of the descriptor is the physical address of theA function,
otherwise, the first word of the descriptor points to a trampoline in the
PLT.
This patch also adds a workaround on tests because on hppa (and it seems
to be the only ABI I have see it), some shared library adds a dynamic PLT
relocation to am empty symbol name:
$ readelf -r elf/tst-audit25mod1.so
[...]
Relocation section '.rela.plt' at offset 0x464 contains 6 entries:
Offset Info Type Sym.Value Sym. Name + Addend 0000200800000081 R_PARISC_IPLT 508
[...]
It breaks some assumptions on the test, where a symbol with an empty
name ("") is passed on la_symbind.
Ben Woodard [Mon, 24 Jan 2022 13:46:18 +0000 (10:46 -0300)]
elf: Fix runtime linker auditing on aarch64 (BZ #26643)
The rtld audit support show two problems on aarch64:
1. _dl_runtime_resolve does not preserve x8, the indirect result
location register, which might generate wrong result calls
depending of the function signature.
2. The NEON Q registers pushed onto the stack by _dl_runtime_resolve
were twice the size of D registers extracted from the stack frame by
_dl_runtime_profile.
While 2. might result in wrong information passed on the PLT tracing,
1. generates wrong runtime behaviour.
The aarch64 rtld audit support is changed to:
* Both La_aarch64_regs and La_aarch64_retval are expanded to include
both x8 and the full sized NEON V registers, as defined by the
ABI.
* dl_runtime_profile needed to extract registers saved by
_dl_runtime_resolve and put them into the new correctly sized
La_aarch64_regs structure.
* The LAV_CURRENT check is change to only accept new audit modules
to avoid the undefined behavior of not save/restore x8.
* Different than other architectures, audit modules older than
LAV_CURRENT are rejected (both La_aarch64_regs and La_aarch64_retval
changed their layout and there are no requirements to support multiple
audit interface with the inherent aarch64 issues).
* A new field is also reserved on both La_aarch64_regs and
La_aarch64_retval to support variant pcs symbols.
Similar to x86, a new La_aarch64_vector type to represent the NEON
register is added on the La_aarch64_regs (so each type can be accessed
directly).
Since LAV_CURRENT was already bumped to support bind-now, there is
no need to increase it again.
Checked on aarch64-linux-gnu.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ce9a68c57c260c8417afc93972849ac9ad243ec4)
The audit symbind callback is not called for binaries built with
-Wl,-z,now or when LD_BIND_NOW=1 is used, nor the PLT tracking callbacks
(plt_enter and plt_exit) since this would change the expected
program semantics (where no PLT is expected) and would have performance
implications (such as for BZ#15533).
LAV_CURRENT is also bumped to indicate the audit ABI change (where
la_symbind flags are set by the loader to indicate no possible PLT
trace).
To handle powerpc64 ELFv1 function descriptor, _dl_audit_symbind
requires to know whether bind-now is used so the symbol value is
updated to function text segment instead of the OPD (for lazy binding
this is done by PPC64_LOAD_FUNCPTR on _dl_runtime_resolve).
Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
powerpc64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 32612615c58b394c3eb09f020f31310797ad3854)
elf: Fix initial-exec TLS access on audit modules (BZ #28096)
For audit modules and dependencies with initial-exec TLS, we can not
set the initial TLS image on default loader initialization because it
would already be set by the audit setup. However, subsequent thread
creation would need to follow the default behaviour.
This patch fixes it by setting l_auditing link_map field not only
for the audit modules, but also for all its dependencies. This is
used on _dl_allocate_tls_init to avoid the static TLS initialization
at load time.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 254d3d5aef2fd8430c469e1938209ac100ebf132)
la_activity is not called during application exit, even though
la_objclose is.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com> Tested-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 5fa11a2bc94c912c3b25860065086902674537ba)
elf: Do not fail for failed dlmopen on audit modules (BZ #28061)
The dl_main sets the LM_ID_BASE to RT_ADD just before starting to
add load new shared objects. The state is set to RT_CONSISTENT just
after all objects are loaded.
However if a audit modules tries to dlmopen an inexistent module,
the _dl_open will assert that the namespace is in an inconsistent
state.
This is different than dlopen, since first it will not use
LM_ID_BASE and second _dl_map_object_from_fd is the sole responsible
to set and reset the r_state value.
So the assert on _dl_open can not really be seen if the state is
consistent, since _dt_main resets it. This patch removes the assert.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
The vDSO is is listed in the link_map chain, but is never the subject of
an la_objopen call. A new internal flag __RTLD_VDSO is added that
acts as __RTLD_OPENEXEC to allocate the required 'struct auditstate'
extra space for the 'struct link_map'.
The return value from the callback is currently ignored, since there
is no PLT call involved by glibc when using the vDSO, neither the vDSO
are exported directly.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
elf: Avoid unnecessary slowdown from profiling with audit (BZ#15533)
The rtld-audit interfaces introduces a slowdown due to enabling
profiling instrumentation (as if LD_AUDIT implied LD_PROFILE).
However, instrumenting is only necessary if one of audit libraries
provides PLT callbacks (la_pltenter or la_pltexit symbols). Otherwise,
the slowdown can be avoided.
The following patch adjusts the logic that enables profiling to iterate
over all audit modules and check if any of those provides a PLT hook.
To keep la_symbind to work even without PLT callbacks, _dl_fixup now
calls the audit callback if the modules implements it.
Co-authored-by: Alexander Monakov <amonakov@ispras.ru>
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
elf: Add _dl_audit_activity_map and _dl_audit_activity_nsid
It consolidates the code required to call la_activity audit
callback.
Also for a new Lmid_t the namespace link_map list are empty, so it
requires to check if before using it. This can happen for when audit
module is used along with dlmopen.
Checked on x86_64-linux-gnu, i686-linux-gnu, and aarch64-linux-gnu.
THe d6d89608ac8c broke powerpc for --enable-bind-now because it turned
out that different than patch assumption rtld elf_get_dynamic_info()
does require to handle RTLD_BOOTSTRAP to avoid DT_FLAGS and
DT_RUNPATH (more specially the GLRO usage which is not reallocate
yet).
This patch fixes by passing two arguments to elf_get_dynamic_info()
to inform that by rtld (bootstrap) or static pie initialization
(static_pie_bootstrap). I think using explicit argument is way more
clear and burried C preprocessor, and compiler should remove the
dead code.
I checked on x86_64 and i686 with default options, --enable-bind-now,
and --enable-bind-now and --enable--static-pie. I also check on
aarch64, armhf, powerpc64, and powerpc with default and
--enable-bind-now.
The 4af6982e4c fix does not fully handle RTLD_BOOTSTRAP usage on
rtld.c due two issues:
1. RTLD_BOOTSTRAP is also used on dl-machine.h on various
architectures and it changes the semantics of various machine
relocation functions.
2. The elf_get_dynamic_info() change was done sideways, previously
to 490e6c62aa get-dynamic-info.h was included by the first
dynamic-link.h include *without* RTLD_BOOTSTRAP being defined.
It means that the code within elf_get_dynamic_info() that uses
RTLD_BOOTSTRAP is in fact unused.
To fix 1. this patch now includes dynamic-link.h only once with
RTLD_BOOTSTRAP defined. The ELF_DYNAMIC_RELOCATE call will now have
the relocation fnctions with the expected semantics for the loader.
And to fix 2. part of 4af6982e4c is reverted (the check argument
elf_get_dynamic_info() is not required) and the RTLD_BOOTSTRAP
pieces are removed.
To reorganize the includes the static TLS definition is moved to
its own header to avoid a circular dependency (it is defined on
dynamic-link.h and dl-machine.h requires it at same time other
dynamic-link.h definition requires dl-machine.h defitions).
Also ELF_MACHINE_NO_REL, ELF_MACHINE_NO_RELA, and ELF_MACHINE_PLT_REL
are moved to its own header. Only ancient ABIs need special values
(arm, i386, and mips), so a generic one is used as default.
The powerpc Elf64_FuncDesc is also moved to its own header, since
csu code required its definition (which would require either include
elf/ folder or add a full path with elf/).
Checked on x86_64, i686, aarch64, armhf, powerpc64, powerpc32,
and powerpc64le.
Before to 490e6c62aa31a8a ('elf: Avoid nested functions in the loader
[BZ #27220]'), elf_get_dynamic_info() was defined twice on rtld.c: on
the first dynamic-link.h include and later within _dl_start(). The
former definition did not define DONT_USE_BOOTSTRAP_MAP and it is used
on setup_vdso() (since it is a global definition), while the former does
define DONT_USE_BOOTSTRAP_MAP and it is used on loader self-relocation.
With the commit change, the function is now included and defined once
instead of defined as a nested function. So rtld.c defines without
defining RTLD_BOOTSTRAP and it brokes at least powerpc32.
This patch fixes by moving the get-dynamic-info.h include out of
dynamic-link.h, which then the caller can corirectly set the expected
semantic by defining STATIC_PIE_BOOTSTRAP, RTLD_BOOTSTRAP, and/or
RESOLVE_MAP.
It also required to enable some asserts only for the loader bootstrap
to avoid issues when called from setup_vdso().
As a side note, this is another issues with nested functions: it is
not clear from pre-processed output (-E -dD) how the function will
be build and its semantic (since nested function will be local and
extra C defines may change it).
I checked on x86_64-linux-gnu (w/o --enable-static-pie),
i686-linux-gnu, powerpc64-linux-gnu, powerpc-linux-gnu-power4,
aarch64-linux-gnu, arm-linux-gnu, sparc64-linux-gnu, and
s390x-linux-gnu.
Fangrui Song [Thu, 7 Oct 2021 18:55:02 +0000 (11:55 -0700)]
elf: Avoid nested functions in the loader [BZ #27220]
dynamic-link.h is included more than once in some elf/ files (rtld.c,
dl-conflict.c, dl-reloc.c, dl-reloc-static-pie.c) and uses GCC nested
functions. This harms readability and the nested functions usage
is the biggest obstacle prevents Clang build (Clang doesn't support GCC
nested functions).
The key idea for unnesting is to add extra parameters (struct link_map
*and struct r_scope_elm *[]) to RESOLVE_MAP,
ELF_MACHINE_BEFORE_RTLD_RELOC, ELF_DYNAMIC_RELOCATE, elf_machine_rel[a],
elf_machine_lazy_rel, and elf_machine_runtime_setup. (This is inspired
by Stan Shebs' ppc64/x86-64 implementation in the
google/grte/v5-2.27/master which uses mixed extra parameters and static
variables.)
Future simplification:
* If mips elf_machine_runtime_setup no longer needs RESOLVE_GOTSYM,
elf_machine_runtime_setup can drop the `scope` parameter.
* If TLSDESC no longer need to be in elf_machine_lazy_rel,
elf_machine_lazy_rel can drop the `scope` parameter.
Tested on aarch64, i386, x86-64, powerpc64le, powerpc64, powerpc32,
sparc64, sparcv9, s390x, s390, hppa, ia64, armhf, alpha, and mips64.
In addition, tested build-many-glibcs.py with {arc,csky,microblaze,nios2}-linux-gnu
and riscv64-linux-gnu-rv64imafdc-lp64d.
debug: Synchronize feature guards in fortified functions [BZ #28746]
Some functions (e.g. stpcpy, pread64, etc.) had moved to POSIX in the
main headers as they got incorporated into the standard, but their
fortified variants remained under __USE_GNU. As a result, these
functions did not get fortified when _GNU_SOURCE was not defined.
Add test wrappers that check all functions tested in tst-chk0 at all
levels with _GNU_SOURCE undefined and then use the failures to (1)
exclude checks for _GNU_SOURCE functions in these tests and (2) Fix
feature macro guards in the fortified function headers so that they're
the same as the ones in the main headers.