elf: Fix elf/tst-pldd with --enable-hardcoded-path-in-tests (BZ#24506)
The elf/tst-pldd (added by 1a4c27355e146 to fix BZ#18035) test does
not expect the hardcoded paths that are output by pldd when the test
is built with --enable-hardcoded-path-in-tests. Instead of showing
the ABI installed library names for loader and libc (such as
ld-linux-x86-64.so.2 and libc.so.6 for x86_64), pldd shows the default
built ld.so and libc.so.
It makes the tests fail with an invalid expected loader/libc name.
This patch fixes the elf-pldd test by adding the canonical ld.so and
libc.so names in the expected list of possible outputs when parsing
the result output from pldd. The test now handles both default
build and --enable-hardcoded-path-in-tests option.
Checked on x86_64-linux-gnu (built with and without
--enable-hardcoded-path-in-tests) and i686-linux-gnu.
* elf/tst-pldd.c (in_str_list): New function.
(do_test): Add default names for ld and libc as one option.
H.J. Lu [Fri, 10 May 2024 03:07:01 +0000 (20:07 -0700)]
Force DT_RPATH for --enable-hardcoded-path-in-tests
On Fedora 40/x86-64, linker enables --enable-new-dtags by default which
generates DT_RUNPATH instead of DT_RPATH. Unlike DT_RPATH, DT_RUNPATH
only applies to DT_NEEDED entries in the executable and doesn't applies
to DT_NEEDED entries in shared libraries which are loaded via DT_NEEDED
entries in the executable. Some glibc tests have libstdc++.so.6 in
DT_NEEDED, which has libm.so.6 in DT_NEEDED. When DT_RUNPATH is generated,
/lib64/libm.so.6 is loaded for such tests. If the newly built glibc is
older than glibc 2.36, these tests fail with
assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_2.36' not found (required by /lib64/libm.so.6)
assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /lib64/libm.so.6)
Pass -Wl,--disable-new-dtags to linker when building glibc tests with
--enable-hardcoded-path-in-tests. This fixes BZ #31719.
H.J. Lu [Thu, 9 May 2024 20:07:23 +0000 (13:07 -0700)]
Don't make errlist.o[s].d depend on errlist-compat.h
stdio-common/errlist.o.d and stdio-common/errlist.os.d aren't generated
alongside with stdio-common/errlist-compat.h. Don't make them depend on
stdio-common/errlist-compat.h to avoid infinite loop with make-4.4. This
fixes BZ #31330.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Makerules: fix MAKEFLAGS assignment for upcoming make-4.4 [BZ# 29564]
make-4.4 will add long flags to MAKEFLAGS variable:
* WARNING: Backward-incompatibility!
Previously only simple (one-letter) options were added to the MAKEFLAGS
variable that was visible while parsing makefiles. Now, all options
are available in MAKEFLAGS.
This causes locale builds to fail when long options are used:
$ make --shuffle
...
make -C localedata install-locales
make: invalid shuffle mode: '1662724426r'
The change fixes it by passing eash option via whitespace and dashes.
That way option is appended to both single-word form and whitespace
separated form.
While at it fixed --silent mode detection in $(MAKEFLAGS) by filtering
out --long-options. Otherwise options like --shuffle flag enable silent
mode unintentionally. $(silent-make) variable consolidates the checks.
Resolves: BZ# 29564
CC: Paul Smith <psmith@gnu.org> CC: Siddhesh Poyarekar <siddhesh@gotplt.org> Signed-off-by: Sergei Trofimovich <slyich@gmail.com> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 2d7ed98add14f75041499ac189696c9bd3d757fe)
Aurelien Jarno [Tue, 8 Jan 2019 20:10:28 +0000 (21:10 +0100)]
testsuite: stdlib/isomac.c: add missing include
When running the testsuite, building stdlib/isomac.c outputs the
following warning:
gcc -O -D_GNU_SOURCE -DIS_IN_build -include /home/aurel32/glibc-build/config.h isomac.c -o /home/aurel32/glibc-build/stdlib/isomac
isomac.c: In function ‘get_null_defines’:
isomac.c:260:3: warning: implicit declaration of function ‘close’; did you mean ‘pclose’? [-Wimplicit-function-declaration]
close (fd);
^~~~~
pclose
Ffsll function randomly regress by ~20%, depending on how code gets
aligned in memory. Ffsll function code size is 17 bytes. Since default
function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes
aligned memory. When ffsll function load at 16, 32 or 64 bytes aligned
memory, entire code fits in single 64 bytes cache line. When ffsll
function load at 48 bytes aligned memory, it splits in two cache line,
hence random regression.
Ffsll function size reduction from 17 bytes to 12 bytes ensures that it
will always fit in single 64 bytes cache line.
This patch fixes ffsll function random performance regression.
Noah Goldstein [Fri, 11 Aug 2023 23:47:17 +0000 (18:47 -0500)]
x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.
On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision. Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8b9a0af8ca012217bf90d1dc0694f85b49ae09da)
x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
```
Split `shared` (cumulative cache size) from `shared_per_thread` (cache
size per socket), the `shared_per_thread` *can* be slightly off from
the previous calculation.
Previously we added `core` even if `threads_l2` was invalid, and only
used `threads_l2` to divide `core` if it was present. The changed
version only included `core` if `threads_l2` was valid.
This change restores the old behavior if `threads_l2` is invalid by
adding the entire value of `core`. Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 47f747217811db35854ea06741be3685e8bbd44d)
Noah Goldstein [Thu, 10 Aug 2023 17:13:26 +0000 (12:13 -0500)]
x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`
The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.
Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.
Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.
As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.
The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.
Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks
Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit af992e7abdc9049714da76cae1e5e18bc4838fb8)
The signal handler installed in the ELF constructor cannot easily
be removed again (because the program may have changed handlers
in the meantime). Mark the object as NODELETE so that the registered
handler function is never unloaded.
Andreas Schwab [Mon, 29 Aug 2022 13:05:40 +0000 (15:05 +0200)]
Add test for bug 29530
This tests for a bug that was introduced in commit edc1686af0 ("vfprintf:
Reuse work_buffer in group_number") and fixed as a side effect of commit 6caddd34bd ("Remove most vfprintf width/precision-dependent allocations
(bug 14231, bug 26211).").
Joseph Myers [Mon, 8 Nov 2021 19:11:51 +0000 (19:11 +0000)]
Fix memmove call in vfprintf-internal.c:group_number
A recent GCC mainline change introduces errors of the form:
vfprintf-internal.c: In function 'group_number':
vfprintf-internal.c:2093:15: error: 'memmove' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
2093 | memmove (w, s, (front_ptr -s) * sizeof (CHAR_T));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a genuine bug in the glibc code: s > front_ptr is always true
at this point in the code, and the intent is clearly for the
subtraction to be the other way round. The other arguments to the
memmove call here also appear to be wrong; w and s point just *after*
the destination and source for copying the rest of the number, so the
size needs to be subtracted to get appropriate pointers for the
copying. Adjust the memmove call to conform to the apparent intent of
the code, so fixing the -Wstringop-overflow error.
Now, if the original code were ever executed, a buffer overrun would
result. However, I believe this code (introduced in commit edc1686af0c0fc2eb535f1d38cdf63c1a5a03675, "vfprintf: Reuse work_buffer
in group_number", so in glibc 2.26) is unreachable in prior glibc
releases (so there is no need for a bug in Bugzilla, no need to
consider any backports unless someone wants to build older glibc
releases with GCC 12 and no possibility of this buffer overrun
resulting in a security issue).
work_buffer is 1000 bytes / 250 wide characters. This case is only
reachable if an initial part of the number, plus a grouped copy of the
rest of the number, fail to fit in that space; that is, if the grouped
number fails to fit in the space. In the wide character case,
grouping is always one wide character, so even with a locale (of which
there aren't any in glibc) grouping every digit, a number would need
to occupy at least 125 wide characters to overflow, and a 64-bit
integer occupies at most 23 characters in octal including a leading 0.
In the narrow character case, the multibyte encoding of the grouping
separator would need to be at least 42 bytes to overflow, again
supposing grouping every digit, but MB_LEN_MAX is 16. So even if we
admit the case of artificially constructed locales not shipped with
glibc, given that such a locale would need to use one of the character
sets supported by glibc, this code cannot be reached at present. (And
POSIX only actually specifies the ' flag for grouping for decimal
output, though glibc acts on it for other bases as well.)
With binary output (if you consider use of grouping there to be
valid), you'd need a 15-byte multibyte character for overflow; I don't
know if any supported character set has such a character (if, again,
we admit constructed locales using grouping every digit and a grouping
separator chosen to have a multibyte encoding as long as possible, as
well as accepting use of grouping with binary), but given that we have
this code at all (clearly it's not *correct*, or in accordance with
the principle of avoiding arbitrary limits, to skip grouping on
running out of internal space like that), I don't think it should need
any further changes for binary printf support to go in.
On the other hand, support for large sizes of _BitInt in printf (see
the N2858 proposal) *would* require something to be done about such
arbitrary limits (presumably using dynamic allocation in printf again,
for sufficiently large _BitInt arguments only - currently only
floating-point uses dynamic allocation, and, as previously discussed,
that could actually be replaced by bounded allocation given smarter
code).
Tested with build-many-glibcs.py for aarch64-linux-gnu (GCC mainline).
Also tested natively for x86_64.
Joseph Myers [Tue, 7 Jul 2020 14:54:12 +0000 (14:54 +0000)]
Remove most vfprintf width/precision-dependent allocations (bug 14231, bug 26211).
The vfprintf implementation (used for all printf-family functions)
contains complicated logic to allocate internal buffers of a size
depending on the width and precision used for a format, using either
malloc or alloca depending on that size, and with consequent checks
for size overflow and allocation failure.
As noted in bug 26211, the version of that logic used when '$' plus
argument number formats are in use is missing the overflow checks,
which can result in segfaults (quite possibly exploitable, I didn't
try to work that out) when the width or precision is in the range
0x7fffffe0 through 0x7fffffff (maybe smaller values as well in the
wprintf case on 32-bit systems, when the multiplication by sizeof
(CHAR_T) can overflow).
All that complicated logic in fact appears to be useless. As far as I
can tell, there has been no need (outside the floating-point printf
code, which does its own allocations) for allocations depending on
width or precision since commit 3e95f6602b226e0de06aaff686dc47b282d7cc16 ("Remove limitation on size
of precision for integers", Sun Sep 12 21:23:32 1999 +0000). Thus,
this patch removes that logic completely, thereby fixing both problems
with excessive allocations for large width and precision for
non-floating-point formats, and the problem with missing overflow
checks with such allocations. Note that this does have the
consequence that width and precision up to INT_MAX are now allowed
where previously INT_MAX / sizeof (CHAR_T) - EXTSIZ or more would have
been rejected, so could potentially expose any other overflows where
the value would previously have been rejected by those removed checks.
I believe this completely fixes bugs 14231 and 26211.
Excessive allocations are still possible in the floating-point case
(bug 21127), as are other integer or buffer overflows (see bug 26201).
This does not address the cases where a precision larger than INT_MAX
(embedded in the format string) would be meaningful without printf's
return value overflowing (when it's used with a string format, or %g
without the '#' flag, so the actual output will be much smaller), as
mentioned in bug 17829 comment 8; using size_t internally for
precision to handle that case would be complicated by struct
printf_info being a public ABI. Nor does it address the matter of an
INT_MIN width being negated (bug 17829 comment 7; the same logic
appears a second time in the file as well, in the form of multiplying
by -1). There may be other sources of memory allocations with malloc
in printf functions as well (bug 24988, bug 16060). From inspection,
I think there are also integer overflows in two copies of "if ((width
-= len) < 0)" logic (where width is int, len is size_t and a very long
string could result in spurious padding being output on a 32-bit
system before printf overflows the count of output characters).
Florian Weimer [Thu, 19 Mar 2020 21:32:28 +0000 (18:32 -0300)]
stdio: Remove memory leak from multibyte convertion [BZ#25691]
This is an updated version of a previous patch [1] with the
following changes:
- Use compiler overflow builtins on done_add_func function.
- Define the scratch +utstring_converted_wide_string using
CHAR_T.
- Added a testcase and mention the bug report.
Both default and wide printf functions might leak memory when
manipulate multibyte characters conversion depending of the size
of the input (whether __libc_use_alloca trigger or not the fallback
heap allocation).
This patch fixes it by removing the extra memory allocation on
string formatting with conversion parts.
The testcase uses input argument size that trigger memory leaks
on unpatched code (using a scratch buffer the threashold to use
heap allocation is lower).
Noah Goldstein [Fri, 18 Feb 2022 23:00:25 +0000 (17:00 -0600)]
x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.c
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Noah Goldstein [Fri, 18 Feb 2022 20:19:15 +0000 (14:19 -0600)]
x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)
Noah Goldstein [Tue, 15 Feb 2022 14:18:15 +0000 (08:18 -0600)]
x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the
random value in the the RSI register is larger than the wide-character
string length in the existing wcslen test, it didn't trigger the wcslen
test failure. Add a test to force 0 into the RSI register before calling
wcslen.
Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.
Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
Noah Goldstein [Wed, 9 Jun 2021 20:17:14 +0000 (16:17 -0400)]
String: Add overflow tests for strnlen, memchr, and strncat [BZ #27974]
This commit adds tests for a bug in the wide char variant of the
functions where the implementation may assume that maxlen for wcsnlen
or n for wmemchr/strncat will not overflow when multiplied by
sizeof(wchar_t).
These tests show the following implementations failing on x86_64:
wcsnlen-sse4_1
wcsnlen-avx2
wmemchr-sse2
wmemchr-avx2
strncat would fail as well if it where on a system that prefered
either of the wcsnlen implementations that failed as it relies on
wcsnlen.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.
Noah Goldstein [Wed, 23 Jun 2021 05:19:34 +0000 (01:19 -0400)]
x86-64: Add wcslen optimize for sse4.1
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)
H.J. Lu [Wed, 23 Jun 2021 03:42:10 +0000 (20:42 -0700)]
x86-64: Move strlen.S to multiarch/strlen-vec.S
Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.
Noah Goldstein [Mon, 3 May 2021 07:03:19 +0000 (03:03 -0400)]
x86: Optimize memchr-evex.S
No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)
No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.
Noah Goldstein [Mon, 3 May 2021 07:01:58 +0000 (03:01 -0400)]
x86: Optimize memchr-avx2.S
No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)
H.J. Lu [Sun, 7 Mar 2021 17:45:23 +0000 (09:45 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions
Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Sun, 7 Mar 2021 17:44:18 +0000 (09:44 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Tue, 23 Feb 2021 14:33:10 +0000 (06:33 -0800)]
x86: Add string/memory function tests in RTM region
At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.
H.J. Lu [Fri, 5 Mar 2021 15:26:42 +0000 (07:26 -0800)]
x86-64: Add AVX optimized string/memory functions for RTM
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with
xtest
jz 1f
vzeroall
ret
1:
vzeroupper
ret
at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
H.J. Lu [Fri, 5 Mar 2021 15:20:28 +0000 (07:20 -0800)]
x86-64: Add memcmp family functions with 256-bit EVEX
Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.
H.J. Lu [Fri, 5 Mar 2021 15:15:03 +0000 (07:15 -0800)]
x86-64: Add memset family functions with 256-bit EVEX
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Fri, 5 Mar 2021 14:46:08 +0000 (06:46 -0800)]
x86-64: Add memmove family functions with 256-bit EVEX
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:36:50 +0000 (06:36 -0800)]
x86-64: Add strcpy family functions with 256-bit EVEX
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:24:52 +0000 (06:24 -0800)]
x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.
For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
H.J. Lu [Fri, 26 Feb 2021 13:36:59 +0000 (05:36 -0800)]
x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
Noah Goldstein [Sun, 9 Jan 2022 22:02:21 +0000 (16:02 -0600)]
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
test-container: Install with $(all-subdirs) [BZ #24794]
Whenever a sub-make is created, it inherits the variable subdirs from its
parent. This is also true when make check is called with a restricted
list of subdirs. In this scenario, make install is executed "partially"
and testroot.pristine ends up with an incomplete installation.
[BZ #24794]
* Makefile (testroot.pristine/install.stamp): Pass
subdirs='$(all-subdirs)' to make install.
Fix SXID_ERASE behavior in setuid programs (BZ #27471)
When parse_tunables tries to erase a tunable marked as SXID_ERASE for
setuid programs, it ends up setting the envvar string iterator
incorrectly, because of which it may parse the next tunable
incorrectly. Given that currently the implementation allows malformed
and unrecognized tunables pass through, it may even allow SXID_ERASE
tunables to go through.
This change revamps the SXID_ERASE implementation so that:
- Only valid tunables are written back to the tunestr string, because
of which children of SXID programs will only inherit a clean list of
identified tunables that are not SXID_ERASE.
- Unrecognized tunables get scrubbed off from the environment and
subsequently from the child environment.
- This has the side-effect that a tunable that is not identified by
the setxid binary, will not be passed on to a non-setxid child even
if the child could have identified that tunable. This may break
applications that expect this behaviour but expecting such tunables
to cross the SXID boundary is wrong. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2ed18c5b534d9e92fc006202a5af0df6b72e7aca)
Instead of passing GLIBC_TUNABLES via the environment, pass the
environment variable from parent to child. This allows us to test
multiple variables to ensure better coverage.
The test list currently only includes the case that's already being
tested. More tests will be added later. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 061fe3f8add46a89b7453e87eabb9c4695005ced)
tst-env-setuid: Use support_capture_subprogram_self_sgid
Use the support_capture_subprogram_self_sgid to spawn an sgid child. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ca335281068a1ed549a75ee64f90a8310755956f)
Add a new function support_capture_subprogram_self_sgid that spawns an
sgid child of the running program with its own image and returns the
exit code of the child process. This functionality is used by at
least three tests in the testsuite at the moment, so it makes sense to
consolidate.
There is also a new function support_subprogram_wait which should
provide simple system() like functionality that does not set up file
actions. This is useful in cases where only the return code of the
spawned subprocess is interesting.
This patch also ports tst-secure-getenv to this new function. A
subsequent patch will port other tests. This also brings an important
change to tst-secure-getenv behaviour. Now instead of succeeding, the
test fails as UNSUPPORTED if it is unable to spawn a setgid child,
which is how it should have been in the first place. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 716a3bdc41b2b4b864dc64475015ba51e35e1273)
DJ Delorie [Thu, 25 Feb 2021 21:08:21 +0000 (16:08 -0500)]
nscd: Fix double free in netgroupcache [BZ #27462]
In commit 745664bd798ec8fd50438605948eea594179fba1 a use-after-free
was fixed, but this led to an occasional double-free. This patch
tracks the "live" allocation better.
Florian Weimer [Wed, 27 Jan 2021 12:36:12 +0000 (13:36 +0100)]
gconv: Fix assertion failure in ISO-2022-JP-3 module (bug 27256)
The conversion loop to the internal encoding does not follow
the interface contract that __GCONV_FULL_OUTPUT is only returned
after the internal wchar_t buffer has been filled completely. This
is enforced by the first of the two asserts in iconv/skeleton.c:
/* We must run out of output buffer space in this
rerun. */
assert (outbuf == outerr);
assert (nstatus == __GCONV_FULL_OUTPUT);
This commit solves this issue by queuing a second wide character
which cannot be written immediately in the state variable, like
other converters already do (e.g., BIG5-HKSCS or TSCII).
H.J. Lu [Mon, 28 Dec 2020 13:28:49 +0000 (05:28 -0800)]
x86: Check IFUNC definition in unrelocated executable [BZ #20019]
Calling an IFUNC function defined in unrelocated executable also leads to
segfault. Issue a fatal error message when calling IFUNC function defined
in the unrelocated executable from a shared library.
On x86, ifuncmain6pie failed with:
[hjl@gnu-cfl-2 build-i686-linux]$ ./elf/ifuncmain6pie --direct
./elf/ifuncmain6pie: IFUNC symbol 'foo' referenced in '/export/build/gnu/tools-build/glibc-32bit/build-i686-linux/elf/ifuncmod6.so' is defined in the executable and creates an unsatisfiable circular dependency.
[hjl@gnu-cfl-2 build-i686-linux]$ readelf -rW elf/ifuncmod6.so | grep foo 00003ff400000706 R_386_GLOB_DAT 0000400c foo_ptr 00003ff800000406 R_386_GLOB_DAT 00000000 foo 0000400c00000401 R_386_32 00000000 foo
[hjl@gnu-cfl-2 build-i686-linux]$
Remove non-JUMP_SLOT relocations against foo in ifuncmod6.so, which
trigger the circular IFUNC dependency, and build ifuncmain6pie with
-Wl,-z,lazy.
H.J. Lu [Tue, 12 Jan 2021 13:15:49 +0000 (05:15 -0800)]
x86-64: Avoid rep movsb with short distance [BZ #27130]
When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:
cmpl $63, %ecx
// Don't use "rep movsb" if ECX <= 63
jbe L(Don't use rep movsb")
Use "rep movsb"
Benchtests data with bench-memcpy, bench-memcpy-large, bench-memcpy-random
and bench-memcpy-walk on Skylake, Ice Lake and Tiger Lake show that its
performance impact is within noise range as "rep movsb" is only used for
data size >= 4KB.
The variant PCS support was ineffective because in the common case
linkmap->l_mach.plt == 0 but then the symbol table flags were ignored
and normal lazy binding was used instead of resolving the relocs early.
(This was a misunderstanding about how GOT[1] is setup by the linker.)
In practice this mainly affects SVE calls when the vector length is
more than 128 bits, then the top bits of the argument registers get
clobbered during lazy binding.
Wilco Dijkstra [Wed, 11 Mar 2020 17:15:25 +0000 (17:15 +0000)]
[AArch64] Improve integer memcpy
Further optimize integer memcpy. Small cases now include copies up
to 32 bytes. 64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies. Comments have been rewritten.
Krzysztof Koch [Tue, 5 Nov 2019 17:35:18 +0000 (17:35 +0000)]
aarch64: Increase small and medium cases for __memcpy_generic
Increase the upper bound on medium cases from 96 to 128 bytes.
Now, up to 128 bytes are copied unrolled.
Increase the upper bound on small cases from 16 to 32 bytes so that
copies of 17-32 bytes are not impacted by the larger medium case.
Benchmarking:
The attached figures show relative timing difference with respect
to 'memcpy_generic', which is the existing implementation.
'memcpy_med_128' denotes the the version of memcpy_generic with
only the medium case enlarged. The 'memcpy_med_128_small_32' numbers
are for the version of memcpy_generic submitted in this patch, which
has both medium and small cases enlarged. The figures were generated
using the script from:
https://www.sourceware.org/ml/libc-alpha/2019-10/msg00563.html
Depending on the platform, the performance improvement in the
bench-memcpy-random.c benchmark ranges from 6% to 20% between
the original and final version of memcpy.S
Tested against GLIBC testsuite and randomized tests.
Wilco Dijkstra [Fri, 28 Aug 2020 16:51:40 +0000 (17:51 +0100)]
AArch64: Improve backwards memmove performance
On some microarchitectures performance of the backwards memmove improves if
the stores use STR with decreasing addresses. So change the memmove loop
in memcpy_advsimd.S to use 2x STR rather than STP.
Add a new memcpy using 128-bit Q registers - this is faster on modern
cores and reduces codesize. Similar to the generic memcpy, small cases
include copies up to 32 bytes. 64-128 byte copies are split into two
cases to improve performance of 64-96 byte copies. Large copies align
the source rather than the destination.
bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,
so make this memcpy the default on N1 (on Centriq it is 15% faster than
memcpy_falkor).
strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.
Florian Weimer [Tue, 19 May 2020 12:09:38 +0000 (14:09 +0200)]
nss_compat: internal_end*ent may clobber errno, hiding ERANGE [BZ #25976]
During cleanup, before returning from get*_r functions, the end*ent
calls must not change errno. Otherwise, an ERANGE error from the
underlying implementation can be hidden, causing unexpected lookup
failures. This commit introduces an internal_end*ent_noerror
function which saves and restore errno, and marks the original
internal_end*ent function as warn_unused_result, so that it is used
only in contexts were errors from it can be handled explicitly.
Andreas Schwab [Mon, 20 Jan 2020 16:01:50 +0000 (17:01 +0100)]
Fix array overflow in backtrace on PowerPC (bug 25423)
When unwinding through a signal frame the backtrace function on PowerPC
didn't check array bounds when storing the frame address. Fixes commit d400dcac5e ("PowerPC: fix backtrace to handle signal trampolines").
Joseph Myers [Wed, 12 Feb 2020 23:31:56 +0000 (23:31 +0000)]
Avoid ldbl-96 stack corruption from range reduction of pseudo-zero (bug 25487).
Bug 25487 reports stack corruption in ldbl-96 sinl on a pseudo-zero
argument (an representation where all the significand bits, including
the explicit high bit, are zero, but the exponent is not zero, which
is not a valid representation for the long double type).
Although this is not a valid long double representation, existing
practice in this area (see bug 4586, originally marked invalid but
subsequently fixed) is that we still seek to avoid invalid memory
accesses as a result, in case of programs that treat arbitrary binary
data as long double representations, although the invalid
representations of the ldbl-96 format do not need to be consistently
handled the same as any particular valid representation.
This patch makes the range reduction detect pseudo-zero and unnormal
representations that would otherwise go to __kernel_rem_pio2, and
returns a NaN for them instead of continuing with the range reduction
process. (Pseudo-zero and unnormal representations whose unbiased
exponent is less than -1 have already been safely returned from the
function before this point without going through the rest of range
reduction.) Pseudo-zero representations would previously result in
the value passed to __kernel_rem_pio2 being all-zero, which is
definitely unsafe; unnormal representations would previously result in
a value passed whose high bit is zero, which might well be unsafe
since that is not a form of input expected by __kernel_rem_pio2.
Florian Weimer [Thu, 5 Dec 2019 16:29:42 +0000 (17:29 +0100)]
misc/test-errno-linux: Handle EINVAL from quotactl
In commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 ("xfs: Sanity check
flags of Q_XQUOTARM call"), Linux 5.4 added checking for the flags
argument, causing the test to fail due to too restrictive test
expectations.
Kamlesh Kumar [Thu, 5 Dec 2019 15:55:19 +0000 (16:55 +0100)]
<string.h>: Define __CORRECT_ISO_CPP_STRING_H_PROTO for Clang [BZ #25232]
Without the asm redirects, strchr et al. are not const-correct.
libc++ has a wrapper header that works with and without
__CORRECT_ISO_CPP_STRING_H_PROTO (using a Clang extension). But when
Clang is used with libstdc++ or just C headers, the overloaded functions
with the correct types are not declared.
This change does not impact current GCC (with libstdc++ or libc++).
Florian Weimer [Tue, 3 Dec 2019 20:08:49 +0000 (21:08 +0100)]
x86: Assume --enable-cet if GCC defaults to CET [BZ #25225]
This links in CET support if GCC defaults to CET. Otherwise, __CET__
is defined, yet CET functionality is not compiled and linked into the
dynamic loader, resulting in a linker failure due to undefined
references to _dl_cet_check and _dl_open_check.
Florian Weimer [Thu, 28 Nov 2019 13:17:27 +0000 (14:17 +0100)]
libio: Disable vtable validation for pre-2.1 interposed handles [BZ #25203]
Commit c402355dfa7807b8e0adb27c009135a7e2b9f1b0 ("libio: Disable
vtable validation in case of interposition [BZ #23313]") only covered
the interposable glibc 2.1 handles, in libio/stdfiles.c. The
parallel code in libio/oldstdfiles.c needs similar detection logic.
rtld: Check __libc_enable_secure before honoring LD_PREFER_MAP_32BIT_EXEC (CVE-2019-19126) [BZ #25204]
The problem was introduced in glibc 2.23, in commit b9eb92ab05204df772eb4929eccd018637c9f3e9
("Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT").
Linux: Use in-tree copy of SO_ constants for !__USE_MISC [BZ #24532]
The kernel changes for a 64-bit time_t on 32-bit architectures
resulted in <asm/socket.h> indirectly including <linux/posix_types.h>.
The latter is not namespace-clean for the POSIX version of
<sys/socket.h>.
This issue has persisted across several Linux releases, so this commit
creates our own copy of the SO_* definitions for !__USE_MISC mode.
The new test socket/tst-socket-consts ensures that the copy is
consistent with the kernel definitions (which vary across
architectures). The test is tricky to get right because CPPFLAGS
includes include/libc-symbols.h, which in turn defines _GNU_SOURCE
unconditionally.
Tested with build-many-glibcs.py. I verified that a discrepancy in
the definitions actually results in a failure of the
socket/tst-socket-consts test.
DJ Delorie [Wed, 30 Oct 2019 22:03:14 +0000 (18:03 -0400)]
Base max_fast on alignment, not width, of bins (Bug 24903)
set_max_fast sets the "impossibly small" value based on,
eventually, MALLOC_ALIGNMENT. The comparisons for the smallest
chunk used is, eventually, MIN_CHUNK_SIZE. Note that i386
is the only platform where these are the same, so a smallest
chunk *would* be put in a no-fastbins fastbin.
This change calculates the "impossibly small" value
based on MIN_CHUNK_SIZE instead, so that we can know it will
always be impossibly small.
malloc: Fix missing accounting of top chunk in malloc_info [BZ #24026]
Fixes `<total type="rest" size="..."> incorrectly showing as 0 most
of the time.
The rest value being wrong is significant because to compute the
actual amount of memory handed out via malloc, the user must subtract
it from <system type="current" size="...">. That result being wrong
makes investigating memory fragmentation issues like
<https://bugzilla.redhat.com/show_bug.cgi?id=843478> close to
impossible.