Andreas Schwab [Mon, 29 Aug 2022 13:05:40 +0000 (15:05 +0200)]
Add test for bug 29530
This tests for a bug that was introduced in commit edc1686af0 ("vfprintf:
Reuse work_buffer in group_number") and fixed as a side effect of commit 6caddd34bd ("Remove most vfprintf width/precision-dependent allocations
(bug 14231, bug 26211).").
Joseph Myers [Mon, 8 Nov 2021 19:11:51 +0000 (19:11 +0000)]
Fix memmove call in vfprintf-internal.c:group_number
A recent GCC mainline change introduces errors of the form:
vfprintf-internal.c: In function 'group_number':
vfprintf-internal.c:2093:15: error: 'memmove' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
2093 | memmove (w, s, (front_ptr -s) * sizeof (CHAR_T));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a genuine bug in the glibc code: s > front_ptr is always true
at this point in the code, and the intent is clearly for the
subtraction to be the other way round. The other arguments to the
memmove call here also appear to be wrong; w and s point just *after*
the destination and source for copying the rest of the number, so the
size needs to be subtracted to get appropriate pointers for the
copying. Adjust the memmove call to conform to the apparent intent of
the code, so fixing the -Wstringop-overflow error.
Now, if the original code were ever executed, a buffer overrun would
result. However, I believe this code (introduced in commit edc1686af0c0fc2eb535f1d38cdf63c1a5a03675, "vfprintf: Reuse work_buffer
in group_number", so in glibc 2.26) is unreachable in prior glibc
releases (so there is no need for a bug in Bugzilla, no need to
consider any backports unless someone wants to build older glibc
releases with GCC 12 and no possibility of this buffer overrun
resulting in a security issue).
work_buffer is 1000 bytes / 250 wide characters. This case is only
reachable if an initial part of the number, plus a grouped copy of the
rest of the number, fail to fit in that space; that is, if the grouped
number fails to fit in the space. In the wide character case,
grouping is always one wide character, so even with a locale (of which
there aren't any in glibc) grouping every digit, a number would need
to occupy at least 125 wide characters to overflow, and a 64-bit
integer occupies at most 23 characters in octal including a leading 0.
In the narrow character case, the multibyte encoding of the grouping
separator would need to be at least 42 bytes to overflow, again
supposing grouping every digit, but MB_LEN_MAX is 16. So even if we
admit the case of artificially constructed locales not shipped with
glibc, given that such a locale would need to use one of the character
sets supported by glibc, this code cannot be reached at present. (And
POSIX only actually specifies the ' flag for grouping for decimal
output, though glibc acts on it for other bases as well.)
With binary output (if you consider use of grouping there to be
valid), you'd need a 15-byte multibyte character for overflow; I don't
know if any supported character set has such a character (if, again,
we admit constructed locales using grouping every digit and a grouping
separator chosen to have a multibyte encoding as long as possible, as
well as accepting use of grouping with binary), but given that we have
this code at all (clearly it's not *correct*, or in accordance with
the principle of avoiding arbitrary limits, to skip grouping on
running out of internal space like that), I don't think it should need
any further changes for binary printf support to go in.
On the other hand, support for large sizes of _BitInt in printf (see
the N2858 proposal) *would* require something to be done about such
arbitrary limits (presumably using dynamic allocation in printf again,
for sufficiently large _BitInt arguments only - currently only
floating-point uses dynamic allocation, and, as previously discussed,
that could actually be replaced by bounded allocation given smarter
code).
Tested with build-many-glibcs.py for aarch64-linux-gnu (GCC mainline).
Also tested natively for x86_64.
Joseph Myers [Tue, 7 Jul 2020 14:54:12 +0000 (14:54 +0000)]
Remove most vfprintf width/precision-dependent allocations (bug 14231, bug 26211).
The vfprintf implementation (used for all printf-family functions)
contains complicated logic to allocate internal buffers of a size
depending on the width and precision used for a format, using either
malloc or alloca depending on that size, and with consequent checks
for size overflow and allocation failure.
As noted in bug 26211, the version of that logic used when '$' plus
argument number formats are in use is missing the overflow checks,
which can result in segfaults (quite possibly exploitable, I didn't
try to work that out) when the width or precision is in the range
0x7fffffe0 through 0x7fffffff (maybe smaller values as well in the
wprintf case on 32-bit systems, when the multiplication by sizeof
(CHAR_T) can overflow).
All that complicated logic in fact appears to be useless. As far as I
can tell, there has been no need (outside the floating-point printf
code, which does its own allocations) for allocations depending on
width or precision since commit 3e95f6602b226e0de06aaff686dc47b282d7cc16 ("Remove limitation on size
of precision for integers", Sun Sep 12 21:23:32 1999 +0000). Thus,
this patch removes that logic completely, thereby fixing both problems
with excessive allocations for large width and precision for
non-floating-point formats, and the problem with missing overflow
checks with such allocations. Note that this does have the
consequence that width and precision up to INT_MAX are now allowed
where previously INT_MAX / sizeof (CHAR_T) - EXTSIZ or more would have
been rejected, so could potentially expose any other overflows where
the value would previously have been rejected by those removed checks.
I believe this completely fixes bugs 14231 and 26211.
Excessive allocations are still possible in the floating-point case
(bug 21127), as are other integer or buffer overflows (see bug 26201).
This does not address the cases where a precision larger than INT_MAX
(embedded in the format string) would be meaningful without printf's
return value overflowing (when it's used with a string format, or %g
without the '#' flag, so the actual output will be much smaller), as
mentioned in bug 17829 comment 8; using size_t internally for
precision to handle that case would be complicated by struct
printf_info being a public ABI. Nor does it address the matter of an
INT_MIN width being negated (bug 17829 comment 7; the same logic
appears a second time in the file as well, in the form of multiplying
by -1). There may be other sources of memory allocations with malloc
in printf functions as well (bug 24988, bug 16060). From inspection,
I think there are also integer overflows in two copies of "if ((width
-= len) < 0)" logic (where width is int, len is size_t and a very long
string could result in spurious padding being output on a 32-bit
system before printf overflows the count of output characters).
Florian Weimer [Thu, 19 Mar 2020 21:32:28 +0000 (18:32 -0300)]
stdio: Remove memory leak from multibyte convertion [BZ#25691]
This is an updated version of a previous patch [1] with the
following changes:
- Use compiler overflow builtins on done_add_func function.
- Define the scratch +utstring_converted_wide_string using
CHAR_T.
- Added a testcase and mention the bug report.
Both default and wide printf functions might leak memory when
manipulate multibyte characters conversion depending of the size
of the input (whether __libc_use_alloca trigger or not the fallback
heap allocation).
This patch fixes it by removing the extra memory allocation on
string formatting with conversion parts.
The testcase uses input argument size that trigger memory leaks
on unpatched code (using a scratch buffer the threashold to use
heap allocation is lower).
Noah Goldstein [Fri, 18 Feb 2022 23:00:25 +0000 (17:00 -0600)]
x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.c
Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)
Noah Goldstein [Fri, 18 Feb 2022 20:19:15 +0000 (14:19 -0600)]
x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)
Noah Goldstein [Tue, 15 Feb 2022 14:18:15 +0000 (08:18 -0600)]
x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]
In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
added wcsnlen-sse4.1 to the wcslen ifunc implementation list. Since the
random value in the the RSI register is larger than the wide-character
string length in the existing wcslen test, it didn't trigger the wcslen
test failure. Add a test to force 0 into the RSI register before calling
wcslen.
Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.
Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)
* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
which is set to indicate to updated software that the loaded microcode is
forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
Elision (HLE) and RTM bits to indicate to software that Intel TSX is
disabled.
1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set. This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.
Noah Goldstein [Wed, 9 Jun 2021 20:17:14 +0000 (16:17 -0400)]
String: Add overflow tests for strnlen, memchr, and strncat [BZ #27974]
This commit adds tests for a bug in the wide char variant of the
functions where the implementation may assume that maxlen for wcsnlen
or n for wmemchr/strncat will not overflow when multiplied by
sizeof(wchar_t).
These tests show the following implementations failing on x86_64:
wcsnlen-sse4_1
wcsnlen-avx2
wmemchr-sse2
wmemchr-avx2
strncat would fail as well if it where on a system that prefered
either of the wcsnlen implementations that failed as it relies on
wcsnlen.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)
No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.
Noah Goldstein [Wed, 23 Jun 2021 05:19:34 +0000 (01:19 -0400)]
x86-64: Add wcslen optimize for sse4.1
No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)
H.J. Lu [Wed, 23 Jun 2021 03:42:10 +0000 (20:42 -0700)]
x86-64: Move strlen.S to multiarch/strlen-vec.S
Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.
Noah Goldstein [Mon, 3 May 2021 07:03:19 +0000 (03:03 -0400)]
x86: Optimize memchr-evex.S
No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)
No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.
Noah Goldstein [Mon, 3 May 2021 07:01:58 +0000 (03:01 -0400)]
x86: Optimize memchr-avx2.S
No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.
Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)
H.J. Lu [Sun, 7 Mar 2021 17:45:23 +0000 (09:45 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions
Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Sun, 7 Mar 2021 17:44:18 +0000 (09:44 -0800)]
x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Tue, 23 Feb 2021 14:33:10 +0000 (06:33 -0800)]
x86: Add string/memory function tests in RTM region
At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.
H.J. Lu [Fri, 5 Mar 2021 15:26:42 +0000 (07:26 -0800)]
x86-64: Add AVX optimized string/memory functions for RTM
Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with
xtest
jz 1f
vzeroall
ret
1:
vzeroupper
ret
at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.
H.J. Lu [Fri, 5 Mar 2021 15:20:28 +0000 (07:20 -0800)]
x86-64: Add memcmp family functions with 256-bit EVEX
Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.
H.J. Lu [Fri, 5 Mar 2021 15:15:03 +0000 (07:15 -0800)]
x86-64: Add memset family functions with 256-bit EVEX
Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.
H.J. Lu [Fri, 5 Mar 2021 14:46:08 +0000 (06:46 -0800)]
x86-64: Add memmove family functions with 256-bit EVEX
Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:36:50 +0000 (06:36 -0800)]
x86-64: Add strcpy family functions with 256-bit EVEX
Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.
H.J. Lu [Fri, 5 Mar 2021 14:24:52 +0000 (06:24 -0800)]
x86-64: Add ifunc-avx2.h functions with 256-bit EVEX
Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.
For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.
H.J. Lu [Fri, 26 Feb 2021 13:36:59 +0000 (05:36 -0800)]
x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP
1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.
Noah Goldstein [Sun, 9 Jan 2022 22:02:21 +0000 (16:02 -0600)]
x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]
Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.
test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.
test-container: Install with $(all-subdirs) [BZ #24794]
Whenever a sub-make is created, it inherits the variable subdirs from its
parent. This is also true when make check is called with a restricted
list of subdirs. In this scenario, make install is executed "partially"
and testroot.pristine ends up with an incomplete installation.
[BZ #24794]
* Makefile (testroot.pristine/install.stamp): Pass
subdirs='$(all-subdirs)' to make install.
Fix SXID_ERASE behavior in setuid programs (BZ #27471)
When parse_tunables tries to erase a tunable marked as SXID_ERASE for
setuid programs, it ends up setting the envvar string iterator
incorrectly, because of which it may parse the next tunable
incorrectly. Given that currently the implementation allows malformed
and unrecognized tunables pass through, it may even allow SXID_ERASE
tunables to go through.
This change revamps the SXID_ERASE implementation so that:
- Only valid tunables are written back to the tunestr string, because
of which children of SXID programs will only inherit a clean list of
identified tunables that are not SXID_ERASE.
- Unrecognized tunables get scrubbed off from the environment and
subsequently from the child environment.
- This has the side-effect that a tunable that is not identified by
the setxid binary, will not be passed on to a non-setxid child even
if the child could have identified that tunable. This may break
applications that expect this behaviour but expecting such tunables
to cross the SXID boundary is wrong. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2ed18c5b534d9e92fc006202a5af0df6b72e7aca)
Instead of passing GLIBC_TUNABLES via the environment, pass the
environment variable from parent to child. This allows us to test
multiple variables to ensure better coverage.
The test list currently only includes the case that's already being
tested. More tests will be added later. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 061fe3f8add46a89b7453e87eabb9c4695005ced)
tst-env-setuid: Use support_capture_subprogram_self_sgid
Use the support_capture_subprogram_self_sgid to spawn an sgid child. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ca335281068a1ed549a75ee64f90a8310755956f)
Add a new function support_capture_subprogram_self_sgid that spawns an
sgid child of the running program with its own image and returns the
exit code of the child process. This functionality is used by at
least three tests in the testsuite at the moment, so it makes sense to
consolidate.
There is also a new function support_subprogram_wait which should
provide simple system() like functionality that does not set up file
actions. This is useful in cases where only the return code of the
spawned subprocess is interesting.
This patch also ports tst-secure-getenv to this new function. A
subsequent patch will port other tests. This also brings an important
change to tst-secure-getenv behaviour. Now instead of succeeding, the
test fails as UNSUPPORTED if it is unable to spawn a setgid child,
which is how it should have been in the first place. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 716a3bdc41b2b4b864dc64475015ba51e35e1273)
DJ Delorie [Thu, 25 Feb 2021 21:08:21 +0000 (16:08 -0500)]
nscd: Fix double free in netgroupcache [BZ #27462]
In commit 745664bd798ec8fd50438605948eea594179fba1 a use-after-free
was fixed, but this led to an occasional double-free. This patch
tracks the "live" allocation better.
Florian Weimer [Wed, 27 Jan 2021 12:36:12 +0000 (13:36 +0100)]
gconv: Fix assertion failure in ISO-2022-JP-3 module (bug 27256)
The conversion loop to the internal encoding does not follow
the interface contract that __GCONV_FULL_OUTPUT is only returned
after the internal wchar_t buffer has been filled completely. This
is enforced by the first of the two asserts in iconv/skeleton.c:
/* We must run out of output buffer space in this
rerun. */
assert (outbuf == outerr);
assert (nstatus == __GCONV_FULL_OUTPUT);
This commit solves this issue by queuing a second wide character
which cannot be written immediately in the state variable, like
other converters already do (e.g., BIG5-HKSCS or TSCII).
H.J. Lu [Mon, 28 Dec 2020 13:28:49 +0000 (05:28 -0800)]
x86: Check IFUNC definition in unrelocated executable [BZ #20019]
Calling an IFUNC function defined in unrelocated executable also leads to
segfault. Issue a fatal error message when calling IFUNC function defined
in the unrelocated executable from a shared library.
On x86, ifuncmain6pie failed with:
[hjl@gnu-cfl-2 build-i686-linux]$ ./elf/ifuncmain6pie --direct
./elf/ifuncmain6pie: IFUNC symbol 'foo' referenced in '/export/build/gnu/tools-build/glibc-32bit/build-i686-linux/elf/ifuncmod6.so' is defined in the executable and creates an unsatisfiable circular dependency.
[hjl@gnu-cfl-2 build-i686-linux]$ readelf -rW elf/ifuncmod6.so | grep foo 00003ff400000706 R_386_GLOB_DAT 0000400c foo_ptr 00003ff800000406 R_386_GLOB_DAT 00000000 foo 0000400c00000401 R_386_32 00000000 foo
[hjl@gnu-cfl-2 build-i686-linux]$
Remove non-JUMP_SLOT relocations against foo in ifuncmod6.so, which
trigger the circular IFUNC dependency, and build ifuncmain6pie with
-Wl,-z,lazy.
H.J. Lu [Tue, 12 Jan 2021 13:15:49 +0000 (05:15 -0800)]
x86-64: Avoid rep movsb with short distance [BZ #27130]
When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:
cmpl $63, %ecx
// Don't use "rep movsb" if ECX <= 63
jbe L(Don't use rep movsb")
Use "rep movsb"
Benchtests data with bench-memcpy, bench-memcpy-large, bench-memcpy-random
and bench-memcpy-walk on Skylake, Ice Lake and Tiger Lake show that its
performance impact is within noise range as "rep movsb" is only used for
data size >= 4KB.
The variant PCS support was ineffective because in the common case
linkmap->l_mach.plt == 0 but then the symbol table flags were ignored
and normal lazy binding was used instead of resolving the relocs early.
(This was a misunderstanding about how GOT[1] is setup by the linker.)
In practice this mainly affects SVE calls when the vector length is
more than 128 bits, then the top bits of the argument registers get
clobbered during lazy binding.
Wilco Dijkstra [Wed, 11 Mar 2020 17:15:25 +0000 (17:15 +0000)]
[AArch64] Improve integer memcpy
Further optimize integer memcpy. Small cases now include copies up
to 32 bytes. 64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies. Comments have been rewritten.
Krzysztof Koch [Tue, 5 Nov 2019 17:35:18 +0000 (17:35 +0000)]
aarch64: Increase small and medium cases for __memcpy_generic
Increase the upper bound on medium cases from 96 to 128 bytes.
Now, up to 128 bytes are copied unrolled.
Increase the upper bound on small cases from 16 to 32 bytes so that
copies of 17-32 bytes are not impacted by the larger medium case.
Benchmarking:
The attached figures show relative timing difference with respect
to 'memcpy_generic', which is the existing implementation.
'memcpy_med_128' denotes the the version of memcpy_generic with
only the medium case enlarged. The 'memcpy_med_128_small_32' numbers
are for the version of memcpy_generic submitted in this patch, which
has both medium and small cases enlarged. The figures were generated
using the script from:
https://www.sourceware.org/ml/libc-alpha/2019-10/msg00563.html
Depending on the platform, the performance improvement in the
bench-memcpy-random.c benchmark ranges from 6% to 20% between
the original and final version of memcpy.S
Tested against GLIBC testsuite and randomized tests.
Wilco Dijkstra [Fri, 28 Aug 2020 16:51:40 +0000 (17:51 +0100)]
AArch64: Improve backwards memmove performance
On some microarchitectures performance of the backwards memmove improves if
the stores use STR with decreasing addresses. So change the memmove loop
in memcpy_advsimd.S to use 2x STR rather than STP.
Add a new memcpy using 128-bit Q registers - this is faster on modern
cores and reduces codesize. Similar to the generic memcpy, small cases
include copies up to 32 bytes. 64-128 byte copies are split into two
cases to improve performance of 64-96 byte copies. Large copies align
the source rather than the destination.
bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,
so make this memcpy the default on N1 (on Centriq it is 15% faster than
memcpy_falkor).
strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.
Florian Weimer [Tue, 19 May 2020 12:09:38 +0000 (14:09 +0200)]
nss_compat: internal_end*ent may clobber errno, hiding ERANGE [BZ #25976]
During cleanup, before returning from get*_r functions, the end*ent
calls must not change errno. Otherwise, an ERANGE error from the
underlying implementation can be hidden, causing unexpected lookup
failures. This commit introduces an internal_end*ent_noerror
function which saves and restore errno, and marks the original
internal_end*ent function as warn_unused_result, so that it is used
only in contexts were errors from it can be handled explicitly.
Andreas Schwab [Mon, 20 Jan 2020 16:01:50 +0000 (17:01 +0100)]
Fix array overflow in backtrace on PowerPC (bug 25423)
When unwinding through a signal frame the backtrace function on PowerPC
didn't check array bounds when storing the frame address. Fixes commit d400dcac5e ("PowerPC: fix backtrace to handle signal trampolines").
Joseph Myers [Wed, 12 Feb 2020 23:31:56 +0000 (23:31 +0000)]
Avoid ldbl-96 stack corruption from range reduction of pseudo-zero (bug 25487).
Bug 25487 reports stack corruption in ldbl-96 sinl on a pseudo-zero
argument (an representation where all the significand bits, including
the explicit high bit, are zero, but the exponent is not zero, which
is not a valid representation for the long double type).
Although this is not a valid long double representation, existing
practice in this area (see bug 4586, originally marked invalid but
subsequently fixed) is that we still seek to avoid invalid memory
accesses as a result, in case of programs that treat arbitrary binary
data as long double representations, although the invalid
representations of the ldbl-96 format do not need to be consistently
handled the same as any particular valid representation.
This patch makes the range reduction detect pseudo-zero and unnormal
representations that would otherwise go to __kernel_rem_pio2, and
returns a NaN for them instead of continuing with the range reduction
process. (Pseudo-zero and unnormal representations whose unbiased
exponent is less than -1 have already been safely returned from the
function before this point without going through the rest of range
reduction.) Pseudo-zero representations would previously result in
the value passed to __kernel_rem_pio2 being all-zero, which is
definitely unsafe; unnormal representations would previously result in
a value passed whose high bit is zero, which might well be unsafe
since that is not a form of input expected by __kernel_rem_pio2.
Florian Weimer [Thu, 5 Dec 2019 16:29:42 +0000 (17:29 +0100)]
misc/test-errno-linux: Handle EINVAL from quotactl
In commit 3dd4d40b420846dd35869ccc8f8627feef2cff32 ("xfs: Sanity check
flags of Q_XQUOTARM call"), Linux 5.4 added checking for the flags
argument, causing the test to fail due to too restrictive test
expectations.
Kamlesh Kumar [Thu, 5 Dec 2019 15:55:19 +0000 (16:55 +0100)]
<string.h>: Define __CORRECT_ISO_CPP_STRING_H_PROTO for Clang [BZ #25232]
Without the asm redirects, strchr et al. are not const-correct.
libc++ has a wrapper header that works with and without
__CORRECT_ISO_CPP_STRING_H_PROTO (using a Clang extension). But when
Clang is used with libstdc++ or just C headers, the overloaded functions
with the correct types are not declared.
This change does not impact current GCC (with libstdc++ or libc++).
Florian Weimer [Tue, 3 Dec 2019 20:08:49 +0000 (21:08 +0100)]
x86: Assume --enable-cet if GCC defaults to CET [BZ #25225]
This links in CET support if GCC defaults to CET. Otherwise, __CET__
is defined, yet CET functionality is not compiled and linked into the
dynamic loader, resulting in a linker failure due to undefined
references to _dl_cet_check and _dl_open_check.
Florian Weimer [Thu, 28 Nov 2019 13:17:27 +0000 (14:17 +0100)]
libio: Disable vtable validation for pre-2.1 interposed handles [BZ #25203]
Commit c402355dfa7807b8e0adb27c009135a7e2b9f1b0 ("libio: Disable
vtable validation in case of interposition [BZ #23313]") only covered
the interposable glibc 2.1 handles, in libio/stdfiles.c. The
parallel code in libio/oldstdfiles.c needs similar detection logic.
rtld: Check __libc_enable_secure before honoring LD_PREFER_MAP_32BIT_EXEC (CVE-2019-19126) [BZ #25204]
The problem was introduced in glibc 2.23, in commit b9eb92ab05204df772eb4929eccd018637c9f3e9
("Add Prefer_MAP_32BIT_EXEC to map executable pages with MAP_32BIT").
Linux: Use in-tree copy of SO_ constants for !__USE_MISC [BZ #24532]
The kernel changes for a 64-bit time_t on 32-bit architectures
resulted in <asm/socket.h> indirectly including <linux/posix_types.h>.
The latter is not namespace-clean for the POSIX version of
<sys/socket.h>.
This issue has persisted across several Linux releases, so this commit
creates our own copy of the SO_* definitions for !__USE_MISC mode.
The new test socket/tst-socket-consts ensures that the copy is
consistent with the kernel definitions (which vary across
architectures). The test is tricky to get right because CPPFLAGS
includes include/libc-symbols.h, which in turn defines _GNU_SOURCE
unconditionally.
Tested with build-many-glibcs.py. I verified that a discrepancy in
the definitions actually results in a failure of the
socket/tst-socket-consts test.
DJ Delorie [Wed, 30 Oct 2019 22:03:14 +0000 (18:03 -0400)]
Base max_fast on alignment, not width, of bins (Bug 24903)
set_max_fast sets the "impossibly small" value based on,
eventually, MALLOC_ALIGNMENT. The comparisons for the smallest
chunk used is, eventually, MIN_CHUNK_SIZE. Note that i386
is the only platform where these are the same, so a smallest
chunk *would* be put in a no-fastbins fastbin.
This change calculates the "impossibly small" value
based on MIN_CHUNK_SIZE instead, so that we can know it will
always be impossibly small.
malloc: Fix missing accounting of top chunk in malloc_info [BZ #24026]
Fixes `<total type="rest" size="..."> incorrectly showing as 0 most
of the time.
The rest value being wrong is significant because to compute the
actual amount of memory handed out via malloc, the user must subtract
it from <system type="current" size="...">. That result being wrong
makes investigating memory fragmentation issues like
<https://bugzilla.redhat.com/show_bug.cgi?id=843478> close to
impossible.
Joseph Myers [Mon, 4 Feb 2019 23:46:58 +0000 (23:46 +0000)]
Fix assertion in malloc.c:tcache_get.
One of the warnings that appears with -Wextra is "ordered comparison
of pointer with integer zero" in malloc.c:tcache_get, for the
assertion:
assert (tcache->entries[tc_idx] > 0);
Indeed, a "> 0" comparison does not make sense for
tcache->entries[tc_idx], which is a pointer. My guess is that
tcache->counts[tc_idx] is what's intended here, and this patch changes
the assertion accordingly.
Tested for x86_64.
* malloc/malloc.c (tcache_get): Compare tcache->counts[tc_idx]
with 0, not tcache->entries[tc_idx].
Stefan Liebler [Wed, 6 Feb 2019 08:06:34 +0000 (09:06 +0100)]
Fix alignment of TLS variables for tls variant TLS_TCB_AT_TP [BZ #23403]
The alignment of TLS variables is wrong if accessed from within a thread
for architectures with tls variant TLS_TCB_AT_TP.
For the main thread the static tls data is properly aligned.
For other threads the alignment depends on the alignment of the thread
pointer as the static tls data is located relative to this pointer.
This patch adds this alignment for TLS_TCB_AT_TP variants in the same way
as it is already done for TLS_DTV_AT_TP. The thread pointer is also already
properly aligned if the user provides its own stack for the new thread.
This patch extends the testcase nptl/tst-tls1.c in order to check the
alignment of the tls variables and it adds a pthread_create invocation
with a user provided stack.
The test itself is migrated from test-skeleton.c to test-driver.c
and the missing support functions xpthread_attr_setstack and xposix_memalign
are added.
mips: Force RWX stack for hard-float builds that can run on pre-4.8 kernels
Linux/Mips kernels prior to 4.8 could potentially crash the user
process when doing FPU emulation while running on non-executable
user stack.
Currently, gcc doesn't emit .note.GNU-stack for mips, but that will
change in the future. To ensure that glibc can be used with such
future gcc, without silently resulting in binaries that might crash
in runtime, this patch forces RWX stack for all built objects if
configured to run against minimum kernel version less than 4.8.
* sysdeps/unix/sysv/linux/mips/Makefile
(test-xfail-check-execstack):
Move under mips-has-gnustack != yes.
(CFLAGS-.o*, ASFLAGS-.o*): New rules.
Apply -Wa,-execstack if mips-force-execstack == yes.
* sysdeps/unix/sysv/linux/mips/configure: Regenerated.
* sysdeps/unix/sysv/linux/mips/configure.ac
(mips-force-execstack): New var.
Set to yes for hard-float builds with minimum_kernel < 4.8.0
or minimum_kernel not set at all.
(mips-has-gnustack): New var.
Use value of libc_cv_as_noexecstack
if mips-force-execstack != yes, otherwise set to no.
nss_db allows for getpwent et al to be called without a set*ent,
but it only works once. After the last get*ent a set*ent is
required to restart, because the end*ent did not properly reset
the module. Resetting it to NULL allows for a proper restart.
If the database doesn't exist, however, end*ent erroniously called
munmap which set errno.
The test case runs "makedb" inside the testroot, so needs selinux
DSOs installed.
H.J. Lu [Mon, 1 Jul 2019 19:23:10 +0000 (12:23 -0700)]
Call _dl_open_check after relocation [BZ #24259]
This is a workaround for [BZ #20839] which doesn't remove the NODELETE
object when _dl_open_check throws an exception. Move it after relocation
in dl_open_worker to avoid leaving the NODELETE object mapped without
relocation.
Joseph Myers [Wed, 18 Sep 2019 13:22:24 +0000 (13:22 +0000)]
Fix RISC-V vfork build with Linux 5.3 kernel headers.
Building glibc for RISC-V with Linux 5.3 kernel headers fails because
<linux/sched.h>, included in vfork.S for CLONE_* constants, contains a
structure definition not safe for inclusion in assembly code.
All other architectures already avoid use of that header in vfork.S,
either defining the CLONE_* constants locally or embedding the
required values directly in the relevant instruction, where they
implement vfork using the clone syscall (see the implementations for
aarch64, ia64, mips and nios2). This patch makes the RISC-V version
define the constants locally like the other architectures.
Tested build for all three RISC-V configurations in
build-many-glibcs.py with Linux 5.3 headers.
* sysdeps/unix/sysv/linux/riscv/vfork.S: Do not include
<linux/sched.h>.
(CLONE_VM): New macro.
(CLONE_VFORK): Likewise.
alpha: force old OSF1 syscalls for getegid, geteuid and getppid [BZ #24986]
On alpha, Linux kernel 5.1 added the standard getegid, geteuid and
getppid syscalls (commit ecf7e0a4ad15287). Up to now alpha was using
the corresponding OSF1 syscalls through:
- sysdeps/unix/alpha/getegid.S
- sysdeps/unix/alpha/geteuid.S
- sysdeps/unix/alpha/getppid.S
When building against kernel headers >= 5.1, the glibc now use the new
syscalls through sysdeps/unix/sysv/linux/syscalls.list. When it is then
used with an older kernel, the corresponding 3 functions fail.
A quick fix is to move the OSF1 wrappers under the
sysdeps/unix/sysv/linux/alpha directory so they override the standard
linux ones. A better fix would be to try the new syscalls and fallback
to the old OSF1 in case the new ones fail. This can be implemented in
a later commit.
Changelog:
[BZ #24986]
* sysdeps/unix/alpha/getegid.S: Move to ...
* sysdeps/unix/sysv/linux/alpha/getegid.S: ... here.
* sysdeps/unix/alpha/geteuid.S: Move to ...
* sysdeps/unix/sysv/linux/alpha/geteuid.S: ... here.
* sysdeps/unix/alpha/getppid.S: Move to ...
* sysdeps/unix/sysv/linux/alpha/getppid.S: ... here
Wilco Dijkstra [Wed, 12 Jun 2019 10:42:34 +0000 (11:42 +0100)]
Improve performance of memmem
This patch significantly improves performance of memmem using a novel
modified Horspool algorithm. Needles up to size 256 use a bad-character
table indexed by hashed pairs of characters to quickly skip past mismatches.
Long needles use a self-adapting filtering step to avoid comparing the whole
needle repeatedly.
By limiting the needle length to 256, the shift table only requires 8 bits
per entry, lowering preprocessing overhead and minimizing cache effects.
This limit also implies worst-case performance is linear.
Small needles up to size 2 use a dedicated linear search. Very long needles
use the Two-Way algorithm (to avoid increasing stack size or slowing down
the common case, inlining is disabled).
The performance gain is 6.6 times on English text on AArch64 using random
needles with average size 8.
Tested against GLIBC testsuite and randomized tests.
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* string/memmem.c (__memmem): Rewrite to improve performance.
Wilco Dijkstra [Wed, 12 Jun 2019 10:38:52 +0000 (11:38 +0100)]
Improve performance of strstr
This patch significantly improves performance of strstr using a novel
modified Horspool algorithm. Needles up to size 256 use a bad-character
table indexed by hashed pairs of characters to quickly skip past mismatches.
Long needles use a self-adapting filtering step to avoid comparing the whole
needle repeatedly.
By limiting the needle length to 256, the shift table only requires 8 bits
per entry, lowering preprocessing overhead and minimizing cache effects.
This limit also implies worst-case performance is linear.
Small needles up to size 3 use a dedicated linear search. Very long needles
use the Two-Way algorithm.
The performance gain using the improved bench-strstr on Cortex-A72 is 5.8
times basic_strstr and 3.7 times twoway_strstr.
Tested against GLIBC testsuite, randomized tests and the GNULIB strstr test
(https://git.savannah.gnu.org/cgit/gnulib.git/tree/tests/test-strstr.c).
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
* string/str-two-way.h (two_way_short_needle): Add inline to avoid
warning.
(two_way_long_needle): Block inlining.
* string/strstr.c (strstr2): Add new function.
(strstr3): Likewise.
(STRSTR): Completely rewrite strstr to improve performance.
alpha: Do not redefine __NR_shmat or __NR_osf_shmat
Fixes build using v5.1-rc1 headers.
The kernel has cleaned up how these are defined. Previous behavior
was to define __NR_osf_shmat as 209 and not define __NR_shmat.
Current behavior is to define __NR_shmat as 209 and then define
__NR_osf_shmat as __NR_shmat.
* sysdeps/unix/sysv/linux/alpha/kernel-features.h (__NR_shmat):
Do not redefine.
* sysdeps/unix/sysv/linux/alpha/sysdep.h (__NR_osf_shmat):
Do not redefine.
posix: Fix large mmap64 offset for mips64n32 (BZ#24699)
The fix for BZ#21270 (commit 158d5fa0e19) added a mask to avoid offset larger
than 1^44 to be used along __NR_mmap2. However mips64n32 users __NR_mmap,
as mips64n64, but still defines off_t as old non-LFS type (other ILP32, such
x32, defines off_t being equal to off64_t). This leads to use the same
mask meant only for __NR_mmap2 call for __NR_mmap, thus limiting the maximum
offset it can use with mmap64.
This patch fixes by setting the high mask only for __NR_mmap2 usage. The
posix/tst-mmap-offset.c already tests it and also fails for mips64n32. The
patch also change the test to check for an arch-specific header that defines
the maximum supported offset.
Checked on x86_64-linux-gnu, i686-linux-gnu, and I also tests tst-mmap-offset
on qemu simulated mips64 with kernel 3.2.0 kernel for both mips-linux-gnu and
mips64-n32-linux-gnu.
[BZ #24699]
* posix/tst-mmap-offset.c: Mention BZ #24699.
(do_test_bz21270): Rename to do_test_large_offset and use
mmap64_maximum_offset to check for maximum expected offset value.
* sysdeps/generic/mmap_info.h: New file.
* sysdeps/unix/sysv/linux/mips/mmap_info.h: Likewise.
* sysdeps/unix/sysv/linux/mmap64.c (MMAP_OFF_HIGH_MASK): Define iff
__NR_mmap2 is used.