git.ipfire.org Git - thirdparty/glibc.git/log

aarch64: MTE compatible strncmp

Add support for MTE to strncmp. Regression tested with xcheck and benchmarked
with glibc's benchtests on the Cortex-A53, Cortex-A72, and Neoverse N1.

The existing implementation assumes that any access to the pages in which the
string resides is safe. This assumption is not true when MTE is enabled. This
patch updates the algorithm to ensure that accesses remain within the bounds
of an MTE tag (16-byte chunks) and improves overall performance.

Co-authored-by: Branislav Rankov <branislav.rankov@arm.com>
Co-authored-by: Wilco Dijkstra <wilco.dijkstra@arm.com>
(cherry picked from commit 03e1378f94173fc192a81e421457198f7b8a34a0)

Force DT_RPATH for --enable-hardcoded-path-in-tests

On Fedora 40/x86-64, linker enables --enable-new-dtags by default which
generates DT_RUNPATH instead of DT_RPATH.  Unlike DT_RPATH, DT_RUNPATH
only applies to DT_NEEDED entries in the executable and doesn't applies
to DT_NEEDED entries in shared libraries which are loaded via DT_NEEDED
entries in the executable.  Some glibc tests have libstdc++.so.6 in
DT_NEEDED, which has libm.so.6 in DT_NEEDED.  When DT_RUNPATH is generated,
/lib64/libm.so.6 is loaded for such tests.  If the newly built glibc is
older than glibc 2.36, these tests fail with

assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_2.36' not found (required by /lib64/libm.so.6)
assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /lib64/libm.so.6)

Pass -Wl,--disable-new-dtags to linker when building glibc tests with
--enable-hardcoded-path-in-tests.  This fixes BZ #31719.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2dcaf70643710e22f92a351e36e3cff8b48c60dc)

Don't make errlist.o[s].d depend on errlist-compat.h

stdio-common/errlist.o.d and stdio-common/errlist.os.d aren't generated
alongside with stdio-common/errlist-compat.h. Don't make them depend on
stdio-common/errlist-compat.h to avoid infinite loop with make-4.4. This
fixes BZ #31330.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>

Makerules: fix MAKEFLAGS assignment for upcoming make-4.4 [BZ# 29564]

make-4.4 will add long flags to MAKEFLAGS variable:

    * WARNING: Backward-incompatibility!
      Previously only simple (one-letter) options were added to the MAKEFLAGS
      variable that was visible while parsing makefiles.  Now, all options
      are available in MAKEFLAGS.

This causes locale builds to fail when long options are used:

    $ make --shuffle
    ...
    make  -C localedata install-locales
    make: invalid shuffle mode: '1662724426r'

The change fixes it by passing eash option via whitespace and dashes.
That way option is appended to both single-word form and whitespace
separated form.

While at it fixed --silent mode detection in $(MAKEFLAGS) by filtering
out --long-options. Otherwise options like --shuffle flag enable silent
mode unintentionally. $(silent-make) variable consolidates the checks.

Resolves: BZ# 29564

CC: Paul Smith <psmith@gnu.org>
CC: Siddhesh Poyarekar <siddhesh@gotplt.org>
Signed-off-by: Sergei Trofimovich <slyich@gmail.com>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 2d7ed98add14f75041499ac189696c9bd3d757fe)

elf: Disable some subtests of ifuncmain1, ifuncmain5 for !PIE

(cherry picked from commit 9cc9d61ee12f2f8620d8e0ea3c42af02bf07fe1e)

resolv/tst-idna_name_classify: Isolate from system libraries

Loading NSS modules from static binaries uses installed system
libraries if LD_LIBRARY_PATH is not set.

(cherry picked from commit 8dddf0bd5a3d57fba8da27e93f3d1a7032fce184)

x86_64: Optimize ffsll function code size.

Ffsll function randomly regress by ~20%, depending on how code gets
aligned in memory.  Ffsll function code size is 17 bytes.  Since default
function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes
aligned memory.  When ffsll function load at 16, 32 or 64 bytes aligned
memory, entire code fits in single 64 bytes cache line.  When ffsll
function load at 48 bytes aligned memory, it splits in two cache line,
hence random regression.

Ffsll function size reduction from 17 bytes to 12 bytes ensures that it
will always fit in single 64 bytes cache line.

This patch fixes ffsll function random performance regression.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 9d94997b5f9445afd4f2bccc5fa60ff7c4361ec1)

x86: Fix incorrect scope of setting `shared_per_thread` [BZ# 30745]

The:

```
    if (shared_per_thread > 0 && threads > 0)
      shared_per_thread /= threads;
```

Code was accidentally moved to inside the else scope.  This doesn't
match how it was previously (before af992e7abd).

This patch fixes that by putting the division after the `else` block.

(cherry picked from commit 084fb31bc2c5f95ae0b9e6df4d3cf0ff43471ede)

x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.

On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8b9a0af8ca012217bf90d1dc0694f85b49ae09da)

x86: Fix slight bug in `shared_per_thread` cache size calculation.

After:
```
    commit af992e7abdc9049714da76cae1e5e18bc4838fb8
    Author: Noah Goldstein <goldstein.w.n@gmail.com>
    Date:   Wed Jun 7 13:18:01 2023 -0500

        x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
```

Split `shared` (cumulative cache size) from `shared_per_thread` (cache
size per socket), the `shared_per_thread` *can* be slightly off from
the previous calculation.

Previously we added `core` even if `threads_l2` was invalid, and only
used `threads_l2` to divide `core` if it was present. The changed
version only included `core` if `threads_l2` was valid.

This change restores the old behavior if `threads_l2` is invalid by
adding the entire value of `core`.
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 47f747217811db35854ea06741be3685e8bbd44d)

x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`

Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`

The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.

Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.

Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.

As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.

The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.

Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks

Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit af992e7abdc9049714da76cae1e5e18bc4838fb8)

debug: Mark libSegFault.so as NODELETE

The signal handler installed in the ELF constructor cannot easily
be removed again (because the program may have changed handlers
in the meantime). Mark the object as NODELETE so that the registered
handler function is never unloaded.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 23ee92deea4c99d0e6a5f48fa7b942909b123ec5)

x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]

Previous implementation was adjusting length (rsi) to match
bytes (eax), but since there is no bound to length this can cause
overflow.

Fix is to just convert the byte-count (eax) to length by dividing by
sizeof (wchar_t) before the comparison.

Full check passes on x86-64 and build succeeds w/ and w/o multiarch.

(cherry picked from commit b0969fa53a28b4ab2159806bf6c99a98999502ee)

x86-64: Require BMI2 for avx2 functions [BZ #29611]

This patch fixes BZ #29611

x86-64: Require BMI2 for strchr-avx2.S [BZ #29611]

Since strchr-avx2.S updated by

commit 1f745ecc2109890886b161d4791e1406fdfc29b8
Author: noah <goldstein.w.n@gmail.com>
Date:   Wed Feb 3 00:38:59 2021 -0500

    x86-64: Refactor and improve performance of strchr-avx2.S

uses sarx:

c4 e2 72 f7 c0        sarx   %ecx,%eax,%eax

for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and
ifunc-avx2.h.

This fixes BZ #29611.

(cherry picked from commit 83c5b368226c34a2f0a5287df40fc290b2b34359)

Add test for bug 29530

This tests for a bug that was introduced in commit edc1686af0 ("vfprintf:
Reuse work_buffer in group_number") and fixed as a side effect of commit
6caddd34bd ("Remove most vfprintf width/precision-dependent allocations
(bug 14231, bug 26211).").

(cherry picked from commit ca6466e8be32369a658035d69542d47603e58a99)

support: Add xsetlocale function

(cherry picked from commit cce35a50c1de0cec5cd1f6c18979ff6ee3ea1dd1)

Fix memmove call in vfprintf-internal.c:group_number

A recent GCC mainline change introduces errors of the form:

vfprintf-internal.c: In function 'group_number':
vfprintf-internal.c:2093:15: error: 'memmove' specified bound between 9223372036854775808 and 18446744073709551615 exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
2093 |               memmove (w, s, (front_ptr -s) * sizeof (CHAR_T));
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a genuine bug in the glibc code: s > front_ptr is always true
at this point in the code, and the intent is clearly for the
subtraction to be the other way round.  The other arguments to the
memmove call here also appear to be wrong; w and s point just *after*
the destination and source for copying the rest of the number, so the
size needs to be subtracted to get appropriate pointers for the
copying.  Adjust the memmove call to conform to the apparent intent of
the code, so fixing the -Wstringop-overflow error.

Now, if the original code were ever executed, a buffer overrun would
result.  However, I believe this code (introduced in commit
edc1686af0c0fc2eb535f1d38cdf63c1a5a03675, "vfprintf: Reuse work_buffer
in group_number", so in glibc 2.26) is unreachable in prior glibc
releases (so there is no need for a bug in Bugzilla, no need to
consider any backports unless someone wants to build older glibc
releases with GCC 12 and no possibility of this buffer overrun
resulting in a security issue).

work_buffer is 1000 bytes / 250 wide characters.  This case is only
reachable if an initial part of the number, plus a grouped copy of the
rest of the number, fail to fit in that space; that is, if the grouped
number fails to fit in the space.  In the wide character case,
grouping is always one wide character, so even with a locale (of which
there aren't any in glibc) grouping every digit, a number would need
to occupy at least 125 wide characters to overflow, and a 64-bit
integer occupies at most 23 characters in octal including a leading 0.
In the narrow character case, the multibyte encoding of the grouping
separator would need to be at least 42 bytes to overflow, again
supposing grouping every digit, but MB_LEN_MAX is 16.  So even if we
admit the case of artificially constructed locales not shipped with
glibc, given that such a locale would need to use one of the character
sets supported by glibc, this code cannot be reached at present.  (And
POSIX only actually specifies the ' flag for grouping for decimal
output, though glibc acts on it for other bases as well.)

With binary output (if you consider use of grouping there to be
valid), you'd need a 15-byte multibyte character for overflow; I don't
know if any supported character set has such a character (if, again,
we admit constructed locales using grouping every digit and a grouping
separator chosen to have a multibyte encoding as long as possible, as
well as accepting use of grouping with binary), but given that we have
this code at all (clearly it's not *correct*, or in accordance with
the principle of avoiding arbitrary limits, to skip grouping on
running out of internal space like that), I don't think it should need
any further changes for binary printf support to go in.

On the other hand, support for large sizes of _BitInt in printf (see
the N2858 proposal) *would* require something to be done about such
arbitrary limits (presumably using dynamic allocation in printf again,
for sufficiently large _BitInt arguments only - currently only
floating-point uses dynamic allocation, and, as previously discussed,
that could actually be replaced by bounded allocation given smarter
code).

Tested with build-many-glibcs.py for aarch64-linux-gnu (GCC mainline).
Also tested natively for x86_64.

(cherry picked from commit db6c4935fae6005d46af413b32aa92f4f6059dce)

Remove most vfprintf width/precision-dependent allocations (bug 14231, bug 26211).

The vfprintf implementation (used for all printf-family functions)
contains complicated logic to allocate internal buffers of a size
depending on the width and precision used for a format, using either
malloc or alloca depending on that size, and with consequent checks
for size overflow and allocation failure.

As noted in bug 26211, the version of that logic used when '$' plus
argument number formats are in use is missing the overflow checks,
which can result in segfaults (quite possibly exploitable, I didn't
try to work that out) when the width or precision is in the range
0x7fffffe0 through 0x7fffffff (maybe smaller values as well in the
wprintf case on 32-bit systems, when the multiplication by sizeof
(CHAR_T) can overflow).

All that complicated logic in fact appears to be useless.  As far as I
can tell, there has been no need (outside the floating-point printf
code, which does its own allocations) for allocations depending on
width or precision since commit
3e95f6602b226e0de06aaff686dc47b282d7cc16 ("Remove limitation on size
of precision for integers", Sun Sep 12 21:23:32 1999 +0000).  Thus,
this patch removes that logic completely, thereby fixing both problems
with excessive allocations for large width and precision for
non-floating-point formats, and the problem with missing overflow
checks with such allocations.  Note that this does have the
consequence that width and precision up to INT_MAX are now allowed
where previously INT_MAX / sizeof (CHAR_T) - EXTSIZ or more would have
been rejected, so could potentially expose any other overflows where
the value would previously have been rejected by those removed checks.

I believe this completely fixes bugs 14231 and 26211.

Excessive allocations are still possible in the floating-point case
(bug 21127), as are other integer or buffer overflows (see bug 26201).
This does not address the cases where a precision larger than INT_MAX
(embedded in the format string) would be meaningful without printf's
return value overflowing (when it's used with a string format, or %g
without the '#' flag, so the actual output will be much smaller), as
mentioned in bug 17829 comment 8; using size_t internally for
precision to handle that case would be complicated by struct
printf_info being a public ABI.  Nor does it address the matter of an
INT_MIN width being negated (bug 17829 comment 7; the same logic
appears a second time in the file as well, in the form of multiplying
by -1).  There may be other sources of memory allocations with malloc
in printf functions as well (bug 24988, bug 16060).  From inspection,
I think there are also integer overflows in two copies of "if ((width
-= len) < 0)" logic (where width is int, len is size_t and a very long
string could result in spurious padding being output on a 32-bit
system before printf overflows the count of output characters).

Tested for x86-64 and x86.

(cherry picked from commit 6caddd34bd7ffb5ac4f36c8e036eee100c2cc535)

stdio: Add tests for printf multibyte convertion leak [BZ#25691]

Checked on x86_64-linux-gnu and i686-linux-gnu.

(cherry picked from commit 910a835dc96c1f518ac2a6179fc622ba81ffb159)

stdio: Remove memory leak from multibyte convertion [BZ#25691]

This is an updated version of a previous patch [1] with the
following changes:

  - Use compiler overflow builtins on done_add_func function.
  - Define the scratch +utstring_converted_wide_string using
    CHAR_T.
  - Added a testcase and mention the bug report.

Both default and wide printf functions might leak memory when
manipulate multibyte characters conversion depending of the size
of the input (whether __libc_use_alloca trigger or not the fallback
heap allocation).

This patch fixes it by removing the extra memory allocation on
string formatting with conversion parts.

The testcase uses input argument size that trigger memory leaks
on unpatched code (using a scratch buffer the threashold to use
heap allocation is lower).

Checked on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
[1] https://sourceware.org/pipermail/libc-alpha/2017-June/082098.html

(cherry picked from commit 3cc4a8367c23582b7db14cf4e150e4068b7fd461)

NEWS: Add a bug fix entry for BZ #28896

x86: Fix TEST_NAME to make it a string in tst-strncmp-rtm.c

Previously TEST_NAME was passing a function pointer. This didn't fail
because of the -Wno-error flag (to allow for overflow sizes passed
to strncmp/wcsncmp)

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit b98d0bbf747f39770e0caba7e984ce9f8f900330)

x86: Test wcscmp RTM in the wcsncmp overflow case [BZ #28896]

In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7835d611af0854e69a0c71e3806f8fe379282d6f)

x86: Fallback {str|wcs}cmp RTM in the ncmp overflow case [BZ #28896]

In the overflow fallback strncmp-avx2-rtm and wcsncmp-avx2-rtm would
call strcmp-avx2 and wcscmp-avx2 respectively. This would have
not checks around vzeroupper and would trigger spurious
aborts. This commit fixes that.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass on
AVX2 machines with and without RTM.

Co-authored-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c6272098323153db373f2986c67786ea8c85f1cf)

string: Add a testcase for wcsncmp with SIZE_MAX [BZ #28755]

Verify that wcsncmp (L("abc"), L("abd"), SIZE_MAX) == 0.  The new test
fails without

commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jan 9 16:02:21 2022 -0600

    x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

and

commit 7e08db3359c86c94918feb33a1182cd0ff3bb10b
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jan 9 16:02:28 2022 -0600

    x86: Fix __wcsncmp_evex in strcmp-evex.S [BZ# 28755]

This is for BZ #28755.

Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
(cherry picked from commit aa5a720056d37cf24924c138a3dbe6dace98e97c)

x86-64: Test strlen and wcslen with 0 in the RSI register [BZ #28064]

commit 6f573a27b6c8b4236445810a44660612323f5a73
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Wed Jun 23 01:19:34 2021 -0400

    x86-64: Add wcslen optimize for sse4.1

added wcsnlen-sse4.1 to the wcslen ifunc implementation list.  Since the
random value in the the RSI register is larger than the wide-character
string length in the existing wcslen test, it didn't trigger the wcslen
test failure.  Add a test to force 0 into the RSI register before calling
wcslen.

(cherry picked from commit a6e7c3745d73ff876b4ba6991fb00768a938aef5)

x86: Remove wcsnlen-sse4_1 from wcslen ifunc-impl-list [BZ #28064]

The following commit

commit 6f573a27b6c8b4236445810a44660612323f5a73
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 23 01:19:34 2021 -0400

x86-64: Add wcslen optimize for sse4.1

Added wcsnlen-sse4.1 to the wcslen ifunc implementation list and did
not add wcslen-sse4.1 to wcslen ifunc implementation list. This commit
fixes that by removing wcsnlen-sse4.1 from the wcslen ifunc
implementation list and adding wcslen-sse4.1 to the ifunc
implementation list.

Testing:
test-wcslen.c, test-rsi-wcslen.c, and test-rsi-strlen.c are passing as
well as all other tests in wcsmbs and string.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 0679442defedf7e52a94264975880ab8674736b2)

x86: Black list more Intel CPUs for TSX [BZ #27398]

Disable TSX and enable RTM_ALWAYS_ABORT for Intel CPUs listed in:

https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html

This fixes BZ #27398.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 1e000d3d33211d5a954300e2a69b90f93f18a1a1)

x86: Check RTM_ALWAYS_ABORT for RTM [BZ #28033]

From

https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html

* Intel TSX will be disabled by default.
* The processor will force abort all Restricted Transactional Memory (RTM)
  transactions by default.
* A new CPUID bit CPUID.07H.0H.EDX[11](RTM_ALWAYS_ABORT) will be enumerated,
  which is set to indicate to updated software that the loaded microcode is
  forcing RTM abort.
* On processors that enumerate support for RTM, the CPUID enumeration bits
  for Intel TSX (CPUID.07H.0H.EBX[11] and CPUID.07H.0H.EBX[4]) continue to
  be set by default after microcode update.
* Workloads that were benefited from Intel TSX might experience a change
  in performance.
* System software may use a new bit in Model-Specific Register (MSR) 0x10F
  TSX_FORCE_ABORT[TSX_CPUID_CLEAR] functionality to clear the Hardware Lock
  Elision (HLE) and RTM bits to indicate to software that Intel TSX is
  disabled.

1. Add RTM_ALWAYS_ABORT to CPUID features.
2. Set RTM usable only if RTM_ALWAYS_ABORT isn't set.  This skips the
string/tst-memchr-rtm etc. testcases on the affected processors, which
always fail after a microcde update.
3. Check RTM feature, instead of usability, against /proc/cpuinfo.

This fixes BZ #28033.

(cherry picked from commit ea8e465a6b8d0f26c72bcbe453a854de3abf68ec)

NEWS: Add a bug fix entry for BZ #27974

String: Add overflow tests for strnlen, memchr, and strncat [BZ #27974]

This commit adds tests for a bug in the wide char variant of the
functions where the implementation may assume that maxlen for wcsnlen
or n for wmemchr/strncat will not overflow when multiplied by
sizeof(wchar_t).

These tests show the following implementations failing on x86_64:

wcsnlen-sse4_1
wcsnlen-avx2

wmemchr-sse2
wmemchr-avx2

strncat would fail as well if it where on a system that prefered
either of the wcsnlen implementations that failed as it relies on
wcsnlen.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit da5a6fba0febbfc90896ce1b2eb75c6d8a88a72d)

x86: Optimize strlen-evex.S

No bug. This commit optimizes strlen-evex.S. The
optimizations are mostly small things but they add up to roughly
10-30% performance improvement for strlen. The results for strnlen are
bit more ambiguous. test-strlen, test-strnlen, test-wcslen, and
test-wcsnlen are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit 4ba65586847751372520a36757c17f114588794e)

x86: Fix overflow bug in wcsnlen-sse4_1 and wcsnlen-avx2 [BZ #27974]

This commit fixes the bug mentioned in the previous commit.

The previous implementations of wmemchr in these files relied
on maxlen * sizeof(wchar_t) which was not guranteed by the standard.

The new overflow tests added in the previous commit now
pass (As well as all the other tests).

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit a775a7a3eb1e85b54af0b4ee5ff4dcf66772a1fb)

x86-64: Add wcslen optimize for sse4.1

No bug. This comment adds the ifunc / build infrastructure
necessary for wcslen to prefer the sse4.1 implementation
in strlen-vec.S. test-wcslen.c is passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6f573a27b6c8b4236445810a44660612323f5a73)

x86-64: Move strlen.S to multiarch/strlen-vec.S

Since strlen.S contains SSE2 version of strlen/strnlen and SSE4.1
version of wcslen/wcsnlen, move strlen.S to multiarch/strlen-vec.S
and include multiarch/strlen-vec.S from SSE2 and SSE4.1 variants.
This also removes the unused symbols, __GI___strlen_sse2 and
__GI___wcsnlen_sse4_1.

(cherry picked from commit a0db678071c60b6c47c468d231dd0b3694ba7a98)

x86-64: Fix an unknown vector operation in memchr-evex.S

An unknown vector operation occurred in commit 2a76821c308. Fixed it
by using "ymm{k1}{z}" but not "ymm {k1} {z}".

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6ea916adfa0ab9af6e7dc6adcf6f977dfe017835)

x86: Optimize memchr-evex.S

No bug. This commit optimizes memchr-evex.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
saving some ALU in the alignment process, and most importantly
increasing ILP in the 4x loop. test-memchr, test-rawmemchr, and
test-wmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2a76821c3081d2c0231ecd2618f52662cb48fccd)

x86: Optimize strlen-avx2.S

No bug. This commit optimizes strlen-avx2.S. The optimizations are
mostly small things but they add up to roughly 10-30% performance
improvement for strlen. The results for strnlen are bit more
ambiguous. test-strlen, test-strnlen, test-wcslen, and test-wcsnlen
are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit aaa23c35071537e2dcf5807e956802ed215210aa)

x86: Fix overflow bug with wmemchr-sse2 and wmemchr-avx2 [BZ #27974]

This commit fixes the bug mentioned in the previous commit.

The previous implementations of wmemchr in these files relied
on n * sizeof(wchar_t) which was not guranteed by the standard.

The new overflow tests added in the previous commit now
pass (As well as all the other tests).

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 645a158978f9520e74074e8c14047503be4db0f0)

x86: Optimize memchr-avx2.S

No bug. This commit optimizes memchr-avx2.S. The optimizations include
replacing some branches with cmovcc, avoiding some branches entirely
in the less_4x_vec case, making the page cross logic less strict,
asaving a few instructions the in loop return loop. test-memchr,
test-rawmemchr, and test-wmemchr are all passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit acfd088a1963ba51cd83c78f95c0ab25ead79e04)

test-strnlen.c: Check that strnlen won't go beyond the maximum length

Place strings ending at page boundary without the null byte. If an
implementation goes beyond EXP_LEN, it will trigger the segfault.

(cherry picked from commit cb882b21b63606aabd6e55afe23b42434d95f2ef)

test-strnlen.c: Initialize wchar_t string with wmemset [BZ #27655]

Use wmemset to initialize wchar_t string.

(cherry picked from commit 86859b7e58d8670b186c5209ba25f0fbd6612fb7)

x86-64: Require BMI2 for __strlen_evex and __strnlen_evex

Since __strlen_evex and __strnlen_evex added by

commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Mar 5 06:24:52 2021 -0800

    x86-64: Add ifunc-avx2.h functions with 256-bit EVEX

use sarx:

c4 e2 6a f7 c0        sarx   %edx,%eax,%eax

require BMI2 for __strlen_evex and __strnlen_evex in ifunc-impl-list.c.
ifunc-avx2.h already requires BMI2 for EVEX implementation.

(cherry picked from commit 55bf411b451c13f0fb7ff3d3bf9a820020b45df1)

NEWS: Add a bug fix entry for BZ #27457

x86-64: Fix ifdef indentation in strlen-evex.S

Fix some indentations of ifdef in file strlen-evex.S which are off by 1
and confusing to read.

(cherry picked from commit 595c22ecd8e87a27fd19270ed30fdbae9ad25426)

x86-64: Use ZMM16-ZMM31 in AVX512 memmove family functions

Update ifunc-memmove.h to select the function optimized with AVX512
instructions using ZMM16-ZMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.

(cherry picked from commit e4fda4631017e49d4ee5a2755db34289b6860fa4)

x86-64: Use ZMM16-ZMM31 in AVX512 memset family functions

Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with AVX512 instructions using ZMM16-ZMM31 registers to avoid RTM abort
with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.

(cherry picked from commit 4e2d8f352774b56078c34648b14a2412c38384f4)

x86: Add string/memory function tests in RTM region

At function exit, AVX optimized string/memory functions have VZEROUPPER
which triggers RTM abort. When such functions are called inside a
transactionally executing RTM region, RTM abort causes severe performance
degradation. Add tests to verify that string/memory functions won't
cause RTM abort in RTM region.

(cherry picked from commit 4bd660be40967cd69072f69ebc2ad32bfcc1f206)

x86-64: Add AVX optimized string/memory functions for RTM

Since VZEROUPPER triggers RTM abort while VZEROALL won't, select AVX
optimized string/memory functions with

xtest
jz 1f
vzeroall
ret
1:
vzeroupper
ret

at function exit on processors with usable RTM, but without 256-bit EVEX
instructions to avoid VZEROUPPER inside a transactionally executing RTM
region.

(cherry picked from commit 7ebba91361badf7531d4e75050627a88d424872f)

x86-64: Add memcmp family functions with 256-bit EVEX

Update ifunc-memcmp.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL, AVX512BW and MOVBE since VZEROUPPER isn't needed at function
exit.

(cherry picked from commit 91264fe3577fe887b4860923fa6142b5274c8965)

x86-64: Add memset family functions with 256-bit EVEX

Update ifunc-memset.h/ifunc-wmemset.h to select the function optimized
with 256-bit EVEX instructions using YMM16-YMM31 registers to avoid RTM
abort with usable AVX512VL and AVX512BW since VZEROUPPER isn't needed at
function exit.

(cherry picked from commit 1b968b6b9b3aac702ac2f133e0dd16cfdbb415ee)

x86-64: Add memmove family functions with 256-bit EVEX

Update ifunc-memmove.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL since VZEROUPPER isn't needed at function exit.

(cherry picked from commit 63ad43566f7a25d140dc723598aeb441ad657eed)

x86-64: Add strcpy family functions with 256-bit EVEX

Update ifunc-strcpy.h to select the function optimized with 256-bit EVEX
instructions using YMM16-YMM31 registers to avoid RTM abort with usable
AVX512VL and AVX512BW since VZEROUPPER isn't needed at function exit.

(cherry picked from commit 525bc2a32c9710df40371f951217c6ae7a923aee)

x86-64: Add ifunc-avx2.h functions with 256-bit EVEX

Update ifunc-avx2.h, strchr.c, strcmp.c, strncmp.c and wcsnlen.c to
select the function optimized with 256-bit EVEX instructions using
YMM16-YMM31 registers to avoid RTM abort with usable AVX512VL, AVX512BW
and BMI2 since VZEROUPPER isn't needed at function exit.

For strcmp/strncmp, prefer AVX2 strcmp/strncmp if Prefer_AVX2_STRCMP
is set.

(cherry picked from commit 1fd8c163a83d96ace1ff78fa6bac7aee084f6f77)

x86: Set Prefer_No_VZEROUPPER and add Prefer_AVX2_STRCMP

1. Set Prefer_No_VZEROUPPER if RTM is usable to avoid RTM abort triggered
by VZEROUPPER inside a transactionally executing RTM region.
2. Since to compare 2 32-byte strings, 256-bit EVEX strcmp requires 2
loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp requires 1 load, 2 VPCMPEQs,
1 VPMINU and 1 VPMOVMSKB, AVX2 strcmp is faster than EVEX strcmp. Add
Prefer_AVX2_STRCMP to prefer AVX2 strcmp family functions.

(cherry picked from commit 1da50d4bda07f04135dca39f40e79fc9eabed1f8)

NEWS: Add a bug fix entry for BZ #28755

x86: Fix __wcsncmp_avx2 in strcmp-avx2.S [BZ# 28755]

Fixes [BZ# 28755] for wcsncmp by redirecting length >= 2^56 to
__wcscmp_avx2. For x86_64 this covers the entire address range so any
length larger could not possibly be used to bound `s1` or `s2`.

test-strcmp, test-strncmp, test-wcscmp, and test-wcsncmp all pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
(cherry picked from commit ddf0992cf57a93200e0c782e2a94d0733a5a0b87)

Fix SXID_ERASE behavior in setuid programs (BZ #27471)

When parse_tunables tries to erase a tunable marked as SXID_ERASE for
setuid programs, it ends up setting the envvar string iterator
incorrectly, because of which it may parse the next tunable
incorrectly.  Given that currently the implementation allows malformed
and unrecognized tunables pass through, it may even allow SXID_ERASE
tunables to go through.

This change revamps the SXID_ERASE implementation so that:

- Only valid tunables are written back to the tunestr string, because
  of which children of SXID programs will only inherit a clean list of
  identified tunables that are not SXID_ERASE.

- Unrecognized tunables get scrubbed off from the environment and
  subsequently from the child environment.

- This has the side-effect that a tunable that is not identified by
  the setxid binary, will not be passed on to a non-setxid child even
  if the child could have identified that tunable.  This may break
  applications that expect this behaviour but expecting such tunables
  to cross the SXID boundary is wrong.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 2ed18c5b534d9e92fc006202a5af0df6b72e7aca)

Enhance setuid-tunables test

Instead of passing GLIBC_TUNABLES via the environment, pass the
environment variable from parent to child. This allows us to test
multiple variables to ensure better coverage.

The test list currently only includes the case that's already being
tested. More tests will be added later.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 061fe3f8add46a89b7453e87eabb9c4695005ced)

tst-env-setuid: Use support_capture_subprogram_self_sgid

Use the support_capture_subprogram_self_sgid to spawn an sgid child.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit ca335281068a1ed549a75ee64f90a8310755956f)

support: Add capability to fork an sgid child

Add a new function support_capture_subprogram_self_sgid that spawns an
sgid child of the running program with its own image and returns the
exit code of the child process.  This functionality is used by at
least three tests in the testsuite at the moment, so it makes sense to
consolidate.

There is also a new function support_subprogram_wait which should
provide simple system() like functionality that does not set up file
actions.  This is useful in cases where only the return code of the
spawned subprocess is interesting.

This patch also ports tst-secure-getenv to this new function.  A
subsequent patch will port other tests.  This also brings an important
change to tst-secure-getenv behaviour.  Now instead of succeeding, the
test fails as UNSUPPORTED if it is unable to spawn a setgid child,
which is how it should have been in the first place.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 716a3bdc41b2b4b864dc64475015ba51e35e1273)

support: Typo and formatting fixes

- Add a newline to the end of error messages in transfer().
- Fixed the name of support_subprocess_init().

(cherry picked from commit 95c68080a3ded882789b1629f872c3ad531efda0)

support: Pass environ to child process

Pass environ to posix_spawn so that the child process can inherit
environment of the test.

(cherry picked from commit e958490f8c74e660bd93c128b3bea746e268f3f6)

S390: Also check vector support in memmove ifunc-selector [BZ #27511]

The arch13 memmove variant is currently selected by the ifunc selector
if the Miscellaneous-Instruction-Extensions Facility 3 facility bit
is present, but the function is also using vector instructions.
If the vector support is not present, one is receiving an operation
exception.

Therefore this patch also checks for vector support in the ifunc
selector and in ifunc-impl-list.c.

Just to be sure, the configure check is now also testing an arch13
vector instruction and an arch13 Miscellaneous-Instruction-Extensions
Facility 3 instruction.

(cherry picked from commit 7759be2593b689cb1eafc0f52ee7f59c639e5d2f)

nscd: Fix double free in netgroupcache [BZ #27462]

In commit 745664bd798ec8fd50438605948eea594179fba1 a use-after-free
was fixed, but this led to an occasional double-free. This patch
tracks the "live" allocation better.

Tested manually by a third party.

Related: RHBZ 1927877

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit dca565886b5e8bd7966e15f0ca42ee5cff686673)

gconv: Fix assertion failure in ISO-2022-JP-3 module (bug 27256)

The conversion loop to the internal encoding does not follow
the interface contract that __GCONV_FULL_OUTPUT is only returned
after the internal wchar_t buffer has been filled completely.  This
is enforced by the first of the two asserts in iconv/skeleton.c:

      /* We must run out of output buffer space in this
rerun.  */
      assert (outbuf == outerr);
      assert (nstatus == __GCONV_FULL_OUTPUT);

This commit solves this issue by queuing a second wide character
which cannot be written immediately in the state variable, like
other converters already do (e.g., BIG5-HKSCS or TSCII).

Reported-by: Tavis Ormandy <taviso@gmail.com>
(cherry picked from commit 7d88c6142c6efc160c0ee5e4f85cde382c072888)

x86: Check IFUNC definition in unrelocated executable [BZ #20019]

Calling an IFUNC function defined in unrelocated executable also leads to
segfault.  Issue a fatal error message when calling IFUNC function defined
in the unrelocated executable from a shared library.

On x86, ifuncmain6pie failed with:

[hjl@gnu-cfl-2 build-i686-linux]$ ./elf/ifuncmain6pie --direct
./elf/ifuncmain6pie: IFUNC symbol 'foo' referenced in '/export/build/gnu/tools-build/glibc-32bit/build-i686-linux/elf/ifuncmod6.so' is defined in the executable and creates an unsatisfiable circular dependency.
[hjl@gnu-cfl-2 build-i686-linux]$ readelf -rW elf/ifuncmod6.so | grep foo
00003ff4  00000706 R_386_GLOB_DAT         0000400c   foo_ptr
00003ff8  00000406 R_386_GLOB_DAT         00000000   foo
0000400c  00000401 R_386_32               00000000   foo
[hjl@gnu-cfl-2 build-i686-linux]$

Remove non-JUMP_SLOT relocations against foo in ifuncmod6.so, which
trigger the circular IFUNC dependency, and build ifuncmain6pie with
-Wl,-z,lazy.

(cherry picked from commits 6ea5b57afa5cdc9ce367d2b69a2cebfb273e4617
and 7137d682ebfcb6db5dfc5f39724718699922f06c)

x86: Set header.feature_1 in TCB for always-on CET [BZ #27177]

Update dl_cet_check() to set header.feature_1 in TCB when both IBT and
SHSTK are always on.

(cherry picked from commit 2ef23b520597f4ea1790a669b83e608f24f4cf12)

x86-64: Avoid rep movsb with short distance [BZ #27130]

When copying with "rep movsb", if the distance between source and
destination is N*4GB + [1..63] with N >= 0, performance may be very
slow. This patch updates memmove-vec-unaligned-erms.S for AVX and
AVX512 versions with the distance in RCX:

cmpl $63, %ecx
// Don't use "rep movsb" if ECX <= 63
jbe L(Don't use rep movsb")
Use "rep movsb"

Benchtests data with bench-memcpy, bench-memcpy-large, bench-memcpy-random
and bench-memcpy-walk on Skylake, Ice Lake and Tiger Lake show that its
performance impact is within noise range as "rep movsb" is only used for
data size >= 4KB.

(cherry picked from commit 3ec5d83d2a237d39e7fd6ef7a0bc8ac4c171a4a5)

aarch64: Fix DT_AARCH64_VARIANT_PCS handling [BZ #26798]

The variant PCS support was ineffective because in the common case
linkmap->l_mach.plt == 0 but then the symbol table flags were ignored
and normal lazy binding was used instead of resolving the relocs early.
(This was a misunderstanding about how GOT[1] is setup by the linker.)

In practice this mainly affects SVE calls when the vector length is
more than 128 bits, then the top bits of the argument registers get
clobbered during lazy binding.

Fixes bug 26798.

(cherry picked from commit 558251bd8785760ad40fcbfeaaee5d27fa5b0fe4)

AArch64: Use __memcpy_simd on Neoverse N2/V1

Add CPU detection of Neoverse N2 and Neoverse V1, and select __memcpy_simd as
the memcpy/memmove ifunc.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit e11ed9d2b4558eeacff81557dc9557001af42a6b)

[AArch64] Improve integer memcpy

Further optimize integer memcpy.  Small cases now include copies up
to 32 bytes.  64-128 byte copies are split into two cases to improve
performance of 64-96 byte copies.  Comments have been rewritten.

(cherry picked from commit 700065132744e0dfa6d4d9142d63f6e3a1934726)

aarch64: Increase small and medium cases for __memcpy_generic

Increase the upper bound on medium cases from 96 to 128 bytes.
Now, up to 128 bytes are copied unrolled.

Increase the upper bound on small cases from 16 to 32 bytes so that
copies of 17-32 bytes are not impacted by the larger medium case.

Benchmarking:
The attached figures show relative timing difference with respect
to 'memcpy_generic', which is the existing implementation.
'memcpy_med_128' denotes the the version of memcpy_generic with
only the medium case enlarged. The 'memcpy_med_128_small_32' numbers
are for the version of memcpy_generic submitted in this patch, which
has both medium and small cases enlarged. The figures were generated
using the script from:
https://www.sourceware.org/ml/libc-alpha/2019-10/msg00563.html

Depending on the platform, the performance improvement in the
bench-memcpy-random.c benchmark ranges from 6% to 20% between
the original and final version of memcpy.S

Tested against GLIBC testsuite and randomized tests.

(cherry picked from commit b9f145df85145506f8e61bac38b792584a38d88f)

AArch64: Rename IS_ARES to IS_NEOVERSE_N1

Rename IS_ARES to IS_NEOVERSE_N1 since that is a bit clearer.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 0f6278a8793a5d04ea31878119eccf99f469a02d)

AArch64: Improve backwards memmove performance

On some microarchitectures performance of the backwards memmove improves if
the stores use STR with decreasing addresses. So change the memmove loop
in memcpy_advsimd.S to use 2x STR rather than STP.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
(cherry picked from commit bd394d131c10c9ec22c6424197b79410042eed99)

AArch64: Add optimized Q-register memcpy

Add a new memcpy using 128-bit Q registers - this is faster on modern
cores and reduces codesize.  Similar to the generic memcpy, small cases
include copies up to 32 bytes.  64-128 byte copies are split into two
cases to improve performance of 64-96 byte copies.  Large copies align
the source rather than the destination.

bench-memcpy-random is ~9% faster than memcpy_falkor on Neoverse N1,
so make this memcpy the default on N1 (on Centriq it is 15% faster than
memcpy_falkor).

Passes GLIBC regression tests.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 4a733bf375238a6a595033b5785cea7f27d61307)

AArch64: Align ENTRY to a cacheline

Given almost all uses of ENTRY are for string/memory functions,
align ENTRY to a cacheline to simplify things.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 34f0d01d5e43c7dedd002ab47f6266dfb5b79c22)

arm: CVE-2020-6096: Fix multiarch memcpy for negative length [BZ #25620]

Unsigned branch instructions could be used for r2 to fix the wrong
behavior when a negative length is passed to memcpy.
This commit fixes the armv7 version.

(cherry picked from commit beea361050728138b82c57dda0c4810402d342b9)

arm: CVE-2020-6096: fix memcpy and memmove for negative length [BZ #25620]

Unsigned branch instructions could be used for r2 to fix the wrong
behavior when a negative length is passed to memcpy and memmove.
This commit fixes the generic arm implementation of memcpy amd memmove.

(cherry picked from commit 79a4fa341b8a89cb03f84564fd72abaa1a2db394)

NEWS: Mention BZ 25933 fix

Fix avx2 strncmp offset compare condition check [BZ #25933]

strcmp-avx2.S: In avx2 strncmp function, strings are compared in
chunks of 4 vector size(i.e. 32x4=128 byte for avx2). After first 4
vector size comparison, code must check whether it already passed
the given offset. This patch implement avx2 offset check condition
for strncmp function, if both string compare same for first 4 vector
size.

(cherry picked from commit 75870237ff3bb363447b03f4b0af100227570910)

nss_compat: internal_end*ent may clobber errno, hiding ERANGE [BZ #25976]

During cleanup, before returning from get*_r functions, the end*ent
calls must not change errno. Otherwise, an ERANGE error from the
underlying implementation can be hidden, causing unexpected lookup
failures. This commit introduces an internal_end*ent_noerror
function which saves and restore errno, and marks the original
internal_end*ent function as warn_unused_result, so that it is used
only in contexts were errors from it can be handled explicitly.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 790b8dda4455865cb8c3a47801f4304c1a43baf6)

NEWS: Merge two bug lists in the glibc 2.30.1 section

NEWS: Mention fixes for BZ 25810/25896/25902/25966

x86-64: Use RDX_LP on __x86_shared_non_temporal_threshold [BZ #25966]

Since __x86_shared_non_temporal_threshold is defined as

long int __x86_shared_non_temporal_threshold;

and long int is 4 bytes for x32, use RDX_LP to compare against
__x86_shared_non_temporal_threshold in assembly code.

(cherry picked from commit 55c7bcc71b84123d5d4bd2814366a6b05fcf8ebd)

Add a C wrapper for prctl [BZ #25896]

Add a C wrapper to pass arguments in

/* Control process execution. */
extern int prctl (int __option, ...) __THROW;

to prctl syscall:

extern int prctl (int, unsigned long int, unsigned long int,
unsigned long int, unsigned long int);

(cherry picked from commit ff026950e280bc3e9487b41b460fb31bc5b57721)

powerpc: Rename argN to _argN in LOADARGS_N [BZ #25902]

LOADARGS_N in powerpc/sysdep.h uses argN as local variables. It breaks
when argN is also a function argument. Rename argN to _argN to avoid
conflict.

(cherry picked from commit 14f43dd34dcf1ba29386c01cd0b286dffb37412d)

Add C wrappers for process_vm_readv/process_vm_writev [BZ #25810]

Since the the U marker can only be applied to 2 unsigned long arguments
in syscalls.list files, add a C wrapper for process_vm_readv and
process_vm_writev syscals which have more than 2 unsigned long arguments.

(cherry picked from commit ad9fd65d716f1ccd757b6b2feeee826d0f187ed4)

Mark unsigned long arguments with U in more syscalls [BZ #25810]

Mark unsigned long arguments in mmap, read, recv, recvfrom, send, sendto,
write, ioperm, sendfile64, setxattr, lsetxattr, fsetxattr, getxattr,
lgetxattr, fgetxattr, listxattr, llistxattr and flistxattr with U in
syscalls.list files.

(cherry picked from commit 86f4f2263bf21ff7f80905b3062c16213b016fe6)

Add a syscall test for [BZ #25810]

Add a test to pass 64-bit long arguments to syscall with undefined upper
32 bits on x32.

Tested on i386, x86-64 and x32 as well as with build-many-glibcs.py.

(cherry picked from commit 781dacc4f41332098e3a272514b20a490a7ebc8c)

Add SYSCALL_ULONG_ARG_[12] to pass long to syscall [BZ #25810]

X32 has 32-bit long and pointer with 64-bit off_t.  Since x32 psABI
requires that pointers passed in registers must be zero-extended to
64bit, x32 can share many syscall interfaces with LP64.  When a LP64
syscall with long and unsigned long int arguments is used for x32, these
arguments must be properly extended to 64-bit.  Otherwise if the upper
32 bits of the register have undefined value, such a syscall will be
rejected by kernel.

For syscalls implemented in assembly codes, 'U' is added to syscall
signature key letters for unsigned long, which is zero-extended to
64-bit types.  SYSCALL_ULONG_ARG_1 and SYSCALL_ULONG_ARG_2 are passed
to syscall-template.S for the first and the second unsigned long int
arguments if PSEUDOS_HAVE_ULONG_INDICES is defined.  They are used by
x32 to zero-extend 32-bit arguments to 64 bits.

Tested on i386, x86-64 and x32 as well as with build-many-glibcs.py.

(cherry picked from commit 2ad5d0845d80589d0adf86593bd36a7c71a521f8)

x32: Properly pass long to syscall [BZ #25810]

X32 has 32-bit long and pointer with 64-bit off_t.  Since x32 psABI
requires that pointers passed in registers must be zero-extended to
64bit, x32 can share many syscall interfaces with LP64.  When a LP64
syscall with long and unsigned long arguments is used for x32, these
arguments must be properly extended to 64-bit.  Otherwise if the upper
32 bits of the register have undefined value, such a syscall will be
rejected by kernel.

Enforce zero-extension for pointers and array system call arguments.
For integer types, extend to int64_t (the full register) using a
regular cast, resulting in zero or sign extension based on the
signedness of the original type.

For

       void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);

we now generate

   0: 41 f7 c1 ff 0f 00 00 test   $0xfff,%r9d
   7: 75 1f                 jne    28 <__mmap64+0x28>
   9: 48 63 d2              movslq %edx,%rdx
   c: 89 f6                 mov    %esi,%esi
   e: 4d 63 c0              movslq %r8d,%r8
  11: 4c 63 d1              movslq %ecx,%r10
  14: b8 09 00 00 40        mov    $0x40000009,%eax
  19: 0f 05                 syscall

That is

1. addr is unchanged.
2. length is zero-extend to 64 bits.
3. prot is sign-extend to 64 bits.
4. flags is sign-extend to 64 bits.
5. fd is sign-extend to 64 bits.
6. offset is unchanged.

For int arguments, since kernel uses only the lower 32 bits and ignores
the upper 32 bits in 64-bit registers, these work correctly.

Tested on x86-64 and x32.  There are no code changes on x86-64.

(cherry picked from commit df76ff3a446a787a95cf74cb15c285464d73a93d)

Add new file missed in previous hppa commit.

(cherry picked from commit acdcca72940e060270e4e54d9c0457398110f409)

Fix data race in setting function descriptors during lazy binding on hppa.

This addresses an issue that is present mainly on SMP machines running
threaded code.  In a typical indirect call or PLT import stub, the
target address is loaded first.  Then the global pointer is loaded into
the PIC register in the delay slot of a branch to the target address.
During lazy binding, the target address is a trampoline which transfers
to _dl_runtime_resolve().

_dl_runtime_resolve() uses the relocation offset stored in the global
pointer and the linkage map stored in the trampoline to find the
relocation.  Then, the function descriptor is updated.

In a multi-threaded application, it is possible for the global pointer
to be updated between the load of the target address and the global
pointer.  When this happens, the relocation offset has been replaced
by the new global pointer.  The function pointer has probably been
updated as well but there is no way to find the address of the function
descriptor and to transfer to the target.  So, _dl_runtime_resolve()
typically crashes.

HP-UX addressed this problem by adding an extra pc-relative branch to
the trampoline.  The descriptor is initially setup to point to the
branch.  The branch then transfers to the trampoline.  This allowed
the trampoline code to figure out which descriptor was being used
without any modification to user code.  I didn't use this approach
as it is more complex and changes function pointer canonicalization.

The order of loading the target address and global pointer in
indirect calls was not consistent with the order used in import stubs.
In particular, $$dyncall and some inline versions of it loaded the
global pointer first.  This was inconsistent with the global pointer
being updated first in dl-machine.h.  Assuming the accesses are
ordered, we want elf_machine_fixup_plt() to store the global pointer
first and calls to load it last.  Then, the global pointer will be
correct when the target function is entered.

However, just to make things more fun, HP added support for
out-of-order execution of accesses in PA 2.0.  The accesses used by
calls are weakly ordered. So, it's possibly under some circumstances
that a function might be entered with the wrong global pointer.
However, HP uses weakly ordered accesses in 64-bit HP-UX, so I assume
that loading the global pointer in the delay slot of the branch must
work consistently.

The basic fix for the race is a combination of modifying user code to
preserve the address of the function descriptor in register %r22 and
setting the least-significant bit in the relocation offset.  The
latter was suggested by Carlos as a way to distinguish relocation
offsets from global pointer values.  Conventionally, %r22 is used
as the address of the function descriptor in calls to $$dyncall.
So, it wasn't hard to preserve the address in %r22.

I have updated gcc trunk and gcc-9 branch to not clobber %r22 in
$$dyncall and inline indirect calls.  I have also modified the import
stubs in binutils trunk and the 2.33 branch to preserve %r22.  This
required making the stubs one instruction longer but we save one
relocation.  I also modified binutils to align the .plt section on
a 8-byte boundary.  This allows descriptors to be updated atomically
with a floting-point store.

With these changes, _dl_runtime_resolve() can fallback to an alternate
mechanism to find the relocation offset when it has been clobbered.
There's just one additional instruction in the fast path. I tested
the fallback function, _dl_fix_reloc_arg(), by changing the branch to
always use the fallback.  Old code still runs as it did before.

Fixes bug 23296.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 1a044511a3f9020c3f430164e0a6a77426fecd7e)

stdlib: Move tst-system to tests-container

Fix some issues with different shell and error messages.

Checked on x86_64-linux-gnu and i686-linux-gnu.

(cherry picked from commit 4eda036f5b897fa8bc20ddd2099b5a6ed4239dc9)

support/shell-container.c: Add builtin kill

No options supported.

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 1c17100c43c0913ec94f3bcc966bf3792236c690)

support/shell-container.c: Add builtin exit

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 5a5a3a3234bc220a5192d620e0cbc5360da46f14)

support/shell-container.c: Return 127 if execve fails

Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 5fce0e095bc413f908f472074c2235198cd76bf4)

Add NEWS entry for CVE-2020-1751 (bug 25423)

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 07d16a6debc830ebcf9533da5396edd2eff688e0)