git.ipfire.org Git - thirdparty/glibc.git/log

io: ftw: Use state stack instead of recursion (BZ 33882)

The current implementation of ftw relies on recursion to traverse
directories (ftw_dir calls process_entry, which calls ftw_dir).  In deep
directory trees, this could lead to a stack overflow (as demonstrated by
the new tst-nftw-bz33882.c test).

This patch refactors ftw to use an explicit, heap-allocated stack to
manage directory traversal:

  * The 'struct ftw_frame' encapsulates the state of a single directory
    level (directory stream, stat buffer, previous base offset, and
    current state).

  * The ftw_dir is rewritten to use a loop instead of recursion and
    an iterative loop to enable immediate state transitions without
    function call overhead.

The patch also cleans up some unused definitions and assumptions (e.g.,
free-clobbering errno) and fixes a UB when handling the ftw callback.

Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>

math: Sync sinh from CORE-MATH

The CORE-MATH e756933f improved the error bound in the fast path for
x_0 <= x < 1/4, along with a formal proof [1].

Checked on x86_64-linux-gnu, i686-linux-gnu, aaarch64-linux-gnu,
and arm-linux-gnueabihf.

[1] https://core-math.gitlabpages.inria.fr/sinh.pdf

testsuite: fix test-narrowing-trap failure on platforms where FE_INVALID is not defined

I didn't realize it can be undefined at all instead of simply
unsupported :(.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Document CVE-2026-4046

Signed-off-by: Siddhesh Poyarekar <siddhesh@gotplt.org>

x86_64: Prefer EVEX512 code-path on AMD Zen5 CPUs

Introduced a synthetic architecture preference flag (Prefer_EVEX512)
and enabled it for AMD Zen5 (CPUID Family 0x1A) when AVX-512 is supported.

This flag modifies IFUNC dispatch to prefer 512-bit EVEX variants over
256-bit EVEX variants for string and memory functions on Zen5 processors,
leveraging their native 512-bit execution units for improved throughput.
When Prefer_EVEX512 is set, the dispatcher selects evex512 implementations;
otherwise, it falls back to evex (256-bit) variants.

The implementation updates the IFUNC selection logic in ifunc-avx2.h and
ifunc-evex.h to check for the Prefer_EVEX512 flag before dispatching to
EVEX512 implementations. This change affects six string/memory functions:

  - strchr
  - strlen
  - strnlen
  - strrchr
  - strchrnul
  - memchr

Benchmarks conducted on AMD Zen5 hardware demonstrate significant
performance improvements across all affected functions:

Function    Baseline   Patched    Avg         Avg        Avg      Max
            Variant    Variant    Baseline    Patched    Change   Improve
                                  (ns)        (ns)       %        %
------------+----------+----------+-----------+----------+--------+--------
STRCHR      evex       evex512    16.408      12.293     25.08%   37.69%
STRLEN      evex       evex512    16.862      11.436     32.18%   56.74%
STRNLEN     evex       evex512    18.493      11.762     36.40%   64.40%
STRRCHR     evex       evex512    15.154      10.874     28.24%   44.38%
STRCHRNUL   evex       evex512    16.464      12.605     23.44%   45.56%
MEMCHR      evex       evex512    9.984       8.268      17.19%   39.99%

Additionally, a tunable option (glibc.cpu.x86_cpu_features.preferred)
is provided to allow runtime control of the Prefer_EVEX512 flag for testing
and compatibility.

Reviewed-by: Ganesh Gopalasubramanian <Ganesh.Gopalasubramanian@amd.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

math: Fix lgammaf regression on i686

The new test from 19781c2221 triggers a failure on i686:

  testing float (without inline functions)
  Failure: lgamma (0x3.12be38p+120): errno set to 0, expected 34 (ERANGE)
  Failure: lgamma_upward (0x3.12be38p+120): errno set to 0, expected 34 (ERANGE)

Use math_narrow_eval on the multiplication to force the expected
precision.

Checked on i686-linux-gnu.

math: Use polydd_cosh instead of polydd on cosh

This is similar to original CORE-MATH code and why the function
exists.

Checked on x86_64-linux-gnu, i686-linux-gnu, aarch64-linux-gnu,
and arm-linux-gnueabihf.

localedata: Add disclaimer to files contributed with assignment

Add the FSF's disclaimer to bi_VU, C, gbm_IN, hif_FJ, sah_RU,
sm_WS, and to_TO which were created under copyright assignment
(not DCO).

This change ensures that all 352 localedata files have either
the FSF disclaimer or the related DCO text we are using
e.g. ab_GE.

Link: https://inbox.sourceware.org/libc-alpha/80426eb7-70cd-4178-8fda-51d590aa38d4@redhat.com/
Link: https://inbox.sourceware.org/libc-alpha/20130220215701.B263F2C0A7@topped-with-meat.com/
Link: https://inbox.sourceware.org/libc-alpha/87pmtq54hs.fsf@oldenburg.str.redhat.com/
Reviewed-by: Collin Funk <collin.funk1@gmail.com>

advisories: Update GLIBC-SA-2026-0005 and GLIBC-SA-2026-0006.

Update advisories with Fix-Commit information for 2.43.9000 and 2.44.

Update NEWS with advisory entries.

resolv: Check hostname for validity (CVE-2026-4438)

The processed hostname in getanswer_ptr should be correctly checked to
avoid invalid characters from being allowed, including shell
metacharacters. It is a security issue to fail to check the returned
hostname for validity.

A regression test is added for invalid metacharacters and other cases
of invalid or valid characters.

No regressions on x86_64-linux-gnu.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Use #!/usr/bin/python3 in remaining Python scripts

Some distributions ban the /usr/bin/python path in their build
systems due to the ambiguity of whether it refers to Python 2 or
Python 3. Python 2 has been out of support for many years, and
glibc has required Python 3 at build time for a while. So it seems
safe to switch the remaining scripts over to /usr/bin/python3.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

LoongArch: Add new files for LA32 in sysdeps/unix/sysv/linux/loongarch/ilp32

Add implies, abilist, c++-types and syscall files.

LoongArch: Add support for LA32 in sysdeps/unix/sysv/linux/loongarch

LoongArch: Add new file for LA32 in sysdeps/loongarch/ilp32

LoongArch: Add support for LA32 in sysdeps/loongarch/fpu

Move the loongarch64 implementation to sysdeps/loongarch/lp64/fpu.

LoongArch: Add support for LA32 in sysdeps/loongarch

LoongArch: fix missing trap for enabled exceptions on narrowing operation

The libc_feupdateenv_test macro is supposed to trap when the trap for a
previously held exception is enabled. But
libc_feupdateenv_test_loongarch wasn't doing it properly: the comment
claims "setting of the cause bits" would cause "the hardware to generate
the exception" but that's simply not true for the LoongArch movgr2fcsr
instruction.

To fix the issue, we need to call __feraiseexcept in case a held exception
is enabled to trap.

Reviewed-by: caiyinyu <caiyinyu@loongson.cn>
Signed-off-by: Xi Ruoyao <xry111@xry111.site>

nptl: Fix nptl/tst-cancel31 fail sometimes

tst-cancel31 fail on la32 qemu-system with a single-core
system sometimes.

IF the test and a infinite loop run on a same x86_64 core,
the test also fail sometimes.
taskset -c 0 make test t=nptl/tst-cancel31
taskset -c 0 ./a.out (a.out is a infinite loop)

After writeopener thread opens the file, it may switch to
main thread and find redundant files.

pthread_cancel and pthread_join writeopener thread
before support_descriptors_check.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

resolv: Count records correctly (CVE-2026-4437)

The answer section boundary was previously ignored, and the code in
getanswer_ptr would iterate past the last resource record, but not
beyond the end of the returned data.  This could lead to subsequent data
being interpreted as answer records, thus violating the DNS
specification.  Such resource records could be maliciously crafted and
hidden from other tooling, but processed by the glibc stub resolver and
acted upon by the application.  While we trust the data returned by the
configured recursive resolvers, we should not trust its format and
should validate it as required.  It is a security issue to incorrectly
process the DNS protocol.

A regression test is added for response section crossing.

No regressions on x86_64-linux-gnu.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

Add advisory text for CVE-2026-4438

Explain the security issue and set the context for the vulnerability to
help downstreams get a better understanding of the issue.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Add advisory text for CVE-2026-4437

Explain the security issue and set the context for the vulnerability to
help downstreams get a better understanding of the issue.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

Use binutils 2.46, MPC 1.4.0 in build-many-glibcs.py

Note that MPC 1.4.0 has moved from .tar.gz to .tar.xz distribution.

Tested with build-many-glibcs.py (host-libraries, compilers and glibcs
builds).

LoongArch: feclearexcept: skip clearing CAUSE

The comment explaining the reason to clear CAUSE does not make any
sense: it says the next "CTC" instruction would raise the FP exception
of which both the CAUSE and ENABLE bits are set, but LoongArch does not
have the CTC instruction. LoongArch has the movgr2fcsr instruction but
movgr2fcsr never raises any FP exception, different from the MIPS CTC
instruction.

So we don't really need to care CAUSE at all.

Signed-off-by: Xi Ruoyao <xry111@xry111.site>

riscv: Resolve calls to memcpy using memcpy-generic in early startup

This patch from Adhemerval sets up the ifunc redirections so that we
resolve memcpy to memcpy_generic in early startup. This avoids infinite
recursion for memcpy calls before the loader is fully initialized.

Tested-by: Jeff Law <jeffrey.law@oss.qualcomm.com>

riscv: Treat clang separately in RVV compiler checks

Detect clang explicitly and apply compiler-specific version checks for
RVV support.

Signed-off-by: Zihong Yao <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>

math: Fix spurious overflow and missing errno for lgammaf

It syncs with CORE-MATH 9a75500ba1831 and 20d51f2ee.

Checked on aarch64-linux-gnu.

misc: Fix a few typos in comments

math: Sync lgammaf with CORE-MATH

It removes some unnecessary corner-case checks and uses a slightly
different binary algorithm for the hard-case database binary search.

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

math: Sync tgammaf with CORE-MATH

It adds a minor optimization on fast path.

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Makefile: add allow-list for failures

Enable adding known failures to allowed-failures.txt and ignore failures
in case they are in the list. In case the allowed-failures.txt does not
exist, all failures lead to a failed status as before.

When the file is present, failures of listed tests are ignored and reported
on stdout. If tests not in the allowed list fail, summarize-tests exits with
status 1 and reports the failing tests.

The expected format of allowed-failures.txt file is:
<test_name> # <comment>

Reviewed-by: Florian Weimer <fweimer@redhat.com>

string: Add fallback implementation for ctz/clz

The libgcc implementations of __builtin_clzl/__builtin_ctzl may require
access to additional data that is not marked as hidden, which could
introduce additional GOT indirection and necessitate RELATIVE relocs.
And the RELATIVE reloc is an issue if the code is used during static-pie
startup before self-relocation (for instance, during an assert).

For this case, the ABI can add a string-bitops.h header that defines
HAVE_BITOPTS_WORKING to 0. A configure check for this issue is tricky
because it requires linking against the standard libraries, which
create many RELATIVE relocations and complicate filtering those that
might be created by the builtins.

The fallback is disabled by default, so no target is affected.

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

AArch64: Remove prefer_sve_ifuncs

Remove the prefer_sve_ifuncs CPU feature since it was intended for older
kernels. Current distros all use modern Linux kernels with improved support
for SVE save/restore, making this check redundant.

Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>

This reverts commit 6e8f32d39a57aa1f31bf15375810aab79a0f5f4b.

First off, apologies for my misunderstanding on how madvise(MADV_HUGEPAGE)
works. I had the misconception that doing madvise(p, 1, MADV_HUGEPAGE) will set
VM_HUGEPAGE on the entire VMA - it does not, it will align the size to
PAGE_SIZE (4k) and then *split* the VMA. Only the first page-length of the
virtual space will VM_HUGEPAGE'd, the rest of it will stay the same.

The above is the semantics for all madvise() calls - which makes sense from a
UABI perspective. madvise() should do the proposed thing to only the length
(page-aligned) which it was asked to do, doing any more than that is not
something the user is expecting.

Commit 6e8f32d39a57 tries to optimize around the madvise() call by determining
whether the VMA got madvise'd before. This will work for most cases except
the following: if check_may_shrink_heap() is true, shrink_heap() re-maps the
shrunk portion, giving us a new VMA altogether. That VMA won't have the
VM_HUGEPAGE flag.

Reverting this commit, we will again mark the new VMA with VM_HUGEPAGE, and
the kernel will merge the two into a single VMA marked with VM_HUGEPAGE.

This may be the only case where we lose VM_HUGEPAGE, and we could micro-optimize
by extending the current if-condition with !check_may_shrink_heap. But let us
not do this - this is very difficult to reason about, and I am soon going
to propose mmap(MAP_HUGEPAGE) in Linux to do away with all these workarounds.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: factor out ld.conf parsing

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>

x86: Fix tanh ifunc selection

The inclusion of generic tanh implementation without undefining the
libm_alias_double (to provide the __tanh_sse2 implementation) makes
the exported tanh symbol pointing to SSE2 variant.

Reviewed-by: DJ Delorie <dj@redhat.com>

x86_64: Add cosh with FMA

The cosh shows an improvement of about ~35% when building for
x86_64-v3.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Consolidated common definition/data for cosh/sinh/tanh

Common data definitions are moved to e_coshsinh_data, cosh only
data is moved to e_cosh_data, sinh to e_sinh_data, and tanh to
e_tanh_data.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use tanh from CORE-MATH

The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MAX, -10]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-10, -10]
* FE_TONEAREST
     0:       4059325526  94.51%
     1:        231023238   5.38%
     2:          4618531   0.11%
* FE_UPWARD
     0:       2106654900  49.05%
     1:       2145413180  49.95%
     2:         40847554   0.95%
     3:          2051661   0.05%
* FE_DOWNWARD
     0:       2106618401  49.05%
     1:       2145409958  49.95%
     2:         40880992   0.95%
     3:          2057944   0.05%
* FE_TOWARDZERO
     0:       4061659952  94.57%
     1:        221006985   5.15%
     2:         12285512   0.29%
     3:            14846   0.00%

* Range [10, DBL_MAX]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Performance-wise, it shows:

latency                      master        patched        improvement
x86_64                     109.7420       184.5950            -68.21%
x86_64v2                   109.1230       187.1890            -71.54%
x86_64v3                    99.4471        49.1104             50.62%
aarch64                     43.0474        32.2933             24.98%
armhf-vpfv4                 41.0954        35.8473             12.77%
powerpc64le                 27.3282        22.7134             16.89%

reciprocal-throughput        master        patched        improvement
x86_64                      42.5562       158.1820           -271.70%
x86_64v2                    42.5734       159.2560           -274.07%
x86_64v3                    35.9899        24.2877             32.52%
aarch64                     24.7660        22.8466              7.75%
armhf-vpfv4                 27.0251        25.8150              4.48%
powerpc64le                 11.7350        11.2504              4.13%

* x86_64:        gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64:       gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4:  gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le:   gcc version 15.2.1 20260128, POWER10

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Remove the SVID error handling from sinh

It improves throughput from 8 to 18% and latency from 1 to 10%,
dependending of the ABI.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use sinh from CORE-MATH

The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MAX, -10]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-10, -10]
* FE_TONEAREST
     0:       3169388892  73.79%
     1:       1125270674  26.20%
     2:           307729   0.01%
* FE_UPWARD
     0:       1450068660  33.76%
     1:       2146926394  49.99%
     2:        697404986  16.24%
     3:           567255   0.01%
* FE_DOWNWARD
     0:       1449727976  33.75%
     1:       2146957381  49.99%
     2:        697719649  16.25%
     3:           562289   0.01%
* FE_TOWARDZERO
     0:       2519351889  58.66%
     1:       1773434502  41.29%
     2:          2180904   0.05%

* Range [10, DBL_MAX]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Performance-wise, it shows:

latency                      master        patched        improvement
x86_64                     101.0710       129.4710            -28.10%
x86_64v2                   101.1810       127.6370            -26.15%
x86_64v3                    96.0685        48.5911             49.42%
aarch64                     41.4229        22.3971             45.93%
armhf-vpfv4                 42.8620        25.6011             40.27%
powerpc64le                 29.2630        13.1450             55.08%

reciprocal-throughput        master        patched        improvement
x86_64                      42.6895       105.7150           -147.64%
x86_64v2                    42.7255       104.7480           -145.17%
x86_64v3                    39.6949        25.9087             34.73%
aarch64                     26.0104        19.2236             26.09%
armhf-vpfv4                 29.4362        23.6350             19.71%
powerpc64le                 12.9170        8.34582             35.39%

* x86_64:        gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64:       gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4:  gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le:   gcc version 15.2.1 20260128, POWER10

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Remove the SVID error handling from cosh

It improves throughout from 3.5% to 9%.

Reviewed-by: DJ Delorie <dj@redhat.com>

math: Use cosh from CORE-MATH

The current implementation precision shows the following accuracy, on
three ranges ([-DBL_MAX,-10], [-10,10], [10,DBL_MAX]) with 10e9 uniform
randomly generated numbers for each range (first column is the
accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):

* Range [-DBL_MAX, -10]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

* Range [-10, -10]
* FE_TONEAREST
     0:       3291614060  76.64%
     1:       1003353235  23.36%
* FE_UPWARD
     0:       2295272497  53.44%
     1:       1999675198  46.56%
     2:            19600   0.00%
* FE_DOWNWARD
     0:       2294966533  53.43%
     1:       1999981461  46.57%
     2:            19301   0.00%
* FE_TOWARDZERO
     0:       2306015780  53.69%
     1:       1988942093  46.31%
     2:             9422   0.00%

* Range [10, DBL_MAX]
* FE_TONEAREST
     0:      10000000000 100.00%
* FE_UPWARD
     0:      10000000000 100.00%
* FE_DOWNWARD
     0:      10000000000 100.00%
* FE_TOWARDZERO
     0:      10000000000 100.00%

The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Performance-wise, it shows:

latency                      master        patched     improvement
x86_64                      52.1066       126.4120        -142.60%
x86_64v2                    49.5781       119.8520        -141.74%
x86_64v3                    45.0811        50.5758         -12.19%
aarch64                     19.9977        21.7814          -8.92%
armhf-vpfv4                 20.5969        27.0479         -31.32%
powerpc64le                 12.6405        13.6768          -8.20%

reciprocal-throughput        master        patched     improvement
x86_64                      18.4833        102.9120       -456.78%
x86_64v2                    17.5409        99.5179        -467.35%
x86_64v3                    18.9187        25.3662         -34.08%
aarch64                     10.9045        18.8217         -72.60%
armhf-vpfv4                 15.7430        24.0822         -52.97%
powerpc64le                  5.4275         8.1269         -49.73%

* x86_64:        gcc version 15.2.1 20260112, Ryzen 9 5900X, --disable-multi-arch
* aarch64:       gcc version 15.2.1 20251105, Neoverse-N1
* armv7a-vpfv4:  gcc version 15.2.1 20251105, Neoverse-N1
* powerpc64le:   gcc version 15.2.1 20260128, POWER10

Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.

Reviewed-by: DJ Delorie <dj@redhat.com>

nptl/htl: Add missing AC_PROVIDES

nptl/htl: Fix confusion over PTHREAD_IN_LIBC and __PTHREAD_NPTL/HTL

The last uses of PTHREAD_IN_LIBC is where it should have been
__PTHREAD_NPTL/HTL. The latter was not conveniently available everywhere.
Defining it from config.h makes things simpler.

nptl: Drop comment about PTHREAD_IN_LIBC

nptl is now always in libc.

elf: directly call dl_init_static_tls

htl can now have it directly in ld.so

resolv: Move libanl symbols to libc on hurd too

elf: Drop librt.so from localplt-built-dso

It's always empty now.

rt: Move librt symbols to libc on hurd too

htl: Use pthread_rwlock for libc_rwlock

Like nptl does, so we really get rwlock behavior.

mach: Add __mach_rwlock_*

We cannot use pthread_rwlock for these until we have reimplemented
pthread_rwlock with gsync, so fork __libc_rwlock off for now.

configure: Remove extra ')' from b4c110022c

configure: Fix bootstrap build after 570c46d36b (BZ 33985)

The 570c46d36b make libgcc_s to be defined for have-cc-with-libunwind=noi
(default for gcc builds) without taking into consideration that the compiler
can link against -lgcc_s (defined by have-libgcc_s).

Checked with a build-many-glibc.py for x86_64-linux-gnu.

linux: Fix aliasing violations and assert address in __check_pf (bug #33927)

The Linux implementation of __check_pf retrieves interface data via
make_request, which queries the kernel via netlink. The IFA_ADDRESS
received from the kernel's RTM_NEWADDR netlink message is (a)
type-punned via pointer-casting leading to strict aliasing violations,
and (b) dereferenced assuming that it is non-NULL.

This commit removes the strict-aliasing violations using memcpy, and
adds an assert that the address is indeed non-NULL before dereferencing
it.

Reported-by: Siteshwar Vashisht <svashisht@redhat.com>
Reviewed-by: Sam James <sam@gentoo.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

x86: Don't left shift negative values

GCC warns about this with -Wshift-negative-value:

    In file included from ../sysdeps/x86/cpu-features.c:24:
    ../sysdeps/x86/dl-cacheinfo.h: In function ‘get_common_cache_info’:
    ../sysdeps/x86/dl-cacheinfo.h:913:45: warning: left shift of negative value [-Wshift-negative-value]
      913 |                           count_mask = ~(-1 << (count_mask + 1));
          |                                             ^~
    ../sysdeps/x86/dl-cacheinfo.h:930:45: warning: left shift of negative value [-Wshift-negative-value]
      930 |                           count_mask = ~(-1 << (count_mask + 1));
          |                                             ^~

This is because C23 § 6.5.8 specifies that this is undefined behavior.
We can cast it to unsigned which would be equivelent to UINT_MAX.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Support loading libunwind instead of libgcc_s

The 'unwind-link' facility allows glibc to support thread cancellation
and exit (pthread_cancel, pthread_exiti, backtrace) by dynamically
loading the  unwind library at runtime, preventing a hard dependency on
libgcc_s within libc.so.

When building with libunwind (for clang/LLVM toolchains [1]), two
assumptions in the existing code break:

  1. The runtime library is libunwind.so instead of libgcc_s.so.

  2. libgcc relies on __gcc_personality_v0 to handle unwinding mechanics.
     libunwind exposes the standard '_Unwind_*' accessors directly.

This patch adapts `unwind-link` to handle both environments based on
the HAVE_CC_WITH_LIBUNWIND configuration:

  * The UNWIND_SONAME macro now selects between LIBGCC_S_SO and
    LIBUNWIND_SO.

  * For libgcc, it continues to resolve `__gcc_personality_v0`.

  * For libunwind, it instead resolves the standard
    _Unwind_GetLanguageSpecificData, _Unwind_SetGR, _Unwind_SetIP,
     and _Unwind_GetRegionStart helpers.

   * unwind-resume.c is updated to implement wrappers for these
     accessors that forward calls to the dynamically loaded function
     pointers, effectively shimming the unwinder.

Tests and Makefiles are updated to link against `$(libunwind)` where
appropriate.

Reviewed-by: Sam James <sam@gentoo.org>
[1] https://github.com/libunwind/libunwind

configure: Repurpose have-cc-with-libunwind for clang support

The `have-cc-with-libunwind` check (and its corresponding macro
HAVE_CC_WITH_LIBUNWIND) was historically specific to IA64, intended
to supplement libgcc with libunwind.  Since this logic is unused in
current GCC configurations, this patch repurposes it to support
clang-based toolchains that utilize LLVM's libunwind instead of
libgcc_s.

The configure script now detects if the compiler natively supports
unwinding via `-lunwind`.

Additionally, when this mode is enabled, `-lclang_rt.builtins` is
explicitly added to the `libgcc_eh` definition.  This is necessary
because `links-dso-program` otherwise fails to link due to a missing
`__gcc_personality_v0` symbol.  It appears that clang does not
automatically link the builtins providing this personality routine
when `rlink-path` is actively used during the build.

Reviewed-by: Sam James <sam@gentoo.org>

configure: Parametrize runtime libraries to support compiler-rt

Historically, the build system has hardcoded references to `-lgcc` and
`-lgcc_eh`, explicitly assuming the use of the GCC runtime.  This
prevents building glibc with alternative toolchains, specifically clang
configured with `--rtlib=compiler-rt`, where these libraries are
replaced by `libclang_rt.builtins`.

This patch introduces a mechanism to dynamically detect the compiler's
underlying runtime library.

The logic works as follows:

1. It queries the compiler using `-print-libgcc-file-name`.
2. It parses the output path to determine if `libgcc` or `compiler-rt`
   is in use.
3. Based on this detection, it parametrizes the build variables for
   the static runtime and exception handling libraries (replacing
   hardcoded `-lgcc` and `-lgcc_eh`).

This ensures that the build system correctly links against the active
compiler runtime—whether it is the traditional libgcc or LLVM's
compiler-rt—without requiring manual overrides.

Reviewed-by: Sam James <sam@gentoo.org>

malloc: Remove lingering DIAG_POP_NEEDS_COMMENT

From 0ea9ebe48ad624919d579dbe651293975fb6a699.

malloc: Cleanup warnings

Cleanup warnings - malloc builds with -Os and -Og without needing any
complex warning avoidance defines.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

Document CVE-2026-3904

All branches already have a fix, so this is mainly for distributions
that may have cherry-picked the SSE2 memcmp implementation.

Signed-off-by: Siddhesh Poyarekar <siddhesh@gotplt.org>

LoongArch: Optimize float environment functions

In LoongArch, fcsr1 is the alias of enables field in fcsr0, fscr3 is the
alias of RM field in fscr0. This patch use fcsr1 and fcsr3 register to
optimize fedisableexcept, feenableexcept, fegetexcept, fegetround,
fesetround, get_rounding_mode functions, which could reduce the
additional andi instruction.

nptl: Only issues __libc_unwind_link_get for SHARED

The compiler already optimizes it away for static builds.

Reviewed-by: Collin Funk <collin.funk1@gmail.com>

x86_64: Conditionally define __sfp_handle_exceptions for compiler-rt

The LLVM compiler-rt builtins library does not currently provide an
implementation for __sfp_handle_exceptions. On x86_64, this causes
unresolved symbol errors when building glibc in environments that
exclude libgcc.

This patch implements __sfp_handle_exceptions specifically for x86_64,
bridging the gap for non-GNU compiler runtimes.

The implementation is used conditionally, only if the compiler does
not already provide the symbol.

NB: the implementation is based on libgcc and raises bosh SSE and i387
exceptions (different that the one from 460ee50de054396cc9791ff4)

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

test-assert-c++-variadic.cc: Disable assert_works for GCC 14.2 and 14.1

PR118629 [1] resolved issue with usage of __PRETTY_FUNCTION__
(to which assert expands) inside unevaluated context for GCC 14.3.
This affects only versions 14.1 and 14.2, as -std=c++26 option is
supported since 14.1.

clang supports above snippet for all version that supports --std=c++26
flag (since 17.0.1).

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118629

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

libio: Properly link in function _IO_wfile_doallocate in static binaries

This patch addresses Bug 33935 - _IO_wfile_doallocate not linked correctly
when linking glibc statically.
https://sourceware.org/bugzilla/show_bug.cgi?id=33935

The function _IO_wfile_doallocate has been added with pragma weak in vtable.c,
while it is the only one symbol contained in wfiledoalloc.c,
and has not been directly called in libio.

In static binaries the true function symbol _IO_wfile_doallocate may not
be correctly linked when linking glibc with cases contains wchar functions,
but the weak symbol in vtable is linked instead,
and cause segmentation fault when running.

This patch fixes this with similar way to symbol _IO_file_doallocate,
that add libio_static_fn_required(_IO_wfile_doallocate) in wgenops.c
to make _IO_wfile_doallocate always link in static binaries.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

malloc: Improve memalign alignment

Use generic stdc_bit_width to safely adapt to input types. Move rounding up of
alignments that are not powers of 2 to __libc_memalign. Simplify alignment
handling of aligned_alloc and __posix_memalign. Add a testcase for non-power
of 2 memalign and fix malloc-debug.

Reviewed-by: DJ Delorie <dj@redhat.com>

feat(rtld): Allow LD_DEBUG category exclusion

Adds support for excluding specific categories from `LD_DEBUG` output.

The `LD_DEBUG` environment variable now accepts category names prefixed
with a dash (`-`) to disable their debugging output. This allows users
to enable broad categories (e.g., `all`) while suppressing verbose or
irrelevant information from specific sub-categories (e.g., `-tls`).

The `process_dl_debug` function in `rtld.c` has been updated to parse
these exclusion options and unset the corresponding bits in
`GLRO(dl_debug_mask)`. The `LD_DEBUG=help` output has also been updated
to document this new functionality. A new test `tst-dl-debug-exclude.sh`
is added to verify the correct behavior of category exclusion.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf(tls): Add debug logging for TLS operations

This commit introduces extensive debug logging for thread-local storage
(TLS) operations within the dynamic linker. When `LD_DEBUG=tls` is
enabled, messages are printed for:
- TLS module assignment and release.
- DTV (Dynamic Thread Vector) resizing events.
- TLS block allocations and deallocations.
- `__tls_get_addr` slow path events (DTV updates, lazy allocations, and
static TLS usage).

The log format is standardized to use a "tls: " prefix and identifies
modules using the "modid %lu" convention. To aid in debugging
multithreaded applications, thread-specific logs include the Thread
Control Block (TCB) address to identify the context of the operation.

A new test module `tst-tls-debug-mod.c` and a corresponding shell script
`tst-tls-debug-recursive.sh` have been added. Additionally, the existing
`tst-dl-debug-tid` NPTL test has been updated to verify these TLS debug
messages in a multithreaded context.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

elf: should check result of openat with -1 not 1

Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>

htl: Fix pthread_once memory ordering

We need to tie the fast-path read with the store, to make sure that when
fast-reading 1, we see all the effects performed by the init routine.

(and we don't need a full barrier, only an acquire/release pair is
needed)

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

htl: Make sure the exit path of last thread sees all thread cleanups

In case e.g. some atexit() handlers expect all threads to have finished
their side effects.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

hurd: Check for _hurdsig_preempted_set with _hurd_siglock held

Without taking _hurd_siglock, we could be missing the addition of a global
preemptor.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

htl: Call thread-specific destructors for last thread too

As required by posix.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

htl: Fix checking for mutex not being recoverable

pthread_mutex_unlock sets __owner_id to NOTRECOVERABLE_ID

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

benchtests: Adapt tanh

Random values in the range of [-4,4].

benchtests: Adapt sinh

Random values in the range of [-10,10].

benchtests: Adapt cosh

Random values in the range of [-10,10].

Fix Makefile alphabetical ordering

hurd; Fix return value for sigwait

It is supposed to return an error code, not just -1.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

hurd: Fix cleaning on sigtimedwait timing out

sigtimedwait also needs to clean up preemptors and the blocked mask before
returning EAGAIN.

Also add some sigtimedwait testing.

Linux: Only define OPEN_TREE_* macros in <sys/mount.h> if undefined (bug 33921)

There is a conditional inclusion of <linux/mount.h> earlier in the file.
If that defines the macros, do not redefine them. This addresses build
problems as the token sequence used by the UAPI macro definitions
changes between Linux versions.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

malloc: Avoid accessing /sys/kernel/mm files

On AArch64 malloc always checks /sys/kernel/mm/transparent_hugepage/enabled to
set the THP mode. However this check is quite expensive and the file may not
be accessible in containers. If DEFAULT_THP_PAGESIZE is non-zero, use
malloc_thp_mode_madvise so that we take advantage of THP in all cases. Since
madvise is a fast systemcall, it adds only a small overhead compared to the
cost of mmap and populating the pages.

Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>

misc: Fix a few typos in comments

htl: Fix race between timedrd/wrlock and unlock

In case the rwlock is unlocked right before we time out, we will have been
given ownership, so we shouldn't time out.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

hurd: Take cancel_lock in critical section

read/write etc. shall be signal-safe, and take cancel_lock, so we have to
defer signal delivery while holding cancel_lock.

Reported-by: Brent Baccala <cosine@freesoft.org> 's Claude assistant

resolv: Avoid duplicate query if search list contains '.' (bug 33804)

Co-authored-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

support: no_override_resolv_conf_search flag for resolver test framework

It is required to test "search ." in /etc/resolv.conf files. The
default is to override the search path isolate from unexpected
settings in the test execution environment.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>

AArch64: Improve memset when len is 64

Change the mask to 48 to support len==64. The second memory store now accesses
offset 32, whereas the third one accesses offset 16. As a result performance
for len==64 almost doubles.

malloc: Add asserts for malloc assumptions

Currently malloc has various assumptions, some documented, some implicit.
Add a few asserts to check the most fundamental assumptions using verify().
Remove some odd #define void.

Reviewed-by: Paul Eggert <eggert@cs.ucla.edu>

tests: posix: use cpu clock for sleep

On some emulated targets sleep may result in inconsistent wait
times which will lead to the failure of the tst-chmod test.

To account for this we use the CLOCK_PROCESS_CPUTIME_ID clock ID
while also consuming CPU time by repeatedly calling clock_gettime.

Reviewed-by: DJ Delorie <dj@redhat.com>

assert: Support assert as variadic macro for C++26 [PR27276]

C++26 changes assert into a variadic macro to support using
assignment-expressions that would be interpreted as multiple macro
arguments, in particular one containing:
* template parameter lists: func<int, float>()
* calls to overloaded operator[] that accepts multiple arguments: arr[1, 2]
  this is C++23 feature, see libstdc++ PR/119855 [1]
* lambdas with explicit captures: [x, y] { ... }

The new expansion in form:
  (__VA_ARGS__) ? void (1 ? 1 : bool (__VA_ARGS__))
                : __assert_fail (...)
Has the following properties:
* Use of (__VA_ARGS__) ? ... : ..., requires that __VA_ARGS__
  is contextually convertible to bool. This means that enumerators
  of scoped enumeration are no longer accepted (they are only
  explicitly convertible). Thus this patch address the glibc PR/27276 [2].
* Nested ternary 1 ? 1 : bool (__VA_ARGS__) guarantees that
  expression expanded from __VA_ARGS__ is not evaluated twice.
  This is used instead of unevaluated context (like sizeof...)
  to support C++ expressions that are not allowed in unevaluated
  context (lambdas until C++20, co_await, co_yield).
* bool (__VA_ARGS__) is ill-formed if __VA_ARGS__ expands to
  multiple arguments: assert(1, 2)
* bool (__VA_ARGS__) also triggers warnings when __VA_ARGS__
  expands to x = 1: assert(x = 1)

To guarantee that the code snippets from assert/test-assert-c++-variadic.cc,
are actually checked for validity, we need to compile this test in C++26
(-std=c++26) mode. To achieve that, this patch compiles the file with
test-config-cxxflags-stdcxx26 variable as additional flag, that is set to
-std=c++26 if $(TEST_CXX) executable supports that flag, and empty otherwise.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119855
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=27276

Co-authored-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>

math: Sync atanh with CORE-MATH

It speeds the fast-path for |x|<0.25. The CORE-MATH muldd is the
same as muldd2 from glibc ddcoremath.h.

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

math: Sync log10p1f with CORE-MATH

The new code shows a small performance increase in x86_64:

latency                      patched        sync   improvement
x86_64                       41.8873     40.2864         3.82%
x86_64v2                     40.5859     39.2079         3.40%
x86_64v3                     34.6393     33.5018         3.28%
aarch64                      15.2731     14.5953         4.44%
armhf-vpfv4                  17.0373     17.0186         0.11%
powerpc64le                   8.3341      8.3298         0.05%

reciprocal-throughput        patched        sync   improvement
x86_64                       15.6516     13.6373        12.87%
x86_64v2                     15.0551     13.2769        11.81%
x86_64v3                     12.8994     11.0628        14.24%
aarch64                       8.8306      9.1898        -4.07%
armhf-vpfv4                   9.5855     10.0199        -4.53%
powerpc64le                   4.0074      4.4466       -10.96%

x86_64 / i686      gcc version 15.2.1 20260112. Ryzen 5900X
aarch64:           gcc version 15.2.1 20251105, Neoverse-N1
armv7a-vpfv4:      gcc version 15.2.1 20251105, Neoverse-N1
powerpc64le:       gcc version 14.2.1 20241230, POWER10

The code size also show slight improvement, the s_log10pf.os
'size' output shows:

size                patched     sync   improvement
x86_64                 2345     2243         4.35%
x86_64v2               2345     2243         4.35%
x86_64v3               2226     2162         2.88%
aarch64                2104     2112        -0.38%
armhf-vpfv4            2016     2012         0.20%
powerpc64le            2324     2340        -0.69%

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

math: Sync log10f with CORE-MATH

The performance is similar:

latency                        master        sync  improvement
x86_64                        34.6851     32.7977        5.44%
x86_64v2                      34.0921     32.4295        4.88%
x86_64v3                      27.8292     27.6070        0.80%
aarch64                       11.7246     11.1351        5.03%
armhf-vpfv4                   13.3748     12.9055        3.51%
powerpc64le                    6.4036      6.5825       -2.79%

reciprocal-throughput          master        sync  improvement
x86_64                        10.2653     10.0437        2.16%
x86_64v2                      10.8432     10.7040        1.28%
x86_64v3                      10.9006     11.0765       -1.61%
aarch64                        6.6447      6.2743        5.57%
armhf-vpfv4                    6.8916      6.7538        2.00%
powerpc64le                    2.9494      2.7661        6.21%

x86_64 / i686      gcc version 15.2.1 20260112. Ryzen 5900X
aarch64:           gcc version 15.2.1 20251105, Neoverse-N1
armv7a-vpfv4:      gcc version 15.2.1 20251105, Neoverse-N1
powerpc64le:       gcc version 14.2.1 20241230, POWER10

The code size is also similar.

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

math: Sync log2p1f with CORE-MATH

The new code shows better performance overall:

latency                      patched        sync   improvement
x86_64                       48.5909     33.3368        31.39%
x86_64v2                     49.1357     33.9981        30.81%
x86_64v3                     39.2397     28.0957        28.40%
aarch64                      16.5372     12.8133        22.52%
armhf-vpfv4                  18.1434     14.5273        19.93%
powerpc64le                   9.0999     7.49235        17.67%

reciprocal-throughput        patched        sync   improvement
x86_64                       14.5197     10.9726        24.43%
x86_64v2                     14.7640     11.1358        24.57%
x86_64v3                     11.5523     9.83253        14.89%
aarch64                       8.2854      7.8479         5.28%
armhf-vpfv4                   8.8586      8.5245         3.77%
powerpc64le                   3.8995      4.0069        -2.75%

x86_64 / i686      gcc version 15.2.1 20260112. Ryzen 5900X
aarch64:           gcc version 15.2.1 20251105, Neoverse-N1
armv7a-vpfv4:      gcc version 15.2.1 20251105, Neoverse-N1
powerpc64le:       gcc version 14.2.1 20241230, POWER10

The sync also improves the internal table size, the s_log1pf.os
'size' output shows:

size                         master        sync   improvement
x86_64                         3417        2089        38.86%
x86_64v2                       3417        2089        38.86%
x86_64v3                       3228        2001        38.01%
i686                           3490        2151        38.37%
aarch64                        3200        1888        41.00%
armhf-vpfv4                    3080        1804        41.43%
powerpc64le                    3408        2148        36.97%

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

math: Sync log1pf with CORE-MATH

The performance is similar, with some minor regression:

latency                      master        sync   improvement
x86_64                      38.2841     38.1375         0.38%
x86_64v2                    37.7338     37.4292         0.81%
x86_64v3                    31.3500     32.3576        -3.21%
aarch64                     13.7384     13.9030        -1.20%
armhf-vpfv4                 15.5730     16.5105        -6.02%
powerpc64le                  7.6038      7.5757         0.37%

reciprocal-throughput        master        sync   improvement
x86_64                      12.4910     11.9683         4.18%
x86_64v2                    12.2935     11.7614         4.33%
x86_64v3                    11.5444     10.6369         7.86%
aarch64                      7.7262      7.8954        -2.19%
armhf-vpfv4                  8.3502      8.8741        -6.27%
powerpc64le                  3.5883      3.5259         1.74%

x86_64 / i686      gcc version 15.2.1 20260112. Ryzen 5900X
aarch64:           gcc version 15.2.1 20251105, Neoverse-N1
armv7a-vpfv4:      gcc version 15.2.1 20251105, Neoverse-N1
powerpc64le:       gcc version 14.2.1 20241230, POWER10

The sync also improves the internal table size, the s_log1pf.os
'size' output shows:

size                         master       sync    improvement
x86_64                         2078       1641         21.03%
x86_64v2                       2078       1641         21.03%
x86_64v3                       1975       1514         23.34%
aarch64                        1808       1336         26.11%
armhf-vpfv4                    1716       1284         25.17%
powerpc64le                    2132       1616         24.20%

Checked on aarch64-linux-gnu, arm-linux-gnueabihf,
powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu.

Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>

htl: Fix mt-safeness of libio

Since d2e04918833 ("Single threaded stdio optimization")
we are supposed to call _IO_enable_locks when creating the first thread,
but that commit missed doing it for htl.