git.ipfire.org Git - thirdparty/gcc.git/log

rs6000: Add shift count guards to avoid undefined behavior [PR118890]

This patch adds missing guards on shift amounts to prevent UB when the
shift count equals or exceeds HOST_BITS_PER_WIDE_INT.

In the patch (r16-2666-g647bd0a02789f1), shift counts were only checked
for nonzero but not for being within valid bounds. This patch tightens
those conditions by enforcing that shift counts are greater than zero
and less than HOST_BITS_PER_WIDE_INT.

2025-08-23 Kishan Parmar <kishan@linux.ibm.com>

gcc/
PR target/118890
* config/rs6000/rs6000.cc (can_be_rotated_to_negative_lis): Add bounds
checks for shift counts to prevent undefined behavior.
(rs6000_emit_set_long_const): Likewise.

[PR rtl-optimization/120553] Improve selecting between constants based on sign bit test

While working to remove mvconst_internal I stumbled over a regression in
the code to handle signed division by a power of two.

In that sequence we want to select between 0, 2^n-1 by pairing a sign
bit splat with a subsequent logical right shift.  This can be done
without branches or conditional moves.

Playing with it a bit made me realize there's a handful of selections we
can do based on a sign bit test.  Essentially there's two broad cases.

Clearing bits after the sign bit splat.  So we have 0, -1, if we clear
bits the 0 stays as-is, but the -1 could easily turn into 2^n-1, ~2^n-1,
or some small constants.

Setting bits after the sign bit splat. If we have 0, -1, setting bits
the -1 stays as-is, but the 0 can turn into 2^n, a small constant, etc.

Shreya and I originally started looking at target patterns to do this,
essentially discovering conditional move forms of the selects and
rewriting them into something more efficient.  That got out of control
pretty quickly and it relied on if-conversion to initially create the
conditional move.

The better solution is to actually discover the cases during
if-conversion itself.  That catches cases that were previously being
missed, checks cost models, and is actually simpler since we don't have
to distinguish between things like ori and bseti, instead we just emit
the natural RTL and let the target figure it out.

In the ifcvt implementation we put these cases just before trying the
traditional conditional move sequences.  Essentially these are a last
attempt before trying the generalized conditional move sequence.

This as been bootstrapped and regression tested on aarch64, riscv,
ppc64le, s390x, alpha, m68k, sh4eb, x86_64 and probably a couple others
I've forgotten.  It's also been tested on the other embedded targets.
Obviously the new tests are risc-v specific, so that testing was
primarily to make sure we didn't ICE, generate incorrect code or regress
target existing specific tests.

Raphael has some changes to attack this from the gimple direction as
well.  I think the latest version of those is on me to push through
internal review.

PR rtl-optimization/120553
gcc/
* ifcvt.cc (noce_try_sign_bit_splat): New function.
(noce_process_if_block): Use it.

gcc/testsuite/

* gcc.target/riscv/pr120553-1.c: New test.
* gcc.target/riscv/pr120553-2.c: New test.
* gcc.target/riscv/pr120553-3.c: New test.
* gcc.target/riscv/pr120553-4.c: New test.
* gcc.target/riscv/pr120553-5.c: New test.
* gcc.target/riscv/pr120553-6.c: New test.
* gcc.target/riscv/pr120553-7.c: New test.
* gcc.target/riscv/pr120553-8.c: New test.

Pass representative of live SLP node to vect_create_epilog_for_reduction

We passed the reduc_info which is close, but the representative is
more spot on and will not collide with making the reduc_info a
distinct type.

* tree-vect-loop.cc (vectorizable_live_operation): Pass
the representative of the PHIs node to
vect_create_epilog_for_reduction.

Fixups around reduction info and STMT_VINFO_REDUC_VECTYPE_IN

STMT_VINFO_REDUC_VECTYPE_IN exists on relevant reduction stmts, not
the reduction info.  And STMT_VINFO_DEF_TYPE exists on the
reduction info.  The following fixes up a few places.

* tree-vect-loop.cc (vectorizable_lane_reducing): Get
reduction info properly.  Adjust checks according to
comments.
(vectorizable_reduction): Do not set STMT_VINFO_REDUC_VECTYPE_IN
on the reduc info.
(vect_transform_reduction): Query STMT_VINFO_REDUC_VECTYPE_IN
on the actual reduction stmt, not the info.

RISC-V: Add testcase for scalar unsigned SAT_MUL form 3

Add run and asm check test cases for scalar unsigned SAT_MUL form 3.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat/sat_arith.h: Add test helper macros.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u64-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u16.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u64-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u16.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u64.rv32.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

Match: Add form 3 for unsigned SAT_MUL

This patch would like to try to match the the unsigned
SAT_MUL form 3, aka below:

  #define DEF_SAT_U_MUL_FMT_3(NT, WT)             \
  NT __attribute__((noinline))                    \
  sat_u_mul_##NT##_from_##WT##_fmt_3 (NT a, NT b) \
  {                                               \
    WT x = (WT)a * (WT)b;                         \
    if ((x >> sizeof(a) * 8) == 0)                \
      return (NT)x;                               \
    else                                          \
      return (NT)-1;                              \
  }

While WT is  T is uint16_t, uint32_t, uint64_t and uint128_t,
and NT is is uint8_t, uint16_t, uint32_t and uint64_t.

gcc/ChangeLog:

* match.pd: Add form 3 for unsigned SAT_MUL.

Signed-off-by: Pan Li <pan2.li@intel.com>

Emit the TLS call after NOTE_INSN_FUNCTION_BEG

For the beginning basic block:

(note 4 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 26 2 NOTE_INSN_FUNCTION_BEG)

emit the TLS call after NOTE_INSN_FUNCTION_BEG.

gcc/

PR target/121635
* config/i386/i386-features.cc (ix86_emit_tls_call): Emit the
TLS call after NOTE_INSN_FUNCTION_BEG.

gcc/testsuite/

PR target/121635
* gcc.target/i386/pr121635-1a.c: New test.
* gcc.target/i386/pr121635-1b.c: Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

Use REDUC_GROUP_FIRST_ELEMENT less

REDUC_GROUP_FIRST_ELEMENT is often checked to see whether we are
dealing with a SLP reduction or a reduction chain. When we are
in the context of analyzing the reduction (so we are sure
the SLP instance we see is correct), then we can use the SLP
instance kind instead.

* tree-vect-loop.cc (get_initial_defs_for_reduction): Adjust
comment.
(vect_create_epilog_for_reduction): Get at the reduction
kind via the instance, re-use the slp_reduc flag instead
of checking REDUC_GROUP_FIRST_ELEMENT again.
Remove unreachable code.
(vectorizable_reduction): Compute a reduc_chain flag from
the SLP instance kind, avoid REDUC_GROUP_FIRST_ELEMENT
checks.
(vect_transform_cycle_phi): Likewise.
(vectorizable_live_operation): Check the SLP instance
kind instead of REDUC_GROUP_FIRST_ELEMENT.

testsuite: Fix g++.dg/abi/mangle83.C for -fshort-enums

Linaro CI informed me that this test fails on ARM thumb-m7-hard-eabi.
This appears to be because the target defaults to -fshort-enums, and so
the mangled names are inaccurate.

This patch just disables the implicit type enum test for this case.

gcc/testsuite/ChangeLog:

* g++.dg/abi/mangle83.C: Disable implicit enum test for
-fshort-enums.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

Decouple parloops from vect reduction infra some more

The following removes the use of STMT_VINFO_REDUC_* from parloops,
also fixing a mistake with analyzing double reductions which rely
on the outer loop vinfo so the inner loop is properly detected as
nested.

* tree-parloops.cc (parloops_is_simple_reduction): Pass
in double reduction inner loop LC phis and query that.
(parloops_force_simple_reduction): Similar, but set it.
Check for valid reduction types here.
(valid_reduction_p): Remove.
(gather_scalar_reductions): Adjust, fixup double
reduction inner loop processing.

RTEMS: Add riscv multilibs

gcc/ChangeLog:

* config/riscv/t-rtems: Add -mstrict-align multilibs for
targets without support for misaligned access in hardware.

[arm] require armv7 support for [PR120424]

Without stating the architecture version required by the test, test
runs with options that are incompatible with the required
architecture version fail, e.g. -mfloat-abi=hard.

armv7 was not covered by the long list of arm variants in
target-supports.exp, so add it, and use it for the effective target
requirement and for the option.

for gcc/testsuite/ChangeLog

PR rtl-optimization/120424
* lib/target-supports.exp (arm arches): Add arm_arch_v7.
* g++.target/arm/pr120424.C: Require armv7 support. Use
dg-add-options arm_arch_v7 instead of explicit -march=armv7.

Daily bump.

Fortran: Fix NULL pointer issue.

PR fortran/121627

gcc/fortran/ChangeLog:

* module.cc (create_int_parameter_array): Avoid NULL
pointer dereference and enhance error message.

gcc/testsuite/ChangeLog:

* gfortran.dg/pr121627.f90: New test.

pru: libgcc: Add software implementation for multiplication

For cores without a hardware multiplier, set respective optabs
with library functions which use software implementation of
multiplication.

The implementation was copied from the RL78 backend.

gcc/ChangeLog:

* config/pru/pru.cc (pru_init_libfuncs): Set softmpy libgcc
functions for optab multiplication entries if TARGET_OPT_MUL
option is not set.

libgcc/ChangeLog:

* config/pru/libgcc-eabi.ver: Add __pruabi_softmpyi and
__pruabi_softmpyll symbols.
* config/pru/t-pru: Add softmpy source files.
* config/pru/pru-softmpy.h: New file.
* config/pru/softmpyi.c: New file.
* config/pru/softmpyll.c: New file.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

pru: Define multilib for different core variants

Enable multilib builds for contemporary PRU core versions (AM335x and
later), and older versions present in AM18xx.

gcc/ChangeLog:

* config.gcc: Include pru/t-multilib.
* config/pru/pru.h (MULTILIB_DEFAULTS): Define.
* config/pru/t-multilib: New file.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

pru: Add options to disable MUL/FILL/ZERO instructions

Older PRU core versions (e.g. in AM1808 SoC) do not support
XIN, XOUT, FILL, ZERO instructions. Add GCC command line options to
optionally disable generation of those instructions, so that code
can be executed on such older PRU cores.

gcc/ChangeLog:

* common/config/pru/pru-common.cc (TARGET_DEFAULT_TARGET_FLAGS):
Keep multiplication, FILL and ZERO instructions enabled by
default.
* config/pru/pru.md (prumov<mode>): Gate code generation on
TARGET_OPT_FILLZERO.
(mov<mode>): Ditto.
(zero_extendqidi2): Ditto.
(zero_extendhidi2): Ditto.
(zero_extendsidi2): Ditto.
(@pru_ior_fillbytes<mode>): Ditto.
(@pru_and_zerobytes<mode>): Ditto.
(@<code>di3): Ditto.
(mulsi3): Gate code generation on TARGET_OPT_MUL.
* config/pru/pru.opt: Add mmul and mfillzero options.
* config/pru/pru.opt.urls: Regenerate.
* config/rl78/rl78.opt.urls: Regenerate.
* doc/invoke.texi: Document new options.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

c: Add folding of nullptr_t in some cases [PR121478]

The middle-end does not fully understand NULLPTR_TYPE. So it
gets confused a lot of the time when dealing with it.
This adds the folding that is similarly done in the C++ front-end already.
In some cases it should produce slightly better code as there is no
reason to load from a nullptr_t variable as it is always NULL.

The following is handled:
nullptr_v ==/!= nullptr_v -> true/false
(ptr)nullptr_v -> (ptr)0, nullptr_v
f(nullptr_v) -> f ((nullptr, nullptr_v))

The last one is for conversion inside ... .

Bootstrapped and tested on x86_64-linux-gnu.

PR c/121478
gcc/c/ChangeLog:

* c-fold.cc (c_fully_fold_internal): Fold nullptr_t ==/!= nullptr_t.
* c-typeck.cc (convert_arguments): Handle conversion from nullptr_t
for varargs.
(convert_for_assignment): Handle conversions from nullptr_t to
pointer type specially.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr121478-1.c: New test.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

c++: constexpr clobber of const [PR121068]

Since r16-3022, 20_util/variant/102912.cc was failing in C++20 and above due
to wrong errors about destruction modifying a const object; destruction is
OK.

PR c++/121068

gcc/cp/ChangeLog:

* constexpr.cc (cxx_eval_store_expression): Allow clobber of a const
object.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/constexpr-dtor18.C: New test.

RISC-V: testsuite: Fix DejaGnu support for riscv_zvfh

Call check_effective_target_riscv_zvfh_ok rather than
check_effective_target_riscv_zvfh in vx_vf_*run-1-f16.c run tests and ensure
that they are actually run.
Also fix remove_options_for_riscv_zvfh.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmacc-run-1-f16.c: Call
check_effective_target_riscv_zvfh_ok rather than
check_effective_target_riscv_zvfh.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwnmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwnmsac-run-1-f16.c: Likewise.
* lib/target-supports.exp (check_effective_target_riscv_zvfh_ok): Append
zvfh instead of v to march.
(remove_options_for_riscv_zvfh): Remove duplicate and
call remove_ rather than add_options_for_riscv_z_ext.

rtl-ssa: Add missing live-out uses [PR121619]

This PR is another bug in the rtl-ssa code to manage live-out uses.
It seems that this didn't get much coverage until recently.

In the testcase, late-combine first removed a register-to-register
move by substituting into all uses, some of which were in other EBBs.
This was done after checking make_uses_available, which (as expected)
says that single dominating definitions are available everywhere
that the definition dominates. But the update failed to add
appropriate live-out uses, so a later parallelisation attempt
tried to move the new destination into a later block.

gcc/
PR rtl-optimization/121619
* rtl-ssa/functions.h (function_info::commit_make_use_available):
Declare.
* rtl-ssa/blocks.cc (function_info::commit_make_use_available):
New function.
* rtl-ssa/changes.cc (function_info::apply_changes_to_insn): Use it.

gcc/testsuite/
PR rtl-optimization/121619
* gcc.dg/pr121619.c: New test.

libstdc++: Use pthread_mutex_clocklock when TSan is active [PR121496]

This reverts r14-905-g3b7cb33033fbe6 which disabled the use of
pthread_mutex_clocklock when TSan is active. That's no longer needed,
because GCC has TSan interceptors for pthread_mutex_clocklock since GCC
15.1 and Clang has them since 18.1.0 (released March 2024).

The interceptor was added by https://github.com/llvm/llvm-project/pull/75713

libstdc++-v3/ChangeLog:

PR libstdc++/121496
* acinclude.m4 (GLIBCXX_CHECK_PTHREAD_MUTEX_CLOCKLOCK): Do not
use _GLIBCXX_TSAN in _GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK macro.
* configure: Regenerate.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>

libstdc++: Check _GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK with #if [PR121496]

The change in r14-905-g3b7cb33033fbe6 to disable the use of
pthread_mutex_clocklock when TSan is active assumed that the
_GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK macro was always checked with #if
rather than #ifdef, which was not true.

This makes the checks use #if consistently.

libstdc++-v3/ChangeLog:

PR libstdc++/121496
* include/std/mutex (__timed_mutex_impl::_M_try_wait_until):
Change preprocessor condition to use #if instead of #ifdef.
(recursive_timed_mutex::_M_clocklock): Likewise.
* testsuite/30_threads/timed_mutex/121496.cc: New test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>

tree-optimization/111494 - reduction vectorization with signed UB

The following makes sure to pun arithmetic that's used in vectorized
reduction to unsigned when overflow invokes undefined behavior.

PR tree-optimization/111494
* gimple-fold.h (arith_code_with_undefined_signed_overflow): Declare.
* gimple-fold.cc (arith_code_with_undefined_signed_overflow): Export.
* tree-vect-stmts.cc (vectorizable_operation): Use unsigned
arithmetic for operations participating in a reduction.

x86-64: Emit the TLS call after NOTE_INSN_BASIC_BLOCK

For a basic block with only a label:

(code_label 78 11 77 3 14 (nil) [1 uses])
(note 77 78 54 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

emit the TLS call after NOTE_INSN_BASIC_BLOCK, instead of before
NOTE_INSN_BASIC_BLOCK, to avoid

x.c: In function ‘aout_16_write_syms’:
x.c:54:1: error: NOTE_INSN_BASIC_BLOCK is missing for block 3
54 | }
| ^
x.c:54:1: error: NOTE_INSN_BASIC_BLOCK 77 in middle of basic block 3
during RTL pass: x86_cse
x.c:54:1: internal compiler error: verify_flow_info failed

gcc/

PR target/121607
* config/i386/i386-features.cc (ix86_emit_tls_call): Emit the
TLS call after NOTE_INSN_BASIC_BLOCK in a basic block with only
a label.

gcc/testsuite/

PR target/121607
* gcc.target/i386/pr121607-1a.c: New test.
* gcc.target/i386/pr121607-1b.c: Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

libstdc++: Implement aligned_accessor from mdspan [PR120994]

This commit completes the implementation of P2897R7 by implementing and
testing the template class aligned_accessor.

PR libstdc++/120994

libstdc++-v3/ChangeLog:

* include/bits/version.def (aligned_accessor): Add.
* include/bits/version.h: Regenerate.
* include/std/mdspan (aligned_accessor): New class.
* src/c++23/std.cc.in (aligned_accessor): Add.
* testsuite/23_containers/mdspan/accessors/generic.cc: Add tests
for aligned_accessor.
* testsuite/23_containers/mdspan/accessors/aligned_neg.cc: New test.
* testsuite/23_containers/mdspan/version.cc: Add test for
__cpp_lib_aligned_accessor.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Implement is_sufficiently_aligned [PR120994]

This commit implements and tests the function is_sufficiently_aligned
from P2897R7.

PR libstdc++/120994

libstdc++-v3/ChangeLog:

* include/bits/align.h (is_sufficiently_aligned): New function.
* include/bits/version.def (is_sufficiently_aligned): Add.
* include/bits/version.h: Regenerate.
* include/std/memory: Add __glibcxx_want_is_sufficiently_aligned.
* src/c++23/std.cc.in (is_sufficiently_aligned): Add.
* testsuite/20_util/headers/memory/version.cc: Add test for
__cpp_lib_is_sufficiently_aligned.
* testsuite/20_util/is_sufficiently_aligned/1.cc: New test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Fix std::numeric_limits<__float128>::max_digits10 [PR121374]

When I added this explicit specialization in r14-1433-gf150a084e25eaa I
used the wrong value for the number of mantissa digits (I used 112
instead of 113). Then when I refactored it in r14-1582-g6261d10521f9fd I
used the value calculated from the incorrect value (35 instead of 36).

libstdc++-v3/ChangeLog:

PR libstdc++/121374
* include/std/limits (numeric_limits<__float128>::max_digits10):
Fix value.
* testsuite/18_support/numeric_limits/128bit.cc: Check value.

libstdc++: Suppress some more additional diagnostics [PR117294]

libstdc++-v3/ChangeLog:

PR c++/117294
* testsuite/20_util/optional/cons/value_neg.cc: Prune additional
output for C++20 and later.
* testsuite/20_util/scoped_allocator/69293_neg.cc: Match
additional error for C++20 and later.

libstdc++: Implement std::dims from <mdspan>.

This commit implements the C++26 feature std::dims described in P2389R2.
It sets the feature testing macro to 202406 and adds tests.

Also fixes the test mdspan/version.cc

libstdc++-v3/ChangeLog:

* include/bits/version.def (mdspan): Set value for C++26.
* include/bits/version.h: Regenerate.
* include/std/mdspan (dims): Add.
* src/c++23/std.cc.in (dims): Add.
* testsuite/23_containers/mdspan/extents/misc.cc: Add tests.
* testsuite/23_containers/mdspan/version.cc: Update test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Simplify precomputed partial products in <mdspan>.

Prior to this commit, the partial products of static extents in <mdspan>
was done in a loop that calls a function that computes the partial
product. The complexity is quadratic in the rank.

This commit removes the quadratic complexity.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__static_prod): Delete.
(__fwd_partial_prods): Compute at compile-time in O(rank), not
O(rank**2).
(__rev_partial_prods): Ditto.
(__size): Inline __static_prod.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce size static storage for __fwd_prod in mdspan.

This fixes an oversight in a previous commit that improved mdspan
related code. Because __size doesn't use __fwd_prod, __fwd_prod(__rank)
is not needed anymore. Hence, one can shrink the size of
__fwd_partial_prods.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__fwd_partial_prods): Reduce size of the
array by 1 element.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

xtensa: Small improvement to "*btrue_INT_MIN"

This patch changes the implementation of the insn to test whether the
result itself is negative or not, rather than the MSB of the result of
the ABS machine instruction.  This eliminates the need to consider bit-
endianness and allows for longer branch distances.

     /* example */
     extern void foo(int);
     void test0(int a) {
       if (a == -2147483648)
         foo(a);
     }
     void test1(int a) {
       if (a != -2147483648)
         foo(a);
     }

     ;; before (endianness: little)
     test0:
      entry sp, 32
      abs a8, a2
      bbci a8, 31, .L1
      mov.n a10, a2
      call8 foo
     .L1:
      retw.n
     test1:
      entry sp, 32
      abs a8, a2
      bbsi a8, 31, .L4
      mov.n a10, a2
      call8 foo
     .L4:
      retw.n

     ;; after (endianness-independent)
     test0:
      entry sp, 32
      abs a8, a2
      bgez a8, .L1
      mov.n a10, a2
      call8 foo
     .L1:
      retw.n
     test1:
      entry sp, 32
      abs a8, a2
      bltz a8, .L4
      mov.n a10, a2
      call8 foo
     .L4:
      retw.n

gcc/ChangeLog:

* config/xtensa/xtensa.md (*btrue_INT_MIN):
Change the branch insn condition to test for a negative number
rather than testing for the MSB.

libstdc++: Replace numeric_limit with __int_traits in mdspan.

Using __int_traits avoids the need to include <limits> from <mdspan>.
This in turn should reduce the size of the pre-compiled <mdspan>.
Similar refactoring was carried out for PR92546. Unfortunately,

./gcc/xgcc -std=c++23 -P -E -x c++ - -include mdspan | wc -l

shows a decrease by 1(!) line. This is due to bits/max_size_type.h which
includes <limits>.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__valid_static_extent): Replace
numeric_limits with __int_traits.
(extents::_S_ctor_explicit): Ditto.
(extents::__static_quotient): Ditto.
(layout_stride::mapping::mapping): Ditto.
(mdspan::size): Ditto.
* testsuite/23_containers/mdspan/extents/class_mandates_neg.cc:
Update test with additional diagnostics.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve extents::operator==.

An interesting case to consider is:

  bool same11(const std::extents<int, dyn,   2, 3>& e1,
              const std::extents<int, dyn, dyn, 3>& e2)
  { return e1 == e2; }

Which has the following properties:

  - There's no mismatching static extents, preventing any
    short-circuiting.

  - There's a comparison between dynamic and static extents.

  - There's one trivial comparison: ... && 3 == 3.

Let E[i] denote the array of static extents, D[k] denote the array of
dynamic extents and k[i] be the index of the i-th extent in D.
(Naturally, k[i] is only meaningful if i is a dynamic extent).

The previous implementation results in assembly that's more or less a
literal translation of:

  for (i = 0; i < 3; ++i)
    e1 = E1[i] == -1 ? D1[k1[i]] : E1[i];
    e2 = E2[i] == -1 ? D2[k2[i]] : E2[i];
    if e1 != e2:
      return false
  return true;

While the proposed method results in assembly for

  if(D1[0] == D2[0]) return false;
  return 2 == D2[1];

i.e.

  110:  8b 17                  mov    edx,DWORD PTR [rdi]
  112:  31 c0                  xor    eax,eax
  114:  39 16                  cmp    DWORD PTR [rsi],edx
  116:  74 08                  je     120 <same11+0x10>
  118:  c3                     ret
  119:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  120:  83 7e 04 02            cmp    DWORD PTR [rsi+0x4],0x2
  124:  0f 94 c0               sete   al
  127:  c3                     ret

It has the following nice properties:

  - It eliminated the indirection D[k[i]], because k[i] is known at
    compile time. Saving us a comparison E[i] == -1 and conditionally
    loading k[i].

  - It eliminated the trivial condition 3 == 3.

The result is code that only loads the required values and performs
exactly the number of comparisons needed by the algorithm. It also
results in smaller object files. Therefore, this seems like a sensible
change. We've check several other examples, including fully statically
determined cases and high-rank examples. The example given above
illustrates the other cases well.

The constexpr condition:

  if constexpr (!_S_is_compatible_extents<...>)
    return false;

is no longer needed, because the optimizer correctly handles this case.
However, it's retained for clarity/certainty.

libstdc++-v3/ChangeLog:

* include/std/mdspan (extents::operator==): Replace loop with
pack expansion.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce indirection in extents::extent.

In both fully static and dynamic extents the comparison
  static_extent(i) == dynamic_extent
is known at compile time. As a result, extents::extent doesn't
need to perform the check at runtime.

An illustrative example is:

  using E = std::extents<int, 3, 5, 7, 11, 13, 17>;
  int required_span_size(const typename Layout::mapping<E>& m)
  { return m.required_span_size(); }

Prior to this commit the generated code (on -O2) is:

2a0:  b9 01 00 00 00         mov    ecx,0x1
2a5:  31 d2                  xor    edx,edx
2a7:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
2ae:  00 00 00 00
2b2:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
2b9:  00 00 00 00
2bd:  0f 1f 00               nop    DWORD PTR [rax]
2c0:  48 8b 04 d5 00 00 00   mov    rax,QWORD PTR [rdx*8+0x0]
2c7:  00
2c8:  48 83 f8 ff            cmp    rax,0xffffffffffffffff
2cc:  0f 84 00 00 00 00      je     2d2 <required_span_size_6d_static+0x32>
2d2:  83 e8 01               sub    eax,0x1
2d5:  0f af 04 97            imul   eax,DWORD PTR [rdi+rdx*4]
2d9:  48 83 c2 01            add    rdx,0x1
2dd:  01 c1                  add    ecx,eax
2df:  48 83 fa 06            cmp    rdx,0x6
2e3:  75 db                  jne    2c0 <required_span_size_6d_static+0x20>
2e5:  89 c8                  mov    eax,ecx
2e7:  c3                     ret

which is a scalar loop, and notably includes the check

  308:  48 83 f8 ff            cmp    rax,0xffffffffffffffff

to assert that the static extent is indeed not -1. Note, that on -O3 the
optimizer eliminates the comparison; and generates a sequence of scalar
operations: lea, shl, add and mov. The aim of this commit is to
eliminate this comparison also for -O2. With the optimization applied we
get:

  2e0:  f3 0f 6f 0f            movdqu xmm1,XMMWORD PTR [rdi]
  2e4:  66 0f 6f 15 00 00 00   movdqa xmm2,XMMWORD PTR [rip+0x0]
  2eb:  00
  2ec:  8b 57 10               mov    edx,DWORD PTR [rdi+0x10]
  2ef:  66 0f 6f c1            movdqa xmm0,xmm1
  2f3:  66 0f 73 d1 20         psrlq  xmm1,0x20
  2f8:  66 0f f4 c2            pmuludq xmm0,xmm2
  2fc:  66 0f 73 d2 20         psrlq  xmm2,0x20
  301:  8d 14 52               lea    edx,[rdx+rdx*2]
  304:  66 0f f4 ca            pmuludq xmm1,xmm2
  308:  66 0f 70 c0 08         pshufd xmm0,xmm0,0x8
  30d:  66 0f 70 c9 08         pshufd xmm1,xmm1,0x8
  312:  66 0f 62 c1            punpckldq xmm0,xmm1
  316:  66 0f 6f c8            movdqa xmm1,xmm0
  31a:  66 0f 73 d9 08         psrldq xmm1,0x8
  31f:  66 0f fe c1            paddd  xmm0,xmm1
  323:  66 0f 6f c8            movdqa xmm1,xmm0
  327:  66 0f 73 d9 04         psrldq xmm1,0x4
  32c:  66 0f fe c1            paddd  xmm0,xmm1
  330:  66 0f 7e c0            movd   eax,xmm0
  334:  8d 54 90 01            lea    edx,[rax+rdx*4+0x1]
  338:  8b 47 14               mov    eax,DWORD PTR [rdi+0x14]
  33b:  c1 e0 04               shl    eax,0x4
  33e:  01 d0                  add    eax,edx
  340:  c3                     ret

Which shows eliminating the trivial comparison, unlocks a new set of
optimizations, i.e. SIMD-vectorization. In particular, the loop has been
vectorized by loading the first four constants from aligned memory; the
first four strides from non-aligned memory, then computes the product
and reduction. It interleaves the above with computing 1 + 12*S[4] +
16*S[5] (as scalar operations) and then finishes the reduction.

A similar effect can be observed for fully dynamic extents.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__all_static): New function.
(__mdspan::_StaticExtents::_S_is_dyn): Inline and eliminate.
(__mdspan::_ExtentsStorage::_S_is_dynamic): New method.
(__mdspan::_ExtentsStorage::_M_extent): Use _S_is_dynamic.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve nearly fully dynamic extents in mdspan.

One previous commit optimized fully dynamic extents; and another
refactored __size such that __fwd_prod is valid for __r = 0, ..., rank
(exclusive).

Therefore, by noticing that __rev_prod (and __fwd_prod) never accesses
the first (or last) extent, one can avoid pre-computing partial products
of static extents in those cases, if all other extents are dynamic.

We check that the size of the reference object file decreases further
and the .rodata sections for

__fwd_prod<dyn, ..., dyn, 11>
__rev_prod<3, dyn, ..., dyn>

are absent.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__fwd_prods): Relax condition for fully-dynamic
extents to cover (dyn, ..., dyn, X).
(__rev_partial_prods): Analogous for (X, dyn, ..., dyn).

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve fully dynamic extents in mdspan.

In mdspan related code, for extents with no static extents, i.e. only
dynamic extents, the following simplifications can be made:

  - The array of dynamic extents has size rank.
  - The two arrays dynamic-index and dynamic-index-inv become
  trivial, e.g. k[i] == i.
  - All elements of the arrays __{fwd,rev}_partial_prods are 1.

This commits eliminates the arrays for dynamic-index, dynamic-index-inv
and __{fwd,rev}_partial_prods. It also removes the indirection k[i] == i
from the source code, which isn't as relevant because the optimizer is
(often) capable of eliminating the indirection.

To check if it's working we look at:

  using E2 = std::extents<int, dyn, dyn, dyn, dyn>;

  int stride_left_E2(const std::layout_left::mapping<E2>& m, size_t r)
  { return m.stride(r); }

which generates the following

  0000000000000190 <stride_left_E2>:
  190:  48 c1 e6 02            shl    rsi,0x2
  194:  74 22                  je     1b8 <stride_left_E2+0x28>
  196:  48 01 fe               add    rsi,rdi
  199:  b8 01 00 00 00         mov    eax,0x1
  19e:  66 90                  xchg   ax,ax
  1a0:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  1a3:  48 83 c7 04            add    rdi,0x4
  1a7:  48 0f af c2            imul   rax,rdx
  1ab:  48 39 fe               cmp    rsi,rdi
  1ae:  75 f0                  jne    1a0 <stride_left_E2+0x10>
  1b0:  c3                     ret
  1b1:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  1b8:  b8 01 00 00 00         mov    eax,0x1
  1bd:  c3                     ret

We see that:

  - There's no code to load the partial product of static extents.
  - There's no indirection D[k[i]], it's just D[i] (as before).

On a test file which computes both mapping::stride(r) and
mapping::required_span_size, we check for static storage with

  objdump -h

we don't see the NTTP _Extents, anything (anymore) related to
_StaticExtents, __fwd_partial_prods or __rev_partial_prods. We also
check that the size of the reference object file (described three
commits prior) reduced by a few percent from 41.9kB to 39.4kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__all_dynamic): New function.
(__mdspan::_StaticExtents::_S_dynamic_index): Convert to method.
(__mdspan::_StaticExtents::_S_dynamic_index_inv): Ditto.
(__mdspan::_StaticExtents): New specialization for fully dynamic
extents.
(__mdspan::__fwd_prod): New constexpr if branch to avoid
instantiating __fwd_partial_prods.
(__mdspan::__rev_prod): Ditto.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve low-rank layout_{left,right}::stride.

The methods layout_{left,right}::mapping::stride are defined
as

  \prod_{i = 0}^r E[i]
  \prod_{i = r+1}^n E[i]

This is computed as the product of a precomputed static product and the
product of the required dynamic extents.

Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one dynamic extent, the generated code loads
two values; and then runs the loop over at most one element, e.g. for
stride_left_d5 defined below the generated code is:

220:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
227:  00
228:  31 d2                  xor    edx,edx
22a:  48 85 c0               test   rax,rax
22d:  74 23                  je     252 <stride_left_d5+0x32>
22f:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
236:  00
237:  48 c1 e1 02            shl    rcx,0x2
23b:  74 13                  je     250 <stride_left_d5+0x30>
23d:  48 01 f9               add    rcx,rdi
240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
243:  48 83 c7 04            add    rdi,0x4
247:  48 0f af c2            imul   rax,rdx
24b:  48 39 f9               cmp    rcx,rdi
24e:  75 f0                  jne    240 <stride_left_d5+0x20>
250:  89 c2                  mov    edx,eax
252:  89 d0                  mov    eax,edx
254:  c3                     ret

If there's no dynamic extents, it simply loads the precomputed product
of static extents.

For rank == 1 the answer is the constant `1`; for rank == 2 it's either 1 or
extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.

Consider,

  using Ed = std::extents<int, dyn>;
  int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
  { return m.stride(r); }

  using E3d = std::extents<int, 3, dyn>;
  int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t r)
  { return m.stride(r); }

  using Ed5 = std::extents<int, dyn, 5>;
  int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t r)
  { return m.stride(r); }

The optimized code for these three cases is:

  0000000000000060 <stride_left_d>:
  60:  b8 01 00 00 00         mov    eax,0x1
  65:  c3                     ret

  0000000000000090 <stride_left_3d>:
  90:  48 83 fe 01            cmp    rsi,0x1
  94:  19 c0                  sbb    eax,eax
  96:  83 e0 fe               and    eax,0xfffffffe
  99:  83 c0 03               add    eax,0x3
  9c:  c3                     ret

  00000000000000a0 <stride_left_d5>:
  a0:  b8 01 00 00 00         mov    eax,0x1
  a5:  48 85 f6               test   rsi,rsi
  a8:  74 02                  je     ac <stride_left_d5+0xc>
  aa:  8b 07                  mov    eax,DWORD PTR [rdi]
  ac:  c3                     ret

For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and loads.
In cases not involving a dynamic extent, it replaces loading one value
with a branchless sequence of four instructions.

This commit also refactors __size to no use any of the precomputed
arrays. This prevents instantiating __{fwd,rev}_partial_prods for
low-rank extents. This results in a further size reduction of a
reference object file (described two commits prior) by 9% from 46.0kB to
41.9kB.

In a prior commit we optimized __size to produce better object code by
precomputing the static products. This refactor enables the optimizer to
generate the same optimized code.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__fwd_prod): Optimize
for rank <= 2.
(__mdspan::__rev_prod): Ditto.
(__mdspan::__size): Refactor to use a pre-computed product, not
a partial product.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Precompute products of static extents.

Let E denote an multi-dimensional extent; n the rank of E; r = 0, ...,
n; E[i] the i-th extent; and D[k] be the (possibly empty) array of
dynamic extents.

The two partial products for r = 0, ..., n:

  \prod_{i = 0}^r E[i]     (fwd)
  \prod_{i = r+1}^n E[i]   (rev)

can be computed as the product of static and dynamic extents. The static
fwd and rev product can be computed at compile time for all values of r.

Three methods are directly affected by this optimization:

  layout_left::mapping::stride
  layout_right::mapping::stride
  mdspan::size

We'll check the generated code (-O2) for all three methods for a generic
(artificially) high-dimensional multi-dimensional extents.

Consider a generic case:

  using Extents = std::extents<int, 3, 5, dyn, dyn, dyn, 7, dyn>;

  int stride_left(const std::layout_left::mapping<Extents>& m, size_t r)
  { return m.stride(r); }

The code generated prior to this commit:

  4f0:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 4f8
  4f7:  00
  4f8:  48 83 c6 01            add    rsi,0x1
  4fc:  48 c7 44 24 e8 ff ff   mov    QWORD PTR [rsp-0x18],0xffffffffffffffff
  503:  ff ff
  505:  48 8d 04 f5 00 00 00   lea    rax,[rsi*8+0x0]
  50c:  00
  50d:  0f 29 44 24 b8         movaps XMMWORD PTR [rsp-0x48],xmm0
  512:  66 0f 76 c0            pcmpeqd xmm0,xmm0
  516:  0f 29 44 24 c8         movaps XMMWORD PTR [rsp-0x38],xmm0
  51b:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 523
  522:  00
  523:  0f 29 44 24 d8         movaps XMMWORD PTR [rsp-0x28],xmm0
  528:  48 83 f8 38            cmp    rax,0x38
  52c:  74 72                  je     5a0 <stride_right_E1+0xb0>
  52e:  48 8d 54 04 b8         lea    rdx,[rsp+rax*1-0x48]
  533:  4c 8d 4c 24 f0         lea    r9,[rsp-0x10]
  538:  b8 01 00 00 00         mov    eax,0x1
  53d:  0f 1f 00               nop    DWORD PTR [rax]
  540:  48 8b 0a               mov    rcx,QWORD PTR [rdx]
  543:  49 89 c0               mov    r8,rax
  546:  4c 0f af c1            imul   r8,rcx
  54a:  48 83 f9 ff            cmp    rcx,0xffffffffffffffff
  54e:  49 0f 45 c0            cmovne rax,r8
  552:  48 83 c2 08            add    rdx,0x8
  556:  49 39 d1               cmp    r9,rdx
  559:  75 e5                  jne    540 <stride_right_E1+0x50>
  55b:  48 85 c0               test   rax,rax
  55e:  74 38                  je     598 <stride_right_E1+0xa8>
  560:  48 8b 14 f5 00 00 00   mov    rdx,QWORD PTR [rsi*8+0x0]
  567:  00
  568:  48 c1 e2 02            shl    rdx,0x2
  56c:  48 83 fa 10            cmp    rdx,0x10
  570:  74 1e                  je     590 <stride_right_E1+0xa0>
  572:  48 8d 4f 10            lea    rcx,[rdi+0x10]
  576:  48 01 d7               add    rdi,rdx
  579:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  580:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  583:  48 83 c7 04            add    rdi,0x4
  587:  48 0f af c2            imul   rax,rdx
  58b:  48 39 f9               cmp    rcx,rdi
  58e:  75 f0                  jne    580 <stride_right_E1+0x90>
  590:  c3                     ret
  591:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  598:  c3                     ret
  599:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  5a0:  b8 01 00 00 00         mov    eax,0x1
  5a5:  eb b9                  jmp    560 <stride_right_E1+0x70>
  5a7:  66 0f 1f 84 00 00 00   nop    WORD PTR [rax+rax*1+0x0]
  5ae:  00 00

which seems to be performing:

  preparatory_work();
  ret = 1
  for(i = 0; i < rank; ++i)
    tmp = ret * E[i]
    if E[i] != -1
      ret = tmp
  for(i = 0; i < rank_dynamic; ++i)
    ret *= D[i]

This commit reduces it down to:

  270:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
  277:  00
  278:  31 d2                  xor    edx,edx
  27a:  48 85 c0               test   rax,rax
  27d:  74 33                  je     2b2 <stride_right_E1+0x42>
  27f:  48 8b 14 f5 00 00 00   mov    rdx,QWORD PTR [rsi*8+0x0]
  286:  00
  287:  48 c1 e2 02            shl    rdx,0x2
  28b:  48 83 fa 10            cmp    rdx,0x10
  28f:  74 1f                  je     2b0 <stride_right_E1+0x40>
  291:  48 8d 4f 10            lea    rcx,[rdi+0x10]
  295:  48 01 d7               add    rdi,rdx
  298:  0f 1f 84 00 00 00 00   nop    DWORD PTR [rax+rax*1+0x0]
  29f:  00
  2a0:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  2a3:  48 83 c7 04            add    rdi,0x4
  2a7:  48 0f af c2            imul   rax,rdx
  2ab:  48 39 f9               cmp    rcx,rdi
  2ae:  75 f0                  jne    2a0 <stride_right_E1+0x30>
  2b0:  89 c2                  mov    edx,eax
  2b2:  89 d0                  mov    eax,edx
  2b4:  c3                     ret

Loosely speaking this does the following:

  1. Load the starting position k in the array of dynamic extents; and
     return if possible.
  2. Load the partial product of static extents.
  3. Computes the \prod_{i = k}^d D[i] where d is the number of
  dynamic extents in a loop.

It shows that the span used for passing in the dynamic extents is
completely eliminated; and the fact that the product always runs to the
end of the array of dynamic extents is used by the compiler to eliminate
one indirection to determine the end position in the array of dynamic
extents.

The analogous code is generated for layout_left.

Next, consider

  using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
  int size2(const std::mdspan<double, E2>& md)
  { return md.size(); }

on immediately preceding commit the generated code is

  10:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 18
  17:  00
  18:  49 89 f8               mov    r8,rdi
  1b:  48 8d 44 24 b8         lea    rax,[rsp-0x48]
  20:  48 c7 44 24 e8 0b 00   mov    QWORD PTR [rsp-0x18],0xb
  27:  00 00
  29:  48 8d 7c 24 f0         lea    rdi,[rsp-0x10]
  2e:  ba 01 00 00 00         mov    edx,0x1
  33:  0f 29 44 24 b8         movaps XMMWORD PTR [rsp-0x48],xmm0
  38:  66 0f 76 c0            pcmpeqd xmm0,xmm0
  3c:  0f 29 44 24 c8         movaps XMMWORD PTR [rsp-0x38],xmm0
  41:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 49
  48:  00
  49:  0f 29 44 24 d8         movaps XMMWORD PTR [rsp-0x28],xmm0
  4e:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
  55:  00 00 00 00
  59:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  60:  48 8b 08               mov    rcx,QWORD PTR [rax]
  63:  48 89 d6               mov    rsi,rdx
  66:  48 0f af f1            imul   rsi,rcx
  6a:  48 83 f9 ff            cmp    rcx,0xffffffffffffffff
  6e:  48 0f 45 d6            cmovne rdx,rsi
  72:  48 83 c0 08            add    rax,0x8
  76:  48 39 c7               cmp    rdi,rax
  79:  75 e5                  jne    60 <size2+0x50>
  7b:  48 85 d2               test   rdx,rdx
  7e:  74 18                  je     98 <size2+0x88>
  80:  49 63 00               movsxd rax,DWORD PTR [r8]
  83:  49 63 48 04            movsxd rcx,DWORD PTR [r8+0x4]
  87:  48 0f af c1            imul   rax,rcx
  8b:  41 0f af 40 08         imul   eax,DWORD PTR [r8+0x8]
  90:  0f af c2               imul   eax,edx
  93:  c3                     ret
  94:  0f 1f 40 00            nop    DWORD PTR [rax+0x0]
  98:  31 c0                  xor    eax,eax
  9a:  c3                     ret

which is needlessly long. The current commit reduces it down to:

  10:  48 63 07               movsxd rax,DWORD PTR [rdi]
  13:  48 63 57 04            movsxd rdx,DWORD PTR [rdi+0x4]
  17:  48 0f af c2            imul   rax,rdx
  1b:  0f af 47 08            imul   eax,DWORD PTR [rdi+0x8]
  1f:  69 c0 83 04 00 00      imul   eax,eax,0x483
  25:  c3                     ret

Which simply computes the product:

  D[0] * D[1] * D[2] * const

where const is the product of all static extents. Meaning the loop to
compute the product of dynamic extents has been fully unrolled and
all constants are perfectly precomputed.

The size of the object file described in the previous commit reduces
by 17% from 55.8kB to 46.0kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__static_prod): New function.
(__mdspan::__fwd_partial_prods): Constexpr array of partial
forward products.
(__mdspan::__fwd_partial_prods): Same for reverse partial
products.
(__mdspan::__static_extents_prod): Delete function.
(__mdspan::__extents_prod): Renamed from __exts_prod and refactored.
include/std/mdspan (__mdspan::__fwd_prod): Compute as the
product of pre-computed static static and the product of dynamic
extents.
(__mdspan::__rev_prod): Ditto.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce template instantiations in <mdspan>.

In mdspan related code involving static extents, often the IndexType is
part of the template parameters, even though it's not needed.

This commit extracts the parts of _ExtentsStorage not related to
IndexType into a separate class _StaticExtents.

It also prefers passing the array of static extents, instead of the
whole extents object where possible.

The size of an object file compiled with -O2 that instantiates
  Layout::mapping<extents<IndexType, Indices...>::stride
  Layout::mapping<extents<IndexType, Indices...>::required_span_size
for the product of
  - eight IndexTypes
  - three Layouts,
  - nine choices of Indices...
decreases by 19% from 69.2kB to 55.8kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::_StaticExtents): Extract non IndexType
related code from _ExtentsStorage.
(__mdspan::_ExtentsStorage): Use _StaticExtents.
(__mdspan::__static_extents): Return reference to NTTP of _StaticExtents.
(__mdspan::__contains_zero): New overload.
(__mdspan::__exts_prod, __mdspan::__static_quotient): Use span to avoid
copying __sta_exts.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

Merge BB and loop path in vect_analyze_stmt

We have now common patterns for most of the vectorizable_* calls, so
merge. This also avoids calling vectorizable_early_exit for BB
vect and clarifies signatures of it and vectorizable_phi.

* tree-vectorizer.h (vectorizable_phi): Take bb_vec_info.
(vectorizable_early_exit): Take loop_vec_info.
* tree-vect-loop.cc (vectorizable_phi): Adjust.
* tree-vect-slp.cc (vect_slp_analyze_operations): Likewise.
(vectorize_slp_instance_root_stmt): Likewise.
* tree-vect-stmts.cc (vectorizable_early_exit): Likewise.
(vect_transform_stmt): Likewise.
(vect_analyze_stmt): Merge the sequences of vectorizable_*
where common.

MAINTAINERS: Update my email address and stand down as AArch64 maintainer

Today is my last working day at Arm, so this patch switches my
MAINTAINERS entries to my personal email address. (It turns out
that I never updated some of the later entries...oops)

In order to avoid setting false expectations, and to try to avoid
getting in the way, I'm also standing down as an AArch64 maintainer,
effective EOB today. I might still end up reviewing the odd AArch64
patch under global reviewership though, depending on how things go :)

ChangeLog:

* MAINTAINERS: Update my email address and stand down as AArch64
maintainer.

Fortran: gfortran PDT component access [PR84122, PR85942]

2025-08-21 Paul Thomas <pault@gcc.gnu.org>

gcc/fortran
PR fortran/84122
* parse.cc (parse_derived): PDT type parameters are not allowed
an explicit access specification and must appear before a
PRIVATE statement. If a PRIVATE statement is seen, mark all the
other components as PRIVATE.

PR fortran/85942
* simplify.cc (get_kind): Convert a PDT KIND component into a
specification expression using the default initializer.

gcc/testsuite/
PR fortran/84122
* gfortran.dg/pdt_38.f03: New test.

PR fortran/85942
* gfortran.dg/pdt_39.f03: New test.

c++: pointer to auto member function [PR120757]

Here r13-1210 correctly changed &A<int>::foo to not be considered
type-dependent, but tsubst_expr of the OFFSET_REF got confused trying to
tsubst a type that involved auto. Fixed by getting the type from the
member rather than tsubst.

PR c++/120757

gcc/cp/ChangeLog:

* pt.cc (tsubst_expr) [OFFSET_REF]: Don't tsubst the type.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1y/auto-fn66.C: New test.

Daily bump.

c++: lambda capture and shadowing [PR121553]

P2036 says that this:

[x=1]{ int x; }

should be rejected, but with my P2036 we started giving an error
for the attached testcase as well, breaking Dolphin. So let's
keep the error only for init-captures.

PR c++/121553

gcc/cp/ChangeLog:

* name-lookup.cc (check_local_shadow): Check !is_normal_capture_proxy.

gcc/testsuite/ChangeLog:

* g++.dg/warn/Wshadow-19.C: Revert P2036 changes.
* g++.dg/warn/Wshadow-6.C: Likewise.
* g++.dg/warn/Wshadow-20.C: New test.
* g++.dg/warn/Wshadow-21.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>

Regenerate common.opt.urls for -fdiagnostics-show-context

When -fdiagnostics-show-context[=DEPTH] was added, they were documented, but
common.opt.urls wasn't regenerated.

gcc/ChangeLog:

* common.opt.urls: Regenerate.

Provide new option -fdiagnostics-show-context=N for -Warray-bounds, -Wstringop-* warnings [PR109071,PR85788,PR88771,PR106762,PR108770,PR115274,PR117179]

'-fdiagnostics-show-context[=DEPTH]'
'-fno-diagnostics-show-context'
     With this option, the compiler might print the interesting control
     flow chain that guards the basic block of the statement which has
     the warning.  DEPTH is the maximum depth of the control flow chain.
     Currently, The list of the impacted warning options includes:
     '-Warray-bounds', '-Wstringop-overflow', '-Wstringop-overread',
     '-Wstringop-truncation'.  and '-Wrestrict'.  More warning options
     might be added to this list in future releases.  The forms
     '-fdiagnostics-show-context' and '-fno-diagnostics-show-context'
     are aliases for '-fdiagnostics-show-context=1' and
     '-fdiagnostics-show-context=0', respectively.

For example:

$ cat t.c
extern void warn(void);
static inline void assign(int val, int *regs, int *index)
{
  if (*index >= 4)
    warn();
  *regs = val;
}
struct nums {int vals[4];};

void sparx5_set (int *ptr, struct nums *sg, int index)
{
  int *val = &sg->vals[index];

  assign(0,    ptr, &index);
  assign(*val, ptr, &index);
}

$ gcc -Wall -O2  -c -o t.o t.c
t.c: In function ‘sparx5_set’:
t.c:12:23: warning: array subscript 4 is above array bounds of ‘int[4]’ [-Warray-bounds=]
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~^~~~~~~
t.c:8:18: note: while referencing ‘vals’
    8 | struct nums {int vals[4];};
      |   ^~~~

In the above, Although the warning is correct in theory, the warning message
itself is confusing to the end-user since there is information that cannot
be connected to the source code directly.

It will be a nice improvement to add more information in the warning message
to report where such index value come from.

With the new option -fdiagnostics-show-context=1, the warning message for
the above testing case is now:

$ gcc -Wall -O2 -fdiagnostics-show-context=1 -c -o t.o t.c
t.c: In function ‘sparx5_set’:
t.c:12:23: warning: array subscript 4 is above array bounds of ‘int[4]’ [-Warray-bounds=]
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~^~~~~~~
  ‘sparx5_set’: events 1-2
    4 |   if (*index >= 4)
      |      ^
      |      |
      |      (1) when the condition is evaluated to true
......
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~~~~~~~~
      |        |
      |        (2) warning happens here
t.c:8:18: note: while referencing ‘vals’
    8 | struct nums {int vals[4];};
      |   ^~~~

PR tree-optimization/109071
PR tree-optimization/85788
PR tree-optimization/88771
PR tree-optimization/106762
PR tree-optimization/108770
PR tree-optimization/115274
PR tree-optimization/117179

gcc/ChangeLog:

* Makefile.in (OBJS): Add diagnostic-context-rich-location.o.
* common.opt (fdiagnostics-show-context): New option.
(fdiagnostics-show-context=): New option.
* diagnostic-context-rich-location.cc: New file.
* diagnostic-context-rich-location.h: New file.
* doc/invoke.texi (fdiagnostics-details): Add
documentation for the new options.
* gimple-array-bounds.cc (check_out_of_bounds_and_warn): Add
one new parameter. Use rich location with details for warning_at.
(array_bounds_checker::check_array_ref): Use rich location with
ditails for warning_at.
(array_bounds_checker::check_mem_ref): Add one new parameter.
Use rich location with details for warning_at.
(array_bounds_checker::check_addr_expr): Use rich location with
move_history_diagnostic_path for warning_at.
(array_bounds_checker::check_array_bounds): Call check_mem_ref with
one more parameter.
* gimple-array-bounds.h: Update prototype for check_mem_ref.
* gimple-ssa-warn-access.cc (warn_string_no_nul): Use rich location
with details for warning_at.
(maybe_warn_nonstring_arg): Likewise.
(maybe_warn_for_bound): Likewise.
(warn_for_access): Likewise.
(check_access): Likewise.
(pass_waccess::check_strncat): Likewise.
(pass_waccess::maybe_check_access_sizes): Likewise.
* gimple-ssa-warn-restrict.cc (pass_wrestrict::execute): Calculate
dominance info for diagnostics show context.
(maybe_diag_overlap): Use rich location with details for warning_at.
(maybe_diag_access_bounds): Use rich location with details for
warning_at.

gcc/testsuite/ChangeLog:

* gcc.dg/pr109071.c: New test.
* gcc.dg/pr109071_1.c: New test.
* gcc.dg/pr109071_10.c: New test.
* gcc.dg/pr109071_11.c: New test.
* gcc.dg/pr109071_12.c: New test.
* gcc.dg/pr109071_2.c: New test.
* gcc.dg/pr109071_3.c: New test.
* gcc.dg/pr109071_4.c: New test.
* gcc.dg/pr109071_5.c: New test.
* gcc.dg/pr109071_6.c: New test.
* gcc.dg/pr109071_7.c: New test.
* gcc.dg/pr109071_8.c: New test.
* gcc.dg/pr109071_9.c: New test.
* gcc.dg/pr117375.c: New test.

sra: Make build_ref_for_offset static [PR121568]

build_ref_for_offset was originally made external
with r0-95095-g3f84bf08c48ea4. The call was extracted
out into ipa_get_jf_ancestor_result by r0-110216-g310bc6334823b9.
Then the call was removed by r10-7273-gf3280e4c0c98e1.
So there is no use of build_ref_for_offset outside of SRA, so
let's make it static again.

Bootstrapped and tested on x86_64-linux-gnu.

PR tree-optimization/121568
gcc/ChangeLog:

* ipa-prop.h (build_ref_for_offset): Remove.
* tree-sra.cc (build_ref_for_offset): Make static.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

Merge aarch64-cc-fusion into late-combine

I'd added the aarch64-specific CC fusion pass to fold a PTEST
instruction into the instruction that feeds the PTEST, in cases
where the latter instruction can set the appropriate flags as a
side-effect.

Combine does the same optimisation.  However, as explained in the
comments, the PTEST case often has:

  A: set predicate P based on inputs X
  B: clobber X
  C: test P

and so the fusion is only possible if we move C before B.
That's something that combine currently can't do (for the cases
that we needed).

The optimisation was never really AArch64-specific.  It's just that,
in an all-too-familiar fashion, we needed it in stage 3, when it was
too late to add something target-independent.

late-combine adds a convenient place to do the optimisation in a
target-independent way, just as combine is a convenient place to
do its related optimisation.

gcc/
* config.gcc (aarch64*-*-*): Remove aarch64-cc-fusion.o from
extra_objs.
* config/aarch64/aarch64-passes.def (pass_cc_fusion): Delete.
* config/aarch64/aarch64-protos.h (make_pass_cc_fusion): Delete.
* config/aarch64/t-aarch64 (aarch64-cc-fusion.o): Delete.
* config/aarch64/aarch64-cc-fusion.cc: Delete.
* late-combine.cc (late_combine::optimizable_set): Take a set_info *
rather than an insn_info * and move destination tests from...
(late_combine::combine_into_uses): ...here. Take a set_info * rather
an insn_info *.  Take the rtx set.
(late_combine::parallelize_insns, late_combine::combine_cc_setter)
(late_combine::combine_insn): New member functions.
(late_combine::m_parallel): New member variable.
* rtlanal.cc (pattern_cost): Handle sets of CC registers in the
same way as comparisons.

rtl-ssa: Fix thinko when adding live-out uses

While testing a later patch, I found that create_degenerate_phi
had an inverted test for bitmap_set_bit.  It was assuming that
the return value was the previous bit value, rather than a
"something changed" value. :(

Also, the call to add_live_out_use shouldn't be conditional
on the DF_LR_OUT operation, since the register could be live-out
because of uses later in the same EBB (which do not require a
live-out use to be added to the rtl-ssa instruction).  Instead,
add_live_out should itself check whether a live-out use already exists.

gcc/
* rtl-ssa/blocks.cc (function_info::create_degenerate_phi): Fix
inverted test of bitmap_set_bit.  Call add_live_out_use even
if the register was previously live-out from the predecessor block.
Instead...
(function_info::add_live_out_use): ...check here whether a live-out
use already exists.

rtl-ssa: Add a find_uses function

rtl-ssa already has a find_def function for finding the definition
of a particular resource (register or memory) at a particular point
in the program. This patch adds a similar function for looking
up uses. Both functions have amortised logarithmic complexity.

gcc/
* rtl-ssa/accesses.h (use_lookup): New class.
* rtl-ssa/functions.h (function_info::find_def): Expand comment.
(function_info::find_use): Declare.
* rtl-ssa/member-fns.inl (use_lookup::prev_use, use_lookup::next_use)
(use_lookup::matching_use, use_lookup::matching_or_prev_use)
(use_lookup::matching_or_next_use): New member functions.
* rtl-ssa/accesses.cc (function_info::find_use): Likewise.

tree-optimization/114480 - speedup IDF compute

The testcase in the PR shows that it's worth splitting the processing
of the initial workset, which is def_blocks from the main iteration.
This reduces SSA incremental update time from 44.7s to 32.9s. Further
changing the workset bitmap of the main iteration to a vector
speeds up things further to 23.5s for an overall nearly halving of
the SSA incremental update compile-time and an overall 12% compile-time
saving at -O1.

Using bitmap_ior in the first loop or avoiding (immediate) re-processing
of blocks in def_blocks does not make a measurable difference for the
testcase so I left this as-is.

PR tree-optimization/114480
* cfganal.cc (compute_idf): Split processing of the initial
workset from the main iteration. Use a vector for the
workset of the main iteration.

AVR: target/121608 - Don't add --relax when linking with -r.

The linker rejects --relax in relocatable links (-r), hence only
add --relax when -r is not specified.

gcc/
PR target/121608
* config/avr/specs.h (LINK_RELAX_SPEC): Wrap in %{!r...}.

Thread the remains of vect_analyze_slp_instance

vect_analyze_slp_instance still handles stores and reduction chains.
The following threads the special handling of those two kinds,
duplicating vect_build_slp_instance into two specialized entries.

* tree-vect-slp.cc (vect_analyze_slp_reduc_chain): New,
copied from vect_analyze_slp_instance and only handle
slp_inst_kind_reduc_chain. Inline vect_build_slp_instance.
(vect_analyze_slp_instance): Only handle slp_inst_kind_store.
Inline vect_build_slp_instance.
(vect_build_slp_instance): Remove now unused stmt_info parameter,
remove special code for store groups and reduction chains.
(vect_analyze_slp): Call vect_analyze_slp_reduc_chain
for reduction chain SLP build and adjust.

Enable gather/scatter for epilogues of vector epilogues

The restriction no longer applies, so remove it.

* tree-vect-data-refs.cc (vect_check_gather_scatter):
Remove restriction on epilogue of epilogue vectorization.

Remove most of the epilogue vinfo fixup

The following removes the fixup we apply to pattern stmt operands
before code generating vector epilogues. This isn't necessary anymore
since the SLP graph now exclusively records the data flow. Similarly
fixing up of SSA references inside DR_REF of gather/scatter isn't
necessary since we now record the analysis result and avoid re-doing
it during transform.

What we still need to keep is the adjustment of the actual pointers
to gimple stmts from stmt_vec_info and the back-reference from the DRs.

* tree-vect-loop.cc (update_epilogue_loop_vinfo): Remove
fixing up pattern stmt operands and gather/scatter DR_REFs.
(find_in_mapping): Remove.

Record get_load_store_info results from analysis

The following is a patch to make us record the get_load_store_info
results from load/store analysis and re-use them during transform.
In particular this moves where SLP_TREE_MEMORY_ACCESS_TYPE is stored.

A major hassle was (and still is, to some extent), gather/scatter
handling with it's accompaning gather_scatter_info.  As
get_load_store_info no longer fully re-analyzes them but parts of
the information is recorded in the SLP tree during SLP build the
following goes and eliminates the use of this data in
vectorizable_load/store, instead recording the other relevant
part in the load-store info (namely the IFN or decl chosen).
Strided load handling keeps the re-analysis but populates the
data back to the SLP tree and the load-store info.  That's something
for further improvement.  This also shows that early classifying
a SLP tree as load/store and allocating the load-store data might
be a way to move back all of the gather/scatter auxiliary data
into one place.

Rather than mass-replacing references to variables I've kept the
locals but made them read-only, only adjusting a few elsval setters
and adding a FIXME to strided SLP handling of alignment (allowing
local override there).

The FIXME shows that while a lot of analysis is done in
get_load_store_type that's far from all of it.  There's also
a possibility that splitting up the transform phase into
separate load/store def types, based on VMAT choosen, will make
the code more maintainable.

* tree-vectorizer.h (vect_load_store_data): New.
(_slp_tree::memory_access_type): Remove.
(SLP_TREE_MEMORY_ACCESS_TYPE): Turn into inline function.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Do not
initialize SLP_TREE_MEMORY_ACCESS_TYPE.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
Remove gather_scatter_info pointer argument, instead get
info from the SLP node.
(vect_build_one_gather_load_call): Get SLP node and builtin
decl as argument and remove uses of gather_scatter_info.
(vect_build_one_scatter_store_call): Likewise.
(vect_get_gather_scatter_ops): Remove uses of gather_scatter_info.
(vect_get_strided_load_store_ops): Get SLP node and remove
uses of gather_scatter_info.
(get_load_store_type): Take pointer to vect_load_store_data
instead of individual pointers.
(vectorizable_store): Adjust.  Re-use get_load_store_type
result from analysis time.
(vectorizable_load): Likewise.

cobol: Eliminate errors that cause valgrind messages.

gcc/cobol/ChangeLog:

* genutil.cc (get_binary_value): Fix a comment.
* parse.y: udf_args_valid(): Fix loc calculation.
* symbols.cc (assert): extend_66_capacity(): Avoid assert(e < e2) in
-O0 build until symbol_table expansion is fixed.

libgcobol/ChangeLog:

* libgcobol.cc (format_for_display_internal): Handle NumericDisplay
properly.
(compare_88): Fix memory access error.
(__gg__unstring): Likewise.

Fortran: Clean up and fix some refs.

gcc/fortran/ChangeLog:

* intrinsic.texi: Correct the example given for FRACTION.
Move the TEAM_NUMBER section to after the TANPI to align
with the order gven in the index.

x86: Place the TLS call before all register setting BBs

We can't place a TLS call before a conditional jump in a basic block like

(code_label 13 11 14 4 2 (nil) [1 uses])
(note 14 13 16 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(jump_insn 16 14 17 4 (set (pc)
        (if_then_else (le (reg:CCNO 17 flags)
                (const_int 0 [0]))
            (label_ref 27)
            (pc))) "x.c":10:21 discrim 1 1462 {*jcc}
     (expr_list:REG_DEAD (reg:CCNO 17 flags)
        (int_list:REG_BR_PROB 628353713 (nil)))
-> 27)

since the TLS call will clobber flags register nor place a TLS call in a
basic block if any live caller-saved registers aren't dead at the end of
the basic block:

;; live  in      6 [bp] 7 [sp] 16 [argp] 17 [flags] 19 [frame] 104
;; live  gen     0 [ax] 102 106 108 116 117 118 120
;; live  kill    5 [di]

Instead, we should place such call before all register setting basic
blocks which dominate the current basic block.

Keep track the replaced GNU and GNU2 TLS instructions.  Use these info to
place the __tls_get_addr call and mark FLAGS register as dead.

gcc/

PR target/121572
* config/i386/i386-features.cc (replace_tls_call): Add a bitmap
argument and put the updated TLS instruction in the bitmap.
(ix86_get_dominator_for_reg): New.
(ix86_check_flags_reg): Likewise.
(ix86_emit_tls_call): Likewise.
(ix86_place_single_tls_call): Add 2 bitmap arguments for updated
GNU and GNU2 TLS instructions.  Call ix86_emit_tls_call to emit
TLS instruction.  Correct debug dump for before instruction.

gcc/testsuite/

PR target/121572
* gcc.target/i386/pr121572-1a.c: New test.
* gcc.target/i386/pr121572-1b.c: Likewise.
* gcc.target/i386/pr121572-2a.c: Likewise.
* gcc.target/i386/pr121572-2b.c: Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

Daily bump.

c++: testcase tweak for -fimplicit-constexpr

This testcase is testing the difference between functions that are or are
not declared constexpr.

gcc/testsuite/ChangeLog:

* g++.dg/cpp26/expansion-stmt16.C: Add -fno-implicit-constexpr.

c++: Fix ICE on mangling invalid compound requirement [PR120618]

This testcase caused an ICE when mangling the invalid type-constraint in
write_requirement since write_type_constraint expects a TEMPLATE_TYPE_PARM.

Setting the trailing return type to NULL_TREE when a
return-type-requirement is found in place of a type-constraint prevents the
failed assertion in write_requirement. It also allows the invalid
constraint to be satisfied in some contexts to prevent redundant errors,
e.g. in concepts-requires5.C.

Bootstrapped and tested on x86_64-linux-gnu.

PR c++/120618

gcc/cp/ChangeLog:

* parser.cc (cp_parser_compound_requirement): Set type to
NULL_TREE for invalid type-constraint.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-requires5.C: Don't require
redundant diagnostic in static assertion.
* g++.dg/concepts/pr120618.C: New test.

Suggested-by: Jason Merrill <jason@redhat.com>

middle-end: Fix malloc like functions when calling with void "return" [PR120024]

When expanding malloc like functions, we copy the return register into a temporary
and then mark that temporary register with a noalias regnote and the alignment.
This works fine unless you are calling the function with a return type of void.
At this point then the valreg will be null and a crash will happen.

A few cleanups are included in this patch because it was easier to do the fix
with the cleanups added.
The start_sequence/end_sequence for ECF_MALLOC is no longer needed; I can't tell
if it was ever needed.
The emit_move_insn function returns the last emitted instruction anyways so
there is no reason to call get_last_insn as we can just use the return value
of emit_move_insn. This has been true since this code was originally added
so I don't understand why it was done that way beforehand.

Bootstrapped and tested on x86_64-linux-gnu.

PR middle-end/120024

gcc/ChangeLog:

* calls.cc (expand_call): Remove start_sequence/end_sequence
for ECF_MALLOC.
Check valreg before deferencing it when it comes to malloc like
functions. Use the return value of emit_move_insn instead of
calling get_last_insn.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/malloc-1.c: New test.
* gcc.dg/torture/malloc-2.c: New test.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

c++: constrained corresponding using from partial spec [PR121351]

When comparing constraints during correspondence checking for a using
from a partial specialization, we need to substitute the partial
specialization arguments into the constraints rather than the primary
template arguments. Otherwise we incorrectly reject e.g. the below
testcase as ambiguous since we substitute T=int* instead of T=int
into #1's constraints and don't notice the correspondence.

This patch corrects the recent r16-2771-gb9f1cc4e119da9 fix by using
outer_template_args instead of TI_ARGS of the DECL_CONTEXT, which
should always give the correct outer arguments for substitution.

PR c++/121351

gcc/cp/ChangeLog:

* class.cc (add_method): Use outer_template_args when
substituting outer template arguments into constraints.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-using7.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>

Remove reduction chain detection from parloops

Historically SLP reduction chains were the only multi-stmt reductions
supported. But since we have check_reduction_path more complicated
cases are handled. As parloops doesn't do any specific chain
processing it can solely rely on that functionality instead.

* tree-parloops.cc (parloops_is_slp_reduction): Remove.
(parloops_is_simple_reduction): Do not call it.

A few missing SLP node passings to vector costing

The following fixes another few missed cases to pass a SLP node
instead of a stmt_info.

* tree-vect-loop.cc (vectorizable_reduction): Pass the
appropriate SLP node for costing of single-def-use-cycle
operations.
(vectorizable_live_operation): Pass the SLP node to the
costing hook.
* tree-vect-stmts.cc (vectorizable_bswap): Likewise.
(vectorizable_store): Likewise.

tree-optimization/121592 - failed reduction SLP discovery

The testcase in the PR shows that when we have a reduction chain
with a wrapped conversion we fail to properly fall back to a
regular reduction, resulting in wrong-code. The following fixes
this by failing discovery. The testcase has other issues, so
I'm not including it here.

PR tree-optimization/121592
* tree-vect-slp.cc (vect_analyze_slp): When SLP reduction chain
discovery fails, fail overall when the tail of the chain
isn't also the entry for the non-SLP reduction.

Fix riscv build, no longer works with python2

Building riscv no longer works with python2:

> python ./config/riscv/arch-canonicalize -misa-spec=20191213 rv64gc
  File "./config/riscv/arch-canonicalize", line 229
    print(f"ERROR: Unhandled conditional dependency: '{ext_name}' with condition:", file=sys.stderr)
                                                                                 ^
SyntaxError: invalid syntax

On systems that have python aliased to python2 we chose that, even
when python3 is available.  Don't.

* config.gcc (riscv*-*-*): Look for python3, then fall back
to python.  Never use python2.

tree-optimization/121527 - wrong SRA with aggregate copy

SRA handles outermost VIEW_CONVERT_EXPRs but it wrongly ignores
those when building an access which leads to the wrong size
used when the VIEW_CONVERT_EXPR does not have the same size as
its operand which is valid GENERIC and is used by Ada upcasting.

PR tree-optimization/121527
* tree-sra.cc (build_access_from_expr_1): Do not strip an
outer VIEW_CONVERT_EXPR as it's relevant for the size of
the access.
(get_access_for_expr): Likewise.

AArch64: Use vectype from SLP node instead of stmt_info [PR121536]

commit g:1786be14e94bf1a7806b9dc09186f021737f0227 stops storing in
STMT_VINFO_VECTYPE the vectype of the current stmt being vectorized and instead
requires the use of SLP_TREE_VECTYPE for everything but data-refs.

This means that STMT_VINFO_VECTYPE (stmt_info) will always be NULL and so
aarch64_bool_compound_p will never properly cost predicate AND operations
anymore resulting in less vectorization.

This patch changes it to use SLP_TREE_VECTYPE and pass the slp_node to
aarch64_bool_compound_p.

gcc/ChangeLog:

PR target/121536
* config/aarch64/aarch64.cc (aarch64_bool_compound_p): Use
SLP_TREE_VECTYPE instead of STMT_VINFO_VECTYPE.
(aarch64_adjust_stmt_cost, aarch64_vector_costs::count_ops): Pass SLP
node to aarch64_bool_compound_p.

gcc/testsuite/ChangeLog:

PR target/121536
* g++.target/aarch64/sve/pr121536.cc: New test.

middle-end: Fix costing hooks of various vectorizable_* [PR121536]

commit g:1786be14e94bf1a7806b9dc09186f021737f0227 stops storing in
STMT_VINFO_VECTYPE the vectype of the current stmt being vectorized and instead
requires the use of SLP_TREE_VECTYPE for everything but data-refs.

However contrary to what the commit says not all usages of STMT_VINFO_VECTYPE
have been purged from vectorizable_* as the costing hooks which don't pass the
SLP tree as an agrument will extract vectype using STMT_VINFO_VECTYPE.

This results in no vector type being passed to the backends and results in a few
costing test failures in AArch64.

This commit replaces the last few cases I could find, all except for in
vectorizable_reduction when single_defuse_cycle where the stmt being costed is
not the representative of the PHI in the SLP tree but rather the out of tree
reduction statement. So I've left that alone, but it does mean vectype is NULL.

Most likely this needs to use the overload where we pass an explicit vectype but
I wasn't sure so left it for now.

gcc/ChangeLog:

PR target/121536
* tree-vect-loop.cc (vectorizable_phi, vectorizable_recurr,
vectorizable_nonlinear_induction, vectorizable_induction): Pass slp_node
instead of stmt_info to record_stmt_cost.

AArch64: Fix scalar costing after removal of vectype from mid-end [PR121536]

commit g:fb59c5719c17a04ecfd58b5e566eccd6d2ac583a stops passing the scalar type
(confusingly named vectype) to the costing hook when doing scalar costing.

As a result, we could no longer distinguish between FPR and GPR scalar stmts.
A later commit also removed STMT_VINFO_VECTYPE from stmt_info.

This leaves the only remaining option to get the type of the original stmt in
the stmt_info. This patch does this when we're performing scalar costing.

Ideally I'd refactor this a bit because a lot of the hooks just need to know if
it's FP or not, but this seems pointless with the ongoing costing churn. So for
now this restores our costing.

gcc/ChangeLog:

PR target/121536
* config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): Set
vectype from type of lhs of gimple stmt.

libstdc++: Restore call to test6642 in string_vector_iterators.cc test [PR104874]

The test call was accidentally omitted in r16-2484-gdc49c0a46ec96e,
a commit that refactored this test file. This patch adds it back.

PR libstdc++/104874

libstdc++-v3/ChangeLog:

* testsuite/24_iterators/random_access/string_vector_iterators.cc:
Call test6642.

testsuite: Fix g++.dg/abi/mangle83.C [PR121578]

This testcase (added in r16-3233-g7921bb4afcb7a3) mistakenly only
required C++14, but auto template paramaters are a C++17 feature.

PR c++/121578

gcc/testsuite/ChangeLog:

* g++.dg/abi/mangle83.C: Requires C++17.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

c++/modules: Fix exporting using-decls of unattached purview functions [PR120195]

We have logic to adjust a function decl if it gets re-declared as a
using-decl with different purviewness, but we also need to do the same
if it gets redeclared with different exportedness.

PR c++/120195

gcc/cp/ChangeLog:

* name-lookup.cc (do_nonmember_using_decl): Also handle change
in exportedness of a function.

gcc/testsuite/ChangeLog:

* g++.dg/modules/using-32_a.C: New test.
* g++.dg/modules/using-32_b.C: New test.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

testsuite: Fix PR108080 testcase for some targets [PR121396]

I added a testcase for the (temporary) warning that we don't currently
support the 'gnu::optimize' or 'gnu::target' attributes in r15-10183;
however, some targets produce target nodes even when only an optimize
attribute is present. This adjusts the warning.

PR c++/108080
PR c++/121396

gcc/testsuite/ChangeLog:

* g++.dg/modules/pr108080.H: Also allow target warnings.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

Daily bump.

docs: Fix __builtin_object_size example [PR121581]

This example used to work (with C) in GCC 14 before the
warning for different pointer types without a cast was changed
to an error.
The fix is to make the q variable `int*` rather than the current `char*`.
This also fixes the example for C++ too.

Pushed as obvious after doing a `make html`.

PR middle-end/121581
gcc/ChangeLog:

* doc/extend.texi (__builtin_object_size): Fix example.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

opts: use sanitize_code_type for sanitizer flags

Currently, the data type of sanitizer flags is unsigned int, with
SANITIZE_SHADOW_CALL_STACK (1UL << 31) being highest individual
enumerator for enum sanitize_code. Use 'sanitize_code_type' data type
to allow for more distinct instrumentation modes be added when needed.

gcc/ChangeLog:

* flag-types.h (sanitize_code_type): Define.
* asan.h (sanitize_flags_p): Use 'sanitize_code_type' instead of
'unsigned int'.
* common.opt: Likewise.
* dwarf2asm.cc (dw2_output_indirect_constant_1): Likewise.
* opts.cc (find_sanitizer_argument): Likewise.
(report_conflicting_sanitizer_options): Likewise.
(parse_sanitizer_options): Likewise.
(parse_no_sanitize_attribute): Likewise.
* opts.h (parse_sanitizer_options): Likewise.
(parse_no_sanitize_attribute): Likewise.
* tree-cfg.cc (print_no_sanitize_attr_value): Likewise.
* tree.cc (tree_fits_sanitize_code_type_p): Define.
(tree_to_sanitize_code_type): Likewise.
* tree.h (tree_fits_sanitize_code_type_p): Declare.
(tree_to_sanitize_code_type): Likewise.

gcc/c-family/ChangeLog:

* c-attribs.cc (add_no_sanitize_value): Use 'sanitize_code_type'
instead of 'unsigned int'.
(handle_no_sanitize_attribute): Likewise.
(handle_no_sanitize_address_attribute): Likewise.
(handle_no_sanitize_thread_attribute): Likewise.
(handle_no_address_safety_analysis_attribute): Likewise.
* c-common.h (add_no_sanitize_value): Likewise.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_declaration_or_fndef): Use
'sanitize_code_type' instead of 'unsigned int'.

gcc/cp/ChangeLog:

* typeck.cc (get_member_function_from_ptrfunc): Use
'sanitize_code_type' instead of 'unsigned int'.

gcc/d/ChangeLog:

* d-attribs.cc (d_handle_no_sanitize_attribute): Use
'sanitize_code_type' instead of 'unsigned int'.

Signed-off-by: Claudiu Zissulescu <claudiu.zissulescu-ianculescu@oracle.com>

aarch64: add new constants for MTE insns

Define new constants to be used by the MTE pattern definitions.

gcc/

* config/aarch64/aarch64.md (MEMTAG_TAG_MASK): New define
constant.
(MEMTAG_ADDR_MASK): Likewise.
(irg, subp, ldg): Use new constants.

Signed-off-by: Claudiu Zissulescu <claudiu.zissulescu-ianculescu@oracle.com>

MAINTAINERS: Update my email address

ChangeLog:

* MAINTAINERS: Update my email address.

Signed-off-by: Spencer Abson <spencer.abson@student.manchester.ac.uk>

libstdc++: Add nodiscard attribute for ranges algorithm [PR121476]

This patch adds the [[nodiscard]] attribute to the operator() of ranges
algorithm function objects if their std counterpart has it.

Furthermore, we [[nodiscard]] the operator() of the following ranges
algorithms that lack a std counterpart:
* find_last, find_last_if, find_last_if_not (to match other find
algorithms)
* contains, contains_subrange (to match find/any_of and search)

Finally, [[nodiscard]] is added to std::min and std::max overloads
that accept std::initializer_list. This appears to be an oversight,
as std::minmax is already marked, and other min overloads are as well.
The same applies to corresponding operator() overloads of ranges::min and
ranges::max.

PR libstdc++/121476

libstdc++-v3/ChangeLog:

* include/bits/ranges_algo.h (__all_of_fn::operator()):
(__any_of_fn::operator(), __none_of_fn::operator())
(__find_first_of_fn::operator(), __count_fn::operator())
(__find_end_fn::operator(), __remove_if_fn::operator())
(__remove_fn::operator(), __unique_fn::operator())
(__is_sorted_until_fn::operator(), __is_sorted_fn::operator())
(__lower_bound_fn::operator(), __upper_bound_fn::operator())
(__equal_range_fn::operator(), __binary_search_fn::operator())
(__is_partitioned_fn::operator(), __partition_point_fn::operator())
(__minmax_fn::operator(), __min_element_fn::operator())
(__includes_fn::operator(), __max_fn::operator())
(__lexicographical_compare_fn::operator(), __clamp__fn::operator())
(__find_last_fn::operator(), __find_last_if_fn::operator())
(__find_last_if_not_fn::operator()): Add [[nodiscard]] attribute.
* include/bits/ranges_algobase.h (__equal_fn::operator()):
Add [[nodiscard]] attribute.
* include/bits/ranges_util.h (__find_fn::operator())
(__find_if_fn::operator(), __find_if_not_fn::operator())
(__mismatch_fn::operator(), __search_fn::operator())
(__min_fn::operator(), __adjacent_find_fn::operator()):
Add [[nodiscard]] attribute.
* include/bits/stl_algo.h (std::min(initializer_list<T>))
(std::min(initializer_list<T>, _Compare))
(std::max(initializer_list<T>))
(std::mmax(initializer_list<T>, _Compare)): Add _GLIBCXX_NODISCARD.
* testsuite/25_algorithms/min/constrained.cc: Silence nodiscard
warning.
* testsuite/25_algorithms/max/constrained.cc: Likewise.
* testsuite/25_algorithms/minmax/constrained.cc: Likewise.
* testsuite/25_algorithms/minmax_element/constrained.cc: Likewise.

gcse: Fix handling of partial clobbers [PR97497]

This patch fixes an internal disagreement in gcse about how to
handle partial clobbers.  Like many passes, gcse doesn't track
the modes of live values, so if a call clobbers only part of
a register, the pass has to make conservative assumptions.
As the comment in the patch says, this means:

(1) ignoring partial clobbers when computing liveness and reaching
    definitions

(2) treating partial clobbers as full clobbers when computing
    availability

DF is mostly concerned with (1), so ignores partial clobbers.

compute_hash_table_work did (2) when calculating kill sets,
but compute_transp didn't do (2) when computing transparency.
This led to a nonsensical situation of a register being in both
the transparency and kill sets.

gcc/
PR rtl-optimization/97497
* function-abi.h (predefined_function_abi::only_partial_reg_clobbers)
(function_abi::only_partial_reg_clobbers): New member functions.
* gcse-common.cc: Include regs.h and function-abi.h.
(compute_transp): Check for partially call-clobbered registers
and treat them as not being transparent in blocks with calls.

libstdc++: Fix-self element self-assigments when inserting an empty range [PR121313]

For __n == 0, the elements were self move-assigned by
std::move_backward(__ins, __old_finish - __n, __old_finish).

PR libstdc++/121313

libstdc++-v3/ChangeLog:

* include/bits/vector.tcc (vector::insert_range): Add check for
empty size.
* testsuite/23_containers/vector/modifiers/insert/insert_range.cc:
New tests.

LoongArch: Implement 16-byte atomic add, sub, and, or, xor, and nand with sc.q

gcc/ChangeLog:

* config/loongarch/sync.md (UNSPEC_TI_FETCH_ADD): New unspec.
(UNSPEC_TI_FETCH_SUB): Likewise.
(UNSPEC_TI_FETCH_AND): Likewise.
(UNSPEC_TI_FETCH_XOR): Likewise.
(UNSPEC_TI_FETCH_OR): Likewise.
(UNSPEC_TI_FETCH_NAND_MASK_INVERTED): Likewise.
(ALL_SC): New define_mode_iterator.
(_scq): New define_mode_attr.
(atomic_fetch_nand<mode>): Accept ALL_SC instead of only GPR.
(UNSPEC_TI_FETCH_DIRECT): New define_int_iterator.
(UNSPEC_TI_FETCH): New define_int_iterator.
(amop_ti_fetch): New define_int_attr.
(size_ti_fetch): New define_int_attr.
(atomic_fetch_<amop_ti_fetch>ti_scq): New define_insn.
(atomic_fetch_<amop_ti_fetch>ti): New define_expand.

LoongArch: Implement 16-byte atomic exchange with sc.q

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_exchangeti_scq): New
define_insn.
(atomic_exchangeti): New define_expand.

LoongArch: Implement 16-byte CAS with sc.q

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_compare_and_swapti_scq): New
define_insn.
(atomic_compare_and_swapti): New define_expand.

LoongArch: Implement 16-byte atomic store with sc.q

When LSX is not available but sc.q is (for example on LA664 where the
SIMD unit is not enabled), we can use a LL-SC loop for 16-byte atomic
store.

gcc/ChangeLog:

* config/loongarch/loongarch.cc (loongarch_print_operand_reloc):
Accept "%t" for printing the number of the 64-bit machine
register holding the upper half of a TImode.
* config/loongarch/sync.md (atomic_storeti_scq): New
define_insn.
(atomic_storeti): expand to atomic_storeti_scq if !ISA_HAS_LSX.

LoongArch: Add -m[no-]scq option

We'll use the sc.q instruction for some 16-byte atomic operations, but
it's only added in LoongArch 1.1 evolution so we need to gate it with
an option.

gcc/ChangeLog:

* config/loongarch/genopts/isa-evolution.in (scq): New evolution
feature.
* config/loongarch/loongarch-evolution.cc: Regenerate.
* config/loongarch/loongarch-evolution.h: Regenerate.
* config/loongarch/loongarch-str.h: Regenerate.
* config/loongarch/loongarch.opt: Regenerate.
* config/loongarch/loongarch.opt.urls: Regenerate.
* config/loongarch/loongarch-def.cc: Make -mscq the default for
-march=la664 and -march=la64v1.1.
* doc/invoke.texi (LoongArch Options): Document -m[no-]scq.

LoongArch: Implement 16-byte atomic store with LSX

If the vector is naturally aligned, it cannot cross cache lines so the
LSX store is guaranteed to be atomic. Thus we can use LSX to do the
lock-free atomic store, instead of using a lock.

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_storeti_lsx): New
define_insn.
(atomic_storeti): New define_expand.

LoongArch: Implement 16-byte atomic load with LSX

If the vector is naturally aligned, it cannot cross cache lines so the
LSX load is guaranteed to be atomic. Thus we can use LSX to do the
lock-free atomic load, instead of using a lock.

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_loadti_lsx): New define_insn.
(atomic_loadti): New define_expand.

LoongArch: Implement atomic_fetch_nand<GPR:mode>

Without atomic_fetch_nandsi and atomic_fetch_nanddi, __atomic_fetch_nand
is expanded to a loop containing a CAS in the body, and CAS itself is a
LL-SC loop so we have a nested loop. This is obviously not a good idea
as we just need one LL-SC loop in fact.

As ~(atom & mask) is (~mask) | (~atom), we can just invert the mask
first and the body of the LL-SC loop would be just one orn instruction.

gcc/ChangeLog:

* config/loongarch/sync.md
(atomic_fetch_nand_mask_inverted<GPR:mode>): New define_insn.
(atomic_fetch_nand<GPR:mode>): New define_expand.

LoongArch: Don't expand atomic_fetch_sub_{hi, qi} to LL-SC loop if -mlam-bh

With -mlam-bh, we should negate the addend first, and use an amadd
instruction. Disabling the expander makes the compiler do it correctly.

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_fetch_sub<SHORT:mode>):
Disable if ISA_HAS_LAM_BH.

LoongArch: Implement subword atomic_fetch_{and, or, xor} with am*.w instructions

We can just shift the mask and fill the other bits with 0 (for ior/xor)
or 1 (for and), and use an am*.w instruction to perform the atomic
operation, instead of using a LL-SC loop.

gcc/ChangeLog:

* config/loongarch/sync.md (UNSPEC_COMPARE_AND_SWAP_AND):
Remove.
(UNSPEC_COMPARE_AND_SWAP_XOR): Remove.
(UNSPEC_COMPARE_AND_SWAP_OR): Remove.
(atomic_test_and_set): Rename to ...
(atomic_fetch_<any_bitwise:amop><SHORT:mode>): ... this, and
adapt the expansion to use it for any bitwise operations and any
val, instead of just ior 1.
(atomic_test_and_set): New define_expand.

LoongArch: Remove unneeded "andi offset, addr, 3" instruction in atomic_test_and_set

On LoongArch sll.w and srl.w instructions only take the [4:0] bits of
rk (shift amount) into account, and we've already defined
SHIFT_COUNT_TRUNCATED to 1 so the compiler knows this fact, thus we
don't need this instruction.

gcc/ChangeLog:

* config/loongarch/sync.md (atomic_test_and_set): Remove
unneeded andi instruction from the expansion.

LoongArch: Remove unneeded "b 3f" instruction after LL-SC loops

This instruction is used to skip an redundant barrier if -mno-ld-seq-sa
or the memory model requires a barrier on failure. But with -mld-seq-sa
and other memory models the barrier may be nonexisting at all, and we
should remove the "b 3f" instruction as well.

The implementation uses a new operand modifier "%T" to output a comment
marker if the operand is a memory order for which the barrier won't be
generated. "%T", and also "%t", are not really used before and the code
for them in loongarch_print_operand_reloc is just some MIPS legacy.

gcc/ChangeLog:

* config/loongarch/loongarch.cc (loongarch_print_operand_reloc):
Make "%T" output a comment marker if the operand is a memory
order for which the barrier won't be generated; remove "%t".
* config/loongarch/sync.md (atomic_cas_value_strong<mode>): Add
%T before "b 3f".
(atomic_cas_value_cmp_and_7_<mode>): Likewise.

LoongArch: Don't emit overly-restrictive barrier for LL-SC loops

For LL-SC loops, if the atomic operation has succeeded, the SC
instruction always imply a full barrier, so the barrier we manually
inserted only needs to take the account for the failure memorder, not
the success memorder (the barrier is skipped with "b 3f" on success
anyway).

Note that if we use the AMCAS instructions, we indeed need to consider
both the success memorder an the failure memorder deciding if "_db"
suffix is needed. Thus the semantics of atomic_cas_value_strong<mode>
and atomic_cas_value_strong<mode>_amcas start to be different. To
prevent the compiler from being too clever, use a different unspec code
for AMCAS instructions.

gcc/ChangeLog:

* config/loongarch/sync.md (UNSPEC_COMPARE_AND_SWAP_AMCAS): New
UNSPEC code.
(atomic_cas_value_strong<mode>): NFC, update the comment to note
we only need to consider failure memory order.
(atomic_cas_value_strong<mode>_amcas): Use
UNSPEC_COMPARE_AND_SWAP_AMCAS instead of
UNSPEC_COMPARE_AND_SWAP.
(atomic_compare_and_swap<mode:GPR>): Pass failure memorder to
gen_atomic_cas_value_strong<mode>.
(atomic_compare_and_swap<mode:SHORT>): Pass failure memorder to
gen_atomic_cas_value_cmp_and_7_si.