git.ipfire.org Git - thirdparty/gcc.git/log

libstdc++: Refactor bound arguments storage for bind_front/back

This patch refactors the implementation of bind_front and bind_back to avoid
using std::tuple for argument storage. Instead, bound arguments are now:
* stored directly if there is only one,
* within a dedicated _Bound_arg_storage otherwise.

_Bound_arg_storage is less expensive to instantiate and access than std::tuple.
It can also be trivially copyable, as it doesn't require a non-trivial assignment
operator for reference types. Storing a single argument directly provides similar
benefits compared to both one element tuple or _Bound_arg_storage.

_Bound_arg_storage holds each argument in an _Indexed_bound_arg base object.
The base class is parameterized by both type and index to allow storing
multiple arguments of the same type. Invocations are handled by _S_apply_front
amd _S_apply_back static functions, which simulate explicit object parameters.
To facilitate this, the __like_t alias template is now unconditionally available
since C++11 in bits/move.h.

libstdc++-v3/ChangeLog:

* include/bits/move.h (std::__like_impl, std::__like_t): Make
available in c++11.
* include/std/functional (std::_Indexed_bound_arg)
(std::_Bound_arg_storage, std::__make_bound_args): Define.
(std::_Bind_front, std::_Bind_back): Use _Bound_arg_storage.
* testsuite/20_util/function_objects/bind_back/1.cc: Expand
test to cover cases of 0, 1, many bound args.
* testsuite/20_util/function_objects/bind_back/111327.cc: Likewise.
* testsuite/20_util/function_objects/bind_front/1.cc: Likewise.
* testsuite/20_util/function_objects/bind_front/111327.cc: Likewise.

Reviewed-by: Jonathan Wakely <jwakely@redhat.com>
Reviewed-by: Patrick Palka <ppalka@redhat.com>
Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>

libstdc++: Specialize _Never_valueless_alt for jthread, stop_token and stop_source

The move constructors for stop_source and stop_token are equivalent to
copying and clearing the raw pointer, as they are wrappers for a
counted-shared state.

For jthread, the move constructor performs a member-wise move of stop_source
and thread. While std::thread could also have a _Never_valueless_alt
specialization due to its inexpensive move (only moving a handle), doing
so now would change the ABI. This patch takes the opportunity to correct
this behavior for jthread, before C++20 API is marked stable.

libstdc++-v3/ChangeLog:

* include/std/stop_token (__variant::_Never_valueless_alt): Declare.
(__variant::_Never_valueless_alt<std::stop_token>)
(__variant::_Never_valueless_alt<std::stop_source>): Define.
* include/std/thread: (__variant::_Never_valueless_alt): Declare.
(__variant::_Never_valueless_alt<std::jthread>): Define.

Enable unroll in the vectorizer when there's reduction for FMA/DOT_PROD_EXPR/SAD_EXPR

The patch is trying to unroll the vectorized loop when there're
FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration dependence
and enable more parallelism(since vectorize will also enable partial
sum).

When there's gather/scatter or scalarization in the loop, don't do the
unroll since the performance bottleneck is not at the reduction.

The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR
CEIL ((latency * throught), num_of_reduction)
.i.e
For fma, latency is 4, throught is 2, if there's 1 FMA for reduction
then unroll factor is 2 * 4 / 1 = 8.

There's also a vect_unroll_limit, the final suggested_unroll_factor is
set as MIN (vect_unroll_limix, 8).

The vect_unroll_limit is mainly for register pressure, avoid to many
spills.
Ideally, all instructions in the vectorized loop should be used to
determine unroll_factor with their (latency * throughput) / number,
but that would too much for this patch, and may just GIGO, so the
patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
SAD_EXPR.

Note when DOT_PROD_EXPR is not native support,
m_num_reduction += 3 * count which almost prevents unroll.

There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA
chain, slight improvement in SPEC2017 performance.

gcc/ChangeLog:

* config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
Addd new memeber m_num_reduc, m_prefer_unroll.
(ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and
m_num_reduc
(ix86_vector_costs::finish_cost): Determine
m_suggested_unroll_vector with consideration of
reduc_lat_mult_thr, m_num_reduction and
ix86_vect_unroll_limit.
* config/i386/i386.h (enum ix86_reduc_unroll_factor): New
enum.
(processor_costs): Add reduc_lat_mult_thr and
vect_unroll_limit.
* config/i386/x86-tune-costs.h: Initialize
reduc_lat_mult_thr and vect_unroll_limit.
* config/i386/i386.opt: Add -param=ix86-vect-unroll-limit.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vect_unroll-1.c: New test.
* gcc.target/i386/vect_unroll-2.c: New test.
* gcc.target/i386/vect_unroll-3.c: New test.
* gcc.target/i386/vect_unroll-4.c: New test.
* gcc.target/i386/vect_unroll-5.c: New test.

[PATCH] RISC-V: Add pattern for reverse floating-point divide

This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a div RTL instruction. The vec_duplicate is the
dividend operand.

Before this patch, we have two instructions, e.g.:
  vfmv.v.f       v2,fa0
  vfdiv.vv       v1,v2,v1

After, we get only one:
  vfrdiv.vf      v1,v1,fa0

gcc/ChangeLog:

* config/riscv/autovec-opt.md (*vfrdiv_vf_<mode>): Add new pattern to
combine vec_duplicate + vfdiv.vv into vfrdiv.vf.
* config/riscv/vector.md (@pred_<optab><mode>_reverse_scalar): Allow VLS
modes.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfrdiv.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: Add support for reverse
variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: Add data for
reverse variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f64.c: New test.

AArch64: extend cost model to cost outer loop vect where the inner loop is invariant [PR121290]

Consider the example:

void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
  for (int i = 0; i < 4; ++i)
    {
      int res = 0;
      for (int j = 0; j < 100; ++j)
        res += y[j] * z[i];
      x[i] = res;
    }
}

we currently vectorize as

f:
        movi    v30.4s, 0
        ldr     q31, [x2]
        add     x2, x1, 400
.L2:
        ld1r    {v29.4s}, [x1], 4
        mla     v30.4s, v29.4s, v31.4s
        cmp     x2, x1
        bne     .L2
        str     q30, [x0]
        ret

which is not useful because by doing outer-loop vectorization we're performing
less work per iteration than we would had we done inner-loop vectorization and
simply unrolled the inner loop.

This patch teaches the cost model that if all your leafs are invariant, then
adjust the loop cost by * VF, since every vector iteration has at least one lane
really just doing 1 scalar.

There are a couple of ways we could have solved this, one is to increase the
unroll factor to process more iterations of the inner loop.  This removes the
need for the broadcast, however we don't support unrolling the inner loop within
the outer loop.  We only support unrolling by increasing the VF, which would
affect the outer loop as well as the inner loop.

We also don't directly support costing inner-loop vs outer-loop vectorization,
and as such we're left trying to predict/steer the cost model ahead of time to
what we think should be profitable.  This patch attempts to do so using a
heuristic which penalizes the outer-loop vectorization.

We now cost the loop as

note:  Cost model analysis:
  Vector inside of loop cost: 2000
  Vector prologue cost: 4
  Vector epilogue cost: 0
  Scalar iteration cost: 300
  Scalar outside cost: 0
  Vector outside cost: 4
  prologue iterations: 0
  epilogue iterations: 0
missed:  cost model: the vector iteration cost = 2000 divided by the scalar iteration cost = 300 is greater or equal to the vectorization factor = 4.
missed:  not vectorized: vectorization not profitable.
missed:  not vectorized: vector version will never be profitable.
missed:  Loop costings may not be worthwhile.

And subsequently generate:

.L5:
        add     w4, w4, w7
        ld1w    z24.s, p6/z, [x0, #1, mul vl]
        ld1w    z23.s, p6/z, [x0, #2, mul vl]
        ld1w    z22.s, p6/z, [x0, #3, mul vl]
        ld1w    z29.s, p6/z, [x0]
        mla     z26.s, p6/m, z24.s, z30.s
        add     x0, x0, x8
        mla     z27.s, p6/m, z23.s, z30.s
        mla     z28.s, p6/m, z22.s, z30.s
        mla     z25.s, p6/m, z29.s, z30.s
        cmp     w4, w6
        bls     .L5

and avoids the load and replicate if it knows it has enough vector pipes to do
so.

gcc/ChangeLog:

PR target/121290
* config/aarch64/aarch64.cc
(class aarch64_vector_costs ): Add m_loop_fully_scalar_dup.
(aarch64_vector_costs::add_stmt_cost): Detect invariant inner loops.
(adjust_body_cost): Adjust final costing if m_loop_fully_scalar_dup.

gcc/testsuite/ChangeLog:

PR target/121290
* gcc.target/aarch64/pr121290.c: New test.

[PATCH] RISC-V: Add pattern for vector-scalar single-width floating-point multiply

This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a mult RTL instruction.

Before this patch, we have two instructions, e.g.:
  vfmv.v.f       v2,fa0
  vfmul.vv       v1,v1,v2

After, we get only one:
  vfmul.vf       v2,v2,fa0

gcc/ChangeLog:

* config/riscv/autovec-opt.md (*vfmul_vf_<mode>): Add new pattern to
combine vec_duplicate + vfmul.vv into vfmul.vf.
* config/riscv/vector.md (@pred_<optab><mode>_scalar): Allow VLS modes.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmul.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_run.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f64.c: New test.
* gcc.target/riscv/rvv/autovec/vls/floating-point-mul-2.c: Adjust scan
dump.
* gcc.target/riscv/rvv/autovec/vls/floating-point-mul-3.c: Likewise.

Fix RISC-V bootstrap

Recent changes from Kito have an unused parameter. On the assumption that he's
going to likely want it as part of the API, I've simply removed the parameter's
name until such time as Kito needs it.

This should restore bootstrapping to the RISC-V port. Committing now rather
than waiting for the CI system given bootstrap builds currently fail.

* config/riscv/riscv.cc (riscv_arg_partial_bytes): Remove name
from unused parameter.

arm: testsuite: make gcc.target/arm/bics_3.c generate bics again

The compiler is getting too smart! But this test is really intended
to test that we generate BICS instead of BIC+CMP, so make the test use
something that we can't subsequently fold away into a bit minipulation
of a store-flag value.

I've also added a couple of extra tests, so we now cover both the
cases where we fold the result away and where that cannot be done.
Also add a test that we don't generate a compare against 0, since
that's really part of what this test is covering.

gcc/testsuite:

* gcc.target/arm/bics_3.c: Add some additional tests that
cannot be folded to a bit manipulation.

Compute vect_reduc_type off SLP node instead of stmt-info

The following changes the vect_reduc_type API to work on the SLP node.
The API is only used from the aarch64 backend, so all changes are there.
In particular I noticed aarch64_force_single_cycle is invoked even
for scalar costing (where the flag tested isn't computed yet), I
figured in scalar costing all reductions are a single cycle.

* tree-vectorizer.h (vect_reduc_type): Get SLP node as argument.
* config/aarch64/aarch64.cc (aarch64_sve_in_loop_reduction_latency):
Take SLO node as argument and adjust.
(aarch64_in_loop_reduction_latency): Likewise.
(aarch64_detect_vector_stmt_subtype): Adjust.
(aarch64_vector_costs::count_ops): Likewise. Treat reductions
during scalar costing as single-cycle.

tree-optimization/121659 - bogus swap of reduction operands

The following addresses a bogus swapping of SLP operands of a
reduction operation which gets STMT_VINFO_REDUC_IDX out of sync
with the SLP operand order.  In fact the most obvious mistake is
that we simply swap operands even on the first stmt even when
there's no difference in the comparison operators (for == and !=
at least).  But there are more latent issues that I noticed and
fixed up in the process.

PR tree-optimization/121659
* tree-vect-slp.cc (vect_build_slp_tree_1): Do not allow
matching up comparison operators by swapping if that would
disturb STMT_VINFO_REDUC_IDX.  Make sure to only
actually mark operands for swapping when there was a
mismatch and we're not processing the first stmt.

* gcc.dg/vect/pr121659.c: New testcase.

Fix UBSAN issue with load-store data refactoring

The following makes sure to read from the lanes_ifn member only
when necessary (and thus it was set).

* tree-vect-stmts.cc (vectorizable_store): Access lanes_ifn
only when VMAT_LOAD_STORE_LANES.
(vectorizable_load): Likewise.

Remove STMT_VINFO_REDUC_VECTYPE_IN

This was added when invariants/externals outside of SLP didn't have
an easily accessible vector type. Now it's redundant so the
following removes it.

* tree-vectorizer.h (stmt_vec_info_::reduc_vectype_in): Remove.
(STMT_VINFO_REDUC_VECTYPE_IN): Likewise.
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Get
at the input vectype via the SLP node child.
(vectorizable_lane_reducing): Likewise.
(vect_transform_reduction): Likewise.
(vectorizable_reduction): Do not set STMT_VINFO_REDUC_VECTYPE_IN.

i386: Fix up recent changes to use GFNI for rotates/shifts [PR121658]

The vgf2p8affineqb_<mode><mask_name> pattern uses "register_operand"
predicate for the first input operand, so using "general_operand"
for the rotate operand passed to it leads to ICEs, and so does
the "nonimmediate_operand" in the <insn>v16qi3 define_expand.
The following patch fixes it by using "register_operand" in the former
case (that pattern is TARGET_GFNI only) and using force_reg in
the latter case (the pattern is TARGET_XOP || TARGET_GFNI and for XOP
we can handle MEM operand).

The rest of the changes are small formatting tweaks or use of const0_rtx
instead of GEN_INT (0).

2025-08-26  Jakub Jelinek  <jakub@redhat.com>

PR target/121658
* config/i386/sse.md (<insn><mode>3 any_shift): Use const0_rtx
instead of GEN_INT (0).
(cond_<insn><mode> any_shift): Likewise.  Formatting fix.
(<insn><mode>3 any_rotate): Use register_operand predicate instead of
general_operand for match_operand 1.  Use const0_rtx instead of
GEN_INT (0).
(<insn>v16qi3 any_rotate): Use force_reg on operands[1].  Formatting
fix.
* config/i386/i386.cc (ix86_shift_rotate_cost): Comment formatting
fixes.

* gcc.target/i386/pr121658.c: New test.

Daily bump.

RISC-V: Add test for vec_duplicate + vmacc.vv unsigned combine with GR2VR cost 0, 1 and 15

Add asm dump check and run test for vec_duplicate + vmacc.vvm
combine to vmacc.vx, with the GR2VR cost is 0, 2 and 15.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u16.c: Add asm check
for vx combine.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u8.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

RISC-V: Add test for vec_duplicate + vmacc.vv signed combine with GR2VR cost 0, 1 and 15

Add asm dump check and run test for vec_duplicate + vmacc.vvm
combine to vmacc.vx, with the GR2VR cost is 0, 2 and 15.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i16.c: Add asm check
for vx combine.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary_data.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary_run.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i8.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

RISC-V: Combine vec_duplicate + vmacc.vv to vmacc.vx on GR2VR cost

This patch would like to combine the vec_duplicate + vmacc.vv to the
vmacc.vx.  From example as below code.  The related pattern will depend
on the cost of vec_duplicate from GR2VR.  Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.

Assume we have example code like below, GR2VR cost is 0.

  #define DEF_VX_TERNARY_CASE_0(T, OP_1, OP_2, NAME)                        \
  void                                                                      \
  test_vx_ternary_##NAME##_##T##_case_0 (T * restrict vd, T * restrict vs2, \
                                         T rs1, unsigned n)                 \
  {                                                                         \
    for (unsigned i = 0; i < n; i++)                                        \
      vd[i] = vd[i] OP_2 vs2[i] OP_1 rs1;                                   \
  }

  DEF_VX_TERNARY_CASE_0(int32_t, *, +, macc)

Before this patch:
  11   │     beq a3,zero,.L8
  12   │     vsetvli a5,zero,e32,m1,ta,ma
  13   │     vmv.v.x v2,a2
  ...
  16   │ .L3:
  17   │     vsetvli a5,a3,e32,m1,ta,ma
  ...
  22   │     vmacc.vv v1,v2,v3
  ...
  25   │     bne a3,zero,.L3

After this patch:
  11   │     beq a3,zero,.L8
  ...
  14   │ .L3:
  15   │     vsetvli a5,a3,e32,m1,ta,ma
  ...
  20   │     vmacc.vx v1,a2,v3
  ...
  23   │     bne a3,zero,.L3

gcc/ChangeLog:

* config/riscv/vector.md (@pred_mul_plus_vx_<mode>): Add new pattern to
generate vmacc rtl.
(*pred_macc_<mode>_scalar_undef): Ditto.
* config/riscv/autovec-opt.md (*vmacc_vx_<mode>): Add new
pattern to match the vmacc vx combine.

Signed-off-by: Pan Li <pan2.li@intel.com>

omp-expand: Initialize fd->loop.n2 if needed for the zero iter case [PR121453]

When expand_omp_for_init_counts is called from expand_omp_for_generic,
zero_iter1_bb is NULL and the code always creates a new bb in which it
clears fd->loop.n2 var (if it is a var), because it can dominate code
with lastprivate guards that use the var.
When called from other places, zero_iter1_bb is non-NULL and so we don't
insert the clearing (and can't, because the same bb is used also for the
non-zero iterations exit and in that case we need to preserve the iteration
count).  Clearing is also not necessary when e.g. outermost collapsed
loop has constant non-zero number of iterations, in that case we initialize the
var to something already earlier.  The following patch makes sure to clear
it if it hasn't been initialized yet before the first check for zero iterations.

2025-08-26  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/121453
* omp-expand.cc (expand_omp_for_init_counts): Clear fd->loop.n2
before first zero count check if zero_iter1_bb is non-NULL upon
entry and fd->loop.n2 has not been written yet.

* gcc.dg/gomp/pr121453.c: New test.

Add a test for PR tree-optimization/121656

PR tree-optimization/121656
* gcc.dg/pr121656.c: New file.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

ctf: avoid overflow for array num elements [PR121411]

CTF array encoding uses uint32 for number of elements.  This means there
is a hard upper limit on array types which the format can represent.

GCC internally was also using a uint32_t for this, which would overflow
when translating from DWARF for arrays with more than UINT32_MAX
elements.  Use an unsigned HOST_WIDE_INT instead to fetch the array
bound, and fall back to CTF_K_UNKNOWN if the array cannot be
represented in CTF.

PR debug/121411

gcc/

* dwarf2ctf.cc (gen_ctf_subrange_type): Use unsigned HWI for
array_num_elements.  Fallback to CTF_K_UNKNOWN if the array
type has too many elements for CTF to represent.

gcc/testsuite/

* gcc.dg/debug/ctf/ctf-array-7.c: New test.

forwprop: Boolify simplify_permutation

After the return type of remove_prop_source_from_use was changed to void,
simplify_permutation only returns 1 or 0 so it can be boolified.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* tree-ssa-forwprop.cc (simplify_permutation): Boolify.
(pass_forwprop::execute): No longer handle 2 as the return
from simplify_permutation.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

Forwprop: boolify forward_propagate_into_comparison

After changing the return type of remove_prop_source_from_use,
forward_propagate_into_comparison will never return 2. So boolify
forward_propagate_into_comparison.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* tree-ssa-forwprop.cc (forward_propagate_into_comparison): Boolify.
(pass_forwprop::execute): Don't handle return of 2 from
forward_propagate_into_comparison.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

forwprop: Remove return type of remove_prop_source_from_use

Since r5-4705-ga499aac5dfa5d9, remove_prop_source_from_use has always
return false. This removes the return type of remove_prop_source_from_use
and cleans up the usage of remove_prop_source_from_use.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* tree-ssa-forwprop.cc (remove_prop_source_from_use): Remove
return type.
(forward_propagate_into_comparison): Update dealing with
no return type of remove_prop_source_from_use.
(forward_propagate_into_gimple_cond): Likewise.
(simplify_permutation): Likewise.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

forwprop: Mark the old switch index for (maybe) dceing

While looking at this code I noticed that we don't remove
the old switch index assignment if it is only used in the switch
after it is modified in simplify_gimple_switch.
This fixes that by marking the old switch index for the dce worklist.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* tree-ssa-forwprop.cc (simplify_gimple_switch): Add simple_dce_worklist
argument. Mark the old index when doing the replacement.
(pass_forwprop::execute): Update call to simplify_gimple_switch.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

Rewrite bool loads for undefined case [PR121279]

Just like r16-465-gf2bb7ffe84840d8 but this time
instead of a VCE there is a full on load from a boolean.
This showed up when trying to remove the extra copy
in the testcase from the revision mentioned above (pr120122-1.c).
So when moving loads from a boolean type from being conditional
to non-conditional, the load needs to become a full load and then
casted into a bool so that the upper bits are correct.

Bitfields loads will always do the truncation so they don't need to
be rewritten. Non boolean types always do the truncation too.

What we do is wrap the original reference with a VCE which causes
the full load and then do a casting to do the truncation. Using
fold_build1 with VCE will do the correct thing if there is a secondary
VCE and will also fold if this was just a plain MEM_REF so there is
no need to handle those 2 cases special either.

Changes since v1:
* v2: Use VIEW_CONVERT_EXPR instead of doing a manual load.
      Accept all non mode precision loads rather than just
      boolean ones.
* v3: Move back to checking boolean type. Don't handle BIT_FIELD_REF.
      Add asserts for IMAG/REAL_PART_EXPR.

Bootstrapped and tested on x86_64-linux-gnu.

PR tree-optimization/121279

gcc/ChangeLog:

* gimple-fold.cc (gimple_needing_rewrite_undefined): Return
true for non mode precision boolean loads.
(rewrite_to_defined_unconditional): Handle non mode precision loads.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr121279-1.c: New test.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

LIM: Manually put uninit decl into ssa

When working on PR121279, I noticed that lim
would create an uninitialized decl and marking
it with supression for uninitialization warning.
This is fine but then into ssa would just call
get_or_create_ssa_default_def on that new decl which
could in theory take some extra compile time to figure
that out.
Plus when doing the rewriting for undefinedness, there
would now be a VCE around the decl. This means the ssa
name is kept around and not propagated in some cases.
So instead this patch manually calls get_or_create_ssa_default_def
to get the "uninitalized" ssa name for this decl and
no longer needs the write into ssa nor for undefined ness.

Bootstrapped and tested on x86_64-linux-gnu.

gcc/ChangeLog:

* tree-ssa-loop-im.cc (execute_sm): Call
get_or_create_ssa_default_def for the new uninitialized
decl.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

xtensa: Make use of compact insn definition syntax for insns whose have multiple alternatives

The use of compact syntax makes the relationship between asm output,
operand constraints, and insn attributes easier to understand and modify,
especially for "mov<mode>_internal".

gcc/ChangeLog:

* config/xtensa/xtensa.md (addsi3, <u>mulhisi3, andsi3,
zero_extend<mode>si2, extendhisi2_internal, movsi_internal,
movhi_internal, movqi_internal, movsf_internal, ashlsi3_internal,
ashrsi3, lshrsi3, rotlsi3, rotrsi3):
Rewrite in compact syntax.

xtensa: Simplify "*masktrue_const_bitcmpl" insn pattern

gcc/ChangeLog:

* config/xtensa/xtensa.md
(The auxiliary define_split for *masktrue_const_bitcmpl):
Use a more concise function call, i.e.,
(1 << GET_MODE_BITSIZE (mode)) - 1 is equivalent to
GET_MODE_MASK (mode).

xtensa: Simplify "zero_extend[hq]isi2" insn patterns

gcc/ChangeLog:

* config/xtensa/xtensa.md (mode_bits):
New mode attribute.
(zero_extend<mode>si2): Use the appropriate mode iterator and
attribute to unify "zero_extend[hq]isi2" to this description.

c++: Implement C++ CWG3048 - Empty destructuring expansion statements

The following patch implements the proposed resolution of
https://cplusplus.github.io/CWG/issues/3048.html
Instead of rejecting structured binding size it just builds a normal
decl rather than structured binding declaration.

2025-08-25 Jakub Jelinek <jakub@redhat.com>

* pt.cc (finish_expansion_stmt): Implement C++ CWG3048
- Empty destructuring expansion statements. Don't error for
destructuring expansion stmts if sz is 0, don't call
fit_decomposition_lang_decl if n is 0 and pass NULL rather than
this_decomp to cp_finish_decl.

* g++.dg/cpp26/expansion-stmt15.C: Don't expect error on
destructuring expansion stmts with structured binding size 0.
* g++.dg/cpp26/expansion-stmt21.C: New test.
* g++.dg/cpp26/expansion-stmt22.C: New test.

c++: Check for *jump_target earlier in cxx_bind_parameters_in_call [PR121601]

The following testcase ICEs, because the
      /* Check we aren't dereferencing a null pointer when calling a non-static
         member function, which is undefined behaviour.  */
      if (i == 0 && DECL_OBJECT_MEMBER_FUNCTION_P (fun)
          && integer_zerop (arg)
          /* But ignore calls from within compiler-generated code, to handle
             cases like lambda function pointer conversion operator thunks
             which pass NULL as the 'this' pointer.  */
          && !(TREE_CODE (t) == CALL_EXPR && CALL_FROM_THUNK_P (t)))
        {
          if (!ctx->quiet)
            error_at (cp_expr_loc_or_input_loc (x),
                      "dereferencing a null pointer");
          *non_constant_p = true;
        }
checking is done before testing if (*jump_target).  Especially when
throws (jump_target), arg can be (and is on this testcase) NULL_TREE,
so calling integer_zerop on it ICEs.

Fixed by moving the if (*jump_target) test earlier.

2025-08-25  Jakub Jelinek  <jakub@redhat.com>

PR c++/121601
* constexpr.cc (cxx_bind_parameters_in_call): Move break
if *jump_target before the check for null this object pointer.

* g++.dg/cpp26/constexpr-eh16.C: New test.

tree-optimization/121638 - missed SLP discovery of live induction

The following fixes a missed SLP discovery of a live induction.
Our pattern matching of those fails because of the PR81529 fix
which I think was misguided and should now no longer be relevant.
So this essentially reverts that fix. I have added a GIMPLE
testcase to increase the chance the particular IL is preserved
through the future.

This shows that how we make some IVs live because of early-break
isn't quite correct, so I had to preserve a hack here. Hopefully
to be investigated at some point.

PR tree-optimization/121638
* tree-vect-stmts.cc (process_use): Do not make induction
PHI backedge values relevant.

* gcc.dg/vect/pr121638.c: New testcase.

targhooks: i386: rename TAG_SIZE to TAG_BITSIZE

gcc/Changelog:

* asan.h (HWASAN_TAG_SIZE): Use targetm.memtag.tag_bitsize.
* config/i386/i386.cc (ix86_memtag_tag_size): Rename to
ix86_memtag_tag_bitsize.
(TARGET_MEMTAG_TAG_SIZE): Renamed to TARGET_MEMTAG_TAG_BITSIZE.
* doc/tm.texi (TARGET_MEMTAG_TAG_SIZE): Likewise.
* doc/tm.texi.in (TARGET_MEMTAG_TAG_SIZE): Likewise.
* target.def (tag_size): Rename to tag_bitsize.
* targhooks.cc (default_memtag_tag_size): Rename to
default_memtag_tag_bitsize.
* targhooks.h (default_memtag_tag_size): Likewise.

Signed-off-by: Claudiu Zissulescu <claudiu.zissulescu-ianculescu@oracle.com>
Co-authored-by: Claudiu Zissulescu <claudiu.zissulescu-ianculescu@oracle.com>

RISC-V: Replace deprecated FUNCTION_VALUE/LIBCALL_VALUE macros with target hooks

The FUNCTION_VALUE and LIBCALL_VALUE macros are deprecated in favor of
the TARGET_FUNCTION_VALUE and TARGET_LIBCALL_VALUE target hooks. This
patch replaces the macro definitions with proper target hook implementations.

This change is also a preparatory step for VLS calling convention support,
which will require additional information that is more easily handled
through the target hook interface.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_init_cumulative_args): Change
fntype parameter from tree to const_tree.
* config/riscv/riscv.cc (riscv_init_cumulative_args): Likewise.
(riscv_function_value): Replace with new implementation that
conforms to TARGET_FUNCTION_VALUE hook signature.
(riscv_libcall_value): New function implementing TARGET_LIBCALL_VALUE.
(TARGET_FUNCTION_VALUE): Define.
(TARGET_LIBCALL_VALUE): Define.
* config/riscv/riscv.h (FUNCTION_VALUE): Remove.
(LIBCALL_VALUE): Remove.

Use x86 GFNI for vectorized constant byte shifts/rotates

The GFNI AVX gf2p8affineqb instruction can be used to implement
vectorized byte shifts or rotates. This patch uses them to implement
shift and rotate patterns to allow the vectorizer to use them.
Previously AVX couldn't do rotates (except with XOP) and had to handle
8 bit shifts with a half throughput 16 bit shift.

This is only implemented for constant shifts. In theory it could
be used with a lookup table for variable shifts, but it's unclear
if it's worth it.

The vectorizer cost model could be improved, but seems to work for now.
It doesn't model the true latencies of the instructions. Also it doesn't
account for the memory loading of the mask, assuming that for a loop
it will be loaded outside the loop.

The instructions would also support more complex patterns
(e.g. arbitary bit movement or inversions), so some of the tricks
applied to ternlog could be applied here too to collapse
more code. It's trickier because the input patterns
can be much longer since they can apply to every bit individually. I didn't
attempt any of this.

There's currently no test case for the masked/cond_ variants, they seem
to be difficult to trigger with the vectorizer. Suggestions for a test
case for them welcome.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_vgf2p8affine_shift_matrix):
New function to lookup shift/rotate matrixes for gf2p8affine.
* config/i386/i386-protos.h (ix86_vgf2p8affine_shift_matrix):
Declare new function.
* config/i386/i386.cc (ix86_shift_rotate_cost): Add cost model
for shift/rotate implemented using gf2p8affine.
* config/i386/sse.md (VI1_AVX512_3264): New mode iterator.
(<insn><mode>3): Add GFNI case for shift patterns.
(cond_<insn><mode>3): New pattern.
(<insn><mode>3<mask_name>): Dito.
(<insn>v16qi): New rotate pattern to handle XOP V16QI case
and GFNI.
(rotl<mode>3, rotr<mode>3): Exclude V16QI case.

gcc/testsuite/ChangeLog:

* gcc.target/i386/shift-gf2p8affine-1.c: New test
* gcc.target/i386/shift-gf2p8affine-2.c: New test
* gcc.target/i386/shift-gf2p8affine-3.c: New test
* gcc.target/i386/shift-v16qi-4.c: New test
* gcc.target/i386/shift-gf2p8affine-5.c: New test
* gcc.target/i386/shift-gf2p8affine-6.c: New test
* gcc.target/i386/shift-gf2p8affine-7.c: New test

LoongArch: Fix ICE in highway-1.3.0 testsuite [PR121634]

I can't believe I made such a stupid pasto and the regression test
didn't detect anything wrong.

PR target/121634

gcc/

* config/loongarch/simd.md (simd_maddw_evod_<mode>_<su>): Use
WVEC_HALF instead of WVEC for the mode of the sign_extend for
the rhs of multiplication.

gcc/testsuite/

* gcc.target/loongarch/pr121634.c: New test.

Fix invalid right shift count with recent ifcvt changes

I got too clever trying to simplify the right shift computation in my recent
ifcvt patch.  Interestingly enough, I haven't seen anything but the Linaro CI
configuration actually trip the problem, though the code is clearly wrong.

The problem I was trying to avoid were the leading zeros when calling clz on a
HWI when the real object is just say 32 bits.

The net is we get a right shift count of "2" when we really wanted a right
shift count of 30.  That causes the execution aspect of bics_3 to fail.

The scan failures are due to creating slightly more efficient code. THe new
code sequences don't need to use conditional execution for selection and thus
we can use bic rather bics which requires a twiddle in the scan.

I reviewed recent bug reports and haven't seen one for this issue.  So no new
testcase as this is covered by the armv7 testsuite in the right configuration.

Bootstrapped and regression tested on x86_64, also verified it fixes the Linaro
reported CI failure and verified the crosses are still happy.  Pushing to the
trunk.

gcc/
* ifcvt.cc (noce_try_sign_bit_splat): Fix right shift computation.

gcc/testsuite/
* gcc.target/arm/bics_3.c: Adjust expected output

Daily bump.

Update gcc de.po

* de.po: Update.

i386: fix ChangeLog entry

Fix a typo in the ChangeLog entry from r16-3355-g96a291c4bb0b8a.

Daily bump.

c++: Fix greater-than operator in braced-init-lists [PR116928]

PR c++/116928

gcc/cp/ChangeLog:

* parser.cc (cp_parser_braced_list): Set greater_than_is_operator_p.

gcc/testsuite/ChangeLog:

* g++.dg/parse/template33.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>

x86: Compile noplt-(g|l)d-1.c with -mtls-dialect=gnu

Compile noplt-gd-1.c and noplt-ld-1.c with -mtls-dialect=gnu to support
the --with-tls=gnu2 configure option since they scan the assembly output
for the __tls_get_addr call which is generated by -mtls-dialect=gnu.

PR target/120933
* gcc.target/i386/noplt-gd-1.c (dg-options): Add
-mtls-dialect=gnu.
* gcc.target/i386/noplt-ld-1.c (dg-options): Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

i386: wire up --with-tls to control -mtls-dialect= default

Allow passing --with-tls= at configure-time to control the default value
of -mtls-dialect= for i386 and x86_64. The default itself (gnu) is not changed
unless --with-tls= is passed.

--with-tls= is already wired up for ARM and RISC-V.

gcc/ChangeLog:
PR target/120933

* config.gcc (supported_defaults): Add tls for i386, x86_64.
* config/i386/i386.h (host_detect_local_cpu): Add tls.
* doc/install.texi: Document --with-tls= for i386, x86_64.

driver: Rework for_each_path using C++

The old C-style was cumbersome make making one responsible for manually
creating and passing in two parts a closure (separate function and
*_info class for closed-over variables).

With C++ lambdas, we can just:

- derive environment types implicitly
- have fewer stray static functions

Also thanks to templates we can
- make the return type polymorphic, to avoid casting pointee types.

Note that `struct spec_path` was *not* converted because it is used
multiple times. We could still convert to a lambda, but we would want to
put the for_each_path call with that lambda inside a separate function
anyways, to support the multiple callers. Unlike the other two
refactors, it is not clear that this one would make anything shorter.
Instead, I define the `operator()` explicitly. Keeping the explicit
struct gives us some nice "named arguments", versus the wrapper function
alternative, too.

gcc/ChangeLog:

* gcc.cc (for_each_path): templated, to make passing lambdas
possible/easy/safe, and to have a polymorphic return type.
(struct add_to_obstack_info): Deleted, lambda captures replace
it.
(add_to_obstack): Moved to lambda in build_search_list.
(build_search_list): Has above lambda now.
(struct file_at_path_info): Deleted, lambda captures replace
it.
(file_at_path): Moved to lambda in find_a_file.
(find_a_file): Has above lambda now.
(struct spec_path_info): Reamed to just struct spec_path.
(struct spec_path): New name.
(spec_path): Rnamed to spec_path::operator()
(spec_path::operator()): New name
(do_spec_1): Updated for_each_path call sites.

Signed-off-by: John Ericson <git@JohnEricson.me>
Reviewed-by: Jason Merrill <jason@redhat.com>

c++/modules: Provide definitions of synthesized methods outside their defining module [PR120499]

In the PR, we're getting a linker error from _Vector_impl's destructor
never getting emitted.  This is because of a combination of factors:

1. in imp-member-4_a, the destructor is not used and so there is no
   definition generated.

2. in imp-member-4_b, the destructor gets synthesized (as part of the
   synthesis for Coll's destructor) but is not ODR-used and so does not
   get emitted.  Despite there being a definition provided in this TU,
   the destructor is still considered imported and so isn't streamed
   into the module body.

3. in imp-member-4_c, we need to ODR-use the destructor but we only got
   a forward declaration from imp-member-4_b, so we cannot emit a body.

The point of failure here is step 2; this function has effectively been
declared in the imp-member-4_b module, and so we shouldn't treat it as
imported.  This way we'll properly stream the body so that importers can
emit it.

PR c++/120499

gcc/cp/ChangeLog:

* method.cc (synthesize_method): Set the instantiating module.

gcc/testsuite/ChangeLog:

* g++.dg/modules/imp-member-4_a.C: New test.
* g++.dg/modules/imp-member-4_b.C: New test.
* g++.dg/modules/imp-member-4_c.C: New test.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

Daily bump.

rs6000: Add shift count guards to avoid undefined behavior [PR118890]

This patch adds missing guards on shift amounts to prevent UB when the
shift count equals or exceeds HOST_BITS_PER_WIDE_INT.

In the patch (r16-2666-g647bd0a02789f1), shift counts were only checked
for nonzero but not for being within valid bounds. This patch tightens
those conditions by enforcing that shift counts are greater than zero
and less than HOST_BITS_PER_WIDE_INT.

2025-08-23 Kishan Parmar <kishan@linux.ibm.com>

gcc/
PR target/118890
* config/rs6000/rs6000.cc (can_be_rotated_to_negative_lis): Add bounds
checks for shift counts to prevent undefined behavior.
(rs6000_emit_set_long_const): Likewise.

[PR rtl-optimization/120553] Improve selecting between constants based on sign bit test

While working to remove mvconst_internal I stumbled over a regression in
the code to handle signed division by a power of two.

In that sequence we want to select between 0, 2^n-1 by pairing a sign
bit splat with a subsequent logical right shift.  This can be done
without branches or conditional moves.

Playing with it a bit made me realize there's a handful of selections we
can do based on a sign bit test.  Essentially there's two broad cases.

Clearing bits after the sign bit splat.  So we have 0, -1, if we clear
bits the 0 stays as-is, but the -1 could easily turn into 2^n-1, ~2^n-1,
or some small constants.

Setting bits after the sign bit splat. If we have 0, -1, setting bits
the -1 stays as-is, but the 0 can turn into 2^n, a small constant, etc.

Shreya and I originally started looking at target patterns to do this,
essentially discovering conditional move forms of the selects and
rewriting them into something more efficient.  That got out of control
pretty quickly and it relied on if-conversion to initially create the
conditional move.

The better solution is to actually discover the cases during
if-conversion itself.  That catches cases that were previously being
missed, checks cost models, and is actually simpler since we don't have
to distinguish between things like ori and bseti, instead we just emit
the natural RTL and let the target figure it out.

In the ifcvt implementation we put these cases just before trying the
traditional conditional move sequences.  Essentially these are a last
attempt before trying the generalized conditional move sequence.

This as been bootstrapped and regression tested on aarch64, riscv,
ppc64le, s390x, alpha, m68k, sh4eb, x86_64 and probably a couple others
I've forgotten.  It's also been tested on the other embedded targets.
Obviously the new tests are risc-v specific, so that testing was
primarily to make sure we didn't ICE, generate incorrect code or regress
target existing specific tests.

Raphael has some changes to attack this from the gimple direction as
well.  I think the latest version of those is on me to push through
internal review.

PR rtl-optimization/120553
gcc/
* ifcvt.cc (noce_try_sign_bit_splat): New function.
(noce_process_if_block): Use it.

gcc/testsuite/

* gcc.target/riscv/pr120553-1.c: New test.
* gcc.target/riscv/pr120553-2.c: New test.
* gcc.target/riscv/pr120553-3.c: New test.
* gcc.target/riscv/pr120553-4.c: New test.
* gcc.target/riscv/pr120553-5.c: New test.
* gcc.target/riscv/pr120553-6.c: New test.
* gcc.target/riscv/pr120553-7.c: New test.
* gcc.target/riscv/pr120553-8.c: New test.

Pass representative of live SLP node to vect_create_epilog_for_reduction

We passed the reduc_info which is close, but the representative is
more spot on and will not collide with making the reduc_info a
distinct type.

* tree-vect-loop.cc (vectorizable_live_operation): Pass
the representative of the PHIs node to
vect_create_epilog_for_reduction.

Fixups around reduction info and STMT_VINFO_REDUC_VECTYPE_IN

STMT_VINFO_REDUC_VECTYPE_IN exists on relevant reduction stmts, not
the reduction info.  And STMT_VINFO_DEF_TYPE exists on the
reduction info.  The following fixes up a few places.

* tree-vect-loop.cc (vectorizable_lane_reducing): Get
reduction info properly.  Adjust checks according to
comments.
(vectorizable_reduction): Do not set STMT_VINFO_REDUC_VECTYPE_IN
on the reduc info.
(vect_transform_reduction): Query STMT_VINFO_REDUC_VECTYPE_IN
on the actual reduction stmt, not the info.

RISC-V: Add testcase for scalar unsigned SAT_MUL form 3

Add run and asm check test cases for scalar unsigned SAT_MUL form 3.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/sat/sat_arith.h: Add test helper macros.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u16-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u32-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u64-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u16.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-4-u8-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u16-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u32-from-u64.rv32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u64-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u128.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u16.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u32.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u64.c: New test.
* gcc.target/riscv/sat/sat_u_mul-run-4-u8-from-u64.rv32.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

Match: Add form 3 for unsigned SAT_MUL

This patch would like to try to match the the unsigned
SAT_MUL form 3, aka below:

  #define DEF_SAT_U_MUL_FMT_3(NT, WT)             \
  NT __attribute__((noinline))                    \
  sat_u_mul_##NT##_from_##WT##_fmt_3 (NT a, NT b) \
  {                                               \
    WT x = (WT)a * (WT)b;                         \
    if ((x >> sizeof(a) * 8) == 0)                \
      return (NT)x;                               \
    else                                          \
      return (NT)-1;                              \
  }

While WT is  T is uint16_t, uint32_t, uint64_t and uint128_t,
and NT is is uint8_t, uint16_t, uint32_t and uint64_t.

gcc/ChangeLog:

* match.pd: Add form 3 for unsigned SAT_MUL.

Signed-off-by: Pan Li <pan2.li@intel.com>

Emit the TLS call after NOTE_INSN_FUNCTION_BEG

For the beginning basic block:

(note 4 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 4 26 2 NOTE_INSN_FUNCTION_BEG)

emit the TLS call after NOTE_INSN_FUNCTION_BEG.

gcc/

PR target/121635
* config/i386/i386-features.cc (ix86_emit_tls_call): Emit the
TLS call after NOTE_INSN_FUNCTION_BEG.

gcc/testsuite/

PR target/121635
* gcc.target/i386/pr121635-1a.c: New test.
* gcc.target/i386/pr121635-1b.c: Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

Use REDUC_GROUP_FIRST_ELEMENT less

REDUC_GROUP_FIRST_ELEMENT is often checked to see whether we are
dealing with a SLP reduction or a reduction chain. When we are
in the context of analyzing the reduction (so we are sure
the SLP instance we see is correct), then we can use the SLP
instance kind instead.

* tree-vect-loop.cc (get_initial_defs_for_reduction): Adjust
comment.
(vect_create_epilog_for_reduction): Get at the reduction
kind via the instance, re-use the slp_reduc flag instead
of checking REDUC_GROUP_FIRST_ELEMENT again.
Remove unreachable code.
(vectorizable_reduction): Compute a reduc_chain flag from
the SLP instance kind, avoid REDUC_GROUP_FIRST_ELEMENT
checks.
(vect_transform_cycle_phi): Likewise.
(vectorizable_live_operation): Check the SLP instance
kind instead of REDUC_GROUP_FIRST_ELEMENT.

testsuite: Fix g++.dg/abi/mangle83.C for -fshort-enums

Linaro CI informed me that this test fails on ARM thumb-m7-hard-eabi.
This appears to be because the target defaults to -fshort-enums, and so
the mangled names are inaccurate.

This patch just disables the implicit type enum test for this case.

gcc/testsuite/ChangeLog:

* g++.dg/abi/mangle83.C: Disable implicit enum test for
-fshort-enums.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

Decouple parloops from vect reduction infra some more

The following removes the use of STMT_VINFO_REDUC_* from parloops,
also fixing a mistake with analyzing double reductions which rely
on the outer loop vinfo so the inner loop is properly detected as
nested.

* tree-parloops.cc (parloops_is_simple_reduction): Pass
in double reduction inner loop LC phis and query that.
(parloops_force_simple_reduction): Similar, but set it.
Check for valid reduction types here.
(valid_reduction_p): Remove.
(gather_scalar_reductions): Adjust, fixup double
reduction inner loop processing.

RTEMS: Add riscv multilibs

gcc/ChangeLog:

* config/riscv/t-rtems: Add -mstrict-align multilibs for
targets without support for misaligned access in hardware.

[arm] require armv7 support for [PR120424]

Without stating the architecture version required by the test, test
runs with options that are incompatible with the required
architecture version fail, e.g. -mfloat-abi=hard.

armv7 was not covered by the long list of arm variants in
target-supports.exp, so add it, and use it for the effective target
requirement and for the option.

for gcc/testsuite/ChangeLog

PR rtl-optimization/120424
* lib/target-supports.exp (arm arches): Add arm_arch_v7.
* g++.target/arm/pr120424.C: Require armv7 support. Use
dg-add-options arm_arch_v7 instead of explicit -march=armv7.

Daily bump.

Fortran: Fix NULL pointer issue.

PR fortran/121627

gcc/fortran/ChangeLog:

* module.cc (create_int_parameter_array): Avoid NULL
pointer dereference and enhance error message.

gcc/testsuite/ChangeLog:

* gfortran.dg/pr121627.f90: New test.

pru: libgcc: Add software implementation for multiplication

For cores without a hardware multiplier, set respective optabs
with library functions which use software implementation of
multiplication.

The implementation was copied from the RL78 backend.

gcc/ChangeLog:

* config/pru/pru.cc (pru_init_libfuncs): Set softmpy libgcc
functions for optab multiplication entries if TARGET_OPT_MUL
option is not set.

libgcc/ChangeLog:

* config/pru/libgcc-eabi.ver: Add __pruabi_softmpyi and
__pruabi_softmpyll symbols.
* config/pru/t-pru: Add softmpy source files.
* config/pru/pru-softmpy.h: New file.
* config/pru/softmpyi.c: New file.
* config/pru/softmpyll.c: New file.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

pru: Define multilib for different core variants

Enable multilib builds for contemporary PRU core versions (AM335x and
later), and older versions present in AM18xx.

gcc/ChangeLog:

* config.gcc: Include pru/t-multilib.
* config/pru/pru.h (MULTILIB_DEFAULTS): Define.
* config/pru/t-multilib: New file.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

pru: Add options to disable MUL/FILL/ZERO instructions

Older PRU core versions (e.g. in AM1808 SoC) do not support
XIN, XOUT, FILL, ZERO instructions. Add GCC command line options to
optionally disable generation of those instructions, so that code
can be executed on such older PRU cores.

gcc/ChangeLog:

* common/config/pru/pru-common.cc (TARGET_DEFAULT_TARGET_FLAGS):
Keep multiplication, FILL and ZERO instructions enabled by
default.
* config/pru/pru.md (prumov<mode>): Gate code generation on
TARGET_OPT_FILLZERO.
(mov<mode>): Ditto.
(zero_extendqidi2): Ditto.
(zero_extendhidi2): Ditto.
(zero_extendsidi2): Ditto.
(@pru_ior_fillbytes<mode>): Ditto.
(@pru_and_zerobytes<mode>): Ditto.
(@<code>di3): Ditto.
(mulsi3): Gate code generation on TARGET_OPT_MUL.
* config/pru/pru.opt: Add mmul and mfillzero options.
* config/pru/pru.opt.urls: Regenerate.
* config/rl78/rl78.opt.urls: Regenerate.
* doc/invoke.texi: Document new options.

Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>

c: Add folding of nullptr_t in some cases [PR121478]

The middle-end does not fully understand NULLPTR_TYPE. So it
gets confused a lot of the time when dealing with it.
This adds the folding that is similarly done in the C++ front-end already.
In some cases it should produce slightly better code as there is no
reason to load from a nullptr_t variable as it is always NULL.

The following is handled:
nullptr_v ==/!= nullptr_v -> true/false
(ptr)nullptr_v -> (ptr)0, nullptr_v
f(nullptr_v) -> f ((nullptr, nullptr_v))

The last one is for conversion inside ... .

Bootstrapped and tested on x86_64-linux-gnu.

PR c/121478
gcc/c/ChangeLog:

* c-fold.cc (c_fully_fold_internal): Fold nullptr_t ==/!= nullptr_t.
* c-typeck.cc (convert_arguments): Handle conversion from nullptr_t
for varargs.
(convert_for_assignment): Handle conversions from nullptr_t to
pointer type specially.

gcc/testsuite/ChangeLog:

* gcc.dg/torture/pr121478-1.c: New test.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

c++: constexpr clobber of const [PR121068]

Since r16-3022, 20_util/variant/102912.cc was failing in C++20 and above due
to wrong errors about destruction modifying a const object; destruction is
OK.

PR c++/121068

gcc/cp/ChangeLog:

* constexpr.cc (cxx_eval_store_expression): Allow clobber of a const
object.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/constexpr-dtor18.C: New test.

RISC-V: testsuite: Fix DejaGnu support for riscv_zvfh

Call check_effective_target_riscv_zvfh_ok rather than
check_effective_target_riscv_zvfh in vx_vf_*run-1-f16.c run tests and ensure
that they are actually run.
Also fix remove_options_for_riscv_zvfh.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmacc-run-1-f16.c: Call
check_effective_target_riscv_zvfh_ok rather than
check_effective_target_riscv_zvfh.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwnmacc-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwnmsac-run-1-f16.c: Likewise.
* lib/target-supports.exp (check_effective_target_riscv_zvfh_ok): Append
zvfh instead of v to march.
(remove_options_for_riscv_zvfh): Remove duplicate and
call remove_ rather than add_options_for_riscv_z_ext.

rtl-ssa: Add missing live-out uses [PR121619]

This PR is another bug in the rtl-ssa code to manage live-out uses.
It seems that this didn't get much coverage until recently.

In the testcase, late-combine first removed a register-to-register
move by substituting into all uses, some of which were in other EBBs.
This was done after checking make_uses_available, which (as expected)
says that single dominating definitions are available everywhere
that the definition dominates. But the update failed to add
appropriate live-out uses, so a later parallelisation attempt
tried to move the new destination into a later block.

gcc/
PR rtl-optimization/121619
* rtl-ssa/functions.h (function_info::commit_make_use_available):
Declare.
* rtl-ssa/blocks.cc (function_info::commit_make_use_available):
New function.
* rtl-ssa/changes.cc (function_info::apply_changes_to_insn): Use it.

gcc/testsuite/
PR rtl-optimization/121619
* gcc.dg/pr121619.c: New test.

libstdc++: Use pthread_mutex_clocklock when TSan is active [PR121496]

This reverts r14-905-g3b7cb33033fbe6 which disabled the use of
pthread_mutex_clocklock when TSan is active. That's no longer needed,
because GCC has TSan interceptors for pthread_mutex_clocklock since GCC
15.1 and Clang has them since 18.1.0 (released March 2024).

The interceptor was added by https://github.com/llvm/llvm-project/pull/75713

libstdc++-v3/ChangeLog:

PR libstdc++/121496
* acinclude.m4 (GLIBCXX_CHECK_PTHREAD_MUTEX_CLOCKLOCK): Do not
use _GLIBCXX_TSAN in _GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK macro.
* configure: Regenerate.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>

libstdc++: Check _GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK with #if [PR121496]

The change in r14-905-g3b7cb33033fbe6 to disable the use of
pthread_mutex_clocklock when TSan is active assumed that the
_GLIBCXX_USE_PTHREAD_MUTEX_CLOCKLOCK macro was always checked with #if
rather than #ifdef, which was not true.

This makes the checks use #if consistently.

libstdc++-v3/ChangeLog:

PR libstdc++/121496
* include/std/mutex (__timed_mutex_impl::_M_try_wait_until):
Change preprocessor condition to use #if instead of #ifdef.
(recursive_timed_mutex::_M_clocklock): Likewise.
* testsuite/30_threads/timed_mutex/121496.cc: New test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>

tree-optimization/111494 - reduction vectorization with signed UB

The following makes sure to pun arithmetic that's used in vectorized
reduction to unsigned when overflow invokes undefined behavior.

PR tree-optimization/111494
* gimple-fold.h (arith_code_with_undefined_signed_overflow): Declare.
* gimple-fold.cc (arith_code_with_undefined_signed_overflow): Export.
* tree-vect-stmts.cc (vectorizable_operation): Use unsigned
arithmetic for operations participating in a reduction.

x86-64: Emit the TLS call after NOTE_INSN_BASIC_BLOCK

For a basic block with only a label:

(code_label 78 11 77 3 14 (nil) [1 uses])
(note 77 78 54 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

emit the TLS call after NOTE_INSN_BASIC_BLOCK, instead of before
NOTE_INSN_BASIC_BLOCK, to avoid

x.c: In function ‘aout_16_write_syms’:
x.c:54:1: error: NOTE_INSN_BASIC_BLOCK is missing for block 3
54 | }
| ^
x.c:54:1: error: NOTE_INSN_BASIC_BLOCK 77 in middle of basic block 3
during RTL pass: x86_cse
x.c:54:1: internal compiler error: verify_flow_info failed

gcc/

PR target/121607
* config/i386/i386-features.cc (ix86_emit_tls_call): Emit the
TLS call after NOTE_INSN_BASIC_BLOCK in a basic block with only
a label.

gcc/testsuite/

PR target/121607
* gcc.target/i386/pr121607-1a.c: New test.
* gcc.target/i386/pr121607-1b.c: Likewise.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

libstdc++: Implement aligned_accessor from mdspan [PR120994]

This commit completes the implementation of P2897R7 by implementing and
testing the template class aligned_accessor.

PR libstdc++/120994

libstdc++-v3/ChangeLog:

* include/bits/version.def (aligned_accessor): Add.
* include/bits/version.h: Regenerate.
* include/std/mdspan (aligned_accessor): New class.
* src/c++23/std.cc.in (aligned_accessor): Add.
* testsuite/23_containers/mdspan/accessors/generic.cc: Add tests
for aligned_accessor.
* testsuite/23_containers/mdspan/accessors/aligned_neg.cc: New test.
* testsuite/23_containers/mdspan/version.cc: Add test for
__cpp_lib_aligned_accessor.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Implement is_sufficiently_aligned [PR120994]

This commit implements and tests the function is_sufficiently_aligned
from P2897R7.

PR libstdc++/120994

libstdc++-v3/ChangeLog:

* include/bits/align.h (is_sufficiently_aligned): New function.
* include/bits/version.def (is_sufficiently_aligned): Add.
* include/bits/version.h: Regenerate.
* include/std/memory: Add __glibcxx_want_is_sufficiently_aligned.
* src/c++23/std.cc.in (is_sufficiently_aligned): Add.
* testsuite/20_util/headers/memory/version.cc: Add test for
__cpp_lib_is_sufficiently_aligned.
* testsuite/20_util/is_sufficiently_aligned/1.cc: New test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Fix std::numeric_limits<__float128>::max_digits10 [PR121374]

When I added this explicit specialization in r14-1433-gf150a084e25eaa I
used the wrong value for the number of mantissa digits (I used 112
instead of 113). Then when I refactored it in r14-1582-g6261d10521f9fd I
used the value calculated from the incorrect value (35 instead of 36).

libstdc++-v3/ChangeLog:

PR libstdc++/121374
* include/std/limits (numeric_limits<__float128>::max_digits10):
Fix value.
* testsuite/18_support/numeric_limits/128bit.cc: Check value.

libstdc++: Suppress some more additional diagnostics [PR117294]

libstdc++-v3/ChangeLog:

PR c++/117294
* testsuite/20_util/optional/cons/value_neg.cc: Prune additional
output for C++20 and later.
* testsuite/20_util/scoped_allocator/69293_neg.cc: Match
additional error for C++20 and later.

libstdc++: Implement std::dims from <mdspan>.

This commit implements the C++26 feature std::dims described in P2389R2.
It sets the feature testing macro to 202406 and adds tests.

Also fixes the test mdspan/version.cc

libstdc++-v3/ChangeLog:

* include/bits/version.def (mdspan): Set value for C++26.
* include/bits/version.h: Regenerate.
* include/std/mdspan (dims): Add.
* src/c++23/std.cc.in (dims): Add.
* testsuite/23_containers/mdspan/extents/misc.cc: Add tests.
* testsuite/23_containers/mdspan/version.cc: Update test.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Simplify precomputed partial products in <mdspan>.

Prior to this commit, the partial products of static extents in <mdspan>
was done in a loop that calls a function that computes the partial
product. The complexity is quadratic in the rank.

This commit removes the quadratic complexity.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__static_prod): Delete.
(__fwd_partial_prods): Compute at compile-time in O(rank), not
O(rank**2).
(__rev_partial_prods): Ditto.
(__size): Inline __static_prod.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce size static storage for __fwd_prod in mdspan.

This fixes an oversight in a previous commit that improved mdspan
related code. Because __size doesn't use __fwd_prod, __fwd_prod(__rank)
is not needed anymore. Hence, one can shrink the size of
__fwd_partial_prods.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__fwd_partial_prods): Reduce size of the
array by 1 element.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

xtensa: Small improvement to "*btrue_INT_MIN"

This patch changes the implementation of the insn to test whether the
result itself is negative or not, rather than the MSB of the result of
the ABS machine instruction.  This eliminates the need to consider bit-
endianness and allows for longer branch distances.

     /* example */
     extern void foo(int);
     void test0(int a) {
       if (a == -2147483648)
         foo(a);
     }
     void test1(int a) {
       if (a != -2147483648)
         foo(a);
     }

     ;; before (endianness: little)
     test0:
      entry sp, 32
      abs a8, a2
      bbci a8, 31, .L1
      mov.n a10, a2
      call8 foo
     .L1:
      retw.n
     test1:
      entry sp, 32
      abs a8, a2
      bbsi a8, 31, .L4
      mov.n a10, a2
      call8 foo
     .L4:
      retw.n

     ;; after (endianness-independent)
     test0:
      entry sp, 32
      abs a8, a2
      bgez a8, .L1
      mov.n a10, a2
      call8 foo
     .L1:
      retw.n
     test1:
      entry sp, 32
      abs a8, a2
      bltz a8, .L4
      mov.n a10, a2
      call8 foo
     .L4:
      retw.n

gcc/ChangeLog:

* config/xtensa/xtensa.md (*btrue_INT_MIN):
Change the branch insn condition to test for a negative number
rather than testing for the MSB.

libstdc++: Replace numeric_limit with __int_traits in mdspan.

Using __int_traits avoids the need to include <limits> from <mdspan>.
This in turn should reduce the size of the pre-compiled <mdspan>.
Similar refactoring was carried out for PR92546. Unfortunately,

./gcc/xgcc -std=c++23 -P -E -x c++ - -include mdspan | wc -l

shows a decrease by 1(!) line. This is due to bits/max_size_type.h which
includes <limits>.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__valid_static_extent): Replace
numeric_limits with __int_traits.
(extents::_S_ctor_explicit): Ditto.
(extents::__static_quotient): Ditto.
(layout_stride::mapping::mapping): Ditto.
(mdspan::size): Ditto.
* testsuite/23_containers/mdspan/extents/class_mandates_neg.cc:
Update test with additional diagnostics.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve extents::operator==.

An interesting case to consider is:

  bool same11(const std::extents<int, dyn,   2, 3>& e1,
              const std::extents<int, dyn, dyn, 3>& e2)
  { return e1 == e2; }

Which has the following properties:

  - There's no mismatching static extents, preventing any
    short-circuiting.

  - There's a comparison between dynamic and static extents.

  - There's one trivial comparison: ... && 3 == 3.

Let E[i] denote the array of static extents, D[k] denote the array of
dynamic extents and k[i] be the index of the i-th extent in D.
(Naturally, k[i] is only meaningful if i is a dynamic extent).

The previous implementation results in assembly that's more or less a
literal translation of:

  for (i = 0; i < 3; ++i)
    e1 = E1[i] == -1 ? D1[k1[i]] : E1[i];
    e2 = E2[i] == -1 ? D2[k2[i]] : E2[i];
    if e1 != e2:
      return false
  return true;

While the proposed method results in assembly for

  if(D1[0] == D2[0]) return false;
  return 2 == D2[1];

i.e.

  110:  8b 17                  mov    edx,DWORD PTR [rdi]
  112:  31 c0                  xor    eax,eax
  114:  39 16                  cmp    DWORD PTR [rsi],edx
  116:  74 08                  je     120 <same11+0x10>
  118:  c3                     ret
  119:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  120:  83 7e 04 02            cmp    DWORD PTR [rsi+0x4],0x2
  124:  0f 94 c0               sete   al
  127:  c3                     ret

It has the following nice properties:

  - It eliminated the indirection D[k[i]], because k[i] is known at
    compile time. Saving us a comparison E[i] == -1 and conditionally
    loading k[i].

  - It eliminated the trivial condition 3 == 3.

The result is code that only loads the required values and performs
exactly the number of comparisons needed by the algorithm. It also
results in smaller object files. Therefore, this seems like a sensible
change. We've check several other examples, including fully statically
determined cases and high-rank examples. The example given above
illustrates the other cases well.

The constexpr condition:

  if constexpr (!_S_is_compatible_extents<...>)
    return false;

is no longer needed, because the optimizer correctly handles this case.
However, it's retained for clarity/certainty.

libstdc++-v3/ChangeLog:

* include/std/mdspan (extents::operator==): Replace loop with
pack expansion.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce indirection in extents::extent.

In both fully static and dynamic extents the comparison
  static_extent(i) == dynamic_extent
is known at compile time. As a result, extents::extent doesn't
need to perform the check at runtime.

An illustrative example is:

  using E = std::extents<int, 3, 5, 7, 11, 13, 17>;
  int required_span_size(const typename Layout::mapping<E>& m)
  { return m.required_span_size(); }

Prior to this commit the generated code (on -O2) is:

2a0:  b9 01 00 00 00         mov    ecx,0x1
2a5:  31 d2                  xor    edx,edx
2a7:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
2ae:  00 00 00 00
2b2:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
2b9:  00 00 00 00
2bd:  0f 1f 00               nop    DWORD PTR [rax]
2c0:  48 8b 04 d5 00 00 00   mov    rax,QWORD PTR [rdx*8+0x0]
2c7:  00
2c8:  48 83 f8 ff            cmp    rax,0xffffffffffffffff
2cc:  0f 84 00 00 00 00      je     2d2 <required_span_size_6d_static+0x32>
2d2:  83 e8 01               sub    eax,0x1
2d5:  0f af 04 97            imul   eax,DWORD PTR [rdi+rdx*4]
2d9:  48 83 c2 01            add    rdx,0x1
2dd:  01 c1                  add    ecx,eax
2df:  48 83 fa 06            cmp    rdx,0x6
2e3:  75 db                  jne    2c0 <required_span_size_6d_static+0x20>
2e5:  89 c8                  mov    eax,ecx
2e7:  c3                     ret

which is a scalar loop, and notably includes the check

  308:  48 83 f8 ff            cmp    rax,0xffffffffffffffff

to assert that the static extent is indeed not -1. Note, that on -O3 the
optimizer eliminates the comparison; and generates a sequence of scalar
operations: lea, shl, add and mov. The aim of this commit is to
eliminate this comparison also for -O2. With the optimization applied we
get:

  2e0:  f3 0f 6f 0f            movdqu xmm1,XMMWORD PTR [rdi]
  2e4:  66 0f 6f 15 00 00 00   movdqa xmm2,XMMWORD PTR [rip+0x0]
  2eb:  00
  2ec:  8b 57 10               mov    edx,DWORD PTR [rdi+0x10]
  2ef:  66 0f 6f c1            movdqa xmm0,xmm1
  2f3:  66 0f 73 d1 20         psrlq  xmm1,0x20
  2f8:  66 0f f4 c2            pmuludq xmm0,xmm2
  2fc:  66 0f 73 d2 20         psrlq  xmm2,0x20
  301:  8d 14 52               lea    edx,[rdx+rdx*2]
  304:  66 0f f4 ca            pmuludq xmm1,xmm2
  308:  66 0f 70 c0 08         pshufd xmm0,xmm0,0x8
  30d:  66 0f 70 c9 08         pshufd xmm1,xmm1,0x8
  312:  66 0f 62 c1            punpckldq xmm0,xmm1
  316:  66 0f 6f c8            movdqa xmm1,xmm0
  31a:  66 0f 73 d9 08         psrldq xmm1,0x8
  31f:  66 0f fe c1            paddd  xmm0,xmm1
  323:  66 0f 6f c8            movdqa xmm1,xmm0
  327:  66 0f 73 d9 04         psrldq xmm1,0x4
  32c:  66 0f fe c1            paddd  xmm0,xmm1
  330:  66 0f 7e c0            movd   eax,xmm0
  334:  8d 54 90 01            lea    edx,[rax+rdx*4+0x1]
  338:  8b 47 14               mov    eax,DWORD PTR [rdi+0x14]
  33b:  c1 e0 04               shl    eax,0x4
  33e:  01 d0                  add    eax,edx
  340:  c3                     ret

Which shows eliminating the trivial comparison, unlocks a new set of
optimizations, i.e. SIMD-vectorization. In particular, the loop has been
vectorized by loading the first four constants from aligned memory; the
first four strides from non-aligned memory, then computes the product
and reduction. It interleaves the above with computing 1 + 12*S[4] +
16*S[5] (as scalar operations) and then finishes the reduction.

A similar effect can be observed for fully dynamic extents.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__all_static): New function.
(__mdspan::_StaticExtents::_S_is_dyn): Inline and eliminate.
(__mdspan::_ExtentsStorage::_S_is_dynamic): New method.
(__mdspan::_ExtentsStorage::_M_extent): Use _S_is_dynamic.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve nearly fully dynamic extents in mdspan.

One previous commit optimized fully dynamic extents; and another
refactored __size such that __fwd_prod is valid for __r = 0, ..., rank
(exclusive).

Therefore, by noticing that __rev_prod (and __fwd_prod) never accesses
the first (or last) extent, one can avoid pre-computing partial products
of static extents in those cases, if all other extents are dynamic.

We check that the size of the reference object file decreases further
and the .rodata sections for

__fwd_prod<dyn, ..., dyn, 11>
__rev_prod<3, dyn, ..., dyn>

are absent.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__fwd_prods): Relax condition for fully-dynamic
extents to cover (dyn, ..., dyn, X).
(__rev_partial_prods): Analogous for (X, dyn, ..., dyn).

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve fully dynamic extents in mdspan.

In mdspan related code, for extents with no static extents, i.e. only
dynamic extents, the following simplifications can be made:

  - The array of dynamic extents has size rank.
  - The two arrays dynamic-index and dynamic-index-inv become
  trivial, e.g. k[i] == i.
  - All elements of the arrays __{fwd,rev}_partial_prods are 1.

This commits eliminates the arrays for dynamic-index, dynamic-index-inv
and __{fwd,rev}_partial_prods. It also removes the indirection k[i] == i
from the source code, which isn't as relevant because the optimizer is
(often) capable of eliminating the indirection.

To check if it's working we look at:

  using E2 = std::extents<int, dyn, dyn, dyn, dyn>;

  int stride_left_E2(const std::layout_left::mapping<E2>& m, size_t r)
  { return m.stride(r); }

which generates the following

  0000000000000190 <stride_left_E2>:
  190:  48 c1 e6 02            shl    rsi,0x2
  194:  74 22                  je     1b8 <stride_left_E2+0x28>
  196:  48 01 fe               add    rsi,rdi
  199:  b8 01 00 00 00         mov    eax,0x1
  19e:  66 90                  xchg   ax,ax
  1a0:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  1a3:  48 83 c7 04            add    rdi,0x4
  1a7:  48 0f af c2            imul   rax,rdx
  1ab:  48 39 fe               cmp    rsi,rdi
  1ae:  75 f0                  jne    1a0 <stride_left_E2+0x10>
  1b0:  c3                     ret
  1b1:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  1b8:  b8 01 00 00 00         mov    eax,0x1
  1bd:  c3                     ret

We see that:

  - There's no code to load the partial product of static extents.
  - There's no indirection D[k[i]], it's just D[i] (as before).

On a test file which computes both mapping::stride(r) and
mapping::required_span_size, we check for static storage with

  objdump -h

we don't see the NTTP _Extents, anything (anymore) related to
_StaticExtents, __fwd_partial_prods or __rev_partial_prods. We also
check that the size of the reference object file (described three
commits prior) reduced by a few percent from 41.9kB to 39.4kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__all_dynamic): New function.
(__mdspan::_StaticExtents::_S_dynamic_index): Convert to method.
(__mdspan::_StaticExtents::_S_dynamic_index_inv): Ditto.
(__mdspan::_StaticExtents): New specialization for fully dynamic
extents.
(__mdspan::__fwd_prod): New constexpr if branch to avoid
instantiating __fwd_partial_prods.
(__mdspan::__rev_prod): Ditto.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Improve low-rank layout_{left,right}::stride.

The methods layout_{left,right}::mapping::stride are defined
as

  \prod_{i = 0}^r E[i]
  \prod_{i = r+1}^n E[i]

This is computed as the product of a precomputed static product and the
product of the required dynamic extents.

Disassembly shows that even for low-rank extents, i.e. rank == 1 and
rank == 2, with at least one dynamic extent, the generated code loads
two values; and then runs the loop over at most one element, e.g. for
stride_left_d5 defined below the generated code is:

220:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
227:  00
228:  31 d2                  xor    edx,edx
22a:  48 85 c0               test   rax,rax
22d:  74 23                  je     252 <stride_left_d5+0x32>
22f:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
236:  00
237:  48 c1 e1 02            shl    rcx,0x2
23b:  74 13                  je     250 <stride_left_d5+0x30>
23d:  48 01 f9               add    rcx,rdi
240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
243:  48 83 c7 04            add    rdi,0x4
247:  48 0f af c2            imul   rax,rdx
24b:  48 39 f9               cmp    rcx,rdi
24e:  75 f0                  jne    240 <stride_left_d5+0x20>
250:  89 c2                  mov    edx,eax
252:  89 d0                  mov    eax,edx
254:  c3                     ret

If there's no dynamic extents, it simply loads the precomputed product
of static extents.

For rank == 1 the answer is the constant `1`; for rank == 2 it's either 1 or
extents.extent(k), with k == 0 for layout_left and k == 1 for
layout_right.

Consider,

  using Ed = std::extents<int, dyn>;
  int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t r)
  { return m.stride(r); }

  using E3d = std::extents<int, 3, dyn>;
  int stride_left_3d(const std::layout_left::mapping<E3d>& m, size_t r)
  { return m.stride(r); }

  using Ed5 = std::extents<int, dyn, 5>;
  int stride_left_d5(const std::layout_left::mapping<Ed5>& m, size_t r)
  { return m.stride(r); }

The optimized code for these three cases is:

  0000000000000060 <stride_left_d>:
  60:  b8 01 00 00 00         mov    eax,0x1
  65:  c3                     ret

  0000000000000090 <stride_left_3d>:
  90:  48 83 fe 01            cmp    rsi,0x1
  94:  19 c0                  sbb    eax,eax
  96:  83 e0 fe               and    eax,0xfffffffe
  99:  83 c0 03               add    eax,0x3
  9c:  c3                     ret

  00000000000000a0 <stride_left_d5>:
  a0:  b8 01 00 00 00         mov    eax,0x1
  a5:  48 85 f6               test   rsi,rsi
  a8:  74 02                  je     ac <stride_left_d5+0xc>
  aa:  8b 07                  mov    eax,DWORD PTR [rdi]
  ac:  c3                     ret

For rank == 1 it simply returns 1 (as expected). For rank == 2, it
either implements a branchless formula, or conditionally loads one
value. In all cases involving a dynamic extent this seems like it's
always doing clearly less work, both in terms of computation and loads.
In cases not involving a dynamic extent, it replaces loading one value
with a branchless sequence of four instructions.

This commit also refactors __size to no use any of the precomputed
arrays. This prevents instantiating __{fwd,rev}_partial_prods for
low-rank extents. This results in a further size reduction of a
reference object file (described two commits prior) by 9% from 46.0kB to
41.9kB.

In a prior commit we optimized __size to produce better object code by
precomputing the static products. This refactor enables the optimizer to
generate the same optimized code.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__fwd_prod): Optimize
for rank <= 2.
(__mdspan::__rev_prod): Ditto.
(__mdspan::__size): Refactor to use a pre-computed product, not
a partial product.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Precompute products of static extents.

Let E denote an multi-dimensional extent; n the rank of E; r = 0, ...,
n; E[i] the i-th extent; and D[k] be the (possibly empty) array of
dynamic extents.

The two partial products for r = 0, ..., n:

  \prod_{i = 0}^r E[i]     (fwd)
  \prod_{i = r+1}^n E[i]   (rev)

can be computed as the product of static and dynamic extents. The static
fwd and rev product can be computed at compile time for all values of r.

Three methods are directly affected by this optimization:

  layout_left::mapping::stride
  layout_right::mapping::stride
  mdspan::size

We'll check the generated code (-O2) for all three methods for a generic
(artificially) high-dimensional multi-dimensional extents.

Consider a generic case:

  using Extents = std::extents<int, 3, 5, dyn, dyn, dyn, 7, dyn>;

  int stride_left(const std::layout_left::mapping<Extents>& m, size_t r)
  { return m.stride(r); }

The code generated prior to this commit:

  4f0:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 4f8
  4f7:  00
  4f8:  48 83 c6 01            add    rsi,0x1
  4fc:  48 c7 44 24 e8 ff ff   mov    QWORD PTR [rsp-0x18],0xffffffffffffffff
  503:  ff ff
  505:  48 8d 04 f5 00 00 00   lea    rax,[rsi*8+0x0]
  50c:  00
  50d:  0f 29 44 24 b8         movaps XMMWORD PTR [rsp-0x48],xmm0
  512:  66 0f 76 c0            pcmpeqd xmm0,xmm0
  516:  0f 29 44 24 c8         movaps XMMWORD PTR [rsp-0x38],xmm0
  51b:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 523
  522:  00
  523:  0f 29 44 24 d8         movaps XMMWORD PTR [rsp-0x28],xmm0
  528:  48 83 f8 38            cmp    rax,0x38
  52c:  74 72                  je     5a0 <stride_right_E1+0xb0>
  52e:  48 8d 54 04 b8         lea    rdx,[rsp+rax*1-0x48]
  533:  4c 8d 4c 24 f0         lea    r9,[rsp-0x10]
  538:  b8 01 00 00 00         mov    eax,0x1
  53d:  0f 1f 00               nop    DWORD PTR [rax]
  540:  48 8b 0a               mov    rcx,QWORD PTR [rdx]
  543:  49 89 c0               mov    r8,rax
  546:  4c 0f af c1            imul   r8,rcx
  54a:  48 83 f9 ff            cmp    rcx,0xffffffffffffffff
  54e:  49 0f 45 c0            cmovne rax,r8
  552:  48 83 c2 08            add    rdx,0x8
  556:  49 39 d1               cmp    r9,rdx
  559:  75 e5                  jne    540 <stride_right_E1+0x50>
  55b:  48 85 c0               test   rax,rax
  55e:  74 38                  je     598 <stride_right_E1+0xa8>
  560:  48 8b 14 f5 00 00 00   mov    rdx,QWORD PTR [rsi*8+0x0]
  567:  00
  568:  48 c1 e2 02            shl    rdx,0x2
  56c:  48 83 fa 10            cmp    rdx,0x10
  570:  74 1e                  je     590 <stride_right_E1+0xa0>
  572:  48 8d 4f 10            lea    rcx,[rdi+0x10]
  576:  48 01 d7               add    rdi,rdx
  579:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  580:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  583:  48 83 c7 04            add    rdi,0x4
  587:  48 0f af c2            imul   rax,rdx
  58b:  48 39 f9               cmp    rcx,rdi
  58e:  75 f0                  jne    580 <stride_right_E1+0x90>
  590:  c3                     ret
  591:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  598:  c3                     ret
  599:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  5a0:  b8 01 00 00 00         mov    eax,0x1
  5a5:  eb b9                  jmp    560 <stride_right_E1+0x70>
  5a7:  66 0f 1f 84 00 00 00   nop    WORD PTR [rax+rax*1+0x0]
  5ae:  00 00

which seems to be performing:

  preparatory_work();
  ret = 1
  for(i = 0; i < rank; ++i)
    tmp = ret * E[i]
    if E[i] != -1
      ret = tmp
  for(i = 0; i < rank_dynamic; ++i)
    ret *= D[i]

This commit reduces it down to:

  270:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
  277:  00
  278:  31 d2                  xor    edx,edx
  27a:  48 85 c0               test   rax,rax
  27d:  74 33                  je     2b2 <stride_right_E1+0x42>
  27f:  48 8b 14 f5 00 00 00   mov    rdx,QWORD PTR [rsi*8+0x0]
  286:  00
  287:  48 c1 e2 02            shl    rdx,0x2
  28b:  48 83 fa 10            cmp    rdx,0x10
  28f:  74 1f                  je     2b0 <stride_right_E1+0x40>
  291:  48 8d 4f 10            lea    rcx,[rdi+0x10]
  295:  48 01 d7               add    rdi,rdx
  298:  0f 1f 84 00 00 00 00   nop    DWORD PTR [rax+rax*1+0x0]
  29f:  00
  2a0:  48 63 17               movsxd rdx,DWORD PTR [rdi]
  2a3:  48 83 c7 04            add    rdi,0x4
  2a7:  48 0f af c2            imul   rax,rdx
  2ab:  48 39 f9               cmp    rcx,rdi
  2ae:  75 f0                  jne    2a0 <stride_right_E1+0x30>
  2b0:  89 c2                  mov    edx,eax
  2b2:  89 d0                  mov    eax,edx
  2b4:  c3                     ret

Loosely speaking this does the following:

  1. Load the starting position k in the array of dynamic extents; and
     return if possible.
  2. Load the partial product of static extents.
  3. Computes the \prod_{i = k}^d D[i] where d is the number of
  dynamic extents in a loop.

It shows that the span used for passing in the dynamic extents is
completely eliminated; and the fact that the product always runs to the
end of the array of dynamic extents is used by the compiler to eliminate
one indirection to determine the end position in the array of dynamic
extents.

The analogous code is generated for layout_left.

Next, consider

  using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
  int size2(const std::mdspan<double, E2>& md)
  { return md.size(); }

on immediately preceding commit the generated code is

  10:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 18
  17:  00
  18:  49 89 f8               mov    r8,rdi
  1b:  48 8d 44 24 b8         lea    rax,[rsp-0x48]
  20:  48 c7 44 24 e8 0b 00   mov    QWORD PTR [rsp-0x18],0xb
  27:  00 00
  29:  48 8d 7c 24 f0         lea    rdi,[rsp-0x10]
  2e:  ba 01 00 00 00         mov    edx,0x1
  33:  0f 29 44 24 b8         movaps XMMWORD PTR [rsp-0x48],xmm0
  38:  66 0f 76 c0            pcmpeqd xmm0,xmm0
  3c:  0f 29 44 24 c8         movaps XMMWORD PTR [rsp-0x38],xmm0
  41:  66 0f 6f 05 00 00 00   movdqa xmm0,XMMWORD PTR [rip+0x0]        # 49
  48:  00
  49:  0f 29 44 24 d8         movaps XMMWORD PTR [rsp-0x28],xmm0
  4e:  66 66 2e 0f 1f 84 00   data16 cs nop WORD PTR [rax+rax*1+0x0]
  55:  00 00 00 00
  59:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
  60:  48 8b 08               mov    rcx,QWORD PTR [rax]
  63:  48 89 d6               mov    rsi,rdx
  66:  48 0f af f1            imul   rsi,rcx
  6a:  48 83 f9 ff            cmp    rcx,0xffffffffffffffff
  6e:  48 0f 45 d6            cmovne rdx,rsi
  72:  48 83 c0 08            add    rax,0x8
  76:  48 39 c7               cmp    rdi,rax
  79:  75 e5                  jne    60 <size2+0x50>
  7b:  48 85 d2               test   rdx,rdx
  7e:  74 18                  je     98 <size2+0x88>
  80:  49 63 00               movsxd rax,DWORD PTR [r8]
  83:  49 63 48 04            movsxd rcx,DWORD PTR [r8+0x4]
  87:  48 0f af c1            imul   rax,rcx
  8b:  41 0f af 40 08         imul   eax,DWORD PTR [r8+0x8]
  90:  0f af c2               imul   eax,edx
  93:  c3                     ret
  94:  0f 1f 40 00            nop    DWORD PTR [rax+0x0]
  98:  31 c0                  xor    eax,eax
  9a:  c3                     ret

which is needlessly long. The current commit reduces it down to:

  10:  48 63 07               movsxd rax,DWORD PTR [rdi]
  13:  48 63 57 04            movsxd rdx,DWORD PTR [rdi+0x4]
  17:  48 0f af c2            imul   rax,rdx
  1b:  0f af 47 08            imul   eax,DWORD PTR [rdi+0x8]
  1f:  69 c0 83 04 00 00      imul   eax,eax,0x483
  25:  c3                     ret

Which simply computes the product:

  D[0] * D[1] * D[2] * const

where const is the product of all static extents. Meaning the loop to
compute the product of dynamic extents has been fully unrolled and
all constants are perfectly precomputed.

The size of the object file described in the previous commit reduces
by 17% from 55.8kB to 46.0kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::__static_prod): New function.
(__mdspan::__fwd_partial_prods): Constexpr array of partial
forward products.
(__mdspan::__fwd_partial_prods): Same for reverse partial
products.
(__mdspan::__static_extents_prod): Delete function.
(__mdspan::__extents_prod): Renamed from __exts_prod and refactored.
include/std/mdspan (__mdspan::__fwd_prod): Compute as the
product of pre-computed static static and the product of dynamic
extents.
(__mdspan::__rev_prod): Ditto.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

libstdc++: Reduce template instantiations in <mdspan>.

In mdspan related code involving static extents, often the IndexType is
part of the template parameters, even though it's not needed.

This commit extracts the parts of _ExtentsStorage not related to
IndexType into a separate class _StaticExtents.

It also prefers passing the array of static extents, instead of the
whole extents object where possible.

The size of an object file compiled with -O2 that instantiates
  Layout::mapping<extents<IndexType, Indices...>::stride
  Layout::mapping<extents<IndexType, Indices...>::required_span_size
for the product of
  - eight IndexTypes
  - three Layouts,
  - nine choices of Indices...
decreases by 19% from 69.2kB to 55.8kB.

libstdc++-v3/ChangeLog:

* include/std/mdspan (__mdspan::_StaticExtents): Extract non IndexType
related code from _ExtentsStorage.
(__mdspan::_ExtentsStorage): Use _StaticExtents.
(__mdspan::__static_extents): Return reference to NTTP of _StaticExtents.
(__mdspan::__contains_zero): New overload.
(__mdspan::__exts_prod, __mdspan::__static_quotient): Use span to avoid
copying __sta_exts.

Reviewed-by: Tomasz Kamiński <tkaminsk@redhat.com>
Signed-off-by: Luc Grosheintz <luc.grosheintz@gmail.com>

Merge BB and loop path in vect_analyze_stmt

We have now common patterns for most of the vectorizable_* calls, so
merge. This also avoids calling vectorizable_early_exit for BB
vect and clarifies signatures of it and vectorizable_phi.

* tree-vectorizer.h (vectorizable_phi): Take bb_vec_info.
(vectorizable_early_exit): Take loop_vec_info.
* tree-vect-loop.cc (vectorizable_phi): Adjust.
* tree-vect-slp.cc (vect_slp_analyze_operations): Likewise.
(vectorize_slp_instance_root_stmt): Likewise.
* tree-vect-stmts.cc (vectorizable_early_exit): Likewise.
(vect_transform_stmt): Likewise.
(vect_analyze_stmt): Merge the sequences of vectorizable_*
where common.

MAINTAINERS: Update my email address and stand down as AArch64 maintainer

Today is my last working day at Arm, so this patch switches my
MAINTAINERS entries to my personal email address. (It turns out
that I never updated some of the later entries...oops)

In order to avoid setting false expectations, and to try to avoid
getting in the way, I'm also standing down as an AArch64 maintainer,
effective EOB today. I might still end up reviewing the odd AArch64
patch under global reviewership though, depending on how things go :)

ChangeLog:

* MAINTAINERS: Update my email address and stand down as AArch64
maintainer.

Fortran: gfortran PDT component access [PR84122, PR85942]

2025-08-21 Paul Thomas <pault@gcc.gnu.org>

gcc/fortran
PR fortran/84122
* parse.cc (parse_derived): PDT type parameters are not allowed
an explicit access specification and must appear before a
PRIVATE statement. If a PRIVATE statement is seen, mark all the
other components as PRIVATE.

PR fortran/85942
* simplify.cc (get_kind): Convert a PDT KIND component into a
specification expression using the default initializer.

gcc/testsuite/
PR fortran/84122
* gfortran.dg/pdt_38.f03: New test.

PR fortran/85942
* gfortran.dg/pdt_39.f03: New test.

c++: pointer to auto member function [PR120757]

Here r13-1210 correctly changed &A<int>::foo to not be considered
type-dependent, but tsubst_expr of the OFFSET_REF got confused trying to
tsubst a type that involved auto. Fixed by getting the type from the
member rather than tsubst.

PR c++/120757

gcc/cp/ChangeLog:

* pt.cc (tsubst_expr) [OFFSET_REF]: Don't tsubst the type.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1y/auto-fn66.C: New test.

Daily bump.

c++: lambda capture and shadowing [PR121553]

P2036 says that this:

[x=1]{ int x; }

should be rejected, but with my P2036 we started giving an error
for the attached testcase as well, breaking Dolphin. So let's
keep the error only for init-captures.

PR c++/121553

gcc/cp/ChangeLog:

* name-lookup.cc (check_local_shadow): Check !is_normal_capture_proxy.

gcc/testsuite/ChangeLog:

* g++.dg/warn/Wshadow-19.C: Revert P2036 changes.
* g++.dg/warn/Wshadow-6.C: Likewise.
* g++.dg/warn/Wshadow-20.C: New test.
* g++.dg/warn/Wshadow-21.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>

Regenerate common.opt.urls for -fdiagnostics-show-context

When -fdiagnostics-show-context[=DEPTH] was added, they were documented, but
common.opt.urls wasn't regenerated.

gcc/ChangeLog:

* common.opt.urls: Regenerate.

Provide new option -fdiagnostics-show-context=N for -Warray-bounds, -Wstringop-* warnings [PR109071,PR85788,PR88771,PR106762,PR108770,PR115274,PR117179]

'-fdiagnostics-show-context[=DEPTH]'
'-fno-diagnostics-show-context'
     With this option, the compiler might print the interesting control
     flow chain that guards the basic block of the statement which has
     the warning.  DEPTH is the maximum depth of the control flow chain.
     Currently, The list of the impacted warning options includes:
     '-Warray-bounds', '-Wstringop-overflow', '-Wstringop-overread',
     '-Wstringop-truncation'.  and '-Wrestrict'.  More warning options
     might be added to this list in future releases.  The forms
     '-fdiagnostics-show-context' and '-fno-diagnostics-show-context'
     are aliases for '-fdiagnostics-show-context=1' and
     '-fdiagnostics-show-context=0', respectively.

For example:

$ cat t.c
extern void warn(void);
static inline void assign(int val, int *regs, int *index)
{
  if (*index >= 4)
    warn();
  *regs = val;
}
struct nums {int vals[4];};

void sparx5_set (int *ptr, struct nums *sg, int index)
{
  int *val = &sg->vals[index];

  assign(0,    ptr, &index);
  assign(*val, ptr, &index);
}

$ gcc -Wall -O2  -c -o t.o t.c
t.c: In function ‘sparx5_set’:
t.c:12:23: warning: array subscript 4 is above array bounds of ‘int[4]’ [-Warray-bounds=]
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~^~~~~~~
t.c:8:18: note: while referencing ‘vals’
    8 | struct nums {int vals[4];};
      |   ^~~~

In the above, Although the warning is correct in theory, the warning message
itself is confusing to the end-user since there is information that cannot
be connected to the source code directly.

It will be a nice improvement to add more information in the warning message
to report where such index value come from.

With the new option -fdiagnostics-show-context=1, the warning message for
the above testing case is now:

$ gcc -Wall -O2 -fdiagnostics-show-context=1 -c -o t.o t.c
t.c: In function ‘sparx5_set’:
t.c:12:23: warning: array subscript 4 is above array bounds of ‘int[4]’ [-Warray-bounds=]
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~^~~~~~~
  ‘sparx5_set’: events 1-2
    4 |   if (*index >= 4)
      |      ^
      |      |
      |      (1) when the condition is evaluated to true
......
   12 |   int *val = &sg->vals[index];
      |        ~~~~~~~~~~~~~~~
      |        |
      |        (2) warning happens here
t.c:8:18: note: while referencing ‘vals’
    8 | struct nums {int vals[4];};
      |   ^~~~

PR tree-optimization/109071
PR tree-optimization/85788
PR tree-optimization/88771
PR tree-optimization/106762
PR tree-optimization/108770
PR tree-optimization/115274
PR tree-optimization/117179

gcc/ChangeLog:

* Makefile.in (OBJS): Add diagnostic-context-rich-location.o.
* common.opt (fdiagnostics-show-context): New option.
(fdiagnostics-show-context=): New option.
* diagnostic-context-rich-location.cc: New file.
* diagnostic-context-rich-location.h: New file.
* doc/invoke.texi (fdiagnostics-details): Add
documentation for the new options.
* gimple-array-bounds.cc (check_out_of_bounds_and_warn): Add
one new parameter. Use rich location with details for warning_at.
(array_bounds_checker::check_array_ref): Use rich location with
ditails for warning_at.
(array_bounds_checker::check_mem_ref): Add one new parameter.
Use rich location with details for warning_at.
(array_bounds_checker::check_addr_expr): Use rich location with
move_history_diagnostic_path for warning_at.
(array_bounds_checker::check_array_bounds): Call check_mem_ref with
one more parameter.
* gimple-array-bounds.h: Update prototype for check_mem_ref.
* gimple-ssa-warn-access.cc (warn_string_no_nul): Use rich location
with details for warning_at.
(maybe_warn_nonstring_arg): Likewise.
(maybe_warn_for_bound): Likewise.
(warn_for_access): Likewise.
(check_access): Likewise.
(pass_waccess::check_strncat): Likewise.
(pass_waccess::maybe_check_access_sizes): Likewise.
* gimple-ssa-warn-restrict.cc (pass_wrestrict::execute): Calculate
dominance info for diagnostics show context.
(maybe_diag_overlap): Use rich location with details for warning_at.
(maybe_diag_access_bounds): Use rich location with details for
warning_at.

gcc/testsuite/ChangeLog:

* gcc.dg/pr109071.c: New test.
* gcc.dg/pr109071_1.c: New test.
* gcc.dg/pr109071_10.c: New test.
* gcc.dg/pr109071_11.c: New test.
* gcc.dg/pr109071_12.c: New test.
* gcc.dg/pr109071_2.c: New test.
* gcc.dg/pr109071_3.c: New test.
* gcc.dg/pr109071_4.c: New test.
* gcc.dg/pr109071_5.c: New test.
* gcc.dg/pr109071_6.c: New test.
* gcc.dg/pr109071_7.c: New test.
* gcc.dg/pr109071_8.c: New test.
* gcc.dg/pr109071_9.c: New test.
* gcc.dg/pr117375.c: New test.

sra: Make build_ref_for_offset static [PR121568]

build_ref_for_offset was originally made external
with r0-95095-g3f84bf08c48ea4. The call was extracted
out into ipa_get_jf_ancestor_result by r0-110216-g310bc6334823b9.
Then the call was removed by r10-7273-gf3280e4c0c98e1.
So there is no use of build_ref_for_offset outside of SRA, so
let's make it static again.

Bootstrapped and tested on x86_64-linux-gnu.

PR tree-optimization/121568
gcc/ChangeLog:

* ipa-prop.h (build_ref_for_offset): Remove.
* tree-sra.cc (build_ref_for_offset): Make static.

Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>

Merge aarch64-cc-fusion into late-combine

I'd added the aarch64-specific CC fusion pass to fold a PTEST
instruction into the instruction that feeds the PTEST, in cases
where the latter instruction can set the appropriate flags as a
side-effect.

Combine does the same optimisation.  However, as explained in the
comments, the PTEST case often has:

  A: set predicate P based on inputs X
  B: clobber X
  C: test P

and so the fusion is only possible if we move C before B.
That's something that combine currently can't do (for the cases
that we needed).

The optimisation was never really AArch64-specific.  It's just that,
in an all-too-familiar fashion, we needed it in stage 3, when it was
too late to add something target-independent.

late-combine adds a convenient place to do the optimisation in a
target-independent way, just as combine is a convenient place to
do its related optimisation.

gcc/
* config.gcc (aarch64*-*-*): Remove aarch64-cc-fusion.o from
extra_objs.
* config/aarch64/aarch64-passes.def (pass_cc_fusion): Delete.
* config/aarch64/aarch64-protos.h (make_pass_cc_fusion): Delete.
* config/aarch64/t-aarch64 (aarch64-cc-fusion.o): Delete.
* config/aarch64/aarch64-cc-fusion.cc: Delete.
* late-combine.cc (late_combine::optimizable_set): Take a set_info *
rather than an insn_info * and move destination tests from...
(late_combine::combine_into_uses): ...here. Take a set_info * rather
an insn_info *.  Take the rtx set.
(late_combine::parallelize_insns, late_combine::combine_cc_setter)
(late_combine::combine_insn): New member functions.
(late_combine::m_parallel): New member variable.
* rtlanal.cc (pattern_cost): Handle sets of CC registers in the
same way as comparisons.

rtl-ssa: Fix thinko when adding live-out uses

While testing a later patch, I found that create_degenerate_phi
had an inverted test for bitmap_set_bit.  It was assuming that
the return value was the previous bit value, rather than a
"something changed" value. :(

Also, the call to add_live_out_use shouldn't be conditional
on the DF_LR_OUT operation, since the register could be live-out
because of uses later in the same EBB (which do not require a
live-out use to be added to the rtl-ssa instruction).  Instead,
add_live_out should itself check whether a live-out use already exists.

gcc/
* rtl-ssa/blocks.cc (function_info::create_degenerate_phi): Fix
inverted test of bitmap_set_bit.  Call add_live_out_use even
if the register was previously live-out from the predecessor block.
Instead...
(function_info::add_live_out_use): ...check here whether a live-out
use already exists.

rtl-ssa: Add a find_uses function

rtl-ssa already has a find_def function for finding the definition
of a particular resource (register or memory) at a particular point
in the program. This patch adds a similar function for looking
up uses. Both functions have amortised logarithmic complexity.

gcc/
* rtl-ssa/accesses.h (use_lookup): New class.
* rtl-ssa/functions.h (function_info::find_def): Expand comment.
(function_info::find_use): Declare.
* rtl-ssa/member-fns.inl (use_lookup::prev_use, use_lookup::next_use)
(use_lookup::matching_use, use_lookup::matching_or_prev_use)
(use_lookup::matching_or_next_use): New member functions.
* rtl-ssa/accesses.cc (function_info::find_use): Likewise.