git.ipfire.org Git - thirdparty/gcc.git/log

fortran: Factor scalar descriptor generation

The same scalar descriptor generation code is present twice, in the
case of derived type entities, and in the case of polymorphic
non-coarray entities. Factor it in preparation for a future third case
that will also need the same code for scalar descriptor generation.

gcc/fortran/ChangeLog:

* trans.cc (get_var_descr): Factor scalar descriptor generation.

fortran: Outline virtual table pointer evaluation

gcc/fortran/ChangeLog:

* trans.cc (get_vptr): New function.
(gfc_add_finalizer_call): Move virtual table pointer evaluation
to get_vptr.

fortran: Remove redundant argument in get_var_descr

get_var_descr get passed as argument both expr and expr->ts.
Remove the type argument which can be retrieved from the other
argument.

gcc/fortran/ChangeLog:

* trans.cc (get_var_descr): Remove argument ts. Use var->ts
instead.
(gfc_add_finalizer_call): Update caller.

fortran: Inline variable definition

The variable has_finalizer is only used in one place, inline its
definition there.

gcc/fortran/ChangeLog:

* trans.cc (gfc_add_finalizer_call): Inline definition of
variable has_finalizer. Merge nested conditions.

fortran: Push final procedure expr gen close to its one usage.

Final procedure pointer expression is generated in gfc_build_final_call
and only used in get_final_proc_ref.  Move the generation there.

gcc/fortran/ChangeLog:

* trans.cc (gfc_add_finalizer_call): Remove local variable
final_expr.  Pass down expr to get_final_proc_ref and move
final procedure expression generation down to its one usage
in get_final_proc_ref.
(get_final_proc_ref): Add argument expr.  Remove argument
final_wrapper.  Recreate final_wrapper from expr.

fortran: Push element size expression generation close to its usage

gfc_add_finalizer_call creates one expression which is only used
by the get_final_proc_ref function.  Move the expression generation
there.

gcc/fortran/ChangeLog:

* trans.cc (gfc_add_finalizer_call): Remove local variable
elem_size.  Pass expression to get_elem_size and move the
element size expression generation close to its usage there.
(get_elem_size): Add argument expr, remove class_size argument
and rebuild it from expr.  Remove ts argument and use the
type of expr instead.

fortran: Reuse final procedure pointer expression

Reuse twice the same final procedure pointer expression instead of
translating it twice.
Final procedure pointer expressions were translated twice, once for the
final procedure call, and once for the check for non-nullness (if
applicable).

gcc/fortran/ChangeLog:

* trans.cc (gfc_add_finalizer_call): Move pre and post code for
the final procedure pointer expression to the outer block.
Reuse the previously evaluated final procedure pointer
expression.

fortran: Add missing cleanup blocks

Move cleanup code for the data descriptor after the finalization code
as it makes more sense to have it after.
Other cleanup blocks should be empty (element size and final pointer
are just data references), but add them by the way, just in case.

gcc/fortran/ChangeLog:

* trans.cc (gfc_add_finalizer_call): Add post code for desc_se
after the finalizer call. Add post code for final_se and
size_se as well.

fortran: Inline gfc_build_final_call

Function gfc_build_final_call has been simplified, inline it.

gcc/fortran/ChangeLog:

* trans.cc (gfc_build_final_call): Inline...
(gfc_add_finalizer_call): ... to its one caller.

fortran: Outline data reference descriptor evaluation

gcc/fortran/ChangeLog:

* trans.cc (get_var_descr): New function.
(gfc_build_final_call): Outline the data reference descriptor
evaluation code to get_var_descr.

fortran: Outline element size evaluation

gcc/fortran/ChangeLog:

* trans.cc (get_elem_size): New function.
(gfc_build_final_call): Outline the element size evaluation
to get_elem_size.

fortran: Outline final procedure pointer evaluation

gcc/fortran/ChangeLog:

* trans.cc (get_final_proc_ref): New function.
(gfc_build_final_call): Outline the pointer evaluation code
to get_final_proc_ref.

fortran: Remove commented out assertion

r13-6747-gd7caf313525a46f200d7f5db1ba893f853774aee commented out an
assertion without any test exercising it. This adds such a test where
the assertion would fail, and removes the commented code.

gcc/fortran/ChangeLog:

* trans.cc (gfc_build_final_call): Remove commented assertion.

gcc/testsuite/ChangeLog:

* gfortran.dg/finalize_53.f90: New test.

Export value/mask known bits from CCP.

Currently CCP throws away the known 1 bits because VRP and irange have
traditionally only had a way of tracking known 0s (set_nonzero_bits).
With the ability to keep all the known bits in the irange, we can now
save this between passes.

gcc/ChangeLog:

* tree-ssa-ccp.cc (ccp_finalize): Export value/mask known bits.

RISC-V: Ensure all implied extensions are included [PR110696]

This patch fix target/PR110696, recursively add all implied extensions.

PR target/110696

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc (riscv_subset_list::handle_implied_ext):
recur add all implied extensions.
(riscv_subset_list::check_implied_ext): Add new method.
(riscv_subset_list::parse): Call checker check_implied_ext.
* config/riscv/riscv-subset.h: Add new method.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/attribute-20.c: New test.
* gcc.target/riscv/pr110696.c: New test.

Signed-off-by: Lehua Ding <lehua.ding@rivai.ai>

RISC-V: Support non-SLP unordered reduction

This patch add reduc_*_scal to support reduction auto-vectorization.

Use COND_LEN_* + reduc_*_scal to support unordered non-SLP auto-vectorization.

Consider this following case:
int __attribute__((noipa))
and_loop (int32_t * __restrict x,
int32_t n, int res)
{
  for (int i = 0; i < n; ++i)
    res &= x[i];
  return res;
}

ASM:
and_loop:
ble a1,zero,.L4
vsetvli a3,zero,e32,m1,ta,ma
vmv.v.i v1,-1
.L3:
vsetvli a5,a1,e32,m1,tu,ma       ------------> MUST BE "TU".
slli a4,a5,2
sub a1,a1,a5
vle32.v v2,0(a0)
add a0,a0,a4
vand.vv v1,v2,v1
bne a1,zero,.L3
vsetivli zero,1,e32,m1,ta,ma
vmv.v.i v2,-1
vsetvli a3,zero,e32,m1,ta,ma
vredand.vs v1,v1,v2
vmv.x.s a5,v1
and a0,a2,a5
ret
.L4:
mv a0,a2
ret

Fix bug of VSETVL PASS which is caused by reduction testcase.

SLP reduction and floating-point in-order reduction are not supported yet.

gcc/ChangeLog:

* config/riscv/autovec.md (reduc_plus_scal_<mode>): New pattern.
(reduc_smax_scal_<mode>): Ditto.
(reduc_umax_scal_<mode>): Ditto.
(reduc_smin_scal_<mode>): Ditto.
(reduc_umin_scal_<mode>): Ditto.
(reduc_and_scal_<mode>): Ditto.
(reduc_ior_scal_<mode>): Ditto.
(reduc_xor_scal_<mode>): Ditto.
* config/riscv/riscv-protos.h (enum insn_type): Add reduction.
(expand_reduction): New function.
* config/riscv/riscv-v.cc (emit_vlmax_reduction_insn): Ditto.
(emit_vlmax_fp_reduction_insn): Ditto.
(get_m1_mode): Ditto.
(expand_cond_len_binop): Fix name.
(expand_reduction): New function
* config/riscv/riscv-vsetvl.cc (gen_vsetvl_pat): Fix VSETVL BUG.
(validate_change_or_fail): New function.
(change_insn): Fix VSETVL BUG.
(change_vsetvl_insn): Ditto.
(pass_vsetvl::backward_demand_fusion): Ditto.
(pass_vsetvl::df_post_optimization): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/rvv.exp: Add reduction tests.
* gcc.target/riscv/rvv/autovec/reduc/reduc-1.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-2.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-3.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-4.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-4.c: New test.

Export value/mask known bits from IPA.

Currently IPA throws away the known 1 bits because VRP and irange have
traditionally only had a way of tracking known 0s (set_nonzero_bits).
With the ability to keep all the known bits in the irange, we can now
save this between passes.

gcc/ChangeLog:

* ipa-prop.cc (ipcp_update_bits): Export value/mask known bits.

riscv: Fix warning in riscv_regno_ok_for_index_p

The variable `regno` is currently not used in riscv_regno_ok_for_index_p(),
which triggers a compiler warning. Let's address this.

Fixes: 423604278ed5 ("riscv: Prepare backend for index registers")
Reported-by: Juzhe Zhong <juzhe.zhong@rivai.ai>
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_regno_ok_for_index_p):
Remove parameter name from declaration of unused parameter.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

vect: Initialize new_temp to avoid false positive warning [PR110652]

As PR110652 and its duplicate PRs show, there could be one
build error

error: 'new_temp' may be used uninitialized

for some build configurations. It's a false positive warning
(or error at -Werror), but in order to make the build succeed,
this patch is to initialize the reported variable 'new_temp'
as NULL_TREE.

PR tree-optimization/110652

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Initialize new_temp as
NULL_TREE.

tree-optimization/110669 - bogus matching of loop bitop

The matching code lacked a check that we end up with a PHI node
in the loop header. This caused us to match a random PHI argument
now catched by the extra PHI_ARG_DEF_FROM_EDGE checking.

PR tree-optimization/110669
* tree-scalar-evolution.cc (analyze_and_compute_bitop_with_inv_effect):
Check we matched a header PHI.

* gcc.dg/torture/pr110669.c: New testcase.

Add global setter for value/mask pair for SSA names.

This patch provides a way to set the value/mask pair of known bits
globally, similarly to how we can use set_nonzero_bits for known 0
bits. This can then be used by CCP and IPA to set value/mask info
instead of throwing away the known 1 bits.

In further clean-ups, I will see if it makes sense to remove
set_nonzero_bits altogether, since it is subsumed by value/mask.

gcc/ChangeLog:

* tree-ssanames.cc (set_bitmask): New.
* tree-ssanames.h (set_bitmask): New.

Normalize irange_bitmask before union/intersect.

The bit twiddling in union/intersect for the value/mask pair must be
normalized to have the unknown bits with a value of 0 in order to make
the math simpler.  Normalizing at construction slowed VRP by 1.5% so I
opted to normalize before updating the bitmask in range-ops, since it
was the only user.  However, with upcoming changes there will be
multiple setters of the mask (IPA and CCP), so we need something more
general.

I played with various alternatives, and settled on normalizing before
union/intersect which were the ones needing the bits cleared.  With
this patch, there's no noticeable difference in performance either in
VRP or in overall compilation.

gcc/ChangeLog:

* value-range.cc (irange_bitmask::verify_mask): Mask need not be
normalized.
* value-range.h (irange_bitmask::union_): Normalize beforehand.
(irange_bitmask::intersect): Same.

PR 95923: More (boolean) bitop simplifications in match.pd

This adds the boolean version of some of the simplifications
that were added with r8-4395-ge268a77b59cb78.

That are the following:
(a | b) & (a == b) --> a & b
a | (a == b) --> a | (b ^ 1)
(a & b) | (a == b) --> a == b

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

PR tree-optimization/95923
* match.pd ((a|b)&(a==b),a|(a==b),(a&b)|(a==b)): New transformation.

gcc/testsuite/ChangeLog:

PR tree-optimization/95923
* gcc.dg/tree-ssa/bitops-2.c: New test.
* gcc.dg/tree-ssa/bool-checks-1.c: New test.

Fix bootstrap failure (with g++ 4.8.5) in tree-if-conv.cc.

This patch fixes the bootstrap failure I'm seeing using gcc 4.8.5 as
the host compiler.

2023-07-17 Roger Sayle <roger@nextmovesoftware.com>

gcc/ChangeLog
* tree-if-conv.cc (predicate_scalar_phi): Make the arguments
to the std::sort comparison lambda function const.

Fix PR 110666: `(a != 2) == a` produces wrong code

I had messed up the case where the outer operator is `==`.
The check for the resulting should have been `==` and not `!=`.
This patch fixes that and adds a full runtime testcase now for
all cases to make sure it works.

OK? Bootstrapped and tested on x86-64-linux-gnu with no regressions.

gcc/ChangeLog:

PR tree-optimization/110666
* match.pd (A NEEQ (A NEEQ CST)): Fix Outer EQ case.

gcc/testsuite/ChangeLog:

PR tree-optimization/110666
* gcc.c-torture/execute/pr110666-1.c: New test.

Initial Lunar Lake, Arrow Lake and Arrow Lake S Support

gcc/ChangeLog:

* common/config/i386/cpuinfo.h (get_intel_cpu): Handle Lunar Lake,
Arrow Lake and Arrow Lake S.
* common/config/i386/i386-common.cc:
(processor_name): Add arrowlake.
(processor_alias_table): Add arrow lake, arrow lake s and lunar
lake.
* common/config/i386/i386-cpuinfo.h (enum processor_subtypes):
Add INTEL_COREI7_ARROWLAKE and INTEL_COREI7_ARROWLAKE_S.
* config.gcc: Add -march=arrowlake and -march=arrowlake-s.
* config/i386/driver-i386.cc (host_detect_local_cpu): Handle
arrowlake-s.
* config/i386/i386-c.cc (ix86_target_macros_internal): Add
arrowlake.
* config/i386/i386-options.cc (m_ARROWLAKE): New.
(processor_cost_table): Add arrowlake.
* config/i386/i386.h (enum processor_type):
Add PROCESSOR_ARROWLAKE.
* config/i386/x86-tune.def: Add m_ARROWLAKE.
* doc/extend.texi: Add arrowlake and arrowlake-s.
* doc/invoke.texi: Ditto.

gcc/testsuite/ChangeLog:

* g++.target/i386/mv16.C: Add arrowlake and arrowlake-s.
* gcc.target/i386/funcspec-56.inc: Handle new march.

i386: Auto vectorize usdot_prod, udot_prod with AVXVNNIINT16 instruction.

gcc/ChangeLog:

* config/i386/sse.md (VI2_AVX2): Delete V32HI since we actually
have the same iterator. Also renaming all the occurence to
VI2_AVX2_AVX512BW.
(usdot_prod<mode>): New define_expand.
(udot_prod<mode>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vnniint16-auto-vectorize-1.c: New test.
* gcc.target/i386/vnniint16-auto-vectorize-2.c: Ditto.

Support Intel SM4

gcc/ChangeLog:

* common/config/i386/cpuinfo.h (get_available_features):
Detech SM4.
* common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM4_SET,
OPTION_MASK_ISA2_SM4_UNSET): New.
(OPTION_MASK_ISA2_AVX_UNSET): Add SM4.
(ix86_handle_option): Handle -msm4.
* common/config/i386/i386-cpuinfo.h (enum processor_features):
Add FEATURE_SM4.
* common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
sm4.
* config.gcc: Add sm4intrin.h.
* config/i386/cpuid.h (bit_SM4): New.
* config/i386/i386-builtin.def (BDESC): Add new builtins.
* config/i386/i386-c.cc (ix86_target_macros_internal): Define
__SM4__.
* config/i386/i386-isa.def (SM4): Add DEF_PTA(SM4).
* config/i386/i386-options.cc (isa2_opts): Add -msm4.
(ix86_valid_target_attribute_inner_p): Handle sm4.
* config/i386/i386.opt: Add option -msm4.
* config/i386/immintrin.h: Include sm4intrin.h
* config/i386/sse.md (vsm4key4_<mode>): New define insn.
(vsm4rnds4_<mode>): Ditto.
* doc/extend.texi: Document sm4.
* doc/invoke.texi: Document -msm4.
* doc/sourcebuild.texi: Document target sm4.
* config/i386/sm4intrin.h: New file.

gcc/testsuite/ChangeLog:

* g++.dg/other/i386-2.C: Add -msm4.
* g++.dg/other/i386-3.C: Ditto.
* gcc.target/i386/funcspec-56.inc: Add new target attribute.
* gcc.target/i386/sse-12.c: Add -msm4.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-22.c: Add sm4.
* gcc.target/i386/sse-23.c: Ditto.
* lib/target-supports.exp (check_effective_target_sm4): New.
* gcc.target/i386/sm4-1.c: New test.
* gcc.target/i386/sm4-check.h: Ditto.
* gcc.target/i386/sm4key4-2.c: Ditto.
* gcc.target/i386/sm4rnds4-2.c: Ditto.

Support Intel SHA512

gcc/ChangeLog:

* common/config/i386/cpuinfo.h (get_available_features):
Detect SHA512.
* common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SHA512_SET,
OPTION_MASK_ISA2_SHA512_UNSET): New.
(OPTION_MASK_ISA2_AVX_UNSET): Add SHA512.
(ix86_handle_option): Handle -msha512.
* common/config/i386/i386-cpuinfo.h (enum processor_features):
Add FEATURE_SHA512.
* common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
sha512.
* config.gcc: Add sha512intrin.h.
* config/i386/cpuid.h (bit_SHA512): New.
* config/i386/i386-builtin-types.def:
Add DEF_FUNCTION_TYPE (V4DI, V4DI, V4DI, V2DI).
* config/i386/i386-builtin.def (BDESC): Add new builtins.
* config/i386/i386-c.cc (ix86_target_macros_internal): Define
__SHA512__.
* config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
V4DI_FTYPE_V4DI_V4DI_V2DI and V4DI_FTYPE_V4DI_V2DI.
* config/i386/i386-isa.def (SHA512): Add DEF_PTA(SHA512).
* config/i386/i386-options.cc (isa2_opts): Add -msha512.
(ix86_valid_target_attribute_inner_p): Handle sha512.
* config/i386/i386.opt: Add option -msha512.
* config/i386/immintrin.h: Include sha512intrin.h.
* config/i386/sse.md (vsha512msg1): New define insn.
(vsha512msg2): Ditto.
(vsha512rnds2): Ditto.
* doc/extend.texi: Document sha512.
* doc/invoke.texi: Document -msha512.
* doc/sourcebuild.texi: Document target sha512.
* config/i386/sha512intrin.h: New file.

gcc/testsuite/ChangeLog:

* g++.dg/other/i386-2.C: Add -msha512.
* g++.dg/other/i386-3.C: Ditto.
* gcc.target/i386/funcspec-56.inc: Add new target attribute.
* gcc.target/i386/sse-12.c: Add -msha512.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-22.c: Add sha512.
* gcc.target/i386/sse-23.c: Ditto.
* lib/target-supports.exp (check_effective_target_sha512): New.
* gcc.target/i386/sha512-1.c: New test.
* gcc.target/i386/sha512-check.h: Ditto.
* gcc.target/i386/sha512msg1-2.c: Ditto.
* gcc.target/i386/sha512msg2-2.c: Ditto.
* gcc.target/i386/sha512rnds2-2.c: Ditto.

Support Intel SM3

gcc/ChangeLog:

* common/config/i386/cpuinfo.h (get_available_features):
Detect SM3.
* common/config/i386/i386-common.cc (OPTION_MASK_ISA2_SM3_SET,
OPTION_MASK_ISA2_SM3_UNSET): New.
(OPTION_MASK_ISA2_AVX_UNSET): Add SM3.
(ix86_handle_option): Handle -msm3.
* common/config/i386/i386-cpuinfo.h (enum processor_features):
Add FEATURE_SM3.
* common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
SM3.
* config.gcc: Add sm3intrin.h
* config/i386/cpuid.h (bit_SM3): New.
* config/i386/i386-builtin-types.def:
Add DEF_FUNCTION_TYPE (V4SI, V4SI, V4SI, V4SI, INT).
* config/i386/i386-builtin.def (BDESC): Add new builtins.
* config/i386/i386-c.cc (ix86_target_macros_internal): Define
__SM3__.
* config/i386/i386-expand.cc (ix86_expand_args_builtin): Handle
V4SI_FTYPE_V4SI_V4SI_V4SI_INT.
* config/i386/i386-isa.def (SM3): Add DEF_PTA(SM3).
* config/i386/i386-options.cc (isa2_opts): Add -msm3.
(ix86_valid_target_attribute_inner_p): Handle sm3.
* config/i386/i386.opt: Add option -msm3.
* config/i386/immintrin.h: Include sm3intrin.h.
* config/i386/sse.md (vsm3msg1): New define insn.
(vsm3msg2): Ditto.
(vsm3rnds2): Ditto.
* doc/extend.texi: Document sm3.
* doc/invoke.texi: Document -msm3.
* doc/sourcebuild.texi: Document target sm3.
* config/i386/sm3intrin.h: New file.

gcc/testsuite/ChangeLog:

* g++.dg/other/i386-2.C: Add -msm3.
* g++.dg/other/i386-3.C: Ditto.
* gcc.target/i386/avx-1.c: Add new define for immediate.
* gcc.target/i386/funcspec-56.inc: Add new target attribute.
* gcc.target/i386/sse-12.c: Add -msm3.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-22.c: Add sm3.
* gcc.target/i386/sse-23.c: Ditto.
* lib/target-supports.exp (check_effective_target_sm3): New.
* gcc.target/i386/sm3-1.c: New test.
* gcc.target/i386/sm3-check.h: Ditto.
* gcc.target/i386/sm3msg1-2.c: Ditto.
* gcc.target/i386/sm3msg2-2.c: Ditto.
* gcc.target/i386/sm3rnds2-2.c: Ditto.

Support Intel AVX-VNNI-INT16

gcc/ChangeLog

* common/config/i386/cpuinfo.h (get_available_features): Detect
avxvnniint16.
* common/config/i386/i386-common.cc
(OPTION_MASK_ISA2_AVXVNNIINT16_SET): New.
(OPTION_MASK_ISA2_AVXVNNIINT16_UNSET): Ditto.
(ix86_handle_option): Handle -mavxvnniint16.
* common/config/i386/i386-cpuinfo.h (enum processor_features):
Add FEATURE_AVXVNNIINT16.
* common/config/i386/i386-isas.h: Add ISA_NAME_TABLE_ENTRY for
avxvnniint16.
* config.gcc: Add avxvnniint16.h.
* config/i386/avxvnniint16intrin.h: New file.
* config/i386/cpuid.h (bit_AVXVNNIINT16): New.
* config/i386/i386-builtin.def: Add new builtins.
* config/i386/i386-c.cc (ix86_target_macros_internal): Define
__AVXVNNIINT16__.
* config/i386/i386-options.cc (isa2_opts): Add -mavxvnniint16.
(ix86_valid_target_attribute_inner_p): Handle avxvnniint16intrin.h.
* config/i386/i386-isa.def: Add DEF_PTA(AVXVNNIINT16).
* config/i386/i386.opt: Add option -mavxvnniint16.
* config/i386/immintrin.h: Include avxvnniint16.h.
* config/i386/sse.md
(vpdp<vpdpwprodtype>_<mode>): New define_insn.
* doc/extend.texi: Document avxvnniint16.
* doc/invoke.texi: Document -mavxvnniint16.
* doc/sourcebuild.texi: Document target avxvnniint16.

gcc/testsuite/ChangeLog

* g++.dg/other/i386-2.C: Add -mavxvnniint16.
* g++.dg/other/i386-3.C: Ditto.
* gcc.target/i386/avx-check.h: Add avxvnniint16 check.
* gcc.target/i386/sse-12.c: Add -mavxvnniint16.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-22.c: Ditto.
* gcc.target/i386/sse-23.c: Ditto.
* gcc.target/i386/funcspec-56.inc: Add new target attribute.
* lib/target-supports.exp
(check_effective_target_avxvnniint16): New.
* gcc.target/i386/avxvnniint16-1.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwusd-2.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwusds-2.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwsud-2.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwsuds-2.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwuud-2.c: Ditto.
* gcc.target/i386/avxvnniint16-vpdpwuuds-2.c: Ditto.

Co-authored-by: Haochen Jiang <haochen.jiang@intel.com>

Daily bump.

Fix profile update in scale_profile_for_vect_loop

When vectorizing 4 times, we sometimes do
  for
    <4x vectorized body>
  for
    <2x vectorized body>
  for
    <1x vectorized body>

Here the second two fors handling epilogue never iterates.
Currently vecotrizer thinks that the middle for itrates twice.
This turns out to be scale_profile_for_vect_loop that uses
niter_for_unrolled_loop.

At that time we know epilogue will iterate at most 2 times
but niter_for_unrolled_loop does not know that the last iteration
will be taken by the epilogue-of-epilogue and thus it think
that the loop may iterate once and exit in middle of second
iteration.

We already do correct job updating niter bounds and this is
just ordering issue.  This patch makes us to first update
the bounds and then do updating of the loop.  I re-implemented
the function more correctly and precisely.

The loop reducing iteration factor for overly flat profiles is bit funny, but
only other method I can think of is to compute sreal scale that would have
similar overhead I think.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

gcc/ChangeLog:

PR middle-end/110649
* tree-vect-loop.cc (scale_profile_for_vect_loop): Rewrite.
(vect_transform_loop): Move scale_profile_for_vect_loop after
upper bound updates.

Fix optimize_mask_stores profile update

While looking into sphinx3 regression I noticed that vectorizer produces
BBs with overall probability count 120%. This patch fixes it.
Richi, I don't know how to create a testcase, but having one would
be nice.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

gcc/ChangeLog:

PR tree-optimization/110649
* tree-vect-loop.cc (optimize_mask_stores): Set correctly
probability of the if-then-else construct.

Avoid double profile udpate in try_peel_loop

try_peel_loop uses gimple_duplicate_loop_body_to_header_edge which subtracts the profile
from the original loop. However then it tries to scale the profile in a wrong way
(it forces header count to be entry count).

This eliminates to profile misupdates in the internal loop of sphinx3.

gcc/ChangeLog:

PR middle-end/110649
* tree-ssa-loop-ivcanon.cc (try_peel_loop): Avoid double profile update.

Daily bump.

testsuite: Require 128 bit long double for ibmlongdouble.

pr103628.f90 adds the -mabi=ibmlongdouble option, but AIX defaults
to 64 bit long double. This patch adds -mlong-double-128 to ensure
that the testcase is compiled with 128 bit long double.

gcc/testsuite/ChangeLog:
* gfortran.dg/pr103628.f90: Add -mlong-double-128 option.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

Update my contrib entry

Committed as obvious after making sure the documentation still builds.

gcc/ChangeLog:

* doc/contrib.texi: Update my entry.

hppa: Modify TLS patterns to provide both 32 and 64-bit support.

2023-07-15 John David Anglin <danglin@gcc.gnu.org>

gcc/ChangeLog:

* config/pa/pa.md: Define constants R1_REGNUM, R19_REGNUM and
R27_REGNUM.
(tgd_load): Restrict to !TARGET_64BIT. Use register constants.
(tld_load): Likewise.
(tgd_load_pic): Change to expander.
(tld_load_pic, tld_offset_load, tp_load): Likewise.
(tie_load_pic, tle_load): Likewise.
(tgd_load_picsi, tgd_load_picdi): New.
(tld_load_picsi, tld_load_picdi): New.
(tld_offset_load<P:mode>): New.
(tp_load<P:mode>): New.
(tie_load_picsi, tie_load_picdi): New.
(tle_load<P:mode>): New.

c++: copy elision w/ obj arg and static memfn call [PR110441]

Here the call A().f() is represented as a COMPOUND_EXPR whose first
operand is the otherwise unused object argument A() and second operand
is the call result (both are TARGET_EXPRs). Within the return statement,
this outermost COMPOUND_EXPR ends up foiling the copy elision check in
build_special_member_call, resulting in us introducing a bogus call to the
deleted move constructor. (Within the variable initialization, which goes
through ocp_convert instead of convert_for_initialization, we've already
been eliding the copy -- despite the outermost COMPOUND_EXPR -- ever since
r10-7410-g72809d6fe8e085 made ocp_convert look through COMPOUND_EXPR).

In contrast I noticed '(A(), A::f())' (which should be equivalent to
the above call) is represented with the COMPOUND_EXPR inside the RHS's
TARGET_EXPR initializer thanks to a special case in cp_build_compound_expr.

So this patch fixes this by making keep_unused_object_arg use
cp_build_compound_expr as well.

PR c++/110441

gcc/cp/ChangeLog:

* call.cc (keep_unused_object_arg): Use cp_build_compound_expr
instead of building a COMPOUND_EXPR directly.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1z/elide8.C: New test.

c++: mangling template-id of unknown template [PR110524]

This fixes a crash when mangling an ADL-enabled call to a template-id
naming an unknown template (as per P0846R0).

PR c++/110524

gcc/cp/ChangeLog:

* mangle.cc (write_expression): Handle TEMPLATE_ID_EXPR
whose template is already an IDENTIFIER_NODE.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/fn-template26.C: New test.

Daily bump.

c++: style tweak

At this point r == t, but it makes more sense to refer to t like all the
other cases do.

gcc/cp/ChangeLog:

* constexpr.cc (cxx_eval_constant_expression): Pass t to get_value.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

c++: c++26 regression fixes

Apparently I wasn't actually running the testsuite in C++26 mode like I
thought I was, so there were some failures I wasn't seeing.

The constexpr hunk fixes regressions with the P2738 implementation; we still
need to use the old handling for casting from void pointers to heap
variables.

PR c++/110344

gcc/cp/ChangeLog:

* constexpr.cc (cxx_eval_constant_expression): Move P2738 handling
after heap handling.
* name-lookup.cc (get_cxx_dialect_name): Add C++26.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/constexpr-cast2.C: Adjust for P2738.
* g++.dg/ipa/devirt-45.C: Handle -fimplicit-constexpr.

arm: [MVE intrinsics] rework vcmlaq

Implement vcmlaq using the new MVE builtins framework.

2023-07-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm-mve-builtins-base.cc (vcmlaq, vcmlaq_rot90)
(vcmlaq_rot180, vcmlaq_rot270): New.
* config/arm/arm-mve-builtins-base.def (vcmlaq, vcmlaq_rot90)
(vcmlaq_rot180, vcmlaq_rot270): New.
* config/arm/arm-mve-builtins-base.h: (vcmlaq, vcmlaq_rot90)
(vcmlaq_rot180, vcmlaq_rot270): New.
* config/arm/arm-mve-builtins.cc
(function_instance::has_inactive_argument): Handle vcmlaq,
vcmlaq_rot90, vcmlaq_rot180, vcmlaq_rot270.
* config/arm/arm_mve.h (vcmlaq): Delete.
(vcmlaq_rot180): Delete.
(vcmlaq_rot270): Delete.
(vcmlaq_rot90): Delete.
(vcmlaq_m): Delete.
(vcmlaq_rot180_m): Delete.
(vcmlaq_rot270_m): Delete.
(vcmlaq_rot90_m): Delete.
(vcmlaq_f16): Delete.
(vcmlaq_rot180_f16): Delete.
(vcmlaq_rot270_f16): Delete.
(vcmlaq_rot90_f16): Delete.
(vcmlaq_f32): Delete.
(vcmlaq_rot180_f32): Delete.
(vcmlaq_rot270_f32): Delete.
(vcmlaq_rot90_f32): Delete.
(vcmlaq_m_f32): Delete.
(vcmlaq_m_f16): Delete.
(vcmlaq_rot180_m_f32): Delete.
(vcmlaq_rot180_m_f16): Delete.
(vcmlaq_rot270_m_f32): Delete.
(vcmlaq_rot270_m_f16): Delete.
(vcmlaq_rot90_m_f32): Delete.
(vcmlaq_rot90_m_f16): Delete.
(__arm_vcmlaq_f16): Delete.
(__arm_vcmlaq_rot180_f16): Delete.
(__arm_vcmlaq_rot270_f16): Delete.
(__arm_vcmlaq_rot90_f16): Delete.
(__arm_vcmlaq_f32): Delete.
(__arm_vcmlaq_rot180_f32): Delete.
(__arm_vcmlaq_rot270_f32): Delete.
(__arm_vcmlaq_rot90_f32): Delete.
(__arm_vcmlaq_m_f32): Delete.
(__arm_vcmlaq_m_f16): Delete.
(__arm_vcmlaq_rot180_m_f32): Delete.
(__arm_vcmlaq_rot180_m_f16): Delete.
(__arm_vcmlaq_rot270_m_f32): Delete.
(__arm_vcmlaq_rot270_m_f16): Delete.
(__arm_vcmlaq_rot90_m_f32): Delete.
(__arm_vcmlaq_rot90_m_f16): Delete.
(__arm_vcmlaq): Delete.
(__arm_vcmlaq_rot180): Delete.
(__arm_vcmlaq_rot270): Delete.
(__arm_vcmlaq_rot90): Delete.
(__arm_vcmlaq_m): Delete.
(__arm_vcmlaq_rot180_m): Delete.
(__arm_vcmlaq_rot270_m): Delete.
(__arm_vcmlaq_rot90_m): Delete.

arm: [MVE intrinsics] factorize vcmlaq

Factorize vcmlaq builtins so that they use parameterized names.

2023-17-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm_mve_builtins.def (vcmlaq_rot90_f)
(vcmlaq_rot270_f, vcmlaq_rot180_f, vcmlaq_f): Add "_f" suffix.
* config/arm/iterators.md (MVE_VCMLAQ_M): New.
(mve_insn): Add vcmla.
(rot): Add VCMLAQ_M_F, VCMLAQ_ROT90_M_F, VCMLAQ_ROT180_M_F,
VCMLAQ_ROT270_M_F.
(mve_rot): Add VCMLAQ_M_F, VCMLAQ_ROT90_M_F, VCMLAQ_ROT180_M_F,
VCMLAQ_ROT270_M_F.
* config/arm/mve.md (mve_vcmlaq<mve_rot><mode>): Rename into ...
(@mve_<mve_insn>q<mve_rot>_f<mode>): ... this.
(mve_vcmlaq_m_f<mode>, mve_vcmlaq_rot180_m_f<mode>)
(mve_vcmlaq_rot270_m_f<mode>, mve_vcmlaq_rot90_m_f<mode>): Merge
into ...
(@mve_<mve_insn>q<mve_rot>_m_f<mode>): ... this.

arm: [MVE intrinsics] rework vcmulq

Implement vcmulq using the new MVE builtins framework.

2023-07-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm-mve-builtins-base.cc (vcmulq, vcmulq_rot90)
(vcmulq_rot180, vcmulq_rot270): New.
* config/arm/arm-mve-builtins-base.def (vcmulq, vcmulq_rot90)
(vcmulq_rot180, vcmulq_rot270): New.
* config/arm/arm-mve-builtins-base.h: (vcmulq, vcmulq_rot90)
(vcmulq_rot180, vcmulq_rot270): New.
* config/arm/arm_mve.h (vcmulq_rot90): Delete.
(vcmulq_rot270): Delete.
(vcmulq_rot180): Delete.
(vcmulq): Delete.
(vcmulq_m): Delete.
(vcmulq_rot180_m): Delete.
(vcmulq_rot270_m): Delete.
(vcmulq_rot90_m): Delete.
(vcmulq_x): Delete.
(vcmulq_rot90_x): Delete.
(vcmulq_rot180_x): Delete.
(vcmulq_rot270_x): Delete.
(vcmulq_rot90_f16): Delete.
(vcmulq_rot270_f16): Delete.
(vcmulq_rot180_f16): Delete.
(vcmulq_f16): Delete.
(vcmulq_rot90_f32): Delete.
(vcmulq_rot270_f32): Delete.
(vcmulq_rot180_f32): Delete.
(vcmulq_f32): Delete.
(vcmulq_m_f32): Delete.
(vcmulq_m_f16): Delete.
(vcmulq_rot180_m_f32): Delete.
(vcmulq_rot180_m_f16): Delete.
(vcmulq_rot270_m_f32): Delete.
(vcmulq_rot270_m_f16): Delete.
(vcmulq_rot90_m_f32): Delete.
(vcmulq_rot90_m_f16): Delete.
(vcmulq_x_f16): Delete.
(vcmulq_x_f32): Delete.
(vcmulq_rot90_x_f16): Delete.
(vcmulq_rot90_x_f32): Delete.
(vcmulq_rot180_x_f16): Delete.
(vcmulq_rot180_x_f32): Delete.
(vcmulq_rot270_x_f16): Delete.
(vcmulq_rot270_x_f32): Delete.
(__arm_vcmulq_rot90_f16): Delete.
(__arm_vcmulq_rot270_f16): Delete.
(__arm_vcmulq_rot180_f16): Delete.
(__arm_vcmulq_f16): Delete.
(__arm_vcmulq_rot90_f32): Delete.
(__arm_vcmulq_rot270_f32): Delete.
(__arm_vcmulq_rot180_f32): Delete.
(__arm_vcmulq_f32): Delete.
(__arm_vcmulq_m_f32): Delete.
(__arm_vcmulq_m_f16): Delete.
(__arm_vcmulq_rot180_m_f32): Delete.
(__arm_vcmulq_rot180_m_f16): Delete.
(__arm_vcmulq_rot270_m_f32): Delete.
(__arm_vcmulq_rot270_m_f16): Delete.
(__arm_vcmulq_rot90_m_f32): Delete.
(__arm_vcmulq_rot90_m_f16): Delete.
(__arm_vcmulq_x_f16): Delete.
(__arm_vcmulq_x_f32): Delete.
(__arm_vcmulq_rot90_x_f16): Delete.
(__arm_vcmulq_rot90_x_f32): Delete.
(__arm_vcmulq_rot180_x_f16): Delete.
(__arm_vcmulq_rot180_x_f32): Delete.
(__arm_vcmulq_rot270_x_f16): Delete.
(__arm_vcmulq_rot270_x_f32): Delete.
(__arm_vcmulq_rot90): Delete.
(__arm_vcmulq_rot270): Delete.
(__arm_vcmulq_rot180): Delete.
(__arm_vcmulq): Delete.
(__arm_vcmulq_m): Delete.
(__arm_vcmulq_rot180_m): Delete.
(__arm_vcmulq_rot270_m): Delete.
(__arm_vcmulq_rot90_m): Delete.
(__arm_vcmulq_x): Delete.
(__arm_vcmulq_rot90_x): Delete.
(__arm_vcmulq_rot180_x): Delete.
(__arm_vcmulq_rot270_x): Delete.

arm: [MVE intrinsics factorize vcmulq

Factorize vcmulq builtins so that they use parameterized names.

We can merged them with vcadd.

2023-07-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/:
* config/arm/arm_mve_builtins.def (vcmulq_rot90_f)
(vcmulq_rot270_f, vcmulq_rot180_f, vcmulq_f): Add "_f" suffix.
* config/arm/iterators.md (MVE_VCADDQ_VCMULQ)
(MVE_VCADDQ_VCMULQ_M): New.
(mve_insn): Add vcmul.
(rot): Add VCMULQ_M_F, VCMULQ_ROT90_M_F, VCMULQ_ROT180_M_F,
VCMULQ_ROT270_M_F.
(VCMUL): Delete.
(mve_rot): Add VCMULQ_M_F, VCMULQ_ROT90_M_F, VCMULQ_ROT180_M_F,
VCMULQ_ROT270_M_F.
* config/arm/mve.md (mve_vcmulq<mve_rot><mode>): Merge into
@mve_<mve_insn>q<mve_rot>_f<mode>.
(mve_vcmulq_m_f<mode>, mve_vcmulq_rot180_m_f<mode>)
(mve_vcmulq_rot270_m_f<mode>, mve_vcmulq_rot90_m_f<mode>): Merge
into @mve_<mve_insn>q<mve_rot>_m_f<mode>.

arm: [MVE intrinsics] rework vcaddq vhcaddq

Implement vcaddq, vhcaddq using the new MVE builtins framework.

2023-07-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm-mve-builtins-base.cc (vcaddq_rot90)
(vcaddq_rot270, vhcaddq_rot90, vhcaddq_rot270): New.
* config/arm/arm-mve-builtins-base.def (vcaddq_rot90)
(vcaddq_rot270, vhcaddq_rot90, vhcaddq_rot270): New.
* config/arm/arm-mve-builtins-base.h: (vcaddq_rot90)
(vcaddq_rot270, vhcaddq_rot90, vhcaddq_rot270): New.
* config/arm/arm-mve-builtins-functions.h (class
unspec_mve_function_exact_insn_rot): New.
* config/arm/arm_mve.h (vcaddq_rot90): Delete.
(vcaddq_rot270): Delete.
(vhcaddq_rot90): Delete.
(vhcaddq_rot270): Delete.
(vcaddq_rot270_m): Delete.
(vcaddq_rot90_m): Delete.
(vhcaddq_rot270_m): Delete.
(vhcaddq_rot90_m): Delete.
(vcaddq_rot90_x): Delete.
(vcaddq_rot270_x): Delete.
(vhcaddq_rot90_x): Delete.
(vhcaddq_rot270_x): Delete.
(vcaddq_rot90_u8): Delete.
(vcaddq_rot270_u8): Delete.
(vhcaddq_rot90_s8): Delete.
(vhcaddq_rot270_s8): Delete.
(vcaddq_rot90_s8): Delete.
(vcaddq_rot270_s8): Delete.
(vcaddq_rot90_u16): Delete.
(vcaddq_rot270_u16): Delete.
(vhcaddq_rot90_s16): Delete.
(vhcaddq_rot270_s16): Delete.
(vcaddq_rot90_s16): Delete.
(vcaddq_rot270_s16): Delete.
(vcaddq_rot90_u32): Delete.
(vcaddq_rot270_u32): Delete.
(vhcaddq_rot90_s32): Delete.
(vhcaddq_rot270_s32): Delete.
(vcaddq_rot90_s32): Delete.
(vcaddq_rot270_s32): Delete.
(vcaddq_rot90_f16): Delete.
(vcaddq_rot270_f16): Delete.
(vcaddq_rot90_f32): Delete.
(vcaddq_rot270_f32): Delete.
(vcaddq_rot270_m_s8): Delete.
(vcaddq_rot270_m_s32): Delete.
(vcaddq_rot270_m_s16): Delete.
(vcaddq_rot270_m_u8): Delete.
(vcaddq_rot270_m_u32): Delete.
(vcaddq_rot270_m_u16): Delete.
(vcaddq_rot90_m_s8): Delete.
(vcaddq_rot90_m_s32): Delete.
(vcaddq_rot90_m_s16): Delete.
(vcaddq_rot90_m_u8): Delete.
(vcaddq_rot90_m_u32): Delete.
(vcaddq_rot90_m_u16): Delete.
(vhcaddq_rot270_m_s8): Delete.
(vhcaddq_rot270_m_s32): Delete.
(vhcaddq_rot270_m_s16): Delete.
(vhcaddq_rot90_m_s8): Delete.
(vhcaddq_rot90_m_s32): Delete.
(vhcaddq_rot90_m_s16): Delete.
(vcaddq_rot270_m_f32): Delete.
(vcaddq_rot270_m_f16): Delete.
(vcaddq_rot90_m_f32): Delete.
(vcaddq_rot90_m_f16): Delete.
(vcaddq_rot90_x_s8): Delete.
(vcaddq_rot90_x_s16): Delete.
(vcaddq_rot90_x_s32): Delete.
(vcaddq_rot90_x_u8): Delete.
(vcaddq_rot90_x_u16): Delete.
(vcaddq_rot90_x_u32): Delete.
(vcaddq_rot270_x_s8): Delete.
(vcaddq_rot270_x_s16): Delete.
(vcaddq_rot270_x_s32): Delete.
(vcaddq_rot270_x_u8): Delete.
(vcaddq_rot270_x_u16): Delete.
(vcaddq_rot270_x_u32): Delete.
(vhcaddq_rot90_x_s8): Delete.
(vhcaddq_rot90_x_s16): Delete.
(vhcaddq_rot90_x_s32): Delete.
(vhcaddq_rot270_x_s8): Delete.
(vhcaddq_rot270_x_s16): Delete.
(vhcaddq_rot270_x_s32): Delete.
(vcaddq_rot90_x_f16): Delete.
(vcaddq_rot90_x_f32): Delete.
(vcaddq_rot270_x_f16): Delete.
(vcaddq_rot270_x_f32): Delete.
(__arm_vcaddq_rot90_u8): Delete.
(__arm_vcaddq_rot270_u8): Delete.
(__arm_vhcaddq_rot90_s8): Delete.
(__arm_vhcaddq_rot270_s8): Delete.
(__arm_vcaddq_rot90_s8): Delete.
(__arm_vcaddq_rot270_s8): Delete.
(__arm_vcaddq_rot90_u16): Delete.
(__arm_vcaddq_rot270_u16): Delete.
(__arm_vhcaddq_rot90_s16): Delete.
(__arm_vhcaddq_rot270_s16): Delete.
(__arm_vcaddq_rot90_s16): Delete.
(__arm_vcaddq_rot270_s16): Delete.
(__arm_vcaddq_rot90_u32): Delete.
(__arm_vcaddq_rot270_u32): Delete.
(__arm_vhcaddq_rot90_s32): Delete.
(__arm_vhcaddq_rot270_s32): Delete.
(__arm_vcaddq_rot90_s32): Delete.
(__arm_vcaddq_rot270_s32): Delete.
(__arm_vcaddq_rot270_m_s8): Delete.
(__arm_vcaddq_rot270_m_s32): Delete.
(__arm_vcaddq_rot270_m_s16): Delete.
(__arm_vcaddq_rot270_m_u8): Delete.
(__arm_vcaddq_rot270_m_u32): Delete.
(__arm_vcaddq_rot270_m_u16): Delete.
(__arm_vcaddq_rot90_m_s8): Delete.
(__arm_vcaddq_rot90_m_s32): Delete.
(__arm_vcaddq_rot90_m_s16): Delete.
(__arm_vcaddq_rot90_m_u8): Delete.
(__arm_vcaddq_rot90_m_u32): Delete.
(__arm_vcaddq_rot90_m_u16): Delete.
(__arm_vhcaddq_rot270_m_s8): Delete.
(__arm_vhcaddq_rot270_m_s32): Delete.
(__arm_vhcaddq_rot270_m_s16): Delete.
(__arm_vhcaddq_rot90_m_s8): Delete.
(__arm_vhcaddq_rot90_m_s32): Delete.
(__arm_vhcaddq_rot90_m_s16): Delete.
(__arm_vcaddq_rot90_x_s8): Delete.
(__arm_vcaddq_rot90_x_s16): Delete.
(__arm_vcaddq_rot90_x_s32): Delete.
(__arm_vcaddq_rot90_x_u8): Delete.
(__arm_vcaddq_rot90_x_u16): Delete.
(__arm_vcaddq_rot90_x_u32): Delete.
(__arm_vcaddq_rot270_x_s8): Delete.
(__arm_vcaddq_rot270_x_s16): Delete.
(__arm_vcaddq_rot270_x_s32): Delete.
(__arm_vcaddq_rot270_x_u8): Delete.
(__arm_vcaddq_rot270_x_u16): Delete.
(__arm_vcaddq_rot270_x_u32): Delete.
(__arm_vhcaddq_rot90_x_s8): Delete.
(__arm_vhcaddq_rot90_x_s16): Delete.
(__arm_vhcaddq_rot90_x_s32): Delete.
(__arm_vhcaddq_rot270_x_s8): Delete.
(__arm_vhcaddq_rot270_x_s16): Delete.
(__arm_vhcaddq_rot270_x_s32): Delete.
(__arm_vcaddq_rot90_f16): Delete.
(__arm_vcaddq_rot270_f16): Delete.
(__arm_vcaddq_rot90_f32): Delete.
(__arm_vcaddq_rot270_f32): Delete.
(__arm_vcaddq_rot270_m_f32): Delete.
(__arm_vcaddq_rot270_m_f16): Delete.
(__arm_vcaddq_rot90_m_f32): Delete.
(__arm_vcaddq_rot90_m_f16): Delete.
(__arm_vcaddq_rot90_x_f16): Delete.
(__arm_vcaddq_rot90_x_f32): Delete.
(__arm_vcaddq_rot270_x_f16): Delete.
(__arm_vcaddq_rot270_x_f32): Delete.
(__arm_vcaddq_rot90): Delete.
(__arm_vcaddq_rot270): Delete.
(__arm_vhcaddq_rot90): Delete.
(__arm_vhcaddq_rot270): Delete.
(__arm_vcaddq_rot270_m): Delete.
(__arm_vcaddq_rot90_m): Delete.
(__arm_vhcaddq_rot270_m): Delete.
(__arm_vhcaddq_rot90_m): Delete.
(__arm_vcaddq_rot90_x): Delete.
(__arm_vcaddq_rot270_x): Delete.
(__arm_vhcaddq_rot90_x): Delete.
(__arm_vhcaddq_rot270_x): Delete.

arm: [MVE intrinsics] Factorize vcaddq vhcaddq

Factorize vcaddq, vhcaddq so that they use the same parameterized
names.

To be able to use the same patterns, we add a suffix to vcaddq.

Note that vcadd uses UNSPEC_VCADDxx for builtins without predication,
and VCADDQ_ROTxx_M_x (that is, not starting with "UNSPEC_"). The
UNPEC_* names are also used by neon.md

2023-07-13 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm_mve_builtins.def (vcaddq_rot90_, vcaddq_rot270_)
(vcaddq_rot90_f, vcaddq_rot90_f): Add "_" or "_f" suffix.
* config/arm/iterators.md (mve_insn): Add vcadd, vhcadd.
(isu): Add UNSPEC_VCADD90, UNSPEC_VCADD270, VCADDQ_ROT270_M_U,
VCADDQ_ROT270_M_S, VCADDQ_ROT90_M_U, VCADDQ_ROT90_M_S,
VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S, VHCADDQ_ROT90_S,
VHCADDQ_ROT270_S.
(rot): Add VCADDQ_ROT90_M_F, VCADDQ_ROT90_M_S, VCADDQ_ROT90_M_U,
VCADDQ_ROT270_M_F, VCADDQ_ROT270_M_S, VCADDQ_ROT270_M_U,
VHCADDQ_ROT90_S, VHCADDQ_ROT270_S, VHCADDQ_ROT90_M_S,
VHCADDQ_ROT270_M_S.
(mve_rot): Add VCADDQ_ROT90_M_F, VCADDQ_ROT90_M_S,
VCADDQ_ROT90_M_U, VCADDQ_ROT270_M_F, VCADDQ_ROT270_M_S,
VCADDQ_ROT270_M_U, VHCADDQ_ROT90_S, VHCADDQ_ROT270_S,
VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S.
(supf): Add VHCADDQ_ROT90_M_S, VHCADDQ_ROT270_M_S,
VHCADDQ_ROT90_S, VHCADDQ_ROT270_S, UNSPEC_VCADD90,
UNSPEC_VCADD270.
(VCADDQ_ROT270_M): Delete.
(VCADDQ_M_F VxCADDQ VxCADDQ_M): New.
(VCADDQ_ROT90_M): Delete.
* config/arm/mve.md (mve_vcaddq<mve_rot><mode>)
(mve_vhcaddq_rot270_s<mode>, mve_vhcaddq_rot90_s<mode>): Merge
into ...
(@mve_<mve_insn>q<mve_rot>_<supf><mode>): ... this.
(mve_vcaddq<mve_rot><mode>): Rename into ...
(@mve_<mve_insn>q<mve_rot>_f<mode>): ... this
(mve_vcaddq_rot270_m_<supf><mode>)
(mve_vcaddq_rot90_m_<supf><mode>, mve_vhcaddq_rot270_m_s<mode>)
(mve_vhcaddq_rot90_m_s<mode>): Merge into ...
(@mve_<mve_insn>q<mve_rot>_m_<supf><mode>): ... this.
(mve_vcaddq_rot270_m_f<mode>, mve_vcaddq_rot90_m_f<mode>): Merge
into ...
(@mve_<mve_insn>q<mve_rot>_m_f<mode>): ... this.

PR target/110588: Add *bt<mode>_setncqi_2 to generate btl on x86.

This patch resolves PR target/110588 to catch another case in combine
where the i386 backend should be generating a btl instruction. This adds
another define_insn_and_split to recognize the RTL representation for this
case.

I also noticed that two related define_insn_and_split weren't using the
preferred string style for single statement preparation-statements, so
I've reformatted these to be consistent in style with the new one.

2023-07-14 Roger Sayle <roger@nextmovesoftware.com>

gcc/ChangeLog
PR target/110588
* config/i386/i386.md (*bt<mode>_setcqi): Prefer string form
preparation statement over braces for a single statement.
(*bt<mode>_setncqi): Likewise.
(*bt<mode>_setncqi_2): New define_insn_and_split.

gcc/testsuite/ChangeLog
PR target/110588
* gcc.target/i386/pr110588.c: New test case.

c++: wrong error with static constexpr var in tmpl [PR109876]

Since r8-509, we'll no longer create a static temporary var for
the initializer '{ 1, 2 }' for num in the attached test because
the code in finish_compound_literal is now guarded by
'&& fcl_context == fcl_c99' but it's fcl_functional here.  This
causes us to reject num as non-constant when evaluating it in
a template.

Jason's idea was to treat num as value-dependent even though it
actually isn't.  This patch implements that suggestion.

We weren't marking objects whose type is an empty class type
constant.  This patch changes that so that v_d_e_p doesn't need
to check is_really_empty_class.

Co-authored-by: Jason Merrill <jason@redhat.com>
PR c++/109876

gcc/cp/ChangeLog:

* decl.cc (cp_finish_decl): Set TREE_CONSTANT when initializing
an object of empty class type.
* pt.cc (value_dependent_expression_p) <case VAR_DECL>: Treat a
constexpr-declared non-constant variable as value-dependent.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/constexpr-template12.C: New test.
* g++.dg/cpp1z/constexpr-template1.C: New test.
* g++.dg/cpp1z/constexpr-template2.C: New test.

i386: Improved insv of DImode/DFmode {high,low}parts into TImode.

This is the next piece towards a fix for (the x86_64 ABI issues affecting)
PR 88873.  This patch generalizes the recent tweak to ix86_expand_move
for setting the highpart of a TImode reg from a DImode source using
*insvti_highpart_1, to handle both DImode and DFmode sources, and also
use the recently added *insvti_lowpart_1 for setting the lowpart.

Although this is another intermediate step (not yet a fix), towards
enabling *insvti and *concat* patterns to be candidates for TImode STV
(by using V2DI/V2DF instructions), it already improves things a little.

For the test case from PR 88873

typedef struct { double x, y; } s_t;
typedef double v2df __attribute__ ((vector_size (2 * sizeof(double))));

s_t foo (s_t a, s_t b, s_t c)
{
  return (s_t) { fma(a.x, b.x, c.x), fma (a.y, b.y, c.y) };
}

With -O2 -march=cascadelake, GCC currently generates:

Before (29 instructions):
        vmovq   %xmm2, -56(%rsp)
        movq    -56(%rsp), %rdx
        vmovq   %xmm4, -40(%rsp)
        movq    $0, -48(%rsp)
        movq    %rdx, -56(%rsp)
        movq    -40(%rsp), %rdx
        vmovq   %xmm0, -24(%rsp)
        movq    %rdx, -40(%rsp)
        movq    -24(%rsp), %rsi
        movq    -56(%rsp), %rax
        movq    $0, -32(%rsp)
        vmovq   %xmm3, -48(%rsp)
        movq    -48(%rsp), %rcx
        vmovq   %xmm5, -32(%rsp)
        vmovq   %rax, %xmm6
        movq    -40(%rsp), %rax
        movq    $0, -16(%rsp)
        movq    %rsi, -24(%rsp)
        movq    -32(%rsp), %rsi
        vpinsrq $1, %rcx, %xmm6, %xmm6
        vmovq   %rax, %xmm7
        vmovq   %xmm1, -16(%rsp)
        vmovapd %xmm6, %xmm3
        vpinsrq $1, %rsi, %xmm7, %xmm7
        vfmadd132pd     -24(%rsp), %xmm7, %xmm3
        vmovapd %xmm3, -56(%rsp)
        vmovsd  -48(%rsp), %xmm1
        vmovsd  -56(%rsp), %xmm0
        ret

After (20 instructions):
        vmovq   %xmm2, -56(%rsp)
        movq    -56(%rsp), %rax
        vmovq   %xmm3, -48(%rsp)
        vmovq   %xmm4, -40(%rsp)
        movq    -48(%rsp), %rcx
        vmovq   %xmm5, -32(%rsp)
        vmovq   %rax, %xmm6
        movq    -40(%rsp), %rax
        movq    -32(%rsp), %rsi
        vpinsrq $1, %rcx, %xmm6, %xmm6
        vmovq   %xmm0, -24(%rsp)
        vmovq   %rax, %xmm7
        vmovq   %xmm1, -16(%rsp)
        vmovapd %xmm6, %xmm2
        vpinsrq $1, %rsi, %xmm7, %xmm7
        vfmadd132pd     -24(%rsp), %xmm7, %xmm2
        vmovapd %xmm2, -56(%rsp)
        vmovsd  -48(%rsp), %xmm1
        vmovsd  -56(%rsp), %xmm0
        ret

2023-07-14  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Generalize special
case inserting of 64-bit values into a TImode register, to handle
both DImode and DFmode using either *insvti_lowpart_1
or *isnvti_highpart_1.

cprop: Do not set REG_EQUAL note when simplifying paradoxical subreg [PR110206]

cprop1 pass does not consider paradoxical subreg and for (insn 22) claims
that it equals 8 elements of HImodeby setting REG_EQUAL note:

(insn 21 19 22 4 (set (reg:V4QI 98)
        (mem/u/c:V4QI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S4 A32])) "pr110206.c":12:42 1530 {*movv4qi_internal}
     (expr_list:REG_EQUAL (const_vector:V4QI [
                (const_int -52 [0xffffffffffffffcc]) repeated x4
            ])
        (nil)))
(insn 22 21 23 4 (set (reg:V8HI 100)
        (zero_extend:V8HI (vec_select:V8QI (subreg:V16QI (reg:V4QI 98) 0)
                (parallel [
                        (const_int 0 [0])
                        (const_int 1 [0x1])
                        (const_int 2 [0x2])
                        (const_int 3 [0x3])
                        (const_int 4 [0x4])
                        (const_int 5 [0x5])
                        (const_int 6 [0x6])
                        (const_int 7 [0x7])
                    ])))) "pr110206.c":12:42 7471 {sse4_1_zero_extendv8qiv8hi2}
     (expr_list:REG_EQUAL (const_vector:V8HI [
                (const_int 204 [0xcc]) repeated x8
            ])
        (expr_list:REG_DEAD (reg:V4QI 98)
            (nil))))

We rely on the "undefined" vals to have a specific value (from the earlier
REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't
need to).  That said, the issue isn't the constant folding per-se but that
we do not actually constant fold but register an equality that doesn't hold.

PR target/110206

gcc/ChangeLog:

* fwprop.cc (contains_paradoxical_subreg_p): Move to ...
* rtlanal.cc (contains_paradoxical_subreg_p): ... here.
* rtlanal.h (contains_paradoxical_subreg_p): Add prototype.
* cprop.cc (try_replace_reg): Do not set REG_EQUAL note
when the original source contains a paradoxical subreg.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110206.c: New test.

Turn TODO_rebuild_frequencies to a pass

Currently we rebiuild profile_counts from profile_probability after inlining,
because there is a chance that producing large loop nests may get unrealistically
large profile_count values.  This is much less of concern when we switched to
new profile_count representation while back.

This propagation can also compensate for profile inconsistencies caused by
optimization passes.  Since inliner is followed by basic cleanup passes that
does not use profile, we get more realistic profile by delaying the recomputation
after basic optimizations exposed by inlininig are finished.

This does not fit into TODO machinery, so I turn rebuilding into stand alone
pass and schedule it before first consumer of profile in the optimization
queue.

I also added logic that avoids repropagating when CFG is good and not too close
to overflow.  Propagating visits very basic block loop_depth times, so it is
not linear and avoiding it may help a bit.

On tramp3d we get 14 functions repropagated and 916 are OK.  The repropagated
functions are RB tree ones where we produce crazy loop nests by recurisve inlining.
This is something to fix independently.

gcc/ChangeLog:

* passes.cc (execute_function_todo): Remove
TODO_rebuild_frequencies
* passes.def: Add rebuild_frequencies pass.
* predict.cc (estimate_bb_frequencies): Drop
force parameter.
(tree_estimate_probability): Update call of
estimate_bb_frequencies.
(rebuild_frequencies): Turn into a pass; verify CFG profile consistency
first and do not rebuild if not necessary.
(class pass_rebuild_frequencies): New.
(make_pass_rebuild_frequencies): New.
* profile-count.h: Add profile_count::very_large_p.
* tree-inline.cc (optimize_inline_calls): Do not return
TODO_rebuild_frequencies
* tree-pass.h (TODO_rebuild_frequencies): Remove.
(make_pass_rebuild_frequencies): Declare.

RISC-V: Enable COND_LEN_FMA auto-vectorization

Add comments as Robin's suggestion in scatter_store_run-7.c

Enable COND_LEN_FMA auto-vectorization for floating-point FMA auto-vectorization **NO** ffast-math.

Since the middle-end support has been approved and I will merge it after I finished bootstrap && regression on X86.
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624395.html

Now, it's time to send this patch.

Consider this following case:

  __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,            \
      TYPE *__restrict a,              \
      TYPE *__restrict b, int n)       \
  {                                                                            \
    for (int i = 0; i < n; i++)                                                \
      dst[i] += a[i] * b[i];                                                   \
  }

TEST_ALL ()

Before this patch:

ternop_double:
        ble     a3,zero,.L5
        mv      a6,a0
.L3:
        vsetvli a5,a3,e64,m1,tu,ma
        slli    a4,a5,3
        vle64.v v1,0(a0)
        vle64.v v2,0(a1)
        vle64.v v3,0(a2)
        sub     a3,a3,a5
        vfmul.vv        v2,v2,v3
        vfadd.vv        v1,v1,v2
        vse64.v v1,0(a6)
        add     a0,a0,a4
        add     a1,a1,a4
        add     a2,a2,a4
        add     a6,a6,a4
        bne     a3,zero,.L3
.L5:
        ret

After this patch:

ternop_double:
ble a3,zero,.L5
mv a6,a0
.L3:
vsetvli a5,a3,e64,m1,tu,ma
slli a4,a5,3
vle64.v v1,0(a0)
vle64.v v2,0(a1)
vle64.v v3,0(a2)
sub a3,a3,a5
vfmacc.vv v1,v3,v2
vse64.v v1,0(a6)
add a0,a0,a4
add a1,a1,a4
add a2,a2,a4
add a6,a6,a4
bne a3,zero,.L3
.L5:
ret

Notice: This patch only supports COND_LEN_FMA, **NO** COND_LEN_FNMA, ... etc since I didn't support them
        in the middle-end yet.

        Will support them in the following patches soon.

gcc/ChangeLog:

* config/riscv/autovec.md (cond_len_fma<mode>): New pattern.
* config/riscv/riscv-protos.h (enum insn_type): New enum.
(expand_cond_len_ternop): New function.
* config/riscv/riscv-v.cc (emit_nonvlmax_fp_ternary_tu_insn): Ditto.
(expand_cond_len_ternop): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-7.c:
Adapt testcase for link fail.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-1.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-3.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-3.c: New test.

bpf: enable instruction scheduling

This patch adds a dummy FSM to bpf.md in order to get INSN_SCHEDULING
defined. If the later is not defined, the `combine' pass generates
paradoxical subregs of mems, which seems to then be mishandled by LRA,
resulting in invalid code.

Tested in bpf-unknown-none.

gcc/ChangeLog:

2023-07-14 Jose E. Marchesi <jose.marchesi@oracle.com>

PR target/110657
* config/bpf/bpf.md: Enable instruction scheduling.

fortran: Reorder array argument evaluation parts [PR92178]

In the case of an array actual arg passed to a polymorphic array dummy
with INTENT(OUT) attribute, reorder the argument evaluation code to
the following:
- first evaluate arguments' values, and data references,
- deallocate data references associated with an allocatable,
   intent(out) dummy,
- create a class container using the freed data references.

The ordering used to be incorrect between the first two items,
when one argument was deallocated before a later argument evaluated
its expression depending on the former argument.
r14-2395-gb1079fc88f082d3c5b583c8822c08c5647810259 fixed it by treating
arguments associated with an allocatable, intent(out) dummy in a
separate, later block.  This, however, wasn't working either if the data
reference of such an argument was depending on its own content, as
the class container initialization was trying to use deallocated
content.

This change generates class container initialization code in a separate
block, so that it is moved after the deallocation block without moving
the rest of the argument evaluation code.

This alone is not sufficient to fix the problem, because the class
container generation code repeatedly uses the full expression of
the argument at a place where deallocation might have happened
already.  This is non-optimal, but may also be invalid, because the data
reference may depend on its own content.  In that case the expression
can't be evaluated after the data has been deallocated.

As in the scalar case previously treated, this is fixed by saving
the data reference to a pointer before any deallocation happens,
and then only refering to the pointer.  gfc_reset_vptr is updated
to take into account the already evaluated class container if it's
available.

Contrary to the scalar case, one hunk is needed to wrap the parameter
evaluation in a conditional, to avoid regressing in
optional_class_2.f90.  This used to be handled by the class wrapper
construction which wrapped the whole code in a conditional.  With
this change the class wrapper construction can't see the parameter
evaluation code, so the latter is updated with an additional handling
for optional arguments.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans.h (gfc_reset_vptr): Add class_container argument.
* trans-expr.cc (gfc_reset_vptr): Ditto.  If a valid vptr can
be obtained through class_container argument, bypass evaluation
of e.
(gfc_conv_procedure_call):  Wrap the argument evaluation code
in a conditional if the associated dummy is optional.  Evaluate
the data reference to a pointer now, and replace later
references with usage of the pointer.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_21.f90: New test.

fortran: Factor data references for scalar class argument wrapping [PR92178]

In the case of a scalar actual arg passed to a polymorphic assumed-rank
dummy with INTENT(OUT) attribute, avoid repeatedly evaluating the actual
argument reference by saving a pointer to it.  This is non-optimal, but
may also be invalid, because the data reference may depend on its own
content.  In that case the expression can't be evaluated after the data
has been deallocated.

There are two ways redundant expressions are generated:
- parmse.expr, which contains the actual argument expression, is
   reused to get or set subfields in gfc_conv_class_to_class.
- gfc_conv_class_to_class, to get the virtual table pointer associated
   with the argument, generates a new expression from scratch starting
   with the frontend expression.

The first part is fixed by saving parmse.expr to a pointer and using
the pointer instead of the original expression.

The second part is fixed by adding a separate field to gfc_se that
is set to the class container expression  when the expression to
evaluate is polymorphic.  This needs the same field in gfc_ss_info
so that its value can be propagated to gfc_conv_class_to_class which
is modified to use that value.  Finally gfc_conv_procedure saves the
expression in that field to a pointer in between to avoid the same
problem as for the first part.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans.h (struct gfc_se): New field class_container.
(struct gfc_ss_info): Ditto.
(gfc_evaluate_data_ref_now): New prototype.
* trans.cc (gfc_evaluate_data_ref_now):  Implement it.
* trans-array.cc (gfc_conv_ss_descriptor): Copy class_container
field from gfc_se struct to gfc_ss_info struct.
(gfc_conv_expr_descriptor): Copy class_container field from
gfc_ss_info struct to gfc_se struct.
* trans-expr.cc (gfc_conv_class_to_class): Use class container
set in class_container field if available.
(gfc_conv_variable): Set class_container field on encountering
class variables or components, clear it on encountering
non-class components.
(gfc_conv_procedure_call): Evaluate data ref to a pointer now,
and replace later references by usage of the pointer.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_20.f90: New test.

fortran: defer class wrapper initialization after deallocation [PR92178]

If an actual argument is associated with an INTENT(OUT) dummy, and code
to deallocate it is generated, generate the class wrapper initialization
after the actual argument deallocation.

This is achieved by passing a cleaned up expression to
gfc_conv_class_to_class, so that the class wrapper initialization code
can be isolated and moved independently after the deallocation.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans-expr.cc (gfc_conv_procedure_call): Use a separate gfc_se
struct, initalized from parmse, to generate the class wrapper.
After the class wrapper code has been generated, copy it back
depending on whether parameter deallocation code has been
generated.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_19.f90: New test.

libgomp.texi: Extend memory allocation documentation

libgomp/
* libgomp.texi (OMP_ALLOCATOR): Document the default values for
the traits. Add crossref to 'Memory allocation'.
(Memory allocation): Refer to OMP_ALLOCATOR for the available
traits and allocators/mem spaces; document the default value
for the pool_size trait.

ifcvt: Sort PHI arguments not only occurrences but also complexity [PR109154]

This patch builds on the previous patch by fixing another issue with the
way ifcvt currently picks which branches to test.

The issue with the current implementation is while it sorts for
occurrences of the argument, it doesn't check for complexity of the arguments.

As an example:

  <bb 15> [local count: 528603100]:
  ...
  if (distbb_75 >= 0.0)
    goto <bb 17>; [59.00%]
  else
    goto <bb 16>; [41.00%]

  <bb 16> [local count: 216727269]:
  ...
  goto <bb 19>; [100.00%]

  <bb 17> [local count: 311875831]:
  ...
  if (distbb_75 < iftmp.0_98)
    goto <bb 18>; [20.00%]
  else
    goto <bb 19>; [80.00%]

  <bb 18> [local count: 62375167]:
  ...

  <bb 19> [local count: 528603100]:
  # prephitmp_175 = PHI <_173(18), 0.0(17), _174(16)>

All tree arguments to the PHI have the same number of occurrences, namely 1,
however it makes a big difference which comparison we test first.

Sorting only on occurrences we'll pick the compares coming from BB 18 and BB 17,
This means we end up generating 4 comparisons, while 2 would have been enough.

By keeping track of the "complexity" of the COND in each BB, (i.e. the number
of comparisons needed to traverse from the start [BB 15] to end [BB 19]) and
using a key tuple of <occurrences, complexity> we end up selecting the compare
from BB 16 and BB 18 first.  BB 16 only requires 1 compare, and BB 18, after we
test BB 16 also only requires one additional compare.  This change paired with
the one previous above results in the optimal 2 compares.

For deep nesting, i.e. for

...
  _79 = vr_15 > 20;
  _80 = _68 & _79;
  _82 = vr_15 <= 20;
  _83 = _68 & _82;
  _84 = vr_15 < -20;
  _85 = _73 & _84;
  _87 = vr_15 >= -20;
  _88 = _73 & _87;
  _ifc__111 = _55 ? 10 : 12;
  _ifc__112 = _70 ? 7 : _ifc__111;
  _ifc__113 = _85 ? 8 : _ifc__112;
  _ifc__114 = _88 ? 9 : _ifc__113;
  _ifc__115 = _45 ? 1 : _ifc__114;
  _ifc__116 = _63 ? 3 : _ifc__115;
  _ifc__117 = _65 ? 4 : _ifc__116;
  _ifc__118 = _83 ? 6 : _ifc__117;
  _ifc__119 = _60 ? 2 : _ifc__118;
  _ifc__120 = _43 ? 13 : _ifc__119;
  _ifc__121 = _75 ? 11 : _ifc__120;
  vw_1 = _80 ? 5 : _ifc__121;

Most of the comparisons are still needed because the chain of
occurrences to not negate eachother. i.e. _80 is _73 & vr_15 >= -20 and
_85 is _73 & vr_15 < -20.  clearly given _73 needs to be true in both branches,
the only additional test needed is on vr_15, where the one test is the negation
of the other.  So we don't need to do the comparison of _73 twice.

The changes in the patch reduces the overall number of compares by one, but has
a bigger effect on the dependency chain.

Previously we would generate 5 instructions chain:

cmple   p7.s, p4/z, z29.s, z30.s
cmpne   p7.s, p7/z, z29.s, #0
cmple   p6.s, p7/z, z31.s, z30.s
cmpge   p6.s, p6/z, z27.s, z25.s
cmplt   p15.s, p6/z, z28.s, z21.s

as the longest chain.  With this patch we generate 3:

cmple   p7.s, p3/z, z27.s, z30.s
cmpne   p7.s, p7/z, z27.s, #0
cmpgt   p7.s, p7/z, z31.s, z30.s

and I don't think (x <= y) && (x != 0) && (z > y) can be reduced further.

gcc/ChangeLog:

PR tree-optimization/109154
* tree-if-conv.cc (INCLUDE_ALGORITHM): Include.
(struct bb_predicate): Add no_predicate_stmts.
(set_bb_predicate): Increase predicate count.
(set_bb_predicate_gimplified_stmts): Conditionally initialize
no_predicate_stmts.
(get_bb_num_predicate_stmts): New.
(init_bb_predicate): Initialzie no_predicate_stmts.
(release_bb_predicate): Cleanup no_predicate_stmts.
(insert_gimplified_predicates): Preserve no_predicate_stmts.

gcc/testsuite/ChangeLog:

PR tree-optimization/109154
* gcc.dg/vect/vect-ifcvt-20.c: New test.

ifcvt: Reduce comparisons on conditionals by tracking truths [PR109154]

Following on from Jakub's patch in g:de0ee9d14165eebb3d31c84e98260c05c3b33acb
these two patches finishes the work fixing the regression and improves codegen.

As explained in that commit, ifconvert sorts PHI args in increasing number of
occurrences in order to reduce the number of comparisons done while
traversing the tree.

The remaining task that this patch fixes is dealing with the long chain of
comparisons that can be created from phi nodes, particularly when they share
any common successor (classical example is a diamond node).

on a PHI-node the true and else branches carry a condition, true will
carry `a` and false `~a`.  The issue is that at the moment GCC tests both `a`
and `~a` when the phi node has more than 2 arguments. Clearly this isn't
needed.  The deeper the nesting of phi nodes the larger the repetition.

As an example, for

foo (int *f, int d, int e)
{
  for (int i = 0; i < 1024; i++)
    {
      int a = f[i];
      int t;
      if (a < 0)
t = 1;
      else if (a < e)
t = 1 - a * d;
      else
t = 0;
      f[i] = t;
    }
}

after Jakub's patch we generate:

  _7 = a_10 < 0;
  _21 = a_10 >= 0;
  _22 = a_10 < e_11(D);
  _23 = _21 & _22;
  _ifc__42 = _23 ? t_13 : 0;
  t_6 = _7 ? 1 : _ifc__42

but while better than before it is still inefficient, since in the false
branch, where we know ~_7 is true, we still test _21.

This leads to superfluous tests for every diamond node.  After this patch we
generate

_7 = a_10 < 0;
_22 = a_10 < e_11(D);
_ifc__42 = _22 ? t_13 : 0;
t_6 = _7 ? 1 : _ifc__42;

Which correctly elides the test of _21.  This is done by borrowing the
vectorizer's helper functions to limit predicate mask usages.  Ifcvt will chain
conditionals on the false edge (unless specifically inverted) so this patch on
creating cond a ? b : c, will register ~a when traversing c.  If c is a
conditional then c will be simplified to the smaller possible predicate given
the assumptions we already know to be true.

gcc/ChangeLog:

PR tree-optimization/109154
* tree-if-conv.cc (gen_simplified_condition,
gen_phi_nest_statement): New.
(gen_phi_arg_condition, predicate_scalar_phi): Use it.

gcc/testsuite/ChangeLog:

PR tree-optimization/109154
* gcc.dg/vect/vect-ifcvt-19.c: New test.

Provide extra checking for phi argument access from edge

The following adds checking that the edge we query an associated
PHI arg for is related to the PHI node. Triggered by questionable
code in one of my reviews.

* gimple.h (gimple_phi_arg): New const overload.
(gimple_phi_arg_def): Make gimple arg const.
(gimple_phi_arg_def_from_edge): New inline function.
* tree-phinodes.h (gimple_phi_arg_imm_use_ptr_from_edge):
Likewise.
* tree-ssa-operands.h (PHI_ARG_DEF_FROM_EDGE): Direct to
new inline function.
(PHI_ARG_DEF_PTR_FROM_EDGE): Likewise.

libgomp: Fix allocator handling for Linux when libnuma is not available

Follow up to r14-2462-g450b05ce54d3f0. The case that libnuma was not
available at runtime was not properly handled; now it falls back to
the normal malloc.

libgomp/

* allocator.c (omp_init_allocator): Check whether symbol from
dlopened libnuma is available before using libnuma for
allocations.

RISC-V: Recognized zihintntl extensions

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc:
(riscv_implied_info): Add zihintntl item.
(riscv_ext_version_table): Ditto.
(riscv_ext_flag_table): Ditto.
* config/riscv/riscv-opts.h (MASK_ZIHINTNTL): New macro.
(TARGET_ZIHINTNTL): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-22.c: New test.
* gcc.target/riscv/predef-28.c: New test.

RISC-V: Remove the redundant expressions in the and<mode>3.

When generating the gen_and<mode>3 function based on the and<mode>3
template, it produces the expression emit_insn (gen_rtx_SET (operand0,
gen_rtx_AND (<mode>, operand1, operand2)));, which is identical to the
portion I removed in this patch. Therefore, the redundant portion can be
deleted.

Signed-off-by: Die Li <lidie@eswincomputing.com>
gcc/ChangeLog:

* config/riscv/riscv.md: Remove redundant portion in and<mode>3.

SH: Fix PR101496 peephole bug

gcc/ChangeLog:

PR target/101469
* config/sh/sh.md (peephole2): Handle case where eliminated reg is also
used by the address of the following memory operand.

Daily bump.

pdp11: Fix epilogue generation [PR107841]

gcc/

PR target/107841
* config/pdp11/pdp11.cc (pdp11_expand_epilogue): Also
deallocate alloca-only frame.

gcc/testsuite/

PR target/107841
* gcc.target/pdp11/pr107841.c: New test.

m2, build: Use LDLFAGS for mklink

When trying to bootstrap current trunk on macOS 14.0 beta 3 with Xcode
15 beta 4, the build failed running mklink in stage 2:

unset CC ; m2/boot-bin/mklink -s --langc++ --exit --name m2/mc-boot/main.cc
/vol/gcc/src/hg/master/darwin/gcc/m2/init/mcinit
dyld[55825]: Library not loaded: /vol/gcc/lib/libstdc++.6.dylib

While it's unclear to me why this only happens on macOS 14, the problem
is clear: unlike other C++ executables, mklink isn't linked with
-static-libstdc++ which is passed in from toplevel in LDFLAGS.

This patch fixes that and allows the build to continue.

Bootstrapped on x86_64-apple-darwin23.0.0, i386-pc-solaris2.11, and
sparc-sun-solaris2.11.

2023-07-11 Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>

gcc/m2:
* Make-lang.in (m2/boot-bin/mklink$(exeext)): Add $(LDFLAGS).

fortran: Release symbols in reversed order [PR106050]

Release symbols in reversed order wrt the order they were allocated.
This fixes an error recovery ICE in the case of a misplaced
derived type declaration.  Such a declaration creates nested
symbols, one for the derived type and one for each type parameter,
which should be immediately released as the declaration is
rejected.  This breaks if the derived type is released first.
As the type parameter symbols are in the namespace of the derived
type, releasing the derived type releases the type parameters, so
one can't access them after that, even to release them.  Hence,
the type parameters should be released first.

PR fortran/106050

gcc/fortran/ChangeLog:

* symbol.cc (gfc_restore_last_undo_checkpoint): Release symbols
in reverse order.

gcc/testsuite/ChangeLog:

* gfortran.dg/pdt_33.f90: New test.

Darwin: Use -platform_version when available [PR110624].

Later versions of the static linker support a more flexible flag to
describe the OS, OS version and SDK used to build the code. This
replaces the functionality of '-mmacosx_version_min' (which is now
deprecated, leading to the diagnostic described in the PR).

We now use the platform_version flag when available which avoids the
diagnostic.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
PR target/110624

gcc/ChangeLog:

* config/darwin.h (DARWIN_PLATFORM_ID): New.
(LINK_COMMAND_A): Use DARWIN_PLATFORM_ID to pass OS, OS version
and SDK data to the static linker.

rs6000, Add return value to __builtin_set_fpscr_rn

Change the return value from void to double for __builtin_set_fpscr_rn.
The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
RN bit positions.  A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
is added to test the new return value for the built-in.

The value __SET_FPSCR_RN_RETURNS_FPSCR__ is defined if
__builtin_set_fpscr_rn returns a double.

gcc/ChangeLog:
* config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
built-in definition return type.
* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Add check,
define __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
* config/rs6000/rs6000.md (rs6000_set_fpscr_rn): Add return
argument to return FPSCR fields.
* doc/extend.texi (__builtin_set_fpscr_rn): Update description for
the return value.  Add description for
__SET_FPSCR_RN_RETURNS_FPSCR__ macro.

gcc/testsuite/ChangeLog:
* gcc.target/powerpc/test_fpscr_rn_builtin.c: Rename to
test_fpscr_rn_builtin_1.c.  Add comment.
* gcc.target/powerpc/test_fpscr_rn_builtin_2.c: New test for the
return value of __builtin_set_fpscr_rn builtin.

libstdc++: std::stoi etc. do not need C99 <stdlib.h> support [PR110653]

std::stoi, std::stol, std::stoul, and std::stod only depend on C89
functions, so don't need to be guarded by _GLIBCXX_USE_C99_STDLIB

std::stoll and std::stoull don't need C99 strtoll and strtoull if
sizeof(long) == sizeof(long long).

std::stold doesn't need C99 strtold if DBL_MANT_DIG == LDBL_MANT_DIG.

This only applies to the narrow character overloads, the wchar_t
overloads depend on a separate _GLIBCXX_USE_C99_WCHAR macro and none of
them can be implemented in C89 easily.

libstdc++-v3/ChangeLog:

PR libstdc++/110653
* include/bits/basic_string.h (stoi, stol, stoul, stod): Do not
depend on _GLIBCXX_USE_C99_STDLIB.
[__LONG_WIDTH__ == __LONG_LONG_WIDTH__] (stoll, stoull): Define
in terms of stol and stoul respectively.
[__DBL_MANT_DIG__ == __LDBL_MANT_DIG__] (stold): Define in terms
of stod.

alpha: Fix computation mode in alpha_emit_set_long_cost [PR106966]

PR target/106966

gcc/ChangeLog:

* config/alpha/alpha.cc (alpha_emit_set_long_const):
Always use DImode when constructing long const.

gcc/testsuite/ChangeLog:

* gcc.target/alpha/pr106966.c: New test.

RA+sched: Change TRUE/FALSE to true/false

gcc/ChangeLog:

* haifa-sched.cc: Change TRUE/FALSE to true/false.
* ira.cc: Ditto.
* lra-assigns.cc: Ditto.
* lra-constraints.cc: Ditto.
* sel-sched.cc: Ditto.

Fix part of PR 110293: `A NEEQ (A NEEQ CST)` part

This fixes part of PR 110293, for the outer comparison case
being `!=` or `==`. In turn PR 110539 is able to be optimized
again as the if statement for `(a&1) == ((a & 1) != 0)` gets optimized
to `false` early enough to allow FRE/DOM to do a CSE for memory store/load.

OK? Bootstrapped and tested on x86_64-linux with no regressions.

gcc/ChangeLog:

PR tree-optimization/110293
PR tree-optimization/110539
* match.pd: Expand the `x != (typeof x)(x == 0)`
pattern to handle where the inner and outer comparsions
are either `!=` or `==` and handle other constants
than 0.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr110293-1.c: New test.
* gcc.dg/tree-ssa/pr110539-1.c: New test.
* gcc.dg/tree-ssa/pr110539-2.c: New test.
* gcc.dg/tree-ssa/pr110539-3.c: New test.
* gcc.dg/tree-ssa/pr110539-4.c: New test.

[RA][PR109520]: Catch error when there are no enough registers for asm insn

Asm insn unlike other insns can have so many operands whose
constraints can not be satisfied.  It results in LRA cycling for such
test case.  The following patch catches such situation and reports the
problem.

PR middle-end/109520

gcc/ChangeLog:

* lra-int.h (lra_insn_recog_data): Add member asm_reloads_num.
(lra_asm_insn_error): New prototype.
* lra.cc: Include rtl_error.h.
(lra_set_insn_recog_data): Initialize asm_reloads_num.
(lra_asm_insn_error): New func whose code is taken from ...
* lra-assigns.cc (lra_split_hard_reg_for): ... here.  Use lra_asm_insn_error.
* lra-constraints.cc (curr_insn_transform): Check reloads nummber for asm.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr109520.c: New test.

SSA MATH: Support COND_LEN_FMA for floating-point math optimization

Hi, Richard and Richi.

Previous patch we support COND_LEN_* binary operations. However, we didn't
support COND_LEN_* ternary.

Now, this patch support COND_LEN_* ternary. Consider this following case:

  __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,            \
      TYPE *__restrict a,              \
      TYPE *__restrict b,\
                TYPE *__restrict c, int n)       \
  {                                                                            \
    for (int i = 0; i < n; i++)                                                \
      dst[i] += a[i] * b[i];                                                   \
  }

TEST_ALL ()

Before this patch:
...
COND_LEN_MUL
COND_LEN_ADD

Afther this patch:
...
COND_LEN_FMA

gcc/ChangeLog:

* genmatch.cc (commutative_op): Add COND_LEN_*
* internal-fn.cc (first_commutative_argument): Ditto.
(CASE): Ditto.
(get_unconditional_internal_fn): Ditto.
(can_interpret_as_conditional_op_p): Ditto.
(internal_fn_len_index): Ditto.
* internal-fn.h (can_interpret_as_conditional_op_p): Ditt.
* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Ditto.
(convert_mult_to_fma): Ditto.
(math_opts_dom_walker::after_dom_children): Ditto.

testsuite: dg-require LTO for libgomp LTO tests

Some test cases in libgomp testsuite pass -flto as an option, but
the testcases do not require LTO target support. This patch adds
the necessary DejaGNU requirement for LTO support to the testcases..

libgomp/ChangeLog:
* testsuite/libgomp.c++/target-map-class-2.C: Require LTO.
* testsuite/libgomp.c-c++-common/requires-4.c: Require LTO.
* testsuite/libgomp.c-c++-common/requires-4a.c: Require LTO.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

RISC-V: Refactor riscv mode after for VXRM and FRM

When investigate the FRM dynmaic rounding mode, we find the global
unknown status is quite different between the fixed-point and
floating-point. Thus, we separate the unknown function with extracting
some inner common functions.

We will also prepare more test cases in another PATCH.

Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:

* config/riscv/riscv.cc (vxrm_rtx): New static var.
(frm_rtx): Ditto.
(global_state_unknown_p): Removed.
(riscv_entity_mode_after): Removed.
(asm_insn_p): New function.
(vxrm_unknown_p): New function for fixed-point.
(riscv_vxrm_mode_after): Ditto.
(frm_unknown_dynamic_p): New function for floating-point.
(riscv_frm_mode_after): Ditto.
(riscv_mode_after): Leverage new functions.

RISC-V: Add more tests for RVV floating-point FRM.

Add more test cases include both the asm check and run for RVV FRM.

Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/float-point-frm-insert-10.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-7.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-8.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-9.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-1.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-2.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-3.c: New test.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS

This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load. We don't call function
vect_model_load_cost for it any more. It removes function
vect_model_load_cost which becomes useless and unreachable
now.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Remove.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
calling vect_model_load_cost.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

As the affected test case gcc.target/i386/pr70021.c shows,
the previous costing can under-cost the total generated
vector loads as for VMAT_CONTIGUOUS_PERMUTE function
vect_model_load_cost doesn't consider the group size which
is considered as vec_num during the transformation.

This patch makes the count of vector load in costing become
consistent with what we generates during the transformation.
To be more specific, for the given test case, for memory
access b[i_20], it costed for 2 vector loads before,
with this patch it costs 8 instead, it matches the final
count of generated vector loads basing from b. This costing
change makes cost model analysis feel it's not profitable
to vectorize the first loop, so this patch adjusts the test
case without vect cost model any more.

But note that this test case also exposes something we can
improve further is that although the number of vector
permutation what we costed and generated are consistent,
but DCE can further optimize some unused permutation out,
it would be good if we can predict that and generate only
those necessary permutations.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
handle memory_access_type VMAT_CONTIGUOUS, remove some
VMAT_CONTIGUOUS_PERMUTE related handlings.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_REVERSE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

This change makes us not miscount some required vector
permutation as the associated test case shows.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Assert it won't get
VMAT_CONTIGUOUS_REVERSE any more.
(vectorizable_load): Adjust the costing handling on
VMAT_CONTIGUOUS_REVERSE without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c: New test.

vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES

This patch adjusts the cost handling on
VMAT_LOAD_STORE_LANES in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

It follows what we do in the function vect_model_load_cost,
and shouldn't have any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
VMAT_LOAD_STORE_LANES without calling vect_model_load_cost.
(vectorizable_load): Remove VMAT_LOAD_STORE_LANES related handling and
assert it will never get VMAT_LOAD_STORE_LANES.

vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER

This patch adjusts the cost handling on VMAT_GATHER_SCATTER
in function vectorizable_load. We don't call function
vect_model_load_cost for it any more.

It's mainly for gather loads with IFN or emulated gather
loads, it follows the handlings in function
vect_model_load_cost. This patch shouldn't have any
functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
VMAT_GATHER_SCATTER without calling vect_model_load_cost.
(vect_model_load_cost): Adjut the assertion on VMAT_GATHER_SCATTER,
remove VMAT_GATHER_SCATTER related handlings and the related parameter
gs_info.

vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP

This patch adjusts the cost handling on VMAT_ELEMENTWISE
and VMAT_STRIDED_SLP in function vectorizable_load.  We
don't call function vect_model_load_cost for them any more.

As PR82255 shows, we don't always need a vector construction
there, moving costing next to the transform can make us only
cost for vector construction when it's actually needed.
Besides, it can count the number of loads consistently for
some cases.

PR tree-optimization/82255

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling
on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP without calling
vect_model_load_cost.
(vect_model_load_cost): Assert it won't get VMAT_ELEMENTWISE and
VMAT_STRIDED_SLP any more, and remove their related handlings.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New test.

2023-06-13  Bill Schmidt  <wschmidt@linux.ibm.com>
    Kewen Lin  <linkw@linux.ibm.com>

vect: Adjust vectorizable_load costing on VMAT_INVARIANT

This patch adjusts the cost handling on VMAT_INVARIANT in
function vectorizable_load. We don't call function
vect_model_load_cost for it any more.

To make the costing on VMAT_INVARIANT better, this patch is
to query hoist_defs_of_uses for hoisting decision, and add
costs for different "where" based on it. Currently function
hoist_defs_of_uses would always hoist the defs of all SSA
uses, adding one argument HOIST_P aims to avoid the actual
hoisting during costing phase.

gcc/ChangeLog:

* tree-vect-stmts.cc (hoist_defs_of_uses): Add one argument HOIST_P.
(vectorizable_load): Adjust the handling on VMAT_INVARIANT to respect
hoisting decision and without calling vect_model_load_cost.
(vect_model_load_cost): Assert it won't get VMAT_INVARIANT any more
and remove VMAT_INVARIANT related handlings.

vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl

This patch adds one extra argument cost_vec to function
vect_build_gather_load_calls, so that we can do costing
next to the tranform in vect_build_gather_load_calls.
For now, the implementation just follows the handlings in
vect_model_load_cost, it isn't so good, so placing one
FIXME for any further improvement. This patch should not
cause any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_build_gather_load_calls): Add the handlings
on costing with one extra argument cost_vec.
(vectorizable_load): Adjust the call to vect_build_gather_load_calls.
(vect_model_load_cost): Assert it won't get VMAT_GATHER_SCATTER with
gs_info.decl set any more.

vect: Move vect_model_load_cost next to the transform in vectorizable_load

This patch is an initial patch to move costing next to the
transform, it still adopts vect_model_load_cost for costing
but moves and duplicates it down according to the handlings
of different vect_memory_access_types, hope it can make the
subsequent patches easy to review. This patch should not
have any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move and duplicate the call
to vect_model_load_cost down to some different transform paths
according to the handlings of different vect_memory_access_types.

tree: Hide wi::from_mpz from GENERATOR_FILE

Similar to r0-85707-g34917a102a4e0c for PR35051, the uses
of mpz_t should be guarded with "#ifndef GENERATOR_FILE".
This patch is to fix it and avoid some possible build
errors.

gcc/ChangeLog:

* tree.h (wi::from_mpz): Hide from GENERATOR_FILE.

mklog: Add --append option to auto add generate ChangeLog to patch file

This tiny patch add --append option to mklog.py that support add generated
change-log to the corresponding patch file. With this option there is no need
to manually copy the generated change-log to the patch file. e.g.:

Run `mklog.py --append /path/to/this/patch` will add the generated change-log
to the right place of the /path/to/this/patch file.

contrib/ChangeLog:

* mklog.py: Add --append option.

Signed-off-by: Lehua Ding <lehua.ding@rivai.ai>

RISC-V: RISC-V: Support gather_load/scatter RVV auto-vectorization

This patch fully support gather_load/scatter_store:
1. Support single-rgroup on both RV32/RV64.
2. Support indexed element width can be same as or smaller than Pmode.
3. Support VLA SLP with gather/scatter.
4. Fully tested all gather/scatter with LMUL = M1/M2/M4/M8 both VLA and VLS.
5. Fix bug of handling (subreg:SI (const_poly_int:DI))
6. Fix bug on vec_perm which is used by gather/scatter SLP.

All kinds of GATHER/SCATTER are normalized into LEN_MASK_*.
We fully supported these 4 kinds of gather/scatter:
1. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with dummy length and dummy mask (Full vector).
2. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with dummy length and real mask.
3. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with real length and dummy mask.
4. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with real length and real mask.

Base on the disscussions with Richards, we don't lower vlse/vsse in RISC-V backend for strided load/store.
Instead, we leave it to the middle-end to handle that.

Regression is pass ok for trunk ?

gcc/ChangeLog:

* config/riscv/autovec.md
(len_mask_gather_load<VNX1_QHSD:mode><VNX1_QHSDI:mode>): New pattern.
(len_mask_gather_load<VNX2_QHSD:mode><VNX2_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX4_QHSD:mode><VNX4_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX8_QHSD:mode><VNX8_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(len_mask_gather_load<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(len_mask_gather_load<mode><mode>): Ditto.
(len_mask_scatter_store<VNX1_QHSD:mode><VNX1_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX2_QHSD:mode><VNX2_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX4_QHSD:mode><VNX4_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX8_QHSD:mode><VNX8_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(len_mask_scatter_store<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(len_mask_scatter_store<mode><mode>): Ditto.
* config/riscv/predicates.md (const_1_operand): New predicate.
(vector_gs_scale_operand_16): Ditto.
(vector_gs_scale_operand_32): Ditto.
(vector_gs_scale_operand_64): Ditto.
(vector_gs_extension_operand): Ditto.
(vector_gs_scale_operand_16_rv32): Ditto.
(vector_gs_scale_operand_32_rv32): Ditto.
* config/riscv/riscv-protos.h (enum insn_type): Add gather/scatter.
(expand_gather_scatter): New function.
* config/riscv/riscv-v.cc (gen_const_vector_dup): Add gather/scatter.
(emit_vlmax_masked_store_insn): New function.
(emit_nonvlmax_masked_store_insn): Ditto.
(modulo_sel_indices): Ditto.
(expand_vec_perm): Fix SLP for gather/scatter.
(prepare_gather_scatter): New function.
(expand_gather_scatter): Ditto.
* config/riscv/riscv.cc (riscv_legitimize_move): Fix bug of
(subreg:SI (DI CONST_POLY_INT)).
* config/riscv/vector-iterators.md: Add gather/scatter.
* config/riscv/vector.md (vec_duplicate<mode>): Use "@" instead.
(@vec_duplicate<mode>): Ditto.
(@pred_indexed_<order>store<VNX16_QHS:mode><VNX16_QHSDI:mode>):
Fix name.
(@pred_indexed_<order>store<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/rvv.exp: Add gather/scatter tests.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-1.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-12.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-2.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-3.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-4.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-5.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-6.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-7.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-8.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-9.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-12.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-2.c:
New test.

Daily bump.

RISC-V: Support COND_LEN_* patterns

This middle-end has been merged:
https://github.com/gcc-mirror/gcc/commit/0d4dd7e07a879d6c07a33edb2799710faa95651e

With this patch, we can handle operations may trap on elements outside the loop.

These 2 following cases will be addressed by this patch:

1. integer division:

  #define TEST_TYPE(TYPE) \
  __attribute__((noipa)) \
  void vrem_##TYPE (TYPE * __restrict dst, TYPE * __restrict a, TYPE * __restrict b, int n) \
  { \
    for (int i = 0; i < n; i++) \
      dst[i] = a[i] % b[i]; \
  }
  #define TEST_ALL() \
   TEST_TYPE(int8_t) \
  TEST_ALL()

  Before this patch:

   vrem_int8_t:
        ble     a3,zero,.L14
        csrr    t4,vlenb
        addiw   a5,a3,-1
        addiw   a4,t4,-1
        sext.w  t5,a3
        bltu    a5,a4,.L10
        csrr    t3,vlenb
        subw    t3,t5,t3
        li      a5,0
        vsetvli t6,zero,e8,m1,ta,ma
.L4:
        add     a6,a2,a5
        add     a7,a0,a5
        add     t1,a1,a5
        mv      a4,a5
        add     a5,a5,t4
        vl1re8.v        v2,0(a6)
        vl1re8.v        v1,0(t1)
        sext.w  a6,a5
        vrem.vv v1,v1,v2
        vs1r.v  v1,0(a7)
        bleu    a6,t3,.L4
        csrr    a5,vlenb
        addw    a4,a4,a5
        sext.w  a5,a4
        beq     t5,a4,.L16
.L3:
        csrr    a6,vlenb
        subw    t5,t5,a4
        srli    a6,a6,1
        addiw   t1,t5,-1
        addiw   a7,a6,-1
        bltu    t1,a7,.L9
        slli    a4,a4,32
        srli    a4,a4,32
        add     t0,a1,a4
        add     t6,a2,a4
        add     a4,a0,a4
        vsetvli a7,zero,e8,mf2,ta,ma
        sext.w  t3,a6
        vle8.v  v1,0(t0)
        vle8.v  v2,0(t6)
        subw    t4,t5,a6
        vrem.vv v1,v1,v2
        vse8.v  v1,0(a4)
        mv      t1,t3
        bltu    t4,t3,.L7
        csrr    t1,vlenb
        add     a4,a4,a6
        add     t0,t0,a6
        add     t6,t6,a6
        sext.w  t1,t1
        vle8.v  v1,0(t0)
        vle8.v  v2,0(t6)
        vrem.vv v1,v1,v2
        vse8.v  v1,0(a4)
.L7:
        addw    a5,t1,a5
        beq     t5,t1,.L14
.L9:
        add     a4,a1,a5
        add     a6,a2,a5
        lb      a6,0(a6)
        lb      a4,0(a4)
        add     a7,a0,a5
        addi    a5,a5,1
        remw    a4,a4,a6
        sext.w  a6,a5
        sb      a4,0(a7)
        bgt     a3,a6,.L9
.L14:
        ret
.L10:
        li      a4,0
        li      a5,0
        j       .L3
.L16:
        ret

After this patch:

   vrem_int8_t:
ble a3,zero,.L5
.L3:
vsetvli a5,a3,e8,m1,tu,ma
vle8.v v1,0(a1)
vle8.v v2,0(a2)
sub a3,a3,a5
vrem.vv v1,v1,v2
vse8.v v1,0(a0)
add a1,a1,a5
add a2,a2,a5
add a0,a0,a5
bne a3,zero,.L3
.L5:
ret

2. Floating-point operation **WITHOUT** -ffast-math:

    #define TEST_TYPE(TYPE) \
    __attribute__((noipa)) \
    void vadd_##TYPE (TYPE * __restrict dst, TYPE *__restrict a, TYPE *__restrict b, int n) \
    { \
      for (int i = 0; i < n; i++) \
        dst[i] = a[i] + b[i]; \
    }

    #define TEST_ALL() \
     TEST_TYPE(float) \

    TEST_ALL()

Before this patch:

   vadd_float:
        ble     a3,zero,.L10
        csrr    a4,vlenb
        srli    t3,a4,2
        addiw   a5,a3,-1
        addiw   a6,t3,-1
        sext.w  t6,a3
        bltu    a5,a6,.L7
        subw    t5,t6,t3
        mv      t1,a1
        mv      a7,a2
        mv      a6,a0
        li      a5,0
        vsetvli t4,zero,e32,m1,ta,ma
.L4:
        vl1re32.v       v1,0(t1)
        vl1re32.v       v2,0(a7)
        addw    a5,a5,t3
        vfadd.vv        v1,v1,v2
        vs1r.v  v1,0(a6)
        add     t1,t1,a4
        add     a7,a7,a4
        add     a6,a6,a4
        bgeu    t5,a5,.L4
        beq     t6,a5,.L10
        sext.w  a5,a5
.L3:
        slli    a4,a5,2
.L6:
        add     a6,a1,a4
        add     a7,a2,a4
        flw     fa4,0(a6)
        flw     fa5,0(a7)
        add     a6,a0,a4
        addiw   a5,a5,1
        fadd.s  fa5,fa5,fa4
        addi    a4,a4,4
        fsw     fa5,0(a6)
        bgt     a3,a5,.L6
.L10:
        ret
.L7:
        li      a5,0
        j       .L3

After this patch:

   vadd_float:
ble a3,zero,.L5
.L3:
vsetvli a5,a3,e32,m1,tu,ma
slli a4,a5,2
vle32.v v1,0(a1)
vle32.v v2,0(a2)
sub a3,a3,a5
vfadd.vv v1,v1,v2
vse32.v v1,0(a0)
add a1,a1,a4
add a2,a2,a4
add a0,a0,a4
bne a3,zero,.L3
.L5:
ret

gcc/ChangeLog:

* config/riscv/autovec.md (cond_len_<optab><mode>): New pattern.
* config/riscv/riscv-protos.h (enum insn_type): New enum.
(expand_cond_len_binop): New function.
* config/riscv/riscv-v.cc (emit_nonvlmax_tu_insn): Ditto.
(emit_nonvlmax_fp_tu_insn): Ditto.
(need_fp_rounding_p): Ditto.
(expand_cond_len_binop): Ditto.
* config/riscv/riscv.cc (riscv_preferred_else_value): Ditto.
(TARGET_PREFERRED_ELSE_VALUE): New target hook.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv.c: Adapt testcase.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv64gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv32gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv64gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vadd-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vadd-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vadd-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-rv64gcv-nofm.c: New test.

Break out profile updating code from gimple_duplicate_sese_region

Move profile updating to tree-ssa-loop-ch.cc since it is
now quite ch specific. There are no functional changes.

Boostrapped/regtesed x86_64-linux, comitted.

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this; break out profile updating
code to ...
* tree-ssa-loop-ch.cc (update_profile_after_ch): ... here.
(ch_base::copy_headers): Update.
* tree-cfg.h (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this.

[range-op] Take known mask into account for bitwise ands [PR107043]

PR tree-optimization/107043

gcc/ChangeLog:

* range-op.cc (operator_bitwise_and::op1_range): Update bitmask.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr107043.c: New test.

[range-op] Take known set bits into account in popcount [PR107053]

This patch teaches popcount about known set bits which are now
available in the irange.

PR tree-optimization/107053

gcc/ChangeLog:

* gimple-range-op.cc (cfn_popcount): Use known set bits.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr107053.c: New test.