git.ipfire.org Git - thirdparty/gcc.git/log

bpf: enable instruction scheduling

This patch adds a dummy FSM to bpf.md in order to get INSN_SCHEDULING
defined. If the later is not defined, the `combine' pass generates
paradoxical subregs of mems, which seems to then be mishandled by LRA,
resulting in invalid code.

Tested in bpf-unknown-none.

gcc/ChangeLog:

2023-07-14 Jose E. Marchesi <jose.marchesi@oracle.com>

PR target/110657
* config/bpf/bpf.md: Enable instruction scheduling.

fortran: Reorder array argument evaluation parts [PR92178]

In the case of an array actual arg passed to a polymorphic array dummy
with INTENT(OUT) attribute, reorder the argument evaluation code to
the following:
- first evaluate arguments' values, and data references,
- deallocate data references associated with an allocatable,
   intent(out) dummy,
- create a class container using the freed data references.

The ordering used to be incorrect between the first two items,
when one argument was deallocated before a later argument evaluated
its expression depending on the former argument.
r14-2395-gb1079fc88f082d3c5b583c8822c08c5647810259 fixed it by treating
arguments associated with an allocatable, intent(out) dummy in a
separate, later block.  This, however, wasn't working either if the data
reference of such an argument was depending on its own content, as
the class container initialization was trying to use deallocated
content.

This change generates class container initialization code in a separate
block, so that it is moved after the deallocation block without moving
the rest of the argument evaluation code.

This alone is not sufficient to fix the problem, because the class
container generation code repeatedly uses the full expression of
the argument at a place where deallocation might have happened
already.  This is non-optimal, but may also be invalid, because the data
reference may depend on its own content.  In that case the expression
can't be evaluated after the data has been deallocated.

As in the scalar case previously treated, this is fixed by saving
the data reference to a pointer before any deallocation happens,
and then only refering to the pointer.  gfc_reset_vptr is updated
to take into account the already evaluated class container if it's
available.

Contrary to the scalar case, one hunk is needed to wrap the parameter
evaluation in a conditional, to avoid regressing in
optional_class_2.f90.  This used to be handled by the class wrapper
construction which wrapped the whole code in a conditional.  With
this change the class wrapper construction can't see the parameter
evaluation code, so the latter is updated with an additional handling
for optional arguments.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans.h (gfc_reset_vptr): Add class_container argument.
* trans-expr.cc (gfc_reset_vptr): Ditto.  If a valid vptr can
be obtained through class_container argument, bypass evaluation
of e.
(gfc_conv_procedure_call):  Wrap the argument evaluation code
in a conditional if the associated dummy is optional.  Evaluate
the data reference to a pointer now, and replace later
references with usage of the pointer.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_21.f90: New test.

fortran: Factor data references for scalar class argument wrapping [PR92178]

In the case of a scalar actual arg passed to a polymorphic assumed-rank
dummy with INTENT(OUT) attribute, avoid repeatedly evaluating the actual
argument reference by saving a pointer to it.  This is non-optimal, but
may also be invalid, because the data reference may depend on its own
content.  In that case the expression can't be evaluated after the data
has been deallocated.

There are two ways redundant expressions are generated:
- parmse.expr, which contains the actual argument expression, is
   reused to get or set subfields in gfc_conv_class_to_class.
- gfc_conv_class_to_class, to get the virtual table pointer associated
   with the argument, generates a new expression from scratch starting
   with the frontend expression.

The first part is fixed by saving parmse.expr to a pointer and using
the pointer instead of the original expression.

The second part is fixed by adding a separate field to gfc_se that
is set to the class container expression  when the expression to
evaluate is polymorphic.  This needs the same field in gfc_ss_info
so that its value can be propagated to gfc_conv_class_to_class which
is modified to use that value.  Finally gfc_conv_procedure saves the
expression in that field to a pointer in between to avoid the same
problem as for the first part.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans.h (struct gfc_se): New field class_container.
(struct gfc_ss_info): Ditto.
(gfc_evaluate_data_ref_now): New prototype.
* trans.cc (gfc_evaluate_data_ref_now):  Implement it.
* trans-array.cc (gfc_conv_ss_descriptor): Copy class_container
field from gfc_se struct to gfc_ss_info struct.
(gfc_conv_expr_descriptor): Copy class_container field from
gfc_ss_info struct to gfc_se struct.
* trans-expr.cc (gfc_conv_class_to_class): Use class container
set in class_container field if available.
(gfc_conv_variable): Set class_container field on encountering
class variables or components, clear it on encountering
non-class components.
(gfc_conv_procedure_call): Evaluate data ref to a pointer now,
and replace later references by usage of the pointer.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_20.f90: New test.

fortran: defer class wrapper initialization after deallocation [PR92178]

If an actual argument is associated with an INTENT(OUT) dummy, and code
to deallocate it is generated, generate the class wrapper initialization
after the actual argument deallocation.

This is achieved by passing a cleaned up expression to
gfc_conv_class_to_class, so that the class wrapper initialization code
can be isolated and moved independently after the deallocation.

PR fortran/92178

gcc/fortran/ChangeLog:

* trans-expr.cc (gfc_conv_procedure_call): Use a separate gfc_se
struct, initalized from parmse, to generate the class wrapper.
After the class wrapper code has been generated, copy it back
depending on whether parameter deallocation code has been
generated.

gcc/testsuite/ChangeLog:

* gfortran.dg/intent_out_19.f90: New test.

libgomp.texi: Extend memory allocation documentation

libgomp/
* libgomp.texi (OMP_ALLOCATOR): Document the default values for
the traits. Add crossref to 'Memory allocation'.
(Memory allocation): Refer to OMP_ALLOCATOR for the available
traits and allocators/mem spaces; document the default value
for the pool_size trait.

ifcvt: Sort PHI arguments not only occurrences but also complexity [PR109154]

This patch builds on the previous patch by fixing another issue with the
way ifcvt currently picks which branches to test.

The issue with the current implementation is while it sorts for
occurrences of the argument, it doesn't check for complexity of the arguments.

As an example:

  <bb 15> [local count: 528603100]:
  ...
  if (distbb_75 >= 0.0)
    goto <bb 17>; [59.00%]
  else
    goto <bb 16>; [41.00%]

  <bb 16> [local count: 216727269]:
  ...
  goto <bb 19>; [100.00%]

  <bb 17> [local count: 311875831]:
  ...
  if (distbb_75 < iftmp.0_98)
    goto <bb 18>; [20.00%]
  else
    goto <bb 19>; [80.00%]

  <bb 18> [local count: 62375167]:
  ...

  <bb 19> [local count: 528603100]:
  # prephitmp_175 = PHI <_173(18), 0.0(17), _174(16)>

All tree arguments to the PHI have the same number of occurrences, namely 1,
however it makes a big difference which comparison we test first.

Sorting only on occurrences we'll pick the compares coming from BB 18 and BB 17,
This means we end up generating 4 comparisons, while 2 would have been enough.

By keeping track of the "complexity" of the COND in each BB, (i.e. the number
of comparisons needed to traverse from the start [BB 15] to end [BB 19]) and
using a key tuple of <occurrences, complexity> we end up selecting the compare
from BB 16 and BB 18 first.  BB 16 only requires 1 compare, and BB 18, after we
test BB 16 also only requires one additional compare.  This change paired with
the one previous above results in the optimal 2 compares.

For deep nesting, i.e. for

...
  _79 = vr_15 > 20;
  _80 = _68 & _79;
  _82 = vr_15 <= 20;
  _83 = _68 & _82;
  _84 = vr_15 < -20;
  _85 = _73 & _84;
  _87 = vr_15 >= -20;
  _88 = _73 & _87;
  _ifc__111 = _55 ? 10 : 12;
  _ifc__112 = _70 ? 7 : _ifc__111;
  _ifc__113 = _85 ? 8 : _ifc__112;
  _ifc__114 = _88 ? 9 : _ifc__113;
  _ifc__115 = _45 ? 1 : _ifc__114;
  _ifc__116 = _63 ? 3 : _ifc__115;
  _ifc__117 = _65 ? 4 : _ifc__116;
  _ifc__118 = _83 ? 6 : _ifc__117;
  _ifc__119 = _60 ? 2 : _ifc__118;
  _ifc__120 = _43 ? 13 : _ifc__119;
  _ifc__121 = _75 ? 11 : _ifc__120;
  vw_1 = _80 ? 5 : _ifc__121;

Most of the comparisons are still needed because the chain of
occurrences to not negate eachother. i.e. _80 is _73 & vr_15 >= -20 and
_85 is _73 & vr_15 < -20.  clearly given _73 needs to be true in both branches,
the only additional test needed is on vr_15, where the one test is the negation
of the other.  So we don't need to do the comparison of _73 twice.

The changes in the patch reduces the overall number of compares by one, but has
a bigger effect on the dependency chain.

Previously we would generate 5 instructions chain:

cmple   p7.s, p4/z, z29.s, z30.s
cmpne   p7.s, p7/z, z29.s, #0
cmple   p6.s, p7/z, z31.s, z30.s
cmpge   p6.s, p6/z, z27.s, z25.s
cmplt   p15.s, p6/z, z28.s, z21.s

as the longest chain.  With this patch we generate 3:

cmple   p7.s, p3/z, z27.s, z30.s
cmpne   p7.s, p7/z, z27.s, #0
cmpgt   p7.s, p7/z, z31.s, z30.s

and I don't think (x <= y) && (x != 0) && (z > y) can be reduced further.

gcc/ChangeLog:

PR tree-optimization/109154
* tree-if-conv.cc (INCLUDE_ALGORITHM): Include.
(struct bb_predicate): Add no_predicate_stmts.
(set_bb_predicate): Increase predicate count.
(set_bb_predicate_gimplified_stmts): Conditionally initialize
no_predicate_stmts.
(get_bb_num_predicate_stmts): New.
(init_bb_predicate): Initialzie no_predicate_stmts.
(release_bb_predicate): Cleanup no_predicate_stmts.
(insert_gimplified_predicates): Preserve no_predicate_stmts.

gcc/testsuite/ChangeLog:

PR tree-optimization/109154
* gcc.dg/vect/vect-ifcvt-20.c: New test.

ifcvt: Reduce comparisons on conditionals by tracking truths [PR109154]

Following on from Jakub's patch in g:de0ee9d14165eebb3d31c84e98260c05c3b33acb
these two patches finishes the work fixing the regression and improves codegen.

As explained in that commit, ifconvert sorts PHI args in increasing number of
occurrences in order to reduce the number of comparisons done while
traversing the tree.

The remaining task that this patch fixes is dealing with the long chain of
comparisons that can be created from phi nodes, particularly when they share
any common successor (classical example is a diamond node).

on a PHI-node the true and else branches carry a condition, true will
carry `a` and false `~a`.  The issue is that at the moment GCC tests both `a`
and `~a` when the phi node has more than 2 arguments. Clearly this isn't
needed.  The deeper the nesting of phi nodes the larger the repetition.

As an example, for

foo (int *f, int d, int e)
{
  for (int i = 0; i < 1024; i++)
    {
      int a = f[i];
      int t;
      if (a < 0)
t = 1;
      else if (a < e)
t = 1 - a * d;
      else
t = 0;
      f[i] = t;
    }
}

after Jakub's patch we generate:

  _7 = a_10 < 0;
  _21 = a_10 >= 0;
  _22 = a_10 < e_11(D);
  _23 = _21 & _22;
  _ifc__42 = _23 ? t_13 : 0;
  t_6 = _7 ? 1 : _ifc__42

but while better than before it is still inefficient, since in the false
branch, where we know ~_7 is true, we still test _21.

This leads to superfluous tests for every diamond node.  After this patch we
generate

_7 = a_10 < 0;
_22 = a_10 < e_11(D);
_ifc__42 = _22 ? t_13 : 0;
t_6 = _7 ? 1 : _ifc__42;

Which correctly elides the test of _21.  This is done by borrowing the
vectorizer's helper functions to limit predicate mask usages.  Ifcvt will chain
conditionals on the false edge (unless specifically inverted) so this patch on
creating cond a ? b : c, will register ~a when traversing c.  If c is a
conditional then c will be simplified to the smaller possible predicate given
the assumptions we already know to be true.

gcc/ChangeLog:

PR tree-optimization/109154
* tree-if-conv.cc (gen_simplified_condition,
gen_phi_nest_statement): New.
(gen_phi_arg_condition, predicate_scalar_phi): Use it.

gcc/testsuite/ChangeLog:

PR tree-optimization/109154
* gcc.dg/vect/vect-ifcvt-19.c: New test.

Provide extra checking for phi argument access from edge

The following adds checking that the edge we query an associated
PHI arg for is related to the PHI node. Triggered by questionable
code in one of my reviews.

* gimple.h (gimple_phi_arg): New const overload.
(gimple_phi_arg_def): Make gimple arg const.
(gimple_phi_arg_def_from_edge): New inline function.
* tree-phinodes.h (gimple_phi_arg_imm_use_ptr_from_edge):
Likewise.
* tree-ssa-operands.h (PHI_ARG_DEF_FROM_EDGE): Direct to
new inline function.
(PHI_ARG_DEF_PTR_FROM_EDGE): Likewise.

libgomp: Fix allocator handling for Linux when libnuma is not available

Follow up to r14-2462-g450b05ce54d3f0. The case that libnuma was not
available at runtime was not properly handled; now it falls back to
the normal malloc.

libgomp/

* allocator.c (omp_init_allocator): Check whether symbol from
dlopened libnuma is available before using libnuma for
allocations.

RISC-V: Recognized zihintntl extensions

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc:
(riscv_implied_info): Add zihintntl item.
(riscv_ext_version_table): Ditto.
(riscv_ext_flag_table): Ditto.
* config/riscv/riscv-opts.h (MASK_ZIHINTNTL): New macro.
(TARGET_ZIHINTNTL): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-22.c: New test.
* gcc.target/riscv/predef-28.c: New test.

RISC-V: Remove the redundant expressions in the and<mode>3.

When generating the gen_and<mode>3 function based on the and<mode>3
template, it produces the expression emit_insn (gen_rtx_SET (operand0,
gen_rtx_AND (<mode>, operand1, operand2)));, which is identical to the
portion I removed in this patch. Therefore, the redundant portion can be
deleted.

Signed-off-by: Die Li <lidie@eswincomputing.com>
gcc/ChangeLog:

* config/riscv/riscv.md: Remove redundant portion in and<mode>3.

SH: Fix PR101496 peephole bug

gcc/ChangeLog:

PR target/101469
* config/sh/sh.md (peephole2): Handle case where eliminated reg is also
used by the address of the following memory operand.

Daily bump.

pdp11: Fix epilogue generation [PR107841]

gcc/

PR target/107841
* config/pdp11/pdp11.cc (pdp11_expand_epilogue): Also
deallocate alloca-only frame.

gcc/testsuite/

PR target/107841
* gcc.target/pdp11/pr107841.c: New test.

m2, build: Use LDLFAGS for mklink

When trying to bootstrap current trunk on macOS 14.0 beta 3 with Xcode
15 beta 4, the build failed running mklink in stage 2:

unset CC ; m2/boot-bin/mklink -s --langc++ --exit --name m2/mc-boot/main.cc
/vol/gcc/src/hg/master/darwin/gcc/m2/init/mcinit
dyld[55825]: Library not loaded: /vol/gcc/lib/libstdc++.6.dylib

While it's unclear to me why this only happens on macOS 14, the problem
is clear: unlike other C++ executables, mklink isn't linked with
-static-libstdc++ which is passed in from toplevel in LDFLAGS.

This patch fixes that and allows the build to continue.

Bootstrapped on x86_64-apple-darwin23.0.0, i386-pc-solaris2.11, and
sparc-sun-solaris2.11.

2023-07-11 Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>

gcc/m2:
* Make-lang.in (m2/boot-bin/mklink$(exeext)): Add $(LDFLAGS).

fortran: Release symbols in reversed order [PR106050]

Release symbols in reversed order wrt the order they were allocated.
This fixes an error recovery ICE in the case of a misplaced
derived type declaration.  Such a declaration creates nested
symbols, one for the derived type and one for each type parameter,
which should be immediately released as the declaration is
rejected.  This breaks if the derived type is released first.
As the type parameter symbols are in the namespace of the derived
type, releasing the derived type releases the type parameters, so
one can't access them after that, even to release them.  Hence,
the type parameters should be released first.

PR fortran/106050

gcc/fortran/ChangeLog:

* symbol.cc (gfc_restore_last_undo_checkpoint): Release symbols
in reverse order.

gcc/testsuite/ChangeLog:

* gfortran.dg/pdt_33.f90: New test.

Darwin: Use -platform_version when available [PR110624].

Later versions of the static linker support a more flexible flag to
describe the OS, OS version and SDK used to build the code. This
replaces the functionality of '-mmacosx_version_min' (which is now
deprecated, leading to the diagnostic described in the PR).

We now use the platform_version flag when available which avoids the
diagnostic.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
PR target/110624

gcc/ChangeLog:

* config/darwin.h (DARWIN_PLATFORM_ID): New.
(LINK_COMMAND_A): Use DARWIN_PLATFORM_ID to pass OS, OS version
and SDK data to the static linker.

rs6000, Add return value to __builtin_set_fpscr_rn

Change the return value from void to double for __builtin_set_fpscr_rn.
The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
RN bit positions.  A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
is added to test the new return value for the built-in.

The value __SET_FPSCR_RN_RETURNS_FPSCR__ is defined if
__builtin_set_fpscr_rn returns a double.

gcc/ChangeLog:
* config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
built-in definition return type.
* config/rs6000/rs6000-c.cc (rs6000_target_modify_macros): Add check,
define __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
* config/rs6000/rs6000.md (rs6000_set_fpscr_rn): Add return
argument to return FPSCR fields.
* doc/extend.texi (__builtin_set_fpscr_rn): Update description for
the return value.  Add description for
__SET_FPSCR_RN_RETURNS_FPSCR__ macro.

gcc/testsuite/ChangeLog:
* gcc.target/powerpc/test_fpscr_rn_builtin.c: Rename to
test_fpscr_rn_builtin_1.c.  Add comment.
* gcc.target/powerpc/test_fpscr_rn_builtin_2.c: New test for the
return value of __builtin_set_fpscr_rn builtin.

libstdc++: std::stoi etc. do not need C99 <stdlib.h> support [PR110653]

std::stoi, std::stol, std::stoul, and std::stod only depend on C89
functions, so don't need to be guarded by _GLIBCXX_USE_C99_STDLIB

std::stoll and std::stoull don't need C99 strtoll and strtoull if
sizeof(long) == sizeof(long long).

std::stold doesn't need C99 strtold if DBL_MANT_DIG == LDBL_MANT_DIG.

This only applies to the narrow character overloads, the wchar_t
overloads depend on a separate _GLIBCXX_USE_C99_WCHAR macro and none of
them can be implemented in C89 easily.

libstdc++-v3/ChangeLog:

PR libstdc++/110653
* include/bits/basic_string.h (stoi, stol, stoul, stod): Do not
depend on _GLIBCXX_USE_C99_STDLIB.
[__LONG_WIDTH__ == __LONG_LONG_WIDTH__] (stoll, stoull): Define
in terms of stol and stoul respectively.
[__DBL_MANT_DIG__ == __LDBL_MANT_DIG__] (stold): Define in terms
of stod.

alpha: Fix computation mode in alpha_emit_set_long_cost [PR106966]

PR target/106966

gcc/ChangeLog:

* config/alpha/alpha.cc (alpha_emit_set_long_const):
Always use DImode when constructing long const.

gcc/testsuite/ChangeLog:

* gcc.target/alpha/pr106966.c: New test.

RA+sched: Change TRUE/FALSE to true/false

gcc/ChangeLog:

* haifa-sched.cc: Change TRUE/FALSE to true/false.
* ira.cc: Ditto.
* lra-assigns.cc: Ditto.
* lra-constraints.cc: Ditto.
* sel-sched.cc: Ditto.

Fix part of PR 110293: `A NEEQ (A NEEQ CST)` part

This fixes part of PR 110293, for the outer comparison case
being `!=` or `==`. In turn PR 110539 is able to be optimized
again as the if statement for `(a&1) == ((a & 1) != 0)` gets optimized
to `false` early enough to allow FRE/DOM to do a CSE for memory store/load.

OK? Bootstrapped and tested on x86_64-linux with no regressions.

gcc/ChangeLog:

PR tree-optimization/110293
PR tree-optimization/110539
* match.pd: Expand the `x != (typeof x)(x == 0)`
pattern to handle where the inner and outer comparsions
are either `!=` or `==` and handle other constants
than 0.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr110293-1.c: New test.
* gcc.dg/tree-ssa/pr110539-1.c: New test.
* gcc.dg/tree-ssa/pr110539-2.c: New test.
* gcc.dg/tree-ssa/pr110539-3.c: New test.
* gcc.dg/tree-ssa/pr110539-4.c: New test.

[RA][PR109520]: Catch error when there are no enough registers for asm insn

Asm insn unlike other insns can have so many operands whose
constraints can not be satisfied.  It results in LRA cycling for such
test case.  The following patch catches such situation and reports the
problem.

PR middle-end/109520

gcc/ChangeLog:

* lra-int.h (lra_insn_recog_data): Add member asm_reloads_num.
(lra_asm_insn_error): New prototype.
* lra.cc: Include rtl_error.h.
(lra_set_insn_recog_data): Initialize asm_reloads_num.
(lra_asm_insn_error): New func whose code is taken from ...
* lra-assigns.cc (lra_split_hard_reg_for): ... here.  Use lra_asm_insn_error.
* lra-constraints.cc (curr_insn_transform): Check reloads nummber for asm.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr109520.c: New test.

SSA MATH: Support COND_LEN_FMA for floating-point math optimization

Hi, Richard and Richi.

Previous patch we support COND_LEN_* binary operations. However, we didn't
support COND_LEN_* ternary.

Now, this patch support COND_LEN_* ternary. Consider this following case:

  __attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst,            \
      TYPE *__restrict a,              \
      TYPE *__restrict b,\
                TYPE *__restrict c, int n)       \
  {                                                                            \
    for (int i = 0; i < n; i++)                                                \
      dst[i] += a[i] * b[i];                                                   \
  }

TEST_ALL ()

Before this patch:
...
COND_LEN_MUL
COND_LEN_ADD

Afther this patch:
...
COND_LEN_FMA

gcc/ChangeLog:

* genmatch.cc (commutative_op): Add COND_LEN_*
* internal-fn.cc (first_commutative_argument): Ditto.
(CASE): Ditto.
(get_unconditional_internal_fn): Ditto.
(can_interpret_as_conditional_op_p): Ditto.
(internal_fn_len_index): Ditto.
* internal-fn.h (can_interpret_as_conditional_op_p): Ditt.
* tree-ssa-math-opts.cc (convert_mult_to_fma_1): Ditto.
(convert_mult_to_fma): Ditto.
(math_opts_dom_walker::after_dom_children): Ditto.

testsuite: dg-require LTO for libgomp LTO tests

Some test cases in libgomp testsuite pass -flto as an option, but
the testcases do not require LTO target support. This patch adds
the necessary DejaGNU requirement for LTO support to the testcases..

libgomp/ChangeLog:
* testsuite/libgomp.c++/target-map-class-2.C: Require LTO.
* testsuite/libgomp.c-c++-common/requires-4.c: Require LTO.
* testsuite/libgomp.c-c++-common/requires-4a.c: Require LTO.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

RISC-V: Refactor riscv mode after for VXRM and FRM

When investigate the FRM dynmaic rounding mode, we find the global
unknown status is quite different between the fixed-point and
floating-point. Thus, we separate the unknown function with extracting
some inner common functions.

We will also prepare more test cases in another PATCH.

Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:

* config/riscv/riscv.cc (vxrm_rtx): New static var.
(frm_rtx): Ditto.
(global_state_unknown_p): Removed.
(riscv_entity_mode_after): Removed.
(asm_insn_p): New function.
(vxrm_unknown_p): New function for fixed-point.
(riscv_vxrm_mode_after): Ditto.
(frm_unknown_dynamic_p): New function for floating-point.
(riscv_frm_mode_after): Ditto.
(riscv_mode_after): Leverage new functions.

RISC-V: Add more tests for RVV floating-point FRM.

Add more test cases include both the asm check and run for RVV FRM.

Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/float-point-frm-insert-10.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-7.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-8.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-9.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-1.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-2.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-3.c: New test.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS

This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load. We don't call function
vect_model_load_cost for it any more. It removes function
vect_model_load_cost which becomes useless and unreachable
now.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Remove.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
calling vect_model_load_cost.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

As the affected test case gcc.target/i386/pr70021.c shows,
the previous costing can under-cost the total generated
vector loads as for VMAT_CONTIGUOUS_PERMUTE function
vect_model_load_cost doesn't consider the group size which
is considered as vec_num during the transformation.

This patch makes the count of vector load in costing become
consistent with what we generates during the transformation.
To be more specific, for the given test case, for memory
access b[i_20], it costed for 2 vector loads before,
with this patch it costs 8 instead, it matches the final
count of generated vector loads basing from b. This costing
change makes cost model analysis feel it's not profitable
to vectorize the first loop, so this patch adjusts the test
case without vect cost model any more.

But note that this test case also exposes something we can
improve further is that although the number of vector
permutation what we costed and generated are consistent,
but DCE can further optimize some unused permutation out,
it would be good if we can predict that and generate only
those necessary permutations.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
handle memory_access_type VMAT_CONTIGUOUS, remove some
VMAT_CONTIGUOUS_PERMUTE related handlings.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.

vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE

This patch adjusts the cost handling on
VMAT_CONTIGUOUS_REVERSE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

This change makes us not miscount some required vector
permutation as the associated test case shows.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Assert it won't get
VMAT_CONTIGUOUS_REVERSE any more.
(vectorizable_load): Adjust the costing handling on
VMAT_CONTIGUOUS_REVERSE without calling vect_model_load_cost.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c: New test.

vect: Adjust vectorizable_load costing on VMAT_LOAD_STORE_LANES

This patch adjusts the cost handling on
VMAT_LOAD_STORE_LANES in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.

It follows what we do in the function vect_model_load_cost,
and shouldn't have any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
VMAT_LOAD_STORE_LANES without calling vect_model_load_cost.
(vectorizable_load): Remove VMAT_LOAD_STORE_LANES related handling and
assert it will never get VMAT_LOAD_STORE_LANES.

vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER

This patch adjusts the cost handling on VMAT_GATHER_SCATTER
in function vectorizable_load. We don't call function
vect_model_load_cost for it any more.

It's mainly for gather loads with IFN or emulated gather
loads, it follows the handlings in function
vect_model_load_cost. This patch shouldn't have any
functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling on
VMAT_GATHER_SCATTER without calling vect_model_load_cost.
(vect_model_load_cost): Adjut the assertion on VMAT_GATHER_SCATTER,
remove VMAT_GATHER_SCATTER related handlings and the related parameter
gs_info.

vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP

This patch adjusts the cost handling on VMAT_ELEMENTWISE
and VMAT_STRIDED_SLP in function vectorizable_load.  We
don't call function vect_model_load_cost for them any more.

As PR82255 shows, we don't always need a vector construction
there, moving costing next to the transform can make us only
cost for vector construction when it's actually needed.
Besides, it can count the number of loads consistently for
some cases.

PR tree-optimization/82255

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Adjust the cost handling
on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP without calling
vect_model_load_cost.
(vect_model_load_cost): Assert it won't get VMAT_ELEMENTWISE and
VMAT_STRIDED_SLP any more, and remove their related handlings.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New test.

2023-06-13  Bill Schmidt  <wschmidt@linux.ibm.com>
    Kewen Lin  <linkw@linux.ibm.com>

vect: Adjust vectorizable_load costing on VMAT_INVARIANT

This patch adjusts the cost handling on VMAT_INVARIANT in
function vectorizable_load. We don't call function
vect_model_load_cost for it any more.

To make the costing on VMAT_INVARIANT better, this patch is
to query hoist_defs_of_uses for hoisting decision, and add
costs for different "where" based on it. Currently function
hoist_defs_of_uses would always hoist the defs of all SSA
uses, adding one argument HOIST_P aims to avoid the actual
hoisting during costing phase.

gcc/ChangeLog:

* tree-vect-stmts.cc (hoist_defs_of_uses): Add one argument HOIST_P.
(vectorizable_load): Adjust the handling on VMAT_INVARIANT to respect
hoisting decision and without calling vect_model_load_cost.
(vect_model_load_cost): Assert it won't get VMAT_INVARIANT any more
and remove VMAT_INVARIANT related handlings.

vect: Adjust vectorizable_load costing on VMAT_GATHER_SCATTER && gs_info.decl

This patch adds one extra argument cost_vec to function
vect_build_gather_load_calls, so that we can do costing
next to the tranform in vect_build_gather_load_calls.
For now, the implementation just follows the handlings in
vect_model_load_cost, it isn't so good, so placing one
FIXME for any further improvement. This patch should not
cause any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_build_gather_load_calls): Add the handlings
on costing with one extra argument cost_vec.
(vectorizable_load): Adjust the call to vect_build_gather_load_calls.
(vect_model_load_cost): Assert it won't get VMAT_GATHER_SCATTER with
gs_info.decl set any more.

vect: Move vect_model_load_cost next to the transform in vectorizable_load

This patch is an initial patch to move costing next to the
transform, it still adopts vect_model_load_cost for costing
but moves and duplicates it down according to the handlings
of different vect_memory_access_types, hope it can make the
subsequent patches easy to review. This patch should not
have any functional changes.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move and duplicate the call
to vect_model_load_cost down to some different transform paths
according to the handlings of different vect_memory_access_types.

tree: Hide wi::from_mpz from GENERATOR_FILE

Similar to r0-85707-g34917a102a4e0c for PR35051, the uses
of mpz_t should be guarded with "#ifndef GENERATOR_FILE".
This patch is to fix it and avoid some possible build
errors.

gcc/ChangeLog:

* tree.h (wi::from_mpz): Hide from GENERATOR_FILE.

mklog: Add --append option to auto add generate ChangeLog to patch file

This tiny patch add --append option to mklog.py that support add generated
change-log to the corresponding patch file. With this option there is no need
to manually copy the generated change-log to the patch file. e.g.:

Run `mklog.py --append /path/to/this/patch` will add the generated change-log
to the right place of the /path/to/this/patch file.

contrib/ChangeLog:

* mklog.py: Add --append option.

Signed-off-by: Lehua Ding <lehua.ding@rivai.ai>

RISC-V: RISC-V: Support gather_load/scatter RVV auto-vectorization

This patch fully support gather_load/scatter_store:
1. Support single-rgroup on both RV32/RV64.
2. Support indexed element width can be same as or smaller than Pmode.
3. Support VLA SLP with gather/scatter.
4. Fully tested all gather/scatter with LMUL = M1/M2/M4/M8 both VLA and VLS.
5. Fix bug of handling (subreg:SI (const_poly_int:DI))
6. Fix bug on vec_perm which is used by gather/scatter SLP.

All kinds of GATHER/SCATTER are normalized into LEN_MASK_*.
We fully supported these 4 kinds of gather/scatter:
1. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with dummy length and dummy mask (Full vector).
2. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with dummy length and real mask.
3. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with real length and dummy mask.
4. LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE with real length and real mask.

Base on the disscussions with Richards, we don't lower vlse/vsse in RISC-V backend for strided load/store.
Instead, we leave it to the middle-end to handle that.

Regression is pass ok for trunk ?

gcc/ChangeLog:

* config/riscv/autovec.md
(len_mask_gather_load<VNX1_QHSD:mode><VNX1_QHSDI:mode>): New pattern.
(len_mask_gather_load<VNX2_QHSD:mode><VNX2_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX4_QHSD:mode><VNX4_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX8_QHSD:mode><VNX8_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.
(len_mask_gather_load<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(len_mask_gather_load<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(len_mask_gather_load<mode><mode>): Ditto.
(len_mask_scatter_store<VNX1_QHSD:mode><VNX1_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX2_QHSD:mode><VNX2_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX4_QHSD:mode><VNX4_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX8_QHSD:mode><VNX8_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.
(len_mask_scatter_store<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(len_mask_scatter_store<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(len_mask_scatter_store<mode><mode>): Ditto.
* config/riscv/predicates.md (const_1_operand): New predicate.
(vector_gs_scale_operand_16): Ditto.
(vector_gs_scale_operand_32): Ditto.
(vector_gs_scale_operand_64): Ditto.
(vector_gs_extension_operand): Ditto.
(vector_gs_scale_operand_16_rv32): Ditto.
(vector_gs_scale_operand_32_rv32): Ditto.
* config/riscv/riscv-protos.h (enum insn_type): Add gather/scatter.
(expand_gather_scatter): New function.
* config/riscv/riscv-v.cc (gen_const_vector_dup): Add gather/scatter.
(emit_vlmax_masked_store_insn): New function.
(emit_nonvlmax_masked_store_insn): Ditto.
(modulo_sel_indices): Ditto.
(expand_vec_perm): Fix SLP for gather/scatter.
(prepare_gather_scatter): New function.
(expand_gather_scatter): Ditto.
* config/riscv/riscv.cc (riscv_legitimize_move): Fix bug of
(subreg:SI (DI CONST_POLY_INT)).
* config/riscv/vector-iterators.md: Add gather/scatter.
* config/riscv/vector.md (vec_duplicate<mode>): Use "@" instead.
(@vec_duplicate<mode>): Ditto.
(@pred_indexed_<order>store<VNX16_QHS:mode><VNX16_QHSDI:mode>):
Fix name.
(@pred_indexed_<order>store<VNX16_QHSD:mode><VNX16_QHSDI:mode>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/rvv.exp: Add gather/scatter tests.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-1.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-12.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-2.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-3.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-4.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-5.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-6.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-7.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-8.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-9.c: New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-12.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-11.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-10.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-3.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-4.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-5.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-6.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-7.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-8.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-9.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load_run-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-2.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-1.c:
New test.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store_run-2.c:
New test.

Daily bump.

RISC-V: Support COND_LEN_* patterns

This middle-end has been merged:
https://github.com/gcc-mirror/gcc/commit/0d4dd7e07a879d6c07a33edb2799710faa95651e

With this patch, we can handle operations may trap on elements outside the loop.

These 2 following cases will be addressed by this patch:

1. integer division:

  #define TEST_TYPE(TYPE) \
  __attribute__((noipa)) \
  void vrem_##TYPE (TYPE * __restrict dst, TYPE * __restrict a, TYPE * __restrict b, int n) \
  { \
    for (int i = 0; i < n; i++) \
      dst[i] = a[i] % b[i]; \
  }
  #define TEST_ALL() \
   TEST_TYPE(int8_t) \
  TEST_ALL()

  Before this patch:

   vrem_int8_t:
        ble     a3,zero,.L14
        csrr    t4,vlenb
        addiw   a5,a3,-1
        addiw   a4,t4,-1
        sext.w  t5,a3
        bltu    a5,a4,.L10
        csrr    t3,vlenb
        subw    t3,t5,t3
        li      a5,0
        vsetvli t6,zero,e8,m1,ta,ma
.L4:
        add     a6,a2,a5
        add     a7,a0,a5
        add     t1,a1,a5
        mv      a4,a5
        add     a5,a5,t4
        vl1re8.v        v2,0(a6)
        vl1re8.v        v1,0(t1)
        sext.w  a6,a5
        vrem.vv v1,v1,v2
        vs1r.v  v1,0(a7)
        bleu    a6,t3,.L4
        csrr    a5,vlenb
        addw    a4,a4,a5
        sext.w  a5,a4
        beq     t5,a4,.L16
.L3:
        csrr    a6,vlenb
        subw    t5,t5,a4
        srli    a6,a6,1
        addiw   t1,t5,-1
        addiw   a7,a6,-1
        bltu    t1,a7,.L9
        slli    a4,a4,32
        srli    a4,a4,32
        add     t0,a1,a4
        add     t6,a2,a4
        add     a4,a0,a4
        vsetvli a7,zero,e8,mf2,ta,ma
        sext.w  t3,a6
        vle8.v  v1,0(t0)
        vle8.v  v2,0(t6)
        subw    t4,t5,a6
        vrem.vv v1,v1,v2
        vse8.v  v1,0(a4)
        mv      t1,t3
        bltu    t4,t3,.L7
        csrr    t1,vlenb
        add     a4,a4,a6
        add     t0,t0,a6
        add     t6,t6,a6
        sext.w  t1,t1
        vle8.v  v1,0(t0)
        vle8.v  v2,0(t6)
        vrem.vv v1,v1,v2
        vse8.v  v1,0(a4)
.L7:
        addw    a5,t1,a5
        beq     t5,t1,.L14
.L9:
        add     a4,a1,a5
        add     a6,a2,a5
        lb      a6,0(a6)
        lb      a4,0(a4)
        add     a7,a0,a5
        addi    a5,a5,1
        remw    a4,a4,a6
        sext.w  a6,a5
        sb      a4,0(a7)
        bgt     a3,a6,.L9
.L14:
        ret
.L10:
        li      a4,0
        li      a5,0
        j       .L3
.L16:
        ret

After this patch:

   vrem_int8_t:
ble a3,zero,.L5
.L3:
vsetvli a5,a3,e8,m1,tu,ma
vle8.v v1,0(a1)
vle8.v v2,0(a2)
sub a3,a3,a5
vrem.vv v1,v1,v2
vse8.v v1,0(a0)
add a1,a1,a5
add a2,a2,a5
add a0,a0,a5
bne a3,zero,.L3
.L5:
ret

2. Floating-point operation **WITHOUT** -ffast-math:

    #define TEST_TYPE(TYPE) \
    __attribute__((noipa)) \
    void vadd_##TYPE (TYPE * __restrict dst, TYPE *__restrict a, TYPE *__restrict b, int n) \
    { \
      for (int i = 0; i < n; i++) \
        dst[i] = a[i] + b[i]; \
    }

    #define TEST_ALL() \
     TEST_TYPE(float) \

    TEST_ALL()

Before this patch:

   vadd_float:
        ble     a3,zero,.L10
        csrr    a4,vlenb
        srli    t3,a4,2
        addiw   a5,a3,-1
        addiw   a6,t3,-1
        sext.w  t6,a3
        bltu    a5,a6,.L7
        subw    t5,t6,t3
        mv      t1,a1
        mv      a7,a2
        mv      a6,a0
        li      a5,0
        vsetvli t4,zero,e32,m1,ta,ma
.L4:
        vl1re32.v       v1,0(t1)
        vl1re32.v       v2,0(a7)
        addw    a5,a5,t3
        vfadd.vv        v1,v1,v2
        vs1r.v  v1,0(a6)
        add     t1,t1,a4
        add     a7,a7,a4
        add     a6,a6,a4
        bgeu    t5,a5,.L4
        beq     t6,a5,.L10
        sext.w  a5,a5
.L3:
        slli    a4,a5,2
.L6:
        add     a6,a1,a4
        add     a7,a2,a4
        flw     fa4,0(a6)
        flw     fa5,0(a7)
        add     a6,a0,a4
        addiw   a5,a5,1
        fadd.s  fa5,fa5,fa4
        addi    a4,a4,4
        fsw     fa5,0(a6)
        bgt     a3,a5,.L6
.L10:
        ret
.L7:
        li      a5,0
        j       .L3

After this patch:

   vadd_float:
ble a3,zero,.L5
.L3:
vsetvli a5,a3,e32,m1,tu,ma
slli a4,a5,2
vle32.v v1,0(a1)
vle32.v v2,0(a2)
sub a3,a3,a5
vfadd.vv v1,v1,v2
vse32.v v1,0(a0)
add a1,a1,a4
add a2,a2,a4
add a0,a0,a4
bne a3,zero,.L3
.L5:
ret

gcc/ChangeLog:

* config/riscv/autovec.md (cond_len_<optab><mode>): New pattern.
* config/riscv/riscv-protos.h (enum insn_type): New enum.
(expand_cond_len_binop): New function.
* config/riscv/riscv-v.cc (emit_nonvlmax_tu_insn): Ditto.
(emit_nonvlmax_fp_tu_insn): Ditto.
(need_fp_rounding_p): Ditto.
(expand_cond_len_binop): Ditto.
* config/riscv/riscv.cc (riscv_preferred_else_value): Ditto.
(TARGET_PREFERRED_ELSE_VALUE): New target hook.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv.c: Adapt testcase.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv64gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv32gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv64gcv.c: Ditto.
* gcc.target/riscv/rvv/autovec/binop/vadd-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vadd-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vadd-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vmul-rv64gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-run-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-rv32gcv-nofm.c: New test.
* gcc.target/riscv/rvv/autovec/binop/vsub-rv64gcv-nofm.c: New test.

Break out profile updating code from gimple_duplicate_sese_region

Move profile updating to tree-ssa-loop-ch.cc since it is
now quite ch specific. There are no functional changes.

Boostrapped/regtesed x86_64-linux, comitted.

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this; break out profile updating
code to ...
* tree-ssa-loop-ch.cc (update_profile_after_ch): ... here.
(ch_base::copy_headers): Update.
* tree-cfg.h (gimple_duplicate_sese_region): Rename to ...
(gimple_duplicate_seme_region): ... this.

[range-op] Take known mask into account for bitwise ands [PR107043]

PR tree-optimization/107043

gcc/ChangeLog:

* range-op.cc (operator_bitwise_and::op1_range): Update bitmask.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr107043.c: New test.

[range-op] Take known set bits into account in popcount [PR107053]

This patch teaches popcount about known set bits which are now
available in the irange.

PR tree-optimization/107053

gcc/ChangeLog:

* gimple-range-op.cc (cfn_popcount): Use known set bits.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/pr107053.c: New test.

libstdc++: Check conversion from filesystem::path to wide strings [PR95048]

The testcase added for this bug only checks conversion from wide strings
on construction, but the fix also covered conversion to wide stings via
path::wstring(). Add checks for that, and u16string() and u32string().

libstdc++-v3/ChangeLog:

PR libstdc++/95048
* testsuite/27_io/filesystem/path/construct/95048.cc: Check
conversions to wide strings.
* testsuite/experimental/filesystem/path/construct/95048.cc:
Likewise.

libstdc++: Compile basic_file_stdio.cc for LFS

Instead of using fopen64, lseek64, and fstat64 we can just include
<bits/largefile-config.h> which defines _FILE_OFFSET_BITS=64 (and
similar target-specific macros). Then we can just use fopen, lseek and
fstat as normal, and they'll be the LFS versions if supported by the
target.

libstdc++-v3/ChangeLog:

* config/io/basic_file_stdio.cc: Define LFS macros.
(__basic_file<char>::open): Use fopen unconditionally.
(get_file_offset): Use lseek unconditionally.
(__basic_file<char>::seekoff): Likewise.
(__basic_file<char>::showmanyc): Use fstat unconditionally.

libstdc++: Fix --enable-cstdio=stdio_pure [PR110574]

When configured with --enable-cstdio=stdio_pure we need to consistently
use fseek and not mix seeks on the file descriptor with reads and writes
on the FILE stream.

There are also a number of bugs related to error handling and return
values, because fread and fwrite return 0 on error, not -1, and fseek
returns 0 on success, not the file offset.

libstdc++-v3/ChangeLog:

PR libstdc++/110574
* acinclude.m4 (GLIBCXX_CHECK_LFS): Check for fseeko and ftello
and define _GLIBCXX_USE_FSEEKO_FTELLO.
* config.h.in: Regenerate.
* configure: Regenerate.
* config/io/basic_file_stdio.cc (xwrite) [_GLIBCXX_USE_STDIO_PURE]:
Check for fwrite error correctly.
(__basic_file<char>::xsgetn) [_GLIBCXX_USE_STDIO_PURE]: Check for
fread error correctly.
(get_file_offset): New function.
(__basic_file<char>::seekoff) [_GLIBCXX_USE_STDIO_PURE]: Use
fseeko if available. Use get_file_offset instead of return value
of fseek.
(__basic_file<char>::showmanyc): Use get_file_offset.

IRA+LRA: Change return type of predicate functions from int to bool

gcc/ChangeLog:

* ira.cc (equiv_init_varies_p): Change return type from int to bool
and adjust function body accordingly.
(equiv_init_movable_p): Ditto.
(memref_used_between_p): Ditto.
* lra-constraints.cc (valid_address_p): Ditto.

libstdc++: Use __is_enum built-in trait

This patch replaces is_enum<T>::value with __is_enum built-in trait in
the type_traits header.

libstdc++-v3/ChangeLog:

* include/std/type_traits (__make_unsigned_selector): Use
__is_enum built-in trait.
(__make_signed_selector): Likewise.
(__underlying_type_impl): Likewise.

Signed-off-by: Ken Matsui <kmatsui@gcc.gnu.org>
Reviewed-by: Jonathan Wakely <jwakely@redhat.com>

[range-op] Enable value/mask propagation in range-op.

Throw the switch in range-ops to make full use of the value/mask
information instead of only the nonzero bits. This will cause most of
the operators implemented in range-ops to use the value/mask
information calculated by CCP's bit_value_binop() function which
range-ops uses. This opens up more optimization opportunities.

In follow-up patches I will change the global range setter
(set_range_info) to be able to save the value/mask pair, and make both
CCP and IPA be able to save the known ones bit info, instead of
throwing it away.

gcc/ChangeLog:

* range-op.cc (irange_to_masked_value): Remove.
(update_known_bitmask): Update irange value/mask pair instead of
only updating nonzero bits.

gcc/testsuite/ChangeLog:

* gcc.dg/pr83073.c: Adjust testcase.

Improve profile update in loop-ch

Improve profile update in loop-ch to handle situation where duplicated header
has loop invariant test.  In this case we konw that all count of the exit edge belongs to
the duplicated loop header edge and can update probabilities accordingly.
Since we also do all the work to track this information from analysis to duplicaiton
I also added code to turn those conditionals to constants so we do not need later
jump threading pass to clean up.

This made me to work out that the propagation was buggy in few aspects
1) it handled every PHI as PHI in header and incorrectly assigned some PHIs
    to be IV-like when they are not
2) it did not check for novops calls that are not required to return same
    value on every invocation.
3) I also added check for asm statement since those are not necessarily
    reproducible either.

I would like to do more changes, but tried to prevent this patch from
snowballing.  The analysis of what statements will remain after duplication can
be improved.  I think we should use ranger query for other than first basic
block, too and possibly drop the IV heuristics then.  Also it seems that a lot
of this logic is pretty much same to analysis in peeling pass, so unifying this
would be nice.

I also think I should move the profile update out of
gimple_duplicate_sese_region (it is now very specific to ch) and rename it,
since those regions are singe entry multiple exit.

Bootstrapped/regtsted x86_64-linux, OK?

Honza

gcc/ChangeLog:

* tree-cfg.cc (gimple_duplicate_sese_region): Add ORIG_ELIMINATED_EDGES
parameter and rewrite profile updating code to handle edges elimination.
* tree-cfg.h (gimple_duplicate_sese_region): Update prototpe.
* tree-ssa-loop-ch.cc (loop_invariant_op_p): New function.
(loop_iv_derived_p): New function.
(should_duplicate_loop_header_p): Track invariant exit edges; fix handling
of PHIs and propagation of IV derived variables.
(ch_base::copy_headers): Pass around the invariant edges hash set.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/loop-ch-profile-1.c: Remove xfail.

riscv: thead: Fix failing XTheadCondMov tests (indirect-rv[32|64])

Recently, two identical XTheadCondMov tests have been added, which both fail.
Let's fix that by changing the following:
* Merge both files into one (no need for separate tests for rv32 and rv64)
* Drop unrelated attribute check test (we already test for `th.mveqz`
and `th.mvnez` instructions, so there is little additional value)
* Fix the pattern to allow matching

Fixes: a1806f0918c0 ("RISC-V: Optimize TARGET_XTHEADCONDMOV")
gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadcondmov-indirect-rv32.c: Moved to...
* gcc.target/riscv/xtheadcondmov-indirect.c: ...here.
* gcc.target/riscv/xtheadcondmov-indirect-rv64.c: Removed.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

ifcvt: Change return type of predicate functions from int to bool

Also change some internal variables and function arguments from int to bool.

gcc/ChangeLog:

* ifcvt.cc (cond_exec_changed_p): Change variable to bool.
(last_active_insn): Change "skip_use_p" function argument to bool.
(noce_operand_ok): Change return type from int to bool.
(find_cond_trap): Ditto.
(block_jumps_and_fallthru_p): Change "fallthru_p" and
"jump_p" variables to bool.
(noce_find_if_block): Change return type from int to bool.
(cond_exec_find_if_block): Ditto.
(find_if_case_1): Ditto.
(find_if_case_2): Ditto.
(dead_or_predicable): Ditto. Change "reversep" function arg to bool.
(block_jumps_and_fallthru): Rename from block_jumps_and_fallthru_p.
(cond_exec_process_insns): Change return type from int to bool.
Change "mod_ok" function arg to bool.
(cond_exec_process_if_block): Change return type from int to bool.
Change "do_multiple_p" function arg to bool.  Change "then_mod_ok"
variable to bool.
(noce_emit_store_flag): Change return type from int to bool.
Change "reversep" function arg to bool.  Change "cond_complex"
variable to bool.
(noce_try_move): Change return type from int to bool.
(noce_try_ifelse_collapse): Ditto.
(noce_try_store_flag): Ditto. Change "reversep" variable to bool.
(noce_try_addcc): Change return type from int to bool.  Change
"subtract" variable to bool.
(noce_try_store_flag_constants): Change return type from int to bool.
(noce_try_store_flag_mask): Ditto.  Change "reversep" variable to bool.
(noce_try_cmove): Change return type from int to bool.
(noce_try_cmove_arith): Ditto. Change "is_mem" variable to bool.
(noce_try_minmax): Change return type from int to bool.  Change
"unsignedp" variable to bool.
(noce_try_abs): Change return type from int to bool.  Change
"negate" variable to bool.
(noce_try_sign_mask): Change return type from int to bool.
(noce_try_move): Ditto.
(noce_try_store_flag_constants): Ditto.
(noce_try_cmove): Ditto.
(noce_try_cmove_arith): Ditto.
(noce_try_minmax): Ditto.  Change "unsignedp" variable to bool.
(noce_try_bitop): Change return type from int to bool.
(noce_operand_ok): Ditto.
(noce_convert_multiple_sets): Ditto.
(noce_convert_multiple_sets_1): Ditto.
(noce_process_if_block): Ditto.
(check_cond_move_block): Ditto.
(cond_move_process_if_block): Ditto. Change "success_p"
variable to bool.
(rest_of_handle_if_conversion): Change return type to void.

VECT: Apply COND_LEN_* into vectorizable_operation

Hi, Richard and Richi.
As we disscussed before, COND_LEN_* patterns were added for multiple situations.
This patch apply CON_LEN_* for the following situation:

Support for the situation that in "vectorizable_operation":
  /* If operating on inactive elements could generate spurious traps,
     we need to restrict the operation to active lanes.  Note that this
     specifically doesn't apply to unhoisted invariants, since they
     operate on the same value for every lane.

     Similarly, if this operation is part of a reduction, a fully-masked
     loop should only change the active lanes of the reduction chain,
     keeping the inactive lanes as-is.  */
  bool mask_out_inactive = ((!is_invariant && gimple_could_trap_p (stmt))
    || reduc_idx >= 0);

For mask_out_inactive is true with length loop control.

So, we can these 2 following cases:

1. Integer division:

   #define TEST_TYPE(TYPE) \
   __attribute__((noipa)) \
   void vrem_##TYPE (TYPE *dst, TYPE *a, TYPE *b, int n) \
   { \
     for (int i = 0; i < n; i++) \
       dst[i] = a[i] % b[i]; \
   }
   #define TEST_ALL() \
   TEST_TYPE(int8_t) \
   TEST_ALL()

With this patch:

  _61 = .SELECT_VL (ivtmp_59, POLY_INT_CST [4, 4]);
  ivtmp_45 = _61 * 4;
  vect__4.8_48 = .LEN_MASK_LOAD (vectp_a.6_46, 32B, _61, 0, { -1, ... });
  vect__6.11_52 = .LEN_MASK_LOAD (vectp_b.9_50, 32B, _61, 0, { -1, ... });
  vect__8.12_53 = .COND_LEN_ADD ({ -1, ... }, vect__4.8_48, vect__6.11_52, vect__4.8_48, _61, 0);
  .LEN_MASK_STORE (vectp_dst.13_55, 32B, _61, 0, { -1, ... }, vect__8.12_53);

2. Floating-point arithmetic **WITHOUT** -ffast-math

   #define TEST_TYPE(TYPE) \
   __attribute__((noipa)) \
   void vadd_##TYPE (TYPE *dst, TYPE *a, TYPE *b, int n) \
   { \
     for (int i = 0; i < n; i++) \
       dst[i] = a[i] + b[i]; \
   }
   #define TEST_ALL() \
   TEST_TYPE(float) \
   TEST_ALL()

With this patch:

  _61 = .SELECT_VL (ivtmp_59, POLY_INT_CST [4, 4]);
  ivtmp_45 = _61 * 4;
  vect__4.8_48 = .LEN_MASK_LOAD (vectp_a.6_46, 32B, _61, 0, { -1, ... });
  vect__6.11_52 = .LEN_MASK_LOAD (vectp_b.9_50, 32B, _61, 0, { -1, ... });
  vect__8.12_53 = .COND_LEN_ADD ({ -1, ... }, vect__4.8_48, vect__6.11_52, vect__4.8_48, _61, 0);
  .LEN_MASK_STORE (vectp_dst.13_55, 32B, _61, 0, { -1, ... }, vect__8.12_53);

With this patch, we can make sure operations won't trap for elements that "mask_out_inactive".

gcc/ChangeLog:

* internal-fn.cc (FOR_EACH_CODE_MAPPING): Adapt for COND_LEN_* support.
(CASE): Ditto.
(get_conditional_len_internal_fn): New function.
* internal-fn.h (get_conditional_len_internal_fn): Ditto.
* tree-vect-stmts.cc (vectorizable_operation): Adapt for COND_LEN_*
support.

libgomp.texi: add cross ref, remove duplicated entry

libgomp/

* libgomp.texi (OpenMP 5.0): Replace '... stub' by @ref to
'Memory allocation' section which contains the full status.
(TR11): Remove differently worded duplicated entry.

i386: Fix FAIL of gcc.target/i386/pr91681-1.c

I committed the wrong version of this patch (with a typo).
Updating to the correct bootstrapped and regression tested version
as obvious.

2023-07-12 Roger Sayle <roger@nextmovesoftware.com>

gcc/ChangeLog
PR target/91681
* config/i386/i386.md (*add<dwi>3_doubleword_concat_zext): Typo.

i386: Fix FAIL of gcc.target/i386/pr91681-1.c

The recent change in TImode parameter passing on x86_64 results in the
FAIL of pr91681-1.c. The issue is that with the extra flexibility,
the combine pass is now spoilt for choice between using either the
*add<dwi>3_doubleword_concat or the *add<dwi>3_doubleword_zext
patterns, when one operand is a *concat and the other is a zero_extend.
The solution proposed below is provide an *add<dwi>3_doubleword_concat_zext
define_insn_and_split, that can benefit both from the register allocation
of *concat, and still avoid the xor normally required by zero extension.

I'm investigating a follow-up refinement to improve register allocation
further by avoiding the early clobber in the =&r, and handling (custom)
reloads explicitly, but this piece resolves the testcase failure.

2023-07-12 Roger Sayle <roger@nextmovesoftware.com>

gcc/ChangeLog
PR target/91681
* config/i386/i386.md (*add<dwi>3_doubleword_concat_zext): New
define_insn_and_split derived from *add<dwi>3_doubleword_concat
and *add<dwi>3_doubleword_zext.

PR target/110598: Fix rega = 0; rega ^= rega regression in i386.md

This patch fixes the regression PR target/110598 caused by my recent
addition of a peephole2.  The intention of that optimization was to
simplify zeroing a register, followed by an IOR, XOR or PLUS operation
on it into a move, or as described in the comment:
;; Peephole2 rega = 0; rega op= regb into rega = regb.

The issue is that I'd failed to consider the (rare and unusual) case,
where regb is rega, where the transformation leads to the incorrect
"rega = rega", when it should be "rega = 0".  The minimal fix is to
add a !reg_mentioned_p check to the recent peephole2.

In addition to resolving the regression, I've added a second peephole2
to optimize the problematic case above, which contains a false
dependency and is therefore tricky to optimize elsewhere.  This is an
improvement over GCC 13, for example, that generates the redundant:

        xorl    %edx, %edx
        xorq    %rdx, %rdx

2023-07-12  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
PR target/110598
* config/i386/i386.md (peephole2): Check !reg_mentioned_p when
optimizing rega = 0; rega op= regb for op in [XOR,IOR,PLUS].
(peephole2): Simplify rega = 0; rega op= rega cases.

gcc/testsuite/ChangeLog
PR target/110598
* gcc.target/i386/pr110598.c: New test case.

i386: Tweak ix86_expand_int_compare to use PTEST for vector equality.

I've come up with an alternate/complementary/supplementary fix to the
patch https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622706.html
for generating the PTEST during RTL expansion, rather than rely on
this being caught/optimized later during STV.

You'll notice in this patch, the tests for TARGET_SSE4_1 and TImode
appear last.  When I was writing this, I initially also added support
for AVX VPTEST and OImode, before realizing that x86 doesn't (yet)
support 256-bit OImode (which also explains why we don't have an OImode
to V1OImode scalar-to-vector pass).  Retaining this clause ordering
should minimize the lines changed if things change in future.

2023-07-12  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_int_compare): If
testing a TImode SUBREG of a 128-bit vector register against
zero, use a PTEST instruction instead of first moving it to
a pair of scalar registers.

genopinit: Allow more than 256 modes.

Upcoming changes for RISC-V will have us exceed 255 modes or 8 bits.
This patch increases the limit to 10 bits and adjusts the hashing
function for the gen* and optabs-query lookups accordingly.
Consequently, the number of optabs is limited to 4095.

gcc/ChangeLog:

* genopinit.cc (main): Adjust maximal number of optabs and
machine modes.
* gensupport.cc (find_optab): Shift optab by 20 and mode by
10 bits.
* optabs-query.h (optab_handler): Ditto.
(convert_optab_handler): Ditto.

libgomp: Use libnuma for OpenMP's partition=nearest allocation trait

As with the memkind library, it is only used when found at runtime;
it does not need to be present when building GCC.

The included testcase does not check whether the memory has been placed
on the nearest node as the Linux kernel memory handling too often ignores
that hint, using a different node for the allocation. However, when
running with 'numactl --preferred=<node> ./executable', it is clearly
visible that the feature works by comparing malloc/default vs. nearest
placement (using get_mempolicy to obtain the node for a mem addr).

libgomp/ChangeLog:

* allocator.c: Add ifdef for LIBGOMP_USE_LIBNUMA.
(enum gomp_numa_memkind_kind): Renamed from gomp_memkind_kind;
add GOMP_MEMKIND_LIBNUMA.
(struct gomp_libnuma_data, gomp_init_libnuma, gomp_get_libnuma): New.
(omp_init_allocator): Handle partition=nearest with libnuma if avail.
(omp_aligned_alloc, omp_free, omp_aligned_calloc, omp_realloc): Add
numa_alloc_local (+ memset), numa_free, and numa_realloc calls as
needed.
* config/linux/allocator.c (LIBGOMP_USE_LIBNUMA): Define
* libgomp.texi: Fix a typo; use 'fi' instead of its ligature char.
(Memory allocation): Renamed from 'Memory allocation with libmemkind';
updated for libnuma usage.
* testsuite/libgomp.c-c++-common/alloc-11.c: New test.
* testsuite/libgomp.c-c++-common/alloc-12.c: New test.

gfortran: Allow ref'ing PDT's len() in parameter-initializer.

Fix declaring a parameter initialized using a pdt_len reference
not simplifying the reference to a constant.

2023-07-12 Andre Vehreschild <vehre@gcc.gnu.org>

gcc/fortran/ChangeLog:

PR fortran/102003
* expr.cc (find_inquiry_ref): Replace len of pdt_string by
constant.
(simplify_ref_chain): Ensure input to find_inquiry_ref is
NULL.
(gfc_match_init_expr): Prevent PDT analysis for function calls.
(gfc_pdt_find_component_copy_initializer): Get the initializer
value for given component.
* gfortran.h (gfc_pdt_find_component_copy_initializer): New
function.
* simplify.cc (gfc_simplify_len): Replace len() of PDT with pdt
component ref or constant.

gcc/testsuite/ChangeLog:

* gfortran.dg/pdt_33.f03: New test.

tree-optimization/110630 - enhance SLP permute support

The following enhances the existing lowpart extraction support for
SLP VEC_PERM nodes to cover all vector aligned extractions. This
allows the existing bb-slp-pr95839.c testcase to be vectorized
with mips -mpaired-single and the new bb-slp-pr95839-3.c testcase
with SSE2.

PR tree-optimization/110630
* tree-vect-slp.cc (vect_add_slp_permutation): New
offset parameter, honor that for the extract code generation.
(vectorizable_slp_permutation_1): Handle offsetted identities.

* gcc.dg/vect/bb-slp-pr95839.c: Make stricter.
* gcc.dg/vect/bb-slp-pr95839-3.c: New variant testcase.

RISC-V: Support integer mult highpart auto-vectorization

This patch is adding an obvious missing mult_high auto-vectorization pattern.

Consider this following case:
void __attribute__ ((noipa)) \
mod_##TYPE (TYPE *__restrict dst, TYPE *__restrict src, int count) \
{ \
  for (int i = 0; i < count; ++i) \
    dst[i] = src[i] / 17; \
}

  T (int32_t) \

TEST_ALL (DEF_LOOP)

Before this patch:
mod_int32_t:
        ble     a2,zero,.L5
        li      a5,17
        vsetvli a3,zero,e32,m1,ta,ma
        vmv.v.x v2,a5
.L3:
        vsetvli a5,a2,e8,mf4,ta,ma
        vle32.v v1,0(a1)
        vsetvli a3,zero,e32,m1,ta,ma
        slli    a4,a5,2
        vdiv.vv v1,v1,v2
        sub     a2,a2,a5
        vsetvli zero,a5,e32,m1,ta,ma
        vse32.v v1,0(a0)
        add     a1,a1,a4
        add     a0,a0,a4
        bne     a2,zero,.L3
.L5:
        ret

After this patch:
mod_int32_t:
ble a2,zero,.L5
li a5,2021163008
addiw a5,a5,-1927
vsetvli a3,zero,e32,m1,ta,ma
vmv.v.x v3,a5
.L3:
vsetvli a5,a2,e8,mf4,ta,ma
vle32.v v2,0(a1)
vsetvli a3,zero,e32,m1,ta,ma
slli a4,a5,2
vmulh.vv v1,v2,v3
sub a2,a2,a5
vsra.vi v2,v2,31
vsra.vi v1,v1,3
vsub.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a0)
add a1,a1,a4
add a0,a0,a4
bne a2,zero,.L3
.L5:
ret

Even though a single "vdiv" is lower into "1 vmulh + 2 vsra + 1 vsub",
4 more instructions are generated, we belive it's much better than before
since division is very slow in the hardward.

gcc/ChangeLog:

* config/riscv/autovec.md (smul<mode>3_highpart): New pattern.
(umul<mode>3_highpart): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/binop/mulh-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/mulh-2.c: New test.
* gcc.target/riscv/rvv/autovec/binop/mulh_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/binop/mulh_run-2.c: New test.

x86: improve fast bfloat->float conversion

There's nothing AVX512BW-ish in here, so no reason to use Yw as the
constraints for the AVX alternative. Furthermore by using the 512-bit
form of VPSSLD (in a new alternative) all 32 registers can be used
directly by the insn without AVX512VL needing to be enabled.

Also adjust the originally last alternative's "prefix" attribute to
maybe_evex.

gcc/

* config/i386/i386.md (extendbfsf2_1): Add new AVX512F
alternative. Adjust original last alternative's "prefix"
attribute to maybe_evex.

x86: make better use of VBROADCASTSS / VPBROADCASTD

... in vec_dupv4sf / *vec_dupv4si. The respective broadcast insns are
never longer (yet sometimes shorter) than the corresponding VSHUFPS /
VPSHUFD, due to the immediate operand of the shuffle insns balancing the
(uniform) need for VEX3 in the broadcast ones. When EVEX encoding is
respective the broadcast insns are always shorter.

Add new alternatives to cover the AVX2 and AVX512 cases as appropriate.

While touching this anyway, switch to consistently using "sseshuf1" in
the "type" attributes for all shuffle forms.

gcc/

* config/i386/sse.md (vec_dupv4sf): Make first alternative use
vbroadcastss for AVX2. New AVX512F alternative.
(*vec_dupv4si): New AVX2 and AVX512F alternatives using
vpbroadcastd. Replace sselog1 by sseshuf1 in "type" attribute.

gcc/testsuite/

* gcc.target/i386/avx2-dupv4sf.c: New test.
* gcc.target/i386/avx2-dupv4si.c: Likewise.
* gcc.target/i386/avx512f-dupv4sf.c: Likewise.
* gcc.target/i386/avx512f-dupv4si.c: Likewise.

riscv: thead: Factor out XThead*-specific peepholes

This patch moves the XThead*-specific peephole passes
into thead-peephole.md with the intend to keep vendor-specific
code separated from RISC-V standard code.

This patch does not contain any functional changes.

gcc/ChangeLog:

* config/riscv/peephole.md: Remove XThead* peephole passes.
* config/riscv/thead.md: Include thead-peephole.md.
* config/riscv/thead-peephole.md: New file.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: Prepare backend for index registers

RISC-V does currently not support index registers.
However, there are some vendor extensions that specify them.
Let's do the necessary changes in the backend so that we can
add support for such a vendor extension in the future.

This is a non-functional change without any intended side-effects.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (riscv_regno_ok_for_index_p):
New prototype.
(riscv_index_reg_class): Likewise.
* config/riscv/riscv.cc (riscv_regno_ok_for_index_p): New function.
(riscv_index_reg_class): New function.
* config/riscv/riscv.h (INDEX_REG_CLASS): Call new function
riscv_index_reg_class().
(REGNO_OK_FOR_INDEX_P): Call new function
riscv_regno_ok_for_index_p().

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: Move address classification info types to riscv-protos.h

enum riscv_address_type and struct riscv_address_info are used
to store address classification information. Let's move this types
into our common header file in order to share them with other
compilation units.

This is a non-functional change without any intendet side-effects.

gcc/ChangeLog:

* config/riscv/riscv-protos.h (enum riscv_address_type):
New location of type definition.
(struct riscv_address_info): Likewise.
* config/riscv/riscv.cc (enum riscv_address_type):
Old location of type definition.
(struct riscv_address_info): Likewise.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: Define Xmode macro

Define a Xmode macro that specifies the registers size (XLEN)
similar to Pmode. This allows the backend code to write generic
RV32/RV64 C code (under certain circumstances).

gcc/ChangeLog:

* config/riscv/riscv.h (Xmode): New macro.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: Simplify output of MEM addresses

We have the following situation for MEM RTX objects:
* TARGET_PRINT_OPERAND expands to riscv_print_operand()
* This falls into the default case (unknown or on letter) of the outer
  switch-case-block and the MEM case of the inner switch-case-block and
  calls output_address() in final.cc with XEXP (op, 0) (the address)
* This calls targetm.asm_out.print_operand_address() which is
  riscv_print_operand_address()
* riscv_print_operand_address() is targeting the address of a MEM RTX
* riscv_print_operand_address() calls riscv_print_operand() for the offset
  and directly prints the register if the address is classified as ADDRESS_REG
* This falls into the default case (unknown or on letter) of the outer
  switch-case-block and the default case of the inner switch-case-block and
  calls output_addr_const().

However, since we know that offset must be a CONST_INT (which will be
followed by a '(<reg>)' string), there is no need to call
riscv_print_operand() for the offset.
Instead we can take the shortcut and use output_addr_const().

This change also brings the code in riscv_print_operand_address()
in line with the other cases, where output_addr_const() is used
to print offsets.

Tested with GCC regression test suite and SPEC intrate.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_print_operand_address): Use
output_addr_const rather than riscv_print_operand.

riscv: thead: Adjust constraints of th_addsl INSN

A recent change adjusted the constraints of ZBA's shNadd INSN.
Let's mirror this change here as well.

gcc/ChangeLog:

* config/riscv/thead.md: Adjust constraints of th_addsl.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: xtheadmempair: Fix doc for th_mempair_order_operands()

There is an incorrect sentence in the documentation of the function
th_mempair_order_operands(). Let's remove it.

gcc/ChangeLog:

* config/riscv/thead.cc (th_mempair_operands_p):
Fix documentation of th_mempair_order_operands().

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: xtheadmempair: Fix CFA reg notes

The current implementation triggers an assertion in
dwarf2out_frame_debug_cfa_offset() under certain circumstances.
The standard code uses REG_FRAME_RELATED_EXPR notes instead
of REG_CFA_OFFSET notes when saving registers on the stack.
So let's do this as well.

gcc/ChangeLog:

* config/riscv/thead.cc (th_mempair_save_regs):
Emit REG_FRAME_RELATED_EXPR notes in prologue.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

riscv: xtheadbb: Add sign/zero extension support for th.ext and th.extu

The current support of the bitfield-extraction instructions
th.ext and th.extu (XTheadBb extension) only covers sign_extract
and zero_extract. This patch add support for sign_extend and
zero_extend to avoid any shifts for sign or zero extensions.

gcc/ChangeLog:

* config/riscv/riscv.md: No base-ISA extension splitter for XThead*.
* config/riscv/thead.md (*extend<SHORT:mode><SUPERQI:mode>2_th_ext):
New XThead extension INSN.
(*zero_extendsidi2_th_extu): New XThead extension INSN.
(*zero_extendhi<GPR:mode>2_th_extu): New XThead extension INSN.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadbb-ext-1.c: New test.
* gcc.target/riscv/xtheadbb-extu-1.c: New test.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

False dependency happens when destination is only updated by
pternlog. There is no false dependency when destination is also used
in source. So either a pxor should be inserted, or input operand
should be set with constraint '0'.

gcc/ChangeLog:

PR target/110438
PR target/110202
* config/i386/predicates.md
(int_float_vector_all_ones_operand): New predicate.
* config/i386/sse.md (*vmov<mode>_constm1_pternlog_false_dep): New
define_insn.
(*<avx512>_cvtmask2<ssemodesuffix><mode>_pternlog_false_dep):
Ditto.
(*<avx512>_cvtmask2<ssemodesuffix><mode>_pternlog_false_dep):
Ditto.
(*<avx512>_cvtmask2<ssemodesuffix><mode>): Adjust to
define_insn_and_split to avoid false dependence.
(*<avx512>_cvtmask2<ssemodesuffix><mode>): Ditto.
(<mask_codefor>one_cmpl<mode>2<mask_name>): Adjust constraint
of operands 1 to '0' to avoid false dependence.
(*andnot<mode>3): Ditto.
(iornot<mode>3): Ditto.
(*<nlogic><mode>3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr110438.c: New test.
* gcc.target/i386/pr100711-6.c: Adjust testcase.

Initial Granite Rapids D Support

gcc/ChangeLog:

* common/config/i386/cpuinfo.h
(get_intel_cpu): Handle Granite Rapids D.
* common/config/i386/i386-common.cc:
(processor_alias_table): Add graniterapids-d.
* common/config/i386/i386-cpuinfo.h
(enum processor_subtypes): Add INTEL_COREI7_GRANITERAPIDS_D.
* config.gcc: Add -march=graniterapids-d.
* config/i386/driver-i386.cc (host_detect_local_cpu):
Handle graniterapids-d.
* config/i386/i386.h: (PTA_GRANITERAPIDS_D): New.
* doc/extend.texi: Add graniterapids-d.
* doc/invoke.texi: Ditto.

gcc/testsuite/ChangeLog:

* g++.target/i386/mv16.C: Add graniterapids-d.
* gcc.target/i386/funcspec-56.inc: Handle new march.

i386: Guard 128 bit VAES builtins with AVX512VL

Since commit 24a8acc, 128 bit intrin is enabled for VAES. However,
AVX512VL is not checked until we reached into pattern, which reports an
ICE.

Added an AVX512VL guard at builtin to report error when checking ISA
flags.

gcc/ChangeLog:

* config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins):
Add OPTION_MASK_ISA_AVX512VL.
* config/i386/i386-expand.cc (ix86_check_builtin_isa_match):
Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512vl-vaes-1.c: New test.

MAINTAINERS: Add myself to write after approval

ChangeLog:

* MAINTAINERS: Add Hao Liu to write after approval

MAINTAINERS: Add myself to write after approval

ChangeLog:

* MAINTAINERS: Add Ken Matsui to write after approval

Signed-off-by: Ken Matsui <kmatsui@gcc.gnu.org>

Daily bump.

RISC-V: Optimize permutation codegen with vcompress

This patch is to recognize specific permutation pattern which can be applied compress approach.

Consider this following case:
typedef int8_t vnx64i __attribute__ ((vector_size (64)));
  1, 2, 3, 5, 7, 9, 10, 11, 12, 14, 15, 17, 19, 21, 22, 23, 26, 28, 30, 31,    \
    37, 38, 41, 46, 47, 53, 54, 55, 60, 61, 62, 63, 76, 77, 78, 79, 80, 81,    \
    82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99,    \
    100, 101, 102, 103, 104, 105, 106, 107
void __attribute__ ((noinline, noclone)) test_1 (int8_t *x, int8_t *y, int8_t *out)
{
  vnx64i v1 = *(vnx64i*)x;
  vnx64i v2 = *(vnx64i*)y;
  vnx64i v3 = __builtin_shufflevector (v1, v2, MASK_64);
  *(vnx64i*)out = v3;
}

https://godbolt.org/z/P33nev6cW

Before this patch:
        lui     a4,%hi(.LANCHOR0)
        addi    a4,a4,%lo(.LANCHOR0)
        vl4re8.v        v4,0(a4)
        li      a4,64
        vsetvli a5,zero,e8,m4,ta,mu
        vl4re8.v        v20,0(a0)
        vl4re8.v        v16,0(a1)
        vmv.v.x v12,a4
        vrgather.vv     v8,v20,v4
        vmsgeu.vv       v0,v4,v12
        vsub.vv v4,v4,v12
        vrgather.vv     v8,v16,v4,v0.t
        vs4r.v  v8,0(a2)
        ret

After this patch:
        lui a4,%hi(.LANCHOR0)
        addi a4,a4,%lo(.LANCHOR0)
        vsetvli a5,zero,e8,m4,ta,ma
        vl4re8.v v12,0(a1)
        vl4re8.v v8,0(a0)
        vlm.v v0,0(a4)
        vslideup.vi v4,v12,20
        vcompress.vm v4,v8,v0
        vs4r.v v4,0(a2)
        ret

gcc/ChangeLog:

* config/riscv/riscv-protos.h (enum insn_type): Add vcompress optimization.
* config/riscv/riscv-v.cc (emit_vlmax_compress_insn): Ditto.
(shuffle_compress_patterns): Ditto.
(expand_vec_perm_const_1): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/compress_run-6.c: New test.

testsuite: Skip failing analyzer tests on AIX.

Some of the analyzer out-of-bounds-diagram tests fail on AIX.

gcc/testsuite/ChangeLog:
* gcc.dg/analyzer/out-of-bounds-diagram-4.c: Skip on AIX.
* gcc.dg/analyzer/out-of-bounds-diagram-5-ascii.c: Same.
* gcc.dg/analyzer/out-of-bounds-diagram-5-unicode.c: Same.
* gcc.dg/analyzer/out-of-bounds-diagram-7.c: Same.
* gcc.dg/analyzer/out-of-bounds-diagram-13.c: Same.
* gcc.dg/analyzer/out-of-bounds-diagram-15.c: Same.

Fortran: formal symbol attributes for intrinsic procedures [PR110288]

gcc/fortran/ChangeLog:

PR fortran/110288
* symbol.cc (gfc_copy_formal_args_intr): When deriving the formal
argument attributes from the actual ones for intrinsic procedure
calls, take special care of CHARACTER arguments that we do not
wrongly treat them formally as deferred-length.

gcc/testsuite/ChangeLog:

PR fortran/110288
* gfortran.dg/findloc_10.f90: New test.

cfg+gcse: Change return type of predicate functions from int to bool

Also change some internal variables from int to bool.

gcc/ChangeLog:

* cfghooks.cc (verify_flow_info): Change "err" variable to bool.
* cfghooks.h (struct cfg_hooks): Change return type of
verify_flow_info from integer to bool.
* cfgrtl.cc (can_delete_note_p): Change return type from int to bool.
(can_delete_label_p): Ditto.
(rtl_verify_flow_info): Change return type from int to bool
and adjust function body accordingly. Change "err" variable to bool.
(rtl_verify_flow_info_1): Ditto.
(free_bb_for_insn): Change return type to void.
(rtl_merge_blocks): Change "b_empty" variable to bool.
(try_redirect_by_replacing_jump): Change "fallthru" variable to bool.
(verify_hot_cold_block_grouping): Change return type from int to bool.
Change "err" variable to bool.
(rtl_verify_edges): Ditto.
(rtl_verify_bb_insns): Ditto.
(rtl_verify_bb_pointers): Ditto.
(rtl_verify_bb_insn_chain): Ditto.
(rtl_verify_fallthru): Ditto.
(rtl_verify_bb_layout): Ditto.
(purge_all_dead_edges): Change "purged" variable to bool.
* cfgrtl.h (free_bb_for_insn): Change return type from int to void.
* postreload-gcse.cc (expr_hasher::equal): Change "equiv_p" to bool.
(load_killed_in_block_p): Change return type from int to bool
and adjust function body accordingly.
(oprs_unchanged_p): Return true/false.
(rest_of_handle_gcse2): Change return type to void.
* tree-cfg.cc (gimple_verify_flow_info): Change return type from
int to bool. Change "err" variable to bool.

rs6000: Update the vsx-vector-6.* tests.

The vsx-vector-6.h file is included into the processor specific test files
vsx-vector-6.p7.c, vsx-vector-6.p8.c, and vsx-vector-6.p9.c.  The .h file
contains a large number of vsx vector built-in tests.  The processor
specific files contain the number of instructions that the tests are
expected to generate for that processor.  The tests are compile only.

This patch reworks the tests into a series of files for related tests.
The new tests consist of a runnable test to verify the built-in argument
types and the functional correctness of each built-in.  There is also a
compile only test that verifies the built-ins generate the expected number
of instructions for the various built-in tests.

gcc/testsuite/
* gcc.target/powerpc/vsx-vector-6-func-1op.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-1op-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-1op.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2lop.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2lop-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2lop.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2op.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2op-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-2op.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-3op.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-3op-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-3op.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-cmp-all.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-cmp-all-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-cmp-all.c: New test
file.
* gcc.target/powerpc/vsx-vector-6-func-cmp.h: New test file.
* gcc.target/powerpc/vsx-vector-6-func-cmp-run.c: New test file.
* gcc.target/powerpc/vsx-vector-6-func-cmp.c: New test file.
* gcc.target/powerpc/vsx-vector-6.h: Remove test file.
* gcc.target/powerpc/vsx-vector-6.p7.c: Remove test file.
* gcc.target/powerpc/vsx-vector-6.p8.c: Remove test file.
* gcc.target/powerpc/vsx-vector-6.p9.c: Remove test file.

testsuite: Require vectors of doubles for pr97428.c

The pr97428.c test assumes support for vectors of doubles, but some
targets only support vectors of floats, causing this test to fail with
such targets. Limit this test to targets that support vectors of
doubles then.

gcc/testsuite/
* gcc.dg/vect/pr97428.c: Limit to `vect_double' targets.

[modula2] Improve uninitialized variable analysis by combining basic blocks

This patch combines basic blocks for static analysis of uninitialized
variables providing that they are not the top of a loop, are not reached
by a conditional and are not reached after a procedure call.  It also
avoids checking array accesses for static analysis.  Finally the patch
adds switch modifiers to allow static analysis to include conditional
branches for subsequent basic block analysis.

gcc/ChangeLog:

* doc/gm2.texi (-Wuninit-variable-checking=) New item.

gcc/m2/ChangeLog:

* gm2-compiler/M2BasicBlock.def (InitBasicBlocksFromRange): New
parameter ScopeSym.
* gm2-compiler/M2BasicBlock.mod (ConvertQuads2BasicBlock): New
parameter ScopeSym.
(InitBasicBlocksFromRange): New parameter ScopeSym.  Call
ConvertQuads2BasicBlock with ScopeSym.
(DisplayBasicBlocks): Uncomment.
* gm2-compiler/M2Code.mod: Replace VariableAnalysis with
ScopeBlockVariableAnalysis.
(InitialDeclareAndOptiomize): Add parameter scope.
(SecondDeclareAndOptimize): Add parameter scope.
* gm2-compiler/M2GCCDeclare.mod (DeclareConstructor): Add scope
parameter to DeclareTypesConstantsProceduresInRange.
(DeclareTypesConstantsProceduresInRange): New parameter scope.
Pass scope to DisplayQuadRange.  Reformatted.
* gm2-compiler/M2GenGCC.def (ConvertQuadsToTree): New parameter
scope.
* gm2-compiler/M2GenGCC.mod (ConvertQuadsToTree): New parameter
scope.
* gm2-compiler/M2Optimize.mod (KnownReachable): New parameter
scope.
* gm2-compiler/M2Options.def (SetUninitVariableChecking): Add
arg parameter.
* gm2-compiler/M2Options.mod (SetUninitVariableChecking): Add
arg parameter and set boolean UninitVariableChecking and
UninitVariableConditionalChecking.
(UninitVariableConditionalChecking): New boolean set to FALSE.
* gm2-compiler/M2Quads.def (IsGoto): New procedure function.
(DisplayQuadRange): Add scope parameter.
(LoopAnalysis): Add scope parameter.
* gm2-compiler/M2Quads.mod: Import PutVarArrayRef.
(IsGoto): New procedure function.
(LoopAnalysis): Add scope parameter and use MetaErrorT1 instead
of WarnStringAt.
(BuildStaticArray): Call PutVarArrayRef.
(BuildDynamicArray): Call PutVarArrayRef.
(DisplayQuadRange): Add scope parameter.
(GetM2OperatorDesc): Add relational condition cases.
* gm2-compiler/M2Scope.def (ScopeProcedure): Add parameter.
* gm2-compiler/M2Scope.mod (DisplayScope): Pass scopeSym to
DisplayQuadRange.
(ForeachScopeBlockDo): Pass scopeSym to p.
* gm2-compiler/M2SymInit.def (VariableAnalysis): Rename to ...
(ScopeBlockVariableAnalysis): ... this.
* gm2-compiler/M2SymInit.mod (ScopeBlockVariableAnalysis): Add
scope parameter.
(bbEntry): New pointer to record.
(bbArray): New array.
(bbFreeList): New variable.
(errorList): New list.
(IssueConditional): New procedure.
(GenerateNoteFlow): New procedure.
(IssueWarning): New procedure.
(IsUniqueWarning): New procedure.
(CheckDeferredRecordAccess): Re-implement.
(CheckBinary): Add warning and lst parameters.
(CheckUnary): Add warning and lst parameters.
(CheckXIndr): Add warning and lst parameters.
(CheckIndrX): Add warning and lst parameters.
(CheckBecomes): Add warning and lst parameters.
(CheckComparison): Add warning and lst parameters.
(CheckReadBeforeInitQuad): Add warning and lst parameters to all
Check procedures.  Add all case quadruple clauses.
(FilterCheckReadBeforeInitQuad): Add warning and lst parameters.
(CheckReadBeforeInitFirstBasicBlock): Add warning and lst parameters.
(bbArrayKill): New procedure.
(DumpBBEntry): New procedure.
(DumpBBArray): New procedure.
(DumpBBSequence): New procedure.
(TestBBSequence): New procedure.
(CreateBBPermultations): New procedure.
(ScopeBlockVariableAnalysis): New procedure.
(GetOp3): New procedure.
(GenerateCFG): New procedure.
(NewEntry): New procedure.
(AppendEntry): New procedure.
(init): Initialize bbFreeList and errorList.
* gm2-compiler/SymbolTable.def (PutVarArrayRef): New procedure.
(IsVarArrayRef): New procedure function.
* gm2-compiler/SymbolTable.mod (SymVar): ArrayRef new field.
(MakeVar): Set ArrayRef to FALSE.
(PutVarArrayRef): New procedure.
(IsVarArrayRef): New procedure function.
* gm2-gcc/init.cc (_M2_M2SymInit_init): New prototype.
(init_PerCompilationInit): Add call to _M2_M2SymInit_init.
* gm2-gcc/m2options.h (M2Options_SetUninitVariableChecking):
New definition.
* gm2-lang.cc (gm2_langhook_handle_option): Add new case
OPT_Wuninit_variable_checking_.
* lang.opt: Wuninit-variable-checking= new entry.

gcc/testsuite/ChangeLog:

* gm2/switches/uninit-variable-checking/cascade/fail/cascadedif.mod: New test.
* gm2/switches/uninit-variable-checking/cascade/fail/switches-uninit-variable-checking-cascade-fail.exp:
New test.

Signed-off-by: Gaius Mulley <gaiusmod2@gmail.com>

libgomp: Update OpenMP memory allocation doc, fix omp_high_bw_mem_space

libgomp/

* allocator.c (omp_init_allocator): Use malloc for
omp_high_bw_mem_space when the memkind lib is unavailable
instead of returning omp_null_allocator.
* libgomp.texi (OpenMP 5.0): Fix typo.
(Memory allocation with libmemkind): Document implementation
in more detail.

c++: coercing variable template from current inst [PR110580]

Here during ahead of time coercion of the variable template-id v1<int>,
since we pass only the innermost arguments to coerce_template_parms (and
outer arguments are still dependent at this point), substitution of the
default template argument V=U just lowers U from level 2 to level 1 rather
than replacing it with int as expected. Thus after coercion we incorrectly
end up with (effectively) v1<int, T> instead of v1<int, int>.

Coercion of a class/alias template-id on the other hand always passes
all levels arguments, which avoids this issue. So this patch makes us
do the same for variable template-ids.

PR c++/110580

gcc/cp/ChangeLog:

* pt.cc (lookup_template_variable): Pass all levels of arguments
to coerce_template_parms, and use the parameters from the most
general template.

gcc/testsuite/ChangeLog:

* g++.dg/cpp1y/var-templ83.C: New test.

Fix typo in the testcase.

Antony Polukhin 2023-07-11 09:51:58 UTC
There's a typo at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87

It should be `|| !test3() || !test3r()` rather than `|| !test3() || !test4r()`

gcc/testsuite/ChangeLog:

PR target/110170
* g++.target/i386/pr110170.C: Fix typo.

VECT: Add COND_LEN_* operations for loop control with length targets

Hi, Richard and Richi.

This patch is adding cond_len_* operations pattern for target support loop control with length.

These patterns will be used in these following case:

1. Integer division:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
   {
     for (int i = 0; i < n; ++i)
      {
        a[i] = b[i] / c[i];
      }
   }

  ARM SVE IR:

  ...
  max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });

  Loop:
  ...
  # loop_mask_29 = PHI <next_mask_37(4), max_mask_36(3)>
  ...
  vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
  ...
  vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
  vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, vect__4.8_28);
  ...
  .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...

  For target like RVV who support loop control with length, we want to see IR as follows:

  Loop:
  ...
  # loop_len_29 = SELECT_VL
  ...
  vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
  ...
  vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
  vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, vect__4.8_28, loop_len_29, bias);
  ...
  .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
  ...
  next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
  ...

  Notice here, we use dummp_mask = { -1, -1, .... , -1 }

2. Integer conditional division:
   Similar case with (1) but with condtion:
   void
   f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t * cond, int n)
   {
     for (int i = 0; i < n; ++i)
       {
         if (cond[i])
         a[i] = b[i] / c[i];
       }
   }

   ARM SVE:
   ...
   max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });

   Loop:
   ...
   # loop_mask_55 = PHI <next_mask_77(5), max_mask_76(4)>
   ...
   vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
   ...
   vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
   ...
   vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
   vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, vect__6.13_62);
   ...
   .MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
   ...
   next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });

   Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to gurantee the correct result.

   However, target with length control can not perform this elegant flow, for RVV, we would expect:

   Loop:
   ...
   loop_len_55 = SELECT_VL
   ...
   mask__29.10_58 = vect__4.9_56 != { 0, ... };
   ...
   vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, vect__8.16_66, vect__6.13_62, loop_len_55, bias);
   ...

   Here we expect COND_LEN_DIV predicated by a real mask which is the outcome of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
   and a real length which is produced by loop control : loop_len_55 = SELECT_VL

3. conditional Floating-point operations (no -ffast-math):

    void
    f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
    {
      for (int i = 0; i < n; ++i)
        {
          if (cond[i])
          a[i] = b[i] + a[i];
        }
    }

  ARM SVE IR:
  max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });

  ...
  # loop_mask_49 = PHI <next_mask_71(4), max_mask_70(3)>
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
  ...
  vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
  ...
  next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
  ...

  For RVV, we would expect IR:

  ...
  loop_len_49 = SELECT_VL
  ...
  mask__27.10_52 = vect__4.9_50 != { 0, ... };
  ...
  vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, vect__8.16_60, vect__6.13_56, loop_len_49, bias);
  ...

4. Conditional un-ordered reduction:

   int32_t
   f (int32_t *restrict a,
   int32_t *restrict cond, int n)
   {
     int32_t result = 0;
     for (int i = 0; i < n; ++i)
       {
           if (cond[i])
         result += a[i];
       }
     return result;
   }

   ARM SVE IR:

     Loop:
     # vect_result_18.7_37 = PHI <vect__33.16_51(4), { 0, ... }(3)>
     ...
     # loop_mask_40 = PHI <next_mask_58(4), max_mask_57(3)>
     ...
     mask__17.11_43 = vect__4.10_41 != { 0, ... };
     vec_mask_and_46 = loop_mask_40 & mask__17.11_43;
     ...
     vect__33.16_51 = .COND_ADD (vec_mask_and_46, vect_result_18.7_37, vect__7.14_47, vect_result_18.7_37);
     ...
     next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
     ...

     Epilogue:
     _53 = .REDUC_PLUS (vect__33.16_51); [tail call]

   For RVV, we expect:

    Loop:
     # vect_result_18.7_37 = PHI <vect__33.16_51(4), { 0, ... }(3)>
     ...
     loop_len_40 = SELECT_VL
     ...
     mask__17.11_43 = vect__4.10_41 != { 0, ... };
     ...
     vect__33.16_51 = .COND_LEN_ADD (mask__17.11_43, vect_result_18.7_37, vect__7.14_47, vect_result_18.7_37, loop_len_40, bias);
     ...
     next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
     ...

     Epilogue:
     _53 = .REDUC_PLUS (vect__33.16_51); [tail call]

     I name these patterns as "cond_len_*" since I want the length operand comes after mask operand and all other operands except length operand
     same order as "cond_*" patterns. Such order will make life easier in the following loop vectorizer support.

gcc/ChangeLog:

* doc/md.texi: Add COND_LEN_* operations for loop control with length.
* internal-fn.cc (cond_len_unary_direct): Ditto.
(cond_len_binary_direct): Ditto.
(cond_len_ternary_direct): Ditto.
(expand_cond_len_unary_optab_fn): Ditto.
(expand_cond_len_binary_optab_fn): Ditto.
(expand_cond_len_ternary_optab_fn): Ditto.
(direct_cond_len_unary_optab_supported_p): Ditto.
(direct_cond_len_binary_optab_supported_p): Ditto.
(direct_cond_len_ternary_optab_supported_p): Ditto.
* internal-fn.def (COND_LEN_ADD): Ditto.
(COND_LEN_SUB): Ditto.
(COND_LEN_MUL): Ditto.
(COND_LEN_DIV): Ditto.
(COND_LEN_MOD): Ditto.
(COND_LEN_RDIV): Ditto.
(COND_LEN_MIN): Ditto.
(COND_LEN_MAX): Ditto.
(COND_LEN_FMIN): Ditto.
(COND_LEN_FMAX): Ditto.
(COND_LEN_AND): Ditto.
(COND_LEN_IOR): Ditto.
(COND_LEN_XOR): Ditto.
(COND_LEN_SHL): Ditto.
(COND_LEN_SHR): Ditto.
(COND_LEN_FMA): Ditto.
(COND_LEN_FMS): Ditto.
(COND_LEN_FNMA): Ditto.
(COND_LEN_FNMS): Ditto.
(COND_LEN_NEG): Ditto.
* optabs.def (OPTAB_D): Ditto.

tree-optimization/110614 - SLP splat and re-align (optimized)

The following properly guards the re-align (optimized) paths used
on old power CPUs for the added case of SLP splats from non-grouped
loads. Testcases are existing in dg-torture.

PR tree-optimization/110614
* tree-vect-data-refs.cc (vect_supportable_dr_alignment):
SLP splats are not suitable for re-align ops.

ada: Avoid renaming_decl in case of constrained array

This patch avoids rewriting "X: S := F(...);" as "X: S renames F(...);".
That rewrite is incorrect if S is a constrained array subtype,
because it changes the semantics. In the original, the
bounds of X are that of S. But constraints are ignored in
renamings, so the bounds of X would come from F'Result.
This can cause spurious Constraint_Errors in some obscure
cases. It causes unnecessary checks to be inserted, and even
when such checks pass (more common case), they might be less
efficient.

gcc/ada/

* exp_ch3.adb (Expand_N_Object_Declaration): Avoid transforming to
a renaming in case of constrained array that comes from source.

ada: Fix wrong resolution for hidden discriminant in predicate

The problem occurs for hidden discriminants of private discriminated types.

gcc/ada/

* sem_ch13.adb (Replace_Type_References_Generic.Visible_Component):
In the case of private discriminated types, return a discriminant
only if it is listed in the discriminant part of the declaration.

testsuite: Unbreak pr110557.cc where long is 32-bit

On ports with 32-bit long, the test produced excess errors:

gcc/testsuite/g++.dg/vect/pr110557.cc:12:8: warning: width of
'Item::y' exceeds its type

Reported-by: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
gcc/testsuite/ChangeLog:

* g++.dg/vect/pr110557.cc: Use long long instead of long for
64-bit type.
(test): Remove an unnecessary cast.

libgcc: Fix -Wint-conversion warning in find_fde_tail

Fixes commit r14-1614-g49310a99330849 ("libgcc: Fix eh_frame fast path
in find_fde_tail").

libgcc/

PR libgcc/110179
* unwind-dw2-fde-dip.c (find_fde_tail): Add cast to avoid
implicit conversion of pointer value to integer.

Daily bump.

rs6000: Remove redundant MEM_P predicate usage

The quad_memory_operand and vsx_quad_dform_memory_operand predicates contain
a (match_code "mem") test, making their MEM_P usage redundant. Remove them.

2023-07-10 Peter Bergner <bergner@linux.ibm.com>

gcc/
* config/rs6000/predicates.md (quad_memory_operand): Remove redundant
MEM_P usage.
(vsx_quad_dform_memory_operand): Likewise.

d: Merge upstream dmd, druntime a88e1335f7, phobos 1921d29df.

D front-end changes:

- Import dmd v2.104.1.
- Deprecation phase ended for access to private method when
overloaded with public method.

D runtime changes:

- Import druntime v2.104.1.
- Linux input header translations were added to druntime.
- Integration with the Valgrind `memcheck' tool has been added
to the garbage collector.

Phobos changes:

- Import phobos v2.104.1.

gcc/d/ChangeLog:

* dmd/MERGE: Merge upstream dmd a88e1335f7.
* dmd/VERSION: Bump version to v2.104.1.

libphobos/ChangeLog:

* libdruntime/MERGE: Merge upstream druntime a88e1335f7.
* src/MERGE: Merge upstream phobos 1921d29df.
* config.h.in: Regenerate.
* configure: Regenerate.
* configure.ac (libphobos-checking): Add valgrind flag.
(DRUNTIME_LIBRARIES_VALGRIND): Call.
* libdruntime/Makefile.am (DRUNTIME_CSOURCES): Add
etc/valgrind/valgrind_.c.
(DRUNTIME_DSOURCES): Add etc/valgrind/valgrind.d.
(DRUNTIME_DSOURCES_LINUX): Add core/sys/linux/input.d,
core/sys/linux/input_event_codes.d, core/sys/linux/uinput.d.
* libdruntime/Makefile.in: Regenerate.
* m4/druntime/libraries.m4 (DRUNTIME_LIBRARIES_VALGRIND): Define.