git.ipfire.org Git - thirdparty/gcc.git/log

c++: bad ggc_free in try_class_unification [PR109556]

Aside from correcting how try_class_unification copies multi-dimensional
'targs', r13-377-g3e948d645bc908 also made it ggc_free this copy as an
optimization. But this is wrong since the call to unify within might've
captured the args in persistent memory such as the satisfaction cache
(as part of constrained auto deduction).

PR c++/109556

gcc/cp/ChangeLog:

* pt.cc (try_class_unification): Don't ggc_free the copy of
'targs'.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-placeholder13.C: New test.

testsuite: fix scan-tree-dump patterns [PR83904,PR100297]

Adjust scan-tree-dump patterns so that they do not accidentally match a
valid path.

gcc/testsuite/ChangeLog:

PR testsuite/83904
PR fortran/100297
* gfortran.dg/allocatable_function_1.f90: Use "__builtin_free "
instead of the naive "free".
* gfortran.dg/reshape_8.f90: Extend pattern from a simple "data".

i386: Add new pattern for zero-extend cmov

After a phiopt change, I got a failure of cmov9.c.
The RTL IR has zero_extend on the outside of
the if_then_else rather than on the side. Both
ways are considered canonical as mentioned in
PR 66588.

This fixes the failure I got and also adds a testcase
which fails before even my phiopt patch but will pass
with this patch.

OK? Bootstrapped and tested on x86_64-linux-gnu with
no regressions.

gcc/ChangeLog:

* config/i386/i386.md (*movsicc_noc_zext_1): New pattern.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cmov10.c: New test.
* gcc.target/i386/cmov11.c: New test.

c++: fix 'unsigned __int128_t' semantics [PR108099]

My earlier patch for 108099 made us accept this non-standard pattern but
messed up the semantics, so that e.g. unsigned __int128_t was not a 128-bit
type.

PR c++/108099

gcc/cp/ChangeLog:

* decl.cc (grokdeclarator): Keep typedef_decl for __int128_t.

gcc/testsuite/ChangeLog:

* g++.dg/ext/int128-8.C: New test.

RISC-V: Support 128 bit vector chunk

RISC-V has provide different VLEN configuration by different ISA
extension like `zve32x`, `zve64x` and `v`
zve32x just guarantee the minimal VLEN is 32 bits,
zve64x guarantee the minimal VLEN is 64 bits,
and v guarantee the minimal VLEN is 128 bits,

Current status (without this patch):

Zve32x: Mode for one vector register mode is VNx1SImode and VNx1DImode
is invalid mode
- one vector register could hold 1 + 1x SImode where x is 0~n, so it
might hold just one SI

Zve64x: Mode for one vector register mode is VNx1DImode or VNx2SImode
- one vector register could hold 1 + 1x DImode where x is 0~n, so it
might hold just one DI.
- one vector register could hold 2 + 2x SImode where x is 0~n, so it
might hold just two SI.

However `v` extension guarantees the minimal VLEN is 128 bits.

We introduce another type/mode mapping for this configure:

v: Mode for one vector register mode is VNx2DImode or VNx4SImode
- one vector register could hold 2 + 2x DImode where x is 0~n, so it
will hold at least two DI
- one vector register could hold 4 + 4x SImode where x is 0~n, so it
will hold at least four DI

This patch model the mode more precisely for the RVV, and help some
middle-end optimization that assume number of element must be a
multiple of two.

gcc/ChangeLog:

* config/riscv/riscv-modes.def (FLOAT_MODE): Add chunk 128 support.
(VECTOR_BOOL_MODE): Ditto.
(ADJUST_NUNITS): Ditto.
(ADJUST_ALIGNMENT): Ditto.
(ADJUST_BYTESIZE): Ditto.
(ADJUST_PRECISION): Ditto.
(RVV_MODES): Ditto.
(VECTOR_MODE_WITH_PREFIX): Ditto.
* config/riscv/riscv-v.cc (ENTRY): Ditto.
(get_vlmul): Ditto.
(get_ratio): Ditto.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_TYPE): Ditto.
* config/riscv/riscv-vector-builtins.def (DEF_RVV_TYPE): Ditto.
(vbool64_t): Ditto.
(vbool32_t): Ditto.
(vbool16_t): Ditto.
(vbool8_t): Ditto.
(vbool4_t): Ditto.
(vbool2_t): Ditto.
(vbool1_t): Ditto.
(vint8mf8_t): Ditto.
(vuint8mf8_t): Ditto.
(vint8mf4_t): Ditto.
(vuint8mf4_t): Ditto.
(vint8mf2_t): Ditto.
(vuint8mf2_t): Ditto.
(vint8m1_t): Ditto.
(vuint8m1_t): Ditto.
(vint8m2_t): Ditto.
(vuint8m2_t): Ditto.
(vint8m4_t): Ditto.
(vuint8m4_t): Ditto.
(vint8m8_t): Ditto.
(vuint8m8_t): Ditto.
(vint16mf4_t): Ditto.
(vuint16mf4_t): Ditto.
(vint16mf2_t): Ditto.
(vuint16mf2_t): Ditto.
(vint16m1_t): Ditto.
(vuint16m1_t): Ditto.
(vint16m2_t): Ditto.
(vuint16m2_t): Ditto.
(vint16m4_t): Ditto.
(vuint16m4_t): Ditto.
(vint16m8_t): Ditto.
(vuint16m8_t): Ditto.
(vint32mf2_t): Ditto.
(vuint32mf2_t): Ditto.
(vint32m1_t): Ditto.
(vuint32m1_t): Ditto.
(vint32m2_t): Ditto.
(vuint32m2_t): Ditto.
(vint32m4_t): Ditto.
(vuint32m4_t): Ditto.
(vint32m8_t): Ditto.
(vuint32m8_t): Ditto.
(vint64m1_t): Ditto.
(vuint64m1_t): Ditto.
(vint64m2_t): Ditto.
(vuint64m2_t): Ditto.
(vint64m4_t): Ditto.
(vuint64m4_t): Ditto.
(vint64m8_t): Ditto.
(vuint64m8_t): Ditto.
(vfloat32mf2_t): Ditto.
(vfloat32m1_t): Ditto.
(vfloat32m2_t): Ditto.
(vfloat32m4_t): Ditto.
(vfloat32m8_t): Ditto.
(vfloat64m1_t): Ditto.
(vfloat64m2_t): Ditto.
(vfloat64m4_t): Ditto.
(vfloat64m8_t): Ditto.
* config/riscv/riscv-vector-switch.def (ENTRY): Ditto.
* config/riscv/riscv.cc (riscv_legitimize_poly_move): Ditto.
(riscv_convert_vector_bits): Ditto.
* config/riscv/riscv.md:
* config/riscv/vector-iterators.md:
* config/riscv/vector.md
(@pred_indexed_<order>store<VNX32_QH:mode><VNX32_QHI:mode>): Ditto.
(@pred_indexed_<order>store<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(@pred_indexed_<order>store<VNX64_Q:mode><VNX64_Q:mode>): Ditto.
(@pred_indexed_<order>store<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(@pred_indexed_<order>store<VNX128_Q:mode><VNX128_Q:mode>): Ditto.
(@pred_reduc_<reduc><mode><vlmul1_zve64>): Ditto.
(@pred_widen_reduc_plus<v_su><mode><vwlmul1_zve64>): Ditto.
(@pred_reduc_plus<order><mode><vlmul1_zve64>): Ditto.
(@pred_widen_reduc_plus<order><mode><vwlmul1_zve64>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr108185-4.c: Adapt testcase.
* gcc.target/riscv/rvv/base/spill-1.c: Ditto.
* gcc.target/riscv/rvv/base/spill-11.c: Ditto.
* gcc.target/riscv/rvv/base/spill-2.c: Ditto.
* gcc.target/riscv/rvv/base/spill-3.c: Ditto.
* gcc.target/riscv/rvv/base/spill-5.c: Ditto.
* gcc.target/riscv/rvv/base/spill-9.c: Ditto.

RISC-V: Align IOR optimization MODE_CLASS condition to AND.

This patch aligned the MODE_CLASS condition of the IOR to the AND. Then
more MODE_CLASS besides SCALAR_INT can able to perform the optimization
A | (~A) -> -1 similar to AND operator. For example as below sample code.

vbool32_t test_shortcut_for_riscv_vmorn_case_5(vbool32_t v1, size_t vl)
{
  return __riscv_vmorn_mm_b32(v1, v1, vl);
}

Before this patch:
vsetvli  a5,zero,e8,mf4,ta,ma
vlm.v    v24,0(a1)
vsetvli  zero,a2,e8,mf4,ta,ma
vmorn.mm v24,v24,v24
vsetvli  a5,zero,e8,mf4,ta,ma
vsm.v    v24,0(a0)
ret

After this patch:
vsetvli zero,a2,e8,mf4,ta,ma
vmset.m v24
vsetvli a5,zero,e8,mf4,ta,ma
vsm.v   v24,0(a0)
ret

Or in RTL's perspective,
from:
(ior:VNx2BI (reg/v:VNx2BI 137 [ v1 ]) (not:VNx2BI (reg/v:VNx2BI 137 [ v1 ])))
to:
(const_vector:VNx2BI repeat [ (const_int 1 [0x1]) ])

The similar optimization like VMANDN has enabled already. There should
be no difference execpt the operator when compare the VMORN and VMANDN
for such kind of optimization. The patch aligns the IOR MODE_CLASS condition
of the simplification to the AND operator.

gcc/ChangeLog:

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Align IOR (A | (~A) -> -1) optimization MODE_CLASS condition to AND.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/mask_insn_shortcut.c: Update check
condition.
* gcc.target/riscv/simplify_ior_optimization.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

i386: Emit compares between high registers and memory

Following code:

typedef __SIZE_TYPE__ size_t;

struct S1s
{
  char pad1;
  char val;
  short pad2;
};

extern char ts[256];

_Bool foo (struct S1s a, size_t i)
{
  return (ts[i] > a.val);
}

compiles with -O2 to:

        movl    %edi, %eax
        movsbl  %ah, %edi
        cmpb    %dil, ts(%rsi)
        setg    %al
        ret

the compare could use high register %ah instead of %dil:

        movl    %edi, %eax
        cmpb    ts(%rsi), %ah
        setl    %al
        ret

Use any_extract code iterator to handle signed and unsigned extracts
from high register and introduce peephole2 patterns to propagate
norex memory opeerand into the compare insn.

gcc/ChangeLog:

PR target/78904
PR target/78952
* config/i386/i386.md (*cmpqi_ext<mode>_1_mem_rex64): New insn pattern.
(*cmpqi_ext<mode>_1): Use nonimmediate_operand predicate
for operand 0. Use any_extract code iterator.
(*cmpqi_ext<mode>_1 peephole2): New peephole2 pattern.
(*cmpqi_ext<mode>_2): Use any_extract code iterator.
(*cmpqi_ext<mode>_3_mem_rex64): New insn pattern.
(*cmpqi_ext<mode>_1): Use general_operand predicate
for operand 1. Use any_extract code iterator.
(*cmpqi_ext<mode>_3 peephole2): New peephole2 pattern.
(*cmpqi_ext<mode>_4): Use any_extract code iterator.

gcc/testsuite/ChangeLog:

PR target/78904
PR target/78952
* gcc.target/i386/pr78952-3.c: New test.

aarch64: Factorise widening add/sub high-half expanders with iterators

I noticed these define_expand are almost identical modulo some string substitutions.
This patch compresses them together with a couple of code iterators.
No functional change intended.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_saddw2<mode>): Delete.
(aarch64_uaddw2<mode>): Delete.
(aarch64_ssubw2<mode>): Delete.
(aarch64_usubw2<mode>): Delete.
(aarch64_<ANY_EXTEND:su><ADDSUB:optab>w2<mode>): New define_expand.

Use solve_add_graph_edge in more places

The following makes sure to use solve_add_graph_edge and honoring
special-cases, especially edges from escaped, in the remaining places
the solver adds edges.

* tree-ssa-structalias.cc (do_ds_constraint): Use
solve_add_graph_edge.

Split out solve_add_graph_edge

Split out a worker with all the special-casings when adding a graph
edge during solving.

* tree-ssa-structalias.cc (solve_add_graph_edge): New function,
split out from ...
(do_sd_constraint): ... here.

Remove odd code from gimple_can_merge_blocks_p

The following removes a special case to not merge a block with
only a non-local label. We have a restriction of non-local labels
to be the first statement (and label) in a block, but otherwise nothing,
if the last stmt of A is a non-local label then it will be still
the first statement of the combined A + B. In particular we'd
happily merge when there's a stmt after that label.

The check originates from the tree-ssa merge.

Bootstrapped and tested on x86_64-unknown-linux-gnu with all
languages.

* tree-cfg.cc (gimple_can_merge_blocks_p): Remove condition
rejecting the merge when A contains only a non-local label.

Introduce VIRTUAL_REGISTER_P and VIRTUAL_REGISTER_NUM_P predicates

These two predicates are similar to existing HARD_REGISTER_P and
HARD_REGISTER_NUM_P predicates and return 1 if the given register
corresponds to a virtual register.

gcc/ChangeLog:

* rtl.h (VIRTUAL_REGISTER_P): New predicate.
(VIRTUAL_REGISTER_NUM_P): Ditto.
(REGNO_PTR_FRAME_P): Use VIRTUAL_REGISTER_NUM_P predicate.
* expr.cc (force_operand): Use VIRTUAL_REGISTER_P predicate.
* function.cc (instantiate_decl_rtl): Ditto.
* rtlanal.cc (rtx_addr_can_trap_p_1): Ditto.
(nonzero_address_p): Ditto.
(refers_to_regno_p): Use VIRTUAL_REGISTER_NUM_P predicate.

Fix pointer sharing in Value_Range constructor.

gcc/ChangeLog:

* value-range.h (Value_Range::Value_Range): Avoid pointer sharing.

Transform more gmp/mpfr uses to use RAII

The following picks up the coccinelle generated patch from Bernhard,
leaving out the fortran frontend parts and fixing up the rest.
In particular both gmp.h and mpfr.h contain macros like
#define mpfr_inf_p(_x) ((_x)->_mpfr_exp == __MPFR_EXP_INF)
for which I add operator-> overloads to the auto_* classes.

* system.h (auto_mpz::operator->()): New.
* realmpfr.h (auto_mpfr::operator->()): New.
* builtins.cc (do_mpfr_lgamma_r): Use auto_mpfr.
* real.cc (real_from_string): Likewise.
(dconst_e_ptr): Likewise.
(dconst_sqrt2_ptr): Likewise.
* tree-ssa-loop-niter.cc (refine_value_range_using_guard):
Use auto_mpz.
(bound_difference_of_offsetted_base): Likewise.
(number_of_iterations_ne): Likewise.
(number_of_iterations_lt_to_ne): Likewise.
* ubsan.cc: Include realmpfr.h.
(ubsan_instrument_float_cast): Use auto_mpfr.

Revert "libstdc++: Export global iostreams with GLIBCXX_3.4.31 symver [PR108969]"

This reverts commit b7c54e3f48086c29179f7765a35c381de5109a0a.

libstdc++-v3/ChangeLog:

* config/abi/post/aarch64-linux-gnu/baseline_symbols.txt:
* config/abi/post/i486-linux-gnu/baseline_symbols.txt:
* config/abi/post/m68k-linux-gnu/baseline_symbols.txt:
* config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt:
* config/abi/post/riscv64-linux-gnu/baseline_symbols.txt:
* config/abi/post/s390x-linux-gnu/baseline_symbols.txt:
* config/abi/post/x86_64-linux-gnu/32/baseline_symbols.txt:
* config/abi/post/x86_64-linux-gnu/baseline_symbols.txt:
* config/abi/pre/gnu.ver:
* src/Makefile.am:
* src/Makefile.in:
* src/c++98/Makefile.am:
* src/c++98/Makefile.in:
* src/c++98/globals_io.cc (defined):
(_GLIBCXX_IO_GLOBAL):

Revert "libstdc++: Fix preprocessor condition in linker script [PR108969]"

This reverts commit 6067ae4557a3a7e5b08359e78a29b8a9d5dfedce.

libstdc++-v3/ChangeLog:

* config/abi/pre/gnu.ver:

Remove special-cased edges when solving copies

The following makes sure to remove the copy edges we ignore or
need to special-case only once.

* tree-ssa-structalias.cc (solve_graph): Remove self-copy
edges, remove edges from escaped after special-casing them.

Fix do_sd_constraint escape special casing

The following fixes the escape special casing to test the proper
variable IDs.

* tree-ssa-structalias.cc (do_sd_constraint): Fixup escape
special casing.

Remove senseless store in do_sd_constraint

* tree-ssa-structalias.cc (do_sd_constraint): Do not write
to the LHS varinfo solution member.

Avoid non-unified nodes on the topological sorting for PTA solving

Since we do not update successor edges when merging nodes we have
to deal with this in the users. The following avoids putting those
on the topo order vector.

* tree-ssa-structalias.cc (topo_visit): Look at the real
destination of edges.

tree-optimization/44794 - avoid excessive RTL unrolling on epilogues

The following adjusts tree_[transform_and_]unroll_loop to set an
upper bound on the number of iterations on the epilogue loop it
creates. For the testcase at hand which involves array prefetching
this avoids applying RTL unrolling to them when -funroll-loops is
specified.

Other users of this API includes predictive commoning and
unroll-and-jam.

PR tree-optimization/44794
* tree-ssa-loop-manip.cc (tree_transform_and_unroll_loop):
If an epilogue loop is required set its iteration upper bound.

LoongArch: Improve cpymemsi expansion [PR109465]

We'd been generating really bad block move sequences which is recently
complained by kernel developers who tried __builtin_memcpy.  To improve
it:

1. Take the advantage of -mno-strict-align.  When it is set, set mode
   size to UNITS_PER_WORD regardless of the alignment.
2. Half the mode size when (block size) % (mode size) != 0, instead of
   falling back to ld.bu/st.b at once.
3. Limit the length of block move sequence considering the number of
   instructions, not the size of block.  When -mstrict-align is set and
   the block is not aligned, the old size limit for straight-line
   implementation (64 bytes) was definitely too large (we don't have 64
   registers anyway).

Change since v1: add a comment about the calculation of num_reg.

gcc/ChangeLog:

PR target/109465
* config/loongarch/loongarch-protos.h
(loongarch_expand_block_move): Add a parameter as alignment RTX.
* config/loongarch/loongarch.h:
(LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER): Remove.
(LARCH_MAX_MOVE_BYTES_STRAIGHT): Remove.
(LARCH_MAX_MOVE_OPS_PER_LOOP_ITER): Define.
(LARCH_MAX_MOVE_OPS_STRAIGHT): Define.
(MOVE_RATIO): Use LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of
LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER.
* config/loongarch/loongarch.cc (loongarch_expand_block_move):
Take the alignment from the parameter, but set it to
UNITS_PER_WORD if !TARGET_STRICT_ALIGN.  Limit the length of
straight-line implementation with LARCH_MAX_MOVE_OPS_STRAIGHT
instead of LARCH_MAX_MOVE_BYTES_STRAIGHT.
(loongarch_block_move_straight): When there are left-over bytes,
half the mode size instead of falling back to byte mode at once.
(loongarch_block_move_loop): Limit the length of loop body with
LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of
LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER.
* config/loongarch/loongarch.md (cpymemsi): Pass the alignment
to loongarch_expand_block_move.

gcc/testsuite/ChangeLog:

PR target/109465
* gcc.target/loongarch/pr109465-1.c: New test.
* gcc.target/loongarch/pr109465-2.c: New test.
* gcc.target/loongarch/pr109465-3.c: New test.

LoongArch: Improve GAR store for va_list

LoongArch backend used to save all GARs for a function with variable
arguments.  But sometimes a function only accepts variable arguments for
a purpose like C++ function overloading.  For example, POSIX defines
open() as:

    int open(const char *path, int oflag, ...);

But only two forms are actually used:

    int open(const char *pathname, int flags);
    int open(const char *pathname, int flags, mode_t mode);

So it's obviously a waste to save all 8 GARs in open().  We can use the
cfun->va_list_gpr_size field set by the stdarg pass to only save the
GARs necessary to be saved.

If the va_list escapes (for example, in fprintf() we pass it to
vfprintf()), stdarg would set cfun->va_list_gpr_size to 255 so we
don't need a special case.

With this patch, only one GAR ($a2/$r6) is saved in open().  Ideally
even this stack store should be omitted too, but doing so is not trivial
and AFAIK there are no compilers (for any target) performing the "ideal"
optimization here, see https://godbolt.org/z/n1YqWq9c9.

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk
(GCC 14 or now)?

gcc/ChangeLog:

* config/loongarch/loongarch.cc
(loongarch_setup_incoming_varargs): Don't save more GARs than
cfun->va_list_gpr_size / UNITS_PER_WORD.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/va_arg.c: New test.

Avoid unnecessary epilogues from tree_unroll_loop

The following fixes the condition determining whether we need an
epilogue.

* tree-ssa-loop-manip.cc (determine_exit_conditions): Fix
no epilogue condition.

Simplify gimple_assign_load

The following simplifies and outlines gimple_assign_load. In
particular it is not necessary to get at the base of the possibly
loaded expression but just handle the case of a single handled
component wrapping a non-memory operand.

* gimple.h (gimple_assign_load): Outline...
* gimple.cc (gimple_assign_load): ... here. Avoid
get_base_address and instead just strip the outermost
handled component, treating a remaining handled component
as load.

aarch64: Delete __builtin_aarch64_neg* builtins and their use

I don't think we need to keep the __builtin_aarch64_neg* builtins around.
They are only used once in the vnegh_f16 intrinsic in arm_fp16.h and I AFAICT
it was added this way only for the sake of orthogonality in
https://gcc.gnu.org/g:d7f33f07d88984cbe769047e3d07fc21067fbba9
We already use normal "-" negation in the other vneg* intrinsics, so do so here as well.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (neg): Delete builtins
definition.
* config/aarch64/arm_fp16.h (vnegh_f16): Reimplement using normal negation.

tree-vect-patterns: Improve __builtin_{clz,ctz,ffs}ll vectorization [PR109011]

For __builtin_popcountll tree-vect-patterns.cc has
vect_recog_popcount_pattern, which improves the vectorized code.
Without that the vectorization is always multi-type vectorization
in the loop (at least int and long long types) where we emit two
.POPCOUNT calls with long long arguments and int return value and then
widen to long long, so effectively after vectorization do the
V?DImode -> V?DImode popcount twice, then pack the result into V?SImode
and immediately unpack.

The following patch extends that handling to __builtin_{clz,ctz,ffs}ll
builtins as well (as long as there is an optab for them; more to come
laster).

x86 can do __builtin_popcountll with -mavx512vpopcntdq, __builtin_clzll
with -mavx512cd, ppc can do __builtin_popcountll and __builtin_clzll
with -mpower8-vector and __builtin_ctzll with -mpower9-vector, s390
can do __builtin_{popcount,clz,ctz}ll with -march=z13 -mzarch (i.e. VX).

2023-04-19  Jakub Jelinek  <jakub@redhat.com>

PR tree-optimization/109011
* tree-vect-patterns.cc (vect_recog_popcount_pattern): Rename to ...
(vect_recog_popcount_clz_ctz_ffs_pattern): ... this.  Handle also
CLZ, CTZ and FFS.  Remove vargs variable, use
gimple_build_call_internal rather than gimple_build_call_internal_vec.
(vect_vect_recog_func_ptrs): Adjust popcount entry.

* gcc.dg/vect/pr109011-1.c: New test.

dse: Use SUBREG_REG for copy_to_mode_reg in DSE replace_read for WORD_REGISTER_OPERATIONS targets [PR109040]

While we've agreed this is not the right fix for the PR109040 bug,
the patch clearly improves generated code (at least on the testcase from the
PR), so I'd like to propose this as optimization heuristics improvement
for GCC 14.

2023-04-19 Jakub Jelinek <jakub@redhat.com>

PR target/109040
* dse.cc (replace_read): If read_reg is a SUBREG of a word mode
REG, for WORD_REGISTER_OPERATIONS copy SUBREG_REG of it into
a new REG rather than the SUBREG.

[aarch64] Use wzr/xzr for assigning 0 to vector element.

gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (aarch64_simd_vec_set_zero<mode>):
New pattern.

gcc/testsuite/ChangeLog:
* gcc.target/aarch64/vec-set-zero.c: New test.

aarch64: PR target/108840 Simplify register shift RTX costs and eliminate shift amount masking

In this PR we fail to eliminate explicit &31 operations for variable shifts such as in:
void
bar (int x[3], int y)
{
  x[0] <<= (y & 31);
  x[1] <<= (y & 31);
  x[2] <<= (y & 31);
}

This is rejected by RTX costs that end up giving too high a cost for:
(set (reg:SI 96)
    (ashift:SI (reg:SI 98)
        (subreg:QI (and:SI (reg:SI 99)
                (const_int 31 [0x1f])) 0)))

There is code to handle the AND-31 case in rtx costs, but it gets confused by the subreg.
It's easy enough to fix by looking inside the subreg when costing the expression.
While doing that I noticed that the ASHIFT case and the other shift-like cases are almost identical
and we should just merge them. This code will only be used for valid insns anyway, so the code after this
patch should do the Right Thing (TM) for all such shift cases.

With this patch there are no more "and wn, wn, 31" instructions left in the testcase.

Bootstrapped and tested on aarch64-none-linux-gnu.

PR target/108840

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_rtx_costs): Merge ASHIFT and
ROTATE, ROTATERT, LSHIFTRT, ASHIFTRT cases.  Handle subregs in op1.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/pr108840.c: New test.

rtl-optimization/109237 - quadraticness in delete_trivially_dead_insns

The following addresses quadraticness in processing debug insns
in delete_trivially_dead_insns and insn_live_p by using TREE_VISITED
on the INSN_VAR_LOCATION_DECL to indicate a later debug bind
with the same decl and no intervening real insn or debug marker.
That gets rid of the NEXT_INSN walk in insn_live_p in favor of
first clearing TREE_VISITED in the first loop over insn and
the book-keeping of decls we set the bit since we need to clear
them when visiting a real or debug marker insn.

That improves the time spent in delete_trivially_dead_insns from
10.6s to 2.2s for the testcase.

PR rtl-optimization/109237
* cse.cc (insn_live_p): Remove NEXT_INSN walk, instead check
TREE_VISITED on INSN_VAR_LOCATION_DECL.
(delete_trivially_dead_insns): Maintain TREE_VISITED on
active debug bind INSN_VAR_LOCATION_DECL.

rtl-optimization/109237 - speedup bb_is_just_return

For the testcase bb_is_just_return is on top of the profile, changing
it to walk BB insns backwards puts it off the profile. That's because
in the forward walk you have to process possibly many debug insns
but in a backward walk you very likely run into control insns first.

PR rtl-optimization/109237
* cfgcleanup.cc (bb_is_just_return): Walk insns backwards.

testsuite: Fix up pr109524.C for -std=c++23 [PR109524]

This testcase was reduced such that it isn't valid C++23, so with my
usual testing with GXX_TESTSUITE_STDS=98,11,14,17,20,2b it fails:
FAIL: g++.dg/pr109524.C  -std=gnu++2b (test for excess errors)
.../gcc/testsuite/g++.dg/pr109524.C: In function 'nn hh(nn)':
.../gcc/testsuite/g++.dg/pr109524.C:35:12: error: cannot bind non-const lvalue reference of type 'nn&' to an rvalue of type 'nn'
.../gcc/testsuite/g++.dg/pr109524.C:17:6: note:   initializing argument 1 of 'nn::nn(nn&)'
The following patch fixes that and I've verified it doesn't change
anything on what the test was testing, it still ICEs in r13-7198 and
passes in r13-7203, now in all language modes (except for 98 where
it is intentionally UNSUPPORTED).

2023-04-19  Jakub Jelinek  <jakub@redhat.com>

PR tree-optimization/109524
* g++.dg/pr109524.C (nn::nn): Change argument type from nn & to
const nn &.

install.texi: Document --enable-decimal-float for AArch64

When I committed the patches to enable support for DFP on AArch64, I
forgot to update the installation documentation.

This patch adds AArch64 as needed (same as i386/x86_64).

2023-04-17 Christophe Lyon <christophe.lyon@arm.com>

gcc/
* doc/install.texi (enable-decimal-float): Add AArch64.

Check hard_regno_mode_ok before setting lowest memory move cost for the mode with different reg classes.

There's a potential performance issue when backend returns some
unreasonable value for the mode which can be never be allocate with
reg class.

gcc/ChangeLog:

PR rtl-optimization/109351
* ira.cc (setup_class_subset_and_memory_move_costs): Check
hard_regno_mode_ok before setting lowest memory move cost for
the mode with different reg classes.

Daily bump.

libstdc++: Adjust uses of null pointer constants in docs

libstdc++-v3/ChangeLog:

* doc/xml/manual/extensions.xml: Fix example to declare and
qualify std::free, and use NULL instead of 0.
* doc/html/manual/ext_demangling.html: Regenerate.
* libsupc++/cxxabi.h: Adjust doxygen comments.

doc: remove stray @gol

@gol was removed in r13-6778, new doc additions can't use it.

gcc/ChangeLog:

* doc/invoke.texi: Remove stray @gol.

ifcvt.cc: Prevent excessive if-conversion for conditional moves

gcc/
* ifcvt.cc (cond_move_process_if_block): Consider the result of
targetm.noce_conversion_profitable_p() when replacing the original
sequence with the converted one.

Add -gcodeview option

gcc/
* common.opt (gcodeview): Add new option.
* gcc.cc (driver_handle_option); Handle OPT_gcodeview.
* opts.cc (command_handle_option): Similarly.
* doc/invoke.texi: Add documentation for -gcodeview.

PHIOPT: Move tree_ssa_cs_elim into pass_cselim::execute.

This moves around the code for tree_ssa_cs_elim slightly
improving code readability and removing declarations that
are no longer needed.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

* tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Remove declaration.
(make_pass_phiopt): Make execute out of line.
(tree_ssa_cs_elim): Move code into ...
(pass_cselim::execute): here.

gcc: Drop obsolete INCLUDE_PTHREAD_H

gcc/ChangeLog:
* system.h: Drop unused INCLUDE_PTHREAD_H.

vect: Verify that GET_MODE_UNITS is greater than one for vect_grouped_store_supported

gcc/ChangeLog:

* tree-vect-data-refs.cc (vect_grouped_store_supported): Add new
condition.

Add TARGET_ZBKB to the condition of bswapsi2, bswapdi2 and rotr<mode>3 patterns

gcc/
* config/riscv/bitmanip.md (rotr<mode>3 expander): Enable for ZBKB.
(bswapdi2, bswapsi2): Similarly.

i386: Improve permutations with INSERTPS instruction [PR94908]

INSERTPS can select any element from src and insert into any place
of the dest. For SSE4.1 targets, compiler can generate e.g.

insertps $64, %xmm0, %xmm1

to insert element 1 from %xmm1 to element 0 of %xmm0.

gcc/ChangeLog:

PR target/94908
* config/i386/i386-builtin.def (__builtin_ia32_insertps128):
Use CODE_FOR_sse4_1_insertps_v4sf.
* config/i386/i386-expand.cc (expand_vec_perm_insertps): New.
(expand_vec_perm_1): Call expand_vec_per_insertps.
* config/i386/i386.md ("unspec"): Declare UNSPEC_INSERTPS here.
* config/i386/mmx.md (mmxscalarmode): New mode attribute.
(@sse4_1_insertps_<mode>): New insn pattern.
* config/i386/sse.md (@sse4_1_insertps_<mode>): Macroize insn
pattern from sse4_1_insertps using VI4F_128 mode iterator.

gcc/testsuite/ChangeLog:

PR target/94908
* gcc.target/i386/pr94908.c: New test.
* gcc.target/i386/sse4_1-insertps-5.c: New test.
* gcc.target/i386/vperm-v4sf-2-sse4.c: New test.

libstdc++: Fix preprocessor condition in linker script [PR108969]

The linker script is preprocessed with $(top_builddir)/config.h not the
include/$target/bits/c++config.h version, which means that configure
macros do not have the _GLIBCXX_ prefix yet.

The _GLIBCXX_SYMVER_GNU and _GLIBCXX_SHARED checks are redundant,
because the gnu.ver file is only used for _GLIBCXX_SYMVER_GNU and the
linker script is only used for the shared library. Remove those.

libstdc++-v3/ChangeLog:

PR libstdc++/108969
* config/abi/pre/gnu.ver: Fix preprocessor condition.

Add GTY support for vrange.

IPA currently puts *some* irange's in GC memory. When I contribute
support for generic ranges in IPA, we'll need to change this to
vrange. This patch adds GTY support for both vrange and frange.

gcc/ChangeLog:

* value-range.cc (gt_ggc_mx): New.
(gt_pch_nx): New.
* value-range.h (class vrange): Add GTY marker.
(class frange): Same.
(gt_ggc_mx): Remove.
(gt_pch_nx): Remove.

constraint: fix relaxed memory and repeated constraint handling

The function `constrain_operands' lacked the logic to consider relaxed
memory constraints when "traditional" memory constraints were not
satisfied, creating potential issues as observed during the reload
compilation pass.

In addition, it was observed that while `constrain_operands' chooses
to disregard constraints when more than one alternative is provided,
e.g. "m,r" using CONSTRAINT__UNKNOWN, it has no checks in place to
determine whether the multiple constraints in a given string are in
fact repetitions of the same constraint and should thus in fact be
treated as a single constraint, as ought to be the case for something
like "m,m".

Both of these issues are dealt with here, thus ensuring that we get
appropriate pattern matching.

gcc/
* lra-constraints.cc (constraint_unique): New.
(process_address_1): Apply constraint_unique test.
* recog.cc (constrain_operands): Allow relaxed memory
constaints.

libstdc++: Export global iostreams with GLIBCXX_3.4.31 symver [PR108969]

Since GCC 13 the global iostream objects are only initialized once in
libstdc++, and not by a std::ios::Init object in every translation unit
that includes <iostream>. To avoid using uninitialized streams defined
in an older libstdc++.so, translation units using the global iostreams
should depend on the GLIBCXX_3.4.31 symver.

Define std::cin as std::__io::cin and then export it as
std::cin@@GLIBCXX_3.4.31 so that references to std::cin bind to the new
symver. Also export it as @GLIBCXX_3.4 for backwards compatibility

libstdc++-v3/ChangeLog:

PR libstdc++/108969
* src/Makefile.am: Move globals_io.cc to here.
* src/Makefile.in: Regenerate.
* src/c++98/Makefile.am: Remove globals_io.cc from here.
* src/c++98/Makefile.in: Regenerate.
* src/c++98/globals_io.cc [_GLIBCXX_SYMVER_GNU] (cin): Adjust
symbol name and then export with GLIBCXX_3.4.31 symver.
(cout, cerr, clog, wcin, wcout, wcerr, wclog): Likewise.
* config/abi/post/aarch64-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/i486-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/m68k-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/riscv64-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/x86_64-linux-gnu/32/baseline_symbols.txt:
Regenerate.
* config/abi/post/s390x-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/post/x86_64-linux-gnu/baseline_symbols.txt:
Regenerate.
* config/abi/pre/gnu.ver: Add iostream objects to new symver.

Docs: Add doc for RISC-V vector intrinsics

Document which version of RISC-V vector intrinsics has implemented in
GCC.

gcc/ChangeLog:

* doc/extend.texi (Target Builtins): Add RISC-V Vector
Intrinsics.
(RISC-V Vector Intrinsics): Document GCC implemented which
version of RISC-V vector intrinsics and its reference.

middle-end/108786 - add bitmap_clear_first_set_bit

This adds bitmap_clear_first_set_bit and uses it where previously
bitmap_clear_bit followed bitmap_first_set_bit. The advantage
is speeding up the search and avoiding to clobber ->current.

PR middle-end/108786
* bitmap.h (bitmap_clear_first_set_bit): New.
* bitmap.cc (bitmap_first_set_bit_worker): Rename from
bitmap_first_set_bit and add optional clearing of the bit.
(bitmap_first_set_bit): Wrap bitmap_first_set_bit_worker.
(bitmap_clear_first_set_bit): Likewise.
* df-core.cc (df_worklist_dataflow_doublequeue): Use
bitmap_clear_first_set_bit.
* graphite-scop-detection.cc (scop_detection::merge_sese):
Likewise.
* sanopt.cc (sanitize_asan_mark_unpoison): Likewise.
(sanitize_asan_mark_poison): Likewise.
* tree-cfgcleanup.cc (cleanup_tree_cfg_noloop): Likewise.
* tree-into-ssa.cc (rewrite_blocks): Likewise.
* tree-ssa-dce.cc (simple_dce_from_worklist): Likewise.
* tree-ssa-sccvn.cc (do_rpo_vn_1): Likewise.

Shrink points-to analysis dumps when not dumping with -details

The following allows to get PTA stats with -stats without blowing
up your filesystem by guarding constraint and solution dumping
with TDF_DETAILS and the SSA points-to info with TDF_DETAILS
or TDF_ALIAS.

* tree-ssa-structalias.cc (dump_sa_stats): Split out from...
(dump_sa_points_to_info): ... this function.
(compute_points_to_sets): Guard large dumps with TDF_DETAILS,
and call dump_sa_stats guarded with TDF_STATS.
(ipa_pta_execute): Likewise.
(compute_may_aliases): Guard dump_alias_info with
TDF_DETAILS|TDF_ALIAS.

* gcc.dg/ipa/ipa-pta-16.c: Use -details for dump.
* gcc.dg/tm/alias-1.c: Likewise.
* gcc.dg/tm/alias-2.c: Likewise.
* gcc.dg/torture/ipa-pta-1.c: Likewise.
* gcc.dg/torture/pr39074-2.c: Likewise.
* gcc.dg/torture/pr39074.c: Likewise.
* gcc.dg/torture/pta-callused-1.c: Likewise.
* gcc.dg/torture/pta-escape-1.c: Likewise.
* gcc.dg/torture/pta-ptrarith-1.c: Likewise.
* gcc.dg/torture/pta-ptrarith-2.c: Likewise.
* gcc.dg/torture/pta-ptrarith-3.c: Likewise.
* gcc.dg/torture/pta-structcopy-1.c: Likewise.
* gcc.dg/torture/ssa-pta-fn-1.c: Likewise.
* gcc.dg/tree-ssa/alias-19.c: Likewise.
* gcc.dg/tree-ssa/pta-callused.c: Likewise.
* gcc.dg/tree-ssa/pta-fp.c: Likewise.
* gcc.dg/tree-ssa/pta-ptrarith-1.c: Likewise.
* gcc.dg/tree-ssa/pta-ptrarith-2.c: Likewise.

PHIOPT: add folding/simplification detail to the dump

While debugging PHI-OPT with match-and-simplify,
I found that adding more dumping to the debug dumps made
it easier to understand what was going on rather than stepping in
the debugger so this adds them. Note I used TDF_FOLDING rather
than TDF_DETAILS as these debug messages can be chatty and
only needed if you are debugging match and simplify
with PHI-OPT and match and simplify uses TDF_FOLDING as
its check.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

* tree-ssa-phiopt.cc (gimple_simplify_phiopt): Dump
the expression that is being tried when TDF_FOLDING
is true.
(phiopt_worker::match_simplify_replacement): Dump
the sequence which was created by gimple_simplify_phiopt
when TDF_FOLDING is true.

PHIOPT: small cleanup in match_simplify_replacement

We know that the statement we are moving is already
have a SSA_NAME on the lhs so we don't need to
check that and can also just call reset_flow_sensitive_info
with the name we already got.

OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

* tree-ssa-phiopt.cc (match_simplify_replacement):
Simplify code that does the movement slightly.

aarch64: Use standard RTL codes for __rev16 intrinsic expansion

I noticed for the expansion of the __rev16* arm_acle.h intrinsics we don't need to use an unspec just because it doesn't match neatly to a bswap code.
We have organic combine patterns for it that we can reuse.
This patch removes the define_insn using UNSPEC_REV (should it have been an UNSPEC_REV16?) and adds an expander to emit
the patterns we have for rev16 using standard RTL codes.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64.md (@aarch64_rev16<mode>): Change to
define_expand.
(rev16<mode>2): Rename to...
(aarch64_rev16<mode>2_alt1): ... This.
(rev16<mode>2_alt): Rename to...
(*aarch64_rev16<mode>2_alt2): ... This.

Declare dconstm0 to go along with dconst0 and friends.

Negating dconst0 is getting pretty old, and we will keep adding copies
of the same idiom. Fixed by adding a dconstm0 constant to go along
with dconst1, dconstm1, etc.

gcc/ChangeLog:

* emit-rtl.cc (init_emit_once): Initialize dconstm0.
* gimple-range-op.cc (class cfn_signbit): Remove dconstm0
declaration.
* range-op-float.cc (zero_range): Use dconstm0.
(zero_to_inf_range): Same.
* real.h (dconstm0): New.
* value-range.cc (frange::flush_denormals_to_zero): Use dconstm0.
(frange::set_zero): Do not declare dconstm0.

RAII auto_mpfr and autp_mpz

The following adds two RAII classes, one for mpz_t and one for mpfr_t
making object lifetime management easier. Both formerly require
explicit initialization with {mpz,mpfr}_init and release with
{mpz,mpfr}_clear.

I've converted two example places (where lifetime is trivial).

* system.h (class auto_mpz): New,
* realmpfr.h (class auto_mpfr): Likewise.
* fold-const-call.cc (do_mpfr_arg1): Use auto_mpfr.
(do_mpfr_arg2): Likewise.
* tree-ssa-loop-niter.cc (bound_difference): Use auto_mpz;

aarch64: Use intrinsic flags information rather than hardcoding FLAG_AUTO_FP

We record the flags to use for the intrinsics in aarch64_simd_intrinsic_data, so use it when initialising them
rather than using a hardcoded FLAG_AUTO_FP. The current vreinterpret intrinsics use FLAG_AUTO_FP anyway so this
patch is an NFC but this will be needed as we migrate more builtins into the intrinsics infrastructure.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-builtins.cc (aarch64_init_simd_intrinsics): Take
builtin flags from intrinsic data rather than hardcoded FLAG_AUTO_FP.

Fixed typo.

gcc/ada
* gcc-interface/utils.cc (unchecked_convert): Fixed typo.

Return true from operator== for two identical ranges containing NAN.

The == operator for ranges signifies that two ranges contain the same
thing, not that they are ultimately equal.  So [2,4] == [2,4], even
though one may be a 2 and the other may be a 3.  Similarly with two
VARYING ranges.

There is an oversight in frange::operator== where we are returning
false for two identical NANs.  This is causing us to never cache NANs
in sbr_sparse_bitmap::set_bb_range.

gcc/ChangeLog:

* value-range.cc (frange::operator==): Adjust for NAN.
(range_tests_nan): Remove some NAN tests.

Add inchash support for vrange.

This patch provides inchash support for vrange. It is along the lines
of the streaming support I just posted and will be used for IPA
hashing of ranges.

gcc/ChangeLog:

* inchash.cc (hash::add_real_value): New.
* inchash.h (class hash): Add add_real_value.
* value-range.cc (add_vrange): New.
* value-range.h (inchash::add_vrange): New.

tree-optimization/109539 - restrict PHI handling in access diagnostics

Access diagnostics visits the SSA def-use chains to diagnose things like
dangling pointer uses.  When that runs into PHIs it tries to prove
all incoming pointers of which one is the currently visited use are
related to decide whether to keep looking for the PHI def uses.
That turns out to be overly optimistic and thus costly.  The following
scraps the existing handling for simply requiring that we eventually
visit all incoming pointers of the PHI during the def-use chain
analysis and only then process uses of the PHI def.

Note this handles backedges of natural loops optimistically, diagnosing
the first iteration.  There's gcc.dg/Wuse-after-free-2.c containing
a testcase requiring this.

PR tree-optimization/109539
* gimple-ssa-warn-access.cc (pass_waccess::check_pointer_uses):
Re-implement pointer relatedness for PHIs.

libstdc++: Implement range_adaptor_closure from P2387R3 [PR108827]

PR libstdc++/108827

libstdc++-v3/ChangeLog:

* include/bits/ranges_cmp.h (__cpp_lib_ranges): Bump value
for C++23.
* include/std/ranges (range_adaptor_closure): Define for C++23.
* include/std/version (__cpp_lib_ranges): Bump value for
C++23.
* testsuite/std/ranges/version_c++23.cc: Bump expected value
of __cpp_lib_ranges.
* testsuite/std/ranges/range_adaptor_closure.cc: New test.

libstdc++: Adding missing feature-test macros for C++23 ranges algos

This patch also renames __cpp_lib_fold to __cpp_lib_ranges_fold
as per the current draft standard.

libstdc++-v3/ChangeLog:

* include/bits/ranges_algo.h (__cpp_lib_ranges_contains):
Define for C++23.
(__cpp_lib_ranges_iota): Likewise.
(__cpp_lib_ranges_find_last): Likewise.
(__cpp_lib_fold): Rename to ...
(__cpp_lib_ranges_fold): ... this.
* include/std/version: As above.
* testsuite/25_algorithms/fold_left/1.cc: Adjust after
renaming __cpp_lib_fold.
* testsuite/std/ranges/version_c++23.cc: Verify values
of the above feature-test macros.

libstdc++: Fix typo in views::as_const's operator() [PR109525]

PR libstdc++/109525

libstdc++-v3/ChangeLog:

* include/std/ranges (views::_AsConst::operator()): Add
missing const to constant_range test.
* testsuite/std/ranges/adaptors/as_const/1.cc (test02):
Improve formatting. Adjust expected type of v2.
(test03): New test.

amdgcn: HardFP divide

Implement FP division using hardware instructions. This replaces both the
softfp library calls, and the --fast-math inaccurate divsion we had previously.

The GCN architecture does not have a single divide instruction, but it does
have a number of support instructions designed to make multiply-by-reciprocal
sufficiently accurate for non-fast-math usage.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (SV_SFDF): New iterator.
(SV_FP): New iterator.
(scalar_mode, SCALAR_MODE): Add identity mappings for scalar modes.
(recip<mode>2): Unify the two patterns using SV_FP.
(div_scale<mode><exec_vcc>): New insn.
(div_fmas<mode><exec>): New insn.
(div_fixup<mode><exec>): New insn.
(div<mode>3): Unify the two expanders and rewrite using hardfp.
* config/gcn/gcn.cc (gcn_md_reorg): Support "vccwait" attribute.
* config/gcn/gcn.md (unspec): Add UNSPEC_DIV_SCALE, UNSPEC_DIV_FMAS,
and UNSPEC_DIV_FIXUP.
(vccwait): New attribute.

gcc/testsuite/ChangeLog:

* gcc.target/gcn/fpdiv.c: Remove the -ffast-math requirement.

aarch64: Give hint for -mcpu options that match -march instead

We should redirect users of the erroneous -mcpu=armv8.2-a to use -march instead.
There is an equivalent hint for -march used with a CPU name.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64.cc (aarch64_validate_mcpu): Add hint to use -march
if the argument matches that.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/spellcheck_11.c: New test.

aarch64: Add QI -> HI zero-extension for LDAPR

This patch is a straightforward extension of the zero-extending LDAPR
pattern to represent QI -> HI load-extends. This maps down to a LDAPRB-W
instruction.
This lets us remove a redundant zero-extend in the new test function.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/atomics.md
(*aarch64_atomic_load<ALLX:mode>_rcpc_zext):
Use SD_HSDI for destination mode iterator.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ldapr-zext.c: Add test for u8 to u16
extension.

RISC-V: Adjust the parsing order of extensions to be consistent with riscv-spec and binutils.

The current order of gcc and binutils parsing extensions is inconsistent.

According to latest risc-v spec, the canonical order in which extension names must
appear in the name string specified in Table 29.1 is different from before.

In the latest table, non-standard extensions must be listed after all standard
extensions. To keep consistent, we now change the parsing order.

Related llvm patch links:
https://reviews.llvm.org/D148315

gcc/ChangeLog:

* common/config/riscv/riscv-common.cc (multi_letter_subset_rank): Swap the order
of z-extensions and s-extensions.
(riscv_subset_list::parse): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/arch-5.c: Likewise.

match.pd: Improve fneg/fadd optimization [PR109240]

match.pd has mostly for AArch64 an optimization in which it optimizes
certain forms of __builtin_shuffle of x + y and x - y vectors into
fneg using twice as wide element type so that every other sign is changed,
followed by fadd.

The following patch extends that optimization, so that it can handle
other forms as well, using the same fneg but fsub instead of fadd.

As the plus is commutative and minus is not and I want to handle
vec_perm with plus minus and minus plus order preferrably in one
pattern, I had to do the matching operand checks by hand.

2023-04-18 Jakub Jelinek <jakub@redhat.com>

PR tree-optimization/109240
* match.pd (fneg/fadd): Rewrite such that it handles both plus as
first vec_perm operand and minus as second using fneg/fadd and
minus as first vec_perm operand and plus as second using fneg/fsub.

* gcc.target/aarch64/simd/addsub_2.c: New test.
* gcc.target/aarch64/sve/addsub_2.c: New test.

Abstract out REAL_VALUE_TYPE streaming.

In upcoming patches I will contribute code to stream out frange's as
well as vrange's. This patch abstracts out the REAL_VALUE_TYPE
streaming into their own functions, so that they may be used elsewhere.

gcc/ChangeLog:

* data-streamer.cc (bp_pack_real_value): New.
(bp_unpack_real_value): New.
* data-streamer.h (bp_pack_real_value): New.
(bp_unpack_real_value): New.
* tree-streamer-in.cc (unpack_ts_real_cst_value_fields): Use
bp_unpack_real_value.
* tree-streamer-out.cc (pack_ts_real_cst_value_fields): Use
bp_pack_real_value.

Abstract out calculation of max HWIs per wide int.

I'm about to add one more use of the same snippet of code, for a total
of 4 identical calculations in the code base.

gcc/ChangeLog:

* wide-int.h (WIDE_INT_MAX_HWIS): New.
(class fixed_wide_int_storage): Use it.
(trailing_wide_ints <N>::set_precision): Use it.
(trailing_wide_ints <N>::extra_size): Use it.

LoongArch: Optimize additions with immediates

1. Use addu16i.d for TARGET_64BIT and suitable immediates.
2. Split one addition with immediate into two addu16i.d or addi.{d/w}
   instructions if possible.  This can avoid using a temp register w/o
   increase the count of instructions.

Inspired by https://reviews.llvm.org/D143710 and
https://reviews.llvm.org/D147222.

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for GCC 14?

gcc/ChangeLog:

* config/loongarch/loongarch-protos.h
(loongarch_addu16i_imm12_operand_p): New function prototype.
(loongarch_split_plus_constant): Likewise.
* config/loongarch/loongarch.cc
(loongarch_addu16i_imm12_operand_p): New function.
(loongarch_split_plus_constant): Likewise.
* config/loongarch/loongarch.h (ADDU16I_OPERAND): New macro.
(DUAL_IMM12_OPERAND): Likewise.
(DUAL_ADDU16I_OPERAND): Likewise.
* config/loongarch/constraints.md (La, Lb, Lc, Ld, Le): New
constraint.
* config/loongarch/predicates.md (const_dual_imm12_operand): New
predicate.
(const_addu16i_operand): Likewise.
(const_addu16i_imm12_di_operand): Likewise.
(const_addu16i_imm12_si_operand): Likewise.
(plus_di_operand): Likewise.
(plus_si_operand): Likewise.
(plus_si_extend_operand): Likewise.
* config/loongarch/loongarch.md (add<mode>3): Convert to
define_insn_and_split.  Use plus_<mode>_operand predicate
instead of arith_operand.  Add alternatives for La, Lb, Lc, Ld,
and Le constraints.
(*addsi3_extended): Convert to define_insn_and_split.  Use
plus_si_extend_operand instead of arith_operand.  Add
alternatives for La and Le alternatives.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/add-const.c: New test.
* gcc.target/loongarch/stack-check-cfa-1.c: Adjust for stack
frame size change.
* gcc.target/loongarch/stack-check-cfa-2.c: Likewise.

libsanitizer, darwin: Unsupport Darwin >= 22 for now.

The mechanism for location dyld has altered from Darwin22 since dyld is now
in the shared cache. The implemented mechanism for walking the cache uses
Apple Blocks which GCC does not yet support, and the fallback to the original
mechanism does not work there.

Until a suitable work-around can be found, unsupport Darwin22+.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
libsanitizer/ChangeLog:

* configure.tgt: Unsupport Darwin22+ until a mechanism can be found
to locate dyld in the shared cache.

Add two new methods to Value_Range.

This is for upcoming work in this area.

gcc/ChangeLog:

* value-range.h (Value_Range::Value_Range): New.
(Value_Range::contains_p): New.

Constify invariant fields of vrange and irange.

The discriminator in vrange cannot change after construction,
similarly the number of allocated ranges in an irange. It's best to
make them constant to avoid invalid changes.

gcc/ChangeLog:

* value-range.h (class vrange): Make m_discriminator const.
(class irange): Make m_max_ranges const. Adjust constructors
accordingly.
(class unsupported_range): Construct vrange appropriately.
(class frange): Same.

LoongArch: Remove the definition of the macro LOGICAL_OP_NON_SHORT_CIRCUIT under the architecture and use the default definition instead.

In some cases, setting this macro as the default can reduce the number of conditional
branch instructions.

gcc/ChangeLog:

* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Remove the macro
definition.

LoongArch: Add built-in functions description of LoongArch Base instruction set instructions.

gcc/ChangeLog:

* doc/extend.texi: Add section for LoongArch Base Built-in functions.

Daily bump.

RISC-V: make the stack manipulation codes more readable.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_first_stack_step): Make codes more
readable.
(riscv_expand_epilogue): Likewise.

c++: bound ttp level lowering [PR109531]

Here when level lowering the bound ttp TT<typename T::type> via the
substitution T=C, we're neglecting to canonicalize (and thereby strip
of simple typedefs) the substituted template arguments {A<int>} before
determining the new canonical type via hash table lookup. This leads to
a hash mismatch ICE for the two equivalent types TT<int> and TT<A<int>>
since iterative_hash_template_arg assumes type arguments are already
canonicalized.

We can fix this by canonicalizing or coercing the substituted arguments
directly, but seeing as creation and ordinary substitution of bound ttps
both go through lookup_template_class, which in turn performs the desired
coercion/canonicalization, it seems preferable to make this code path go
through lookup_template_class as well.

PR c++/109531

gcc/cp/ChangeLog:

* pt.cc (tsubst) <case BOUND_TEMPLATE_TEMPLATE_PARM>:
In the level-lowering case just use lookup_template_class
to rebuild the bound ttp.

gcc/testsuite/ChangeLog:

* g++.dg/template/canon-type-20.C: New test.
* g++.dg/template/ttp36.C: New test.

RISC-V: optimize stack manipulation in save-restore

The stack that save-restore reserves is not well accumulated in stack allocation and deallocation.
This patch allows less instructions to be used in stack allocation and deallocation if save-restore enabled.

before patch:
  bar:
    call t0,__riscv_save_4
    addi sp,sp,-64
    ...
    li t0,-12288
    addi t0,t0,-1968 # optimized out after patch
    add sp,sp,t0 # prologue
    ...
    li t0,12288 # epilogue
    addi t0,t0,2000 # optimized out after patch
    add sp,sp,t0
    ...
    addi sp,sp,32
    tail __riscv_restore_4

after patch:
  bar:
    call t0,__riscv_save_4
    addi sp,sp,-2032
    ...
    li t0,-12288
    add sp,sp,t0 # prologue
    ...
    li t0,12288 # epilogue
    add sp,sp,t0
    ...
    addi sp,sp,2032
    tail __riscv_restore_4

gcc/

* config/riscv/riscv.cc (riscv_expand_prologue): Consider save-restore in
stack allocation.
(riscv_expand_epilogue): Consider save-restore in stack deallocation.

gcc/testsuite

* gcc.target/riscv/stack_save_restore.c: New test.

PHIOPT: Remove gate_hoist_loads prototype

gate_hoist_loads is defined before its usage so there is
no reason for the declaration (prototype) to be there.

Committed as obvious after a bootstrap/test on x86_64-linux-gnu with no regressions.

gcc/ChangeLog:

* tree-ssa-phiopt.cc (gate_hoist_loads): Remove
prototype.

Do not export global ranges from -Walloca pass.

A warning pass should not be exporting global ranges it finds along
the way, because that will alter the behavior of future passes.

The reason the present behavior was there was because of some long ago
forgotten regression in another pass. This regression is no longer
there, and if there's ever any fallout from cleaning this up, we can
address it in the pass that is missing some information.

gcc/ChangeLog:

* gimple-ssa-warn-alloca.cc (pass_walloca::execute): Do not export
global ranges.

RISC-V: Force ilp32d for the T-Head FMV test

These functions are NOPs on the soft-float ABIs. Since we're already
forcing the ISA, let's just force the ABI too.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadfmv-fmv.c: Force the ilp32d ABI.

RISC-V: Set the ABI for the RVV tests

The RVV test harness currently sets the ISA according to the target
tuple, but doesn't also set the ABI. This just sets the ABI to match
the ISA, though we should really also be respecting the user's specific
ISA to test.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/rvv.exp (gcc_mabi): New variable.

RISC-V: Clean up the pr106602.c testcase

The test case that was added is rv64i-specific, as there's better ways
to generate this code on rv32i (where the long/int cast is a NOP) and on
rv64i_zba (where we have word shifts). This renames the original test
case and adds two more for those targets.

gcc/testsuite/ChangeLog:
PR target/106602
* gcc.target/riscv/pr106602.c: Moved to...
* gcc.target/riscv/pr106602-rv64i.c: ...here.
* gcc.target/riscv/pr106602-rv32i.c: New test.
* gcc.target/riscv/pr106602-rv64i_zba.c: New test.

RISC-V: add a new parameter in riscv_first_stack_step.

gcc/ChangeLog:

* config/riscv/riscv.cc (riscv_first_stack_step): Add a new function
parameter remaining_size.
(riscv_compute_frame_info): Adapt new riscv_first_stack_step interface.
(riscv_expand_prologue): Likewise.
(riscv_expand_epilogue): Likewise.

RISC-V: Optimze the reverse conditions of rotate shift

gcc/ChangeLog:

* config/riscv/bitmanip.md (rotrsi3_sext): Support generating
roriw for constant counts.
* rtl.h (reverse_rotate_by_imm_p): Add function declartion
* simplify-rtx.cc (reverse_rotate_by_imm_p): New function.
(simplify_context::simplify_binary_operation_1): Use it.
* expmed.cc (expand_shift_1): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/zbb-rol-ror-04.c: New test.
* gcc.target/riscv/zbb-rol-ror-05.c: New test.
* gcc.target/riscv/zbb-rol-ror-06.c: New test.
* gcc.target/riscv/zbb-rol-ror-07.c: New test.

Update crontab and git_update_version.py

2023-04-17 Jakub Jelinek <jakub@redhat.com>

maintainer-scripts/
* crontab: Snapshots from trunk are now GCC 14 related.
Add GCC 13 snapshots from the respective branch.
contrib/
* gcc-changelog/git_update_version.py (active_refs): Add
releases/gcc-13.

ada: bump Library_Version to 14.

gcc/ada/ChangeLog:

* gnatvsn.ads: Bump Library_Version to 14.

Bump BASE-VER.

2023-04-17 Jakub Jelinek <jakub@redhat.com>

* BASE-VER: Set to 14.0.0.

ipa: Fix double reference-count decrements for the same edge (PR 107769, PR 109318)

It turns out that since addition of the code that can identify globals
which are only read from, the code that keeps track of the references
can decrement their count for the same calls, once during IPA-CP and
then again during inlining.  Fixed by adding a special flag to the
pass-through variant and simply wiping out the reference to the
refdesc structure from the constant ones.

Moreover, during debugging of the issue I have discovered that the
code removing references could remove a reference associated with the
same statement but of a wrong type.  In all cases it wanted to remove
an IPA_REF_ADDR reference so removing a lesser one instead should do
no harm in practice, but we should try to be consistent and so this
patch extends symtab_node::find_reference so that it searches for a
reference of a given type only.

gcc/ChangeLog:

2023-04-14  Martin Jambor  <mjambor@suse.cz>

PR ipa/107769
PR ipa/109318
* cgraph.h (symtab_node::find_reference): Add parameter use_type.
* ipa-prop.h (ipa_pass_through_data): New flag refdesc_decremented.
(ipa_zap_jf_refdesc): New function.
(ipa_get_jf_pass_through_refdesc_decremented): Likewise.
(ipa_set_jf_pass_through_refdesc_decremented): Likewise.
* ipa-cp.cc (ipcp_discover_new_direct_edges): Provide a value for
the new parameter of find_reference.
(adjust_references_in_caller): Likewise. Make sure the constant jump
function is not used to decrement a refdec counter again.  Only
decrement refdesc counters when the pass_through jump function allows
it.  Added a detailed dump when decrementing refdesc counters.
* ipa-prop.cc (ipa_print_node_jump_functions_for_edge): Dump new flag.
(ipa_set_jf_simple_pass_through): Initialize the new flag.
(ipa_set_jf_unary_pass_through): Likewise.
(ipa_set_jf_arith_pass_through): Likewise.
(remove_described_reference): Provide a value for the new parameter of
find_reference.
(update_jump_functions_after_inlining): Zap refdesc of new jfunc if
the previous pass_through had a flag mandating that we do so.
(propagate_controlled_uses): Likewise.  Only decrement refdesc
counters when the pass_through jump function allows it.
(ipa_edge_args_sum_t::duplicate): Provide a value for the new
parameter of find_reference.
(ipa_write_jump_function): Assert the new flag does not have to be
streamed.
* symtab.cc (symtab_node::find_reference): Add parameter use_type, use
it in searching.

gcc/testsuite/ChangeLog:

2023-04-06  Martin Jambor  <mjambor@suse.cz>

PR ipa/107769
PR ipa/109318
* gcc.dg/ipa/pr109318.c: New test.
* gcc.dg/lto/pr107769_0.c: Likewise.

aarch64: disable LDP via tuning structure for -mcpu=ampere1

AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops.
Given the chance that this causes instructions to slip into the next
decoding cycle and the additional overheads when handling
cacheline-crossing LDP instructions, we disable the generation of LDP
isntructions through the tuning structure from instruction combining
(such as in peephole2).

Given the code-density benefits in builtins and prologue/epilogue
expansion, we allow LDPs there.

This commit:
* adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE
* allows -moverride=tune=... to override this

These changes are benchmark-driven, yielding the following changes
(with a net-overall improvement):
   503.bwaves_r.      -0.88%
   507.cactuBSSN_r     0.35%
   508.namd_r          3.09%
   510.parest_r       -2.99%
   511.povray_r        5.54%
   519.lbm_r          15.83%
   521.wrf_r           0.56%
   526.blender_r       2.47%
   527.cam4_r          0.70%
   538.imagick_r       0.00%
   544.nab_r          -0.33%
   549.fotonik3d_r.   -0.42%
   554.roms_r          0.00%
   -------------------------
   = total             1.79%

Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com>
gcc/ChangeLog:

* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE.
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp):
Check for the above tuning option when processing loads.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ampere1-no_ldp_combine.c: New test.

testsuite: Fix up vect-simd-clone-1[678]f.c tests some more

With
make check-gcc check-g++ -j32 -k RUNTESTFLAGS='--target_board=unix\{-m32,-m32/-mavx,-m32/-mavx512f,-m32/-march=cascadelake,-m64,-m64/-mavx,-m64/-mavx512f,-m64/-march=cascadelake\}
+vect.exp=vect-simd-clone*'
the vect-simd-clone-1[678]f.c tests fail with -m32/-mavx512f and -m32/-march=cascadelake,
in that case there are zero matches rather than the 4 expected for ia32.
-m64/-mavx512f and -m64/-march=cascadelake works fine though (2 expected
matches).

So, the following patch just adds -mno-avx512f for x86 non-lp64.

2023-04-17 Jakub Jelinek <jakub@redhat.com>

* gcc.dg/vect/vect-simd-clone-16f.c: Add -mno-avx512f for non-lp64 x86.
* gcc.dg/vect/vect-simd-clone-17f.c: Likewise.
* gcc.dg/vect/vect-simd-clone-18f.c: Likewise.

tree-optimization/109524 - ICE with VRP edge removal

VRP queues edges to process late for updating global ranges for
__builtin_unreachable. But this interferes with edge removal
from substitute_and_fold. The following deals with this by
looking up the edge with source/dest block indices which do not
become stale.

PR tree-optimization/109524
* tree-vrp.cc (remove_unreachable::m_list): Change to a
vector of pairs of block indices.
(remove_unreachable::maybe_register_block): Adjust.
(remove_unreachable::remove_and_update_globals): Likewise.
Deal with removed blocks.

* g++.dg/pr109524.C: New testcase.

testsuite: update builtins-5-p9-runnable.c for BE

Hi,

As PR108809 mentioned, vec_xl_len_r and vec_xst_len_r are tested
in gcc.target/powerpc/builtins-5-p9-runnable.c.
The vector operand of these two bifs are different from the view
of v16_int8 between BE and LE, even it is same from the view of
128bits(uint128/V1TI).

The test case gcc.target/powerpc/builtins-5-p9-runnable.c was
written for LE environment, this patch updates it for BE.

Tested on ppc64 BE and LE.
Is this ok for trunk?

BR,
Jeff (Jiufu)

gcc/testsuite/ChangeLog:

PR testsuite/108809
* gcc.target/powerpc/builtins-5-p9-runnable.c: Update for BE.

RISC-V: Fix testsuite fail on RV32

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/scalar_move-2.c: Adjust include way
for riscv_vector.h
* gcc.target/riscv/rvv/base/spill-sp-adjust.c: Add missing
-mabi.

RISC-V: Add test cases for the RVV mask insn shortcut.

There are sorts of shortcut codegen for the RVV mask insn. For
example.

vmxor vd, va, va => vmclr vd.

We would like to add more optimization like this but first of all
we must add the tests for the existing shortcut optimization, to
ensure we don't break existing optimization from underlying shortcut
optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/mask_insn_shortcut.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

Daily bump.