git.ipfire.org Git - thirdparty/gcc.git/log

c++: guard more against undiagnosed error_mark_node [PR112658]

This adds a sanity check to cp_parser_expression_statement similar to
the one in finish_expr_stmt added by r6-6795-g0fd9d4921f7ba2, which
effectively downgrades accepts-invalid/wrong-code bugs like this one
into ice-on-invalid/ice-on-valid ones.

PR c++/112658

gcc/cp/ChangeLog:

* parser.cc (cp_parser_expression_statement): If the statement
is error_mark_node, make sure we've seen_error().

c++: undiagnosed error_mark_node from cp_build_c_cast [PR112658]

When cp_build_c_cast commits to an erroneous const_cast, we neglect to
replay errors from build_const_cast_1 which can lead to us incorrectly
accepting (and "miscompiling") the cast, or triggering the assert in
finish_expr_stmt.

This patch fixes this oversight.  This was the original fix for the ICE
in PR112658 before r14-5941-g305a2686c99bf9 made us accept the testcase
there after all.  I wasn't able to come up with an alternate testcase for
which this fix has an effect anymore, but below is a reduced version of
the PR112658 testcase (accepted ever since r14-5941) for good measure.

PR c++/112658
PR c++/94264

gcc/cp/ChangeLog:

* typeck.cc (cp_build_c_cast): If we're committed to a const_cast
and the result is erroneous, call build_const_cast_1 a second
time to issue errors.  Use complain=tf_none instead of =false.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/initlist-array20.C: New test.

RISC-V: Add vectorized strcmp and strncmp.

This patch adds vectorized strcmp and strncmp implementations and
tests. Similar to strlen, expansion is still guarded by
-minline-str(n)cmp.

gcc/ChangeLog:

PR target/112109

* config/riscv/riscv-protos.h (expand_strcmp): Declare.
* config/riscv/riscv-string.cc (riscv_expand_strcmp): Add
strategy handling and delegation to scalar and vector expanders.
(expand_strcmp): Vectorized implementation.
* config/riscv/riscv.md: Add TARGET_VECTOR to strcmp and strncmp
expander.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/builtin/strcmp-run.c: New test.
* gcc.target/riscv/rvv/autovec/builtin/strcmp.c: New test.
* gcc.target/riscv/rvv/autovec/builtin/strncmp-run.c: New test.
* gcc.target/riscv/rvv/autovec/builtin/strncmp.c: New test.

RISC-V: Add vectorized strlen.

This patch implements a vectorized strlen by re-using and slightly
adjusting the rawmemchr implementation. Rawmemchr returns the address
of the needle while strlen returns the difference between needle address
and start address.

As before, strlen expansion is guarded by -minline-strlen.

While testing with -minline-strlen I encountered a vsetvl problem in
memcpy-chk.c where we didn't insert a vsetvl at the proper spot (after
a setjmp). This needs to be fixed separately and I figured I'd post
this patch as-is.

gcc/ChangeLog:

PR target/112109

* config/riscv/riscv-protos.h (expand_rawmemchr): Add strlen
parameter.
* config/riscv/riscv-string.cc (riscv_expand_strlen): Call
rawmemchr.
(expand_rawmemchr): Add strlen handling.
* config/riscv/riscv.md: Add TARGET_VECTOR to strlen expander.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/builtin/strlen-run.c: New test.
* gcc.target/riscv/rvv/autovec/builtin/strlen.c: New test.

aarch64: Some tweaks to the early-ra pass

early-ra's likely_operand_match_p didn't handle relaxed and special
memory constraints, which meant that the pass wasn't able to match
LD1RQ instructions to their constraints, and so backed out of
trying to allocate.  This patch fixes that by switching the sense
of the match: does the rtx seem appropriate for the constraint?,
rather than: does the constraint seem appropriate for the rtx?

Also, I came across a case that needed more general equivalence
detection.  Previously we would only record equivalences after
the last definition of the source register, but it's worth trying
to handle cases where the destination register's live range is
restricted to a block, and the next definition of the source
occurs only after the end of the destination register's live range.

The patch also fixes a cut-&-pasto that Alex noticed (thanks).

gcc/
* config/aarch64/aarch64-early-ra.cc (allocno_info::chain_next):
Put into an enum with...
(allocno_info::last_def_point): ...new member variable.
(allocno_info::m_current_bb_point): New member variable.
(likely_operand_match_p): Switch based on get_constraint_type,
rather than based on rtx code.  Handle relaxed and special memory
constraints.
(early_ra::record_copy): Allow the source of an equivalence to be
assigned to more than once.
(early_ra::record_allocno_use): Invalidate any previous equivalence.
Initialize last_def_point.
(early_ra::record_allocno_def): Set last_def_point.
(early_ra::valid_equivalence_p): New function, split out from...
(early_ra::record_copy): ...here.  Use last_def_point to handle
source registers that have a later definition.
(make_pass_aarch64_early_ra): Fix comment.

gcc/testsuite/
* gcc.target/aarch64/sme/strided_2.c: New test.

Revert "arm: vld1q_types_x2 ACLE intrinsics"

This reverts commit a1a0cdf21bb6a076e98658d815645d8ad1193840.

Revert "arm: vld1q_types_x3 ACLE intrinsics"

This reverts commit 2514a331835e055a963fd059dc5770e5ae500af0.

Revert "arm: vld1q_types_x4 ACLE intrinsics"

This reverts commit ac827ec3e600bcb636f564876b186ee19d384a1e.

Revert "arm: vst1_types_x2 ACLE intrinsics"

This reverts commit a69a7c7b6782c5b6f213f1f34af8dbb6541f27bb.

Revert "arm: vst1_types_x3 ACLE intrinsics"

This reverts commit ef07ae652c25ec04c2e3ef8cec14b0771a809861.

Revert "arm: vst1_types_x4 ACLE intrinsics"

This reverts commit 2f48d846c794ba091b266133f73717361096d454.

Revert "arm: vst1q_types_x2 ACLE intrinsics"

This reverts commit 2cd0d0261ef9d0e13e20407f131f32dcb67fcdd3.

Revert "arm: vst1q_types_x3 ACLE intrinsics"

This reverts commit 2d58d53c9e0eed83faa9254f8d3ec0ddd54812d8.

Revert "arm: vst1q_types_x4 ACLE intrinsics"

This reverts commit 4ad77f883c178679f1dbb3a5603f811e022080bb.

Revert "arm: vld1_types_x2 ACLE intrinsics"

This reverts commit 8fff3f065277f13176c320f22c4ed766a82c5d8e.

Revert "arm: vld1_types_x3 ACLE intrinsics"

This reverts commit 8e3ae874b21bdd8da32afefa6f6f60913481564c.

Revert "arm: vld1_types_x4 ACLE intrinsics"

This reverts commit 656f092cba951fddc1e40468ad71d241ffe98566.

libgcov: Call __builtin_fork instead of fork

Some targets do not provide a prototype for fork, and compilation now
fails with an implicit-function-declaration error.

libgcc/

* libgcov-interface.c (__gcov_fork): Use __builtin_fork instead
of fork.

OpenMP/Fortran: Implement omp allocators/allocate for ptr/allocatables

This commit adds -fopenmp-allocators which enables support for
'omp allocators' and 'omp allocate' that are associated with a Fortran
allocate-stmt. If such a construct is encountered, an error is shown,
unless the -fopenmp-allocators flag is present.

With -fopenmp -fopenmp-allocators, those constructs get turned into
GOMP_alloc allocations, while -fopenmp-allocators (also without -fopenmp)
ensures deallocation and reallocation (via intrinsic assignments) are
properly directed to GOMP_free/omp_realloc - while normal Fortran
allocations are processed by free/realloc.

In order to distinguish a 'malloc'ed from a 'GOMP_alloc'ed memory, the
version field of the Fortran array discriptor is (mis)used: 0 indicates
the normal Fortran allocation while 1 denotes GOMP_alloc. For scalars,
there is record keeping in libgomp: GOMP_add_alloc(ptr) will add the
pointer address to a splay_tree while GOMP_is_alloc(ptr) will return
true it was previously added but also removes it from the list.

Besides Fortran FE work, BUILT_IN_GOMP_REALLOC is no part of
omp-builtins.def and libgomp gains the mentioned two new function.

gcc/ChangeLog:

* builtin-types.def (BT_FN_PTR_PTR_SIZE_PTRMODE_PTRMODE): New.
* omp-builtins.def (BUILT_IN_GOMP_REALLOC): New.
* builtins.cc (builtin_fnspec): Handle it.
* gimple-ssa-warn-access.cc (fndecl_alloc_p,
matching_alloc_calls_p): Likewise.
* gimple.cc (nonfreeing_call_p): Likewise.
* predict.cc (expr_expected_value_1): Likewise.
* tree-ssa-ccp.cc (evaluate_stmt): Likewise.
* tree.cc (fndecl_dealloc_argno): Likewise.

gcc/fortran/ChangeLog:

* dump-parse-tree.cc (show_omp_node): Handle EXEC_OMP_ALLOCATE
and EXEC_OMP_ALLOCATORS.
* f95-lang.cc (ATTR_ALLOC_WARN_UNUSED_RESULT_SIZE_2_NOTHROW_LIST):
Add 'ECF_LEAF | ECF_MALLOC' to existing 'ECF_NOTHROW'.
(ATTR_ALLOC_WARN_UNUSED_RESULT_SIZE_2_NOTHROW_LEAF_LIST): Define.
* gfortran.h (gfc_omp_clauses): Add contained_in_target_construct.
* invoke.texi (-fopenacc, -fopenmp): Update based on C version.
(-fopenmp-simd): New, based on C version.
(-fopenmp-allocators): New.
* lang.opt (fopenmp-allocators): Add.
* openmp.cc (resolve_omp_clauses): For allocators/allocate directive,
add target and no dynamic_allocators diagnostic and more invalid
diagnostic.
* parse.cc (decode_omp_directive): Set contains_teams_construct.
* trans-array.h (gfc_array_allocate): Update prototype.
(gfc_conv_descriptor_version): New prototype.
* trans-decl.cc (gfc_init_default_dt): Fix comment.
* trans-array.cc (gfc_conv_descriptor_version): New.
(gfc_array_allocate): Support GOMP_alloc allocation.
(gfc_alloc_allocatable_for_assignment, structure_alloc_comps):
Handle GOMP_free/omp_realloc as needed.
* trans-expr.cc (gfc_conv_procedure_call): Likewise.
(alloc_scalar_allocatable_for_assignment): Likewise.
* trans-intrinsic.cc (conv_intrinsic_move_alloc): Likewise.
* trans-openmp.cc (gfc_trans_omp_allocators,
gfc_trans_omp_directive): Handle allocators/allocate directive.
(gfc_omp_call_add_alloc, gfc_omp_call_is_alloc): New.
* trans-stmt.h (gfc_trans_allocate): Update prototype.
* trans-stmt.cc (gfc_trans_allocate): Support GOMP_alloc.
* trans-types.cc (gfc_get_dtype_rank_type): Set version field.
* trans.cc (gfc_allocate_using_malloc, gfc_allocate_allocatable):
Update to handle GOMP_alloc.
(gfc_deallocate_with_status, gfc_deallocate_scalar_with_status):
Handle GOMP_free.
(trans_code): Update call.
* trans.h (gfc_allocate_allocatable, gfc_allocate_using_malloc):
Update prototype.
(gfc_omp_call_add_alloc, gfc_omp_call_is_alloc): New prototype.
* types.def (BT_FN_PTR_PTR_SIZE_PTRMODE_PTRMODE): New.

libgomp/ChangeLog:

* allocator.c (struct fort_alloc_splay_tree_key_s,
fort_alloc_splay_compare, GOMP_add_alloc, GOMP_is_alloc): New.
* libgomp.h: Define splay_tree_static for 'reverse' splay tree.
* libgomp.map (GOMP_5.1.2): New; add GOMP_add_alloc and
GOMP_is_alloc; move GOMP_target_map_indirect_ptr from ...
(GOMP_5.1.1): ... here.
* libgomp.texi (Impl. Status, Memory management): Update for
allocators/allocate directives.
* splay-tree.c: Handle splay_tree_static define to declare all
functions as static.
(splay_tree_lookup_node): New.
* splay-tree.h: Handle splay_tree_decl_only define.
(splay_tree_lookup_node): New prototype.
* target.c: Define splay_tree_static for 'reverse'.
* testsuite/libgomp.fortran/allocators-1.f90: New test.
* testsuite/libgomp.fortran/allocators-2.f90: New test.
* testsuite/libgomp.fortran/allocators-3.f90: New test.
* testsuite/libgomp.fortran/allocators-4.f90: New test.
* testsuite/libgomp.fortran/allocators-5.f90: New test.

gcc/testsuite/ChangeLog:

* gfortran.dg/gomp/allocate-14.f90: Add coarray and
not-listed tests.
* gfortran.dg/gomp/allocate-5.f90: Remove sorry dg-message.
* gfortran.dg/bind_c_array_params_2.f90: Update expected
dump for dtype '.version=0'.
* gfortran.dg/gomp/allocate-16.f90: New test.
* gfortran.dg/gomp/allocators-3.f90: New test.
* gfortran.dg/gomp/allocators-4.f90: New test.

libgcc: Fix config.in

It was updated incorrectly in

  commit dbbfb52b0e9c66ee9d05b8fd17c4f44655e48463
  Author:     Szabolcs Nagy <szabolcs.nagy@arm.com>
  CommitDate: 2023-12-08 11:29:06 +0000

    libgcc: aarch64: Configure check for __getauxval

so regenerate it.

libgcc/ChangeLog:

* config.in: Regenerate.

libgcc: aarch64: Add SME unwinder support

To support the ZA lazy save scheme, the PCS requires the unwinder to
reset the SME state to PSTATE.SM=0, PSTATE.ZA=0, TPIDR2_EL0=0 on entry
to an exception handler. We use the __arm_za_disable SME runtime call
unconditionally to achieve this.
https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#exceptions

The hidden alias is used to avoid a PLT and avoid inconsistent VPCS
marking (we don't rely on special PCS at the call site). In case of
static linking the SME runtime init code is linked in code that raises
exceptions.

libgcc/ChangeLog:

* config/aarch64/__arm_za_disable.S: Add hidden alias.
* config/aarch64/aarch64-unwind.h: Reset the SME state before
EH return via the _Unwind_Frames_Extra hook.

libgcc: aarch64: Add SME runtime support

The call ABI for SME (Scalable Matrix Extension) requires a number of
helper routines which are added to libgcc so they are tied to the
compiler version instead of the libc version. See
https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#sme-support-routines

The routines are in shared libgcc and static libgcc eh, even though
they are not related to exception handling. This is to avoid linking
a copy of the routines into dynamic linked binaries, because TPIDR2_EL0
block can be extended in the future which is better to handle in a
single place per process.

The support routines have to decide if SME is accessible or not. Linux
tells userspace if SME is accessible via AT_HWCAP2, otherwise a new
__aarch64_sme_accessible symbol was introduced that a libc can define.
Due to libgcc and libc build order, the symbol availability cannot be
checked so for __aarch64_sme_accessible an unistd.h feature test macro
is used while such detection mechanism is not available for __getauxval
so we rely on configure checks based on the target triplet.

Asm helper code is added to make writing the routines easier.

libgcc/ChangeLog:

* config/aarch64/t-aarch64: Add sources to the build.
* config/aarch64/__aarch64_have_sme.c: New file.
* config/aarch64/__arm_sme_state.S: New file.
* config/aarch64/__arm_tpidr2_restore.S: New file.
* config/aarch64/__arm_tpidr2_save.S: New file.
* config/aarch64/__arm_za_disable.S: New file.
* config/aarch64/aarch64-asm.h: New file.
* config/aarch64/libgcc-sme.ver: New file.

libgcc: aarch64: Configure check for __getauxval

Add configure check for the __getauxval ABI symbol, which is always
available on aarch64 glibc, and may be available on other linux C
runtimes. For now only enabled on glibc, others have to override it

target_configargs=libgcc_cv_have___getauxval=yes

This is deliberately obscure as it should be auto detected, ideally
via a feature test macro in unistd.h (link time detection is not
possible since the libc may not be installed at libgcc build time),
but currently there is no such feature test mechanism.

Without __getauxval, libgcc cannot do runtime CPU feature detection
and has to assume only the build time known features are available.

libgcc/ChangeLog:

* config.in: Undef HAVE___GETAUXVAL.
* configure: Regenerate.
* configure.ac: Check for __getauxval.

libgcc: aarch64: Configure check for .variant_pcs support

Ideally SME support routines in libgcc are marked as variant PCS symbols
so check if as supports the directive.

libgcc/ChangeLog:

* config.in: Undef HAVE_AS_VARIANT_PCS.
* configure: Regenerate.
* configure.ac: Check for .variant_pcs.

tree-optimization/112909 - uninit diagnostic with abnormal copy

The following avoids spurious uninit diagnostics for SSA name
copies which mostly appear when the source is marked as abnormal
which prevents copy propagation.

To prevent regressions I remove the bail out for anonymous SSA
names in the PHI arg place from warn_uninitialized_phi leaving
that to warn_uninit where I handle SSA copies from a SSA name
which isn't anonymous. In theory this might cause more
valid and false positive diagnostics to pop up.

PR tree-optimization/112909
* tree-ssa-uninit.cc (find_uninit_use): Look through a
single level of SSA name copies with single use.

* gcc.dg/uninit-pr112909.c: New testcase.

Revert "testsuite: require avx_runtime for some tests"

This reverts commit 249404649d26f544d1ad6808625807532c2b6a42.

LoongArch: Fix ICE and use simplify_gen_subreg instead of gen_rtx_SUBREG directly.

loongarch_expand_vec_cond_mask_expr generates 'subreg's of 'subreg's, which are not supported
in gcc, it causes an ICE:

ice.c:55:1: error: unrecognizable insn:
   55 | }
      | ^
(insn 63 62 64 8 (set (reg:V4DI 278)
        (subreg:V4DI (subreg:V4DF (reg:V4DI 273 [ vect__53.26 ]) 0) 0)) -1
     (nil))
during RTL pass: vregs
ice.c:55:1: internal compiler error: in extract_insn, at recog.cc:2804

Last time, Ruoyao has fixed a similar ICE:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636156.html

This patch fixes ICE and use simplify_gen_subreg instead of gen_rtx_SUBREG as much as possible
to avoid the same ice happening again.

gcc/ChangeLog:

* config/loongarch/loongarch.cc (loongarch_try_expand_lsx_vshuf_const): Use
simplify_gen_subreg instead of gen_rtx_SUBREG.
(loongarch_expand_vec_perm_const_2): Ditto.
(loongarch_expand_vec_cond_expr): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/pr112476-3.c: New test.
* gcc.target/loongarch/pr112476-4.c: New test.

LoongArch: Fix lsx-vshuf.c and lasx-xvshuf_b.c tests fail on LA664 [PR112611]

For [x]vshuf instructions, if the index value in the selector exceeds 63, it triggers
undefined behavior on LA464, but not on LA664. To ensure compatibility of these two
tests on both LA464 and LA664, we have modified both tests to ensure that the index
value in the selector does not exceed 63.

gcc/testsuite/ChangeLog:

PR target/112611
* gcc.target/loongarch/vector/lasx/lasx-xvshuf_b.c: Sure index less than 64.
* gcc.target/loongarch/vector/lsx/lsx-vshuf.c: Ditto.

LoongArch: Vectorized loop unrolling is disable for divf/sqrtf/rsqrtf when -mrecip is enabled.

Using -mrecip generates a sequence of instructions to replace divf, sqrtf and rsqrtf. The number
of generated instructions is close to or exceeds the maximum issue instructions per cycle of the
LoongArch, so vectorized loop unrolling is not performed on them.

gcc/ChangeLog:

* config/loongarch/loongarch.cc (loongarch_vector_costs::determine_suggested_unroll_factor):
If m_has_recip is true, uf return 1.
(loongarch_vector_costs::add_stmt_cost): Detect the use of approximate instruction sequence.

LoongArch: New options -mrecip and -mrecip= with ffast-math.

When both the -mrecip and -mfrecipe options are enabled, use approximate reciprocal
instructions and approximate reciprocal square root instructions with additional
Newton-Raphson steps to implement single precision floating-point division, square
root and reciprocal square root operations, for a better performance.

gcc/ChangeLog:

* config/loongarch/genopts/loongarch.opt.in (recip_mask): New variable.
(-mrecip, -mrecip): New options.
* config/loongarch/lasx.md (div<mode>3): New expander.
(*div<mode>3): Rename.
(sqrt<mode>2): New expander.
(*sqrt<mode>2): Rename.
(rsqrt<mode>2): New expander.
* config/loongarch/loongarch-protos.h (loongarch_emit_swrsqrtsf): New prototype.
(loongarch_emit_swdivsf): Ditto.
* config/loongarch/loongarch.cc (loongarch_option_override_internal): Set
recip_mask for -mrecip and -mrecip= options.
(loongarch_emit_swrsqrtsf): New function.
(loongarch_emit_swdivsf): Ditto.
* config/loongarch/loongarch.h (RECIP_MASK_NONE, RECIP_MASK_DIV, RECIP_MASK_SQRT
RECIP_MASK_RSQRT, RECIP_MASK_VEC_DIV, RECIP_MASK_VEC_SQRT, RECIP_MASK_VEC_RSQRT
RECIP_MASK_ALL): New bitmasks.
(TARGET_RECIP_DIV, TARGET_RECIP_SQRT, TARGET_RECIP_RSQRT, TARGET_RECIP_VEC_DIV
TARGET_RECIP_VEC_SQRT, TARGET_RECIP_VEC_RSQRT): New tests.
* config/loongarch/loongarch.md (sqrt<mode>2): New expander.
(*sqrt<mode>2): Rename.
(rsqrt<mode>2): New expander.
* config/loongarch/loongarch.opt (recip_mask): New variable.
(-mrecip, -mrecip): New options.
* config/loongarch/lsx.md (div<mode>3): New expander.
(*div<mode>3): Rename.
(sqrt<mode>2): New expander.
(*sqrt<mode>2): Rename.
(rsqrt<mode>2): New expander.
* config/loongarch/predicates.md (reg_or_vecotr_1_operand): New predicate.
* doc/invoke.texi (LoongArch Options): Document new options.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/divf.c: New test.
* gcc.target/loongarch/recip-divf.c: New test.
* gcc.target/loongarch/recip-sqrtf.c: New test.
* gcc.target/loongarch/sqrtf.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-divf.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-recip-divf.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-recip-sqrtf.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-recip.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-sqrtf.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-divf.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-recip-divf.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-recip-sqrtf.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-recip.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-sqrtf.c: New test.

LoongArch: Redefine pattern for xvfrecip/vfrecip instructions.

Redefine pattern for [x]vfrecip instructions use rtx code instead of unspec, and enable
[x]vfrecip instructions to be generated during auto-vectorization.

gcc/ChangeLog:

* config/loongarch/lasx.md (lasx_xvfrecip_<flasxfmt>): Renamed to ..
(recip<mode>3): .. this.
* config/loongarch/loongarch-builtins.cc (CODE_FOR_lsx_vfrecip_d): Redefine
to new pattern name.
(CODE_FOR_lsx_vfrecip_s): Ditto.
(CODE_FOR_lasx_xvfrecip_d): Ditto.
(CODE_FOR_lasx_xvfrecip_s): Ditto.
(loongarch_expand_builtin_direct): For the vector recip instructions, construct a
temporary parameter const1_vector.
* config/loongarch/lsx.md (lsx_vfrecip_<flsxfmt>): Renamed to ..
(recip<mode>3): .. this.
* config/loongarch/predicates.md (const_vector_1_operand): New predicate.

LoongArch: Use standard pattern name for xvfrsqrt/vfrsqrt instructions.

Rename lasx_xvfrsqrt*/lsx_vfrsqrt* to rsqrt<mode>2 to align with standard
pattern name. Define function use_rsqrt_p to decide when to use rsqrt optab.

gcc/ChangeLog:

* config/loongarch/lasx.md (lasx_xvfrsqrt_<flasxfmt>): Renamed to ..
(rsqrt<mode>2): .. this.
* config/loongarch/loongarch-builtins.cc
(CODE_FOR_lsx_vfrsqrt_d): Redefine to standard pattern name.
(CODE_FOR_lsx_vfrsqrt_s): Ditto.
(CODE_FOR_lasx_xvfrsqrt_d): Ditto.
(CODE_FOR_lasx_xvfrsqrt_s): Ditto.
* config/loongarch/loongarch.cc (use_rsqrt_p): New function.
(loongarch_optab_supported_p): Ditto.
(TARGET_OPTAB_SUPPORTED_P): New hook.
* config/loongarch/loongarch.md (*rsqrt<mode>a): Remove.
(*rsqrt<mode>2): New insn pattern.
(*rsqrt<mode>b): Remove.
* config/loongarch/lsx.md (lsx_vfrsqrt_<flsxfmt>): Renamed to ..
(rsqrt<mode>2): .. this.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/vector/lasx/lasx-rsqrt.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-rsqrt.c: New test.

LoongArch: Add support for LoongArch V1.1 approximate instructions.

This patch adds define_insn/builtins/intrinsics for these instructions, and add option
-mfrecipe to control instruction generation.

gcc/ChangeLog:

* config/loongarch/genopts/isa-evolution.in (fecipe): Add.
* config/loongarch/larchintrin.h (__frecipe_s): New intrinsic.
(__frecipe_d): Ditto.
(__frsqrte_s): Ditto.
(__frsqrte_d): Ditto.
* config/loongarch/lasx.md (lasx_xvfrecipe_<flasxfmt>): New insn pattern.
(lasx_xvfrsqrte_<flasxfmt>): Ditto.
* config/loongarch/lasxintrin.h (__lasx_xvfrecipe_s): New intrinsic.
(__lasx_xvfrecipe_d): Ditto.
(__lasx_xvfrsqrte_s): Ditto.
(__lasx_xvfrsqrte_d): Ditto.
* config/loongarch/loongarch-builtins.cc (AVAIL_ALL): Add predicates.
(LSX_EXT_BUILTIN): New macro.
(LASX_EXT_BUILTIN): Ditto.
* config/loongarch/loongarch-cpucfg-map.h: Regenerate.
* config/loongarch/loongarch-c.cc: Add builtin macro "__loongarch_frecipe".
* config/loongarch/loongarch-def.cc: Regenerate.
* config/loongarch/loongarch-str.h (OPTSTR_FRECIPE): Regenerate.
* config/loongarch/loongarch.cc (loongarch_asm_code_end): Dump status for TARGET_FRECIPE.
* config/loongarch/loongarch.md (loongarch_frecipe_<fmt>): New insn pattern.
(loongarch_frsqrte_<fmt>): Ditto.
* config/loongarch/loongarch.opt: Regenerate.
* config/loongarch/lsx.md (lsx_vfrecipe_<flsxfmt>): New insn pattern.
(lsx_vfrsqrte_<flsxfmt>): Ditto.
* config/loongarch/lsxintrin.h (__lsx_vfrecipe_s): New intrinsic.
(__lsx_vfrecipe_d): Ditto.
(__lsx_vfrsqrte_s): Ditto.
(__lsx_vfrsqrte_d): Ditto.
* doc/extend.texi: Add documentation for LoongArch new builtins and intrinsics.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/larch-frecipe-builtin.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-frecipe-builtin.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-frecipe-builtin.c: New test.

Shrink out-of-SSA dump

The following removes the second GIMPLE function dump after
remove_ssa_form which used to rewrite the IL with the coalescing
result but doesn't do so since a long time now.

* tree-outof-ssa.cc (rewrite_out_of_ssa): Dump GIMPLE once only,
after final IL adjustments.

RISC-V: Fix ICE for incorrect mode attr in V_F2DI_CONVERT_BRIDGE

The mode attr V_F2DI_CONVERT_BRIDGE converts the floating-point mode
to the widden floating-point by design. But we take (RVVM1HF "RVVM2SI") by
mistake.

This patch would like to fix it by replacing the
(RVVM1HF "RVVM2SI") to (RVVM1HF "RVVM2SF") as design.

gcc/ChangeLog:

* config/riscv/vector-iterators.md: Replace RVVM2SI to RVVM2SF
for mode attr V_F2DI_CONVERT_BRIDGE.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/unop/math-lroundf16-rv64-ice-1.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

LoongArch: Add support for xorsign.

This patch adds support for xorsign pattern to scalar fp and vector. With the
new expands, uniformly using vector bitwise logical operations to handle xorsign.

On LoongArch64, floating-point registers and vector registers share the same register,
so this patch also allows conversion between LSX vector mode and scalar fp mode to
avoid unnecessary instruction generation.

gcc/ChangeLog:

* config/loongarch/lasx.md (xorsign<mode>3): New expander.
* config/loongarch/loongarch.cc (loongarch_can_change_mode_class): Allow
conversion between LSX vector mode and scalar fp mode.
* config/loongarch/loongarch.md (@xorsign<mode>3): New expander.
* config/loongarch/lsx.md (@xorsign<mode>3): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/vector/lasx/lasx-xorsign-run.c: New test.
* gcc.target/loongarch/vector/lasx/lasx-xorsign.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-xorsign-run.c: New test.
* gcc.target/loongarch/vector/lsx/lsx-xorsign.c: New test.
* gcc.target/loongarch/xorsign-run.c: New test.
* gcc.target/loongarch/xorsign.c: New test.

lower-bitint: Avoid merging non-mergeable stmt with cast and mergeable stmt [PR112902]

Before bitint lowering, the IL has:
  b.0_1 = b;
  _2 = -b.0_1;
  _3 = (unsigned _BitInt(512)) _2;
  a.1_4 = a;
  a.2_5 = (unsigned _BitInt(512)) a.1_4;
  _6 = _3 * a.2_5;
on the first function.  Now, gimple_lower_bitint has an optimization
(when not -O0) that it avoids assigning underlying VAR_DECLs for certain
SSA_NAMEs where it is possible to lower it in a single loop (or straight
line code) rather than in multiple loops.
So, e.g. the multiplication above uses handle_operand_addr, which can deal
with INTEGER_CST arguments, loads but also casts, so it is fine
not to assign an underlying VAR_DECL for SSA_NAMEs a.1_4 and a.2_5, as
the multiplication can handle it fine.
The more problematic case is the other multiplication operand.
It is again a result of a (in this case narrowing) cast, so it is fine
not to assign VAR_DECL for _3.  Normally we can merge the load (b.0_1)
with the negation (_2) and even with the following cast (_3).  If _3
was used in a mergeable operation like addition, subtraction, negation,
&|^ or equality comparison, all of b.0_1, _2 and _3 could be without
underlying VAR_DECLs.
The problem is that the current code does that even when the cast is used
by a non-mergeable operation, and handle_operand_addr certainly can't handle
the mergeable operations feeding the rhs1 of the cast, for multiplication
we don't emit any loop in which it could appear, for other operations like
shifts or non-equality comparisons we emit loops, but either in the reverse
direction or with unpredictable indexes (for shifts).
So, in order to lower the above correctly, we need to have an underlying
VAR_DECL for either _2 or _3; if we choose _2, then the load and negation
would be done in one loop and extension handled as part of the
multiplication, if we choose _3, then the load, negation and cast are done
in one loop and the multiplication just uses the underlying VAR_DECL
computed by that.
It is far easier to do this for _3, which is what the following patch
implements.
It actually already had code for most of it, just it did that for widening
casts only (optimize unless the cast rhs1 is not SSA_NAME, or is SSA_NAME
defined in some other bb, or with more than one use, etc.).
This falls through into such code even for the narrowing or same precision
casts, unless the cast is used in a mergeable operation.

2023-12-08  Jakub Jelinek  <jakub@redhat.com>

PR tree-optimization/112902
* gimple-lower-bitint.cc (gimple_lower_bitint): For a narrowing
or same precision cast don't set SSA_NAME_VERSION in m_names only
if use_stmt is mergeable_op or fall through into the check that
use is a store or rhs1 is not mergeable or other reasons prevent
merging.

* gcc.dg/bitint-52.c: New test.

vr-values: Avoid ICEs on large _BitInt cast to floating point [PR112901]

For casts from integers to floating point,
simplify_float_conversion_using_ranges uses SCALAR_INT_TYPE_MODE
and queries optabs on the optimization it wants to make.

That doesn't really work for large/huge BITINT_TYPE, those have BLKmode
which is not scalar int mode. Querying an optab is not useful for that
either.

I think it is best to just skip this optimization for those bitints,
after all, bitint lowering uses ranges already to determine minimum
precision for bitint operands of the integer to float casts.

2023-12-08 Jakub Jelinek <jakub@redhat.com>

PR tree-optimization/112901
* vr-values.cc
(simplify_using_ranges::simplify_float_conversion_using_ranges):
Return false if rhs1 has BITINT_TYPE type with BLKmode TYPE_MODE.

* gcc.dg/bitint-51.c: New test.

haifa-sched: Avoid overflows in extend_h_i_d [PR112411]

On Thu, Dec 07, 2023 at 09:36:23AM +0100, Jakub Jelinek wrote:
> Without the dg-skip-if I got on 64-bit host with
> -O3 --param min-nondebug-insn-uid=0x40000000:
> cc1: out of memory allocating 571230784744 bytes after a total of 2772992 bytes

I've looked at this and the problem is in haifa-sched.cc:
9047        h_i_d.safe_grow_cleared (3 * get_max_uid () / 2, true);
get_max_uid () is 0x4000024d with the --param min-nondebug-insn-uid=0x40000000
and so 3 * get_max_uid () / 2 actually overflows to -536870028 but as vec.h
then treats the value as unsigned, it attempts to allocate
0xe0000374U * 152UL bytes, i.e. those 532GB.  If the above is fixed to do
3U * get_max_uid () / 2 instead, it will get slightly better and will only
need 0x60000373U * 152UL bytes, i.e. 228GB.

Perhaps more could be helped by making the vector indirect (contain pointers
to haifa_insn_data_def rather than the structures themselves) and pool allocate
those, but the more important question is how sparse are uids in normal
compilations without those large --param min-nondebug-insn-uid= parameters.
Because if they aren't enough, such a change would increase compile time memory
just to help the unusual case.

2023-12-08  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/112411
* haifa-sched.cc (extend_h_i_d): Use 3U instead of 3 in
3 * get_max_uid () / 2 calculation.

LoongArch: Remove the definition of ISA_BASE_LA64V110 from the code.

The instructions defined in LoongArch Reference Manual v1.1 are not the instruction
set v1.1 version. The CPU defined later may only support some instructions in
LoongArch Reference Manual v1.1. Therefore, the macro ISA_BASE_LA64V110 and
related definitions are removed here.

gcc/ChangeLog:

* config/loongarch/genopts/loongarch-strings: Delete STR_ISA_BASE_LA64V110.
* config/loongarch/genopts/loongarch.opt.in: Likewise.
* config/loongarch/loongarch-cpu.cc (ISA_BASE_LA64V110_FEATURES): Delete macro.
(fill_native_cpu_config): Define a new variable hw_isa_evolution record the
extended instruction set support read from cpucfg.
* config/loongarch/loongarch-def.cc: Set evolution at initialization.
* config/loongarch/loongarch-def.h (ISA_BASE_LA64V100): Delete.
(ISA_BASE_LA64V110): Likewise.
(N_ISA_BASE_TYPES): Likewise.
(defined): Likewise.
* config/loongarch/loongarch-opts.cc: Likewise.
* config/loongarch/loongarch-opts.h (TARGET_64BIT): Likewise.
(ISA_BASE_IS_LA64V110): Likewise.
* config/loongarch/loongarch-str.h (STR_ISA_BASE_LA64V110): Likewise.
* config/loongarch/loongarch.opt: Regenerate.

LoongArch: Switch loongarch-def from C to C++ to make it possible.

We'll use HOST_WIDE_INT in LoongArch static properties in following patches.

To keep the same readability as C99 designated initializers, create a
std::array like data structure with position setter function, and add
field setter functions for structs used in loongarch-def.cc.

Remove unneeded guards #if
!defined(IN_LIBGCC2) && !defined(IN_TARGET_LIBS) && !defined(IN_RTS)
in loongarch-def.h and loongarch-opts.h.

gcc/ChangeLog:

* config/loongarch/loongarch-def.h: Remove extern "C".
(loongarch_isa_base_strings): Declare as loongarch_def_array
instead of plain array.
(loongarch_isa_ext_strings): Likewise.
(loongarch_abi_base_strings): Likewise.
(loongarch_abi_ext_strings): Likewise.
(loongarch_cmodel_strings): Likewise.
(loongarch_cpu_strings): Likewise.
(loongarch_cpu_default_isa): Likewise.
(loongarch_cpu_issue_rate): Likewise.
(loongarch_cpu_multipass_dfa_lookahead): Likewise.
(loongarch_cpu_cache): Likewise.
(loongarch_cpu_align): Likewise.
(loongarch_cpu_rtx_cost_data): Likewise.
(loongarch_isa): Add a constructor and field setter functions.
* config/loongarch/loongarch-opts.h (loongarch-defs.h): Do not
include for target libraries.
* config/loongarch/loongarch-opts.cc: Comment code that doesn't
run and causes compilation errors.
* config/loongarch/loongarch-tune.h (LOONGARCH_TUNE_H): Likewise.
(struct loongarch_rtx_cost_data): Likewise.
(struct loongarch_cache): Likewise.
(struct loongarch_align): Likewise.
* config/loongarch/t-loongarch: Compile loongarch-def.cc with the
C++ compiler.
* config/loongarch/loongarch-def-array.h: New file for a
std:array like data structure with position setter function.
* config/loongarch/loongarch-def.c: Rename to ...
* config/loongarch/loongarch-def.cc: ... here.
(loongarch_cpu_strings): Define as loongarch_def_array instead
of plain array.
(loongarch_cpu_default_isa): Likewise.
(loongarch_cpu_cache): Likewise.
(loongarch_cpu_align): Likewise.
(loongarch_cpu_rtx_cost_data): Likewise.
(loongarch_cpu_issue_rate): Likewise.
(loongarch_cpu_multipass_dfa_lookahead): Likewise.
(loongarch_isa_base_strings): Likewise.
(loongarch_isa_ext_strings): Likewise.
(loongarch_abi_base_strings): Likewise.
(loongarch_abi_ext_strings): Likewise.
(loongarch_cmodel_strings): Likewise.
(abi_minimal_isa): Likewise.
(loongarch_rtx_cost_optimize_size): Use field setter functions
instead of designated initializers.
(loongarch_rtx_cost_data): Implement default constructor.

Add IntegerRange for -param=min-nondebug-insn-uid= and fix vector growing in LRA and vec [PR112411]

As documented, --param min-nondebug-insn-uid= is very useful in debugging
-fcompare-debug issues in RTL dumps, without it it is really hard to
find differences.  With it, DEBUG_INSNs generally use low INSN_UIDs
(1+) and non-DEBUG_INSNs use INSN_UIDs from the parameter up.
For good results, the parameter should be larger than the number of
DEBUG_INSNs in all or at least problematic functions, so I typically
use --param min-nondebug-insn-uid=10000 or --param
min-nondebug-insn-uid=1000.

The PR is about using --param min-nondebug-insn-uid=2147483647 or
similar behavior can be achieved with that minus some epsilon,
INSN_UIDs for the non-debug insns then wrap around and as they are signed,
all kinds of things break.  Obviously, that can happen even without that
option, but functions containing more than 2147483647 insns usually don't
compile much earlier due to getting out of memory.
As it is a debugging option, I'd prefer not to impose any drastically small
limits on it because if a function has a lot of DEBUG_INSNs, it is useful
to start still above them, otherwise the allocation of uids will DTRT
even for DEBUG_INSNs but there will be then differences in non-DEBUG_INSN
allocations.

So, the following patch uses 0x40000000 limit, half the maximum amount for
DEBUG_INSNs and half for non-DEBUG_INSNs, it will still result in very
unlikely overflows in real world.

Note, using large min-nondebug-insn-uid is very expensive for compile time
memory and compile time, because DF as well as various RTL passes use
arrays indexed by INSN_UIDs, e.g. LRA with sizeof (void *) elements,
ditto df (df->insns).

Now, in LRA I've ran into ICEs already with
--param min-nondebug-insn-uid=0x2aaaaaaa
on 64-bit host.  It uses a custom vector management and wants to grow
allocation 1.5x when growing, but all this computation is done in int,
so already 0x2aaaaaab * 3 / 2 + 1 overflows to negative value.  And
unlike vec.cc growing which also uses unsigned int type for the above
(and the + 1 is not there), it also doesn't make sure if there is an
overflow that it allocates at least as much as needed, vec.cc
does
  if ...
  else
    /* Grow slower when large.  */
    alloc = (alloc * 3 / 2);

  /* If this is still too small, set it to the right size. */
  if (alloc < desired)
    alloc = desired;
so even if there is overflow during the * 1.5 computation, but
desired is still representable in the range of the alloced counter
(31-bits in both vec.h and LRA), it doesn't grow exponentially but
at least works for the current value.

The patch now uses there
  lra_insn_recog_data_len = index * 3U / 2;
  if (lra_insn_recog_data_len <= index)
    lra_insn_recog_data_len = index + 1;
basically do what vec.cc does.  I thought we could do better for
both vec.cc and LRA on 64-bit hosts even without growing the allocated
counters, but now that I look at it again, perhaps we can't.
The above overflows already with original alloc or lra_insn_recog_data_len
0x55555556, where 0x5555555 * 3U / 2 is still 0x7fffffff
and so representable in the 32-bit, but 0x55555556 * 3U / 2 is
1.  I thought that we could use alloc * (size_t) 3 / 2 so that on 64-bit
hosts it wouldn't overflow that quickly, but 0x55555556 * (size_t) 3 / 2
there is 0x80000001 which is still ok in unsigned, but given that vec.h
then stores the counter into unsigned m_alloc:31; bit-field, it is too much.

With the lra.cc change, one can actually compile simple function
with -O0 on 64-bit host with --param min-nondebug-insn-uid=0x40000000
(i.e. the new limit), but already needed quite a big part of my 32GB
RAM + 24GB swap.
The patch adds a dg-skip-if for that case though, because such option
is way too much for 32-bit hosts even at -O0 and empty function,
and with -O3 on a longer function it is too much for average 64-bit host
as well.  Without the dg-skip-if I got on 64-bit host:
cc1: out of memory allocating 571230784744 bytes after a total of 2772992 bytes
and
cc1: out of memory allocating 1388 bytes after a total of 2002944 bytes
on 32-bit host.  A test requiring more than 532GB of RAM on 64-bit hosts
is just too much for our testsuite.

2023-12-08  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/112411
* params.opt (-param=min-nondebug-insn-uid=): Add
IntegerRange(0, 1073741824).
* lra.cc (check_and_expand_insn_recog_data): Use 3U rather than 3
in * 3 / 2 computation and if the result is smaller or equal to
index, use index + 1.

* gcc.dg/params/blocksort-part.c: Add dg-skip-if for
--param min-nondebug-insn-uid=1073741824.

i386: Mark Xeon Phi ISAs as deprecated

Since Knight Landing and Knight Mill microarchitectures are EOL, we
would like to remove its support in GCC 15. In GCC 14, we will first
emit a warning for the usage.

gcc/ChangeLog:

* config/i386/driver-i386.cc (host_detect_local_cpu):
Do not append "-mno-" for Xeon Phi ISAs.
* config/i386/i386-options.cc (ix86_option_override_internal):
Emit a warning for KNL/KNM targets.
* config/i386/i386.opt: Emit a warning for Xeon Phi ISAs.

gcc/testsuite/ChangeLog:

* g++.dg/other/i386-2.C: Adjust testcases.
* g++.dg/other/i386-3.C: Ditto.
* g++.dg/pr80481.C: Ditto.
* gcc.dg/pr71279.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fmaddps-1.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fmaddps-2.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fmaddss-1.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fnmaddps-1.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fnmaddps-2.c: Ditto.
* gcc.target/i386/avx5124fmadd-v4fnmaddss-1.c: Ditto.
* gcc.target/i386/avx5124vnniw-vp4dpwssd-1.c: Ditto.
* gcc.target/i386/avx5124vnniw-vp4dpwssd-2.c: Ditto.
* gcc.target/i386/avx5124vnniw-vp4dpwssds-1.c: Ditto.
* gcc.target/i386/avx5124vnniw-vp4dpwssds-2.c: Ditto.
* gcc.target/i386/avx512er-vexp2pd-1.c: Ditto.
* gcc.target/i386/avx512er-vexp2pd-2.c: Ditto.
* gcc.target/i386/avx512er-vexp2ps-1.c: Ditto.
* gcc.target/i386/avx512er-vexp2ps-2.c: Ditto.
* gcc.target/i386/avx512er-vrcp28pd-1.c: Ditto.
* gcc.target/i386/avx512er-vrcp28pd-2.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ps-1.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ps-2.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ps-3.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ps-4.c: Ditto.
* gcc.target/i386/avx512er-vrcp28sd-1.c: Ditto.
* gcc.target/i386/avx512er-vrcp28sd-2.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ss-1.c: Ditto.
* gcc.target/i386/avx512er-vrcp28ss-2.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28pd-1.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28pd-2.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-1.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-2.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-3.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-4.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-5.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ps-6.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28sd-1.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28sd-2.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ss-1.c: Ditto.
* gcc.target/i386/avx512er-vrsqrt28ss-2.c: Ditto.
* gcc.target/i386/avx512f-gather-1.c: Ditto.
* gcc.target/i386/avx512f-gather-2.c: Ditto.
* gcc.target/i386/avx512f-gather-3.c: Ditto.
* gcc.target/i386/avx512f-gather-4.c: Ditto.
* gcc.target/i386/avx512f-gather-5.c: Ditto.
* gcc.target/i386/avx512f-i32gatherd512-1.c: Ditto.
* gcc.target/i386/avx512f-i32gatherd512-2.c: Ditto.
* gcc.target/i386/avx512f-i32gatherpd512-1.c: Ditto.
* gcc.target/i386/avx512f-i32gatherpd512-2.c: Ditto.
* gcc.target/i386/avx512f-i32gatherps512-1.c: Ditto.
* gcc.target/i386/avx512f-vect-perm-1.c: Ditto.
* gcc.target/i386/avx512f-vect-perm-2.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf0dpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf0dps-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf0qpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf0qps-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf1dpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf1dps-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf1qpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vgatherpf1qps-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf0dpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf0dps-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf0qpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf0qps-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf1dpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf1dps-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf1qpd-1.c: Ditto.
* gcc.target/i386/avx512pf-vscatterpf1qps-1.c: Ditto.
* gcc.target/i386/funcspec-56.inc: Ditto.
* gcc.target/i386/pr103404.c: Ditto.
* gcc.target/i386/pr104448.c: Ditto.
* gcc.target/i386/pr107934.c: Ditto.
* gcc.target/i386/pr64387.c: Ditto.
* gcc.target/i386/pr70728.c: Ditto.
* gcc.target/i386/pr71346.c: Ditto.
* gcc.target/i386/pr82941-2.c: Ditto.
* gcc.target/i386/pr82942-1.c: Ditto.
* gcc.target/i386/pr82942-2.c: Ditto.
* gcc.target/i386/pr82990-1.c: Ditto.
* gcc.target/i386/pr82990-3.c: Ditto.
* gcc.target/i386/pr82990-4.c: Ditto.
* gcc.target/i386/pr82990-6.c: Ditto.
* gcc.target/i386/pr88713-3.c: Ditto.
* gcc.target/i386/pr89523-5.c: Ditto.
* gcc.target/i386/pr89523-6.c: Ditto.
* gcc.target/i386/pr91033.c: Ditto.
* gcc.target/i386/pr94561.c: Ditto.
* gcc.target/i386/prefetchwt1-1.c: Ditto.
* gcc.target/i386/sse-12.c: Ditto.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-26.c: Ditto.
* gcc.target/i386/pr69471-3.c: Removed.

RISC-V: Remove redundant check of better_main_loop_than_p in COST model

Since loop vectorizer won't call better_main_loop_than_p if !flag_vect_cost_model.

Committed as it is obvious.

gcc/ChangeLog:

* config/riscv/riscv-vector-costs.cc (costs::better_main_loop_than_p):
Remove redundant check.

tree-optimization/112774: extend the SCEV CHREC tree with a nonwrapping flag

The flag is defined as CHREC_NOWRAP(tree), and will be dumped from
"{offset, +, 1}_1" to "{offset, +, 1}<nw>_1" (nw is short for nonwrapping).
Two SCEV interfaces record_nonwrapping_chrec and nonwrapping_chrec_p are
added to set and check the flag respectively.

As resetting the SCEV cache (i.e., the chrec trees) may not reset the
loop->estimate_state, free_numbers_of_iterations_estimates is called
explicitly in loop vectorization to make sure the flag can be
calculated propriately by niter.

gcc/ChangeLog:

PR tree-optimization/112774
* tree-pretty-print.cc: if nonwrapping flag is set, chrec will be
printed with additional <nw> info.
* tree-scalar-evolution.cc: add record_nonwrapping_chrec and
nonwrapping_chrec_p to set and check the new flag respectively.
* tree-scalar-evolution.h: Likewise.
* tree-ssa-loop-niter.cc (idx_infer_loop_bounds,
infer_loop_bounds_from_pointer_arith, infer_loop_bounds_from_signedness,
scev_probably_wraps_p): call record_nonwrapping_chrec before
record_nonwrapping_iv, call nonwrapping_chrec_p to check the flag is
set and return false from scev_probably_wraps_p.
* tree-vect-loop.cc (vect_analyze_loop): call
free_numbers_of_iterations_estimates explicitly.
* tree-core.h: document the nothrow_flag usage in CHREC_NOWRAP
* tree.h: add CHREC_NOWRAP(NODE), base.nothrow_flag is used to
represent the nonwrapping info.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/scev-16.c: New test.

[PATCH 1/5][V3][ifcvt] optimize x=c ? (y op z) : y by RISC-V Zicond like insns

op=[PLUS, MINUS, IOR, XOR]

Conditional op, if zero
rd = (rc == 0) ? (rs1 op rs2) : rs1
-->
czero.nez rd, rs2, rc
op rd, rs1, rd

Conditional op, if non-zero
rd = (rc != 0) ? (rs1 op rs2) : rs1
-->
czero.eqz rd, rs2, rc
op rd, rs1, rd

gcc/ChangeLog:

* ifcvt.cc (noce_try_cond_zero_arith): New function.
(noce_emit_czero, get_base_reg): Likewise.
(noce_cond_zero_binary_op_supported): Likewise.
(noce_bbs_ok_for_cond_zero_arith): Likewise.
(noce_process_if_block): Use noce_try_cond_zero_arith.

Co-authored-by: Xiao Zeng<zengxiao@eswincomputing.com>

analyzer: fix ICE for 2 bits before the start of base region [PR112889]

Cncrete bindings were using -1 and -2 in the offset field to signify
deleted and empty hash slots, but these are valid values, leading to
assertion failures inside hash_map::put on a debug build, and probable
bugs in a release build.

(gdb) call k.dump(true)
start: -2, size: 1, next: -1

(gdb) p k.is_empty()
$6 = true

Fix by using the size field rather than the offset.

gcc/analyzer/ChangeLog:
PR analyzer/112889
* store.h (concrete_binding::concrete_binding): Strengthen
assertion to require size to be be positive, rather than just
non-zero.
(concrete_binding::mark_deleted): Use size rather than start bit
offset.
(concrete_binding::mark_empty): Likewise.
(concrete_binding::is_deleted): Likewise.
(concrete_binding::is_empty): Likewise.

gcc/testsuite/ChangeLog:
PR analyzer/112889
* c-c++-common/analyzer/ice-pr112889.c: New test.

Signed-off-by: David Malcolm <dmalcolm@redhat.com>

Daily bump.

RISC-V: Support interleave vector with different step sequence

This patch fixes 64 ICEs in full coverage testing since they happens due to same reason.

Before this patch:

internal compiler error: in expand_const_vector, at config/riscv/riscv-v.cc:1270

appears 400 times in full coverage testing report.

The root cause is we didn't support interleave vector with different steps.

Here is the story:

We already supported interleave with single same step, that is:
e.g. v = { 0, 100, 2, 102, 4, 104, ... }
This sequence can be interpreted as interleave vector by 2 seperate sequences:
sequence1 = { 0, 2, 4, ... } and sequence2 = { 100, 102, 104, ... }.
Their step are both 2.

However, we didn't support interleave vector when they have different steps which
cause ICE in such situations.

This patch support different steps interleaved vector for the following 2 situations:

1. When vector can be extended EEW:

Case 1: { 0, 0, 1, 0, 2, 0, ... }
It's interleaved by sequence1 = { 0, 1, 2, ... } and sequence1 = { 0, 0, 0, ... }
Suppose the original vector can be extended EEW, e.g. mode = RVVM1SImode.
Then such interleaved vector can be achieved with { 1, 2, 3, ... } with RVVM1DImode.
So, for this situation the codegen is pretty efficient and clean:

.MASK_LEN_STORE (&s, 32B, { -1, ... }, 16, 0, { 0, 0, 1, 0, 2, 0, ... });

->
   vsetvli a5,zero,e64,m8,ta,ma
   vid.v v8
   vsetivli zero,16,e32,m8,ta,ma
   vse32.v v8,0(a4)

Case 2: { 0, 100, 1, 100, 2, 100, ... }

.MASK_LEN_STORE (&s, 32B, { -1, ... }, 16, 0, { 0, 100, 1, 100, 2, 100, ... });

->

   vsetvli a1,zero,e64,m8,ta,ma
   vid.v v8
   li a7,100
   vand.vx v8,v8,a4
   vsetivli zero,16,e32,m8,ta,ma
   vse32.v v8,0(a5)

2. When vector can't be extended EEW:

Since we can't use EEW = 64, for example, RVVM1SImode in -march=rv32gc_zve32f,
we use vmerge to combine the sequence.

.MASK_LEN_STORE (&s, 32B, { -1, ... }, 16, 0, { 200, 100, 201, 103, 202, 106, ... });

1. Generate sequence1 = { 200, 200, 201, 201, 202, 202, ... } and sequence2 = { 100, 100, 103, 103, 106, 106, ... }
2. Merge sequence1 and sequence2 with mask { 0, 1, 0, 1, ... }

gcc/ChangeLog:

* config/riscv/riscv-protos.h (expand_vec_series): Adapt function.
* config/riscv/riscv-v.cc (rvv_builder::double_steps_npatterns_p): New function.
(expand_vec_series): Adapt function.
(expand_const_vector): Support new interleave vector with different step.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/slp-interleave-1.c: New test.
* gcc.target/riscv/rvv/autovec/slp-interleave-2.c: New test.
* gcc.target/riscv/rvv/autovec/slp-interleave-3.c: New test.
* gcc.target/riscv/rvv/autovec/slp-interleave-4.c: New test.

libstdc++: Simplify ranges::to closure objects

We can use the existing _Partial range adaptor closure object for
ranges::to instead of essentially reimplementing it.

libstdc++-v3/ChangeLog:

* include/std/ranges (__detail::_ToClosure): Replace with ...
(__detail::_To): ... this.
(__detail::_ToClosure2): Replace with ...
(__detail::To2): ... this.
(to): Simplify using the existing _Partial range adaptor
closure object.

libstdc++: Fix misleading typedef name in <format>

This local typedef for uintptr_t was accidentally named uint64_t,
probably from a careless code completion shortcut. We don't need the
typedef at all since it's only used once. Just use __UINTPTR_TYPE__
directly instead.

libstdc++-v3/ChangeLog:

* include/std/format (_Iter_sink<charT, contiguous_iterator>):
Remove uint64_t local type.

libstdc++: Use <cstdint> instead of <stdint.h> in <bits/atomic_wait.h>

In r14-5922-g6c8f2d3a08bc01 I added <stdint.h> to <bits/atomic_wait.h>,
so that uintptr_t is declared if that header is compiled as a header
unit. I used <stdint.h> because that's what <atomic> already includes,
so it seemed simpler to be consistent. However, this means that name
lookup for uintptr_t in <bits/atomic_wait.h> depends on whether
<cstdint> has been included by another header first. Whether name lookup
finds std::uintptr_t or ::uintptr_t will depend on include order. This
causes problems when compiling modules with Clang:

bits/atomic_wait.h:251:7: error: 'std::__detail::__waiter_pool_base' has different definitions in different modules; first difference is defined here found method '_S_for' with body
      _S_for(const void* __addr) noexcept
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bits/atomic_wait.h:251:7: note: but in 'tm.<global>' found method '_S_for' with different body
      _S_for(const void* __addr) noexcept
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By including <cstdint> we would ensure that name lookup always finds the
name in namespace std. Alternatively, we can stop including <stdint.h>
for those types, so that we don't declare the entire contents of
<stdint.h> when we only need a couple of types from it. This patch does
the former, which is appropriate for backporting.

libstdc++-v3/ChangeLog:

* include/bits/atomic_wait.h: Include <cstdint> instead of
<stdint.h>.

libstdc++: Fix recent changes to __glibcxx_assert [PR112882]

The changes in r14-6198-g5e8a30d8b8f4d7 were broken, as I used
_GLIBCXX17_CONSTEXPR for the 'if _GLIBCXX17_CONSTEXPR (true)' condition,
forgetting that it would also be used for the is_constant_evaluated()
check. Using 'if constexpr (std::is_constant_evaluated())' is a bug.

Additionally, relying on __glibcxx_assert_fail to give a "not a constant
expression" error is a problem because at -O0 an undefined reference to
__glibcxx_assert_fail is present in the compiled code. This means you
can't use libstdc++ headers without also linking to libstdc++ for the
symbol definition.

This fix rewrites the __glibcxx_assert macro again. This still avoids
doing the duplicate checks, once for constexpr and once at runtime (if
_GLIBCXX_ASSERTIONS is defined). When _GLIBCXX_ASSERTIONS is defined we
still rely on __glibcxx_assert_fail to give a "not a constant
expression" error during constant evaluation (because when assertions
are defined it's not a problem to emit a reference to the symbol). But
when that macro is not defined, we use a new inline (but not constexpr)
overload of __glibcxx_assert_fail to cause compilation to fail. That
inline function doesn't cause an undefined reference to a symbol in the
library (and will be optimized away anyway).

We can also add always_inline to the __is_constant_evaluated function,
although this doesn't actually matter for -O0 and it's always inlined
with any optimization enabled.

libstdc++-v3/ChangeLog:

PR libstdc++/112882
* include/bits/c++config (__is_constant_evaluated): Add
always_inline attribute.
(_GLIBCXX_DO_ASSERT): Remove macro.
(__glibcxx_assert): Define separately for assertions-enabled and
constexpr-only cases.

aarch64: Add an early RA for strided registers

This pass adds a simple register allocator for FP & SIMD registers.
Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4
instructions, which require a very specific grouping structure,
and so would be difficult to exploit with general allocation.

The allocator is very simple.  It gives up on anything that would
require spilling, or that it might not handle well for other reasons.

The allocator needs to track liveness at the level of individual FPRs.
Doing that fixes a lot of the PRs relating to redundant moves caused by
structure loads and stores.  That particular problem is going to be
fixed more generally for GCC 15 by Lehua's RA patches.

However, the early-RA pass runs before scheduling, so it has a chance
to bag a spill-free allocation of vector code before the scheduler moves
things around.  It could therefore still be useful for non-SME code
(e.g. for hand-scheduled ACLE code) even after Lehua's patches are in.

The pass is controlled by a tristate switch:

- -mearly-ra=all: run on all functions
- -mearly-ra=strided: run on functions that have access to strided registers
- -mearly-ra=none: don't run on any function

The patch makes -mearly-ra=all the default at -O2 and above for now.
We can revisit this for GCC 15 once Lehua's patches are in;
-mearly-ra=strided might then be more appropriate.

As said previously, the pass is very naive.  There's much more that we
could do, such as handling invariants better.  The main focus is on not
committing to a bad allocation, rather than on handling as much as
possible.

gcc/
PR rtl-optimization/106694
PR rtl-optimization/109078
PR rtl-optimization/109391
* config.gcc: Add aarch64-early-ra.o for AArch64 targets.
* config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule.
* config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum.
* config/aarch64/aarch64.opt (mearly_ra): New option.
* doc/invoke.texi: Document it.
* common/config/aarch64/aarch64-common.cc
(aarch_option_optimization_table): Use -mearly-ra=strided by
default for -O2 and above.
* config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New pass.
* config/aarch64/aarch64-protos.h (aarch64_strided_registers_p)
(make_pass_aarch64_early_ra): Declare.
* config/aarch64/aarch64-sme.md (@aarch64_sme_lut<LUTI_BITS><mode>):
Add a stride_type attribute.
(@aarch64_sme_lut<LUTI_BITS><mode>_strided2): New pattern.
(@aarch64_sme_lut<LUTI_BITS><mode>_strided4): Likewise.
* config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand)
(svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand): Handle
new way of defining multi-register loads and stores.
* config/aarch64/aarch64-sve.md (@aarch64_ld1<SVE_FULLx24:mode>)
(@aarch64_ldnt1<SVE_FULLx24:mode>, @aarch64_st1<SVE_FULLx24:mode>)
(@aarch64_stnt1<SVE_FULLx24:mode>): Delete.
* config/aarch64/aarch64-sve2.md (@aarch64_<LD1_COUNT:optab><mode>)
(@aarch64_<LD1_COUNT:optab><mode>_strided2): New patterns.
(@aarch64_<LD1_COUNT:optab><mode>_strided4): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>_strided2): Likewise.
(@aarch64_<ST1_COUNT:optab><mode>_strided4): Likewise.
* config/aarch64/aarch64.cc (aarch64_strided_registers_p): New
function.
* config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete.
(UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise.
(UNSPEC_STNT1_SVE_COUNT): Likewise.
(stride_type): New attribute.
* config/aarch64/constraints.md (Uwd, Uwt): New constraints.
* config/aarch64/iterators.md (UNSPEC_LD1_COUNT, UNSPEC_LDNT1_COUNT)
(UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs.
(optab): Handle them.
(LD1_COUNT, ST1_COUNT): New iterators.
* config/aarch64/aarch64-early-ra.cc: New file.

gcc/testsuite/
PR rtl-optimization/106694
PR rtl-optimization/109078
PR rtl-optimization/109391
* gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected
output test.
* gcc.target/aarch64/sve/shift_1.c: Allow reversed shifts for .s
as well as .d.
* gcc.target/aarch64/sme/strided_1.c: New test.
* gcc.target/aarch64/pr109078.c: Likewise.
* gcc.target/aarch64/pr109391.c: Likewise.
* gcc.target/aarch64/sve/pr106694.c: Likewise.

arm: vld1_types_x4 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1 intrinsic for the arm port. This patch adds the
_x4 variants of the vld1 intrinsic.

The previous vld1_x4 has been updated to vld1q_x4 to take into
account that it works with 4-word-length types. vld1_x4 is now
only for 2-word-length types.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1_u8_x4, vld1_u16_x4, vld1_u32_x4, vld1_u64_x4): New
(vld1_s8_x4, vld1_s16_x4, vld1_s32_x4, vld1_s64_x4): New.
(vld1_f16_x4, vld1_f32_x4): New.
(vld1_p8_x4, vld1_p16_x4, vld1_p64_x4): New.
(vld1_bf16_x4): New.
(vld1q_types_x4): Updated to use vld1q_x4
from arm_neon_builtins.def
* config/arm/arm_neon_builtins.def
(vld1_x4): Updated entries.
(vld1q_x4): New entries, but comes from the old vld1_x2
* config/arm/neon.md (neon_vld1q_x4<mode>):
Updated from neon_vld1_x4<mode>.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_p64_xN_1.c: Add new tests.

arm: vld1_types_x3 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1 intrinsic for the arm port. This patch adds the
_x3 variants of the vld1 intrinsic.

The previous vld1_x3 has been updated to vld1q_x3 to take into
account that it works with 4-word-length types. vld1_x3 is now
only for 2-word-length types.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1_u8_x3, vld1_u16_x3, vld1_u32_x3, vld1_u64_x3): New
(vld1_s8_x3, vld1_s16_x3, vld1_s32_x3, vld1_s64_x3): New.
(vld1_f16_x3, vld1_f32_x3): New.
(vld1_p8_x3, vld1_p16_x3, vld1_p64_x3): New.
(vld1_bf16_x3): New.
(vld1q_types_x3): Updated to use vld1q_x3 from
arm_neon_builtins.def
* config/arm/arm_neon_builtins.def
(vld1_x3): Updated entries.
(vld1q_x3): New entries, but comes from the old vld1_x2
* config/arm/neon.md (neon_vld1q_x3<mode>): Updated from
neon_vld1_x3<mode>.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_p64_xN_1.c: Add new tests.

arm: vld1_types_x2 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1 intrinsic for the arm port. This patch adds the
_x2 variants of the vld1 intrinsic.

The previous vld1_x2 has been updated to vld1q_x2 to take into
account that it works with 4-word-length types. vld1_x2 is now
only for 2-word-length types.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1_u8_x2, vld1_u16_x2, vld1_u32_x2, vld1_u64_x2): New
(vld1_s8_x2, vld1_s16_x2, vld1_s32_x2, vld1_s64_x2): New.
(vld1_f16_x2, vld1_f32_x2): New.
(vld1_p8_x2, vld1_p16_x2, vld1_p64_x2): New.
(vld1_bf16_x2): New.
(vld1q_types_x2): Updated to use vld1q_x2 from
arm_neon_builtins.def
* config/arm/arm_neon_builtins.def
(vld1_x2): Updated entries.
(vld1q_x2): New entries, but comes from the old vld1_x2
* config/arm/neon.md
(neon_vld1<VMEMX2_q>_x2<VDQX:mode>): Updated
from neon_vld1_x2<mode>.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1_p64_xN_1.c: Add new tests.

arm: vst1q_types_x4 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1q intrinsic for the arm port. This patch adds the
_x4 variants of the vst1q intrinsic.

ACLE:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1q_u8_x4, vst1q_u16_x4, vst1q_u32_x4, vst1q_u64_x4): New.
(vst1q_s8_x4, vst1q_s16_x4, vst1q_s32_x4, vst1q_s64_x4): New.
(vst1q_f16_x4, vst1q_f32_x4): New.
(vst1q_p8_x4, vst1q_p16_x4, vst1q_p64_x4): New.
(vst1q_bf16_x4): New.
* config/arm/arm_neon_builtins.def (vst1q_x4): New entries.
* config/arm/neon.md (neon_vst1q_x4<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1q_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_p64_xN_1.c: Add new tests.

arm: vst1q_types_x3 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1q intrinsic for the arm port. This patch adds the
_x3 variants of the vst1q intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1q_u8_x3, vst1q_u16_x3, vst1q_u32_x3, vst1q_u64_x3): New.
(vst1q_s8_x3, vst1q_s16_x3, vst1q_s32_x3, vst1q_s64_x3): New.
(vst1q_f16_x3, vst1q_f32_x3): New.
(vst1q_p8_x3, vst1q_p16_x3, vst1q_p64_x3): New.
(vst1q_bf16_x3): New.
* config/arm/arm_neon_builtins.def (vst1q_x3): New entries.
* config/arm/neon.md (neon_vst1q_x3<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1q_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_p64_xN_1.c: Add new tests.

arm: vst1q_types_x2 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1q intrinsic for the arm port. This patch adds the
_x2 variants of the vst1q intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1q_u8_x2, vst1q_u16_x2, vst1q_u32_x2, vst1q_u64_x2): New.
(vst1q_s8_x2, vst1q_s16_x2, vst1q_s32_x2, vst1q_s64_x2): New.
(vst1q_f16_x2, vst1q_f32_x2): New.
(vst1q_p8_x2, vst1q_p16_x2, vst1q_p64_x2): New.
(vst1q_bf16_x2): New.
* config/arm/arm_neon_builtins.def (vst1q_x2): New entries.
* config/arm/neon.md
(neon_vst1<VMEMX2_q>_x2<VDQX:mode>): Updated from
neon_vst1_x2<mode>.
* config/arm/iterators.md (VMEMX2): New mode iterator.
(VMEMX2_q): New mode attribute.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1q_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1q_p64_xN_1.c: Add new tests.

arm: vst1_types_x4 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1 intrinsic for the arm port. This patch adds the
_x4 variants of the vst1 intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1_u8_x4, vst1_u16_x4, vst1_u32_x4, vst1_u64_x4): New.
(vst1_s8_x4, vst1_s16_x4, vst1_s32_x4, vst1_s64_x4): New.
(vst1_f16_x4, vst1_f32_x4): New.
(vst1_p8_x4, vst1_p16_x4, vst1_p64_x4): New.
(vst1_bf16_x4): New.
* config/arm/arm_neon_builtins.def (vst1_x4): New entries.
* config/arm/neon.md (vst1_x4<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1_base_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_bf16_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_fp16_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_p64_xN_1.c: Add new test.

arm: vst1_types_x3 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1 intrinsic for the arm port. This patch adds the
_x3 variants of the vst1 intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1_u8_x3, vst1_u16_x3, vst1_u32_x3, vst1_u64_x3): New.
(vst1_s8_x3, vst1_s16_x3, vst1_s32_x3, vst1_s64_x3): New.
(vst1_f16_x3, vst1_f32_x3): New.
(vst1_p8_x3, vst1_p16_x3, vst1_p64_x3): New.
(vst1_bf16_x3): New.
* config/arm/arm_neon_builtins.def (vst1_x3): New entries.
* config/arm/neon.md (vst1_x3<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1_base_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_bf16_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_fp16_xN_1.c: Add new test.
* gcc.target/arm/simd/vst1_p64_xN_1.c: Add new test.

arm: vst1_types_x2 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vst1 intrinsic for the arm port. This patch adds the
_x2 variants of the vst1 intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vst1_u8_x2, vst1_u16_x2, vst1_u32_x2, vst1_u64_x2): New.
(vst1_s8_x2, vst1_s16_x2, vst1_s32_x2, vst1_s64_x2): New.
(vst1_f16_x2, vst1_f32_x2): New.
(vst1_p8_x2, vst1_p16_x2, vst1_p64_x2): New.
(vst1_bf16_x2): New.
* config/arm/arm_neon_builtins.def (vst1_x2): New entries.
* config/arm/neon.md (vst1_x2<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vst1_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vst1_p64_xN_1.c: Add new tests.

arm: vld1q_types_x4 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1q intrinsic for the arm port. This patch adds the
_x4 variants of the vld1q intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1q_u8_x4, vld1q_u16_x4, vld1q_u32_x4, vld1q_u64_x4): New.
(vld1q_s8_x4, vld1q_s16_x4, vld1q_s32_x4, vld1q_s64_x4): New.
(vld1q_f16_x4, vld1q_f32_x4): New.
(vld1q_p8_x4, vld1q_p16_x4, vld1q_p64_x4): New.
(vld1q_bf16_x4): New.
* config/arm/arm_neon_builtins.def (vld1_x4): New entries.
* config/arm/neon.md (vld1_x4<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1q_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_p64_xN_1.c: Add new tests.

arm: vld1q_types_x3 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1q intrinsic for the arm port. This patch adds the
_x3 variants of the vld1q intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1q_u8_x3, vld1q_u16_x3, vld1q_u32_x3, vld1q_u64_x3): New.
(vld1q_s8_x3, vld1q_s16_x3, vld1q_s32_x3, vld1q_s64_x3): New.
(vld1q_f16_x3, vld1q_f32_x3): New.
(vld1q_p8_x3, vld1q_p16_x3, vld1q_p64_x3): New.
(vld1q_bf16_x3): New.
* config/arm/arm_neon_builtins.def (vld1_x3): New entries.
* config/arm/neon.md (vld1_x3<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1q_base_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_bf16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_fp16_xN_1.c: Add new tests.
* gcc.target/arm/simd/vld1q_p64_xN_1.c: Add new tests.

arm: vld1q_types_x2 ACLE intrinsics

This patch is part of a series of patches implementing the _xN
variants of the vld1q intrinsic for the arm port. This patch adds the
_x2 variants of the vld1q intrinsic.

ACLE documents:
https://developer.arm.com/documentation/ihi0053/latest/

ISA documents:
https://developer.arm.com/documentation/ddi0487/latest/

gcc/ChangeLog:
* config/arm/arm_neon.h
(vld1q_u8_x2, vld1q_u16_x2, vld1q_u32_x2, vld1q_u64_x2): New.
(vld1q_s8_x2, vld1q_s16_x2, vld1q_s32_x2, vld1q_s64_x2): New.
(vld1q_f16_x2, vld1q_f32_x2): New.
(vld1q_p8_x2, vld1q_p16_x2, vld1q_p64_x2): New.
(vld1q_bf16_x2): New.
* config/arm/arm_neon_builtins.def (vld1_x2): New entries.
* config/arm/neon.md (vld1_x2<mode>): New.

gcc/testsuite/ChangeLog:
* gcc.target/arm/simd/vld1q_base_xN_1.c: Add new test.
* gcc.target/arm/simd/vld1q_bf16_xN_1.c: Add new test.
* gcc.target/arm/simd/vld1q_fp16_xN_1.c: Add new test.
* gcc.target/arm/simd/vld1q_p64_xN_1.c: Add new test.

s390: Fix expansion of vec_step

Add missing "s390" while expanding vec_step to __builtin_s390_vec_step.

gcc/ChangeLog:

* config/s390/vecintrin.h (vec_step): Expand vec_step to
__builtin_s390_vec_step.

aarch64: add -fno-stack-protector to tests

These tests fail when the testsuite is executed with -fstack-protector-strong.
To avoid this, this patch adds -fno-stack-protector to dg-options.

The list of FAILs is appended. As you can see, it's mostly about
scan-assembler-* which are sort of expected to fail with the stack
protector on.

FAIL: gcc.target/aarch64/ldp_stp_unaligned_2.c scan-assembler-not mov\\tx[0-9]+, sp
FAIL: gcc.target/aarch64/shadow_call_stack_5.c scan-assembler-times stp\\\\tx29, x30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/shadow_call_stack_5.c scan-assembler ldr\\\\tx29, \\\\[sp\\\\]
FAIL: gcc.target/aarch64/shadow_call_stack_6.c scan-assembler-times str\\\\tx30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/shadow_call_stack_7.c scan-assembler-times stp\\\\tx19, x30, \\\\[sp, -[0-9]+\\\\]! 1
FAIL: gcc.target/aarch64/shadow_call_stack_7.c scan-assembler ldr\\\\tx19, \\\\[sp\\\\], [0-9]+
FAIL: gcc.target/aarch64/shadow_call_stack_8.c scan-assembler-times stp\\\\tx19, x20, \\\\[sp, -[0-9]+\\\\]! 1
FAIL: gcc.target/aarch64/shadow_call_stack_8.c scan-assembler ldp\\\\tx19, x20, \\\\[sp\\\\], [0-9]+
FAIL: gcc.target/aarch64/stack-check-12.c scan-assembler-times str\\\\txzr, 2
FAIL: gcc.target/aarch64/stack-check-prologue-11.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-12.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-13.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-13.c scan-assembler-times str\\\\s+x30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-14.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-14.c scan-assembler-times str\\\\s+x30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-15.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-15.c scan-assembler-times str\\\\s+x30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-17.c check-function-bodies test1
FAIL: gcc.target/aarch64/stack-check-prologue-17.c check-function-bodies test2
FAIL: gcc.target/aarch64/stack-check-prologue-18.c check-function-bodies test1
FAIL: gcc.target/aarch64/stack-check-prologue-18.c check-function-bodies test2
FAIL: gcc.target/aarch64/stack-check-prologue-18.c check-function-bodies test3
FAIL: gcc.target/aarch64/stack-check-prologue-19.c check-function-bodies test1
FAIL: gcc.target/aarch64/stack-check-prologue-19.c check-function-bodies test2
FAIL: gcc.target/aarch64/stack-check-prologue-19.c check-function-bodies test3
FAIL: gcc.target/aarch64/stack-check-prologue-2.c scan-assembler-times str\\\\s+xzr, 0
FAIL: gcc.target/aarch64/stack-check-prologue-5.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-6.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/stack-check-prologue-8.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 2
FAIL: gcc.target/aarch64/stack-check-prologue-9.c scan-assembler-times str\\\\s+xzr, \\\\[sp, 1024\\\\] 1
FAIL: gcc.target/aarch64/test_frame_1.c scan-assembler-times str\\tx30, \\\\[sp, -[0-9]+\\\\]! 2
FAIL: gcc.target/aarch64/test_frame_10.c scan-assembler-times stp\\tx19, x30, \\\\[sp, [0-9]+\\\\] 1
FAIL: gcc.target/aarch64/test_frame_10.c scan-assembler ldp\\tx19, x30, \\\\[sp, [0-9]+\\\\]
FAIL: gcc.target/aarch64/test_frame_11.c scan-assembler-times stp\\tx29, x30, \\\\[sp, -[0-9]+\\\\]! 2
FAIL: gcc.target/aarch64/test_frame_13.c scan-assembler-times stp\\tx29, x30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/test_frame_15.c scan-assembler-times stp\\tx29, x30, \\\\[sp, [0-9]+\\\\] 1
FAIL: gcc.target/aarch64/test_frame_2.c scan-assembler-times stp\\tx19, x30, \\\\[sp, -[0-9]+\\\\]! 1
FAIL: gcc.target/aarch64/test_frame_2.c scan-assembler ldp\\tx19, x30, \\\\[sp\\\\], [0-9]+
FAIL: gcc.target/aarch64/test_frame_4.c scan-assembler-times stp\\tx19, x30, \\\\[sp, -[0-9]+\\\\]! 1
FAIL: gcc.target/aarch64/test_frame_4.c scan-assembler ldp\\tx19, x30, \\\\[sp\\\\], [0-9]+
FAIL: gcc.target/aarch64/test_frame_6.c scan-assembler-times str\\tx30, \\\\[sp\\\\] 1
FAIL: gcc.target/aarch64/test_frame_7.c scan-assembler-times stp\\tx19, x30, \\\\[sp] 1
FAIL: gcc.target/aarch64/test_frame_8.c scan-assembler-times str\\tx30, \\\\[sp, [0-9]+\\\\] 1
FAIL: gcc.target/aarch64/test_frame_8.c scan-assembler ldr\\tx30, \\\\[sp, [0-9]+\\\\]
FAIL: gcc.target/aarch64/sve/struct_vect_24.c scan-assembler-times cmp\\\\s+x[0-9]+, 61440 4
FAIL: gcc.target/aarch64/sve/struct_vect_24.c scan-assembler-times sub\\\\s+x[0-9]+, x[0-9]+, 61440 4
FAIL: gcc.target/aarch64/sve/struct_vect_24.c scan-assembler-times cmp\\s+x[0-9]+, 61440 4
FAIL: gcc.target/aarch64/sve/struct_vect_24.c scan-assembler-times sub\\s+x[0-9]+, x[0-9]+, 61440 4

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/ldp_stp_unaligned_2.c: Use -fno-stack-protector.
* gcc.target/aarch64/shadow_call_stack_5.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_6.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_7.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_8.c: Likewise.
* gcc.target/aarch64/stack-check-12.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-11.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-12.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-13.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-14.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-15.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-17.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-18.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-2.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-5.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-6.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-8.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-9.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_24.c: Likewise.
* gcc.target/aarch64/test_frame_1.c: Likewise.
* gcc.target/aarch64/test_frame_10.c: Likewise.
* gcc.target/aarch64/test_frame_11.c: Likewise.
* gcc.target/aarch64/test_frame_13.c: Likewise.
* gcc.target/aarch64/test_frame_15.c: Likewise.
* gcc.target/aarch64/test_frame_2.c: Likewise.
* gcc.target/aarch64/test_frame_4.c: Likewise.
* gcc.target/aarch64/test_frame_6.c: Likewise.
* gcc.target/aarch64/test_frame_7.c: Likewise.
* gcc.target/aarch64/test_frame_8.c: Likewise.

strub: enable conditional support

Targets that don't expose callee stacks to callers, such as nvptx, as
well as -fsplit-stack compilations, violate fundamental assumptions of
the current strub implementation.  This patch enables targets to
disable strub, and disables it when -fsplit-stack is enabled.

When strub support is disabled, the testsuite will now skip strub
tests, and libgcc will not build the strub runtime components.

for  gcc/ChangeLog

* target.def (have_strub_support_for): New hook.
* doc/tm.texi.in: Document it.
* doc/tm.texi: Rebuild.
* ipa-strub.cc: Include target.h.
(strub_target_support_p): New.
(can_strub_p): Call it.  Test for no flag_split_stack.
(pass_ipa_strub::adjust_at_calls_call): Check for target
support.
* config/nvptx/nvptx.cc (TARGET_HAVE_STRUB_SUPPORT_FOR):
Disable.
* doc/sourcebuild.texi (strub): Document new effective
target.

for  gcc/testsuite/ChangeLog

* c-c++-common/strub-split-stack.c: New.
* c-c++-common/strub-unsupported.c: New.
* c-c++-common/strub-unsupported-2.c: New.
* c-c++-common/strub-unsupported-3.c: New.
* lib/target-supports.exp (check_effective_target_strub): New.
* c-c++-common/strub-O0.c: Require effective target strub.
* c-c++-common/strub-O1.c: Likewise.
* c-c++-common/strub-O2.c: Likewise.
* c-c++-common/strub-O2fni.c: Likewise.
* c-c++-common/strub-O3.c: Likewise.
* c-c++-common/strub-O3fni.c: Likewise.
* c-c++-common/strub-Og.c: Likewise.
* c-c++-common/strub-Os.c: Likewise.
* c-c++-common/strub-all1.c: Likewise.
* c-c++-common/strub-all2.c: Likewise.
* c-c++-common/strub-apply1.c: Likewise.
* c-c++-common/strub-apply2.c: Likewise.
* c-c++-common/strub-apply3.c: Likewise.
* c-c++-common/strub-apply4.c: Likewise.
* c-c++-common/strub-at-calls1.c: Likewise.
* c-c++-common/strub-at-calls2.c: Likewise.
* c-c++-common/strub-defer-O1.c: Likewise.
* c-c++-common/strub-defer-O2.c: Likewise.
* c-c++-common/strub-defer-O3.c: Likewise.
* c-c++-common/strub-defer-Os.c: Likewise.
* c-c++-common/strub-internal1.c: Likewise.
* c-c++-common/strub-internal2.c: Likewise.
* c-c++-common/strub-parms1.c: Likewise.
* c-c++-common/strub-parms2.c: Likewise.
* c-c++-common/strub-parms3.c: Likewise.
* c-c++-common/strub-relaxed1.c: Likewise.
* c-c++-common/strub-relaxed2.c: Likewise.
* c-c++-common/strub-short-O0-exc.c: Likewise.
* c-c++-common/strub-short-O0.c: Likewise.
* c-c++-common/strub-short-O1.c: Likewise.
* c-c++-common/strub-short-O2.c: Likewise.
* c-c++-common/strub-short-O3.c: Likewise.
* c-c++-common/strub-short-Os.c: Likewise.
* c-c++-common/strub-strict1.c: Likewise.
* c-c++-common/strub-strict2.c: Likewise.
* c-c++-common/strub-tail-O1.c: Likewise.
* c-c++-common/strub-tail-O2.c: Likewise.
* c-c++-common/strub-var1.c: Likewise.
* c-c++-common/torture/strub-callable1.c: Likewise.
* c-c++-common/torture/strub-callable2.c: Likewise.
* c-c++-common/torture/strub-const1.c: Likewise.
* c-c++-common/torture/strub-const2.c: Likewise.
* c-c++-common/torture/strub-const3.c: Likewise.
* c-c++-common/torture/strub-const4.c: Likewise.
* c-c++-common/torture/strub-data1.c: Likewise.
* c-c++-common/torture/strub-data2.c: Likewise.
* c-c++-common/torture/strub-data3.c: Likewise.
* c-c++-common/torture/strub-data4.c: Likewise.
* c-c++-common/torture/strub-data5.c: Likewise.
* c-c++-common/torture/strub-indcall1.c: Likewise.
* c-c++-common/torture/strub-indcall2.c: Likewise.
* c-c++-common/torture/strub-indcall3.c: Likewise.
* c-c++-common/torture/strub-inlinable1.c: Likewise.
* c-c++-common/torture/strub-inlinable2.c: Likewise.
* c-c++-common/torture/strub-ptrfn1.c: Likewise.
* c-c++-common/torture/strub-ptrfn2.c: Likewise.
* c-c++-common/torture/strub-ptrfn3.c: Likewise.
* c-c++-common/torture/strub-ptrfn4.c: Likewise.
* c-c++-common/torture/strub-pure1.c: Likewise.
* c-c++-common/torture/strub-pure2.c: Likewise.
* c-c++-common/torture/strub-pure3.c: Likewise.
* c-c++-common/torture/strub-pure4.c: Likewise.
* c-c++-common/torture/strub-run1.c: Likewise.
* c-c++-common/torture/strub-run2.c: Likewise.
* c-c++-common/torture/strub-run3.c: Likewise.
* c-c++-common/torture/strub-run4.c: Likewise.
* c-c++-common/torture/strub-run4c.c: Likewise.
* c-c++-common/torture/strub-run4d.c: Likewise.
* c-c++-common/torture/strub-run4i.c: Likewise.
* g++.dg/strub-run1.C: Likewise.
* g++.dg/torture/strub-init1.C: Likewise.
* g++.dg/torture/strub-init2.C: Likewise.
* g++.dg/torture/strub-init3.C: Likewise.
* gnat.dg/strub_attr.adb: Likewise.
* gnat.dg/strub_ind.adb: Likewise.
* gnat.dg/strub_access.adb: Likewise.
* gnat.dg/strub_access1.adb: Likewise.
* gnat.dg/strub_disp.adb: Likewise.
* gnat.dg/strub_disp1.adb: Likewise.
* gnat.dg/strub_ind1.adb: Likewise.
* gnat.dg/strub_ind2.adb: Likewise.
* gnat.dg/strub_intf.adb: Likewise.
* gnat.dg/strub_intf1.adb: Likewise.
* gnat.dg/strub_intf2.adb: Likewise.
* gnat.dg/strub_renm.adb: Likewise.
* gnat.dg/strub_renm1.adb: Likewise.
* gnat.dg/strub_renm2.adb: Likewise.
* gnat.dg/strub_var.adb: Likewise.
* gnat.dg/strub_var1.adb: Likewise.

for  libgcc/ChangeLog

* configure.ac: Check for strub support.
* configure: Rebuilt.
* Makefile.in: Compile strub.c conditionally.

testsuite: skip gcc.target/i386/pr106910-1.c test when using newlib

Using newlib produces a different codegen because the support for c99
differs (see libc_has_function hook).

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr106910-1.c: Disable for newlib.

testsuite: refine gcc.dg/analyzer/fd-4.c test for newlib

Contrary to glibc, including stdio.h from newlib defines mode_t which
conflicts with the test's type definition.

.../gcc/testsuite/gcc.dg/analyzer/fd-4.c:19:3: error: redefinition of typedef 'mode_t' with different type
...
.../include/sys/types.h:189:25: note: previous declaration of 'mode_t' with type 'mode_t' {aka 'unsigned int'}

Defining _MODE_T_DECLARED skips the type definition.

gcc/testsuite/ChangeLog:

* gcc.dg/analyzer/fd-4.c: Fix for newlib.

testsuite: require avx_runtime for some tests

These 3 tests fails parsing the 'vect' dump when not using -mavx. Make
the dependency explicit.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-ifcvt-18.c: Add dep on avx_runtime.
* gcc.dg/vect/vect-simd-clone-16f.c: Likewise.
* gcc.dg/vect/vect-simd-clone-18f.c: Likewise.

PR modula2/112893 detect procedure address incompatible with cardinal in iso

In ISO m2 the type cardinal is assignment incompatible with address (but
it is allowed in PIM).  The patch also extends the type checker to include
procedures (which appear as having GetType () = address).  At some point
this should be be improved to use a pointer to proc type.  Perhaps in
the next stage1.
For now this will catch procedures being passed as actual parameters into
a formal cardinal parameter in ISO m2 (for example).

gcc/m2/ChangeLog:

PR modula2/112893
* gm2-compiler/M2Base.mod (Ass): Extend array to include proc row
and column.  Allow PIM to assign cardinal variables to address
variables.
(Expr): Ditto.
(Comp): Ditto.
* gm2-compiler/M2Check.mod (getSType): New procedure function.
Replace all occurances of GetSType with getSType.
* gm2-compiler/M2GenGCC.mod (CodeParam): Rewrite format specifier
error message.
* gm2-compiler/M2Quads.mod (CheckProcTypeAndProcedure): Add tokno
parameter.
* gm2-compiler/M2Range.def (InitTypesParameterCheck): Add tokno
parameter.
(InitParameterRangeCheck): Add tokno parameter.
Remove EXPORT QUALIFIED list.
(InitParameterRangeCheck): Add tokno parameter.
* gm2-compiler/M2Range.mod (InitTypesParameterCheck): Add tokno
parameter and pass tokno to PutRangeParam.
(InitParameterRangeCheck): Add tokno parameter and pass tokno to
PutRangeParam.
(PutRangeParam): Add tokno parameter and assign to tokenNo.
(FoldTypeParam): Rewrite format string.

gcc/testsuite/ChangeLog:

PR modula2/112893
* gm2/iso/fail/proccard.mod: New test.
* gm2/pim/pass/proccard.mod: New test.

Signed-off-by: Gaius Mulley <gaiusmod2@gmail.com>

RISC-V: Fix AVL propagation ICE for vleff/vlsegff

This patch fixes 400 ICEs in full coverage testing:

internal compiler error: in validate_change_or_fail, at config/riscv/riscv-v.cc:4597

The root cause is each operand is used in vleff/vlsegff twice:

(define_insn "@pred_fault_load<mode>"
  [(set (match_operand:V 0 "register_operand"              "=vd,    vd,    vr,    vr")
(if_then_else:V
  (unspec:<VM>
    [(match_operand:<VM> 1 "vector_mask_operand" "   vm,    vm,   Wc1,   Wc1")
     (match_operand 4 "vector_length_operand"    "   rK,    rK,    rK,    rK")
     (match_operand 5 "const_int_operand"        "    i,     i,     i,     i")
     (match_operand 6 "const_int_operand"        "    i,     i,     i,     i")
     (match_operand 7 "const_int_operand"        "    i,     i,     i,     i")
     (reg:SI VL_REGNUM)
     (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
  (unspec:V
    [(match_operand:V 3 "memory_operand"         "    m,     m,     m,     m")] UNSPEC_VLEFF)
  (match_operand:V 2 "vector_merge_operand"      "   vu,     0,    vu,     0")))
   (set (reg:SI VL_REGNUM)
  (unspec:SI
    [(if_then_else:V
       (unspec:<VM>
[(match_dup 1) (match_dup 4) (match_dup 5)
(match_dup 6) (match_dup 7)
(reg:SI VL_REGNUM) (reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
       (unspec:V [(match_dup 3)] UNSPEC_VLEFF)
       (match_dup 2))] UNSPEC_MODIFY_VL))]

Then later instruction change in AVL propagation change ICE:

      validate_change_or_fail (rinsn, recog_data.operand_loc[index],
       get_avl_type_rtx (avl_type::NONVLMAX), false);

which is the operand change according to location. Such operand change in 2 locations instead of 1.

So regenerate pattern for such instructions AVL propagation to fix the ICEs.

gcc/ChangeLog:

* config/riscv/riscv-avlprop.cc (simplify_replace_avl): New function.
(simplify_replace_vlmax_avl): Fix bug.
* config/riscv/t-riscv: Add a new include file.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/vsetvl/avl_prop-2.c: New test.

RISC-V: xtheadmemidx: Document inline asm issue with memory constraint

The XTheadMemIdx support relies on the fact that memory operands that
can be expressed by XTheadMemIdx instructions, will only appear as
operands of such instructions.  For internal instruction generation
this is guaranteed by the implemenation.  However, in case of inline
assembly, this guarantee is not given and we cannot differentiate
these two cases when printing the operand:

  asm volatile ("sd %1,%0" : "=m"(*tmp) : "r"(val));
  asm volatile ("th.srd %1,%0" : "=m"(*tmp) : "r"(val));

If XTheadMemIdx is enabled, then the address will be printed as if an
XTheadMemIdx instruction is emitted, which is obviously wrong in the
first case.

There might be solutions to handle this (e.g. using TARGET_MEM_CONSTRAINT
or extending the mnemonics to accept the standard operands for
XTheadMemIdx instructions), but let's document this behavior for now
as a known issue by adding xfail tests until we have an acceptable fix.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadmemidx-inline-asm-1.c: New test.

Reported-by: Jin Ma <jinma@linux.alibaba.com>
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

RISC-V: xtheadfmemidx: Disable if xtheadmemidx is not available

XTheadMemIdx provides register-register offsets for GP register
loads/stores.  XTheadFMemIdx does the same for FP registers.

We've observed an issue with XTheadFMemIdx-only builds, where FP
registers have been promoted to GP registers:

(insn 26 22 51 (set (reg:DF 15 a5 [orig:136 <retval> ] [136])
        (mem/u:DF (plus:DI (reg/f:DI 15 a5 [141])
                (reg:DI 10 a0 [144])) [1 CSWTCH.2[_10]+0 S8 A64])) 217 {*movdf_hardfloat_rv64}
     (expr_list:REG_DEAD (reg:DI 10 a0 [144])
        (nil)))

This results in the following assembler error:
  Assembler messages:
  Error: unrecognized opcode `th.lrd a5,a5,a0,0', extension `xtheadmemidx' required

There seems to be a (reasonable) assumption, that addressing modes
for FP registers are compatible with those of GP registers.

We already ran into a similar issue during development of the
XTheadFMemIdx support patch, where we could trace the issue down to
the optimization splitters.  Back then we simply disabled them in case
XTheadMemIdx is not available.  But as it turned out, that was not
enough.

To ensure, we won't see such issues anymore, let's make the support
for XTheadFMemIdx depend on XTheadMemIdx.  I.e., if only XTheadFMemIdx
is available, then no instructions of this extension will be emitted.

While this looks a bit drastic at first view, it is the best practical
solution since XTheadFMemIdx without XTheadMemIdx does not exist in real
hardware and would be an odd thing to do.

gcc/ChangeLog:

* config/riscv/thead.cc (th_memidx_classify_address_index):
Require TARGET_XTHEADMEMIDX for FP modes.
* config/riscv/thead.md: Require TARGET_XTHEADMEMIDX for all
XTheadFMemIdx pattern.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/xtheadfmemidx-without-xtheadmemidx.c: New test.

Reported-by: Jin Ma <jinma@linux.alibaba.com>
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

testsuite: Add testcase for already fixed PR [PR111068]

This one unfortunately can't be bisected, it ICEd until r14-3430
inclusive, but r14-3431 removed -mavx10.1-512 support and when it
was readded in r14-5607 it doesn't ICE anymore.

I'm just committing the testcase so that it doesn't reappear.

2023-12-07 Jakub Jelinek <jakub@redhat.com>

PR target/111068
* gcc.target/i386/pr111068.c: New test.

c-family: Fix up -fno-debug-cpp [PR111965]

As can be seen in the second testcase, -fno-debug-cpp is actually
implemented the same as -fdebug-cpp and so doesn't turn the debugging
off.

The following patch fixes that.

2023-12-07 Andrew Pinski <pinskia@gmail.com>
Jakub Jelinek <jakub@redhat.com>

PR preprocessor/111965
gcc/c-family/
* c-opts.cc (c_common_handle_option) <case OPT_fdebug_cpp>: Set
cpp_opts->debug to value rather than 1.
gcc/testsuite/
* gcc.dg/cpp/pr111965-1.c: New test.
* gcc.dg/cpp/pr111965-2.c: New test.

expr: Handle BITINT_TYPE in count_type_elements [PR112881]

The following testcaser ICEs during gimplification, because
count_type_elements doesn't handle BITINT_TYPE. It should handle it like
other integral types.

2023-12-07 Jakub Jelinek <jakub@redhat.com>

PR middle-end/112881
* expr.cc (count_type_elements): Handle BITINT_TYPE like INTEGER_TYPE.

* gcc.dg/bitint-50.c: New test.

tree-ssa-dce: Fix up maybe_optimize_arith_overflow for BITINT_TYPE [PR112880]

The following testcase ICEs because maybe_optimize_arith_overflow
uses build_nonstandard_integer_type, which is inappropriate if
type is large BITINT_TYPE.

2023-12-07 Jakub Jelinek <jakub@redhat.com>

PR tree-optimization/112880
* tree-ssa-dce.cc (maybe_optimize_arith_overflow): Use
unsigned_type_for instead of conditionally calling
build_nonstandard_integer_type.

* gcc.dg/bitint-49.c: New test.

testsuite: Fix up gcc.target/s390/pr96127.c test for modern C [PR96127]

I've noticed this test regressed on s390x-linux with the addition of the
switch to modern C patchset. Haven't tried to reproduce the ICE, but as it
was a backend ICE and FE after warning used to add such casts before (now
errors), I think this ought to keep the testcase testing what was intended
before.

2023-12-07 Jakub Jelinek <jakub@redhat.com>

PR target/96127
* gcc.target/s390/pr96127.c (c1): Add casts to long int *.

analyzer: deal with -fshort-enums

On platforms that enable -fshort-enums by default, various switch-enum
analyzer tests fail, because apply_constraints_for_gswitch doesn't
expect the integral promotion type cast.  I've arranged for the code
to cope with those casts.

for  gcc/analyzer/ChangeLog

* region-model.cc (has_nondefault_case_for_value_p): Take
enumerate type as a parameter.
(region_model::apply_constraints_for_gswitch): Cope with
integral promotion type casts.

for  gcc/testsuite/ChangeLog

* gcc.dg/analyzer/switch-short-enum-1.c: New.
* gcc.dg/analyzer/switch-no-short-enum-1.c: New.

libsupc++: try cxa_thread_atexit_impl at runtime

g++.dg/tls/thread_local-order2.C fails when the toolchain is built for
a platform that lacks __cxa_thread_atexit_impl, even if the program is
built and run using that toolchain on a (later) platform that offers
__cxa_thread_atexit_impl.

This patch adds runtime testing for __cxa_thread_atexit_impl on select
platforms (GNU variants, for starters) that support weak symbols.

for libstdc++-v3/ChangeLog

PR libstdc++/112858
* config/os/gnu-linux/os_defines.h
(_GLIBCXX_MAY_HAVE___CXA_THREAD_ATEXIT_IMPL): Define.
* libsupc++/atexit_thread.cc [__GXX_WEAK__ &&
_GLIBCXX_MAY_HAVE___CXA_THREAD_ATEXIT_IMPL]
(__cxa_thread_atexit): Add dynamic detection of
__cxa_thread_atexit_impl.

aarch64: rcpc3: Add intrinsics tests

Add unit test to ensure that added intrinsics compile to the correct
`LDAP1 {Vt.D}[lane],[Xn]' and `STL1 {Vt.d}[lane],[Xn]' instructions.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/acle/rcpc3.c: New.

aarch64: rcpc3: add Neon ACLE wrapper functions to `arm_neon.h'

Create the necessary mappings from the ACLE-defined Neon intrinsics
names[1] to the internal builtin function names.

[1] https://arm-software.github.io/acle/neon_intrinsics/advsimd.html

gcc/ChangeLog:

* config/aarch64/arm_neon.h (vldap1_lane_u64): New.
(vldap1q_lane_u64): Likewise.
(vldap1_lane_s64): Likewise.
(vldap1q_lane_s64): Likewise.
(vldap1_lane_f64): Likewise.
(vldap1q_lane_f64): Likewise.
(vldap1_lane_p64): Likewise.
(vldap1q_lane_p64): Likewise.
(vstl1_lane_u64): Likewise.
(vstl1q_lane_u64): Likewise.
(vstl1_lane_s64): Likewise.
(vstl1q_lane_s64): Likewise.
(vstl1_lane_f64): Likewise.
(vstl1q_lane_f64): Likewise.
(vstl1_lane_p64): Likewise.
(vstl1q_lane_p64): Likewise.

aarch64: rcpc3: Add Neon ACLE intrinsics

Register the target specific builtins in `aarch64-simd-builtins.def'
and implement their associated backend patterns in `aarch64-simd.md'.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def
(vec_ldap1_lane): New.
(vec_stl1_lane): Likewise.
* config/aarch64/aarch64-simd.md
(aarch64_vec_stl1_lanes<mode>_lane<Vel>): New.
(aarch64_vec_stl1_lane<mode>): Likewise.
(aarch64_vec_ldap1_lanes<mode>_lane<Vel>): Likewise.
(aarch64_vec_ldap1_lane<mode>): Likewise.
* config/aarch64/aarch64.md (UNSPEC_LDAP1_LANE): New.
(UNSPEC_STL1_LANE): Likewise.

aarch64: rcpc3: Add relevant iterators to handle Neon intrinsics

The LDAP1 and STL1 Neon ACLE intrinsics, operating on 64-bit data
values, operate on single-lane (Vt.1D) or twin-lane (Vt.2D) SIMD
register configurations, either in the DI or DF modes. This leads to
the need for a mode iterator accounting for the V1DI, V1DF, V2DI and
V2DF modes.

This patch therefore introduces the new V12DIF mode iterator with
which to generate functions operating on signed 64-bit integer and
float values and V12DIUP for generating the unsigned and
polynomial-type counterparts. Along with this, we modify the
associated mode attributes accordingly in order to allow for the
implementation of the relevant backend patterns for the intrinsics.

gcc/ChangeLog:

* config/aarch64/iterators.md (V12DIF): New.
(V12DUP): Likewise.
(VEL): Add support for all V12DIF-associated modes.
(Vetype): Add support for V1DI and V1DF.
(Vel): Likewise.

aarch64: rcpc3: Add +rcpc3 extension

Given the optional LRCPC3 target support for Armv8.2-a cores onwards,
the +rcpc3 arch feature modifier is added to GCC's command-line options.

gcc/ChangeLog:

* config/aarch64/aarch64-option-extensions.def (rcpc3): New.
* config/aarch64/aarch64.h (AARCH64_ISA_RCPC3): Likewise.
(TARGET_RCPC3): Likewise.
* doc/invoke.texi (rcpc3): Document feature in AArch64 Options.

[APX NDD] Support TImode shift for NDD

For TImode shifts, they are splitted by splitter functions, which assume
operands[0] and operands[1] to be the same. For the NDD alternative the
assumption may not be true so add split functions for NDD to emit the NDD
form instructions, and omit the handling of !64bit target split.

Although the NDD form allows memory src, for post-reload splitter there are
no extra register to accept NDD form shift, especially shld/shrd. So only
accept register alternative for shift src under NDD.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_split_ashl_ndd): New
function to split NDD form lshift.
(ix86_split_rshift_ndd): Likewise for l/ashiftrt.
* config/i386/i386-protos.h (ix86_split_ashl_ndd): New
prototype.
(ix86_split_rshift_ndd): Likewise.
* config/i386/i386.md (ashl<mode>3_doubleword): Add NDD
alternative, call ndd split function when operands[0]
not equal to operands[1].
(define_split for doubleword lshift): Likewise.
(define_peephole for doubleword lshift): Likewise.
(<insn><mode>3_doubleword): Likewise for l/ashiftrt.
(define_split for doubleword l/ashiftrt): Likewise.
(define_peephole for doubleword l/ashiftrt): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd-ti-shift.c: New test.

[APX NDD] Support APX NDD for cmove insns

gcc/ChangeLog:

* config/i386/i386.md (*mov<mode>cc_noc): Extend with new constraints
to support NDD.
(*movsicc_noc_zext): Likewise.
(*movsicc_noc_zext_1): Likewise.
(*movqicc_noc): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd-cmov.c: New test.

[APX NDD] Support APX NDD for shld/shrd insns

For shld/shrd insns, the old pattern use match_dup 0 as its shift src and use
+r*m as its constraint. To support NDD we added new define_insns to handle NDD
form pattern with extra input and dest operand to be fixed in register.

gcc/ChangeLog:

* config/i386/i386.md (x86_64_shld_ndd): New define_insn.
(x86_64_shld_ndd_1): Likewise.
(*x86_64_shld_ndd_2): Likewise.
(x86_shld_ndd): Likewise.
(x86_shld_ndd_1): Likewise.
(*x86_shld_ndd_2): Likewise.
(x86_64_shrd_ndd): Likewise.
(x86_64_shrd_ndd_1): Likewise.
(*x86_64_shrd_ndd_2): Likewise.
(x86_shrd_ndd): Likewise.
(x86_shrd_ndd_1): Likewise.
(*x86_shrd_ndd_2): Likewise.
(*x86_64_shld_shrd_1_nozext): Adjust codegen under TARGET_APX_NDD.
(*x86_shld_shrd_1_nozext): Likewise.
(*x86_64_shrd_shld_1_nozext): Likewise.
(*x86_shrd_shld_1_nozext): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd-shld-shrd.c: New test.

[APX NDD] Support APX NDD for rotate insns

gcc/ChangeLog:

* config/i386/i386.md (*<insn><mode>3_1): Extend with a new
alternative to support NDD for SI/DI rotate, and adjust output
template.
(*<insn>si3_1_zext): Likewise.
(*<insn><mode>3_1): Likewise for QI/HI modes.
(rcrsi2): Likewise, and use nonimmediate_operand for operands[1]
to accept memory input for NDD alternative.
(rcrdi2): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add test for left/right rotate.

[APX NDD] Support APX NDD for right shift insns

Similar to LSHIFT, rshift do not need to omit $1 for NDD form.

gcc/ChangeLog:

* config/i386/i386.md (ashr<mode>3_cvt): Extend with new
alternatives to support NDD, and adjust output templates.
(*ashr<mode>3_1): Likewise for SI/DI mode.
(*lshr<mode>3_1): Likewise.
(*<insn>si3_1_zext): Likewise.
(*ashr<mode>3_1): Likewise for QI/HI mode.
(*lshrqi3_1): Likewise.
(*lshrhi3_1): Likewise.
(<insn><mode>3_cmp): Likewise.
(*<insn><mode>3_cconly): Likewise.
(*ashrsi3_cvt_zext): Likewise, and use nonimmediate_operand for
operands[1] to accept memory input for NDD alternative.
(*highpartdisi2): Likewise.
(*<insn>si3_cmp_zext): Likewise.
(<insn><mode>3_carry): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add l/ashiftrt tests.

[APX NDD] Support APX NDD for left shift insns

For left shift, there is an optimization TARGET_DOUBLE_WITH_ADD that shl
1 can be optimized to add. As NDD form of add requires src operand to
be register since NDD cannot take 2 memory src, we currently just keep
using NDD form shift instead of add.

The optimization TARGET_SHIFT1 will try to remove constant 1 to use shorter
opcode, but under NDD assembler will automatically use it whether $1 exist
or not, so do not involve NDD with it.

The doubleword insns for left shift calls ix86_expand_ashl, which assume
all shift related pattern has same operand[0] and operand[1]. For these pattern
we will support them in a standalone patch.

gcc/ChangeLog:

* config/i386/i386.md (*ashl<mode>3_1): Extend with new
alternatives to support NDD, limit the new alternative to
generate sal only, and adjust output template for NDD.
(*ashlsi3_1_zext): Likewise.
(*ashlhi3_1): Likewise.
(*ashlqi3_1): Likewise.
(*ashl<mode>3_cmp): Likewise.
(*ashlsi3_cmp_zext): Likewise, and use nonimmediate_operand for
operands[1] to accept memory input for NDD alternative.
(*ashl<mode>3_cconly): Likewise.
(*ashl<dwi>3_doubleword_highpart): Adjust codegen for NDD.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add tests for sal.

[APX NDD] Support APX NDD for or/xor insn

Similar to AND insn, two splitters need to be adjusted to prevent
misoptimizaiton for NDD OR/XOR.

Also adjust *one_cmplsi2_2_zext and its corresponding splitter that will
generate xor insn.

gcc/ChangeLog:

* config/i386/i386.md (<code><mode>3): Add new alternative for NDD
and adjust output templates.
(*<code><mode>_1): Likewise.
(*<code>qi_1): Likewise.
(*notxor<mode>_1): Likewise.
(*<code>si_1_zext): Likewise.
(*notxorqi_1): Likewise.
(*<code><mode>_2): Likewise.
(*<code>si_2_zext): Likewise.
(*<code>si_2_zext_imm): Likewise.
(*<code>si_1_zext_imm): Likewise, and use nonimmediate_operand for
operands[1] to accept memory input for NDD alternative.
(*one_cmplsi2_2_zext): Likewise.
(define_split for *one_cmplsi2_2_zext): Use nonimmediate_operand for
operands[3].
(*<code><dwi>3_doubleword): Add NDD constraints, adopt '&' to NDD dest
and emit move for optimized case if operands[0] != operands[1] or
operands[4] != operands[5].
(define_split for QI highpart OR/XOR): Prohibit splitter to split NDD
form OR/XOR insn to <any_logic:code>qi_ext<mode>_3.
(define_split for QI strict_lowpart optimization): Prohibit splitter to
split NDD form AND insn to *<code><mode>3_1_slp.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add or and xor test.

[APX NDD] Support APX NDD for and insn

For NDD form AND insn, there are three splitter fixes after extending legacy
patterns.

1. APX NDD does not support high QImode registers like ah, bh, ch, dh, so for
some optimization splitters that generates highpart zero_extract for QImode
need to be prohibited under NDD pattern.

2. Legacy AND insn will use r/qm/L constraint, and a post-reload splitter will
transform it into zero_extend move. But for NDD form AND, the splitter is not
strict enough as the splitter assum such AND will have the const_int operand
matching the constraint "L", then NDD form AND allows const_int with any QI
values. Restrict the splitter condition to match "L" constraint that strictly
matches zero-extend sematic.

3. Legacy AND insn will adopt r/0/Z constraint, a splitter will try to optimize
such form into strict_lowpart QImode AND when 7th bit is not set. But the
splitter will wronly convert non-zext form of NDD and with memory src, then the
strict_lowpart transform matches alternative 1 of *<code><mode>_slp_1 and
generates *movstrict<mode>_1 so the zext sematic was omitted. This could cause
highpart of dest not cleared and generates wrong code. Disable the splitter
when NDD adopted and operands[0] and operands[1] are not equal.

gcc/ChangeLog:

* config/i386/i386.md (and<mode>3): Add NDD alternatives and adjust
output template.
(*anddi_1): Likewise.
(*and<mode>_1): Likewise.
(*andqi_1): Likewise.
(*andsi_1_zext): Likewise.
(*anddi_2): Likewise.
(*andsi_2_zext): Likewise.
(*andqi_2_maybe_si): Likewise.
(*and<mode>_2): Likewise.
(*and<dwi>3_doubleword): Add NDD alternative, adopt '&' to NDD dest and
emit move for optimized case if operands[0] not equal to operands[1].
(define_split for QI highpart AND): Prohibit splitter to split NDD
form AND insn to <any_logic:code>qi_ext<mode>_3.
(define_split for QI strict_lowpart optimization): Prohibit splitter to
split NDD form AND insn to *<code><mode>3_1_slp.
(define_split for zero_extend and optimization): Prohibit splitter to
split NDD form AND insn to zero_extend insn.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add and test.

[APX NDD] Support APX NDD for not insn

For *one_cmplsi2_2_zext, it will be splitted to xor, so its NDD form will be
added together with xor NDD support.

gcc/ChangeLog:

* config/i386/i386.md (one_cmpl<mode>2): Add new constraints for NDD
and adjust output template.
(*one_cmpl<mode>2_1): Likewise.
(*one_cmplqi2_1): Likewise.
(*one_cmpl<dwi>2_doubleword): Likewise, and adopt '&' to NDD dest.
(*one_cmpl<mode>2_2): Likewise.
(*one_cmplsi2_1_zext): Likewise, and use nonimmediate_operand for
operands[1] to accept memory input for NDD alternative.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add not test.

[APX NDD] Support APX NDD for neg insn

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_unary_operator): Add use_ndd
parameter and adjust for NDD.
* config/i386/i386-protos.h: Add use_ndd parameter for
ix86_unary_operator_ok and ix86_expand_unary_operator.
* config/i386/i386.cc (ix86_unary_operator_ok): Add use_ndd parameter
and adjust for NDD.
* config/i386/i386.md (neg<mode>2): Add new constraint for NDD and
adjust output template.
(*neg<mode>_1): Likewise.
(*neg<dwi>2_doubleword): Likewise and adopt '&' to NDD dest.
(*neg<mode>_2): Likewise.
(*neg<mode>_ccc_1): Likewise.
(*neg<mode>_ccc_2): Likewise.
(*negsi_1_zext): Likewise, and use nonimmediate_operand for operands[1]
to accept memory input for NDD alternatives.
(*negsi_2_zext): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add neg test.

[APX NDD] Support APX NDD for sbb insn

Similar to *add<dwi>3_doubleword, operands[1] may not equal to operands[0] so
extra move and earlyclobber are required.

gcc/ChangeLog:

* config/i386/i386.md (*sub<dwi>3_doubleword): Add new alternative for
NDD, adopt '&' modifier to NDD dest and emit move when operands[0] not
equal to operands[1].
(*sub<dwi>3_doubleword_zext): Likewise.
(*subv<dwi>4_doubleword): Likewise.
(*subv<dwi>4_doubleword_1): Likewise.
(*subv<mode>4_overflow_1): Add NDD alternatives and adjust output
templates.
(*subv<mode>4_overflow_2): Likewise.
(@sub<mode>3_carry): Likewise.
(*addsi3_carry_zext_0r): Likewise, and use nonimmediate_operand for
operands[1] to accept memory input for NDD alternative.
(*subsi3_carry_zext): Likewise.
(subborrow<mode>): Parse TARGET_APX_NDD to ix86_binary_operator_ok.
(subborrow<mode>_0): Likewise.
(*sub<mode>3_eq): Likewise.
(*sub<mode>3_ne): Likewise.
(*sub<mode>3_eq_1): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd-sbb.c: New test.

[APX NDD] Support APX NDD for sub insns

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_fixup_binary_operands_no_copy):
Add use_ndd parameter and parse it.
* config/i386/i386-protos.h (ix86_fixup_binary_operands_no_copy):
Change define.
* config/i386/i386.md (sub<mode>3): Add new alternatives for NDD
and adjust output templates.
(*sub<mode>_1): Likewise.
(*sub<mode>_2): Likewise.
(subv<mode>4): Likewise.
(*subv<mode>4): Likewise.
(subv<mode>4_1): Likewise.
(usubv<mode>4): Likewise.
(*sub<mode>_3): Likewise.
(*subsi_1_zext): Likewise, and use nonimmediate_operand for operands[1]
to accept memory input for NDD alternatives.
(*subsi_2_zext): Likewise.
(*subsi_3_zext): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-ndd.c: Add test for ndd sub.