git.ipfire.org Git - thirdparty/gcc.git/log

Use memcpy instead of memmove in __relocate_a_1

__relocate_a_1 is used to copy data after vector reizing. This can be done by memcpy
rather than memmove.

libstdc++-v3/ChangeLog:

PR middle-end/109849
* include/bits/stl_uninitialized.h (__relocate_a_1): Use memcpy instead
of memmove.

sra: SRA of non-escaped aggregates passed by reference to calls

PR109849 shows that a loop that heavily pushes and pops from a stack
implemented by a C++ std::vec results in slow code, mainly because the
vector structure is not split by SRA and so we end up in many loads
and stores into it.  This is because it is passed by reference
to (re)allocation methods and so needs to live in memory, even though
it does not escape from them and so we could SRA it if we
re-constructed it before the call and then separated it to distinct
replacements afterwards.

This patch does exactly that, first relaxing the selection of
candidates to also include those which are addressable but do not
escape and then adding code to deal with the calls.  The
micro-benchmark that is also the (scan-dump) testcase in this patch
runs twice as fast with it than with current trunk.  Honza measured
its effect on the libjxl benchmark and it almost closes the
performance gap between Clang and GCC while not requiring excessive
inlining and thus code growth.

The patch disallows creation of replacements for such aggregates which
are also accessed with a precision smaller than their size because I
have observed that this led to excessive zero-extending of data
leading to slow-downs of perlbench (on some CPUs).  Apart from this
case I have not noticed any regressions, at least not so far.

Gimple call argument flags can tell if an argument is unused (and then
we do not need to generate any statements for it) or if it is not
written to and then we do not need to generate statements loading
replacements from the original aggregate after the call statement.
Unfortunately, we cannot symmetrically use flags that an aggregate is
not read because to avoid re-constructing the aggregate before the
call because flags don't tell which what parts of aggregates were not
written to, so we load all replacements, and so all need to have the
correct value before the call.

This version of the patch also takes care to avoid attempts to modify
abnormal edges, something which was missing in the previosu version.

gcc/ChangeLog:

2023-11-23  Martin Jambor  <mjambor@suse.cz>

PR middle-end/109849
* tree-sra.cc (passed_by_ref_in_call): New.
(sra_initialize): Allocate passed_by_ref_in_call.
(sra_deinitialize): Free passed_by_ref_in_call.
(create_access): Add decl pool candidates only if they are not
already candidates.
(build_access_from_expr_1): Bail out on ADDR_EXPRs.
(build_access_from_call_arg): New function.
(asm_visit_addr): Rename to scan_visit_addr, change the
disqualification dump message.
(scan_function): Check taken addresses for all non-call statements,
including phi nodes.  Process all call arguments, including the static
chain, build_access_from_call_arg.
(maybe_add_sra_candidate): Relax need_to_live_in_memory check to allow
non-escaped local variables.
(sort_and_splice_var_accesses): Disallow smaller-than-precision
replacements for aggregates passed by reference to functions.
(sra_modify_expr): Use a separate stmt iterator for adding satements
before the processed statement and after it.
(enum out_edge_check): New type.
(abnormal_edge_after_stmt_p): New function.
(sra_modify_call_arg): New function.
(sra_modify_assign): Adjust calls to sra_modify_expr.
(sra_modify_function_body): Likewise, use sra_modify_call_arg to
process call arguments, including the static chain.

gcc/testsuite/ChangeLog:

2023-11-23  Martin Jambor  <mjambor@suse.cz>

PR middle-end/109849
* g++.dg/tree-ssa/pr109849.C: New test.
* g++.dg/tree-ssa/sra-eh-1.C: Likewise.
* gcc.dg/tree-ssa/pr109849.c: Likewise.
* gcc.dg/tree-ssa/sra-longjmp-1.c: Likewise.
* gfortran.dg/pr43984.f90: Added -fno-tree-sra to dg-options.

i386: Fix ICE with -fsplit-stack -mcmodel=large [PR112686]

For -mcmodel=large, we have to load function address to a register.

PR target/112686

gcc/ChangeLog:

* config/i386/i386.cc (ix86_expand_split_stack_prologue): Load
function address to a register for ix86_cmodel == CM_LARGE.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112686.c: New test.

OpenMP: Add -Wopenmp and use it

The new warning has two purposes: First, it makes clearer to the
user that it is about OpenMP and, secondly and more importantly,
it permits to use -Wno-openmp.

The newly added -Wopenmp is enabled by default and replaces the
'0' (always warning) in several OpenMP-related warning calls.
For code shared with OpenACC, it only uses OPT_Wopenmp for
'flag_openmp | flag_openmp_simd'.

gcc/c-family/ChangeLog:

* c.opt (Wopenmp): Add, enable by default.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_omp_clause_num_threads,
c_parser_omp_clause_num_tasks, c_parser_omp_clause_grainsize,
c_parser_omp_clause_priority, c_parser_omp_clause_schedule,
c_parser_omp_clause_num_teams, c_parser_omp_clause_thread_limit,
c_parser_omp_clause_dist_schedule, c_parser_omp_depobj,
c_parser_omp_scan_loop_body, c_parser_omp_assumption_clauses):
Add OPT_Wopenmp to warning_at.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_omp_clause_dist_schedule,
cp_parser_omp_scan_loop_body, cp_parser_omp_assumption_clauses,
cp_parser_omp_depobj): Add OPT_Wopenmp to warning_at.
* semantics.cc (finish_omp_clauses): Likewise.

gcc/ChangeLog:

* doc/invoke.texi (-Wopenmp): Add.
* gimplify.cc (gimplify_omp_for): Add OPT_Wopenmp to warning_at.
* omp-expand.cc (expand_omp_ordered_sink): Likewise.
* omp-general.cc (omp_check_context_selector): Likewise.
* omp-low.cc (scan_omp_for, check_omp_nesting_restrictions,
lower_omp_ordered_clauses): Likewise.
* omp-simd-clone.cc (simd_clone_clauses_extract): Likewise.

gcc/fortran/ChangeLog:

* lang.opt (Wopenmp): Add, enabled by dafault and documented in C.
* openmp.cc (gfc_match_omp_declare_target, resolve_positive_int_expr,
resolve_nonnegative_int_expr, resolve_omp_clauses,
gfc_resolve_omp_do_blocks): Use OPT_Wopenmp with gfc_warning{,_now}.

arm: libgcc: provide implementations of __sync_synchronize

Prior to Armv6 there was no architected method to synchronize data
across processors.  Armv6 saw the first introduction of
multi-processor support, using a CP15 operation; but this was
deprecated in Armv7 and is not supported on m-profile devices of any
form.  Armv7 (and armv6-m) and later support data synchronization via
the DMB instruction.

This all leads to difficulties when linking programs as the user
generally needs to know which synchronization method is needed, but
there seems no easy way around this, when there are no OS-related
primitives available.

I've addressed this by adding multiple variants of __sync_synchronize
to libgcc, one for each of the above use cases.  I've named these
__sync_synchronize_none, __sync_synchronize_cp15dmb and
__sync_synchronize_dmb.  I've also added three specs files that can be
used to direct the linker to pick the appropriate implementation.
Using specs fragments for this is preferable to directing the user to
directly use --defsym as the latter has to be placed at the correct
position on the command line to be effective and the spec rule ensures
this automatically.

I've also added a default implementation of __sync_synchronize.  The
default implementation will use DMB if that is available in the target
ISA, or fall back to a nul-implementation if it isn't.  In the latter
case it will cause the linker (GNU LD) to emit a warning that
specifies how to pick a specific implementation.  I've chosen not to
permit this default to use the CP15 solution as that has been
deprecated.

libgcc:

* config.host (arm*-*-eabi* | arm*-*-rtems*):
Add arm/t-sync to the makefile rules.
* config/arm/lib1funcs.S (__sync_synchronize_none)
(__sync_synchronize_cp15dmb, __sync_synchronize_dmb)
(__sync_synchronize): New functions.
* config/arm/t-sync: New file.
* config/arm/sync-none.specs: Likewise.
* config/arm/sync-dmb.specs: Likewise.
* config/arm/sync-cp15dmb.specs: Likewise.

OpenMP: Accept argument to depobj's destroy clause

Since OpenMP 5.2, the destroy clause takes an depend argument as argument;
for the depobj directive, it the new argument is optional but, if present,
it must be identical to the directive's argument.

gcc/c/ChangeLog:

* c-parser.cc (c_parser_omp_depobj): Accept optionally an argument
to the destroy clause.

gcc/cp/ChangeLog:

* parser.cc (cp_parser_omp_depobj): Accept optionally an argument
to the destroy clause.

gcc/fortran/ChangeLog:

* openmp.cc (gfc_match_omp_depobj): Accept optionally an argument
to the destroy clause.

libgomp/ChangeLog:

* libgomp.texi (5.2 Impl. Status): An argument to the destroy clause
is now supported.

gcc/testsuite/ChangeLog:

* c-c++-common/gomp/depobj-3.c: New test.
* gfortran.dg/gomp/depobj-3.f90: New test.

c++: Allow exporting const-qualified namespace-scope variables [PR99232]

By [basic.link] p3.2.1, a non-template non-volatile const-qualified
variable is not necessarily internal linkage in a module declaration,
and rather may have module linkage (or external linkage if it is
exported, see p4.8).

PR c++/99232

gcc/cp/ChangeLog:

* decl.cc (grokvardecl): Don't mark variables attached to
modules as internal.

gcc/testsuite/ChangeLog:

* g++.dg/modules/pr99232_a.C: New test.
* g++.dg/modules/pr99232_b.C: New test.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

RISC-V: Fix inconsistency among all vectorization hooks

This patches 200+ ICEs exposed by testing with rv64gc_zve64d.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112694

The rootcause is we disallow poly (1,1) size vectorization in preferred_simd_mode.
with this following code:
- if (TARGET_MIN_VLEN < 128 && TARGET_MAX_LMUL < RVV_M2)
- return word_mode;

However, we allow poly (1,1) size in hook:
TARGET_VECTORIZE_RELATED_MODE
TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES

And also enables it in all vectorization patterns.

I was adding this into preferred_simd_mode because poly (1,1) size mode will cause
ICE in can_duplicate_and_interleave_p.

So, the alternative approach we need to block poly (1,1) size in both TARGET_VECTORIZE_RELATED_MODE
and TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES hooks and all vectorization patterns.
which is ugly approach and too much codes change.

Now, after investivation, I find it's nice that loop vectorizer can automatically block poly (1,1)
size vector in interleave vectorization with this commit:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=730909fa858bd691095bc23655077aa13b7941a9

So, we don't need to worry about ICE in interleave vectorization and allow poly (1,1) size vector
in vectorization which fixes 200+ ICEs in zve64d march.

PR target/112694

gcc/ChangeLog:

* config/riscv/riscv-v.cc (preferred_simd_mode): Allow poly_int (1,1) vectors.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112694-1.c: New test.

gcc: configure: drop Valgrind 3.1 compatibility

Our system.h and configure.ac try to accommodate valgrind-3.1, but it is
more than 15 years old at this point. As Valgrind-based checking is a
developer-oriented feature, drop the compatibility stuff and streamline
the detection.

gcc/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac: Delete manual checks for old Valgrind headers.
* system.h (VALGRIND_MAKE_MEM_NOACCESS): Delete.
(VALGRIND_MAKE_MEM_DEFINED): Delete.
(VALGRIND_MAKE_MEM_UNDEFINED): Delete.
(VALGRIND_MALLOCLIKE_BLOCK): Delete.
(VALGRIND_FREELIKE_BLOCK): Delete.

libcpp: configure: drop unused Valgrind detection

When top-level configure has either --enable-checking=valgrind or
--enable-valgrind-annotations, we want to activate a couple of workarounds
in libcpp. They do not use anything from the Valgrind API, so just
delete all detection.

libcpp/ChangeLog:

* config.in: Regenerate.
* configure: Regenerate.
* configure.ac (ENABLE_VALGRIND_CHECKING): Delete.
(ENABLE_VALGRIND_ANNOTATIONS): Rename to
ENABLE_VALGRIND_WORKAROUNDS. Delete Valgrind header checks.
* lex.cc (new_buff): Adjust for renaming.
(_cpp_free_buff): Ditto.

i386: Fix ICE during cbranchv16qi4 expansion [PR112681]

The following testcase ICEs, because cbranchv16qi4 expansion calls
ix86_expand_branch with op1 being a pre-AVX unaligned memory and
ix86_expand_branch emits a xorv16qi3 instruction without making sure
the operand predicates are satisfied.
While I could manually check if the argument (or both?) doesn't
match vector_operand predicate (apparently this one or bcst_vector_operand
is used in all integral 16+ bytes *xorv*3 instructions) force it into a
register, but as all gen_xorv*3 expanders call
ix86_expand_vector_logical_operator, it seems easier to just call that
function which ensures the right thing happens. Calling the individual
gen_xorv*3 functions would mean ugly switch on the modes and using high
level expand_simple_binop here seems too high level to me.

2023-11-24 Jakub Jelinek <jakub@redhat.com>

PR target/112681
* config/i386/i386-expand.cc (ix86_expand_branch): Use
ix86_expand_vector_logical_operator to expand vector XOR rather than
gen_rtx_SET on gen_rtx_XOR.

* gcc.target/i386/sse4-pr112681.c: New test.

rtl-ssa: Add some helpers for removing accesses

This adds some helpers to access-utils.h for removing accesses from an
access_array. This is needed by the upcoming aarch64 load/store pair
fusion pass.

gcc/ChangeLog:

* rtl-ssa/access-utils.h (filter_accesses): New.
(remove_regno_access): New.
(check_remove_regno_access): New.
* rtl-ssa/accesses.cc (rtl_ssa::remove_note_accesses_base): Use
new filter_accesses helper.

rtl-ssa: Support for inserting new insns

The upcoming aarch64 load pair pass needs to form store pairs, and can
re-order stores over loads when alias analysis determines this is safe.
In the case that both mem defs have uses in the RTL-SSA IR, and both
stores require re-ordering over their uses, we represent that as
(tentative) deletion of the original store insns and creation of a new
insn, to prevent requiring repeated re-parenting of uses during the
pass. We then update all mem uses that require re-parenting in one go
at the end of the pass.

To support this, RTL-SSA needs to handle inserting new insns (rather
than just changing existing ones), so this patch adds support for that.

New insns (and new accesses) are temporaries, allocated above a temporary
obstack_watermark, such that the user can easily back out of a change without
awkward bookkeeping.

gcc/ChangeLog:

* rtl-ssa/accesses.cc (function_info::create_set): New.
* rtl-ssa/accesses.h (access_info::is_temporary): New.
* rtl-ssa/changes.cc (move_insn): Handle new (temporary) insns.
(function_info::finalize_new_accesses): Handle new/temporary
user-created accesses.
(function_info::apply_changes_to_insn): Ensure m_is_temp flag
on new insns gets cleared.
(function_info::change_insns): Handle new/temporary insns.
(function_info::create_insn): New.
* rtl-ssa/changes.h (class insn_change): Make function_info a
friend class.
* rtl-ssa/functions.h (function_info): Declare new entry points:
create_set, create_insn. Declare new change_alloc helper.
* rtl-ssa/insns.cc (insn_info::print_full): Identify temporary insns in
dump.
* rtl-ssa/insns.h (insn_info): Add new m_is_temp flag and accompanying
is_temporary accessor.
* rtl-ssa/internals.inl (insn_info::insn_info): Initialize m_is_temp to
false.
* rtl-ssa/member-fns.inl (function_info::change_alloc): New.
* rtl-ssa/movement.h (restrict_movement_for_defs_ignoring): Add
handling for temporary defs.

match.pd: Avoid simplification into invalid BIT_FIELD_REFs [PR112673]

The following testcase is lowered by the bitint lowering pass, then
vectorizer vectorizes one of the loops in it, so we have
  vect__18.6_34 = VIEW_CONVERT_EXPR<vector(4) unsigned long>(x_35(D));
  _8 = BIT_FIELD_REF <vect__18.6_34, 64, 0>;
...
  _18 = BIT_FIELD_REF <vect__18.6_34, 64, 64>;
etc. where x_35(D) is _BitInt(256) argument.  That is valid BIT_FIELD_REF,
the first argument is a vector and it extracts the vector elements from it.
Then comes forwprop4 and simplifies that using match.pd into
  _8 = (unsigned long) x_35(D);
...
  _18 = BIT_FIELD_REF <x_35(D), 64, 64>;
and tree-cfg verification ICEs on the latter (though, even the first cast
is kind of undesirable after bitint lowering, we want large/huge bitints
lowered).  The ICE is because if BIT_FIELD_REFs first argument has
INTEGRAL_TYPE_P, we require type_has_mode_precision_p, but that is not the
case of _BitInt(256), it has BLKmode.

The following patch fixes it by doing the BIT_FIELD_REF with VCE to
BIT_FIELD_REF simplification only if the result is valid.

2023-11-24  Jakub Jelinek  <jakub@redhat.com>

PR tree-optimization/112673
* match.pd (bit_field_ref (vce @0) -> bit_field_ref @0): Only simplify
if either @0 doesn't have scalar integral type or if it has mode
precision.

* gcc.dg/pr112673.c: New test.

lower-bitint: Lower FLOAT_EXPR from BITINT_TYPE INTEGER_CST [PR112679]

The bitint lowering pass only does something if it sees BITINT_TYPE (medium,
large, huge) SSA_NAMEs.  In the past I've already ran into one special case
where the above doesn't work well, if there is a store of medium/large/huge
BITINT_TYPE INTEGER_CST into memory, there might not be any BITINT_TYPE
SSA_NAMEs in the function, yet we need to lower.  This has been solved by
also checking for SSA_NAME_IS_VIRTUAL_OPERAND if at the vdef there isn't
such a store (the whole intent is make the pass as cheap as possible in the
currently very likely case that the IL doesn't have any BITINT_TYPEs at
all).
And the following testcase shows a similar problem.  With -frounding-math
we don't fold some of FLOAT_EXPRs with INTEGER_CST operands, and if those
INTEGER_CSTs are medium/large/huge BITINT_TYPEs, we need to either cast
the INTEGER_CST to corresponding INTEGER_TYPE (for medium) or lower to
internal fn call which is later turned into libgcc call (for large/huge).
The following patch does that, but of course admittedly this discovery
of stores and FLOAT_EXPRs means we already look through quite a few
SSA_NAME_DEF_STMTs even when BITINT_TYPEs never appear.

2023-11-23  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/112679
* gimple-lower-bitint.cc (gimple_lower_bitint): Also stop first loop on
floating point SSA_NAME set in FLOAT_EXPR assignment from BITINT_TYPE
INTEGER_CST.  Set has_large_huge for those if that BITINT_TYPE is large
or huge.  Set kind to such FLOAT_EXPR assignment rhs1 BITINT_TYPE's kind.

* gcc.dg/bitint-42.c: New test.

tree-optimization/112677 - stack corruption with .COND_* reduction

The following makes sure to allocate enough space for vectype_op
in vectorizable_reduction.

PR tree-optimization/112677
* tree-vect-loop.cc (vectorizable_reduction): Use alloca
to allocate vectype_op.

Clean up by_pieces_ninsns

The by pieces compare can be implemented by overlapped operations. So
it should be taken into consideration when doing the adjustment for
overlap operations.  The mode returned from
widest_fixed_size_mode_for_size is already checked with mov_optab in
by_pieces_mode_supported_p called by widest_fixed_size_mode_for_size.
So it is no need to check mov_optab again in by_pieces_ninsns.  The
patch fixes these issues.

gcc/
* expr.cc (by_pieces_ninsns): Include by pieces compare when
do the adjustment for overlap operations.  Replace mov_optab
checks with gcc assertion.

lower-bitint: Fix up -fnon-call-exceptions bit-field load lowering [PR112668]

As the following testcase shows, there are some bugs in the
-fnon-call-exceptions bit-field load lowering.  In particular, there
is a case where we want to emit a load early in the initialization
(before m_init_gsi) and because that load might throw exception, need
to split block after the load so that it has an EH edge.
Now, across this splitting, we have m_init_gsi, save_gsi (something
we put back into m_gsi afterwards) statement iterators and m_preheader_bb
which is used to determine the pre-header edge of a loop (if any).
As the testcase shows, both of these statement iterators and m_preheader_bb
as well need adjustments if the block was split.  If the stmt iterators
refer to a statement, they need to be updated so that if the statement is
in the bb after the split gsi_bb and gsi_seq is updated, otherwise they
ought to be the start of the new (second) bb.
Similarly, m_preheader_bb should be updated to the second bb if it was
the first before.  Other spots where we insert something before m_init_gsi
don't split blocks in there and are fine.

The m_gsi iterator is normal iterator to insert statements before it,
so gsi_end_p means insert statements at the end of basic block.
m_init_gsi is on the other side an iterator after which statements should be
inserted (so gsi_end_p means insert statements at the start of basic block
after labels), but the whole pass is written for insertion of statements before
iterators, so when in 3 spots it wants to insert something after m_init_gsi,
it saves current iterator to save_gsi and sets m_gsi to gsi_after_labels
if m_init_gsi was gsi_end_p, or to the next statement.  But it actually wasn't
updating m_init_gsi back when switching to normal iterator, this patch changes
that such that further statements after m_init_gsi will appear after the
set of statements inserted before m_init_gsi.

Finally, the pass had a couple of places where it wanted to create a gsi_end_p
iterator for a particular basic block, instead of doing
m_gsi = gsi_last_bb (bb); if (!gsi_end_p (m_gsi)) gsi_next (&m_gsi);
the pass now uses new m_gsi = gsi_end_bb (bb) function.

2023-11-24  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/112668
* gimple-iterator.h (gsi_end, gsi_end_bb): New inline functions.
* gimple-lower-bitint.cc (bitint_large_huge::handle_cast): After
temporarily adding statements after m_init_gsi, update m_init_gsi
such that later additions after it will be after the added statements.
(bitint_large_huge::handle_load): Likewise.  When splitting
gsi_bb (m_init_gsi) basic block, update m_preheader_bb if needed
and update saved m_gsi as well if needed.
(bitint_large_huge::lower_mergeable_stmt,
bitint_large_huge::lower_comparison_stmt,
bitint_large_huge::lower_mul_overflow,
bitint_large_huge::lower_bit_query): Use gsi_end_bb.

* gcc.dg/bitint-40.c: New test.

tree: Fix up try_catch_may_fallthru [PR112619]

The following testcase ICEs with -std=c++98 since r14-5086 because
block_may_fallthru is called on a TRY_CATCH_EXPR whose second operand
is a MODIFY_EXPR rather than STATEMENT_LIST, which try_catch_may_fallthru
apparently expects.
I've been wondering whether that isn't some kind of FE bug and whether
there isn't some unwritten rule that second operand of TRY_CATCH_EXPR
must be a STATEMENT_LIST. Looking at the FEs, the C++ FE uses mostly its
own trees, TRY_BLOCK (TRY_CATCH_EXPR replacement) with HANDLER in it (CATCH_EXPR
replacement) - but HANDLER can be immediate second operand rather than nested
in STATEMENT_LIST, EH_SPEC_BLOCK (this one stands for both TRY_CATCH_EXPR
and EH_FILTER_EXPR in its second argument); both of these are only replaced
by the generic trees during gimplification though, so will unlikely be seen
by block_may_fallthru; and then CLEANUP_STMT, which is genericized
into TRY_CATCH_EXPR with non-CATCH_EXPR/EH_FILTER_EXPR in its body (this is
the one that causes the ICE on this testcase).
The Go and Rust FEs create TRY_CATCH_EXPR with CATCH_EXPR immediately in its
second argument (but either are unlucky that block_may_fallthru isn't called
or the body can always fallthru, or latent ICE), while the D FE most likely
hit this ICE and attempts to work around it, by checking at TRY_CATCH_EXPR
creation time if the second argument from pop_stmt_list is STATEMENT_LIST and
if not, forcefully wraps it into a STATEMENT_LIST.

Unfortunately, I don't see an easy way to create an artificial tree iterator
from just a single tree statement, so the patch duplicates what the loops
later do (after all, it is very simple, just didn't want to duplicate
also the large comments explaning it, so the 3 See below. comments).

2023-11-24 Jakub Jelinek <jakub@redhat.com>

PR c++/112619
* tree.cc (try_catch_may_fallthru): If second operand of
TRY_CATCH_EXPR is not a STATEMENT_LIST, handle it as if it was a
STATEMENT_LIST containing a single statement.

* g++.dg/eh/pr112619.C: New test.

tree-optimization/112344 - relax final value-replacement fix

The following tries to reduce the number of cases we use an unsigned
type for the addition when we know the original signed increment was
OK which is when the total unsigned increment computed fits the signed
type as well.

This fixes the observed testsuite fallout.

PR tree-optimization/112344
* tree-chrec.cc (chrec_apply): Only use an unsigned add
when the overall increment doesn't fit the signed type.

RISC-V: Optimize a special case of VLA SLP

When working on fixing bugs of zvl1024b. I notice a special VLA SLP case
can be better optimized.

v = vec_perm (op1, op2, { nunits - 1, nunits, nunits + 1, ... })

Before this patch, we are using genriec approach (vrgather):

vid
vadd.vx
vrgather
vmsgeu
vrgather

With this patch, we use vec_extract + slide1up:

scalar = vec_extract (last element of op1)
v = slide1up (op2, scalar)

Tested on zvl128b/zvl256b/zvl512b/zvl1024b of both RV32 and RV64 no regression.

Ok for trunk ?

PR target/112599

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_extract_and_slide1up_patterns): New function.
(expand_vec_perm_const_1): Add new optimization.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112599-2.c: New test.

RISC-V: Disable BSWAP optimization for NUNITS < 4

When fixing bugs, I notice there is a piece odd codes look incorrect.
which probably make codegen worse.

#include <stdint.h>

typedef int8_t vnx2qi __attribute__ ((vector_size (2)));

#define MASK_2(X, Y) (Y) - 1 - (X), (Y) - 2 - (X)

#define PERMUTE(TYPE, NUNITS)                                                  \
  __attribute__ ((noipa)) void permute_##TYPE (TYPE values1, TYPE values2,     \
       TYPE *out)                      \
  {                                                                            \
    TYPE v                                                                     \
      = __builtin_shufflevector (values1, values2, MASK_##NUNITS (0, NUNITS)); \
    *(TYPE *) out = v;                                                         \
  }

#define TEST_ALL(T)                                                            \
  T (vnx2qi, 2)

TEST_ALL (PERMUTE)

Before this patch:

        vsetivli        zero,2,e8,mf8,ta,ma
        vle8.v  v1,0(a0)
        vsetivli        zero,1,e16,mf4,ta,ma
        vsrl.vi v2,v1,8
        vsll.vi v1,v1,8
        vor.vv  v1,v2,v1
        vsetivli        zero,2,e8,mf8,ta,ma
        vse8.v  v1,0(a2)
        ret

After this patch:

        vsetivli        zero,2,e8,mf8,ta,ma
        vle8.v  v3,0(a0)
        vid.v   v1
        vrsub.vi        v1,v1,1
        vrgather.vv     v2,v3,v1
        vse8.v  v2,0(a2)
        ret

Committed as it is very obvious if during code review.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_bswap_pattern): Disable for NUNIT < 4.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/perm-4.c: Adapt test.
* gcc.target/riscv/rvv/autovec/vls/perm-4.c: Ditto.

c++: Support lambdas in static template member initialisers [PR107398]

The testcase noted in the PR fails because the context of the lambda is
not in namespace scope, but rather in class scope. This patch removes
the assertion that the context must be a namespace and ensures that
lambdas in class scope still get the correct merge_kind.

PR c++/107398

gcc/cp/ChangeLog:

* module.cc (trees_out::get_merge_kind): Handle lambdas in class
scope.
(maybe_key_decl): Remove assertion and fix whitespace.

gcc/testsuite/ChangeLog:

* g++.dg/modules/lambda-6_a.C: New test.
* g++.dg/modules/lambda-6_b.C: New test.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

i386: Fix AVX512 and AVX10 option issues

gcc/ChangeLog:

PR target/112643
* config/i386/driver-i386.cc (check_avx10_avx512_features):
Renamed to ...
(check_avx512_features): this and remove avx10 check.
(host_detect_local_cpu): Never append -mno-avx10.1-{256,512} to
avoid emitting warnings when building GCC with native arch.
* config/i386/i386-builtin.def (BDESC): Add missing AVX512VL for
128/256 bit builtin for AVX512VP2INTERSECT.
* config/i386/i386-options.cc (ix86_option_override_internal):
Also check whether the AVX512 flags is set when trying to reset.
* config/i386/i386.h
(PTA_SKYLAKE_AVX512): Add missing PTA_EVEX512.
(PTA_ZNVER4): Ditto.

c++: check mismatching exports for class tags [PR98885]

Checks for exporting a declaration that was previously declared as not
exported is implemented in 'duplicate_decls', but this doesn't handle
declarations of classes. This patch adds these checks and slightly
adjusts the associated error messages for clarity.

PR c++/98885

gcc/cp/ChangeLog:

* decl.cc (duplicate_decls): Adjust error message.
(xref_tag): Adjust error message. Check exporting decl that is
already declared as non-exporting.

gcc/testsuite/ChangeLog:

* g++.dg/modules/export-1.C: Adjust error messages. Remove
xfails for working case. Add new test case.

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

Daily bump.

MAINTAINERS: Add myself to write after approval and DCO

ChangeLog:

* MAINTAINERS: Add myself to write after approval and DCO

Signed-off-by: Nathaniel Shead <nathanieloshead@gmail.com>

contrib/regression/btest-gcc.sh: Optionally handle XPASS.

Tests with keys that match both PASS, FAIL (or now
optionally XPASS), count as fail.  XPASSes were previously
ignored.  Handling them as FAIL seems the most useful
alternative, but not counting XPASSes may be deliberate.
It's also a matter of compatibility, so make it optional.

Attempts to use --handle-xpass-as-fail was previously
flagged as a usage error.  If you pass it now, on state with
previous mixed XPASS and PASS results but doesn't change in
this run, the XPASS is discovered as a (new) regression.
For new XPASSing tests, it's handled as a new FAIL.

* btest-gcc.sh (--handle-xpass-as-fail): New option.

contrib/regression/btest-gcc.sh: Simplify option handling.

* btest-gcc.sh (Option handling): Break out shifts from each
option alternative.

contrib/regression/btest-gcc.sh: Handle multiple options.

This is a long-standing bug: passing "-j --add-passes-despite-regression"
or "--add-passes-despite-regression -j" caused the second option to be
treated as TARGET; the first non-option parameter.

* btest-gcc.sh (Option handling): Handle multiple options.

hppa: Fix g++.dg/modules/bad-mapper-1.C on hpux

2023-11-23 John David Anglin <danglin@gcc.gnu.org>

gcc/testsuite/ChangeLog:

* g++.dg/modules/bad-mapper-1.C: Add hppa*-*-hpux* to dg-error
"-:failed mapper handshake communication" targets.

hppa: Fix gcc.dg/analyzer/fd-4.c on hpux

2023-11-23 John David Anglin <danglin@gcc.gnu.org>

gcc/testsuite/ChangeLog:

* gcc.dg/analyzer/fd-4.c: Define _MODE_T on hpux.

hppa: Export main in pr104869.C on hpux

This is needed to avoid a linker warning.

2023-11-23 John David Anglin <danglin@gcc.gnu.org>

gcc/testsuite/ChangeLog:

* g++.dg/pr104869.C: Export main on hpux.

testsuite, lib: Re-allow mulitple function start labels.

The change applied in r14-5760-g2a46e0e7e20 changed the behaviour of
functions with assembly like:

bar:
__acle_se_bar:

Where both bar and __acle_se_bar are globals refering to the same
function body. The old behaviour overrides 'bar' with '__acle_se_bar'
and the scan tests for that label.

The change here re-allows the override.

Case like this are not legal Mach-O (where two global symbols cannot
have the same address in the assembler output). However, given the
constraints on the Mach-O scanning, it does not seem that it is
necessary to skip the change (any incorrect case should be easily
evident in the assembler).

gcc/testsuite/ChangeLog:

* lib/scanasm.exp: Allow multiple function start symbols,
taking the last as the function name.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>

testsuite: fortran: fix invalid testcases (missing MOLD argument to NULL)

The Fortran standard requires that NULL() passed to an assumed-rank
dummy argument has a MOLD argument.

gcc/testsuite/ChangeLog:

PR fortran/104819
* gfortran.dg/assumed_rank_10.f90: Add MOLD argument to NULL().
* gfortran.dg/assumed_rank_8.f90: Likewise.

Fortran: restrictions on integer arguments to SYSTEM_CLOCK [PR112609]

Fortran 2023 added restrictions on integer arguments to SYSTEM_CLOCK to
have a decimal exponent range at least as large as a default integer,
and that all integer arguments have the same kind type parameter.

gcc/fortran/ChangeLog:

PR fortran/112609
* check.cc (gfc_check_system_clock): Add checks on integer arguments
to SYSTEM_CLOCK specific to F2023.
* error.cc (notify_std_msg): Adjust to handle new features added
in F2023.
* gfortran.texi (_gfortran_set_options): Document GFC_STD_F2023_DEL,
remove obsolete option GFC_STD_F2008_TS and fix enumeration values.
* libgfortran.h (GFC_STD_F2023_DEL): Add and use in GFC_STD_OPT_F23.
* options.cc (set_default_std_flags): Add GFC_STD_F2023_DEL.

gcc/testsuite/ChangeLog:

PR fortran/112609
* gfortran.dg/system_clock_1.f90: Add option -std=f2003.
* gfortran.dg/system_clock_3.f08: Add option -std=f2008.
* gfortran.dg/system_clock_4.f90: New test.

AVR: PR target/86776: Implement CVE-2017-5753.

gcc/
PR target/86776
* config/avr/avr.cc (TARGET_HAVE_SPECULATION_SAFE_VALUE): Define
to speculation_safe_value_not_needed.

hppa: xfail scan-assembler-not check in g++.dg/cpp0x/initlist-const1.C

2023-11-23 John David Anglin <danglin@gcc.gnu.org>

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/initlist-const1.C: xfail scan-assembler-not
check on hppa*-*-hpux*.

libstdc++: Define std::ranges::to for C++23 (P1206R7) [PR111055]

This adds the std::ranges::to functions for C++23. The rest of P1206R7
is not yet implemented, i.e. the new constructors taking the
std::from_range tag, and the new insert_range, assign_range, etc. member
functions. std::ranges::to works with the standard containers even
without the new constructors, so this is useful immediately.

The __cpp_lib_ranges_to_container feature test macro can be defined now,
because that only indicates support for the changes in <ranges>, which
are implemented by this patch. The __cpp_lib_containers_ranges macro
will be defined once all containers support the new member functions.

libstdc++-v3/ChangeLog:

PR libstdc++/111055
* include/bits/ranges_base.h (from_range_t): Define new tag
type.
(from_range): Define new tag object.
* include/bits/version.def (ranges_to_container): Define.
* include/bits/version.h: Regenerate.
* include/std/ranges (ranges::to): Define.
* testsuite/std/ranges/conv/1.cc: New test.
* testsuite/std/ranges/conv/2_neg.cc: New test.
* testsuite/std/ranges/conv/version.cc: New test.

libstdc++: Fix access error in __gnu_test::uneq_allocator

The operator== function is only a friend of the LHS argument, so cannot
access the private member of the RHS argument. Use the public accessor
instead.

libstdc++-v3/ChangeLog:

* testsuite/util/testsuite_allocator.h (uneq_allocator): Fix
equality operator for heterogeneous comparisons.

Don't skip check for warning at line 411 in Wattributes.c on hppa*64*-*-*

2023-11-23 John David Anglin <danglin@gcc.gnu.org>

gcc/testsuite/ChangeLog:

* c-c++-common/Wattributes.c: Don't skip check for warning
at line 411 in Wattributes.c on hppa*64*-*-*.

gcc: Introduce -fhardened

In <https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628748.html>
I proposed -fhardened, a new umbrella option that enables a reasonable set
of hardening flags.  The read of the room seems to be that the option
would be useful.  So here's a patch implementing that option.

Currently, -fhardened enables:

  -D_FORTIFY_SOURCE=3 (or =2 for older glibcs)
  -D_GLIBCXX_ASSERTIONS
  -ftrivial-auto-var-init=zero
  -fPIE  -pie  -Wl,-z,relro,-z,now
  -fstack-protector-strong
  -fstack-clash-protection
  -fcf-protection=full (x86 GNU/Linux only)

-fhardened will not override options that were specified on the command line
(before or after -fhardened).  For example,

     -D_FORTIFY_SOURCE=1 -fhardened

means that _FORTIFY_SOURCE=1 will be used.  Similarly,

      -fhardened -fstack-protector

will not enable -fstack-protector-strong.

Currently, -fhardened is only supported on GNU/Linux.

In DW_AT_producer it is reflected only as -fhardened; it doesn't expand
to anything.  This patch provides -Whardened, enabled by default, which
warns when -fhardened couldn't enable a particular option.  I think most
often it will say that _FORTIFY_SOURCE wasn't enabled because optimization
were not enabled.

gcc/c-family/ChangeLog:

* c-opts.cc: Include "target.h".
(c_finish_options): Maybe cpp_define _FORTIFY_SOURCE
and _GLIBCXX_ASSERTIONS.

gcc/ChangeLog:

* common.opt (Whardened, fhardened): New options.
* config.in: Regenerate.
* config/bpf/bpf.cc: Include "opts.h".
(bpf_option_override): If flag_stack_protector_set_by_fhardened_p, do
not inform that -fstack-protector does not work.
* config/i386/i386-options.cc (ix86_option_override_internal): When
-fhardened, maybe enable -fcf-protection=full.
* config/linux-protos.h (linux_fortify_source_default_level): Declare.
* config/linux.cc (linux_fortify_source_default_level): New.
* config/linux.h (TARGET_FORTIFY_SOURCE_DEFAULT_LEVEL): Redefine.
* configure: Regenerate.
* configure.ac: Check if the linker supports '-z now' and '-z relro'.
Check if -fhardened is supported on $target_os.
* doc/invoke.texi: Document -fhardened and -Whardened.
* doc/tm.texi: Regenerate.
* doc/tm.texi.in (TARGET_FORTIFY_SOURCE_DEFAULT_LEVEL): Add.
* gcc.cc (driver_handle_option): Remember if any link options or -static
were specified on the command line.
(process_command): When -fhardened, maybe enable -pie and
-Wl,-z,relro,-z,now.
* opts.cc (flag_stack_protector_set_by_fhardened_p): New global.
(finish_options): When -fhardened, enable
-ftrivial-auto-var-init=zero and -fstack-protector-strong.
(print_help_hardened): New.
(print_help): Call it.
* opts.h (flag_stack_protector_set_by_fhardened_p): Declare.
* target.def (fortify_source_default_level): New target hook.
* targhooks.cc (default_fortify_source_default_level): New.
* targhooks.h (default_fortify_source_default_level): Declare.
* toplev.cc (process_options): When -fhardened, enable
-fstack-clash-protection.  If flag_stack_protector_set_by_fhardened_p,
do not warn that -fstack-protector not supported for this target.
Don't enable -fhardened when !HAVE_FHARDENED_SUPPORT.

gcc/testsuite/ChangeLog:

* gcc.misc-tests/help.exp: Test -fhardened.
* c-c++-common/fhardened-1.S: New test.
* c-c++-common/fhardened-1.c: New test.
* c-c++-common/fhardened-10.c: New test.
* c-c++-common/fhardened-11.c: New test.
* c-c++-common/fhardened-12.c: New test.
* c-c++-common/fhardened-13.c: New test.
* c-c++-common/fhardened-14.c: New test.
* c-c++-common/fhardened-15.c: New test.
* c-c++-common/fhardened-2.c: New test.
* c-c++-common/fhardened-3.c: New test.
* c-c++-common/fhardened-4.c: New test.
* c-c++-common/fhardened-5.c: New test.
* c-c++-common/fhardened-6.c: New test.
* c-c++-common/fhardened-7.c: New test.
* c-c++-common/fhardened-8.c: New test.
* c-c++-common/fhardened-9.c: New test.
* gcc.target/i386/cf_check-6.c: New test.

libgcc: mark __hardcfr_check_fail as always_inline

The function __hardcfr_check_fail in hardcfr.c is internal and static
inline.  It receives many arguments, which require more than five
registers to be passed in bpf-none-unknown targets.  BPF is limited to
that number of registers to pass arguments, and therefore libgcc fails
to build in that target.  This patch marks the function with the
always_inline attribute, fixing the bpf build.

Tested in bpf-unknown-none target and x86_64-linux-gnu host.

libgcc/ChangeLog:

* hardcfr.c (__hardcfr_check_fail): Mark as always_inline.

testsuite: Fix subexpressions with `scan-assembler-times'

We have an issue with `scan-assembler-times' handling expressions using
subexpressions as produced by capturing parentheses `()' in an odd way,
and one that is inconsistent with `scan-assembler', `scan-assembler-not',
etc.  The problem comes from calling `regexp' with `-inline -all', which
causes a list to be returned that would otherwise be placed in match
variables.

Consequently if we have say:

/* { dg-final { scan-assembler-times "\\s(foo|bar)\\s" 1 } } */

in a test case and there is a lone `foo' present in output being matched,
then our invocation of `regexp -inline -all' in `scan-assembler-times'
will return:

{ foo } foo

and that in turn will confuse our match count calculation as `llength'
will return 2 rather than 1, making the test fail even though `foo' was
only actually matched once.

It seems unclear why we chose to call `regexp' in such an odd way in the
first place just to figure out the number of matches.  The first version
of TCL that supports the `-all' option to `regexp' is 8.3, and according
to its documentation[1][2] `regexp' already returns the number of matches
found whenever `-all' has been used *unless* `-inline' has also been used.

Remove the `-inline' option then along with the `llength' invocation.

References:

[1] "Tcl Built-In Commands - regexp manual page",
    <https://www.tcl.tk/man/tcl8.2.3/TclCmd/regexp.html>

[2] "Tcl Built-In Commands - regexp manual page",
    <https://www.tcl.tk/man/tcl8.3/TclCmd/regexp.html>

gcc/testsuite/
* lib/scanasm.exp (scan-assembler-times): Remove the `-inline'
option to `regexp' and the wrapping `llength' call.

AArch64/testsuite: Use non-capturing parentheses with ccmp_1.c

Use non-capturing parentheses for the subexpressions used with
`scan-assembler-times', to avoid a quirk with double-counting.

gcc/testsuite/
* gcc.target/aarch64/ccmp_1.c: Use non-capturing parentheses
with `scan-assembler-times'.

ARM/testsuite: Use non-capturing parentheses with pr53447-5.c

Use non-capturing parentheses for the subexpressions used with
`scan-assembler-times', to avoid a quirk with double-counting.

gcc/testsuite/
* gcc.target/arm/pr53447-5.c: Use non-capturing parentheses with
`scan-assembler-times'.

arm: [MVE intrinsics] Add default clause to full_width_access::memory_vector_mode

My recent commit 0c2037d9d93a8f768cb11698ff794278246bb31f added a
switch statement lacking a default clause, leading to warnings or
errors when building with --enable-werror-always.

Fix by adding an empty default.

Committed as obvious.

2023-11-23 Christophe Lyon <christophe.lyon@linaro.org>

gcc/
* config/arm/arm-mve-builtins-functions.h
(full_width_access::memory_vector_mode): Add default clause.

i386: Wrong code with __builtin_parityl [PR112672]

gen_parityhi2_cmp instruction clobbers its input operand, so use
a temporary register in the call to gen_parityhi2_cmp.

PR target/112672

gcc/ChangeLog:

* config/i386/i386.md (parityhi2):
Use temporary register in the call to gen_parityhi2_cmp.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr112672.c: New test.

i386: Fix ICE with -mforce-indirect-call and -fsplit-stack [PR89316]

With the above two options, use a temporary register regno (as returned
from split_stack_prologue_scratch_regno) as an indirect call scratch
register to hold __morestack function address.  On 64-bit targets, two
temporary registers are always available, so load the function addres in
%r11 and call __morestack_large_model with its one-argument-register value
rn %r10.  On 32-bit targets, bail out with a "sorry" if the temporary
register can not be obtained.

On 32-bit targets, also emit PIC sequence that re-uses the obtained indirect
call scratch register before moving the function address to it.  We can
not set up %ebx PIC register in this case, but __morestack is prepared
for this situation and sets it up by itself.

PR target/89316

gcc/ChangeLog:

* config/i386/i386.cc (ix86_expand_split_stack_prologue): Obtain
scratch regno when flag_force_indirect_call is set.  On 64-bit
targets, call __morestack_large_model when  flag_force_indirect_call
is set and on 32-bit targets with -fpic, manually expand PIC sequence
to call __morestack.  Move the function address to an indirect
call scratch register.

gcc/testsuite/ChangeLog:

* g++.target/i386/pr89316.C: New test.
* gcc.target/i386/pr112605-1.c: New test.
* gcc.target/i386/pr112605-2.c: New test.
* gcc.target/i386/pr112605.c: New test.

gcov: No atomic ops for -fprofile-update=single

gcc/ChangeLog:

PR tree-optimization/112678

* tree-profile.cc (tree_profiling): Do not use atomic operations
for -fprofile-update=single.

s390: implement flags output

Implement flags output for inline assemblies.  Only use one output constraint
that captures the whole condition code.  No breakout into different condition
codes is allowed.  Also, only one condition code variable is allowed.

Add further logic to canonicalize various cases where we combine different
cases of possible condition codes.

gcc/ChangeLog:

* config/s390/s390-c.cc (s390_cpu_cpp_builtins): Define
__GCC_ASM_FLAG_OUTPUTS__.
* config/s390/s390.cc (s390_canonicalize_comparison): More
UNSPEC_CC_TO_INT cases.
(s390_md_asm_adjust): Implement flags output.
* config/s390/s390.md (ccstore4): Allow mask operands.
* doc/extend.texi: Document flags output.

gcc/testsuite/ChangeLog:

* gcc.target/s390/ccor.c: New test.

Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>

s390: split int128 load

Issue two loads when using GPRs instead of one load-multiple.

Bootstrapped and tested on s390. OK for mainline?

gcc/ChangeLog:

* config/s390/s390.md: Split TImode loads.

gcc/testsuite/ChangeLog:

* gcc.target/s390/int128load.c: New test.

Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>

s390: Fix ICE in testcase pr89233

When using GNU vector extensions, an access outside of the vector size
caused an ICE on s390. Fix this by aligning with the vec_extract
builtin, i.e., computing constant index modulo number of lanes.

Fixes testcase gcc.target/s390/pr89233.c.

gcc/ChangeLog:

* config/s390/vector.md: (*vec_extract) Fix.

Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>

swap ops in reassoc to reduce cross backedge FMA

Previously for ops.length >= 3, when FMA is present, we don't
rank the operands so that more FMAs can be preserved. But this
brings more FMAs with loop dependency, which lead to worse
performance on some targets.

Rank the oprands (set width=2) when:
1. avoid_fma_max_bits is set.
2. And loop dependent FMA sequence is found.

In this way, we don't have to discard all the FMA candidates
in the bad shaped sequence in widening_mul, instead we can keep
fewer FMAs without loop dependency.

With this patch, there's about 2% improvement in 510.parest_r
1-copy run on ampere1 (with "-Ofast -mcpu=ampere1 -flto
--param avoid-fma-max-bits=512").

PR tree-optimization/110279

gcc/ChangeLog:

* tree-ssa-reassoc.cc (get_reassociation_width): check
for loop dependent FMAs.
(reassociate_bb): For 3 ops, refine the condition to call
swap_ops_for_binary_stmt.

gcc/testsuite/ChangeLog:

* gcc.dg/pr110279-1.c: New test.

RISC-V: Add wrapper for emit vec_extract[NFC]

Add wrapper for vec_extract since my following patch will need to call it.
gcc/ChangeLog:

* config/riscv/riscv-protos.h (emit_vec_extract): New function.
* config/riscv/riscv-v.cc (emit_vec_extract): Ditto.
* config/riscv/riscv.cc (riscv_legitimize_move): Refine codes.

RISC-V: Disable AVL propagation of vrgather instruction

This patch fixes following FAILs in zvl1024b of both RV32/RV64:

FAIL: gcc.c-torture/execute/990128-1.c   -O2  execution test
FAIL: gcc.c-torture/execute/990128-1.c   -O2 -flto -fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.c-torture/execute/990128-1.c   -O2 -flto -fuse-linker-plugin -fno-fat-lto-objects  execution test
FAIL: gcc.c-torture/execute/990128-1.c   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
FAIL: gcc.c-torture/execute/990128-1.c   -O3 -g  execution test
FAIL: gcc.dg/torture/pr58955-2.c   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test

The root case can be simpliy described in this following small case:

https://godbolt.org/z/7GaxbEGzG

typedef int64_t v1024b __attribute__ ((vector_size (128)));

void foo (void *out, void *in, int64_t a, int64_t b)
{
  v1024b v = {a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a};
  v1024b v2 = {b,b,b,b,b,b,b,b,b,b,b,b,b,b,b,b};
  v1024b index = *(v1024b*)in;
  v1024b v3 = __builtin_shuffle (v, v2, index);
  __riscv_vse64_v_i64m1 (out, (vint64m1_t)v3, 10);
}

Incorrect ASM:

foo:
        li      a5,31
        vsetivli        zero,10,e64,m1,ta,mu
        vmv.v.x v2,a5
        vl1re64.v       v1,0(a1)
        vmv.v.x v4,a2
        vand.vv v1,v1,v2
        vmv.v.x v3,a3
        vmsgeu.vi       v0,v1,16
        vrgather.vv     v2,v4,v1       --> AVL = VLMAX according to codes.
        vadd.vi v1,v1,-16
        vrgather.vv     v2,v3,v1,v0.t  --> AVL = VLMAX according to codes.
        vse64.v v2,0(a0)               --> AVL = 10 according to codes.
        ret

For vrgather dest, source, index instruction, when index may has the value > the following store AVL
that is index value > 10.  In this situation, the codes above will end up with:

The source vector of vrgather has undefined value on index >= AVL (which is 10 in this case).

So disable AVL propagation for vrgather instruction.

PR target/112599
PR target/112670

gcc/ChangeLog:

* config/riscv/riscv-avlprop.cc (alv_can_be_propagated_p): New function.
(vlmax_ta_p): Disable vrgather AVL propagation.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112599-1.c: New test.

expr: Fix &bitint_var handling in initializers [PR112336]

As the following testcase shows, we ICE when trying to emit ADDR_EXPR of
a bitint variable which doesn't have mode width.
The problem is in the EXTEND_BITINT stuff which makes sure we treat the
padding bits on memory reads from user bitint vars as undefined.
When expanding ADDR_EXPR on such vars inside outside of initializers,
expand_expr_addr* uses EXPAND_CONST_ADDRESS modifier and EXTEND_BITINT
does nothing, but in initializers it keeps using EXPAND_INITIALIZER
modifier. So, we need to treat EXPAND_INITIALIZER the same as
EXPAND_CONST_ADDRESS for this regard.

2023-11-23 Jakub Jelinek <jakub@redhat.com>

PR middle-end/112336
* expr.cc (EXTEND_BITINT): Don't call reduce_to_bit_field_precision
if modifier is EXPAND_INITIALIZER.

* gcc.dg/bitint-41.c: New test.

RISC-V: Refine some codes of riscv-v.cc[NFC]

This patch is NFC patch to refine unreasonable codes I notice.

Tested on zvl128b/zvl256b/zvl512b/zvl1024b no regression.

Committed.

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vlmax_gather_insn): Refine codes.
(emit_vlmax_masked_gather_mu_insn): Ditto.
(modulo_sel_indices): Ditto.
(expand_vec_perm): Ditto.
(shuffle_generic_patterns): Ditto.

c++: Require C++11 for g++.dg/opt/pr110879.C [PR110879]

The _M_realloc_insert member does not have the trivial relocation
optimization for C++98, which seems to be why the _M_end_of_storage
member does not get optimized away. Make this test unsupported for
C++98.

gcc/testsuite/ChangeLog:

PR libstdc++/110879
* g++.dg/opt/pr110879.C: Require C++11 or later.

c: Add __builtin_stdc_* builtins

As discussed in the
https://sourceware.org/pipermail/libc-alpha/2023-November/152756.html
thread, including e.g.
https://sourceware.org/pipermail/libc-alpha/2023-November/152795.html
patch, while one can use the new __builtin_{clz,ctz,popcount}g builtins
to implement the stdbit.h type-generic macros, there are certain problems
with that implementation if those macros must be usable outside of
function bodies (e.g. int a = sizeof (stdc_bit_floor (0ULL));), must not
evaluate their arguments multiple times and especially for deep stdc_*
macro nesting don't expand the argument more than once.  Plus ideally are
usable in constant expressions for all the types if they have constant
arguments.  The above second URL satisfies it all but the last two (the
last one satisfies for many of them).  While we could get away with just
adding __biultin_stdc_bit_{ceil,floor,width} which are complicated and
2 further extensions (some way to say that __builtin_c{l,t}zg should
imply bit precision of the first argument for the second argument without
using __builtin_popcountg ((__typeof (x)) -1) in there because that
causes another expansion of the macro argument and say __builtin_bit_complement
type-generic builtin which would be like (__typeof (x)) ~(x)), it was decided
we want to implement builtins for all the stdc type-generic macros.
As we are close to running out of 8-bit enum rid (when adding the 14 new
RID_* we have 7 too many), this patch implements those 14 keywords using
a single RID_BUILTIN_STDC and simply in the rare case this is being
parsed check values of 1-2 characters from the builtin names to see which
one it is.

2023-11-23  Jakub Jelinek  <jakub@redhat.com>

gcc/
* doc/extend.texi (__builtin_stdc_bit_ceil, __builtin_stdc_bit_floor,
__builtin_stdc_bit_width, __builtin_stdc_count_ones,
__builtin_stdc_count_zeros, __builtin_stdc_first_leading_one,
__builtin_stdc_first_leading_zero, __builtin_stdc_first_trailing_one,
__builtin_stdc_first_trailing_zero, __builtin_stdc_has_single_bit,
__builtin_stdc_leading_ones, __builtin_stdc_leading_zeros,
__builtin_stdc_trailing_ones, __builtin_stdc_trailing_zeros): Document.
gcc/c-family/
* c-common.h (enum rid): Add RID_BUILTIN_STDC: New.
* c-common.cc (c_common_reswords): Add __builtin_stdc_bit_ceil,
__builtin_stdc_bit_floor, __builtin_stdc_bit_width,
__builtin_stdc_count_ones, __builtin_stdc_count_zeros,
__builtin_stdc_first_leading_one, __builtin_stdc_first_leading_zero,
__builtin_stdc_first_trailing_one, __builtin_stdc_first_trailing_zero,
__builtin_stdc_has_single_bit, __builtin_stdc_leading_ones,
__builtin_stdc_leading_zeros, __builtin_stdc_trailing_ones and
__builtin_stdc_trailing_zeros.  Move __builtin_assoc_barrier
alphabetically earlier.
gcc/c/
* c-parser.cc (c_parser_postfix_expression): Handle RID_BUILTIN_STDC.
* c-decl.cc (names_builtin_p): Likewise.
gcc/testsuite/
* gcc.dg/builtin-stdc-bit-1.c: New test.
* gcc.dg/builtin-stdc-bit-2.c: New test.

middle-end/32667 - document cpymem and memcpy exact overlap requirement

The following amends the cpymem documentation to mention that exact
overlap needs to be handled gracefully, also noting that the target
runtime is expected to behave the same way where -ffreestanding
docs mention the set of routines required.

PR middle-end/32667
* doc/md.texi (cpymem): Document that exact overlap of source
and destination needs to work.
* doc/standards.texi (ffreestanding): Mention memcpy is required
to handle the exact overlap case.

c++: Implement C++26 P2741R3 - user-generated static_assert messages [PR110348]

The following patch implements the user generated static_assert messages next
to string literals.

As I wrote already in the PR, in addition to looking through the paper
I looked at the clang++ testcase for this feature implemented there from
paper's author and on godbolt played with various parts of the testcase
coverage below, and there are some differences between what the patch
implements and what clang++ implements.

The first is that clang++ diagnoses if M.size () or M.data () methods
are present, but aren't constexpr; while the paper introduction talks about
that, the standard wording changes don't seem to require that, all they say
is that those methods need to exist (assuming accessible and the like)
and be implicitly convertible to std::size_t or const char *, but rest is
only if the static assertion fails.  If there is intent to change that
wording, the question is how far to go, e.g. while M.size () could be
constexpr, they could e.g. return some class object which wouldn't have
constexpr conversion operator to size_t/const char * and tons of other
reasons why the constant evaluation could fail.  Without actually evaluating
it I don't see how we could guarantee anything for non-failed static_assert.

The second difference is that
static_assert (false, "foo"_myd);
in the testcase is normal failed static assertion and
static_assert (true, "foo"_myd);
would be accepted, while clang++ rejects it.  IMHO
"foo"_myd doesn't match the syntactic requirements of unevaluated-string
as mentioned in http://eel.is/c++draft/dcl.pre#10 , and because
a constexpr udlit operator can return something which is valid, it shouldn't
be rejected just in case.
Last is clang++ ICEs on non-static data members size/data.

The first version of this support had a difference where M.data () was not
a constant expression but a core constant expression, but if M.size () != 0
M.data ()[0] ... M.data ()[M.size () - 1] were integer constant expressions.
We don't have any routine to test whether an expression is a core constant
expression, so what the code does is try silently whether M.data () is
a constant expression (maybe_constant_value), if it is, nice, we can use
that result to attempt to optimize the extraction of the message from it
if it is some recognized form involving a STRING_CST and just to double-check
try to constant evaluate M.data ()[0] and M.data ()[M.size () - 1] expressions
as boundaries but not anything in between.  If M.data () is not a constant
expression, we don't fail, but use a slower method of evaluating M.data ()[i]
for i 0, 1, ... M.size () - 1.  And if M.size () == 0, the above wouldn't
evaluate anything, so we try to constant evaluate (M.data (), 0) as constant
expression, which should succeed if M.data () is a core constant expression
and fail otherwise.

The patch assumes that these expressions are manifestly constant evaluated.

The patch implements what I see in the paper, because it is unclear what
further changes will be voted in (and the changes can be done at that
point).
The initial patch used tf_none in 6 spots so that just the static_assert
specific errors were emitted and not others, but during review this has been
changed, so that we emit both the more detailed errors why something wasn't
found or wasn't callable or wasn't convertible and diagnostics that
static_assert second argument needs to satisfy some of the needed properties.

2023-11-23  Jakub Jelinek  <jakub@redhat.com>

PR c++/110348
gcc/
* doc/invoke.texi (-Wno-c++26-extensions): Document.
gcc/c-family/
* c.opt (Wc++26-extensions): New option.
* c-cppbuiltin.cc (c_cpp_builtins): For C++26 predefine
__cpp_static_assert to 202306L rather than 201411L.
gcc/cp/
* parser.cc: Implement C++26 P2741R3 - user-generated static_assert
messages.
(cp_parser_static_assert): Parse message argument as
conditional-expression if it is not a pure string literal or
several of them concatenated followed by closing paren.
* semantics.cc (finish_static_assert): Handle message which is not
STRING_CST.  For condition with bare parameter packs return early.
* pt.cc (tsubst_expr) <case STATIC_ASSERT>: Also tsubst_expr
message and make sure that if it wasn't originally STRING_CST, it
isn't after tsubst_expr either.
gcc/testsuite/
* g++.dg/cpp26/static_assert1.C: New test.
* g++.dg/cpp26/feat-cxx26.C (__cpp_static_assert): Expect
202306L rather than 201411L.
* g++.dg/cpp0x/udlit-error1.C: Expect different diagnostics for
static_assert with user-defined literal.

ifcvt: remove obsolete SUBREG handling in noce_convert_multiple_sets

This code used to handle SUBREG for register replacement when ifcvt
was doing the replacements manually. This special handling is not
needed anymore because simplify_replace_rtx is used for the
replacements and it properly handles these cases.

gcc/ChangeLog:

* ifcvt.cc (noce_convert_multiple_sets_1): Remove old code.

Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>

DSE: Allow vector type for get_stored_val when read < store

Update in v4:
* Merge upstream and removed some independent changes.

Update in v3:
* Take known_le instead of known_lt for vector size.
* Return NULL_RTX when gap is not equal 0 and not constant.

Update in v2:
* Move vector type support to get_stored_val.

Original log:

This patch would like to allow the vector mode in the
get_stored_val in the DSE. It is valid for the read
rtx if and only if the read bitsize is less than the
stored bitsize.

Given below example code with
--param=riscv-autovec-preference=fixed-vlmax.

vuint8m1_t test () {
  uint8_t arr[32] = {
    1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
    1, 2, 7, 1, 3, 4, 5, 3, 1, 0, 1, 2, 4, 4, 9, 9,
  };

  return __riscv_vle8_v_u8m1(arr, 32);
}

Before this patch:
test:
  lui     a5,%hi(.LANCHOR0)
  addi    sp,sp,-32
  addi    a5,a5,%lo(.LANCHOR0)
  li      a3,32
  vl2re64.v       v2,0(a5)
  vsetvli zero,a3,e8,m1,ta,ma
  vs2r.v  v2,0(sp)             <== Unnecessary store to stack
  vle8.v  v1,0(sp)             <== Ditto
  vs1r.v  v1,0(a0)
  addi    sp,sp,32
  jr      ra

After this patch:
test:
  lui     a5,%hi(.LANCHOR0)
  addi    a5,a5,%lo(.LANCHOR0)
  li      a4,32
  addi    sp,sp,-32
  vsetvli zero,a4,e8,m1,ta,ma
  vle8.v  v1,0(a5)
  vs1r.v  v1,0(a0)
  addi    sp,sp,32
  jr      ra

Below tests are passed within this patch:
* The risc-v regression test.
* The x86 bootstrap and regression test.
* The aarch64 regression test.

PR target/111720

gcc/ChangeLog:

* dse.cc (get_stored_val): Allow vector mode if read size is
less than or equal to stored size.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr111720-0.c: New test.
* gcc.target/riscv/rvv/base/pr111720-1.c: New test.
* gcc.target/riscv/rvv/base/pr111720-10.c: New test.
* gcc.target/riscv/rvv/base/pr111720-2.c: New test.
* gcc.target/riscv/rvv/base/pr111720-3.c: New test.
* gcc.target/riscv/rvv/base/pr111720-4.c: New test.
* gcc.target/riscv/rvv/base/pr111720-5.c: New test.
* gcc.target/riscv/rvv/base/pr111720-6.c: New test.
* gcc.target/riscv/rvv/base/pr111720-7.c: New test.
* gcc.target/riscv/rvv/base/pr111720-8.c: New test.
* gcc.target/riscv/rvv/base/pr111720-9.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

mingw: Exclude utf8 manifest [PR111170, PR108865]

Make the utf8 manifest optional (on by default and
explicitly off with --disable-win32-utf8-manifest)
in the mingw hosts.

Also eliminate duplication between the 32-bit and
64-bit mingw hosts by putting them both in the
same branch and special-case only the 64-bit long
long setting.

PR mingw/111170
PR mingw/108865

Signed-off-by: Costas Argyris <costas.argyris@gmail.com>
Signed-off-by: Jonathan Yong <10walls@gmail.com>
gcc/Changelog:

* configure.ac: Handle new --enable-win32-utf8-manifest
option.
* config.host: allow win32 utf8 manifest to be disabled
by user.
* configure: Regenerate.

testsuite: Tweak xfail bogus g++.dg/warn/Wstringop-overflow-4.C:144, PR106120

The conditions under which this this bogus warning is
emitted has changed to not happen for 32-bit targets
anymore. Adjust accordingly.

PR testsuite/106120
* g++.dg/warn/Wstringop-overflow-4.C:144 XFAIL bogus warning for
lp64 targets with c++98.

Daily bump.

hppa: Define MAX_FIXED_MODE_SIZE

Replace default define. We support TImode when TARGET_64BIT is true.

2023-11-22 John David Anglin <danglin@gcc.gnu.org>

gcc/ChangeLog:

PR target/112592
* config/pa/pa.h (MAX_FIXED_MODE_SIZE): Define.

hppa: Fix integer REG+D address reloads

I made a mistake in the previous change to integer_store_memory_operand.
There is no support pa_emit_move sequence to handle secondary reloads of
integer REG+D instructions.  Further, the Q constraint is used for some
non-simple instructions (movb and addib).  Thus, we need to return true
when reload is in progress.

2023-11-22  John David Anglin  <danglin@gcc.gnu.org>

gcc/ChangeLog:

PR target/112617
* config/pa/predicates.md (integer_store_memory_operand): Return
true for REG+D addresses when reload_in_progress is true.

c++: alias template of non-template class [PR112633]

The entering_scope adjustment in tsubst_aggr_type assumes if an alias is
dependent, then so is the aliased type (and therefore it has template info)
but that's not true for the dependent alias template specialization ty1<T>
below which aliases the non-template class A. In this case no adjustment
is needed anyway, so we can just punt.

PR c++/112633

gcc/cp/ChangeLog:

* pt.cc (tsubst_aggr_type): Handle empty TYPE_TEMPLATE_INFO
in the entering_scope adjustment.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/alias-decl-75.C: New test.

Adjust 'libgomp.c/declare-variant-{3,4}-[...]' for inter-procedural value range propagation

..., that is, commit 53ba8d669550d3a1f809048428b97ca607f95cf5
"inter-procedural value range propagation", after which we see:

    [-PASS:-]{+FAIL:+} libgomp.c/declare-variant-3-sm30.c scan-nvptx-none-offload-tree-dump optimized "= f30 \$\$;"

Etc.  That's due to:

    @@ -144,13 +144,11 @@
     __attribute__((omp target entrypoint, noclone))
     void main._omp_fn.0 (const struct .omp_data_t.3 & restrict .omp_data_i)
     {
    -  int _3;
       int * _5;

       <bb 2> [local count: 1073741824]:
    -  _3 = f30 ();
       _5 = *.omp_data_i_4(D).v;
    -  *_5 = _3;
    +  *_5 = 30;
       return;

It's nice to see this optimization work here, too, but it does interfere with
how we're currently testing OpenMP 'declare variant'.

libgomp/
* testsuite/libgomp.c/declare-variant-3.h (f30, f35, f53, f70)
(f75, f80, f): Add '__attribute__ ((noipa))'.
* testsuite/libgomp.c/declare-variant-4.h (gfx803, gfx900, gfx906)
(gfx908, gfx90a, f): Likewise.

testsuite: Update path to intl include.

When we are building libintl in-tree, we need to pass the path
to the generated libintl.h include to the plugin tests. This
path has changed with the use of gettext directly.

gcc/testsuite/ChangeLog:

* lib/plugin-support.exp: Update the expected path to an
in-tree build of libintl.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>

testsuite, Darwin: Add support for Mach-O function body scans.

We need to process the source slightly differently from ELF, especially
in that we have __USER_LABEL_PREFIX__ and there are no function start
and end assembler directives. This means we cannot delineate functions
when frame output is switched off.

TODO: consider adding -mtest-markers or something similar to inject
assembler comments that can be scanned for.

gcc/testsuite/ChangeLog:

* lib/scanasm.exp: Initial handling for Mach-O function body scans.

Signed-off-by: Iain Sandoe <iain@sandoe.co.uk>
Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>

tree-optimization/112344 - wrong final value replacement

When performing final value replacement chrec_apply that's used to
compute the overall effect of niters to a CHREC doesn't consider that
the overall increment of { -2147483648, +, 2 } doesn't fit in
a signed integer when the loop iterates until the value of the IV
of 20. The following fixes this mistake, carrying out the multiply
and add in an unsigned type instead, avoiding undefined overflow
and thus later miscompilation by path range analysis.

PR tree-optimization/112344
* tree-chrec.cc (chrec_apply): Perform the overall increment
calculation and increment in an unsigned type.

* gcc.dg/torture/pr112344.c: New testcase.

amdgcn: Fix vector TImode reload loop

I've only observed the problem on the devel/omp/gcc-13 branch, but this
could theoretically affect mainline also. The mov insns for the other modes
already have '$', so this completes the set.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (*mov<mode>_4reg): Disparage AVGPR use when a
reload is required.

[IRA]: Fix using undefined dump file in IRA code during insn scheduling

Part of IRA code is used for register pressure sensitive insn
scheduling and live range shrinkage.  Numerous changes of IRA resulted
in that this IRA code uses dump file passed by the scheduler and
internal ira dump file (in called functions) which can be undefined or
freed by the scheduler during compiling previous functions.  The patch
fixes this problem.  To reproduce the error valgrind should be used
and GCC should be compiled with valgrind annotations.  Therefor the
patch does not contain the test case.

gcc/ChangeLog:

PR rtl-optimization/112610
* ira-costs.cc: (find_costs_and_classes): Remove arg.
Use ira_dump_file for printing.
(print_allocno_costs, print_pseudo_costs): Ditto.
(ira_costs): Adjust call of find_costs_and_classes.
(ira_set_pseudo_classes): Set up and restore ira_dump_file.

gcc.misc-tests/linkage-y.c: Compatibility with C99+ system compilers

This program is compiled with an installed "cc" compiler, not the
built GCC compiler, so it should be as compatible as possible across a
wide range of compilers.

gcc/testsuite/

* gcc.misc-tests/linkage-y.c (puts): Declare.
(main): Add int return type and return 0.

RISC-V: Fix incorrect use of vcompress in permutation auto-vectorization

This patch fixes following FAILs on zvl512b of RV32 system:

FAIL: gcc.target/riscv/rvv/autovec/struct/struct_vect_run-12.c execution test
FAIL: gcc.target/riscv/rvv/autovec/struct/struct_vect_run-9.c execution test

The root cause is that for permutation indice = {0,3,7,0} use vcompress optimization
which is incorrect. Fix vcompress optimization bug.

PR target/112598

gcc/ChangeLog:

* config/riscv/riscv-v.cc (shuffle_compress_patterns): Fix vcompress bug.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112598-3.c: New test.

Build: fix error in fixinclude configure

The stray line defining enable_darwin_at_rpath outside of the scope of
_LT_DARWIN_LINKER_FEATURES is a mistake and should be removed. It leads
to a wrong line in fixincludes/ChangeLog because there is no $1 argument
at that point.

ChangeLog:

* libtool.m4: Fix stray call

fixincludes/ChangeLog:

* configure: Regenerated.

AArch64: fix aarch64_usubw pattern

It looks like during my pre-commit testrun I forgot to apply this patch
to the patch stack. It had a typo in the element size.

It also looks like since the hi/lo operations take different element
counts for the assembler syntax that I can't have a unified pattern.

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md
(aarch64_uaddw<mode>_<PERM_EXTEND:perm_hilo>_zip,
aarch64_usubw<mode>_<PERM_EXTEND:perm_hilo>_zip): Split into...
(aarch64_uaddw<mode>_lo_zip, aarch64_uaddw<mode>_hi_zip,
"aarch64_usubw<mode>_lo_zip, "aarch64_usubw<mode>_hi_zip): ... This.
* config/aarch64/iterators.md (PERM_EXTEND, perm_index): Remove.
(perm_hilo): Remove UNSPEC_ZIP1, UNSPEC_ZIP2.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/uxtl-combine-4.c: Fix typo.
* gcc.target/aarch64/uxtl-combine-5.c: Likewise.
* gcc.target/aarch64/uxtl-combine-6.c: Likewise.

testsuite: Add testcase for already fixed PR112518

This PR has been fixed by the PR112526 fix.

2023-11-22 Jakub Jelinek <jakub@redhat.com>

PR target/112518
* gcc.target/i386/bmi2-pr112518.c: New test.

arm: [MVE intrinsics] Fix typo

In commt 0c2037d9d93a8f768cb11698ff794278246bb31f (Add support for
contiguous loads and stores), I added a spurious line which broke
bootstrap because of an unused variable error.

This patch removes it.

Committed as obvious.

2023-11-22 Christophe Lyon <christophe.lyon@linaro.org>

gcc/ChangeLog:

* config/arm/arm-mve-builtins.cc
(function_resolver::infer_pointer_type): Remove spurious line.

LoongArch: Optimize LSX vector shuffle on floating-point vector

The vec_perm expander was wrongly defined.  GCC internal says:

Operand 3 is the “selector”.  It is an integral mode vector of the same
width and number of elements as mode M.

But we made operand 3 in the same mode as the shuffled vectors, so it
would be a FP mode vector if the shuffled vectors are FP mode.

With this mistake, the generic code manages to work around and it ends
up creating some very nasty code for a simple __builtin_shuffle (a, b,
c) where a and b are V4SF, c is V4SI:

    la.local    $r12,.LANCHOR0
    la.local    $r13,.LANCHOR1
    vld $vr1,$r12,48
    vslli.w $vr1,$vr1,2
    vld $vr2,$r12,16
    vld $vr0,$r13,0
    vld $vr3,$r13,16
    vshuf.b $vr0,$vr1,$vr1,$vr0
    vld $vr1,$r12,32
    vadd.b  $vr0,$vr0,$vr3
    vandi.b $vr0,$vr0,31
    vshuf.b $vr0,$vr1,$vr2,$vr0
    vst $vr0,$r12,0
    jr  $r1

This is obviously stupid.  Fix the expander definition and adjust
loongarch_expand_vec_perm to handle it correctly.

gcc/ChangeLog:

* config/loongarch/lsx.md (vec_perm<mode:LSX>): Make the
selector VIMODE.
* config/loongarch/loongarch.cc (loongarch_expand_vec_perm):
Use the mode of the selector (instead of the shuffled vector)
for truncating it.  Operate on subregs in the selector mode if
the shuffled vector has a different mode (i. e. it's a
floating-point vector).

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/vect-shuf-fp.c: New test.

[APX PUSH2POP2] Adjust operand order for PUSH2POP2

The push2/pop2 operand order does not match the binutils implementation
for AT&T syntax that it will first push operands[2] then operands[1].
Correct it by reverse operand order for AT&T syntax.

gcc/ChangeLog:

* config/i386/i386.md (push2_di): Adjust operand order for AT&T
syntax.
(pop2_di): Likewise.
(push2p_di): Likewise.
(pop2p_di): Likewise.

gcc/testsuite/ChangeLog:

* gcc.target/i386/apx-push2pop2-1.c: Adjust output scan.
* gcc.target/i386/apx-push2pop2_force_drap-1.c: Likewise.

RISC-V: Fix permutation indice mode bug

This patch fixes following FAILs on zvl512b:
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-1.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-1.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-16.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-16.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-17.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-17.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-5.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-5.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-6.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/slp_run-6.c execution test

The root cause is that we are using vrgather.vv on vector QI mode which
is incorrect for zvl512b since it exceed 256.

Instead, we should use vrgatherei16.vv

PR target/112598

gcc/ChangeLog:

* config/riscv/riscv-v.cc (emit_vlmax_gather_insn): Adapt the priority.
(shuffle_generic_patterns): Fix permutation indice bug.
* config/riscv/vector-iterators.md: Fix VEI16 bug.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr112598-2.c: New test.

Support cbranchm for Vector HI/QImode.

gcc/ChangeLog:

* config/i386/sse.md (cbranch<mode>4): Extend to Vector
HI/QImode.

c++: start_preparsed_function tweak

In review of the deducing 'this' patch, it came up that the logic in
start_preparsed_function around the ctype variable was convoluted, being
set for non-static member functions and friends, but not for static member
functions. Let's set it for any member function, and not rely on it to
decide whether to set up 'this'.

gcc/cp/ChangeLog:

* decl.cc (start_preparsed_function): Clarify ctype logic.

PR target/111815: VAX: Only accept the index scaler as the RHS operand to ASHIFT

As from commit 9df1ba9a35b8 ("libbacktrace: support zstd decompression")
GCC for the `vax-netbsdelf' target fails to complete building, with an
ICE:

during RTL pass: final
.../libbacktrace/elf.c: In function 'elf_zstd_decompress':
.../libbacktrace/elf.c:5006:1: internal compiler error: in print_operand_address, at config/vax/vax.cc:514
5006 | }
      | ^
0x1113df97 print_operand_address(_IO_FILE*, rtx_def*)
.../gcc/config/vax/vax.cc:514
0x10c2489b default_print_operand_address(_IO_FILE*, machine_mode, rtx_def*)
.../gcc/targhooks.cc:373
0x106ddd0b output_address(machine_mode, rtx_def*)
.../gcc/final.cc:3648
0x106ddd0b output_asm_insn(char const*, rtx_def**)
.../gcc/final.cc:3505
0x106e2143 output_asm_insn(char const*, rtx_def**)
.../gcc/final.cc:3421
0x106e2143 final_scan_insn_1
.../gcc/final.cc:2841
0x106e28e3 final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
.../gcc/final.cc:2887
0x106e2bf7 final_1
.../gcc/final.cc:1979
0x106e3c67 rest_of_handle_final
.../gcc/final.cc:4240
0x106e3c67 execute
.../gcc/final.cc:4318
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

This is due to combine producing an invalid address RTX:

(plus:SI (ashift:SI (const_int 1 [0x1])
        (reg:QI 3 %r3 [1232]))
    (reg/v:SI 10 %r10 [orig:736 weight_mask ] [736]))

where the expression is ((1 << R3) + R10), which does not match a valid
machine addressing mode.  Consequently `print_operand_address' chokes.

This can be reduced to the testcase included, where it triggers the same
ICE in `p'.  Preincrements are required so that their results land in
registers and consequently an indexed addressing mode is tried or
otherwise doing operations piecemeal on stack-based function arguments
as direct input operands turns out more profitable in terms of RTX costs
and the ICE is avoided.

The ultimate cause has been commit c605a8bf9270 ("VAX: Accept ASHIFT in
address expressions"), where a shift of an immediate value by a register
has been mistakenly allowed as an index expression as if the shift
operation was commutative such as multiplication is.  So with ASHIFT the
scaler in an index expression has to be the right-hand operand, and the
backend has to enforce that, whereas with MULT the scaler can be either
operand.

Fix this by only accepting the index scaler as the RHS operand to
ASHIFT.

gcc/
PR target/111815
* config/vax/vax.cc (index_term_p): Only accept the index scaler
as the RHS operand to ASHIFT.

gcc/testsuite/
PR target/111815
* gcc.dg/torture/pr111815.c: New test.

RISC-V: Remove duplicate `order_operator' predicate

Remove our RISC-V-specific `order_operator' predicate, which is exactly
the same as generic `ordered_comparison_operator' one.

gcc/
* config/riscv/predicates.md (order_operator): Remove predicate.
* config/riscv/riscv.cc (riscv_rtx_costs): Update accordingly.
* config/riscv/riscv.md (*branch<mode>, *mov<GPR:mode><X:mode>cc)
(cstore<mode>4): Likewise.

RISC-V/testsuite: Add branchless cases for FP NE cond-add operation

Verify, for the generic floating-point NE conditional-add operation,
that if-conversion triggers via `noce_try_addcc' at `-mbranch-cost=3'
setting, which makes branchless code sequences emitted by if-conversion
cheaper than their original branched equivalents, and that extraneous
instructions such as SNEZ, etc. are not present in output.

The reason to XFAIL the SImode test for RV64 targets is GCC thinks it
has to sign-extend addends, which causes if-conversion to give up.

gcc/testsuite/
* gcc.target/riscv/adddifne.c: New test.
* gcc.target/riscv/addsifne.c: New test.

RISC-V/testsuite: Add branched cases for FP NE cond-add operation

Verify, for the generic floating-point NE conditional-add operation,
that if-conversion does *not* trigger at `-mbranch-cost=2' setting,
which makes original branched code sequences cheaper than their
branchless equivalents if-conversion would emit.

gcc/testsuite/
* gcc.target/riscv/adddibfne.c: New test.
* gcc.target/riscv/addsibfne.c: New test.

RISC-V/testsuite: Add branched cases for FP NE cond-move operations

Verify, for the floating-point NE conditional-move operation, that
if-conversion triggers via `noce_try_cmove' at the respective
sufficiently high `-mbranch-cost=' settings that make branchless code
sequences produced by if-conversion cheaper than their original branched
equivalents, and that extraneous instructions such as SNEZ, etc. are not
present in output.

gcc/testsuite/
* gcc.target/riscv/movdifeq-sfb.c: New test.
* gcc.target/riscv/movdifeq-thead.c: New test.
* gcc.target/riscv/movdifeq-ventana.c: New test.
* gcc.target/riscv/movdifeq-zicond.c: New test.
* gcc.target/riscv/movdifeq.c: New test.
* gcc.target/riscv/movsifeq-sfb.c: New test.
* gcc.target/riscv/movsifeq-thead.c: New test.
* gcc.target/riscv/movsifeq-ventana.c: New test.
* gcc.target/riscv/movsifeq-zicond.c: New test.
* gcc.target/riscv/movsifeq.c: New test.

RISC-V/testsuite: Add branched cases for FP NE cond-move operations

Verify, for generic, Ventana and Zicond targets and the floating-point
NE conditional-move operation, that if-conversion does *not* trigger at
the respective sufficiently low `-mbranch-cost=' settings that make
original branched code sequences cheaper than their branchless
equivalents if-conversion would emit.

gcc/testsuite/
* gcc.target/riscv/movdibfeq-ventana.c: New test.
* gcc.target/riscv/movdibfeq-zicond.c: New test.
* gcc.target/riscv/movdibfeq.c: New test.
* gcc.target/riscv/movsibfeq-ventana.c: New test.
* gcc.target/riscv/movsibfeq-zicond.c: New test.
* gcc.target/riscv/movsibfeq.c: New test.

RISC-V: Handle FP NE operator via inversion in cond-operation expansion

We have no FNE.fmt machine instructions, but we can emulate them for the
purpose of conditional-move and conditional-add operations by using the
respective FEQ.fmt instruction and then swapping the data input operands
or complementing the mask for the conditional addend respectively, so
update our handlers accordingly.

gcc/
* config/riscv/riscv-protos.h (riscv_expand_float_scc): Add
`invert_ptr' parameter.
* config/riscv/riscv.cc (riscv_emit_float_compare): Add NE
inversion handling.
(riscv_expand_float_scc): Pass `invert_ptr' through to
`riscv_emit_float_compare'.
(riscv_expand_conditional_move): Pass `&invert' to
`riscv_expand_float_scc'.
* config/riscv/riscv.md (add<mode>cc): Likewise.

RISC-V/testsuite: Add branchless cases for generic FP cond adds

Verify, for generic floating-point conditional-add operations that have
a corresponding conditional-set machine instruction, that if-conversion
triggers via `noce_try_addcc' at `-mbranch-cost=3' setting, which makes
branchless code sequences emitted by if-conversion cheaper than their
original branched equivalents, and that extraneous instructions such as
SNEZ, etc. are not present in output.

The reason to XFAIL SImode tests for RV64 targets is the compiler thinks
it has to sign-extend addends, which causes if-conversion to give up.

gcc/testsuite/
* gcc.target/riscv/adddifeq.c: New test.
* gcc.target/riscv/adddifge.c: New test.
* gcc.target/riscv/adddifgt.c: New test.
* gcc.target/riscv/adddifle.c: New test.
* gcc.target/riscv/adddiflt.c: New test.
* gcc.target/riscv/addsifeq.c: New test.
* gcc.target/riscv/addsifge.c: New test.
* gcc.target/riscv/addsifgt.c: New test.
* gcc.target/riscv/addsifle.c: New test.
* gcc.target/riscv/addsiflt.c: New test.

RISC-V/testsuite: Add branched cases for generic FP cond adds

Verify, for generic floating-point conditional-add operations that have
a corresponding conditional-set machine instruction, that if-conversion
does *not* trigger at `-mbranch-cost=2' setting, which makes original
branched code sequences cheaper than their branchless equivalents
if-conversion would emit. Cover all the relevant floating-point
relational operations to make sure no corner case escapes.

gcc/testsuite/
* gcc.target/riscv/adddibfeq.c: New test.
* gcc.target/riscv/adddibfge.c: New test.
* gcc.target/riscv/adddibfgt.c: New test.
* gcc.target/riscv/adddibfle.c: New test.
* gcc.target/riscv/adddibflt.c: New test.
* gcc.target/riscv/addsibfeq.c: New test.
* gcc.target/riscv/addsibfge.c: New test.
* gcc.target/riscv/addsibfgt.c: New test.
* gcc.target/riscv/addsibfle.c: New test.
* gcc.target/riscv/addsibflt.c: New test.

RISC-V/testsuite: Add branchless cases for generic FP cond moves

Verify, for generic floating-point conditional-move operations that have
a corresponding conditional-set machine instruction, that if-conversion
triggers (via `cond_move_convert_if_block', which doesn't report) at
`-mbranch-cost=5' setting, which makes branchless code sequences emitted
by if-conversion cheaper than their original branched equivalents, and
that extraneous instructions such as SNEZ, etc. are not present in
output.

gcc/testsuite/
* gcc.target/riscv/movdifge.c: New test.
* gcc.target/riscv/movdifgt.c: New test.
* gcc.target/riscv/movdifle.c: New test.
* gcc.target/riscv/movdiflt.c: New test.
* gcc.target/riscv/movdifne.c: New test.
* gcc.target/riscv/movsifge.c: New test.
* gcc.target/riscv/movsifgt.c: New test.
* gcc.target/riscv/movsifle.c: New test.
* gcc.target/riscv/movsiflt.c: New test.
* gcc.target/riscv/movsifne.c: New test.

RISC-V/testsuite: Add branched cases for generic FP cond moves

Verify, for generic floating-point conditional-move operations that have
a corresponding conditional-set machine instruction, that if-conversion
does *not* trigger at `-mbranch-cost=4' setting, which makes original
branched code sequences cheaper than their branchless equivalents
if-conversion would emit. Cover all the relevant floating-point
relational operations to make sure no corner case escapes.

gcc/testsuite/
* gcc.target/riscv/movdibfge.c: New test.
* gcc.target/riscv/movdibfgt.c: New test.
* gcc.target/riscv/movdibfle.c: New test.
* gcc.target/riscv/movdibflt.c: New test.
* gcc.target/riscv/movdibfne.c: New test.
* gcc.target/riscv/movsibfge.c: New test.
* gcc.target/riscv/movsibfgt.c: New test.
* gcc.target/riscv/movsibfle.c: New test.
* gcc.target/riscv/movsibflt.c: New test.
* gcc.target/riscv/movsibfne.c: New test.

RISC-V: Avoid extraneous integer comparison for FP comparisons

We have floating-point coditional-set machine instructions for a subset
of FP comparisons, so avoid going through a comparison against constant
zero in `riscv_expand_float_scc' where not necessary, preventing an
extraneous RTL instruction from being produced that counts against the
cost of the replacement branchless code sequence in if-conversion, e.g.:

(insn 29 6 30 2 (set (reg:DI 142)
        (ge:DI (reg/v:DF 135 [ w ])
            (reg/v:DF 136 [ x ]))) 297 {*cstoredfdi4}
     (nil))
(insn 30 29 31 2 (set (reg:DI 143)
        (ne:DI (reg:DI 142)
            (const_int 0 [0]))) 319 {*sne_zero_didi}
     (nil))
(insn 31 30 32 2 (set (reg:DI 141)
        (reg:DI 143)) 206 {*movdi_64bit}
     (nil))
(insn 32 31 33 2 (set (reg:DI 144)
        (neg:DI (reg:DI 141))) 15 {negdi2}
     (nil))
(insn 33 32 34 2 (set (reg:DI 145)
        (and:DI (reg:DI 144)
            (reg/v:DI 137 [ y ]))) 102 {*anddi3}
     (nil))
(insn 34 33 35 2 (set (reg:DI 146)
        (not:DI (reg:DI 144))) 111 {one_cmpldi2}
     (nil))
(insn 35 34 36 2 (set (reg:DI 147)
        (and:DI (reg:DI 146)
            (reg/v:DI 138 [ z ]))) 102 {*anddi3}
     (nil))
(insn 36 35 21 2 (set (reg/v:DI 138 [ z ])
        (ior:DI (reg:DI 145)
            (reg:DI 147))) 105 {iordi3}
     (nil))

where the second insn effectively just copies its input.  This now gets
simplified to:

(insn 29 6 30 2 (set (reg:DI 141)
        (ge:DI (reg/v:DF 135 [ w ])
            (reg/v:DF 136 [ x ]))) 297 {*cstoredfdi4}
     (nil))
(insn 30 29 31 2 (set (reg:DI 142)
        (neg:DI (reg:DI 141))) 15 {negdi2}
     (nil))
(insn 31 30 32 2 (set (reg:DI 143)
        (and:DI (reg:DI 142)
            (reg/v:DI 137 [ y ]))) 102 {*anddi3}
     (nil))
(insn 32 31 33 2 (set (reg:DI 144)
        (not:DI (reg:DI 142))) 111 {one_cmpldi2}
     (nil))
(insn 33 32 34 2 (set (reg:DI 145)
        (and:DI (reg:DI 144)
            (reg/v:DI 138 [ z ]))) 102 {*anddi3}
     (nil))
(insn 34 33 21 2 (set (reg/v:DI 138 [ z ])
        (ior:DI (reg:DI 143)
            (reg:DI 145))) 105 {iordi3}
     (nil))

lowering the cost of the code sequence produced (even though combine
would swallow the second insn anyway).

We still need to produce a comparison against constant zero where the
instruction following a floating-point coditional-set operation is a
branch, so add canonicalization to `riscv_expand_conditional_branch'
instead.

gcc/
* config/riscv/riscv.cc (riscv_emit_float_compare) <NE>: Handle
separately.
<EQ, LE, LT, GE, GT>: Return operands supplied as is.
(riscv_emit_binary): Call `riscv_emit_binary' directly rather
than going through a temporary register for word-mode targets.
(riscv_expand_conditional_branch): Canonicalize the comparison
if not against constant zero.

RISC-V: Provide FP conditional-branch instructions for if-conversion

Do not expand floating-point conditional-branch RTL instructions right
away that use a comparison operation that is either directly available
as a machine conditional-set instruction or is NE, which can be emulated
by EQ.  This is so that if-conversion sees them in their original form
and can produce fewer operations tried in a branchless code sequence
compared to when such an instruction has been already converted to a
sequence of a floating-point conditional-set RTL instruction followed by
an integer conditional-branch RTL instruction.  Split any floating-point
conditional-branch RTL instructions still remaining after reload then.

Adjust the testsuite accordingly: since the middle end uses the inverse
condition internally, an inverse conditional-set instruction may make it
to assembly output and also `cond_move_process_if_block' will be used by
if-conversion rather than `noce_process_if_block', because the latter
function not yet been updated to handle inverted conditions.

gcc/
* config/riscv/predicates.md (ne_operator): New predicate.
* config/riscv/riscv.cc (riscv_insn_cost): Handle branches on a
floating-point condition.
* config/riscv/riscv.md (@cbranch<mode>4): Rename expander to...
(@cbranch<ANYF:mode>4): ... this.  Only expand the RTX via
`riscv_expand_conditional_branch' for `!signed_order_operator'
operators, otherwise let it through.
(*cbranch<ANYF:mode>4, *cbranch<ANYF:mode>4): New insns and
splitters.

gcc/testsuite/
* gcc.target/riscv/movdifge-sfb.c: Reject "if-conversion
succeeded through" rather than accepting it.
* gcc.target/riscv/movdifge-thead.c: Likewise.
* gcc.target/riscv/movdifge-ventana.c: Likewise.
* gcc.target/riscv/movdifge-zicond.c: Likewise.
* gcc.target/riscv/movdifgt-sfb.c: Likewise.
* gcc.target/riscv/movdifgt-thead.c: Likewise.
* gcc.target/riscv/movdifgt-ventana.c: Likewise.
* gcc.target/riscv/movdifgt-zicond.c: Likewise.
* gcc.target/riscv/movdifle-sfb.c: Likewise.
* gcc.target/riscv/movdifle-thead.c: Likewise.
* gcc.target/riscv/movdifle-ventana.c: Likewise.
* gcc.target/riscv/movdifle-zicond.c: Likewise.
* gcc.target/riscv/movdiflt-sfb.c: Likewise.
* gcc.target/riscv/movdiflt-thead.c: Likewise.
* gcc.target/riscv/movdiflt-ventana.c: Likewise.
* gcc.target/riscv/movdiflt-zicond.c: Likewise.
* gcc.target/riscv/movsifge-sfb.c: Likewise.
* gcc.target/riscv/movsifge-thead.c: Likewise.
* gcc.target/riscv/movsifge-ventana.c: Likewise.
* gcc.target/riscv/movsifge-zicond.c: Likewise.
* gcc.target/riscv/movsifgt-sfb.c: Likewise.
* gcc.target/riscv/movsifgt-thead.c: Likewise.
* gcc.target/riscv/movsifgt-ventana.c: Likewise.
* gcc.target/riscv/movsifgt-zicond.c: Likewise.
* gcc.target/riscv/movsifle-sfb.c: Likewise.
* gcc.target/riscv/movsifle-thead.c: Likewise.
* gcc.target/riscv/movsifle-ventana.c: Likewise.
* gcc.target/riscv/movsifle-zicond.c: Likewise.
* gcc.target/riscv/movsiflt-sfb.c: Likewise.
* gcc.target/riscv/movsiflt-thead.c: Likewise.
* gcc.target/riscv/movsiflt-ventana.c: Likewise.
* gcc.target/riscv/movsiflt-zicond.c: Likewise.
* gcc.target/riscv/smax-ieee.c: Also accept FLT.D.
* gcc.target/riscv/smaxf-ieee.c: Also accept FLT.S.
* gcc.target/riscv/smin-ieee.c: Also accept FGT.D.
* gcc.target/riscv/sminf-ieee.c: Also accept FGT.S.