git.ipfire.org Git - thirdparty/gcc.git/log

Fix up bootstrap with GCC 4.[89] after RAII auto_mpfr and autp_mpz [PR109589]

On Tue, Apr 18, 2023 at 03:39:41PM +0200, Richard Biener via Gcc-patches wrote:
> The following adds two RAII classes, one for mpz_t and one for mpfr_t
> making object lifetime management easier. Both formerly require
> explicit initialization with {mpz,mpfr}_init and release with
> {mpz,mpfr}_clear.

This unfortunately broke bootstrap when using GCC 4.8.x or 4.9.x as
it uses deleted friends which weren't supported until PR62101 fixed
them in 2014 for GCC 5.

The following patch adds an workaround, not deleting those friends
for those old versions.
While it means if people add those mp*_{init{,2},clear} calls on auto_mp*
objects they won't notice when doing non-bootstrap builds using
very old system compilers, people should be bootstrapping their changes
and it will be caught during bootstraps even when starting with those
old compilers, plus most people actually use much newer compilers
when developing.

2023-04-22 Jakub Jelinek <jakub@redhat.com>

PR bootstrap/109589
* system.h (class auto_mpz): Workaround PR62101 bug in GCC 4.8 and 4.9.
* realmpfr.h (class auto_mpfr): Likewise.

Adjust rx movsicc tests

The rx port has target specific test movsicc which is naturally meant to verify
that if-conversion is happening on the expected cases.

Unfortunately the test is poorly written.  The core problem is there are 8
distinct tests and each of those tests is expected to generate a specific
sequence.  Unfortunately, various generic bits might turn an equality test
into an inequality test or make other similar changes.

The net result is the assembly matching patterns may find a particular sequence,
but it may be for a different function than was originally intended.  ie,
test1's output may match the expected assembly for test5.  Ugh!

This patch breaks the movsicc test down into 8 distinct tests and adjusts the
patterns they match.  The nice thing is all these tests are supposed to have
branches that use a bCC 1f form.  So we can make them a bit more robust by
ignoring the actual condition code used.  So if we change eq to ne, as long
as we match the movsicc pattern, we're OK.  And the 1f style is only used by
the movsicc pattern.

With the tests broken down it's a lot easier to diagnose why one test fails
after the recent changes to if-conversion.  movsicc-3 fails because of the
profitability test.  It's more expensive than the other cases because of its
use of (const_int 10) rather than (const_int 0).  (const_int 0) naturally has
a smaller cost.

It looks to me like in this context (const_int 10) should have the same cost
as (const_int 0).  But I'm nowhere near well versed in the cost model for the
rx port.  So I'm just leaving the test as xfailed.  If someone cares enough,
they can dig into it further.

gcc/testsuite
* gcc.target/rx/movsicc.c: Broken down into ...
* gcc.target/rx/movsicc-1.c: Here.
* gcc.target/rx/movsicc-2.c: Here.
* gcc.target/rx/movsicc-3.c: Here.  xfail one test.
* gcc.target/rx/movsicc-4.c: Here.
* gcc.target/rx/movsicc-5.c: Here.
* gcc.target/rx/movsicc-6.c: Here.
* gcc.target/rx/movsicc-7.c: Here.
* gcc.target/rx/movsicc-8.c: Here.

match.pd: Fix fneg/fadd optimization [PR109583]

The following testcase ICEs on x86, foo function since my r14-22
improvement, but bar already since r13-4122.  The problem is the same,
in the if expression related_vector_mode is called and that starts with
  gcc_assert (VECTOR_MODE_P (vector_mode));
but nothing in the fneg/fadd match.pd pattern actually checks if the
VEC_PERM type has VECTOR_MODE_P (vec_mode).  In this case it has BLKmode
and so it ICEs.

The following patch makes sure we don't ICE on it.

2023-04-22  Jakub Jelinek  <jakub@redhat.com>

PR tree-optimization/109583
* match.pd (fneg/fadd simplify): Don't call related_vector_mode
if vec_mode is not VECTOR_MODE_P.

* gcc.dg/pr109583.c: New test.

Update loop estimate after header duplication

Loop header copying implements partial loop peelng.  If all exits of the loop
are peeled (which is the common case) the number of iterations decreases by 1.
Without noting this, for loops iterating zero times, we end up re-peeling them
later in the loop peeling pass which is wasteful.

This patch commonizes the code for esitmate update and adds logic to detect
when all (likely) exits were peeled by loop-ch.

We are still wrong about update of estimate however: if the exits behave
randomly with given probability, loop peeling does not decrease expected
iteration counts, just decreases probability that loop will be executed.
In this case we thus incorrectly decrease any_estimate. Doing so however
at least help us to not peel or optimize hard the lop later.

If the loop iterates precisely the estimated nuner of iterations. the estimate
decreases, but we are wrong about decreasing the header frequncy.  We already
have logic that tries to prove that loop exit will not be taken in peeled out
iterations and it may make sense to special case this.

I also fixed problem where we had off by one error in iteration count updating.
It makes perfect sense to expect loop to have 0 iterations.  However if bounds
drops to negative, we lose info about the loop behaviour (since we have no
profile data reaching the loop body).

Bootstrapped/regtested x86_64-linux, comitted.
Honza

gcc/ChangeLog:

2023-04-22  Jan Hubicka  <hubicka@ucw.cz>
    Ondrej Kubanek  <kubanek0ondrej@gmail.com>

* cfgloopmanip.h (adjust_loop_info_after_peeling): Declare.
* tree-ssa-loop-ch.cc (ch_base::copy_headers): Fix updating of
loop profile and bounds after header duplication.
* tree-ssa-loop-ivcanon.cc (adjust_loop_info_after_peeling):
Break out from try_peel_loop; fix handling of 0 iterations.
(try_peel_loop): Use adjust_loop_info_after_peeling.

gcc/testsuite/ChangeLog:

2023-04-22  Jan Hubicka  <hubicka@ucw.cz>
    Ondrej Kubanek  <kubanek0ondrej@gmail.com>

* gcc.dg/tree-ssa/peel1.c: Decrease number of peels by 1.
* gcc.dg/unroll-8.c: Decrease loop iteration estimate.
* gcc.dg/tree-prof/peel-2.c: New test.

Daily bump.

Do not fold ADDR_EXPR conditions leading to builtin_unreachable early.

Ranges can not represent &var globally yet, so we cannot fold these
expressions early or we lose the __builtin_unreachable information.

PR tree-optimization/109546
gcc/
* tree-vrp.cc (remove_unreachable::remove_and_update_globals): Do
not fold conditions with ADDR_EXPR early.

gcc/testsuite/
* gcc.dg/pr109546.c: New.

c++: fix 'unsigned typedef-name' extension [PR108099]

In the comments for PR108099 Jakub provided some testcases that demonstrated
that even before the regression noted in the patch we were getting the
semantics of this extension wrong: in the unsigned case we weren't producing
the corresponding standard unsigned type but another distinct one of the
same size, and in the signed case we were just dropping it on the floor and
not actually returning a signed type at all.

The former issue is fixed by using c_common_signed_or_unsigned_type instead
of unsigned_type_for, and the latter issue by adding a (signed_p &&
typedef_decl) case.

This patch introduces a failure on std/ranges/iota/max_size_type.cc due to
the latter issue, since the testcase expects 'signed rep_t' to do something
sensible, and previously we didn't. Now that we do, it exposes a bug in the
__max_diff_type::operator>>= handling of sign extension: when we evaluate
-1000 >> 2 in __max_diff_type we keep the MSB set, but leave the
second-most-significant bit cleared.

PR c++/108099

gcc/cp/ChangeLog:

* decl.cc (grokdeclarator): Don't clear typedef_decl after 'unsigned
typedef' pedwarn. Use c_common_signed_or_unsigned_type. Also
handle 'signed typedef'.

gcc/testsuite/ChangeLog:

* g++.dg/ext/int128-8.C: Remove xfailed dg-bogus markers.
* g++.dg/ext/unsigned-typedef2.C: New test.
* g++.dg/ext/unsigned-typedef3.C: New test.

configure: Only create serdep.tmp if needed

There's no reason to create this file if none of the serial configure
options are passed.

ChangeLog:

* configure: Regenerate.
* configure.ac: Only create serdep.tmp if needed

gcc/m2: Drop references to $(P)

$(P) seems to have been a workaround for some old, proprietary make
implementations that we no longer support. It was removed in
r0-31149-gb8dad04b688e9c.

gcc/m2/ChangeLog:

* Make-lang.in: Remove references to $(P).
* Make-maintainer.in: Ditto.

Adjust x86 testsuite for recent if-conversion cost checking

gcc/testsuite
PR testsuite/109549
* gcc.target/i386/cmov6.c: No longer expect this test to
generate 'cmov' instructions.

aarch64: Emit single-instruction for smin (x, 0) and smax (x, 0)

Motivated by https://reviews.llvm.org/D148249, we can expand to a single instruction
for the SMIN (x, 0) and SMAX (x, 0) cases using the combined AND/BIC and ASR operations.
Given that we already have well-fitting TARGET_CSSC patterns and expanders for the min/max codes
in the backend this patch does some minor refactoring to ensure we emit the right SMAX/SMIN RTL codes
for TARGET_CSSC, fall back to the generic expanders or emit a simple SMIN/SMAX with 0 RTX for !TARGET_CSSC
that is now matched by a separate pattern.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64.md (aarch64_umax<mode>3_insn): Delete.
(umax<mode>3): Emit raw UMAX RTL instead of going through gen_ function
for umax.
(<optab><mode>3): New define_expand for MAXMIN_NOUMAX codes.
(*aarch64_<optab><mode>3_zero): Define.
(*aarch64_<optab><mode>3_cssc): Likewise.
* config/aarch64/iterators.md (maxminand): New code attribute.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/sminmax-asr_1.c: New test.

PR target/108779 aarch64: Implement -mtp= option

A user has requested that we support the -mtp= option in aarch64 GCC for changing
the TPIDR register to read for TLS accesses. I'm not a big fan of the option name,
but we already support it in the arm port and Clang supports it for AArch64 already,
where it accepts the 'el0', 'el1', 'el2', 'el3' values.

This patch implements the same functionality in GCC.

Bootstrapped and tested on aarch64-none-linux-gnu.
Confirmed with godbolt that the sequences and options are the same as what Clang accepts/generates.

gcc/ChangeLog:

PR target/108779
* config/aarch64/aarch64-opts.h (enum aarch64_tp_reg): Define.
* config/aarch64/aarch64-protos.h (aarch64_output_load_tp):
Define prototype.
* config/aarch64/aarch64.cc (aarch64_tpidr_register): Declare.
(aarch64_override_options_internal): Handle the above.
(aarch64_output_load_tp): New function.
* config/aarch64/aarch64.md (aarch64_load_tp_hard): Call
aarch64_output_load_tp.
* config/aarch64/aarch64.opt (aarch64_tp_reg): Define enum.
(mtp=): New option.
* doc/invoke.texi (AArch64 Options): Document -mtp=.

gcc/testsuite/ChangeLog:

PR target/108779
* gcc.target/aarch64/mtp.c: New test.
* gcc.target/aarch64/mtp_1.c: New test.
* gcc.target/aarch64/mtp_2.c: New test.
* gcc.target/aarch64/mtp_3.c: New test.
* gcc.target/aarch64/mtp_4.c: New test.

aarch64: PR target/99195 Add scheme to optimise away vec_concat with zeroes on 64-bit Advanced SIMD ops

I finally got around to trying out the define_subst approach for PR target/99195.
The problem we have is that many Advanced SIMD instructions have 64-bit vector variants that
clear the top half of the 128-bit Q register. This would allow the compiler to avoid generating
explicit zeroing instructions to concat the 64-bit result with zeroes for code like:
vcombine_u16(vadd_u16(a, b), vdup_n_u16(0))
We've been getting user reports of GCC missing this optimisation in real world code, so it's worth
doing something about it.
The straightforward approach that we've been taking so far is adding extra patterns in aarch64-simd.md
that match the 64-bit result in a vec_concat with zeroes. Unfortunately for big-endian the vec_concat
operands to match have to be the other way around, so we would end up adding two extra define_insns.
This would lead to too much bloat in aarch64-simd.md

This patch defines a pair of define_subst constructs that allow us to annotate patterns in aarch64-simd.md
with the <vczle> and <vczbe> subst_attrs and the compiler will automatically produce the vec_concat widening patterns,
properly gated for BYTES_BIG_ENDIAN when needed. This seems like the least intrusive way to describe the extra zeroing semantics.

I've had a look at the generated insn-*.cc files in the build directory and it seems that define_subst does what we want it to do
when applied multiple times on a pattern in terms of insn conditions and modes.

This patch adds the define_subst machinery and adds the annotations to some of the straightforward binary and unary integer
operations. Many more such annotations are possible and I aim add them in future patches if this approach is acceptable.

Bootstrapped and tested on aarch64-none-linux-gnu and on aarch64_be-none-elf.

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (add_vec_concat_subst_le): Define.
(add_vec_concat_subst_be): Likewise.
(vczle): Likewise.
(vczbe): Likewise.
(add<mode>3): Rename to...
(add<mode>3<vczle><vczbe>): ... This.
(sub<mode>3): Rename to...
(sub<mode>3<vczle><vczbe>): ... This.
(mul<mode>3): Rename to...
(mul<mode>3<vczle><vczbe>): ... This.
(and<mode>3): Rename to...
(and<mode>3<vczle><vczbe>): ... This.
(ior<mode>3): Rename to...
(ior<mode>3<vczle><vczbe>): ... This.
(xor<mode>3): Rename to...
(xor<mode>3<vczle><vczbe>): ... This.
* config/aarch64/iterators.md (VDZ): Define.

gcc/testsuite/ChangeLog:

PR target/99195
* gcc.target/aarch64/simd/pr99195_1.c: New test.

c++, tree: optimize walk_tree_1 and cp_walk_subtrees

These functions currently repeatedly dereference tp during the subtree
walks, dereferences which the compiler can't CSE because it can't
guarantee that the subtree walking doesn't modify *tp.

But we already implicitly require that TREE_CODE (*tp) remains the same
throughout the subtree walks, so it doesn't seem to be a huge leap to
strengthen that to requiring *tp remains the same.

So this patch manually CSEs the dereferences of *tp. This means that a
callback function can no longer replace *tp with another tree (of the
same TREE_CODE) when walking one of its subtrees, but that doesn't sound
like a useful capability anyway.

gcc/cp/ChangeLog:

* tree.cc (cp_walk_subtrees): Avoid repeatedly dereferencing tp.
<case DECLTYPE_TYPE>: Use cp_unevaluated and WALK_SUBTREE.
<case ALIGNOF_EXPR etc>: Likewise.

gcc/ChangeLog:

* tree.cc (walk_tree_1): Avoid repeatedly dereferencing tp
and type_p.

Add Ajit Kumar Agarwal to write after approval

2023-04-21 Ajit Kumar Agarwal <aagarwa1@linux.ibm.com>

ChangeLog:

* MAINTAINERS (Write After Approval): Add myself.

Fix boostrap failure in tree-ssa-loop-ch.cc

I managed to mix up patch and its WIP version in previous commit.
This patch adds the missing edge iterator and also fixes a side
case where new loop header would have multiple latches.

gcc/ChangeLog:

* tree-ssa-loop-ch.cc (ch_base::copy_headers): Fix previous
commit.

expansion: make layout of x_shift*cost[][][] more efficient

when debugging expmed.[ch] for PR/108987 saw that some of the cost arrays have
less than ideal layout as follows:

x_shift*cost[0..63][speed][modes]

We would want speed to be first index since a typical compile will have
that fixed, followed by mode and then the shift values.

It should be non-functional from compiler semantics pov, except
executing slightly faster due to better locality of shift values for
given speed and mode. And also a bit more intutive when debugging.

gcc/Changelog:

* expmed.h (x_shift*_cost): convert to int [speed][mode][shift].
(shift*_cost_ptr ()): Access x_shift*_cost array directly.

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>

MAINTAINERS: add Vineet Gupta to write after approval

ChangeLog:

* MAINTAINERS (Write After Approval): Add myself.
Ref: <680c7bbe-5d6e-07cd-8468-247afc65e1dd@gmail.com>

Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>

[aarch64] Use force_reg instead of copy_to_mode_reg.

Use force_reg instead of copy_to_mode_reg in aarch64_simd_dup_constant
and aarch64_expand_vector_init to avoid creating pseudo if original value
is already in a register.

gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_simd_dup_constant): Use
force_reg instead of copy_to_mode_reg.
(aarch64_expand_vector_init): Likewise.

i386: Remove REG_OK_FOR_INDEX/REG_OK_FOR_BASE and their derivatives

x86 was converted to TARGET_LEGITIMATE_ADDRESS_P long ago.  Remove
remnants of the conversion.  Also, cleanup the remaining macros a bit
by introducing INDEX_REGNO_P macro.

No functional change.

gcc/ChangeLog:

2023-04-21  Uroš Bizjak  <ubizjak@gmail.com>

* config/i386/i386.h (REG_OK_FOR_INDEX_P, REG_OK_FOR_BASE_P): Remove.
(REG_OK_FOR_INDEX_NONSTRICT_P,  REG_OK_FOR_BASE_NONSTRICT_P): Ditto.
(REG_OK_FOR_INDEX_STRICT_P, REG_OK_FOR_BASE_STRICT_P): Ditto.

(FIRST_INDEX_REG, LAST_INDEX_REG): New defines.
(LEGACY_INDEX_REG_P, LEGACY_INDEX_REGNO_P): New macros.
(INDEX_REG_P, INDEX_REGNO_P): Ditto.

(REGNO_OK_FOR_INDEX_P): Use INDEX_REGNO_P predicates.

(REGNO_OK_FOR_INDEX_NONSTRICT_P): New macro.
(EG_OK_FOR_BASE_NONSTRICT_P): Ditto.

* config/i386/predicates.md (index_register_operand):
Use REGNO_OK_FOR_INDEX_P and REGNO_OK_FOR_INDEX_NONSTRICT_P macros.

* config/i386/i386.cc (ix86_legitimate_address_p): Use
REGNO_OK_FOR_BASE_P, REGNO_OK_FOR_BASE_NONSTRICT_P,
REGNO_OK_FOR_INDEX_P and REGNO_OK_FOR_INDEX_NONSTRICT_P macros.

Fix latent bug in loop header copying which forgets to update the loop header pointer

gcc/ChangeLog:

2023-04-21 Jan Hubicka <hubicka@ucw.cz>
Ondrej Kubanek <kubanek0ondrej@gmail.com>

* tree-ssa-loop-ch.cc (ch_base::copy_headers): Update loop header and
latch.

Add safe_is_a

The following adds safe_is_a, an is_a check handling nullptr
gracefully.

* is-a.h (safe_is_a): New.

Add operator* to gimple_stmt_iterator and gphi_iterator

This allows STL style iterator dereference. It's the same
as gsi_stmt () or .phi ().

* gimple-iterator.h (gimple_stmt_iterator::operator*): Add.
(gphi_iterator::operator*): Likewise.

Stabilize inliner

The Fibonacci heap can change its behaviour quite significantly for no good
reasons when multiple edges with same key occurs.  This is quite common
for small functions.

This patch stabilizes the order by adding edge uids into the info.
Again I think this is good idea regardless of the incremental WPA project
since we do not want random changes in inline decisions.

gcc/ChangeLog:

2023-04-21  Jan Hubicka  <hubicka@ucw.cz>
    Michal Jires  <michal@jires.eu>

* ipa-inline.cc (class inline_badness): New class.
(edge_heap_t, edge_heap_node_t): Use inline_badness for badness instead
of sreal.
(update_edge_key): Update.
(lookup_recursive_calls): Likewise.
(recursive_inlining): Likewise.
(add_new_edges_to_heap): Likewise.
(inline_small_functions): Likewise.

Cleanup odr_types_equivalent_p

gcc/ChangeLog:

2023-04-21 Jan Hubicka <hubicka@ucw.cz>

* ipa-devirt.cc (odr_types_equivalent_p): Cleanup warned checks.

PR modula2/109586 cc1gm2 ICE when compiling large source files.

The function m2block_RememberConstant calls m2tree_IsAConstant.
However IsAConstant does not recognise TREE_CODE(t) ==
CONSTRUCTOR as a constant. Without this patch CONSTRUCTOR
contants are garbage collected (and not preserved) resulting in
a corrupt tree and crash.

gcc/m2/ChangeLog:

PR modula2/109586
* gm2-gcc/m2tree.cc (m2tree_IsAConstant): Add (TREE_CODE
(t) == CONSTRUCTOR) to expression.

Signed-off-by: Gaius Mulley <gaiusmod2@gmail.com>

tree-optimization/109573 - avoid ICEing on unexpected live def

The following relaxes the assert in vectorizable_live_operation
where we catch currently unhandled cases to also allow an
intermediate copy as it happens here but also relax the assert
to checking only.

PR tree-optimization/109573
* tree-vect-loop.cc (vectorizable_live_operation): Allow
unhandled SSA copy as well. Demote assert to checking only.

* g++.dg/vect/pr109573.cc: New testcase.

Use correct CFG orders for DF worklist processing

This adjusts the remaining three RPO computes in DF.  The DF_FORWARD
problems should use a RPO on the forward graph, the DF_BACKWARD
problems should use a RPO on the inverted graph.

Conveniently now inverted_rev_post_order_compute computes a RPO.
We still use post_order_compute and reverse its order for its
side-effect of deleting unreachable blocks.

This resuls in an overall reduction on visited blocks on cc1files by 5.2%.

Because on the reverse CFG most regions are irreducible, there's
few cases the number of visited blocks increases.  For the set
of cc1files I have this is for et-forest.i, graph.i, hwint.i,
tree-ssa-dom.i, tree-ssa-loop-ch.i and tree-ssa-threadedge.i.  For
tree-ssa-dse.i it's off-noise and I've more closely investigated
and figured it is really bad luck due to the irreducibility.

* df-core.cc (df_analyze): Compute RPO on the reverse graph
for DF_BACKWARD problems.
(loop_post_order_compute): Rename to ...
(loop_rev_post_order_compute): ... this, compute a RPO.
(loop_inverted_post_order_compute): Rename to ...
(loop_inverted_rev_post_order_compute): ... this, compute a RPO.
(df_analyze_loop): Use RPO on the forward graph for DF_FORWARD
problems, RPO on the inverted graph for DF_BACKWARD.

change inverted_post_order_compute to inverted_rev_post_order_compute

The following changes the inverted_post_order_compute API back to
a plain C array interface and computing a reverse post order since
that's what's always required.  It will make massaging DF to use
the correct iteration orders easier.  Elsewhere it requires turning
backward iteration over the computed order with forward iteration.

* cfganal.h (inverted_rev_post_order_compute): Rename
from ...
(inverted_post_order_compute): ... this.  Add struct function
argument, change allocation to a C array.
* cfganal.cc (inverted_rev_post_order_compute): Likewise.
* lcm.cc (compute_antinout_edge): Adjust.
* lra-lives.cc (lra_create_live_ranges_1): Likewise.
* tree-ssa-dce.cc (remove_dead_stmt): Likewise.
* tree-ssa-pre.cc (compute_antic): Likewise.

change DF to use the proper CFG order for DF_FORWARD problems

This changes DF to use RPO on the forward graph for DF_FORWARD
problems.  While that naturally maps to pre_and_rev_postorder_compute
we use the existing (wrong) CFG order for DF_BACKWARD problems
computed by post_order_compute since that provides the required
side-effect of deleting unreachable blocks.

The change requires turning the inconsistent vec<int> vs int * back
to consistent int *.  A followup patch will change the
inverted_post_order_compute API and change the DF_BACKWARD problem
to use the correct RPO on the backward graph together with statistics
I produced last year for the combined effect.

* df.h (df_d::postorder_inverted): Change back to int *,
clarify comments.
* df-core.cc (rest_of_handle_df_finish): Adjust.
(df_analyze_1): Likewise.
(df_analyze): For DF_FORWARD problems use RPO on the forward
graph.  Adjust.
(loop_inverted_post_order_compute): Adjust API.
(df_analyze_loop): Adjust.
(df_get_n_blocks): Likewise.
(df_get_postorder): Likewise.

RISC-V: Defer vsetvli insertion to later if possible [PR108270]

Fix issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108270.

Consider the following testcase:
void f (void * restrict in, void * restrict out, int l, int n, int m)
{
  for (int i = 0; i < l; i++){
    for (int j = 0; j < m; j++){
      for (int k = 0; k < n; k++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, 17);
  __riscv_vse8_v_i8mf8 (out + i + j, v, 17);
}
    }
  }
}

Compile option: -O3

Before this patch:
mv      a7,a2
mv      a6,a0
mv      t1,a1
mv      a2,a3
vsetivli zero,17,e8,mf8,ta,ma
ble     a7,zero,.L1
ble     a4,zero,.L1
ble     a3,zero,.L1
...

After this patch:
mv      a7,a2
mv      a6,a0
mv      t1,a1
mv      a2,a3
ble     a7,zero,.L1
ble     a4,zero,.L1
ble     a3,zero,.L1
add     a1,a0,a4
li      a0,0
vsetivli zero,17,e8,mf8,ta,ma
...

This issue is a missed optmization produced by Phase 3 global backward demand
fusion instead of LCM.

This patch is fixing poor placement of the vsetvl.

This point is seletected not because LCM but by Phase 3 (VL/VTYPE demand info
backward fusion and propogation) which
is I introduced into VSETVL PASS to enhance LCM && improve vsetvl instruction
performance.

This patch is to supress the Phase 3 too aggressive backward fusion and
propagation to the top of the function program
when there is no define instruction of AVL (AVL is 0 ~ 31 imm since vsetivli
instruction allows imm value instead of reg).

You may want to ask why we need Phase 3 to the job.
Well, we have so many situations that pure LCM fails to optimize, here I can
show you a simple case to demonstrate it:

void f (void * restrict in, void * restrict out, int n, int m, int cond)
{
  size_t vl = 101;
  for (size_t j = 0; j < m; j++){
    if (cond) {
      for (size_t i = 0; i < n; i++)
{
  vint8mf8_t v = __riscv_vle8_v_i8mf8 (in + i + j, vl);
  __riscv_vse8_v_i8mf8 (out + i, v, vl);
}
    } else {
      for (size_t i = 0; i < n; i++)
{
  vint32mf2_t v = __riscv_vle32_v_i32mf2 (in + i + j, vl);
  v = __riscv_vadd_vv_i32mf2 (v,v,vl);
  __riscv_vse32_v_i32mf2 (out + i, v, vl);
}
    }
  }
}

You can see:
The first inner loop needs vsetvli e8 mf8 for vle+vse.
The second inner loop need vsetvli e32 mf2 for vle+vadd+vse.

If we don't have Phase 3 (Only handled by LCM (Phase 4)), we will end up with :

outerloop:
...
vsetvli e8mf8
inner loop 1:
....

vsetvli e32mf2
inner loop 2:
....

However, if we have Phase 3, Phase 3 is going to fuse the vsetvli e32 mf2 of
inner loop 2 into vsetvli e8 mf8, then we will end up with this result after
phase 3:

outerloop:
...
inner loop 1:
vsetvli e32mf2
....

inner loop 2:
vsetvli e32mf2
....

Then, this demand information after phase 3 will be well optimized after phase 4
(LCM), after Phase 4 result is:

vsetvli e32mf2
outerloop:
...
inner loop 1:
....

inner loop 2:
....

You can see this is the optimal codegen after current VSETVL PASS (Phase 3:
Demand backward fusion and propagation + Phase 4: LCM ). This is a known issue
when I start to implement VSETVL PASS.

gcc/ChangeLog:

PR target/108270
* config/riscv/riscv-vsetvl.cc
(vector_infos_manager::all_empty_predecessor_p): New function.
(pass_vsetvl::backward_demand_fusion): Ditto.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

PR target/108270
* gcc.target/riscv/rvv/vsetvl/imm_bb_prop-1.c: Adapt testcase.
* gcc.target/riscv/rvv/vsetvl/imm_conflict-3.c: Ditto.
* gcc.target/riscv/rvv/vsetvl/pr108270.c: New test.

riscv: Fix <bitmanip_insn> fallout.

PR109582: Since r14-116 generic.md uses standard names instead of the
types defined in the <bitmanip_insn> iterator (that match instruction
names). Change this.

gcc/ChangeLog:

PR target/109582
* config/riscv/generic.md: Change standard names to insn names.

rs6000: xfail float128 comparison test case that fails on powerpc64.

This patch xfails a float128 comparison test case on powerpc64 that
fails due to a longstanding issue with floating-point compares.

See PR58684 for more information.

When float128 hardware is enabled (-mfloat128-hardware), xscmpuqp is
generated for comparison which is unexpected. When float128 software
emulation is enabled (-mno-float128-hardware), we still have to xfail
the hardware version (__lekf2_hw) which finally generates xscmpuqp.

gcc/testsuite/
PR target/108728
* gcc.dg/torture/float128-cmp-invalid.c: Add xfail.

testsuite: make ppc_cpu_supports_hw as effective target keyword [PR108728]

gcc/testsuite/
PR target/108728
* lib/target-supports.exp (is-effective-target-keyword): Add
ppc_cpu_supports_hw.

Fix LCM dataflow CFG order

The following fixes the initial order the LCM dataflow routines process
BBs. For a forward problem you want reverse postorder, for a backward
problem you want reverse postorder on the inverted graph.

The LCM iteration has very many other issues but this allows to
turn inverted_post_order_compute into computing a reverse postorder
more easily.

* lcm.cc (compute_antinout_edge): Use RPO on the inverted graph.
(compute_laterin): Use RPO.
(compute_available): Likewise.

LoongArch: Fix MUSL_DYNAMIC_LINKER

The system based on musl has no '/lib64', so change it.

https://wiki.musl-libc.org/guidelines-for-distributions.html,
"Multilib/multi-arch" section of this introduces it.

gcc/
* config/loongarch/gnu-user.h (MUSL_DYNAMIC_LINKER): Redefine.

Signed-off-by: Peng Fan <fanpeng@loongson.cn>
Suggested-by: Xi Ruoyao <xry111@xry111.site>

RISC-V: Add local user vsetvl instruction elimination [PR109547]

This patch is to enhance optimization for auto-vectorization.

Before this patch:

Loop:
vsetvl a5,a2...
vsetvl zero,a5...
vle

After this patch:

Loop:
vsetvl a5,a2
vle

gcc/ChangeLog:

PR target/109547
* config/riscv/riscv-vsetvl.cc (local_eliminate_vsetvl_insn): New function.
(vector_insn_info::skip_avl_compatible_p): Ditto.
(vector_insn_info::merge): Remove default value.
(pass_vsetvl::compute_local_backward_infos): Ditto.
(pass_vsetvl::cleanup_insns): Add local vsetvl elimination.
* config/riscv/riscv-vsetvl.h: Ditto.

gcc/testsuite/ChangeLog:

PR target/109547
* gcc.target/riscv/rvv/vsetvl/pr109547.c: New.
* gcc.target/riscv/rvv/vsetvl/vsetvl-17.c: Update scan
condition.

Daily bump.

update_web_docs_git: Allow setting TEXI2*, add git build default

maintainer-scripts/ChangeLog:

* update_web_docs_git: Add a mechanism to override makeinfo,
texi2dvi and texi2pdf, and default them to
/home/gccadmin/texinfo/install-git/bin/${tool}, if present.

c++: simplify TEMPLATE_TYPE_PARM level lowering

1. Don't bother recursing when level lowering a cv-qualified type
   template parameter.
2. Get rid of the recursive loop breaker when level lowering a
   constrained auto, and enable the TEMPLATE_PARM_DESCENDANTS cache in
   this case too.  This should be safe to do so now that we no longer
   substitute constraints on an auto.

gcc/cp/ChangeLog:

* pt.cc (tsubst) <case TEMPLATE_TYPE_PARM>: Don't recurse when
level lowering a cv-qualified type template parameter.  Remove
recursive loop breaker in the level lowering case for constrained
autos.  Use the TEMPLATE_PARM_DESCENDANTS cache in this case as
well.

c++: use TREE_VEC for trailing args of variadic built-in traits

This patch makes us use TREE_VEC instead of TREE_LIST to represent the
trailing arguments of a variadic built-in trait.  These built-ins are
typically passed a simple pack expansion as the second argument, e.g.

   __is_constructible(T, Ts...)

and the main benefit of this representation change is that substituting
into this argument list is now basically free since tsubst_template_args
makes sure we reuse the TREE_VEC of the corresponding ARGUMENT_PACK when
expanding such a pack expansion.  In the previous TREE_LIST representation
we would need need to convert the expanded pack expansion into a TREE_LIST
(via tsubst_tree_list).

Note that an empty set of trailing arguments is now represented as an
empty TREE_VEC instead of NULL_TREE, so now TRAIT_TYPE/EXPR_TYPE2 will
be empty only for unary traits.

gcc/cp/ChangeLog:

* constraint.cc (diagnose_trait_expr): Convert a TREE_VEC
of arguments into a TREE_LIST for sake of pretty printing.
* cxx-pretty-print.cc (pp_cxx_trait): Handle TREE_VEC
instead of TREE_LIST of trailing variadic trait arguments.
* method.cc (constructible_expr): Likewise.
(is_xible_helper): Likewise.
* parser.cc (cp_parser_trait): Represent trailing variadic trait
arguments as a TREE_VEC instead of TREE_LIST.
* pt.cc (value_dependent_expression_p): Handle TREE_VEC
instead of TREE_LIST of trailing variadic trait arguments.
* semantics.cc (finish_type_pack_element): Likewise.
(check_trait_type): Likewise.

c++: make strip_typedefs generalize strip_typedefs_expr

Currently if we have a TREE_VEC of types that we want to strip of typedefs,
we unintuitively need to call strip_typedefs_expr instead of strip_typedefs
since only strip_typedefs_expr handles TREE_VEC, and it also dispatches
to strip_typedefs when given a type.  But this seems backwards: arguably
strip_typedefs_expr should be the more specialized function, which
strip_typedefs dispatches to (and thus generalizes).

So this patch makes strip_typedefs subsume strip_typedefs_expr rather
than vice versa, which allows for some simplifications.

gcc/cp/ChangeLog:

* tree.cc (strip_typedefs): Move TREE_LIST handling to
strip_typedefs_expr.  Dispatch to strip_typedefs_expr for
non-type 't'.
<case TYPENAME_TYPE>: Remove manual dispatching to
strip_typedefs_expr.
<case TRAIT_TYPE>: Likewise.
(strip_typedefs_expr): Replaces calls to strip_typedefs_expr
with strip_typedefs throughout.  Don't dispatch to strip_typedefs
for type 't'.
<case TREE_LIST>: Replace this with the better version from
strip_typedefs.

doc: Remove repeated word (typo)

gcc/ChangeLog:

* doc/extend.texi (Common Function Attributes): Remove duplicate
word.

Signed-off-by: Alejandro Colomar <alx@kernel.org>

Do not ignore UNDEFINED ranges when determining PHI equivalences.

Do not ignore UNDEFINED name arguments when registering two-way equivalences
from PHIs.

PR tree-optimization/109564
gcc/
* gimple-range-fold.cc (fold_using_range::range_of_phi): Do no ignore
UNDEFINED range names when deciding if all PHI arguments are the same,

gcc/testsuite/
* gcc.dg/torture/pr109564-1.c: New testcase.
* gcc.dg/torture/pr109564-2.c: Likewise.
* gcc.dg/tree-ssa/evrp-ignore.c: XFAIL.
* gcc.dg/tree-ssa/vrp06.c: Likewise.

tree-vect-patterns: One small vect_recog_ctz_ffs_pattern tweak [PR109011]

I've noticed I've made a typo, ifn in this function this late
is always only IFN_CTZ or IFN_FFS, never IFN_CLZ.

Due to this typo, we weren't using the originally intended
.CTZ (X) = .POPCOUNT ((X - 1) & ~X)
but
.CTZ (X) = PREC - .POPCOUNT (X | -X)
instead when we want to emit __builtin_ctz*/.CTZ using .POPCOUNT.
Both compute the same value, both are defined at 0 with the
same value (PREC), both have same number of GIMPLE statements,
but I think the former ought to be preferred, because lots of targets
have andn as a single operation rather than two, and also putting
a -1 constant into a vector register is often cheaper than vector
with broadcast PREC power of two value.

2023-04-20 Jakub Jelinek <jakub@redhat.com>

PR tree-optimization/109011
* tree-vect-patterns.cc (vect_recog_ctz_ffs_pattern): Use
.CTZ (X) = .POPCOUNT ((X - 1) & ~X) in preference to
.CTZ (X) = PREC - .POPCOUNT (X | -X).

c: Avoid -Wenum-int-mismatch warning for redeclaration of builtin acc_on_device [PR107041]

The new -Wenum-int-mismatch warning triggers with -Wsystem-headers in
<openacc.h>, for obvious reasons the builtin acc_on_device uses int
type argument rather than enum which isn't defined yet when the builtin
is created, while the OpenACC spec requires it to have acc_device_t
enum argument. The header makes sure it has int underlying type by using
negative and __INT_MAX__ enumerators.

I've tried to make the builtin typegeneric or just varargs, but that
changes behavior e.g. when one calls it with some C++ class which has
cast operator to acc_device_t, so the following patch instead disables
the warning for this builtin.

2023-04-20 Jakub Jelinek <jakub@redhat.com>

PR c/107041
* c-decl.cc (diagnose_mismatched_decls): Avoid -Wenum-int-mismatch
warning on acc_on_device declaration.

* gcc.dg/goacc/pr107041.c: New test.

[LRA]: Exclude some hard regs for multi-reg inout reload pseudos used in asm in different mode

See gcc.c-torture/execute/20030222-1.c.  Consider the code for 32-bit (e.g. BE) target:
  int i, v; long x; x = v; asm ("" : "=r" (i) : "0" (x));
We generate the following RTL with reload insns:
  1. subreg:si(x:di, 0) = 0;
  2. subreg:si(x:di, 4) = v:si;
  3. t:di = x:di, dead x;
  4. asm ("" : "=r" (subreg:si(t:di,4)) : "0" (t:di))
  5. i:si = subreg:si(t:di,4);
If we assign hard reg of x to t, dead code elimination will remove insn #2
and we will use unitialized hard reg.  So exclude the hard reg of x for t.
We could ignore this problem for non-empty asm using all x value but it is hard to
check that the asm are expanded into insn realy using x and setting r.
The old reload pass used the same approach.

gcc/ChangeLog

* lra-constraints.cc (match_reload): Exclude some hard regs for
multi-reg inout reload pseudos used in asm in different mode.

arch: Use VIRTUAL_REGISTER_P predicate.

gcc/ChangeLog:

* config/arm/arm.cc (thumb1_legitimate_address_p):
Use VIRTUAL_REGISTER_P predicate.
(arm_eliminable_register): Ditto.
* config/avr/avr.md (push<mode>_1): Ditto.
* config/bfin/predicates.md (register_no_elim_operand): Ditto.
* config/h8300/predicates.md (register_no_sp_elim_operand): Ditto.
* config/i386/predicates.md (register_no_elim_operand): Ditto.
* config/iq2000/predicates.md (call_insn_operand): Ditto.
* config/microblaze/microblaze.h (CALL_INSN_OP): Ditto.

i386: Handle sign-extract for QImode operations with high registers [PR78952]

Introduce extract_operator predicate to handle both, zero-extract and
sign-extract extract operations with expressions like:

    (subreg:QI
      (zero_extract:SWI248
        (match_operand 1 "int248_register_operand" "0")
(const_int 8)
(const_int 8)) 0)

As shown in the testcase, this will enable generation of QImode
instructions with high registers when signed arguments are used.

gcc/ChangeLog:

PR target/78952
* config/i386/predicates.md (extract_operator): New predicate.
* config/i386/i386.md (any_extract): Remove code iterator.
(*cmpqi_ext<mode>_1_mem_rex64): Use extract_operator predicate.
(*cmpqi_ext<mode>_1): Ditto.
(*cmpqi_ext<mode>_2): Ditto.
(*cmpqi_ext<mode>_3_mem_rex64): Ditto.
(*cmpqi_ext<mode>_3): Ditto.
(*cmpqi_ext<mode>_4): Ditto.
(*extzvqi_mem_rex64): Ditto.
(*extzvqi): Ditto.
(*insvqi_2): Ditto.
(*extendqi<SWI24:mode>_ext_1): Ditto.
(*addqi_ext<mode>_0): Ditto.
(*addqi_ext<mode>_1): Ditto.
(*addqi_ext<mode>_2): Ditto.
(*subqi_ext<mode>_0): Ditto.
(*subqi_ext<mode>_2): Ditto.
(*testqi_ext<mode>_1): Ditto.
(*testqi_ext<mode>_2): Ditto.
(*andqi_ext<mode>_0): Ditto.
(*andqi_ext<mode>_1): Ditto.
(*andqi_ext<mode>_1_cc): Ditto.
(*andqi_ext<mode>_2): Ditto.
(*<any_or:code>qi_ext<mode>_0): Ditto.
(*<any_or:code>qi_ext<mode>_1): Ditto.
(*<any_or:code>qi_ext<mode>_2): Ditto.
(*xorqi_ext<mode>_1_cc): Ditto.
(*negqi_ext<mode>_2): Ditto.
(*ashlqi_ext<mode>_2): Ditto.
(*<any_shiftrt:insn>qi_ext<mode>_2): Ditto.

gcc/testsuite/ChangeLog:

PR target/78952
* gcc.target/i386/pr78952-4.c: New test.

[PR target/108248] [RISC-V] Break down some bitmanip insn types

This is primarily Raphael's work.  All I did was adjust it to apply to the
trunk and add the new types to generic.md's scheduling model.

The basic idea here is to make sure we have the ability to schedule the
bitmanip instructions with a finer degree of control.  Some of the bitmanip
instructions are likely to have differing scheduler characteristics across
different implementations.

So rather than assign these instructions a generic "bitmanip" type, this
patch assigns them a type based on their RTL code by using the <bitmanip_insn>
iterator for the type.  Naturally we have to add a few new types.  It affects
clz, ctz, cpop, min, max.

We didn't do this for things like shNadd, single bit manipulation, etc. We
certainly could if the needs presents itself.

I threw all the new types into the generic_alu bucket in the generic
scheduling model.  Seems as good a place as any. Someone who knows the
sifive uarch should probably add these types (and bitmanip) to the sifive
scheduling model.

We also noticed that the recently added orc.b didn't have a type at all.
So we added it as a generic bitmanip type.

This has been bootstrapped in a gcc-12 base and I've built and run the
testsuite without regressions on the trunk.

Given it was primarily Raphael's work I could probably approve & commit it.
But I'd like to give the other RISC-V folks a chance to chime in.

PR target/108248
gcc/
* config/riscv/bitmanip.md (clz, ctz, pcnt, min, max patterns): Use
<bitmanip_insn> as the type to allow for fine grained control of
scheduling these insns.
* config/riscv/generic.md (generic_alu): Add bitmanip, clz, ctz, pcnt,
min, max.
* config/riscv/riscv.md (type attribute): Add types for clz, ctz,
pcnt, signed and unsigned min/max.

RISC-V: Fix RVV register order

This patch fixes the issue of incorrect reigster order of RVV.
The new register order is coming from kito original RVV GCC implementation.

Consider this case:
void f (void *base,void *base2,void *out,size_t vl, int n)
{
    vuint64m8_t bindex = __riscv_vle64_v_u64m8 (base + 100, vl);
    for (int i = 0; i < n; i++){
      vbool8_t m = __riscv_vlm_v_b8 (base + i, vl);
      vuint64m8_t v = __riscv_vluxei64_v_u64m8_m(m,base,bindex,vl);
      vuint64m8_t v2 = __riscv_vle64_v_u64m8_tu (v, base2 + i, vl);
      vint8m1_t v3 = __riscv_vluxei64_v_i8m1_m(m,base,v,vl);
      vint8m1_t v4 = __riscv_vluxei64_v_i8m1_m(m,base,v2,vl);
      __riscv_vse8_v_i8m1 (out + 100*i,v3,vl);
      __riscv_vse8_v_i8m1 (out + 222*i,v4,vl);
    }
}

Before this patch:
f:
csrr    t0,vlenb
slli    t1,t0,3
sub     sp,sp,t1
addi    a5,a0,100
vsetvli zero,a3,e64,m8,ta,ma
vle64.v v24,0(a5)
vs8r.v  v24,0(sp)
ble     a4,zero,.L1
mv      a6,a0
add     a4,a4,a0
mv      a5,a2
.L3:
vsetvli zero,zero,e64,m8,ta,ma
vl8re64.v       v24,0(sp)
vlm.v   v0,0(a6)
vluxei64.v      v24,(a0),v24,v0.t
addi    a6,a6,1
vsetvli zero,zero,e8,m1,tu,ma
vmv8r.v v16,v24
vluxei64.v      v8,(a0),v24,v0.t
vle64.v v16,0(a1)
vluxei64.v      v24,(a0),v16,v0.t
vse8.v  v8,0(a2)
vse8.v  v24,0(a5)
addi    a1,a1,1
addi    a2,a2,100
addi    a5,a5,222
bne     a4,a6,.L3
.L1:
csrr    t0,vlenb
slli    t1,t0,3
add     sp,sp,t1
jr      ra

After this patch:
f:
addi    a5,a0,100
vsetvli zero,a3,e64,m8,ta,ma
vle64.v v24,0(a5)
ble     a4,zero,.L1
mv      a6,a0
add     a4,a4,a0
mv      a5,a2
.L3:
vsetvli zero,zero,e64,m8,ta,ma
vlm.v   v0,0(a6)
addi    a6,a6,1
vluxei64.v      v8,(a0),v24,v0.t
vsetvli zero,zero,e8,m1,tu,ma
vmv8r.v v16,v8
vluxei64.v      v2,(a0),v8,v0.t
vle64.v v16,0(a1)
vluxei64.v      v1,(a0),v16,v0.t
vse8.v  v2,0(a2)
vse8.v  v1,0(a5)
addi    a1,a1,1
addi    a2,a2,100
addi    a5,a5,222
bne     a4,a6,.L3
.L1:
ret

The redundant register spillings is eliminated.
However, there is one more issue need to be addressed which is the redundant
move instruction "vmv8r.v". This is another story, and it will be fixed by another
patch (Fine tune RVV machine description RA constraint).

gcc/ChangeLog:

* config/riscv/riscv.h (enum reg_class): Fix RVV register order.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/spill-4.c: Adapt testcase.
* gcc.target/riscv/rvv/base/spill-6.c: Adapt testcase.
* gcc.target/riscv/rvv/base/reg_order-1.c: New test.

Signed-off-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
Co-authored-by: kito-cheng <kito.cheng@sifive.com>

RISC-V: Fix riscv/arch-19.c with different ISA spec version

In newer ISA spec, F will implied zicsr, add that into -march option to
prevent different test result on different default -misa-spec version.

gcc/testsuite/

* gcc.target/riscv/arch-19.c: Add -misa-spec.

RISC-V: Fix wrong check of register occurrences [PR109535]

count_occurrences will conly count same RTX (same code and same mode),
but what we want to track is the occurrence of a register, a register
might appeared in the insn with different mode or contain in SUBREG.

Testcase coming from Kito.

gcc/ChangeLog:

PR target/109535
* config/riscv/riscv-vsetvl.cc (count_regno_occurrences): New function.
(pass_vsetvl::cleanup_insns): Fix bug.

gcc/testsuite/ChangeLog:

PR target/109535
* g++.target/riscv/rvv/base/pr109535.C: New test.
* gcc.target/riscv/rvv/base/pr109535.c: New test.

Signed-off-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
Co-authored-by: kito-cheng <kito.cheng@sifive.com>

RISC-V: Fix simplify_ior_optimization.c on rv32

GCC will complaint if target ABI isn't have corresponding multi-lib on
glibc toolchain, use stdint-gcc.h to suppress that.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/simplify_ior_optimization.c: Use stdint-gcc.h
rather than stdint.h

amdgcn: bug fix ldexp insn

The vop3 instructions don't support B constraint immediates.
Also, take the use the SV_FP iterator to delete a redundant pattern.

gcc/ChangeLog:

* config/gcn/gcn-valu.md (vnsi, VnSI): Add scalar modes.
(ldexp<mode>3): Delete.
(ldexp<mode>3<exec>): Change "B" to "A".

amdgcn: update target-supports.exp

The backend can now vectorize more things.

gcc/testsuite/ChangeLog:

* lib/target-supports.exp
(check_effective_target_vect_call_copysignf): Add amdgcn.
(check_effective_target_vect_call_sqrtf): Add amdgcn.
(check_effective_target_vect_call_ceilf): Add amdgcn.
(check_effective_target_vect_call_floor): Add amdgcn.
(check_effective_target_vect_logical_reduc): Add amdgcn.

tree: Add 3+ argument fndecl_built_in_p

On Wed, Feb 22, 2023 at 09:52:06AM +0000, Richard Biener wrote:
> > The following testcase ICEs because we still have some spots that
> > treat BUILT_IN_UNREACHABLE specially but not BUILT_IN_UNREACHABLE_TRAP
> > the same.

This patch uses (fndecl_built_in_p (node, BUILT_IN_UNREACHABLE)
                 || fndecl_built_in_p (node, BUILT_IN_UNREACHABLE_TRAP))
a lot and from grepping around, we do something like that in lots of
other places, or in some spots instead as
(fndecl_built_in_p (node, BUILT_IN_NORMAL)
&& (DECL_FUNCTION_CODE (node) == BUILT_IN_WHATEVER1
     || DECL_FUNCTION_CODE (node) == BUILT_IN_WHATEVER2))
The following patch adds an overload for this case, so we can write
it in a shorter way, using C++11 argument packs so that it supports
as many codes as one needs.

2023-04-20  Jakub Jelinek  <jakub@redhat.com>
    Jonathan Wakely  <jwakely@redhat.com>

* tree.h (built_in_function_equal_p): New helper function.
(fndecl_built_in_p): Turn into variadic template to support
1 or more built_in_function arguments.
* builtins.cc (fold_builtin_expect): Use 3 argument fndecl_built_in_p.
* gimplify.cc (goa_stabilize_expr): Likewise.
* cgraphclones.cc (cgraph_node::create_clone): Likewise.
* ipa-fnsummary.cc (compute_fn_summary): Likewise.
* omp-low.cc (setjmp_or_longjmp_p): Likewise.
* cgraph.cc (cgraph_edge::redirect_call_stmt_to_callee,
cgraph_update_edges_for_call_stmt_node,
cgraph_edge::verify_corresponds_to_fndecl,
cgraph_node::verify_node): Likewise.
* tree-stdarg.cc (optimize_va_list_gpr_fpr_size): Likewise.
* gimple-ssa-warn-access.cc (matching_alloc_calls_p): Likewise.
* ipa-prop.cc (try_make_edge_direct_virtual_call): Likewise.

tree-vect-patterns: Pattern recognize ctz or ffs using clz, popcount or ctz [PR109011]

The following patch allows to vectorize __builtin_ffs*/.FFS even if
we just have vector .CTZ support, or __builtin_ffs*/.FFS/__builtin_ctz*/.CTZ
if we just have vector .CLZ or .POPCOUNT support.
It uses various expansions from Hacker's Delight book as well as GCC's
expansion, in particular:
.CTZ (X) = PREC - .CLZ ((X - 1) & ~X)
.CTZ (X) = .POPCOUNT ((X - 1) & ~X)
.CTZ (X) = (PREC - 1) - .CLZ (X & -X)
.FFS (X) = PREC - .CLZ (X & -X)
.CTZ (X) = PREC - .POPCOUNT (X | -X)
.FFS (X) = (PREC + 1) - .POPCOUNT (X | -X)
.FFS (X) = .CTZ (X) + 1
where the first one can be only used if both CTZ and CLZ have value
defined at zero (kind 2) and both have value of PREC there.
If the original has value defined at zero and the latter doesn't
for other forms or if it doesn't have matching value for that case,
a COND_EXPR is added for that afterwards.

The patch also modifies vect_recog_popcount_clz_ctz_ffs_pattern
such that the two can work together.

2023-04-20 Jakub Jelinek <jakub@redhat.com>

PR tree-optimization/109011
* tree-vect-patterns.cc (vect_recog_ctz_ffs_pattern): New function.
(vect_recog_popcount_clz_ctz_ffs_pattern): Move vect_pattern_detected
call later. Don't punt for IFN_CTZ or IFN_FFS if it doesn't have
direct optab support, but has instead IFN_CLZ, IFN_POPCOUNT or
for IFN_FFS IFN_CTZ support, use vect_recog_ctz_ffs_pattern for that
case.
(vect_vect_recog_func_ptrs): Add ctz_ffs entry.

* gcc.dg/vect/pr109011-1.c: Remove -mpower9-vector from
dg-additional-options.
(baz, qux): Remove functions and corresponding dg-final.
* gcc.dg/vect/pr109011-2.c: New test.
* gcc.dg/vect/pr109011-3.c: New test.
* gcc.dg/vect/pr109011-4.c: New test.
* gcc.dg/vect/pr109011-5.c: New test.

Remove duplicate DFS walks from DF init

The following removes unused CFG order computes from
rest_of_handle_df_initialize. The CFG orders are computed from df_analyze ().
This also removes code duplication that would have to be kept in sync.

* df-core.cc (rest_of_handle_df_initialize): Remove
computation of df->postorder, df->postorder_inverted and
df->n_blocks.

testsuite: Fix up g++.dg/ext/int128-8.C testcase [PR109560]

The testcase needs to be restricted to int128 effective targets,
it expectedly fails on i386 and other 32-bit targets.

2023-04-20 Jakub Jelinek <jakub@redhat.com>

PR c++/108099
PR testsuite/109560
* g++.dg/ext/int128-8.C: Require int128 effective target.

PR testsuite/106879 FAIL: gcc.dg/vect/bb-slp-layout-19.c on powerpc64

On P7, option -mno-allow-movmisalign is added during testing, which
prevents slp happen on the case.

Like PR65484 and PR87306, this patch use vect_hw_misalign to guard
the case on powerpc targets.

gcc/testsuite/ChangeLog:

PR testsuite/106879
* gcc.dg/vect/bb-slp-layout-19.c: Modify to guard the check with
vect_hw_misalign on POWERs.

i386: Share AES xmm intrin with VAES

Currently in GCC, the 128 bit intrin for instruction vaes{end,dec}{last,}
is under AES ISA. Because there is no dependency between ISA set AES
and VAES, The 128 bit intrin is not available when we use compiler flag
-mvaes -mavx512vl and there is no other way to use that intrin. But it
should according to Intel SDM.

Although VAES aims to be a VEX/EVEX promotion for AES, but it is only part
of it. Therefore, we share the AES xmm intrin with VAES.

Also, since -mvaes indicates that we could use VEX encoding for ymm, we
should imply AVX for VAES.

gcc/ChangeLog:

* common/config/i386/i386-common.cc
(OPTION_MASK_ISA2_AVX_UNSET): Add OPTION_MASK_ISA2_VAES_UNSET.
(ix86_handle_option): Set AVX flag for VAES.
* config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins):
Add OPTION_MASK_ISA2_VAES_UNSET.
(def_builtin): Share builtin between AES and VAES.
* config/i386/i386-expand.cc (ix86_check_builtin_isa_match):
Ditto.
* config/i386/i386.md (aes): New isa attribute.
* config/i386/sse.md (aesenc): Add pattern for VAES with xmm.
(aesenclast): Ditto.
(aesdec): Ditto.
(aesdeclast): Ditto.
* config/i386/vaesintrin.h: Remove redundant avx target push.
* config/i386/wmmintrin.h (_mm_aesdec_si128): Change to macro.
(_mm_aesdeclast_si128): Ditto.
(_mm_aesenc_si128): Ditto.
(_mm_aesenclast_si128): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512fvl-vaes-1.c: Add VAES xmm test.
* gcc.target/i386/pr109117-1.c: Modify error message.

Add reduce_*_ep[i|u][8|16] series intrinsics

gcc/ChangeLog:

* config/i386/avx2intrin.h
(_MM_REDUCE_OPERATOR_BASIC_EPI16): New macro.
(_MM_REDUCE_OPERATOR_MAX_MIN_EP16): Ditto.
(_MM256_REDUCE_OPERATOR_BASIC_EPI16): Ditto.
(_MM256_REDUCE_OPERATOR_MAX_MIN_EP16): Ditto.
(_MM_REDUCE_OPERATOR_BASIC_EPI8): Ditto.
(_MM_REDUCE_OPERATOR_MAX_MIN_EP8): Ditto.
(_MM256_REDUCE_OPERATOR_BASIC_EPI8): Ditto.
(_MM256_REDUCE_OPERATOR_MAX_MIN_EP8): Ditto.
(_mm_reduce_add_epi16): New instrinsics.
(_mm_reduce_mul_epi16): Ditto.
(_mm_reduce_and_epi16): Ditto.
(_mm_reduce_or_epi16): Ditto.
(_mm_reduce_max_epi16): Ditto.
(_mm_reduce_max_epu16): Ditto.
(_mm_reduce_min_epi16): Ditto.
(_mm_reduce_min_epu16): Ditto.
(_mm256_reduce_add_epi16): Ditto.
(_mm256_reduce_mul_epi16): Ditto.
(_mm256_reduce_and_epi16): Ditto.
(_mm256_reduce_or_epi16): Ditto.
(_mm256_reduce_max_epi16): Ditto.
(_mm256_reduce_max_epu16): Ditto.
(_mm256_reduce_min_epi16): Ditto.
(_mm256_reduce_min_epu16): Ditto.
(_mm_reduce_add_epi8): Ditto.
(_mm_reduce_mul_epi8): Ditto.
(_mm_reduce_and_epi8): Ditto.
(_mm_reduce_or_epi8): Ditto.
(_mm_reduce_max_epi8): Ditto.
(_mm_reduce_max_epu8): Ditto.
(_mm_reduce_min_epi8): Ditto.
(_mm_reduce_min_epu8): Ditto.
(_mm256_reduce_add_epi8): Ditto.
(_mm256_reduce_mul_epi8): Ditto.
(_mm256_reduce_and_epi8): Ditto.
(_mm256_reduce_or_epi8): Ditto.
(_mm256_reduce_max_epi8): Ditto.
(_mm256_reduce_max_epu8): Ditto.
(_mm256_reduce_min_epi8): Ditto.
(_mm256_reduce_min_epu8): Ditto.
* config/i386/avx512vlbwintrin.h:
(_mm_mask_reduce_add_epi16): Ditto.
(_mm_mask_reduce_mul_epi16): Ditto.
(_mm_mask_reduce_and_epi16): Ditto.
(_mm_mask_reduce_or_epi16): Ditto.
(_mm_mask_reduce_max_epi16): Ditto.
(_mm_mask_reduce_max_epu16): Ditto.
(_mm_mask_reduce_min_epi16): Ditto.
(_mm_mask_reduce_min_epu16): Ditto.
(_mm256_mask_reduce_add_epi16): Ditto.
(_mm256_mask_reduce_mul_epi16): Ditto.
(_mm256_mask_reduce_and_epi16): Ditto.
(_mm256_mask_reduce_or_epi16): Ditto.
(_mm256_mask_reduce_max_epi16): Ditto.
(_mm256_mask_reduce_max_epu16): Ditto.
(_mm256_mask_reduce_min_epi16): Ditto.
(_mm256_mask_reduce_min_epu16): Ditto.
(_mm_mask_reduce_add_epi8): Ditto.
(_mm_mask_reduce_mul_epi8): Ditto.
(_mm_mask_reduce_and_epi8): Ditto.
(_mm_mask_reduce_or_epi8): Ditto.
(_mm_mask_reduce_max_epi8): Ditto.
(_mm_mask_reduce_max_epu8): Ditto.
(_mm_mask_reduce_min_epi8): Ditto.
(_mm_mask_reduce_min_epu8): Ditto.
(_mm256_mask_reduce_add_epi8): Ditto.
(_mm256_mask_reduce_mul_epi8): Ditto.
(_mm256_mask_reduce_and_epi8): Ditto.
(_mm256_mask_reduce_or_epi8): Ditto.
(_mm256_mask_reduce_max_epi8): Ditto.
(_mm256_mask_reduce_max_epu8): Ditto.
(_mm256_mask_reduce_min_epi8): Ditto.
(_mm256_mask_reduce_min_epu8): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512vlbw-reduce-op-1.c: New test.

i386: Add PCLMUL dependency for VPCLMULQDQ

Currently in GCC, the 128 bit intrin for instruction vpclmulqdq is
under PCLMUL ISA. Because there is no dependency between ISA set PCLMUL
and VPCLMULQDQ, The 128 bit intrin is not available when we just use
compiler flag -mvpclmulqdq. But it should according to Intel SDM.

Since VPCLMULQDQ is a VEX/EVEX promotion for PCLMUL, it is natural to
add dependency between them.

Also, with -mvpclmulqdq, we can use ymm under VEX encoding, so
VPCLMULQDQ should imply AVX.

gcc/ChangeLog:

* common/config/i386/i386-common.cc
(OPTION_MASK_ISA_VPCLMULQDQ_SET):
Add OPTION_MASK_ISA_PCLMUL_SET and OPTION_MASK_ISA_AVX_SET.
(OPTION_MASK_ISA_AVX_UNSET):
Add OPTION_MASK_ISA_VPCLMULQDQ_UNSET.
(OPTION_MASK_ISA_PCLMUL_UNSET): Ditto.
* config/i386/i386.md (vpclmulqdqvl): New.
* config/i386/sse.md (pclmulqdq): Add evex encoding.
* config/i386/vpclmulqdqintrin.h: Remove redudant avx target
push.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vpclmulqdq.c: Add compile test for xmm.

i386: Fix vpblendm{b,w} intrins and insns

For vpblendm{b,w}, they actually do not have constant parameters.
Therefore, there is no need for them been wrapped in __OPTIMIZE__.

Also, we should check TARGET_AVX512VL for 128/256 bit vectors.

gcc/ChangeLog:

* config/i386/avx512vlbwintrin.h
(_mm_mask_blend_epi16): Remove __OPTIMIZE__ wrapper.
(_mm_mask_blend_epi8): Ditto.
(_mm256_mask_blend_epi16): Ditto.
(_mm256_mask_blend_epi8): Ditto.
* config/i386/avx512vlintrin.h
(_mm256_mask_blend_pd): Ditto.
(_mm256_mask_blend_ps): Ditto.
(_mm256_mask_blend_epi64): Ditto.
(_mm256_mask_blend_epi32): Ditto.
(_mm_mask_blend_pd): Ditto.
(_mm_mask_blend_ps): Ditto.
(_mm_mask_blend_epi64): Ditto.
(_mm_mask_blend_epi32): Ditto.
* config/i386/sse.md (VF_AVX512BWHFBF16): Removed.
(VF_AVX512HFBFVL): Move it before the first usage.
(<avx512>_blendm<mode>): Change iterator from VF_AVX512BWHFBF16
to VF_AVX512HFBFVL.

i386: Add AVX512BW dependency to AVX512VBMI2

gcc/ChangeLog:

* common/config/i386/i386-common.cc
(OPTION_MASK_ISA_AVX512VBMI2_SET): Change OPTION_MASK_ISA_AVX512F_SET
to OPTION_MASK_ISA_AVX512BW_SET.
(OPTION_MASK_ISA_AVX512F_UNSET):
Remove OPTION_MASK_ISA_AVX512VBMI2_UNSET.
(OPTION_MASK_ISA_AVX512BW_UNSET):
Add OPTION_MASK_ISA_AVX512VBMI2_UNSET.
* config/i386/avx512vbmi2intrin.h: Do not push avx512bw.
* config/i386/avx512vbmi2vlintrin.h: Ditto.
* config/i386/i386-builtin.def: Remove OPTION_MASK_ISA_AVX512BW.
* config/i386/sse.md (VI12_AVX512VLBW): Removed.
(VI12_VI48F_AVX512VLBW): Rename to VI12_VI48F_AVX512VL.
(compress<mode>_mask): Change iterator from VI12_AVX512VLBW to
VI12_AVX512VL.
(compressstore<mode>_mask): Ditto.
(expand<mode>_mask): Ditto.
(expand<mode>_maskz): Ditto.
(*expand<mode>_mask): Change iterator from VI12_VI48F_AVX512VLBW to
VI12_VI48F_AVX512VL.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512bw-pr100267-1.c: Remove avx512f and avx512bw.
* gcc.target/i386/avx512bw-pr100267-b-2.c: Ditto.
* gcc.target/i386/avx512bw-pr100267-d-2.c: Ditto.
* gcc.target/i386/avx512bw-pr100267-q-2.c: Ditto.
* gcc.target/i386/avx512bw-pr100267-w-2.c: Ditto.
* gcc.target/i386/avx512f-vpcompressb-1.c: Ditto.
* gcc.target/i386/avx512f-vpcompressb-2.c: Ditto.
* gcc.target/i386/avx512f-vpcompressw-1.c: Ditto.
* gcc.target/i386/avx512f-vpcompressw-2.c: Ditto.
* gcc.target/i386/avx512f-vpexpandb-1.c: Ditto.
* gcc.target/i386/avx512f-vpexpandb-2.c: Ditto.
* gcc.target/i386/avx512f-vpexpandw-1.c: Ditto.
* gcc.target/i386/avx512f-vpexpandw-2.c: Ditto.
* gcc.target/i386/avx512f-vpshld-1.c: Ditto.
* gcc.target/i386/avx512f-vpshldd-2.c: Ditto.
* gcc.target/i386/avx512f-vpshldq-2.c: Ditto.
* gcc.target/i386/avx512f-vpshldv-1.c: Ditto.
* gcc.target/i386/avx512f-vpshldvd-2.c: Ditto.
* gcc.target/i386/avx512f-vpshldvq-2.c: Ditto.
* gcc.target/i386/avx512f-vpshldvw-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdd-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdq-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdv-1.c: Ditto.
* gcc.target/i386/avx512f-vpshrdvd-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdvq-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdvw-2.c: Ditto.
* gcc.target/i386/avx512f-vpshrdw-2.c: Ditto.
* gcc.target/i386/avx512vbmi2-vpshld-1.c: Ditto.
* gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcompressb-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcompressb-2.c: Ditto.
* gcc.target/i386/avx512vl-vpcompressw-2.c: Ditto.
* gcc.target/i386/avx512vl-vpexpandb-1.c: Ditto.
* gcc.target/i386/avx512vl-vpexpandb-2.c: Ditto.
* gcc.target/i386/avx512vl-vpexpandw-1.c: Ditto.
* gcc.target/i386/avx512vl-vpexpandw-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshldd-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshldq-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshldv-1.c: Ditto.
* gcc.target/i386/avx512vl-vpshldvd-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshldvq-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshldvw-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdd-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdq-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdv-1.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdvd-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdvq-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdvw-2.c: Ditto.
* gcc.target/i386/avx512vl-vpshrdw-2.c: Ditto.
* gcc.target/i386/avx512vlbw-pr100267-1.c: Ditto.
* gcc.target/i386/avx512vlbw-pr100267-b-2.c: Ditto.
* gcc.target/i386/avx512vlbw-pr100267-w-2.c: Ditto.

i386: Add AVX512BW dependency to AVX512BITALG

Since some of the AVX512BITALG intrins use 32/64 bit mask,
AVX512BW should be implied.

gcc/ChangeLog:

* common/config/i386/i386-common.cc
(OPTION_MASK_ISA_AVX512BITALG_SET):
Change OPTION_MASK_ISA_AVX512F_SET
to OPTION_MASK_ISA_AVX512BW_SET.
(OPTION_MASK_ISA_AVX512F_UNSET):
Remove OPTION_MASK_ISA_AVX512BITALG_SET.
(OPTION_MASK_ISA_AVX512BW_UNSET):
Add OPTION_MASK_ISA_AVX512BITALG_SET.
* config/i386/avx512bitalgintrin.h: Do not push avx512bw.
* config/i386/i386-builtin.def:
Remove redundant OPTION_MASK_ISA_AVX512BW.
* config/i386/sse.md (VI1_AVX512VLBW): Removed.
(avx512vl_vpshufbitqmb<mode><mask_scalar_merge_name>):
Change the iterator from VI1_AVX512VLBW to VI1_AVX512VL.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512bitalg-vpopcntb-1.c:
Remove avx512bw.
* gcc.target/i386/avx512bitalg-vpopcntb.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntbvl.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntw-1.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntw.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntwvl.c: Ditto.
* gcc.target/i386/avx512bitalg-vpshufbitqmb-1.c: Ditto.
* gcc.target/i386/avx512bitalg-vpshufbitqmb.c: Ditto.
* gcc.target/i386/avx512bitalgvl-vpopcntb-1.c: Ditto.
* gcc.target/i386/avx512bitalgvl-vpopcntw-1.c: Ditto.
* gcc.target/i386/avx512bitalgvl-vpshufbitqmb-1.c: Ditto.
* gcc.target/i386/pr93696-1.c: Ditto.
* gcc.target/i386/pr93696-2.c: Ditto.

i386: Use macro to wrap up share builtin exceptions in builtin isa check

gcc/ChangeLog:

* config/i386/i386-expand.cc
(ix86_check_builtin_isa_match): Correct wrong comments.
Add a new macro SHARE_BUILTIN and refactor the current if
clauses to macro.

Re-arrange sections of i386 cpuid

gcc/ChangeLog:

* config/i386/cpuid.h: Open a new section for Extended Features
Leaf (%eax == 7, %ecx == 0) and Extended Features Sub-leaf (%eax == 7,
%ecx == 1).

Optimize vshuf{i,f}{32x4,64x2} ymm and vperm{i,f}128 ymm

vshuf{i,f}{32x4,64x2} ymm and vperm{i,f}128 ymm are 3 clk.
We can optimze them to vblend, vmovaps when there's no cross-lane.

gcc/ChangeLog:

* config/i386/sse.md: Modify insn vperm{i,f}
and vshuf{i,f}.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512vl-vshuff32x4-1.c: Modify test.
* gcc.target/i386/avx512vl-vshuff64x2-1.c: Ditto.
* gcc.target/i386/avx512vl-vshufi32x4-1.c: Ditto.
* gcc.target/i386/avx512vl-vshufi64x2-1.c: Ditto.
* gcc.target/i386/opt-vperm-vshuf-1.c: New test.
* gcc.target/i386/opt-vperm-vshuf-2.c: Ditto.
* gcc.target/i386/opt-vperm-vshuf-3.c: Ditto.

Daily bump.

gcc: xtensa: add -m[no-]strict-align option

gcc/
* config/xtensa/xtensa-opts.h: New header.
* config/xtensa/xtensa.h (STRICT_ALIGNMENT): Redefine as
xtensa_strict_align.
* config/xtensa/xtensa.cc (xtensa_option_override): When
-m[no-]strict-align is not specified in the command line set
xtensa_strict_align to 0 if the hardware supports both unaligned
loads and stores or to 1 otherwise.
* config/xtensa/xtensa.opt (mstrict-align): New option.
* doc/invoke.texi (Xtensa Options): Document -m[no-]strict-align.

gcc: xtensa: add data alignment properties to dynconfig

gcc/
* config/xtensa/xtensa-dynconfig.cc (xtensa_get_config_v4): New
function.

include/
* xtensa-dynconfig.h (xtensa_config_v4): New struct.
(XCHAL_DATA_WIDTH, XCHAL_UNALIGNED_LOAD_EXCEPTION)
(XCHAL_UNALIGNED_STORE_EXCEPTION, XCHAL_UNALIGNED_LOAD_HW)
(XCHAL_UNALIGNED_STORE_HW, XTENSA_CONFIG_V4_ENTRY_LIST): New
definitions.
(XTENSA_CONFIG_INSTANCE_LIST): Add xtensa_config_v4 instance.
(XTENSA_CONFIG_ENTRY_LIST): Add XTENSA_CONFIG_V4_ENTRY_LIST.

c++: Define built-in for std::tuple_element [PR100157]

This adds a new built-in to replace the recursive class template
instantiations done by traits such as std::tuple_element and
std::variant_alternative.  The purpose is to select the Nth type from a
list of types, e.g. __type_pack_element<1, char, int, float> is int.
We implement it as a special kind of TRAIT_TYPE.

For a pathological example tuple_element_t<1000, tuple<2000 types...>>
the compilation time is reduced by more than 90% and the memory used by
the compiler is reduced by 97%.  In realistic examples the gains will be
much smaller, but still relevant.

Unlike the other built-in traits, __type_pack_element uses template-id
syntax instead of call syntax and is SFINAE-enabled, matching Clang's
implementation.  And like the other built-in traits, it's not mangleable
so we can't use it directly in function signatures.

N.B. Clang seems to implement __type_pack_element as a first-class
template that can e.g. be used as a template-template argument.  For
simplicity we implement it in a more ad-hoc way.

Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
PR c++/100157

gcc/cp/ChangeLog:

* cp-trait.def (TYPE_PACK_ELEMENT): Define.
* cp-tree.h (finish_trait_type): Add complain parameter.
* cxx-pretty-print.cc (pp_cxx_trait): Handle
CPTK_TYPE_PACK_ELEMENT.
* parser.cc (cp_parser_constant_expression): Document default
arguments.
(cp_parser_trait): Handle CPTK_TYPE_PACK_ELEMENT.  Pass
tf_warning_or_error to finish_trait_type.
* pt.cc (tsubst) <case TRAIT_TYPE>: Handle non-type first
argument.  Pass complain to finish_trait_type.
* semantics.cc (finish_type_pack_element): Define.
(finish_trait_type): Add complain parameter.  Handle
CPTK_TYPE_PACK_ELEMENT.
* tree.cc (strip_typedefs): Handle non-type first argument.
Pass tf_warning_or_error to finish_trait_type.
* typeck.cc (structural_comptypes) <case TRAIT_TYPE>: Use
cp_tree_equal instead of same_type_p for the first argument.

libstdc++-v3/ChangeLog:

* include/bits/utility.h (_Nth_type): Conditionally define in
terms of __type_pack_element if available.
* testsuite/20_util/tuple/element_access/get_neg.cc: Prune
additional errors from the new built-in.

gcc/testsuite/ChangeLog:

* g++.dg/ext/type_pack_element1.C: New test.
* g++.dg/ext/type_pack_element2.C: New test.
* g++.dg/ext/type_pack_element3.C: New test.

c++: bad ggc_free in try_class_unification [PR109556]

Aside from correcting how try_class_unification copies multi-dimensional
'targs', r13-377-g3e948d645bc908 also made it ggc_free this copy as an
optimization. But this is wrong since the call to unify within might've
captured the args in persistent memory such as the satisfaction cache
(as part of constrained auto deduction).

PR c++/109556

gcc/cp/ChangeLog:

* pt.cc (try_class_unification): Don't ggc_free the copy of
'targs'.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/concepts-placeholder13.C: New test.

testsuite: fix scan-tree-dump patterns [PR83904,PR100297]

Adjust scan-tree-dump patterns so that they do not accidentally match a
valid path.

gcc/testsuite/ChangeLog:

PR testsuite/83904
PR fortran/100297
* gfortran.dg/allocatable_function_1.f90: Use "__builtin_free "
instead of the naive "free".
* gfortran.dg/reshape_8.f90: Extend pattern from a simple "data".

i386: Add new pattern for zero-extend cmov

After a phiopt change, I got a failure of cmov9.c.
The RTL IR has zero_extend on the outside of
the if_then_else rather than on the side. Both
ways are considered canonical as mentioned in
PR 66588.

This fixes the failure I got and also adds a testcase
which fails before even my phiopt patch but will pass
with this patch.

OK? Bootstrapped and tested on x86_64-linux-gnu with
no regressions.

gcc/ChangeLog:

* config/i386/i386.md (*movsicc_noc_zext_1): New pattern.

gcc/testsuite/ChangeLog:

* gcc.target/i386/cmov10.c: New test.
* gcc.target/i386/cmov11.c: New test.

c++: fix 'unsigned __int128_t' semantics [PR108099]

My earlier patch for 108099 made us accept this non-standard pattern but
messed up the semantics, so that e.g. unsigned __int128_t was not a 128-bit
type.

PR c++/108099

gcc/cp/ChangeLog:

* decl.cc (grokdeclarator): Keep typedef_decl for __int128_t.

gcc/testsuite/ChangeLog:

* g++.dg/ext/int128-8.C: New test.

RISC-V: Support 128 bit vector chunk

RISC-V has provide different VLEN configuration by different ISA
extension like `zve32x`, `zve64x` and `v`
zve32x just guarantee the minimal VLEN is 32 bits,
zve64x guarantee the minimal VLEN is 64 bits,
and v guarantee the minimal VLEN is 128 bits,

Current status (without this patch):

Zve32x: Mode for one vector register mode is VNx1SImode and VNx1DImode
is invalid mode
- one vector register could hold 1 + 1x SImode where x is 0~n, so it
might hold just one SI

Zve64x: Mode for one vector register mode is VNx1DImode or VNx2SImode
- one vector register could hold 1 + 1x DImode where x is 0~n, so it
might hold just one DI.
- one vector register could hold 2 + 2x SImode where x is 0~n, so it
might hold just two SI.

However `v` extension guarantees the minimal VLEN is 128 bits.

We introduce another type/mode mapping for this configure:

v: Mode for one vector register mode is VNx2DImode or VNx4SImode
- one vector register could hold 2 + 2x DImode where x is 0~n, so it
will hold at least two DI
- one vector register could hold 4 + 4x SImode where x is 0~n, so it
will hold at least four DI

This patch model the mode more precisely for the RVV, and help some
middle-end optimization that assume number of element must be a
multiple of two.

gcc/ChangeLog:

* config/riscv/riscv-modes.def (FLOAT_MODE): Add chunk 128 support.
(VECTOR_BOOL_MODE): Ditto.
(ADJUST_NUNITS): Ditto.
(ADJUST_ALIGNMENT): Ditto.
(ADJUST_BYTESIZE): Ditto.
(ADJUST_PRECISION): Ditto.
(RVV_MODES): Ditto.
(VECTOR_MODE_WITH_PREFIX): Ditto.
* config/riscv/riscv-v.cc (ENTRY): Ditto.
(get_vlmul): Ditto.
(get_ratio): Ditto.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_TYPE): Ditto.
* config/riscv/riscv-vector-builtins.def (DEF_RVV_TYPE): Ditto.
(vbool64_t): Ditto.
(vbool32_t): Ditto.
(vbool16_t): Ditto.
(vbool8_t): Ditto.
(vbool4_t): Ditto.
(vbool2_t): Ditto.
(vbool1_t): Ditto.
(vint8mf8_t): Ditto.
(vuint8mf8_t): Ditto.
(vint8mf4_t): Ditto.
(vuint8mf4_t): Ditto.
(vint8mf2_t): Ditto.
(vuint8mf2_t): Ditto.
(vint8m1_t): Ditto.
(vuint8m1_t): Ditto.
(vint8m2_t): Ditto.
(vuint8m2_t): Ditto.
(vint8m4_t): Ditto.
(vuint8m4_t): Ditto.
(vint8m8_t): Ditto.
(vuint8m8_t): Ditto.
(vint16mf4_t): Ditto.
(vuint16mf4_t): Ditto.
(vint16mf2_t): Ditto.
(vuint16mf2_t): Ditto.
(vint16m1_t): Ditto.
(vuint16m1_t): Ditto.
(vint16m2_t): Ditto.
(vuint16m2_t): Ditto.
(vint16m4_t): Ditto.
(vuint16m4_t): Ditto.
(vint16m8_t): Ditto.
(vuint16m8_t): Ditto.
(vint32mf2_t): Ditto.
(vuint32mf2_t): Ditto.
(vint32m1_t): Ditto.
(vuint32m1_t): Ditto.
(vint32m2_t): Ditto.
(vuint32m2_t): Ditto.
(vint32m4_t): Ditto.
(vuint32m4_t): Ditto.
(vint32m8_t): Ditto.
(vuint32m8_t): Ditto.
(vint64m1_t): Ditto.
(vuint64m1_t): Ditto.
(vint64m2_t): Ditto.
(vuint64m2_t): Ditto.
(vint64m4_t): Ditto.
(vuint64m4_t): Ditto.
(vint64m8_t): Ditto.
(vuint64m8_t): Ditto.
(vfloat32mf2_t): Ditto.
(vfloat32m1_t): Ditto.
(vfloat32m2_t): Ditto.
(vfloat32m4_t): Ditto.
(vfloat32m8_t): Ditto.
(vfloat64m1_t): Ditto.
(vfloat64m2_t): Ditto.
(vfloat64m4_t): Ditto.
(vfloat64m8_t): Ditto.
* config/riscv/riscv-vector-switch.def (ENTRY): Ditto.
* config/riscv/riscv.cc (riscv_legitimize_poly_move): Ditto.
(riscv_convert_vector_bits): Ditto.
* config/riscv/riscv.md:
* config/riscv/vector-iterators.md:
* config/riscv/vector.md
(@pred_indexed_<order>store<VNX32_QH:mode><VNX32_QHI:mode>): Ditto.
(@pred_indexed_<order>store<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto.
(@pred_indexed_<order>store<VNX64_Q:mode><VNX64_Q:mode>): Ditto.
(@pred_indexed_<order>store<VNX64_QH:mode><VNX64_QHI:mode>): Ditto.
(@pred_indexed_<order>store<VNX128_Q:mode><VNX128_Q:mode>): Ditto.
(@pred_reduc_<reduc><mode><vlmul1_zve64>): Ditto.
(@pred_widen_reduc_plus<v_su><mode><vwlmul1_zve64>): Ditto.
(@pred_reduc_plus<order><mode><vlmul1_zve64>): Ditto.
(@pred_widen_reduc_plus<order><mode><vwlmul1_zve64>): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/pr108185-4.c: Adapt testcase.
* gcc.target/riscv/rvv/base/spill-1.c: Ditto.
* gcc.target/riscv/rvv/base/spill-11.c: Ditto.
* gcc.target/riscv/rvv/base/spill-2.c: Ditto.
* gcc.target/riscv/rvv/base/spill-3.c: Ditto.
* gcc.target/riscv/rvv/base/spill-5.c: Ditto.
* gcc.target/riscv/rvv/base/spill-9.c: Ditto.

RISC-V: Align IOR optimization MODE_CLASS condition to AND.

This patch aligned the MODE_CLASS condition of the IOR to the AND. Then
more MODE_CLASS besides SCALAR_INT can able to perform the optimization
A | (~A) -> -1 similar to AND operator. For example as below sample code.

vbool32_t test_shortcut_for_riscv_vmorn_case_5(vbool32_t v1, size_t vl)
{
  return __riscv_vmorn_mm_b32(v1, v1, vl);
}

Before this patch:
vsetvli  a5,zero,e8,mf4,ta,ma
vlm.v    v24,0(a1)
vsetvli  zero,a2,e8,mf4,ta,ma
vmorn.mm v24,v24,v24
vsetvli  a5,zero,e8,mf4,ta,ma
vsm.v    v24,0(a0)
ret

After this patch:
vsetvli zero,a2,e8,mf4,ta,ma
vmset.m v24
vsetvli a5,zero,e8,mf4,ta,ma
vsm.v   v24,0(a0)
ret

Or in RTL's perspective,
from:
(ior:VNx2BI (reg/v:VNx2BI 137 [ v1 ]) (not:VNx2BI (reg/v:VNx2BI 137 [ v1 ])))
to:
(const_vector:VNx2BI repeat [ (const_int 1 [0x1]) ])

The similar optimization like VMANDN has enabled already. There should
be no difference execpt the operator when compare the VMORN and VMANDN
for such kind of optimization. The patch aligns the IOR MODE_CLASS condition
of the simplification to the AND operator.

gcc/ChangeLog:

* simplify-rtx.cc (simplify_context::simplify_binary_operation_1):
Align IOR (A | (~A) -> -1) optimization MODE_CLASS condition to AND.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/base/mask_insn_shortcut.c: Update check
condition.
* gcc.target/riscv/simplify_ior_optimization.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

i386: Emit compares between high registers and memory

Following code:

typedef __SIZE_TYPE__ size_t;

struct S1s
{
  char pad1;
  char val;
  short pad2;
};

extern char ts[256];

_Bool foo (struct S1s a, size_t i)
{
  return (ts[i] > a.val);
}

compiles with -O2 to:

        movl    %edi, %eax
        movsbl  %ah, %edi
        cmpb    %dil, ts(%rsi)
        setg    %al
        ret

the compare could use high register %ah instead of %dil:

        movl    %edi, %eax
        cmpb    ts(%rsi), %ah
        setl    %al
        ret

Use any_extract code iterator to handle signed and unsigned extracts
from high register and introduce peephole2 patterns to propagate
norex memory opeerand into the compare insn.

gcc/ChangeLog:

PR target/78904
PR target/78952
* config/i386/i386.md (*cmpqi_ext<mode>_1_mem_rex64): New insn pattern.
(*cmpqi_ext<mode>_1): Use nonimmediate_operand predicate
for operand 0. Use any_extract code iterator.
(*cmpqi_ext<mode>_1 peephole2): New peephole2 pattern.
(*cmpqi_ext<mode>_2): Use any_extract code iterator.
(*cmpqi_ext<mode>_3_mem_rex64): New insn pattern.
(*cmpqi_ext<mode>_1): Use general_operand predicate
for operand 1. Use any_extract code iterator.
(*cmpqi_ext<mode>_3 peephole2): New peephole2 pattern.
(*cmpqi_ext<mode>_4): Use any_extract code iterator.

gcc/testsuite/ChangeLog:

PR target/78904
PR target/78952
* gcc.target/i386/pr78952-3.c: New test.

aarch64: Factorise widening add/sub high-half expanders with iterators

I noticed these define_expand are almost identical modulo some string substitutions.
This patch compresses them together with a couple of code iterators.
No functional change intended.
Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-simd.md (aarch64_saddw2<mode>): Delete.
(aarch64_uaddw2<mode>): Delete.
(aarch64_ssubw2<mode>): Delete.
(aarch64_usubw2<mode>): Delete.
(aarch64_<ANY_EXTEND:su><ADDSUB:optab>w2<mode>): New define_expand.

Use solve_add_graph_edge in more places

The following makes sure to use solve_add_graph_edge and honoring
special-cases, especially edges from escaped, in the remaining places
the solver adds edges.

* tree-ssa-structalias.cc (do_ds_constraint): Use
solve_add_graph_edge.

Split out solve_add_graph_edge

Split out a worker with all the special-casings when adding a graph
edge during solving.

* tree-ssa-structalias.cc (solve_add_graph_edge): New function,
split out from ...
(do_sd_constraint): ... here.

Remove odd code from gimple_can_merge_blocks_p

The following removes a special case to not merge a block with
only a non-local label. We have a restriction of non-local labels
to be the first statement (and label) in a block, but otherwise nothing,
if the last stmt of A is a non-local label then it will be still
the first statement of the combined A + B. In particular we'd
happily merge when there's a stmt after that label.

The check originates from the tree-ssa merge.

Bootstrapped and tested on x86_64-unknown-linux-gnu with all
languages.

* tree-cfg.cc (gimple_can_merge_blocks_p): Remove condition
rejecting the merge when A contains only a non-local label.

Introduce VIRTUAL_REGISTER_P and VIRTUAL_REGISTER_NUM_P predicates

These two predicates are similar to existing HARD_REGISTER_P and
HARD_REGISTER_NUM_P predicates and return 1 if the given register
corresponds to a virtual register.

gcc/ChangeLog:

* rtl.h (VIRTUAL_REGISTER_P): New predicate.
(VIRTUAL_REGISTER_NUM_P): Ditto.
(REGNO_PTR_FRAME_P): Use VIRTUAL_REGISTER_NUM_P predicate.
* expr.cc (force_operand): Use VIRTUAL_REGISTER_P predicate.
* function.cc (instantiate_decl_rtl): Ditto.
* rtlanal.cc (rtx_addr_can_trap_p_1): Ditto.
(nonzero_address_p): Ditto.
(refers_to_regno_p): Use VIRTUAL_REGISTER_NUM_P predicate.

Fix pointer sharing in Value_Range constructor.

gcc/ChangeLog:

* value-range.h (Value_Range::Value_Range): Avoid pointer sharing.

Transform more gmp/mpfr uses to use RAII

The following picks up the coccinelle generated patch from Bernhard,
leaving out the fortran frontend parts and fixing up the rest.
In particular both gmp.h and mpfr.h contain macros like
#define mpfr_inf_p(_x) ((_x)->_mpfr_exp == __MPFR_EXP_INF)
for which I add operator-> overloads to the auto_* classes.

* system.h (auto_mpz::operator->()): New.
* realmpfr.h (auto_mpfr::operator->()): New.
* builtins.cc (do_mpfr_lgamma_r): Use auto_mpfr.
* real.cc (real_from_string): Likewise.
(dconst_e_ptr): Likewise.
(dconst_sqrt2_ptr): Likewise.
* tree-ssa-loop-niter.cc (refine_value_range_using_guard):
Use auto_mpz.
(bound_difference_of_offsetted_base): Likewise.
(number_of_iterations_ne): Likewise.
(number_of_iterations_lt_to_ne): Likewise.
* ubsan.cc: Include realmpfr.h.
(ubsan_instrument_float_cast): Use auto_mpfr.

Revert "libstdc++: Export global iostreams with GLIBCXX_3.4.31 symver [PR108969]"

This reverts commit b7c54e3f48086c29179f7765a35c381de5109a0a.

libstdc++-v3/ChangeLog:

* config/abi/post/aarch64-linux-gnu/baseline_symbols.txt:
* config/abi/post/i486-linux-gnu/baseline_symbols.txt:
* config/abi/post/m68k-linux-gnu/baseline_symbols.txt:
* config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt:
* config/abi/post/riscv64-linux-gnu/baseline_symbols.txt:
* config/abi/post/s390x-linux-gnu/baseline_symbols.txt:
* config/abi/post/x86_64-linux-gnu/32/baseline_symbols.txt:
* config/abi/post/x86_64-linux-gnu/baseline_symbols.txt:
* config/abi/pre/gnu.ver:
* src/Makefile.am:
* src/Makefile.in:
* src/c++98/Makefile.am:
* src/c++98/Makefile.in:
* src/c++98/globals_io.cc (defined):
(_GLIBCXX_IO_GLOBAL):

Revert "libstdc++: Fix preprocessor condition in linker script [PR108969]"

This reverts commit 6067ae4557a3a7e5b08359e78a29b8a9d5dfedce.

libstdc++-v3/ChangeLog:

* config/abi/pre/gnu.ver:

Remove special-cased edges when solving copies

The following makes sure to remove the copy edges we ignore or
need to special-case only once.

* tree-ssa-structalias.cc (solve_graph): Remove self-copy
edges, remove edges from escaped after special-casing them.

Fix do_sd_constraint escape special casing

The following fixes the escape special casing to test the proper
variable IDs.

* tree-ssa-structalias.cc (do_sd_constraint): Fixup escape
special casing.

Remove senseless store in do_sd_constraint

* tree-ssa-structalias.cc (do_sd_constraint): Do not write
to the LHS varinfo solution member.

Avoid non-unified nodes on the topological sorting for PTA solving

Since we do not update successor edges when merging nodes we have
to deal with this in the users. The following avoids putting those
on the topo order vector.

* tree-ssa-structalias.cc (topo_visit): Look at the real
destination of edges.

tree-optimization/44794 - avoid excessive RTL unrolling on epilogues

The following adjusts tree_[transform_and_]unroll_loop to set an
upper bound on the number of iterations on the epilogue loop it
creates. For the testcase at hand which involves array prefetching
this avoids applying RTL unrolling to them when -funroll-loops is
specified.

Other users of this API includes predictive commoning and
unroll-and-jam.

PR tree-optimization/44794
* tree-ssa-loop-manip.cc (tree_transform_and_unroll_loop):
If an epilogue loop is required set its iteration upper bound.

LoongArch: Improve cpymemsi expansion [PR109465]

We'd been generating really bad block move sequences which is recently
complained by kernel developers who tried __builtin_memcpy.  To improve
it:

1. Take the advantage of -mno-strict-align.  When it is set, set mode
   size to UNITS_PER_WORD regardless of the alignment.
2. Half the mode size when (block size) % (mode size) != 0, instead of
   falling back to ld.bu/st.b at once.
3. Limit the length of block move sequence considering the number of
   instructions, not the size of block.  When -mstrict-align is set and
   the block is not aligned, the old size limit for straight-line
   implementation (64 bytes) was definitely too large (we don't have 64
   registers anyway).

Change since v1: add a comment about the calculation of num_reg.

gcc/ChangeLog:

PR target/109465
* config/loongarch/loongarch-protos.h
(loongarch_expand_block_move): Add a parameter as alignment RTX.
* config/loongarch/loongarch.h:
(LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER): Remove.
(LARCH_MAX_MOVE_BYTES_STRAIGHT): Remove.
(LARCH_MAX_MOVE_OPS_PER_LOOP_ITER): Define.
(LARCH_MAX_MOVE_OPS_STRAIGHT): Define.
(MOVE_RATIO): Use LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of
LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER.
* config/loongarch/loongarch.cc (loongarch_expand_block_move):
Take the alignment from the parameter, but set it to
UNITS_PER_WORD if !TARGET_STRICT_ALIGN.  Limit the length of
straight-line implementation with LARCH_MAX_MOVE_OPS_STRAIGHT
instead of LARCH_MAX_MOVE_BYTES_STRAIGHT.
(loongarch_block_move_straight): When there are left-over bytes,
half the mode size instead of falling back to byte mode at once.
(loongarch_block_move_loop): Limit the length of loop body with
LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of
LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER.
* config/loongarch/loongarch.md (cpymemsi): Pass the alignment
to loongarch_expand_block_move.

gcc/testsuite/ChangeLog:

PR target/109465
* gcc.target/loongarch/pr109465-1.c: New test.
* gcc.target/loongarch/pr109465-2.c: New test.
* gcc.target/loongarch/pr109465-3.c: New test.

LoongArch: Improve GAR store for va_list

LoongArch backend used to save all GARs for a function with variable
arguments.  But sometimes a function only accepts variable arguments for
a purpose like C++ function overloading.  For example, POSIX defines
open() as:

    int open(const char *path, int oflag, ...);

But only two forms are actually used:

    int open(const char *pathname, int flags);
    int open(const char *pathname, int flags, mode_t mode);

So it's obviously a waste to save all 8 GARs in open().  We can use the
cfun->va_list_gpr_size field set by the stdarg pass to only save the
GARs necessary to be saved.

If the va_list escapes (for example, in fprintf() we pass it to
vfprintf()), stdarg would set cfun->va_list_gpr_size to 255 so we
don't need a special case.

With this patch, only one GAR ($a2/$r6) is saved in open().  Ideally
even this stack store should be omitted too, but doing so is not trivial
and AFAIK there are no compilers (for any target) performing the "ideal"
optimization here, see https://godbolt.org/z/n1YqWq9c9.

Bootstrapped and regtested on loongarch64-linux-gnu.  Ok for trunk
(GCC 14 or now)?

gcc/ChangeLog:

* config/loongarch/loongarch.cc
(loongarch_setup_incoming_varargs): Don't save more GARs than
cfun->va_list_gpr_size / UNITS_PER_WORD.

gcc/testsuite/ChangeLog:

* gcc.target/loongarch/va_arg.c: New test.

Avoid unnecessary epilogues from tree_unroll_loop

The following fixes the condition determining whether we need an
epilogue.

* tree-ssa-loop-manip.cc (determine_exit_conditions): Fix
no epilogue condition.

Simplify gimple_assign_load

The following simplifies and outlines gimple_assign_load. In
particular it is not necessary to get at the base of the possibly
loaded expression but just handle the case of a single handled
component wrapping a non-memory operand.

* gimple.h (gimple_assign_load): Outline...
* gimple.cc (gimple_assign_load): ... here. Avoid
get_base_address and instead just strip the outermost
handled component, treating a remaining handled component
as load.

aarch64: Delete __builtin_aarch64_neg* builtins and their use

I don't think we need to keep the __builtin_aarch64_neg* builtins around.
They are only used once in the vnegh_f16 intrinsic in arm_fp16.h and I AFAICT
it was added this way only for the sake of orthogonality in
https://gcc.gnu.org/g:d7f33f07d88984cbe769047e3d07fc21067fbba9
We already use normal "-" negation in the other vneg* intrinsics, so do so here as well.

Bootstrapped and tested on aarch64-none-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64-simd-builtins.def (neg): Delete builtins
definition.
* config/aarch64/arm_fp16.h (vnegh_f16): Reimplement using normal negation.