Pan Li [Fri, 14 Jul 2023 02:14:07 +0000 (10:14 +0800)]
RISC-V: Support basic floating-point dynamic rounding mode
This patch would like to support the basic floating-point dynamic
rounding modes for the RVV.
We implement the dynamic rounding mode by below steps.
1. Set entry to DYN and exit to DYN_EXIT.
2. Add one rtl variable into machine_function for backup/restore.
3. Backup frm value when entry.
4. Restore frm value when exit and prev mode is not DYN.
5. Restore frm when mode switching to DYN.
6. Set frm when mode switching to STATIC.
Please *NOTE* inline asm and call during the cfun will be implemented
in another PATCH(s).
Signed-off-by: Pan Li <pan2.li@intel.com> Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/riscv.cc (struct machine_function): Add new field.
(riscv_static_frm_mode_p): New function.
(riscv_emit_frm_mode_set): New function for emit FRM.
(riscv_emit_mode_set): Extract function for FRM.
(riscv_mode_needed): Fix the TODO.
(riscv_mode_entry): Initial dynamic frm RTL.
(riscv_mode_exit): Return DYN_EXIT.
* config/riscv/riscv.md: Add rdfrm.
* config/riscv/vector-iterators.md (unspecv): Add DYN_EXIT unspecv.
* config/riscv/vector.md (frm_modee): Add new mode dyn_exit.
(fsrm): Removed.
(fsrmsi_backup): New pattern for swap.
(fsrmsi_restore): New pattern for restore.
(fsrmsi_restore_exit): New pattern for restore exit.
(frrmsi): New pattern for backup.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-frm-insert-1.c: Adjust
test cases.
* gcc.target/riscv/rvv/base/float-point-frm-insert-10.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-2.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-3.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-4.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-5.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-6.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-7.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-8.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-insert-9.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-run-1.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-run-2.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-frm-run-3.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-1.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-10.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-11.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-12.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-13.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-14.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-15.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-16.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-17.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-18.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-19.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-2.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-20.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-21.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-22.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-23.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-24.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-25.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-26.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-27.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-28.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-29.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-3.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-30.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-31.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-32.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-4.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-5.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-6.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-7.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-8.c: New test.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-9.c: New test.
Jason Merrill [Thu, 13 Jul 2023 21:48:05 +0000 (17:48 -0400)]
c++: only cache constexpr calls that are constant exprs
In reviewing Nathaniel's patch for PR70331, it occurred to me that instead
of looking for various specific problematic things in the result of a
constexpr call to decide whether to cache it, we should use
reduced_constant_expression_p.
The change to that function is to avoid crashing on uninitialized objects of
non-class type.
In a trial version of this patch I checked to see what cases this stopped
caching; most were instances of partially-initialized return values, which
seem fine to not cache. Some were returning pointers to expiring local
variables, which we definitely want not to cache. And one was bit-cast3.C,
which will be handled in a follow-up patch.
gcc/cp/ChangeLog:
* constexpr.cc (cxx_eval_call_expression): Only cache
reduced_constant_expression_p results.
(reduced_constant_expression_p): Handle CONSTRUCTOR of scalar type.
(cxx_eval_constant_expression): Fold vectors here.
(cxx_eval_bare_aggregate): Not here.
combine-stack-adj: Change return type of predicate function from int to bool
gcc/ChangeLog:
* combine-stack-adj.cc (stack_memref_p): Change return type from
int to bool and adjust function body accordingly.
(rest_of_handle_stack_adjustments): Change return type to void.
combine: Change return type of predicate functions from int to bool
Also change some internal variables and function arguments from int to bool.
gcc/ChangeLog:
* combine.cc (struct reg_stat_type): Change last_set_invalid to bool.
(cant_combine_insn_p): Change return type from int to bool and adjust
function body accordingly.
(can_combine_p): Ditto.
(combinable_i3pat): Ditto. Change "i1_not_in_src" and "i0_not_in_src"
function arguments from int to bool.
(contains_muldiv): Change return type from int to bool and adjust
function body accordingly.
(try_combine): Ditto. Change "new_direct_jump" pointer function
argument from int to bool. Change "substed_i2", "substed_i1",
"substed_i0", "added_sets_0", "added_sets_1", "added_sets_2",
"i2dest_in_i2src", "i1dest_in_i1src", "i2dest_in_i1src",
"i0dest_in_i0src", "i1dest_in_i0src", "i2dest_in_i0src",
"i2dest_killed", "i1dest_killed", "i0dest_killed", "i1_feeds_i2_n",
"i0_feeds_i2_n", "i0_feeds_i1_n", "i3_subst_into_i2", "have_mult",
"swap_i2i3", "split_i2i3" and "changed_i3_dest" variables
from int to bool.
(subst): Change "in_dest", "in_cond" and "unique_copy" function
arguments from int to bool.
(combine_simplify_rtx): Change "in_dest" and "in_cond" function
arguments from int to bool.
(make_extraction): Change "unsignedp", "in_dest" and "in_compare"
function argument from int to bool.
(force_int_to_mode): Change "just_select" function argument
from int to bool. Change "next_select" variable to bool.
(rtx_equal_for_field_assignment_p): Change return type from
int to bool and adjust function body accordingly.
(merge_outer_ops): Ditto. Change "pcomp_p" pointer function
argument from int to bool.
(get_last_value_validate): Change return type from int to bool
and adjust function body accordingly.
(reg_dead_at_p): Ditto.
(reg_bitfield_target_p): Ditto.
(combine_instructions): Ditto. Change "new_direct_jump"
variable to bool.
(can_combine_p): Change return type from int to bool
and adjust function body accordingly.
(likely_spilled_retval_p): Ditto.
(can_change_dest_mode): Change "added_sets" function argument
from int to bool.
(find_split_point): Change "unsignedp" variable to bool.
(simplify_if_then_else): Change "comparison_p" and "swapped"
variables to bool.
(simplify_set): Change "other_changed" variable to bool.
(expand_compound_operation): Change "unsignedp" variable to bool.
(force_to_mode): Change "just_select" function argument
from int to bool. Change "next_select" variable to bool.
(extended_count): Change "unsignedp" function argument to bool.
(simplify_shift_const_1): Change "complement_p" variable to bool.
(simplify_comparison): Change "changed" variable to bool.
(rest_of_handle_combine): Change return type to void.
Harald Anlauf [Sun, 16 Jul 2023 20:17:27 +0000 (22:17 +0200)]
Fortran: intrinsics and deferred-length character arguments [PR95947,PR110658]
gcc/fortran/ChangeLog:
PR fortran/95947
PR fortran/110658
* trans-expr.cc (gfc_conv_procedure_call): For intrinsic procedures
whose result characteristics depends on the first argument and which
can be of type character, the character length will not be deferred.
gcc/testsuite/ChangeLog:
PR fortran/95947
PR fortran/110658
* gfortran.dg/deferred_character_37.f90: New test.
Andre Vieira [Mon, 17 Jul 2023 16:00:54 +0000 (17:00 +0100)]
Include insn-opinit.h in PLUGIN_H [PR110610]
This patch fixes PR110610 by including insn-opinit.h in the INTERNAL_FN_H list,
as insn-opinit.h is now required by internal-fn.h. This will lead to
insn-opinit.h being installed in the plugin directory.
ira: Skip empty regclass when setting up reg class relations
ira.cc:setup_reg_class_relations sets up ira_reg_class_subset (among
other things). If reg class cl3 has no registers, then that empty set
is always hard_reg_set_subset_p of any other set, and this makes
ira_reg_class_subset[ALL_REGS][NO_REGS] equal to such a regclass,
rather than NO_REGS.
This breaks code (lra-constraints.cc:in_class_p/curr_insn_transform,
for e.g.) which uses NO_REGS to check for an empty regclass.
Why define an empty regclass? A regclass could be conditionally empty (via
TARGET_CONDITIONAL_REGISTER_USAGE) - for the avr target, ADDW_REGS and
NO_LD_REGS are empty for the avrtiny subarch, for example.
Fix by continuing the innermost loop if the corresponding reg class is empty.
gcc/ChangeLog:
* ira.cc (setup_reg_class_relations): Continue
if regclass cl3 is hard_reg_set_empty_p.
OpenMP/Fortran: Parsing support for 'uses_allocators'
The 'uses_allocators' clause to the 'target' construct accepts predefined
allocators and can also be used to define a new allocator for a target region.
As predefined allocators in GCC do not require special handling, those can and
are ignored after parsing, such that this feature now works. On the other hand,
defining a new allocator will fail for now with a 'sorry, unimplemented'.
Note that both the OpenMP 5.0/5.1 and 5.2 syntax for uses_allocators
is supported by this commit.
Martin Jambor [Mon, 17 Jul 2023 12:22:06 +0000 (14:22 +0200)]
Restore bootstrap by removing unused variable in tree-ssa-loop-ivcanon.cc
This restores bootstrap by removing the variable causing:
/home/mjambor/gcc/trunk/src/gcc/tree-ssa-loop-ivcanon.cc: In function ‘bool try_peel_loop(loop*, edge, tree, bool, long int)’:
/home/mjambor/gcc/trunk/src/gcc/tree-ssa-loop-ivcanon.cc:1170:17: error: variable ‘entry_count’ set but not used [-Werror=unused-but-set-variable]
1170 | profile_count entry_count = profile_count::zero ();
| ^~~~~~~~~~~
cc1plus: all warnings being treated as errors
Mikael Morin [Mon, 17 Jul 2023 12:14:22 +0000 (14:14 +0200)]
fortran: Pass pre-calculated class container argument [pr110618]
Pass already evaluated class container argument from
gfc_conv_procedure_call down to gfc_add_finalizer_call through
gfc_deallocate_scalar_with_status and gfc_deallocate_with_status,
to avoid repeatedly evaluating the same data reference expressions
in the generated code.
PR fortran/110618
gcc/fortran/ChangeLog:
* trans.h (gfc_deallocate_with_status): Add class container
argument.
(gfc_deallocate_scalar_with_status): Ditto.
* trans.cc (gfc_deallocate_with_status): Add class container
argument and pass it down to gfc_add_finalize_call.
(gfc_deallocate_scalar_with_status): Same.
* trans-array.cc (structure_alloc_comps): Update caller.
* trans-stmt.cc (gfc_trans_deallocate): Ditto.
* trans-expr.cc (gfc_conv_procedure_call): Ditto. Pass
pre-evaluated class container argument if it's available.
Mikael Morin [Mon, 17 Jul 2023 12:14:18 +0000 (14:14 +0200)]
fortran: Use pre-evaluated class container if available [PR110618]
Add the possibility to provide a pre-evaluated class container argument
to gfc_add_finalizer to avoid repeatedly evaluating data reference
expressions in the generated code.
PR fortran/110618
gcc/fortran/ChangeLog:
* trans.h (gfc_add_finalizer_call): Add class container argument.
* trans.cc (gfc_add_finalizer_call): Ditto. Pass down new
argument to get_final_proc_ref, get_elem_size, get_var_desc,
and get_vptr.
(get_elem_size): Add class container argument.
Use provided class container if it's available.
(get_var_descr): Same.
(get_vptr): Same.
(get_final_proc_ref): Same. Add boolean telling the class
container argument is used. Set it. Don't try to use
final_wrapper if class container argument was used.
Mikael Morin [Mon, 17 Jul 2023 12:14:14 +0000 (14:14 +0200)]
fortran: Factor scalar descriptor generation
The same scalar descriptor generation code is present twice, in the
case of derived type entities, and in the case of polymorphic
non-coarray entities. Factor it in preparation for a future third case
that will also need the same code for scalar descriptor generation.
Mikael Morin [Mon, 17 Jul 2023 12:13:53 +0000 (14:13 +0200)]
fortran: Push final procedure expr gen close to its one usage.
Final procedure pointer expression is generated in gfc_build_final_call
and only used in get_final_proc_ref. Move the generation there.
gcc/fortran/ChangeLog:
* trans.cc (gfc_add_finalizer_call): Remove local variable
final_expr. Pass down expr to get_final_proc_ref and move
final procedure expression generation down to its one usage
in get_final_proc_ref.
(get_final_proc_ref): Add argument expr. Remove argument
final_wrapper. Recreate final_wrapper from expr.
Mikael Morin [Mon, 17 Jul 2023 12:13:48 +0000 (14:13 +0200)]
fortran: Push element size expression generation close to its usage
gfc_add_finalizer_call creates one expression which is only used
by the get_final_proc_ref function. Move the expression generation
there.
gcc/fortran/ChangeLog:
* trans.cc (gfc_add_finalizer_call): Remove local variable
elem_size. Pass expression to get_elem_size and move the
element size expression generation close to its usage there.
(get_elem_size): Add argument expr, remove class_size argument
and rebuild it from expr. Remove ts argument and use the
type of expr instead.
Mikael Morin [Mon, 17 Jul 2023 12:13:44 +0000 (14:13 +0200)]
fortran: Reuse final procedure pointer expression
Reuse twice the same final procedure pointer expression instead of
translating it twice.
Final procedure pointer expressions were translated twice, once for the
final procedure call, and once for the check for non-nullness (if
applicable).
gcc/fortran/ChangeLog:
* trans.cc (gfc_add_finalizer_call): Move pre and post code for
the final procedure pointer expression to the outer block.
Reuse the previously evaluated final procedure pointer
expression.
Mikael Morin [Mon, 17 Jul 2023 12:13:37 +0000 (14:13 +0200)]
fortran: Add missing cleanup blocks
Move cleanup code for the data descriptor after the finalization code
as it makes more sense to have it after.
Other cleanup blocks should be empty (element size and final pointer
are just data references), but add them by the way, just in case.
gcc/fortran/ChangeLog:
* trans.cc (gfc_add_finalizer_call): Add post code for desc_se
after the finalizer call. Add post code for final_se and
size_se as well.
Currently CCP throws away the known 1 bits because VRP and irange have
traditionally only had a way of tracking known 0s (set_nonzero_bits).
With the ability to keep all the known bits in the irange, we can now
save this between passes.
gcc/ChangeLog:
* tree-ssa-ccp.cc (ccp_finalize): Export value/mask known bits.
This patch add reduc_*_scal to support reduction auto-vectorization.
Use COND_LEN_* + reduc_*_scal to support unordered non-SLP auto-vectorization.
Consider this following case:
int __attribute__((noipa))
and_loop (int32_t * __restrict x,
int32_t n, int res)
{
for (int i = 0; i < n; ++i)
res &= x[i];
return res;
}
ASM:
and_loop:
ble a1,zero,.L4
vsetvli a3,zero,e32,m1,ta,ma
vmv.v.i v1,-1
.L3:
vsetvli a5,a1,e32,m1,tu,ma ------------> MUST BE "TU".
slli a4,a5,2
sub a1,a1,a5
vle32.v v2,0(a0)
add a0,a0,a4
vand.vv v1,v2,v1
bne a1,zero,.L3
vsetivli zero,1,e32,m1,ta,ma
vmv.v.i v2,-1
vsetvli a3,zero,e32,m1,ta,ma
vredand.vs v1,v1,v2
vmv.x.s a5,v1
and a0,a2,a5
ret
.L4:
mv a0,a2
ret
Fix bug of VSETVL PASS which is caused by reduction testcase.
SLP reduction and floating-point in-order reduction are not supported yet.
* gcc.target/riscv/rvv/rvv.exp: Add reduction tests.
* gcc.target/riscv/rvv/autovec/reduc/reduc-1.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-2.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-3.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc-4.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_run-4.c: New test.
Currently IPA throws away the known 1 bits because VRP and irange have
traditionally only had a way of tracking known 0s (set_nonzero_bits).
With the ability to keep all the known bits in the irange, we can now
save this between passes.
gcc/ChangeLog:
* ipa-prop.cc (ipcp_update_bits): Export value/mask known bits.
Kewen Lin [Mon, 17 Jul 2023 08:44:59 +0000 (03:44 -0500)]
vect: Initialize new_temp to avoid false positive warning [PR110652]
As PR110652 and its duplicate PRs show, there could be one
build error
error: 'new_temp' may be used uninitialized
for some build configurations. It's a false positive warning
(or error at -Werror), but in order to make the build succeed,
this patch is to initialize the reported variable 'new_temp'
as NULL_TREE.
PR tree-optimization/110652
gcc/ChangeLog:
* tree-vect-stmts.cc (vectorizable_load): Initialize new_temp as
NULL_TREE.
Richard Biener [Mon, 17 Jul 2023 07:20:33 +0000 (09:20 +0200)]
tree-optimization/110669 - bogus matching of loop bitop
The matching code lacked a check that we end up with a PHI node
in the loop header. This caused us to match a random PHI argument
now catched by the extra PHI_ARG_DEF_FROM_EDGE checking.
PR tree-optimization/110669
* tree-scalar-evolution.cc (analyze_and_compute_bitop_with_inv_effect):
Check we matched a header PHI.
Add global setter for value/mask pair for SSA names.
This patch provides a way to set the value/mask pair of known bits
globally, similarly to how we can use set_nonzero_bits for known 0
bits. This can then be used by CCP and IPA to set value/mask info
instead of throwing away the known 1 bits.
In further clean-ups, I will see if it makes sense to remove
set_nonzero_bits altogether, since it is subsumed by value/mask.
The bit twiddling in union/intersect for the value/mask pair must be
normalized to have the unknown bits with a value of 0 in order to make
the math simpler. Normalizing at construction slowed VRP by 1.5% so I
opted to normalize before updating the bitmask in range-ops, since it
was the only user. However, with upcoming changes there will be
multiple setters of the mask (IPA and CCP), so we need something more
general.
I played with various alternatives, and settled on normalizing before
union/intersect which were the ones needing the bits cleared. With
this patch, there's no noticeable difference in performance either in
VRP or in overall compilation.
gcc/ChangeLog:
* value-range.cc (irange_bitmask::verify_mask): Mask need not be
normalized.
* value-range.h (irange_bitmask::union_): Normalize beforehand.
(irange_bitmask::intersect): Same.
I had messed up the case where the outer operator is `==`.
The check for the resulting should have been `==` and not `!=`.
This patch fixes that and adds a full runtime testcase now for
all cases to make sure it works.
OK? Bootstrapped and tested on x86-64-linux-gnu with no regressions.
gcc/ChangeLog:
PR tree-optimization/110666
* match.pd (A NEEQ (A NEEQ CST)): Fix Outer EQ case.
gcc/testsuite/ChangeLog:
PR tree-optimization/110666
* gcc.c-torture/execute/pr110666-1.c: New test.
i386: Auto vectorize usdot_prod, udot_prod with AVXVNNIINT16 instruction.
gcc/ChangeLog:
* config/i386/sse.md (VI2_AVX2): Delete V32HI since we actually
have the same iterator. Also renaming all the occurence to
VI2_AVX2_AVX512BW.
(usdot_prod<mode>): New define_expand.
(udot_prod<mode>): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vnniint16-auto-vectorize-1.c: New test.
* gcc.target/i386/vnniint16-auto-vectorize-2.c: Ditto.
Jan Hubicka [Sun, 16 Jul 2023 21:56:59 +0000 (23:56 +0200)]
Fix profile update in scale_profile_for_vect_loop
When vectorizing 4 times, we sometimes do
for
<4x vectorized body>
for
<2x vectorized body>
for
<1x vectorized body>
Here the second two fors handling epilogue never iterates.
Currently vecotrizer thinks that the middle for itrates twice.
This turns out to be scale_profile_for_vect_loop that uses
niter_for_unrolled_loop.
At that time we know epilogue will iterate at most 2 times
but niter_for_unrolled_loop does not know that the last iteration
will be taken by the epilogue-of-epilogue and thus it think
that the loop may iterate once and exit in middle of second
iteration.
We already do correct job updating niter bounds and this is
just ordering issue. This patch makes us to first update
the bounds and then do updating of the loop. I re-implemented
the function more correctly and precisely.
The loop reducing iteration factor for overly flat profiles is bit funny, but
only other method I can think of is to compute sreal scale that would have
similar overhead I think.
Bootstrapped/regtested x86_64-linux, will commit it shortly.
Jan Hubicka [Sun, 16 Jul 2023 21:55:14 +0000 (23:55 +0200)]
Fix optimize_mask_stores profile update
While looking into sphinx3 regression I noticed that vectorizer produces
BBs with overall probability count 120%. This patch fixes it.
Richi, I don't know how to create a testcase, but having one would
be nice.
Bootstrapped/regtested x86_64-linux, will commit it shortly.
gcc/ChangeLog:
PR tree-optimization/110649
* tree-vect-loop.cc (optimize_mask_stores): Set correctly
probability of the if-then-else construct.
Jan Hubicka [Sun, 16 Jul 2023 21:53:56 +0000 (23:53 +0200)]
Avoid double profile udpate in try_peel_loop
try_peel_loop uses gimple_duplicate_loop_body_to_header_edge which subtracts the profile
from the original loop. However then it tries to scale the profile in a wrong way
(it forces header count to be entry count).
This eliminates to profile misupdates in the internal loop of sphinx3.
David Edelsohn [Sat, 15 Jul 2023 22:44:25 +0000 (18:44 -0400)]
testsuite: Require 128 bit long double for ibmlongdouble.
pr103628.f90 adds the -mabi=ibmlongdouble option, but AIX defaults
to 64 bit long double. This patch adds -mlong-double-128 to ensure
that the testcase is compiled with 128 bit long double.
Here the call A().f() is represented as a COMPOUND_EXPR whose first
operand is the otherwise unused object argument A() and second operand
is the call result (both are TARGET_EXPRs). Within the return statement,
this outermost COMPOUND_EXPR ends up foiling the copy elision check in
build_special_member_call, resulting in us introducing a bogus call to the
deleted move constructor. (Within the variable initialization, which goes
through ocp_convert instead of convert_for_initialization, we've already
been eliding the copy -- despite the outermost COMPOUND_EXPR -- ever since r10-7410-g72809d6fe8e085 made ocp_convert look through COMPOUND_EXPR).
In contrast I noticed '(A(), A::f())' (which should be equivalent to
the above call) is represented with the COMPOUND_EXPR inside the RHS's
TARGET_EXPR initializer thanks to a special case in cp_build_compound_expr.
So this patch fixes this by making keep_unused_object_arg use
cp_build_compound_expr as well.
PR c++/110441
gcc/cp/ChangeLog:
* call.cc (keep_unused_object_arg): Use cp_build_compound_expr
instead of building a COMPOUND_EXPR directly.
Jason Merrill [Fri, 14 Jul 2023 13:37:21 +0000 (09:37 -0400)]
c++: c++26 regression fixes
Apparently I wasn't actually running the testsuite in C++26 mode like I
thought I was, so there were some failures I wasn't seeing.
The constexpr hunk fixes regressions with the P2738 implementation; we still
need to use the old handling for casting from void pointers to heap
variables.
PR c++/110344
gcc/cp/ChangeLog:
* constexpr.cc (cxx_eval_constant_expression): Move P2738 handling
after heap handling.
* name-lookup.cc (get_cxx_dialect_name): Add C++26.
gcc/testsuite/ChangeLog:
* g++.dg/cpp0x/constexpr-cast2.C: Adjust for P2738.
* g++.dg/ipa/devirt-45.C: Handle -fimplicit-constexpr.
Christophe Lyon [Wed, 28 Jun 2023 14:29:15 +0000 (14:29 +0000)]
arm: [MVE intrinsics] Factorize vcaddq vhcaddq
Factorize vcaddq, vhcaddq so that they use the same parameterized
names.
To be able to use the same patterns, we add a suffix to vcaddq.
Note that vcadd uses UNSPEC_VCADDxx for builtins without predication,
and VCADDQ_ROTxx_M_x (that is, not starting with "UNSPEC_"). The
UNPEC_* names are also used by neon.md
Roger Sayle [Fri, 14 Jul 2023 17:21:56 +0000 (18:21 +0100)]
PR target/110588: Add *bt<mode>_setncqi_2 to generate btl on x86.
This patch resolves PR target/110588 to catch another case in combine
where the i386 backend should be generating a btl instruction. This adds
another define_insn_and_split to recognize the RTL representation for this
case.
I also noticed that two related define_insn_and_split weren't using the
preferred string style for single statement preparation-statements, so
I've reformatted these to be consistent in style with the new one.
2023-07-14 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/110588
* config/i386/i386.md (*bt<mode>_setcqi): Prefer string form
preparation statement over braces for a single statement.
(*bt<mode>_setncqi): Likewise.
(*bt<mode>_setncqi_2): New define_insn_and_split.
gcc/testsuite/ChangeLog
PR target/110588
* gcc.target/i386/pr110588.c: New test case.
Marek Polacek [Thu, 25 May 2023 22:54:18 +0000 (18:54 -0400)]
c++: wrong error with static constexpr var in tmpl [PR109876]
Since r8-509, we'll no longer create a static temporary var for
the initializer '{ 1, 2 }' for num in the attached test because
the code in finish_compound_literal is now guarded by
'&& fcl_context == fcl_c99' but it's fcl_functional here. This
causes us to reject num as non-constant when evaluating it in
a template.
Jason's idea was to treat num as value-dependent even though it
actually isn't. This patch implements that suggestion.
We weren't marking objects whose type is an empty class type
constant. This patch changes that so that v_d_e_p doesn't need
to check is_really_empty_class.
Co-authored-by: Jason Merrill <jason@redhat.com>
PR c++/109876
gcc/cp/ChangeLog:
* decl.cc (cp_finish_decl): Set TREE_CONSTANT when initializing
an object of empty class type.
* pt.cc (value_dependent_expression_p) <case VAR_DECL>: Treat a
constexpr-declared non-constant variable as value-dependent.
gcc/testsuite/ChangeLog:
* g++.dg/cpp0x/constexpr-template12.C: New test.
* g++.dg/cpp1z/constexpr-template1.C: New test.
* g++.dg/cpp1z/constexpr-template2.C: New test.
Roger Sayle [Fri, 14 Jul 2023 17:10:05 +0000 (18:10 +0100)]
i386: Improved insv of DImode/DFmode {high,low}parts into TImode.
This is the next piece towards a fix for (the x86_64 ABI issues affecting)
PR 88873. This patch generalizes the recent tweak to ix86_expand_move
for setting the highpart of a TImode reg from a DImode source using
*insvti_highpart_1, to handle both DImode and DFmode sources, and also
use the recently added *insvti_lowpart_1 for setting the lowpart.
Although this is another intermediate step (not yet a fix), towards
enabling *insvti and *concat* patterns to be candidates for TImode STV
(by using V2DI/V2DF instructions), it already improves things a little.
2023-07-14 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Generalize special
case inserting of 64-bit values into a TImode register, to handle
both DImode and DFmode using either *insvti_lowpart_1
or *isnvti_highpart_1.
We rely on the "undefined" vals to have a specific value (from the earlier
REG_EQUAL note) but actual code generation doesn't ensure this (it doesn't
need to). That said, the issue isn't the constant folding per-se but that
we do not actually constant fold but register an equality that doesn't hold.
PR target/110206
gcc/ChangeLog:
* fwprop.cc (contains_paradoxical_subreg_p): Move to ...
* rtlanal.cc (contains_paradoxical_subreg_p): ... here.
* rtlanal.h (contains_paradoxical_subreg_p): Add prototype.
* cprop.cc (try_replace_reg): Do not set REG_EQUAL note
when the original source contains a paradoxical subreg.
Jan Hubicka [Fri, 14 Jul 2023 15:14:15 +0000 (17:14 +0200)]
Turn TODO_rebuild_frequencies to a pass
Currently we rebiuild profile_counts from profile_probability after inlining,
because there is a chance that producing large loop nests may get unrealistically
large profile_count values. This is much less of concern when we switched to
new profile_count representation while back.
This propagation can also compensate for profile inconsistencies caused by
optimization passes. Since inliner is followed by basic cleanup passes that
does not use profile, we get more realistic profile by delaying the recomputation
after basic optimizations exposed by inlininig are finished.
This does not fit into TODO machinery, so I turn rebuilding into stand alone
pass and schedule it before first consumer of profile in the optimization
queue.
I also added logic that avoids repropagating when CFG is good and not too close
to overflow. Propagating visits very basic block loop_depth times, so it is
not linear and avoiding it may help a bit.
On tramp3d we get 14 functions repropagated and 916 are OK. The repropagated
functions are RB tree ones where we produce crazy loop nests by recurisve inlining.
This is something to fix independently.
gcc/ChangeLog:
* passes.cc (execute_function_todo): Remove
TODO_rebuild_frequencies
* passes.def: Add rebuild_frequencies pass.
* predict.cc (estimate_bb_frequencies): Drop
force parameter.
(tree_estimate_probability): Update call of
estimate_bb_frequencies.
(rebuild_frequencies): Turn into a pass; verify CFG profile consistency
first and do not rebuild if not necessary.
(class pass_rebuild_frequencies): New.
(make_pass_rebuild_frequencies): New.
* profile-count.h: Add profile_count::very_large_p.
* tree-inline.cc (optimize_inline_calls): Do not return
TODO_rebuild_frequencies
* tree-pass.h (TODO_rebuild_frequencies): Remove.
(make_pass_rebuild_frequencies): Declare.
Add comments as Robin's suggestion in scatter_store_run-7.c
Enable COND_LEN_FMA auto-vectorization for floating-point FMA auto-vectorization **NO** ffast-math.
Since the middle-end support has been approved and I will merge it after I finished bootstrap && regression on X86.
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624395.html
Now, it's time to send this patch.
Consider this following case:
__attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst, \
TYPE *__restrict a, \
TYPE *__restrict b, int n) \
{ \
for (int i = 0; i < n; i++) \
dst[i] += a[i] * b[i]; \
}
Notice: This patch only supports COND_LEN_FMA, **NO** COND_LEN_FNMA, ... etc since I didn't support them
in the middle-end yet.
Will support them in the following patches soon.
gcc/ChangeLog:
* config/riscv/autovec.md (cond_len_fma<mode>): New pattern.
* config/riscv/riscv-protos.h (enum insn_type): New enum.
(expand_cond_len_ternop): New function.
* config/riscv/riscv-v.cc (emit_nonvlmax_fp_ternary_tu_insn): Ditto.
(expand_cond_len_ternop): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store_run-7.c:
Adapt testcase for link fail.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-1.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-3.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm_run-3.c: New test.
Jose E. Marchesi [Fri, 14 Jul 2023 11:54:06 +0000 (13:54 +0200)]
bpf: enable instruction scheduling
This patch adds a dummy FSM to bpf.md in order to get INSN_SCHEDULING
defined. If the later is not defined, the `combine' pass generates
paradoxical subregs of mems, which seems to then be mishandled by LRA,
resulting in invalid code.
Tested in bpf-unknown-none.
gcc/ChangeLog:
2023-07-14 Jose E. Marchesi <jose.marchesi@oracle.com>
Mikael Morin [Fri, 14 Jul 2023 12:15:51 +0000 (14:15 +0200)]
fortran: Reorder array argument evaluation parts [PR92178]
In the case of an array actual arg passed to a polymorphic array dummy
with INTENT(OUT) attribute, reorder the argument evaluation code to
the following:
- first evaluate arguments' values, and data references,
- deallocate data references associated with an allocatable,
intent(out) dummy,
- create a class container using the freed data references.
The ordering used to be incorrect between the first two items,
when one argument was deallocated before a later argument evaluated
its expression depending on the former argument. r14-2395-gb1079fc88f082d3c5b583c8822c08c5647810259 fixed it by treating
arguments associated with an allocatable, intent(out) dummy in a
separate, later block. This, however, wasn't working either if the data
reference of such an argument was depending on its own content, as
the class container initialization was trying to use deallocated
content.
This change generates class container initialization code in a separate
block, so that it is moved after the deallocation block without moving
the rest of the argument evaluation code.
This alone is not sufficient to fix the problem, because the class
container generation code repeatedly uses the full expression of
the argument at a place where deallocation might have happened
already. This is non-optimal, but may also be invalid, because the data
reference may depend on its own content. In that case the expression
can't be evaluated after the data has been deallocated.
As in the scalar case previously treated, this is fixed by saving
the data reference to a pointer before any deallocation happens,
and then only refering to the pointer. gfc_reset_vptr is updated
to take into account the already evaluated class container if it's
available.
Contrary to the scalar case, one hunk is needed to wrap the parameter
evaluation in a conditional, to avoid regressing in
optional_class_2.f90. This used to be handled by the class wrapper
construction which wrapped the whole code in a conditional. With
this change the class wrapper construction can't see the parameter
evaluation code, so the latter is updated with an additional handling
for optional arguments.
PR fortran/92178
gcc/fortran/ChangeLog:
* trans.h (gfc_reset_vptr): Add class_container argument.
* trans-expr.cc (gfc_reset_vptr): Ditto. If a valid vptr can
be obtained through class_container argument, bypass evaluation
of e.
(gfc_conv_procedure_call): Wrap the argument evaluation code
in a conditional if the associated dummy is optional. Evaluate
the data reference to a pointer now, and replace later
references with usage of the pointer.
Mikael Morin [Fri, 14 Jul 2023 12:15:21 +0000 (14:15 +0200)]
fortran: Factor data references for scalar class argument wrapping [PR92178]
In the case of a scalar actual arg passed to a polymorphic assumed-rank
dummy with INTENT(OUT) attribute, avoid repeatedly evaluating the actual
argument reference by saving a pointer to it. This is non-optimal, but
may also be invalid, because the data reference may depend on its own
content. In that case the expression can't be evaluated after the data
has been deallocated.
There are two ways redundant expressions are generated:
- parmse.expr, which contains the actual argument expression, is
reused to get or set subfields in gfc_conv_class_to_class.
- gfc_conv_class_to_class, to get the virtual table pointer associated
with the argument, generates a new expression from scratch starting
with the frontend expression.
The first part is fixed by saving parmse.expr to a pointer and using
the pointer instead of the original expression.
The second part is fixed by adding a separate field to gfc_se that
is set to the class container expression when the expression to
evaluate is polymorphic. This needs the same field in gfc_ss_info
so that its value can be propagated to gfc_conv_class_to_class which
is modified to use that value. Finally gfc_conv_procedure saves the
expression in that field to a pointer in between to avoid the same
problem as for the first part.
PR fortran/92178
gcc/fortran/ChangeLog:
* trans.h (struct gfc_se): New field class_container.
(struct gfc_ss_info): Ditto.
(gfc_evaluate_data_ref_now): New prototype.
* trans.cc (gfc_evaluate_data_ref_now): Implement it.
* trans-array.cc (gfc_conv_ss_descriptor): Copy class_container
field from gfc_se struct to gfc_ss_info struct.
(gfc_conv_expr_descriptor): Copy class_container field from
gfc_ss_info struct to gfc_se struct.
* trans-expr.cc (gfc_conv_class_to_class): Use class container
set in class_container field if available.
(gfc_conv_variable): Set class_container field on encountering
class variables or components, clear it on encountering
non-class components.
(gfc_conv_procedure_call): Evaluate data ref to a pointer now,
and replace later references by usage of the pointer.
Mikael Morin [Fri, 14 Jul 2023 12:15:07 +0000 (14:15 +0200)]
fortran: defer class wrapper initialization after deallocation [PR92178]
If an actual argument is associated with an INTENT(OUT) dummy, and code
to deallocate it is generated, generate the class wrapper initialization
after the actual argument deallocation.
This is achieved by passing a cleaned up expression to
gfc_conv_class_to_class, so that the class wrapper initialization code
can be isolated and moved independently after the deallocation.
PR fortran/92178
gcc/fortran/ChangeLog:
* trans-expr.cc (gfc_conv_procedure_call): Use a separate gfc_se
struct, initalized from parmse, to generate the class wrapper.
After the class wrapper code has been generated, copy it back
depending on whether parameter deallocation code has been
generated.
libgomp/
* libgomp.texi (OMP_ALLOCATOR): Document the default values for
the traits. Add crossref to 'Memory allocation'.
(Memory allocation): Refer to OMP_ALLOCATOR for the available
traits and allocators/mem spaces; document the default value
for the pool_size trait.
All tree arguments to the PHI have the same number of occurrences, namely 1,
however it makes a big difference which comparison we test first.
Sorting only on occurrences we'll pick the compares coming from BB 18 and BB 17,
This means we end up generating 4 comparisons, while 2 would have been enough.
By keeping track of the "complexity" of the COND in each BB, (i.e. the number
of comparisons needed to traverse from the start [BB 15] to end [BB 19]) and
using a key tuple of <occurrences, complexity> we end up selecting the compare
from BB 16 and BB 18 first. BB 16 only requires 1 compare, and BB 18, after we
test BB 16 also only requires one additional compare. This change paired with
the one previous above results in the optimal 2 compares.
Most of the comparisons are still needed because the chain of
occurrences to not negate eachother. i.e. _80 is _73 & vr_15 >= -20 and
_85 is _73 & vr_15 < -20. clearly given _73 needs to be true in both branches,
the only additional test needed is on vr_15, where the one test is the negation
of the other. So we don't need to do the comparison of _73 twice.
The changes in the patch reduces the overall number of compares by one, but has
a bigger effect on the dependency chain.
Previously we would generate 5 instructions chain:
As explained in that commit, ifconvert sorts PHI args in increasing number of
occurrences in order to reduce the number of comparisons done while
traversing the tree.
The remaining task that this patch fixes is dealing with the long chain of
comparisons that can be created from phi nodes, particularly when they share
any common successor (classical example is a diamond node).
on a PHI-node the true and else branches carry a condition, true will
carry `a` and false `~a`. The issue is that at the moment GCC tests both `a`
and `~a` when the phi node has more than 2 arguments. Clearly this isn't
needed. The deeper the nesting of phi nodes the larger the repetition.
As an example, for
foo (int *f, int d, int e)
{
for (int i = 0; i < 1024; i++)
{
int a = f[i];
int t;
if (a < 0)
t = 1;
else if (a < e)
t = 1 - a * d;
else
t = 0;
f[i] = t;
}
}
Which correctly elides the test of _21. This is done by borrowing the
vectorizer's helper functions to limit predicate mask usages. Ifcvt will chain
conditionals on the false edge (unless specifically inverted) so this patch on
creating cond a ? b : c, will register ~a when traversing c. If c is a
conditional then c will be simplified to the smaller possible predicate given
the assumptions we already know to be true.
gcc/ChangeLog:
PR tree-optimization/109154
* tree-if-conv.cc (gen_simplified_condition,
gen_phi_nest_statement): New.
(gen_phi_arg_condition, predicate_scalar_phi): Use it.
gcc/testsuite/ChangeLog:
PR tree-optimization/109154
* gcc.dg/vect/vect-ifcvt-19.c: New test.
Richard Biener [Fri, 14 Jul 2023 08:01:39 +0000 (10:01 +0200)]
Provide extra checking for phi argument access from edge
The following adds checking that the edge we query an associated
PHI arg for is related to the PHI node. Triggered by questionable
code in one of my reviews.
* gimple.h (gimple_phi_arg): New const overload.
(gimple_phi_arg_def): Make gimple arg const.
(gimple_phi_arg_def_from_edge): New inline function.
* tree-phinodes.h (gimple_phi_arg_imm_use_ptr_from_edge):
Likewise.
* tree-ssa-operands.h (PHI_ARG_DEF_FROM_EDGE): Direct to
new inline function.
(PHI_ARG_DEF_PTR_FROM_EDGE): Likewise.
libgomp: Fix allocator handling for Linux when libnuma is not available
Follow up to r14-2462-g450b05ce54d3f0. The case that libnuma was not
available at runtime was not properly handled; now it falls back to
the normal malloc.
libgomp/
* allocator.c (omp_init_allocator): Check whether symbol from
dlopened libnuma is available before using libnuma for
allocations.
Die Li [Fri, 14 Jul 2023 02:02:05 +0000 (02:02 +0000)]
RISC-V: Remove the redundant expressions in the and<mode>3.
When generating the gen_and<mode>3 function based on the and<mode>3
template, it produces the expression emit_insn (gen_rtx_SET (operand0,
gen_rtx_AND (<mode>, operand1, operand2)));, which is identical to the
portion I removed in this patch. Therefore, the redundant portion can be
deleted.
Signed-off-by: Die Li <lidie@eswincomputing.com>
gcc/ChangeLog:
* config/riscv/riscv.md: Remove redundant portion in and<mode>3.
When trying to bootstrap current trunk on macOS 14.0 beta 3 with Xcode
15 beta 4, the build failed running mklink in stage 2:
unset CC ; m2/boot-bin/mklink -s --langc++ --exit --name m2/mc-boot/main.cc
/vol/gcc/src/hg/master/darwin/gcc/m2/init/mcinit
dyld[55825]: Library not loaded: /vol/gcc/lib/libstdc++.6.dylib
While it's unclear to me why this only happens on macOS 14, the problem
is clear: unlike other C++ executables, mklink isn't linked with
-static-libstdc++ which is passed in from toplevel in LDFLAGS.
This patch fixes that and allows the build to continue.
Bootstrapped on x86_64-apple-darwin23.0.0, i386-pc-solaris2.11, and
sparc-sun-solaris2.11.
Mikael Morin [Thu, 13 Jul 2023 19:23:44 +0000 (21:23 +0200)]
fortran: Release symbols in reversed order [PR106050]
Release symbols in reversed order wrt the order they were allocated.
This fixes an error recovery ICE in the case of a misplaced
derived type declaration. Such a declaration creates nested
symbols, one for the derived type and one for each type parameter,
which should be immediately released as the declaration is
rejected. This breaks if the derived type is released first.
As the type parameter symbols are in the namespace of the derived
type, releasing the derived type releases the type parameters, so
one can't access them after that, even to release them. Hence,
the type parameters should be released first.
PR fortran/106050
gcc/fortran/ChangeLog:
* symbol.cc (gfc_restore_last_undo_checkpoint): Release symbols
in reverse order.
Darwin: Use -platform_version when available [PR110624].
Later versions of the static linker support a more flexible flag to
describe the OS, OS version and SDK used to build the code. This
replaces the functionality of '-mmacosx_version_min' (which is now
deprecated, leading to the diagnostic described in the PR).
We now use the platform_version flag when available which avoids the
diagnostic.
Carl Love [Thu, 13 Jul 2023 17:44:43 +0000 (13:44 -0400)]
rs6000, Add return value to __builtin_set_fpscr_rn
Change the return value from void to double for __builtin_set_fpscr_rn.
The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
RN bit positions. A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
is added to test the new return value for the built-in.
The value __SET_FPSCR_RN_RETURNS_FPSCR__ is defined if
__builtin_set_fpscr_rn returns a double.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/test_fpscr_rn_builtin.c: Rename to
test_fpscr_rn_builtin_1.c. Add comment.
* gcc.target/powerpc/test_fpscr_rn_builtin_2.c: New test for the
return value of __builtin_set_fpscr_rn builtin.
Jonathan Wakely [Thu, 13 Jul 2023 09:44:57 +0000 (10:44 +0100)]
libstdc++: std::stoi etc. do not need C99 <stdlib.h> support [PR110653]
std::stoi, std::stol, std::stoul, and std::stod only depend on C89
functions, so don't need to be guarded by _GLIBCXX_USE_C99_STDLIB
std::stoll and std::stoull don't need C99 strtoll and strtoull if
sizeof(long) == sizeof(long long).
std::stold doesn't need C99 strtold if DBL_MANT_DIG == LDBL_MANT_DIG.
This only applies to the narrow character overloads, the wchar_t
overloads depend on a separate _GLIBCXX_USE_C99_WCHAR macro and none of
them can be implemented in C89 easily.
libstdc++-v3/ChangeLog:
PR libstdc++/110653
* include/bits/basic_string.h (stoi, stol, stoul, stod): Do not
depend on _GLIBCXX_USE_C99_STDLIB.
[__LONG_WIDTH__ == __LONG_LONG_WIDTH__] (stoll, stoull): Define
in terms of stol and stoul respectively.
[__DBL_MANT_DIG__ == __LDBL_MANT_DIG__] (stold): Define in terms
of stod.
Andrew Pinski [Wed, 12 Jul 2023 07:33:14 +0000 (00:33 -0700)]
Fix part of PR 110293: `A NEEQ (A NEEQ CST)` part
This fixes part of PR 110293, for the outer comparison case
being `!=` or `==`. In turn PR 110539 is able to be optimized
again as the if statement for `(a&1) == ((a & 1) != 0)` gets optimized
to `false` early enough to allow FRE/DOM to do a CSE for memory store/load.
OK? Bootstrapped and tested on x86_64-linux with no regressions.
gcc/ChangeLog:
PR tree-optimization/110293
PR tree-optimization/110539
* match.pd: Expand the `x != (typeof x)(x == 0)`
pattern to handle where the inner and outer comparsions
are either `!=` or `==` and handle other constants
than 0.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/pr110293-1.c: New test.
* gcc.dg/tree-ssa/pr110539-1.c: New test.
* gcc.dg/tree-ssa/pr110539-2.c: New test.
* gcc.dg/tree-ssa/pr110539-3.c: New test.
* gcc.dg/tree-ssa/pr110539-4.c: New test.
[RA][PR109520]: Catch error when there are no enough registers for asm insn
Asm insn unlike other insns can have so many operands whose
constraints can not be satisfied. It results in LRA cycling for such
test case. The following patch catches such situation and reports the
problem.
PR middle-end/109520
gcc/ChangeLog:
* lra-int.h (lra_insn_recog_data): Add member asm_reloads_num.
(lra_asm_insn_error): New prototype.
* lra.cc: Include rtl_error.h.
(lra_set_insn_recog_data): Initialize asm_reloads_num.
(lra_asm_insn_error): New func whose code is taken from ...
* lra-assigns.cc (lra_split_hard_reg_for): ... here. Use lra_asm_insn_error.
* lra-constraints.cc (curr_insn_transform): Check reloads nummber for asm.
SSA MATH: Support COND_LEN_FMA for floating-point math optimization
Hi, Richard and Richi.
Previous patch we support COND_LEN_* binary operations. However, we didn't
support COND_LEN_* ternary.
Now, this patch support COND_LEN_* ternary. Consider this following case:
__attribute__ ((noipa)) void ternop_##TYPE (TYPE *__restrict dst, \
TYPE *__restrict a, \
TYPE *__restrict b,\
TYPE *__restrict c, int n) \
{ \
for (int i = 0; i < n; i++) \
dst[i] += a[i] * b[i]; \
}
David Edelsohn [Wed, 12 Jul 2023 18:31:20 +0000 (14:31 -0400)]
testsuite: dg-require LTO for libgomp LTO tests
Some test cases in libgomp testsuite pass -flto as an option, but
the testcases do not require LTO target support. This patch adds
the necessary DejaGNU requirement for LTO support to the testcases..
Pan Li [Wed, 12 Jul 2023 05:38:42 +0000 (13:38 +0800)]
RISC-V: Refactor riscv mode after for VXRM and FRM
When investigate the FRM dynmaic rounding mode, we find the global
unknown status is quite different between the fixed-point and
floating-point. Thus, we separate the unknown function with extracting
some inner common functions.
We will also prepare more test cases in another PATCH.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv.cc (vxrm_rtx): New static var.
(frm_rtx): Ditto.
(global_state_unknown_p): Removed.
(riscv_entity_mode_after): Removed.
(asm_insn_p): New function.
(vxrm_unknown_p): New function for fixed-point.
(riscv_vxrm_mode_after): Ditto.
(frm_unknown_dynamic_p): New function for floating-point.
(riscv_frm_mode_after): Ditto.
(riscv_mode_after): Leverage new functions.
Pan Li [Wed, 12 Jul 2023 15:01:39 +0000 (23:01 +0800)]
RISC-V: Add more tests for RVV floating-point FRM.
Add more test cases include both the asm check and run for RVV FRM.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-frm-insert-10.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-7.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-8.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-insert-9.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-1.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-2.c: New test.
* gcc.target/riscv/rvv/base/float-point-frm-run-3.c: New test.
Kewen Lin [Thu, 13 Jul 2023 02:23:22 +0000 (21:23 -0500)]
vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS
This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load. We don't call function
vect_model_load_cost for it any more. It removes function
vect_model_load_cost which becomes useless and unreachable
now.
gcc/ChangeLog:
* tree-vect-stmts.cc (vect_model_load_cost): Remove.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
calling vect_model_load_cost.
Kewen Lin [Thu, 13 Jul 2023 02:23:22 +0000 (21:23 -0500)]
vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_PERMUTE
This patch adjusts the cost handling on
VMAT_CONTIGUOUS_PERMUTE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.
As the affected test case gcc.target/i386/pr70021.c shows,
the previous costing can under-cost the total generated
vector loads as for VMAT_CONTIGUOUS_PERMUTE function
vect_model_load_cost doesn't consider the group size which
is considered as vec_num during the transformation.
This patch makes the count of vector load in costing become
consistent with what we generates during the transformation.
To be more specific, for the given test case, for memory
access b[i_20], it costed for 2 vector loads before,
with this patch it costs 8 instead, it matches the final
count of generated vector loads basing from b. This costing
change makes cost model analysis feel it's not profitable
to vectorize the first loop, so this patch adjusts the test
case without vect cost model any more.
But note that this test case also exposes something we can
improve further is that although the number of vector
permutation what we costed and generated are consistent,
but DCE can further optimize some unused permutation out,
it would be good if we can predict that and generate only
those necessary permutations.
gcc/ChangeLog:
* tree-vect-stmts.cc (vect_model_load_cost): Assert this function only
handle memory_access_type VMAT_CONTIGUOUS, remove some
VMAT_CONTIGUOUS_PERMUTE related handlings.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS_PERMUTE
without calling vect_model_load_cost.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr70021.c: Adjust with -fno-vect-cost-model.
Kewen Lin [Thu, 13 Jul 2023 02:23:22 +0000 (21:23 -0500)]
vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS_REVERSE
This patch adjusts the cost handling on
VMAT_CONTIGUOUS_REVERSE in function vectorizable_load. We
don't call function vect_model_load_cost for it any more.
This change makes us not miscount some required vector
permutation as the associated test case shows.
gcc/ChangeLog:
* tree-vect-stmts.cc (vect_model_load_cost): Assert it won't get
VMAT_CONTIGUOUS_REVERSE any more.
(vectorizable_load): Adjust the costing handling on
VMAT_CONTIGUOUS_REVERSE without calling vect_model_load_cost.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/costmodel/ppc/costmodel-vect-reversed.c: New test.