Richard Biener [Sun, 13 Oct 2024 09:42:27 +0000 (11:42 +0200)]
tree-optimization/116481 - avoid building function_type[]
The following avoids building an array type with function or method
element type during diagnosing an array bound violation as this
will result in an error, rejecting a program with a not too useful
error message. Instead build such array type manually.
PR tree-optimization/116481
* pointer-query.cc (build_printable_array_type):
Build an array types with function or method element type
manually to avoid bogus diagnostic.
Richard Biener [Sun, 13 Oct 2024 13:12:44 +0000 (15:12 +0200)]
tree-optimization/116290 - fix compare-debug issue in ldist
Loop distribution does different analysis with -g0/-g due to counting
a debug stmt starting a BB against a limit which will everntually
lead to different IVOPTs choices. I've fixed a possible IVOPTs
issue on the way even though it doesn't make a difference here.
PR tree-optimization/116290
* tree-loop-distribution.cc (determine_reduction_stmt_1): PHIs
have no debug variants. Start with first non-debug real stmt.
* tree-ssa-loop-ivopts.cc (find_givs_in_bb): Do not analyze
debug stmts.
Richard Biener [Fri, 17 May 2024 09:02:29 +0000 (11:02 +0200)]
middle-end/115110 - Fix view_converted_memref_p
view_converted_memref_p was checking the reference type against the
pointer type of the offset operand rather than its pointed-to type
which leads to all refs being subject to view-convert treatment
in get_alias_set causing numerous testsuite fails but with its
new uses from r15-512-g9b7cad5884f21c is also a wrong-code issue.
liuhongt [Wed, 16 Oct 2024 05:43:48 +0000 (13:43 +0800)]
Refine splitters related to "combine vpcmpuw + zero_extend to vpcmpuw"
r12-6103-g1a7ce8570997eb combines vpcmpuw + zero_extend to vpcmpuw
with the pre_reload splitter, but the splitter transforms the
zero_extend into a subreg which make reload think the upper part is
garbage, it's not correct.
The patch adjusts the zero_extend define_insn_and_split to
define_insn to keep zero_extend.
gcc/ChangeLog:
PR target/117159
* config/i386/sse.md
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Change from define_insn_and_split to define_insn.
(*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Split to the zero_extend pattern.
(*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
(*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
(*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
Martin Jambor [Fri, 18 Oct 2024 19:32:16 +0000 (21:32 +0200)]
ipa: Treat static constructors and destructors as non-local (PR 115815)
In PR 115815, IPA-SRA thought it had control over all invocations of a
(recursive) static destructor but it did not see the implied
invocation which led to the original being left behind and the
clean-up code encountering uses of SSAs that definitely should have
been dead.
Fixed by teaching cgraph_node::can_be_local_p about static
constructors and destructors. Similar test is missing in
cgraph_node::local_p so I added the check there as well.
In addition to the commit with the fix, this backport also contains
squashed commit 1a458bdeb223ffa501bac8e76182115681967094 which fixes
dejagnu directives in the testcase.
gcc/ChangeLog:
2024-07-25 Martin Jambor <mjambor@suse.cz>
PR ipa/115815
* cgraph.cc (cgraph_node_cannot_be_local_p_1): Also check
DECL_STATIC_CONSTRUCTOR and DECL_STATIC_DESTRUCTOR.
* ipa-visibility.cc (non_local_p): Likewise.
(cgraph_node::local_p): Delete extraneous line of tabs.
gcc/testsuite/ChangeLog:
2024-07-25 Martin Jambor <mjambor@suse.cz>
PR ipa/115815
* gcc.dg/lto/pr115815_0.c: New test.
Li Xu [Thu, 10 Oct 2024 14:51:19 +0000 (08:51 -0600)]
RISC-V:Bugfix for C++ code compilation failure with rv32imafc_zve32f[pr116883]
From: xuli <xuli1@eswincomputing.com>
Example as follows:
int main()
{
unsigned long arraya[128], arrayb[128], arrayc[128];
for (int i = 0; i < 128; i++)
{
arraya[i] = arrayb[i] + arrayc[i];
}
return 0;
}
Compiled with -march=rv32imafc_zve32f -mabi=ilp32f, it will cause a compilation issue:
riscv_vector.h:40:25: error: ambiguating new declaration of 'vint64m4_t __riscv_vle64(vbool16_t, const long long int*, unsigned int)'
40 | #pragma riscv intrinsic "vector"
| ^~~~~~~~
riscv_vector.h:40:25: note: old declaration 'vint64m1_t __riscv_vle64(vbool64_t, const long long int*, unsigned int)'
With zvl=32b, vbool16_t is registered in init_builtins() with
type_common.precision=0x101 (nunits=2), mode_nunits[E_RVVMF16BI]=[2,2].
Normally, vbool64_t is only valid when TARGET_MIN_VLEN > 32, so vbool64_t
is not registered in init_builtins(), meaning vbool64_t=null.
In order to implement __attribute__((target("arch=+v"))), we must register
all vector types and all RVV intrinsics. Therefore, vbool64_t will be registered
by default with zvl=128b in reinit_builtins(), resulting in
type_common.precision=0x101 (nunits=2) and mode_nunits[E_RVVMF64BI]=[2,2].
We then get TYPE_VECTOR_SUBPARTS(vbool16_t) == TYPE_VECTOR_SUBPARTS(vbool64_t),
calculated using type_common.precision, resulting in 2. Since vbool16_t and
vbool64_t have the same element type (boolean_type), the compiler treats them
as the same type, leading to a re-declaration conflict.
After all types and intrinsics have been registered, processing
__attribute__((target("arch=+v"))) will update the parameters option and
init_adjust_machine_modes. Therefore, to avoid conflicts, we can choose
zvl=4096b for the null type reinit_builtins().
Patrick Palka [Tue, 15 Oct 2024 17:13:15 +0000 (13:13 -0400)]
c++: checking ICE w/ constexpr if and lambda as def targ [PR117054]
Here we're tripping over the assert in extract_locals_r which enforces
that an extra-args tree appearing inside another extra-args tree doesn't
actually have extra args. This invariant doesn't always hold for lambdas
(which recently gained the extra-args mechanism) but that should be
harmless since cp_walk_subtrees doesn't walk LAMBDA_EXPR_EXTRA_ARGS and
so should be immune to the PR114303 issue for now. So let's just disable
this assert for lambdas.
PR c++/117054
gcc/cp/ChangeLog:
* pt.cc (extract_locals_r): Disable tree_extra_args assert
for LAMBDA_EXPR.
Marek Polacek [Thu, 29 Aug 2024 14:40:50 +0000 (10:40 -0400)]
c++: ICE with -Wtautological-compare in template [PR116534]
Pre r14-4793, we'd call warn_tautological_cmp -> operand_equal_p
with operands wrapped in NON_DEPENDENT_EXPR, which works, since
o_e_p bails for codes it doesn't know. But now we pass operands
not encapsulated in NON_DEPENDENT_EXPR, and crash, because the
template tree for &a[x] has null DECL_FIELD_OFFSET.
This patch extends r12-7797 to cover the case when DECL_FIELD_OFFSET
is null.
PR c++/116534
gcc/ChangeLog:
* fold-const.cc (operand_compare::operand_equal_p): If either
field's DECL_FIELD_OFFSET is null, compare the fields with ==.
Marek Polacek [Wed, 28 Aug 2024 19:45:49 +0000 (15:45 -0400)]
c++: wrong error due to std::initializer_list opt [PR116476]
Here maybe_init_list_as_array gets elttype=field, init={NON_LVALUE_EXPR <2>}
and it tries to convert the init's element type (int) to field
using implicit_conversion, which works, so overall maybe_init_list_as_array
is successful.
But it constifies init_elttype so we end up with "const int". Later,
when we actually perform the conversion and invoke field::field(T&&),
we end up with this error:
error: binding reference of type 'int&&' to 'const int' discards qualifiers
So I think maybe_init_list_as_array should try to perform the conversion,
like it does below with fc.
PR c++/116476
gcc/cp/ChangeLog:
* call.cc (maybe_init_list_as_array): Try convert_like and see if it
worked.
We cannot elide the TARGET_EXPR because we're taking its address.
It is set as eliding in massage_init_elt. I've tried to not set
TARGET_EXPR_ELIDING_P when the context is not direct-initialization.
That didn't work: even when it's not direct-initialization now, it
can become one later, for instance, after split_nonconstant_init.
One problem is that replace_placeholders_for_class_temp_r will replace
placeholders in non-eliding TARGET_EXPRs with the slot, but if we then
elide the TARGET_EXPR, we end up with a "stray" VAR_DECL and crash.
(Only some TARGET_EXPRs are handled by replace_decl.)
I thought I'd have to go back to
<https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651163.html> but
then I realized that this problem occurrs only with ()-init but not
{}-init. With {}-init, there is no problem, because we are clearing
TARGET_EXPR_ELIDING_P in process_init_constructor_record:
/* We can't actually elide the temporary when initializing a
potentially-overlapping field from a function that returns by
value. */
if (ce->index
&& TREE_CODE (next) == TARGET_EXPR
&& unsafe_copy_elision_p (ce->index, next))
TARGET_EXPR_ELIDING_P (next) = false;
But that does not happen for ()-init because we have no ce->index.
()-init doesn't allow brace elision so we don't really reshape them.
But I can just move the clearing a few lines down and then it handles
both ()-init and {}-init.
PR c++/116424
gcc/cp/ChangeLog:
* typeck2.cc (process_init_constructor_record): Move the clearing of
TARGET_EXPR_ELIDING_P down.
Uros Bizjak [Tue, 15 Oct 2024 14:51:33 +0000 (16:51 +0200)]
i386: Fix expand_vector_set for VEC_MERGE/VEC_DUPLICATE RTX [PR117116]
Middle end can generate SYMBOL_REF RTX as a value "val" in the call
to expand_vector_set, but SYMBOL_REF RTX is not accepted in
<sse2p4_1>_pinsr<ssemodesuffix> insn pattern, generated via
VEC_MERGE/VEC_DUPLICATE RTX path.
Force the value into a register before VEC_MERGE/VEC_DUPLICATE RTX
is generated if it doesn't satisfy nonimmediate_operand predicate.
PR target/117116
gcc/ChangeLog:
* config/i386/i386-expand.cc (expand_vector_set): Force "val"
into a register before VEC_MERGE/VEC_DUPLICATE RTX is generated
if it doesn't satisfy nonimmediate_operand predicate.
Jonathan Wakely [Wed, 16 Oct 2024 08:22:37 +0000 (09:22 +0100)]
libstdc++: Fix Python deprecation warning in printers.py
python/libstdcxx/v6/printers.py:1355: DeprecationWarning: 'count' is passed as positional argument
The Python docs say:
Deprecated since version 3.13: Passing count and flags as positional
arguments is deprecated. In future Python versions they will be
keyword-only parameters.
Using a keyword argument for count only became possible with Python 3.1
so introduce a new function to do the substitution.
libstdc++-v3/ChangeLog:
* python/libstdcxx/v6/printers.py (strip_fundts_namespace): New.
(StdExpAnyPrinter, StdExpOptionalPrinter): Use it.
Jonathan Wakely [Sun, 13 Oct 2024 20:47:14 +0000 (21:47 +0100)]
libstdc++: Implement LWG 3564 for ranges::transform_view
The _Iterator<true> type returned by begin() const uses const F& to
transform the elements, so it should use const F& to determine the
iterator's value_type and iterator_category as well.
This was accepted into the WP in July 2022.
libstdc++-v3/ChangeLog:
* include/std/ranges (transform_view:_Iterator): Use const F&
to determine value_type and iterator_category of
_Iterator<true>, as per LWG 3564.
* testsuite/std/ranges/adaptors/transform.cc: Check value_type
and iterator_category.
Jonathan Wakely [Fri, 11 Oct 2024 08:40:38 +0000 (09:40 +0100)]
libstdc++: Fix localized %c formatting for <chrono> [PR117085]
When formatting a time point with %c we call std::vformat_to using the
formatting locale's D_T_FMT string, but we weren't adding the L option
to the format string. This meant we always interpreted D_T_FMT in the C
locale, instead of using the formatting locale as obviously intended
when %c is used.
libstdc++-v3/ChangeLog:
PR libstdc++/117085
* include/bits/chrono_io.h (__formatter_chrono::_M_c): Add L
option to format string.
* testsuite/std/time/format.cc: Move to...
* testsuite/std/time/format/format.cc: ...here.
* testsuite/std/time/format/pr117085.cc: New test.
Jonathan Wakely [Fri, 27 Sep 2024 15:54:31 +0000 (16:54 +0100)]
libstdc++: Tweak %c formatting for chrono types
libstdc++-v3/ChangeLog:
* include/bits/chrono_io.h (__formatter_chrono::_M_c): Add
[[unlikely]] attribute to condition for missing %c format in
locale. Use %T instead of %H:%M:%S in fallback.
Jonathan Wakely [Tue, 24 Sep 2024 22:20:56 +0000 (23:20 +0100)]
libstdc++: Populate generic std::time_get's wide %c format [PR117135]
I missed out the __timepunct<wchar_t> specialization for the "generic"
implementation when defining the %c format in r15-4016-gc534e37faccf48.
libstdc++-v3/ChangeLog:
PR libstdc++/117135
* config/locale/generic/time_members.cc
(__timepunct<wchar_t>::_M_initialize_timepunc): Set
_M_date_time_format for C locale. Set %Ex formats to the same
values as the %x formats.
Jonathan Wakely [Sun, 13 Oct 2024 21:48:43 +0000 (22:48 +0100)]
libstdc++: Use std::move for iterator in ranges::fill [PR117094]
Input iterators aren't required to be copyable.
libstdc++-v3/ChangeLog:
PR libstdc++/117094
* include/bits/ranges_algobase.h (__fill_fn): Use std::move for
iterator that might not be copyable.
* testsuite/25_algorithms/fill/constrained.cc: Check
non-copyable iterator with sized sentinel.
middle-end: Fix ifcvt predicate generation for masked function calls
Up until now, due to a latent bug in the code for the ifcvt pass,
irrespective of the branch taken in a conditional statement, the
original condition for the if statement was used in masking the
function call.
This patch ensures that the correct predicate mask generation is
carried out such that, upon autovectorization, the correct vector
lanes are selected in the vectorized function call.
gcc/ChangeLog:
* tree-if-conv.cc (predicate_statements): Fix handling of
predicated function calls.
Jan Hubicka [Mon, 22 Jul 2024 21:01:50 +0000 (23:01 +0200)]
Fix handling of ICF_NOVOPS in ipa-modref
As shown in somewhat convoluted testcase, ipa-modref is mistreating
ECF_NOVOPS as "having no side effects". This come from time when
modref cared only about memory accesses and thus it was possible to
shortcut on it.
This patch removes (hopefully) all those bad shortcuts.
Bootstrapped/regtested x86_64-linux, comitted.
The testcase contains a VNx2QImode pseudo that is live across a call
and that cannot be allocated a call-preserved register. LRA quite
reasonably tried to save it before the call and restore it afterwards.
Unfortunately, the target told it to do that in SImode, even though
punning between SImode and VNx2QImode is disallowed by both
TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P.
The natural class to use for SImode is GENERAL_REGS, so this led
to an unsalvageable situation in which we had:
(set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B))
where A needed GENERAL_REGS and B needed FP_REGS. We therefore ended
up in a reload loop.
The hooks above should ensure that this situation can never occur
for incoming subregs. It only happened here because the target
explicitly forced it.
The decision to use SImode for modes smaller than 4 bytes dates
back to the beginning of the port, before 16-bit floating-point
modes existed. I'm not sure whether promoting to SImode really
makes sense for any FPR, but that's a separate performance/QoI
discussion. For now, this patch just disallows using SImode
when it is wrong for correctness reasons, since that should be
safer to backport.
gcc/
PR testsuite/116238
* config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode):
Only return SImode if we can convert to and from it.
gcc/testsuite/
PR testsuite/116238
* gcc.target/aarch64/sve/pr116238.c: New test.
Steve Baird [Mon, 8 Jul 2024 21:45:55 +0000 (14:45 -0700)]
ada: Type conversion in instance incorrectly rejected.
In some cases, a legal type conversion in a generic package is correctly
accepted but the corresponding type conversion in an instance of the generic
is incorrectly rejected.
gcc/ada/
PR ada/114593
* sem_res.adb (Valid_Conversion): Test In_Instance instead of
In_Instance_Body.
According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2
instructions can be decomposed into two independent 128-bit
micro-operations, except for a subset of Intel AVX2 instructions,
known as cross-lane operations, can only compute the result for an
element by utilizing one or more sources belonging to other elements.
The 256-bit instructions listed below use more operand sources than
can be natively supported by a single reservation station within these
microarchitectures. They are decomposed into two μops, where the first
μop resolves a subset of operand dependencies across two cycles. The
dependent second μop executes the 256-bit operation by using a single
128-bit execution port for two consecutive cycles with a five-cycle
latency for a total latency of seven cycles.
Instead of setting tune avx128_optimal for SRF, the patch add a new
tune avx256_avoid_vec_perm for it. so by default, vectorizer still
uses 256-bit VF if cost is profitable, but lowers to 128-bit whenever
256-bit vec_perm is needed for auto-vectorization. w/o vec_perm,
performance of 256-bit vectorization should be similar as 128-bit
ones(some benchmark results show it's even better than 128-bit
vectorization since it enables more parallelism for convert cases.)
* config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
Add new member m_num_avx256_vec_perm.
(ix86_vector_costs::add_stmt_cost): Record 256-bit vec_perm.
(ix86_vector_costs::finish_cost): Prevent vectorization for
TAREGT_AVX256_AVOID_VEC_PERM when there's 256-bit vec_perm
instruction.
* config/i386/i386.h (TARGET_AVX256_AVOID_VEC_PERM): New
Macro.
* config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): Add
m_CORE_ATOM.
(X86_TUNE_AVX256_AVOID_VEC_PERM): New tune.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx256_avoid_vec_perm.c: New test.
For Crestmont, 4-operand vex blendv instructions come from MSROM and
is slower than 3-instructions sequence (op1 & mask) | (op2 & ~mask).
legacy blendv instruction can still be handled by the decoder.
The patch add a new tune which is enabled for all processors except
for SRF/CWF. It will use vpand + vpandn + vpor instead of
vpblendvb(similar for vblendvps/vblendvpd) for SRF/CWF.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_sse_movcc): Guard
instruction blendv generation under new tune.
* config/i386/i386.h (TARGET_SSE_MOVCC_USE_BLENDV): New Macro.
* config/i386/x86-tune.def (X86_TUNE_SSE_MOVCC_USE_BLENDV):
New tune.
Richard Biener [Wed, 9 Oct 2024 09:42:59 +0000 (11:42 +0200)]
tree-optimization/117041 - fix load classification of former grouped load
When we first detect a grouped load but later dis-associate it we
only set DR_GROUP_FIRST_ELEMENT to NULL, indicating it is not a
STMT_VINFO_GROUPED_ACCESS but leave DR_GROUP_NEXT_ELEMENT set. This
causes a stray DR_GROUP_NEXT_ELEMENT access in get_group_load_store_type
to go wrong, indicating a load isn't single_element_p when it actually
is, leading to wrong classification and an ICE.
PR tree-optimization/117041
* tree-vect-stmts.cc (get_group_load_store_type): Only
check DR_GROUP_NEXT_ELEMENT for STMT_VINFO_GROUPED_ACCESS.
Richard Biener [Mon, 30 Sep 2024 11:38:28 +0000 (13:38 +0200)]
tree-optimization/116879 - failure to recognize non-empty latch
When we relaxed the vectorizers constraint on loop structure verifying
the emptiness of the latch became too lose as can be seen in the case
for PR116879 where the latch effectively contains two basic-blocks
which one being an unmerged forwarder that's not empty.
PR tree-optimization/116879
* tree-vect-loop.cc (vect_analyze_loop_form): Scan all
blocks that form the latch.
Richard Biener [Thu, 26 Sep 2024 13:41:59 +0000 (15:41 +0200)]
tree-optimization/116850 - corrupt post-dom info
Path isolation computes post-dominators on demand but can end up
splitting blocks after that, wrecking it. We can delay splitting
of blocks until we no longer need the post-dom info which is what
the following patch does to solve the issue.
PR tree-optimization/116850
* gimple-ssa-isolate-paths.cc (bb_split_points): New global.
(insert_trap): Delay BB splitting if post-doms are computed.
(find_explicit_erroneous_behavior): Process delayed BB
splitting after releasing post dominators.
(gimple_ssa_isolate_erroneous_paths): Do not free post-dom
info here.
The following reverts a bogus fix done for PR101009 and instead makes
sure we get into the same_access_functions () case when computing
the distance vector for g[1] and g[1] where the constants ended up
having different types. The generic code doesn't seem to handle
loop invariant dependences. The special case gets us both
( 0 ) and ( 1 ) as distance vectors while formerly we got ( 1 ),
which the PR101009 fix changed to ( 0 ) with bad effects on other
cases as shown in this PR.
PR tree-optimization/116768
* tree-data-ref.cc (build_classic_dist_vector_1): Revert
PR101009 change.
* tree-chrec.cc (eq_evolutions_p): Make sure (sizetype)1
and (int)1 compare equal.
Currently the forward threader isn't limited as to the search space
it explores and with it now using path-ranger for simplifying
conditions it runs into it became pretty slow for degenerate cases
like compiling insn-emit.cc for RISC-V esp. when compiling for
a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.
The following makes the forward threader honor the search space
limit I introduced for the backward threader. This reduces
compile-time from minutes to seconds for the testcase in PR116166.
Note this wasn't necessary before we had ranger but with ranger
the work we do is quadatic in the length of the threading path
we build up (the same is true for the backwards threader).
PR tree-optimization/116166
* tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
Add limit parameter.
(jump_threader::thread_through_normal_block): Likewise.
* tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
Honor and decrement limit parameter.
(jump_threader::thread_through_normal_block): Likewise.
(jump_threader::thread_across_edge): Initialize limit from
param_max_jump_thread_paths and pass it down to workers.
Jakub Jelinek [Fri, 4 Oct 2024 11:12:45 +0000 (13:12 +0200)]
i386: Fix up ix86_expand_int_compare with TImode comparisons of SUBREGs from V8{H,B}Fmode against zero [PR116921]
The following testcase ICEs, because the ix86_expand_int_compare
optimization to use {,v}ptest assumes there are instructions for all
16-byte vector modes. That isn't the case, we only have one for
V16QI, V8HI, V4SI, V2DI, V1TI, V4SF and V2DF, not for
V8HF nor V8BF.
The following patch fixes that by using the V8HI instruction instead
for those 2 modes. tmp can't be a SUBREG, because it is SUBREG_REG
of another SUBREG, so we don't need to worry about gen_lowpart
failing.
2024-10-04 Jakub Jelinek <jakub@redhat.com>
PR target/116921
* config/i386/i386-expand.cc (ix86_expand_int_compare): Add a SUBREG
to V8HImode from V8HFmode or V8BFmode before generating a ptest.
Jakub Jelinek [Tue, 1 Oct 2024 07:52:20 +0000 (09:52 +0200)]
range-cache: Fix ranger ICE if number of bbs increases [PR116899]
Ranger cache during initialization reserves number of basic block slots
in the m_workback vector (in an inefficient way by doing create (0)
and safe_grow_cleared (n) and truncate (0) rather than say create (n)
or reserve (n) after create). The problem is that some passes split bbs and/or
create new basic blocks and so when one is unlucky, the quick_push into that
vector can ICE.
The following patch replaces those 4 quick_push calls with safe_push.
I've also gathered some statistics from compiling 63 gcc sources (picked those
that dependent on gimple-range-cache.h just because I had to rebuild them once
for the instrumentation), and that showed that in 81% of cases nothing has
been pushed into the vector at all (and note, not everything was small, there
were even cases with 10000+ basic blocks), another 18.5% of cases had just 1-4
elements in the vector at most, 0.08% had 5-8 and 19 out of 305386 cases
had at most 9-11 elements, nothing more. So, IMHO reserving number of basic
block in the vector is a waste of compile time memory and with the safe_push
calls, the patch just initializes the vector to vNULL.
2024-10-01 Jakub Jelinek <jakub@redhat.com>
PR middle-end/116899
* gimple-range-cache.cc (ranger_cache::ranger_cache): Set m_workback
to vNULL instead of creating it, growing and then truncating.
(ranger_cache::fill_block_cache): Use safe_push rather than quick_push
on m_workback.
(ranger_cache::range_from_dom): Likewise.
Jakub Jelinek [Tue, 1 Oct 2024 07:49:49 +0000 (09:49 +0200)]
range-cache: Fix ICE on SSA_NAME with def_stmt not yet in the IL [PR116898]
Some passes like the bitint lowering queue some statements on edges and only
commit them at the end of the pass. If they use ranger at the same time,
the ranger might see such SSA_NAMEs and ICE on those. The following patch
instead just punts on them.
2024-10-01 Jakub Jelinek <jakub@redhat.com>
PR middle-end/116898
* gimple-range-cache.cc (ranger_cache::block_range): If a SSA_NAME
with NULL def_bb isn't SSA_NAME_IS_DEFAULT_DEF, return false instead
of failing assertion. Formatting fix.
Jakub Jelinek [Sun, 29 Sep 2024 19:52:32 +0000 (21:52 +0200)]
cselib: Discard useless locs of preserved VALUEs [PR116627]
remove_useless_values iteratively discards useless locs (locs of
cselib_val which refer to non-preserved VALUEs with no locations),
which in turn can make further values useless until no further VALUEs
are made useless and then discards the useless VALUEs.
Preserved VALUEs (something done during var-tracking only I think)
live in a different hash table, cselib_preserved_hash_table rather
than cselib_hash_table. cselib_find_slot first looks up slot in
cselib_preserved_hash_table and only if not found looks it up in
cselib_hash_table (and INSERTs only into the latter), whereas preservation
of a VALUE results in move of a cselib_val from the latter to the former
hash table.
The testcase in the PR (apparently too fragile, it only reproduces on 14
branch with various flags on a single arch, not on trunk) ICEs, because
we have a preserved VALUE (QImode with (const_int 0) as one of the locs).
In a different BB SImode r2 is looked up, a non-preserved VALUE is created
for it, and the r13-2916 added code attempts to lookup also SUBREGs of that
in narrower modes, among those QImode, so adds to that SImode r2
non-preserve VALUE a new loc of (subreg:QI (value:SI) 0). That SImode
value is considered useless, so remove_useless_value discards it, but
nothing discarded it from the preserved VALUE's loc_list, so when looking
something up in the hash table we ICE trying to derevence CSELIB_VAL
of the discarded VALUE.
I think we need to discuard useless locs even from the preserved VALUEs.
That IMHO shouldn't create any further useless VALUEs, the preserved
VALUEs are never useless, so we don't need to iterate with it, can do it
just once, but IMHO it needs to be done because actually
discard_useless_values.
The following patch does that.
2024-09-29 Jakub Jelinek <jakub@redhat.com>
PR target/116627
* cselib.cc (remove_useless_values): Discard useless locs
even from preserved cselib_vals in cselib_preserved_hash_table
hash table.
Jakub Jelinek [Fri, 20 Sep 2024 07:14:29 +0000 (09:14 +0200)]
i386: Fix up _mm_min_ss etc. handling of zeros and NaNs [PR116738]
min/max patterns for intrinsics which on x86 result in the second
input operand if the two operands are both zeros or one or both of them
are a NaN shouldn't use SMIN/SMAX RTL, because that is similarly to
MIN_EXPR/MAX_EXPR undefined what will be the result in those cases.
The following patch adds an expander which uses either a new pattern with
UNSPEC_IEEE_M{AX,IN} or use the S{MIN,MAX} representation of the same.
2024-09-20 Uros Bizjak <ubizjak@gmail.com>
Jakub Jelinek <jakub@redhat.com>
PR target/116738
* config/i386/subst.md (mask_scalar_operand_arg34,
mask_scalar_expand_op3, round_saeonly_scalar_mask_arg3): New
subst attributes.
* config/i386/sse.md
(<sse>_vm<code><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
Change from define_insn to define_expand, rename the old define_insn
to ...
(*<sse>_vm<code><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
... this.
(<sse>_ieee_vm<ieee_maxmin><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
New define_insn.
Another spot where we mark_used a function (in this case ctor or dtor)
even when it is just artificially used inside of thunks (emitted on mingw
with -Os for the testcase).
2024-09-13 Jakub Jelinek <jakub@redhat.com>
PR c++/116678
* optimize.cc: Include decl.h.
(maybe_thunk_body): Temporarily change deprecated_state to
UNAVAILABLE_DEPRECATED_SUPPRESS.
Richard Ball [Thu, 10 Oct 2024 18:16:39 +0000 (19:16 +0100)]
aarch64: Alter pr116258.c test to correct for big endian.
The test at pr116258.c fails on big endian targets,
this is because the test checks that the index of a floating
point multiply is 0, which is correct only for little endian.
gcc/testsuite/ChangeLog:
PR tree-optimization/116258
* gcc.target/aarch64/pr116258.c:
Alter test to add big-endian support.
Martin Uecker [Tue, 17 Sep 2024 09:37:29 +0000 (11:37 +0200)]
c: fix crash when checking for compatibility of structures [PR116726]
When checking for compatibility of structure or union types in
tagged_types_tu_compatible_p, restore the old value of the pointer to
the top of the temporary cache after recursively calling comptypes_internal
when looping over the members of a structure of union. While the next
iteration of the loop overwrites the pointer, I missed the fact that it can
be accessed again when types of function arguments are compared as part
of recursive type checking and the function is entered again.
PR c/116726
gcc/c/ChangeLog:
* c-typeck.cc (tagged_types_tu_compatible_p): Restore value
of the cache after recursing into comptypes_internal.
Eric Botcazou [Wed, 4 Sep 2024 22:19:25 +0000 (00:19 +0200)]
ada: Fix wrong finalization of anonymous array aggregate
The issue arises when the aggregate consists only of iterated associations
because, in this case, its expansion uses a 2-pass mechanism which creates
a temporary that needs a fully-fledged initialization, thus running afoul
of the optimization that avoids building the initialization procedure in
the anonymous array case.
gcc/ada/ChangeLog:
* exp_aggr.ads (Is_Two_Pass_Aggregate): New function declaration.
* exp_aggr.adb (Is_Two_Pass_Aggregate): New function body.
(Expand_Array_Aggregate): Call Is_Two_Pass_Aggregate to detect the
aggregates that need the 2-pass expansion.
* exp_ch3.adb (Expand_Freeze_Array_Type): In the anonymous array
case, build the initialization procedure if the initial value in
the object declaration is a 2-pass aggregate.
Eric Botcazou [Wed, 11 Sep 2024 17:37:08 +0000 (19:37 +0200)]
ada: Fix negative value returned by 'Image for array with nonnegative component
The problem is that Exp_Put_Image.Build_Elementary_Put_Image_Call uses the
signedness of the base type but the size of the first subtype, hence the
discrepancy between them.
gcc/ada/ChangeLog:
PR ada/115535
* exp_put_image.adb (Build_Elementary_Put_Image_Call): Use the size
of the underlying type to find the support type.
Eric Botcazou [Wed, 11 Sep 2024 17:26:18 +0000 (19:26 +0200)]
ada: Fix internal error on elsif part of if-statement containing if-expression
The problem occurs when the compiler is trying to find a context to which
it can hoist finalization actions coming from the if-expression, because
Find_Hook_Context incorrectly returns the N_Elsif_Part node.
gcc/ada/ChangeLog:
PR ada/114640
* exp_util.adb (Find_Hook_Context): For a node present within a
conditional expression, do not return an N_Elsif_Part node.
Eric Botcazou [Wed, 11 Sep 2024 17:42:03 +0000 (19:42 +0200)]
ada: Fix bogus error in instantiation with formal package
The compiler reports that an actual does not match the formal when there
is a defaulted formal discrete type because Check_Formal_Package_Instance
fails to skip the implicit base type generated by the compiler.
gcc/ada/ChangeLog:
PR ada/114636
* sem_ch12.adb (Check_Formal_Package_Instance): For a defaulted
formal discrete type, skip the generated implicit base type.
Jan Beulich [Tue, 8 Oct 2024 14:05:33 +0000 (16:05 +0200)]
x86/{,V}AES: adjust when to force EVEX encoding
Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
said "..., but we need to emit {evex} prefix in the assembly if AES ISA
is not enabled". Yet it did so only for the TARGET_AES insns. Going from
the alternative chosen in the TARGET_VAES insns isn't quite right: If
AES is (also) enabled, EVEX encoding would needlessly be forced.
AVR: target/116953 - ICE due to operands clobber in avr_out_sbxx_branch.
PR target/116953
gcc/
* config/avr/avr.cc (avr_out_sbxx_branch): Work on a copy of
the operands rather than on operands itself, which is just
recog_data.operand and may be clobbered by jump_over_one_insn_p.
gcc/testsuite/
* gcc.target/avr/torture/pr116953.c: New test.
Eric Botcazou [Fri, 4 Oct 2024 09:27:33 +0000 (11:27 +0200)]
Fix crash with subunit of local package
This is a regression present on the 14 branch only: the expander gets
confused when trying to insert the finalizer of a procedure that contains
a package as a subunit. The offending code no longer exists on the
mainline so this adds the minimal fix to address the issue.
gcc/ada
PR ada/116430
* exp_ch7.adb (Build_Finalizer.Create_Finalizer): For the insertion
point of the finalizer, deal with package bodies that are subunits.
Jonathan Wakely [Wed, 21 Aug 2024 11:29:32 +0000 (12:29 +0100)]
libstdc++: Make debug sequence members mutable [PR116369]
We need to be able to attach debug mode iterators to const containers,
so the safe iterator constructor uses const_cast to get a modifiable
pointer to the container. If the container was defined as const, that
const_cast to access its members results in undefined behaviour. PR
116369 shows a case where it results in a segfault because the container
is in a rodata section (which shouldn't have happened, but the undefined
behaviour in the library still exists in any case).
This makes the _M_iterators and _M_const_iterators data members mutable,
so that it's safe to modify them even if the declared type of the
container is a const type.
Ideally we would not need the const_cast at all. Instead, the _M_attach
member (and everything it calls) should be const-qualified. That would
work fine now, because the members that it ends up modifying are
mutable. Making that change would require a number of new exports from
the shared library, and would require retaining the old non-const member
functions (maybe as symbol aliases) for backwards compatibility. That
might be worth changing at some point, but isn't done here.
Jonathan Wakely [Wed, 28 Aug 2024 11:38:18 +0000 (12:38 +0100)]
libstdc++: Fix autoconf check for O_NONBLOCK in <fcntl.h>
I misused the AC_CHECK_DECL macro, assuming that it behaved like
AC_CHECK_DECLS and always defined a HAVE_xxx macro if the decl was
found. Instead, the [action-if-found] shell commands are needed to
defined HAVE_O_NONBLOCK explicitly.
Jonathan Wakely [Tue, 11 Jun 2024 15:45:43 +0000 (16:45 +0100)]
libstdc++: Fix std::codecvt<wchar_t, char, mbstate_t> for empty dest [PR37475]
For the GNU locale model, codecvt::do_out and codecvt::do_in incorrectly
return 'ok' when the destination range is empty. That happens because
detecting incomplete output is done in the loop body, and the loop is
never even entered if to == to_end.
By restructuring the loop condition so that we check the output range
separately, we can ensure that for a non-empty source range, we always
enter the loop at least once, and detect if the destination range is too
small.
The loops also seem easier to reason about if we return immediately on
any error, instead of checking the result twice on every iteration. We
can use an RAII type to restore the locale before returning, which also
simplifies all the other member functions.
libstdc++-v3/ChangeLog:
PR libstdc++/37475
* config/locale/gnu/codecvt_members.cc (Guard): New RAII type.
(do_out, do_in): Return partial if the destination is empty but
the source is not. Use Guard to restore locale on scope exit.
Return immediately on any conversion error.
(do_encoding, do_max_length, do_length): Use Guard.
* testsuite/22_locale/codecvt/in/char/37475.cc: New test.
* testsuite/22_locale/codecvt/in/wchar_t/37475.cc: New test.
* testsuite/22_locale/codecvt/out/char/37475.cc: New test.
* testsuite/22_locale/codecvt/out/wchar_t/37475.cc: New test.
Jonathan Wakely [Tue, 24 Sep 2024 22:20:56 +0000 (23:20 +0100)]
libstdc++: Populate std::time_get::get's %c format for C locale
We were using the empty string "" for D_T_FMT and ERA_D_T_FMT in the C
locale, instead of "%a %b %e %T %Y" as the C standard requires. Set it
correctly for each locale implementation that defines time_members.cc.
We can also explicitly set the _M_era_xxx pointers to the same values as
the corresponding _M_xxx ones, rather than setting them to point to
identical string literals. This doesn't rely on the compiler merging
string literals, and makes it more explicit that they're the same in the
C locale.
libstdc++-v3/ChangeLog:
* config/locale/dragonfly/time_members.cc
(__timepunct<char>::_M_initialize_timepunc)
(__timepunct<wchar_t>::_M_initialize_timepunc): Set
_M_date_time_format for C locale. Set %Ex formats to the same
values as the %x formats.
* config/locale/generic/time_members.cc: Likewise.
* config/locale/gnu/time_members.cc: Likewise.
* testsuite/22_locale/time_get/get/char/5.cc: New test.
* testsuite/22_locale/time_get/get/wchar_t/5.cc: New test.
Jonathan Wakely [Thu, 6 Jun 2024 10:50:06 +0000 (11:50 +0100)]
libstdc++: Work around some PSTL test failures for debug mode [PR90276]
This addresses one known failure due to a bug in the upstream tests, and
a number of timeouts due to the algorithms running much more slowly with
debug mode checks enabled.
libstdc++-v3/ChangeLog:
PR libstdc++/90276
* testsuite/25_algorithms/pstl/alg_sorting/partial_sort.cc
[_GLIBCXX_DEBUG]: Add xfail-run-if for debug mode.
* testsuite/25_algorithms/pstl/alg_nonmodifying/nth_element.cc
[_GLIBCXX_DEBUG]: Reduce size of test data.
* testsuite/25_algorithms/pstl/alg_sorting/includes.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_util.h:
Likewise.
I noticed that char8_t was missing from the list of types that were
prevented from using the std::formatter partial specialization for
integer types. That partial specialization was also matching
cv-qualified integer types, because std::integral<const int> is true.
This change simplifies the constraints by introducing a new variable
template which is only true for cv-unqualified integer types, with
explicit specializations to exclude the character types. This should be
slightly more efficient than the previous constraints that checked
std::integral<T> and (!__is_one_of<T, char, wchar_t, ...>). It also
avoids the need for a separate std::formatter specialization for 128-bit
integers, as they can be handled by the new variable template too.
libstdc++-v3/ChangeLog:
* include/std/format (__format::__is_formattable_integer): New
variable template and specializations.
(template<integral, __char> struct formatter): Replace
constraints on first arg with __is_formattable_integer.
* testsuite/std/format/formatter/requirements.cc: Check that
std::formatter specializations for char8_t and const int are
disabled.
Jonathan Wakely [Wed, 18 Sep 2024 16:20:29 +0000 (17:20 +0100)]
libstdc++: Fix formatting of most negative chrono::duration [PR116755]
When formatting chrono::duration<signed-integer-type, P>::min() we were
causing undefined behaviour by trying to form the negative of the most
negative value. If we convert negative durations with integer rep to the
corresponding unsigned integer rep then we can safely represent all
values.
libstdc++-v3/ChangeLog:
PR libstdc++/116755
* include/bits/chrono_io.h (formatter<duration<R,P>>::format):
Cast negative integral durations to unsigned rep.
* testsuite/20_util/duration/io.cc: Test the most negative
integer durations.
Richard Biener [Wed, 18 Sep 2024 07:52:55 +0000 (09:52 +0200)]
tree-optimization/116585 - SSA corruption with split_constant_offset
split_constant_offset when looking through SSA defs can end up
picking SSA leafs that are subject to abnormal coalescing. This
can lead to downstream consumers to insert code based on the
result (like from dataref analysis) in places that violate constraints
for abnormal coalescing. It's best to not expand defs whose operands
are subject to abnormal coalescing - and not either do something when
a subexpression has operands like that already.
PR tree-optimization/116585
* tree-data-ref.cc (split_constant_offset_1): When either
operand is subject to abnormal coalescing do no further
processing.
Jason Merrill [Sun, 15 Sep 2024 11:50:04 +0000 (13:50 +0200)]
c++: -Wdangling-reference and empty class [PR115361]
We can't have a dangling reference to an empty class unless it's
specifically to that class or one of its bases. This was giving a
false positive on the _ExtractKey pattern in libstdc++ hashtable.h.
This also adjusts the order of arguments to reference_related_p, which
is relevant for empty classes (unlike scalars).
Several of the classes in the testsuite needed to gain data members to
continue to warn.
This fixes another false positive. When a function is taking a
temporary of scalar type that couldn't be bound to the return type
of the function, don't warn, such a program would be ill-formed.
Thanks to Jonathan for reporting the problem.
PR c++/115987
gcc/cp/ChangeLog:
* call.cc (do_warn_dangling_reference): Don't consider a
temporary with a scalar type that cannot bind to the return type.
gcc/testsuite/ChangeLog:
* g++.dg/ext/attr-no-dangling6.C: Adjust.
* g++.dg/ext/attr-no-dangling7.C: Likewise.
* g++.dg/warn/Wdangling-reference22.C: New test.
Use iterative PTA definitions for members of the same AMD processor family.
Also, fix a couple of related M_CPU_TYPE/M_CPU_SUBTYPE inconsistencies.
No functional changes intended.
gcc/ChangeLog:
* config/i386/i386.h: Add PTA_BDVER1, PTA_BDVER2, PTA_BDVER3,
PTA_BDVER4, PTA_BTVER1 and PTA_BTVER2.
* common/config/i386/i386-common.cc (processor_alias_table)
<"bdver1">: Use PTA_BDVER1.
<"bdver2">: Use PTA_BDVER2.
<"bdver3">: Use PTA_BDVER3.
<"bdver4">: Use PTA_BDVER4.
<"btver1">: Use PTA_BTVER1. Use M_CPU_TYPE (AMD_BTVER1).
<"btver2">: Use PTA_BTVER2.
<"shanghai>: Use M_CPU_SUBTYPE (AMDFAM10H_SHANGHAI).
<"istanbul>: Use M_CPU_SUBTYPE (AMDFAM10H_ISTANBUL).
Jan Hubicka [Tue, 3 Sep 2024 16:20:34 +0000 (18:20 +0200)]
Zen5 tuning part 4: update reassocation width
Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in
3 of them. FP units can do 2 additions and 2 multiplications with latency 2
and 3. This patch updates reassociation width accordingly. This has potential
of increasing register pressure but unlike while benchmarking znver1 tuning
I did not noticed this actually causing problem on spec, so this patch bumps
up reassociation width to 6 for everything except for integer vectors, where
there are 4 units with typical latency of 1.
Jan Hubicka [Tue, 3 Sep 2024 14:26:16 +0000 (16:26 +0200)]
Zen5 tuning part 3: scheduler tweaks
this patch adds support for new fussion in znver5 documented in the
optimization manual:
The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions
with certain ALU instructions. The following conditions need to be met for
fusion to happen:
- The MOV should be reg-reg mov with Opcode 0x89 or 0x8B
- The MOV is followed by an ALU instruction where the MOV and ALU destination register match.
- The ALU instruction may source only registers or immediate data. There cannot be any memory source.
- The ALU instruction sources either the source or dest of MOV instruction.
- If ALU instruction has 2 reg sources, they should be different.
- The following ALU instructions can fuse with an older qualified MOV instruction:
ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR
(I assume OP is OR)
I also increased issue rate from 4 to 6. Theoretically znver5 can do more, but
with our model we can't realy use it.
Increasing issue rate to 8 leads to infinite loop in scheduler.
Finally, I also enabled fuse_alu_and_branch since it is supported by
znver5 (I think by earlier zens too).
New fussion pattern moves quite few instructions around in common code:
@@ -2210,13 +2210,13 @@
.cfi_offset 3, -32
leaq 63(%rsi), %rbx
movq %rbx, %rbp
+ shrq $6, %rbp
+ salq $3, %rbp
subq $16, %rsp
.cfi_def_cfa_offset 48
movq %rdi, %r12
- shrq $6, %rbp
- movq %rsi, 8(%rsp)
- salq $3, %rbp
movq %rbp, %rdi
+ movq %rsi, 8(%rsp)
call _Znwm
movq 8(%rsp), %rsi
movl $0, 8(%r12)
@@ -2224,8 +2224,8 @@
movq %rax, (%r12)
movq %rbp, 32(%r12)
testq %rsi, %rsi
- movq %rsi, %rdx
cmovns %rsi, %rbx
+ movq %rsi, %rdx
sarq $63, %rdx
shrq $58, %rdx
sarq $6, %rbx
which should help decoder bandwidth and perhaps also cache, though I was not
able to measure off-noise effect on SPEC.
gcc/ChangeLog:
* config/i386/i386.h (TARGET_FUSE_MOV_AND_ALU): New tune.
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Updat for znver5.
(ix86_adjust_cost): Add TODO about znver5 memory latency.
(ix86_fuse_mov_alu_p): New.
(ix86_macro_fusion_pair_p): Use it.
* config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): Add ZNVER5.
(X86_TUNE_FUSE_MOV_AND_ALU): New tune;
Jan Hubicka [Tue, 3 Sep 2024 13:07:41 +0000 (15:07 +0200)]
Zen5 tuning part 2: disable gather and scatter
We disable gathers for zen4. It seems that gather has improved a bit compared
to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
the indices are known ahead of time. Vector loads followed by shuffles result
in a higher load bandwidth." however the situation seems to be more
complicated.
gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
products in TSVC. Curiously enough breaking these out into microbenchmark
reversed the situation and it turns out that the performance depends on
how indices are distributed. gather is loss if indices are sequential,
neutral if they are random and win for some strides (4, 8).
This seems to be similar to earlier zens, so I think (especially for
backporting znver5 support) that it makes sense to be conistent and disable
gather unless we work out a good heuristics on when to use it. Since we
typically do not know the indices in advance, I don't see how that can be done.
I opened PR116582 with some examples of wins and loses
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
ZNVER5.
(X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.
Jan Hubicka [Tue, 3 Sep 2024 11:38:33 +0000 (13:38 +0200)]
Zen5 tuning part 1: avoid FMA chains
testing matrix multiplication benchmarks shows that FMA on a critical chain
is a perofrmance loss over separate multiply and add. While the latency of 4
is lower than multiply + add (3+2) the problem is that all values needs to
be ready before computation starts.
While on znver4 AVX512 code fared well with FMA, it was because of the split
registers. Znver5 benefits from avoding FMA on all widths. This may be different
with the mobile version though.
On naive matrix multiplication benchmark the difference is 8% with -O3
only since with -Ofast loop interchange solves the problem differently.
It is 30% win, for example, on S323 from TSVC:
if address override is used to avoid the invalid memory operand like
cmpl %fs:previous_emax@dtpoff(%eax), %r12d
gcc/
PR target/116839
* config/i386/i386.cc (ix86_rewrite_tls_address_1): Make it
static. Return if TLS address is thread register plus an integer
register.
gcc/testsuite/
PR target/116839
* gcc.target/i386/pr116839.c: New file.
Currently subregs originating from *tf_to_fprx2_0 and *tf_to_fprx2_1
survive register allocation. This in turn leads to wrong register
renaming. Keeping the current approach would mean we need two insns for
*tf_to_fprx2_0 and *tf_to_fprx2_1, respectively. Something along the
lines
and similar for *tf_to_fprx2_1. Note, pre register allocation operand 0
has mode FPRX2 and afterwards DF once subregs have been eliminated.
Since we always copy a whole vector register into a floating-point
register pair, another way to fix this is to merge *tf_to_fprx2_0 and
*tf_to_fprx2_1 into a single insn which means we don't have to use
subregs at all. The downside of this is that the assembler template
contains two instructions, now. The upside is that we don't have to
come up with some artificial insn before RA which might be more
readable/maintainable. That is implemented by this patch.
In commit r11-4872-ge627cda5686592, the output operand specifier %V was
introduced which is used in tf_to_fprx2 only, now. Instead of coming up
with its counterpart %F for floating-point registers, which would also
only be used in tf_to_fprx2, I print the operands directly. This
renders %V unused which is why it is removed by this patch.
Ensure for AQ and AR constraints that the resulting displacement after
adding any positive offset less than the size of the object being
referenced is still valid.
gcc/ChangeLog:
* config/s390/s390.cc (s390_mem_constraint): Check displacement
for AQ and AR constraints.