Iain Buclaw [Tue, 11 Mar 2025 16:56:18 +0000 (17:56 +0100)]
d: Fix regression returning from function with invariants [PR119139]
An optimization was added in GDC-12 which sets the TREE_READONLY flag on
all local variables with the storage class `const' assigned. For some
reason, const is also being added by the front-end to `__result'
variables in non-virtual functions, which ends up getting wrong code by
the gimplify pass promoting the local to static storage.
A bug has been raised upstream, as this looks like an error in the AST.
For now, turn off setting TREE_READONLY on all result variables.
PR d/119139
gcc/d/ChangeLog:
* decl.cc (get_symbol_decl): Don't set TREE_READONLY for __result
declarations.
Fix folding of BIT_NOT_EXPR for POLY_INT_CST [PR118976]
There was an embarrassing typo in the folding of BIT_NOT_EXPR for
POLY_INT_CSTs: it used - rather than ~ on the poly_int. Not sure
how that happened, but it might have been due to the way that
~x is implemented as -1 - x internally.
gcc/
PR tree-optimization/118976
* fold-const.cc (const_unop): Use ~ rather than - for BIT_NOT_EXPR.
* config/aarch64/aarch64.cc (aarch64_test_sve_folding): New function.
(aarch64_run_selftests): Run it.
aarch64: Fix folding of degenerate svwhilele case [PR117045]
The svwhilele folder mishandled the degenerate case in which
the second argument is the maximum integer. In that case,
the result is all-true regardless of the first parameter:
If the second scalar operand is equal to the maximum signed integer
value then a condition which includes an equality test can never fail
and the result will be an all-true predicate.
This is because the conceptual "increment the first operand
by 1 after each element" is done modulo the range of the operand.
The GCC code was instead treating it as infinite precision.
whilele_5.c even had a test for the incorrect behaviour.
The easiest fix seemed to be to handle that case specially before
doing constant folding. This also copes with variable first operands.
gcc/
PR target/116999
PR target/117045
* config/aarch64/aarch64-sve-builtins-base.cc
(svwhilelx_impl::fold): Check for WHILELTs of the minimum value
and WHILELEs of the maximum value. Fold them to all-false and
all-true respectively.
The testcase contains a VNx2QImode pseudo that is live across a call
and that cannot be allocated a call-preserved register. LRA quite
reasonably tried to save it before the call and restore it afterwards.
Unfortunately, the target told it to do that in SImode, even though
punning between SImode and VNx2QImode is disallowed by both
TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P.
The natural class to use for SImode is GENERAL_REGS, so this led
to an unsalvageable situation in which we had:
(set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B))
where A needed GENERAL_REGS and B needed FP_REGS. We therefore ended
up in a reload loop.
The hooks above should ensure that this situation can never occur
for incoming subregs. It only happened here because the target
explicitly forced it.
The decision to use SImode for modes smaller than 4 bytes dates
back to the beginning of the port, before 16-bit floating-point
modes existed. I'm not sure whether promoting to SImode really
makes sense for any FPR, but that's a separate performance/QoI
discussion. For now, this patch just disallows using SImode
when it is wrong for correctness reasons, since that should be
safer to backport.
gcc/
PR testsuite/116238
* config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode):
Only return SImode if we can convert to and from it.
gcc/testsuite/
PR testsuite/116238
* gcc.target/aarch64/sve/pr116238.c: New test.
Christophe Lyon [Mon, 3 Mar 2025 11:12:18 +0000 (11:12 +0000)]
arm: Handle fixed PIC register in require_pic_register (PR target/115485)
Commit r9-4307-g89d7557202d25a forgot to accept a fixed PIC register
when extending the assert in require_pic_register.
arm_pic_register can be set explicitly by the user
(e.g. -mpic-register=r9) or implicitly as the default value with
-fpic/-fPIC/-fPIE and -mno-pic-data-is-text-relative -mlong-calls, and
we want to use/accept it when recording cfun->machine->pic_reg as used
to be the case.
--cut here--
/* Otherwise, if this register is used by I3, then this register
now dies here, so we must put a REG_DEAD note here unless there
is one already. */
else if (reg_referenced_p (XEXP (note, 0), PATTERN (i3))
&& ! (REG_P (XEXP (note, 0))
? find_regno_note (i3, REG_DEAD,
REGNO (XEXP (note, 0)))
: find_reg_note (i3, REG_DEAD, XEXP (note, 0))))
{
PUT_REG_NOTE_KIND (note, REG_DEAD);
place = i3;
}
--cut here--
Flags register is used in I3, but there already is a REG_DEAD note in I3.
The above condition doesn't trigger and continues in the "else" part where
REG_DEAD note is put to I2. The proposed solution corrects the above
logic to trigger every time the register is referenced in I3, avoiding the
"else" part.
PR rtl-optimization/118739
gcc/ChangeLog:
* combine.cc (distribute_notes) <case REG_UNUSED>: Correct the
logic when the register is used by I3.
Iain Buclaw [Fri, 28 Feb 2025 18:22:36 +0000 (19:22 +0100)]
d: Fix comparing uninitialized memory in dstruct.d [PR116961]
Floating-point emulation in the D front-end is done via a type named
`struct longdouble`, which in GDC is a small interface around the
real_value type. Because the D code cannot include gcc/real.h directly,
a big enough buffer is used for the data instead.
On x86_64, this buffer is actually bigger than real_value itself, so
when a new longdouble object is created with
there is uninitialized padding at the end of `r`. This was never a
problem when D was implemented in C++ (until GCC 12) as comparing two
longdouble objects with `==' would be forwarded to the relevant
operator== overload that extracted the underlying real_value.
However when the front-end was translated to D, such conditions were
instead rewritten into identity comparisons
return exp.toReal() is CTFloat.zero
The `is` operator gets lowered as a call to `memcmp() == 0', which is
where the read of uninitialized memory occurs, as seen by valgrind.
==26778== Conditional jump or move depends on uninitialised value(s)
==26778== at 0x911F41: dmd.dstruct._isZeroInit(dmd.expression.Expression) (dstruct.d:635)
==26778== by 0x9123BE: StructDeclaration::finalizeSize() (dstruct.d:373)
==26778== by 0x86747C: dmd.aggregate.AggregateDeclaration.determineSize(ref const(dmd.location.Loc)) (aggregate.d:226)
[...]
To avoid accidentally reading uninitialized data, explicitly initialize
all `longdouble` variables with an empty constructor on C++ side of the
implementation before initializing underlying real_value type it holds.
where the shift count operand does not trivially fit the scheme of
address operands. Reject those operands, especially since
strip_address_mutations() expects expressions of the form
(and ... (const_int ...)) and fails for (and ... (const_wide_int ...)).
Thus, be more strict here and accept only CONST_INT operands. Done by
replacing immediate_operand() with const_int_operand() which is enough
since the former only additionally checks for LEGITIMATE_PIC_OPERAND_P
and targetm.legitimate_constant_p which are always true for CONST_INT
operands.
While on it, fix indentation of the if block.
gcc/ChangeLog:
PR target/118835
* config/s390/s390.cc (s390_valid_shift_count): Reject shift
count operands which do not trivially fit the scheme of
address operands.
Dimitry Andric [Tue, 28 Jan 2025 17:36:16 +0000 (18:36 +0100)]
libgcc: On FreeBSD use GCC's crt objects for static linking
Add crtbeginT.o to extra_parts on FreeBSD. This ensures we use GCC's
crt objects for static linking. Otherwise it could mix crtbeginT.o
from the base system with libgcc's crtend.o, possibly leading to
segfaults.
libgcc:
PR target/118685
* config.host (*-*-freebsd*): Add crtbeginT.o to extra_parts.
Lewis Hyatt [Sun, 26 Jan 2025 23:57:00 +0000 (18:57 -0500)]
options: Adjust cl_optimization_compare to avoid checking ICE [PR115913]
At the end of a sequence like:
#pragma GCC push_options
...
#pragma GCC pop_options
the handler for pop_options calls cl_optimization_compare() (as generated by
optc-save-gen.awk) to make sure that all global state has been restored to
the value it had prior to the push_options call. The verification is
performed for almost all entries in the global_options struct. This leads to
unexpected checking asserts, as discussed in the PR, in case the state of
warnings-related options has been intentionally modified in between
push_options and pop_options via a call to #pragma GCC diagnostic. Address
that by skipping the verification for CL_WARNING-flagged options.
gcc/ChangeLog:
PR middle-end/115913
* optc-save-gen.awk (cl_optimization_compare): Skip options with
CL_WARNING flag.
gcc/testsuite/ChangeLog:
PR middle-end/115913
* c-c++-common/cpp/pr115913.c: New test.
Peter Bergner [Thu, 16 Jan 2025 16:49:45 +0000 (10:49 -0600)]
rs6000: Fix loop limit for built-in constant checking
The loop checking for built-in constant operand restrictions was missing
some operands due to the loop limit being too small. Fixing that exposed
a testsuite failure which is caused by a typo in the pmxvi4ger8pp definition
where we had made the PMASK field too small.
2025-01-16 Peter Bergner <bergner@linux.ibm.com>
gcc/
* config/rs6000/rs6000-builtin.cc (rs6000_expand_builtin): Use correct
array size for the loop limit.
* config/rs6000/rs6000-builtins.def: Fix field size for PMASK operand.
d: Fix ICE in build_deref, at d/d-codegen.cc:1650 [PR111650]
PR d/111650
gcc/d/ChangeLog:
* decl.cc (get_fndecl_arguments): Move generation of frame type to ...
(DeclVisitor::visit (FuncDeclaration *)): ... here, after the call to
build_closure.
Jan Hubicka [Tue, 3 Sep 2024 13:07:41 +0000 (15:07 +0200)]
Zen5 tuning part 2: disable gather and scatter
We disable gathers for zen4. It seems that gather has improved a bit compared
to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
the indices are known ahead of time. Vector loads followed by shuffles result
in a higher load bandwidth." however the situation seems to be more
complicated.
gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
products in TSVC. Curiously enough breaking these out into microbenchmark
reversed the situation and it turns out that the performance depends on
how indices are distributed. gather is loss if indices are sequential,
neutral if they are random and win for some strides (4, 8).
This seems to be similar to earlier zens, so I think (especially for
backporting znver5 support) that it makes sense to be conistent and disable
gather unless we work out a good heuristics on when to use it. Since we
typically do not know the indices in advance, I don't see how that can be done.
I opened PR116582 with some examples of wins and loses
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
ZNVER5.
(X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.
Iain Buclaw [Mon, 20 Jan 2025 19:01:03 +0000 (20:01 +0100)]
d: Fix failing test with 32-bit compiler [PR114434]
Since the introduction of gdc.test/runnable/test23514.d, it's exposed an
incorrect compilation when adding a 64-bit constant to a link-time
address. The current cast to size_t causes a loss of precision, which
can result in incorrect compilation.
PR d/114434
gcc/d/ChangeLog:
* expr.cc (ExprVisitor::visit (PtrExp *)): Get the offset as a
dinteger_t rather than a size_t.
(ExprVisitor::visit (SymOffExp *)): Likewise.
Uros Bizjak [Mon, 20 Jan 2025 15:19:43 +0000 (16:19 +0100)]
i386: Disable SImode/DImode moves from/to mask regs without avx512bw [PR118067]
SImode and DImode moves from/to mask registers are valid only with AVX512BW,
so mark relevant alternatives in *movsi_internal and *movdi_internal as such.
Simon Martin [Sun, 5 Jan 2025 09:36:47 +0000 (10:36 +0100)]
c++: Friend classes don't shadow enclosing template class paramater [PR118255]
We currently reject the following code
=== code here ===
template <int non_template> struct S { friend class non_template; };
class non_template {};
S<0> s;
=== code here ===
While EDG agrees with the current behaviour, clang and MSVC don't (see
https://godbolt.org/z/69TGaabhd), and I believe that this code is valid,
since the friend clause does not actually declare a type, so it cannot
shadow anything. The fact that we didn't error out if the non_template
class was declared before S backs this up as well.
This patch fixes this by skipping the call to check_template_shadow for
hidden bindings.
PR c++/118255
gcc/cp/ChangeLog:
* name-lookup.cc (pushdecl): Don't call check_template_shadow
for hidden bindings.
gcc/testsuite/ChangeLog:
* g++.dg/lookup/pr99116-1.C: Adjust test expectation.
* g++.dg/template/friend84.C: New test.
Eugene Rozenfeld [Sat, 11 Jan 2025 03:48:52 +0000 (19:48 -0800)]
Fix setting of call graph node AutoFDO count
We are initializing both the call graph node count and
the entry block count of the function with the head_count value
from the profile.
Count propagation algorithm may refine the entry block count
and we may end up with a case where the call graph node count
is set to zero but the entry block count is non-zero. That becomes
a problem because we have this code in execute_fixup_cfg:
profile_count num = node->count;
profile_count den = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
bool scale = num.initialized_p () && !(num == den);
Here if num is 0 but den is not 0, scale becomes true and we
lose the counts in
if (scale)
bb->count = bb->count.apply_scale (num, den);
This is what happened in the issue reported in PR116743
(a 10% regression in MySQL HAMMERDB tests). 3d9e6767939e9658260e2506e81ec32b37cba041 made an improvement in
AutoFDO count propagation, which caused a mismatch between
the call graph node count (zero) and the entry block count (non-zero)
and subsequent loss of counts as described above.
The fix is to update the call graph node count once we've done count propagation.
Tested on x86_64-pc-linux-gnu.
gcc/ChangeLog:
PR gcov-profile/116743
* auto-profile.cc (afdo_annotate_cfg): Fix mismatch between the call graph node count
and the entry block count.
Nathaniel Shead [Fri, 20 Dec 2024 11:09:39 +0000 (22:09 +1100)]
c++: Allow pragmas in NSDMIs [PR118147]
This patch removes the (unnecessary) CPP_PRAGMA_EOL case from
cp_parser_cache_defarg, which currently has the result that any pragmas
in the NSDMI cause an error.
PR c++/118147
gcc/cp/ChangeLog:
* parser.cc (cp_parser_cache_defarg): Don't error when
CPP_PRAGMA_EOL.
Richard Biener [Tue, 12 Nov 2024 10:15:15 +0000 (11:15 +0100)]
tree-optimization/117417 - ICE with complex load optimization
When we decompose a complex load only used as real and imaginary
parts we fail to honor IL constraints which are that a BIT_FIELD_REF
of register type should be outermost in a ref. The following
simply avoids the transform when the complex load has such a
BIT_FIELD_REF.
STMT_VINFO_SLP_VECT_ONLY isn't properly computed as union of all
group members and when the group is later split due to duplicates
not all sub-groups inherit the flag.
PR tree-optimization/117307
* tree-vect-data-refs.cc (vect_analyze_data_ref_accesses):
Properly compute STMT_VINFO_SLP_VECT_ONLY. Set it on all
parts of a split group.
Richard Biener [Sat, 12 Oct 2024 12:51:37 +0000 (14:51 +0200)]
tree-optimization/117104 - add missed guards to max(a,b) != a simplification
For vector types we have to make sure the comparison result is a vector
type and the resulting compare operation is supported. As the resulting
compare is never an equality compare I didn't bother to check for the
cbranch case.
Jakub Jelinek [Tue, 15 Oct 2024 17:38:46 +0000 (19:38 +0200)]
match.pd: Further fma negation fixes [PR116891]
On Mon, Oct 14, 2024 at 08:53:29AM +0200, Jakub Jelinek wrote:
> > PR middle-end/116891
> > * match.pd ((negate (IFN_FNMS@3 @0 @1 @2)) -> (IFN_FMA @0 @1 @2)):
> > Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING.
>
> Guess it would be nice to have a testcase which FAILs without the patch and
> PASSes with it, but it can be added later.
I've added such a testcase now, and additionally found the fix only fixed
one of the 4 problematic similar cases.
Here is a patch which fixes the others too and adds the testcases.
fma-pr116891.c FAILed without your patch, FAILs with your patch too (but
only due to the bar/baz/qux checks) and PASSes with the patch.
The following reverts a bogus fix done for PR101009 and instead makes
sure we get into the same_access_functions () case when computing
the distance vector for g[1] and g[1] where the constants ended up
having different types. The generic code doesn't seem to handle
loop invariant dependences. The special case gets us both
( 0 ) and ( 1 ) as distance vectors while formerly we got ( 1 ),
which the PR101009 fix changed to ( 0 ) with bad effects on other
cases as shown in this PR.
PR tree-optimization/116768
* tree-data-ref.cc (build_classic_dist_vector_1): Revert
PR101009 change.
* tree-chrec.cc (eq_evolutions_p): Make sure (sizetype)1
and (int)1 compare equal.
Richard Biener [Sun, 13 Oct 2024 13:12:44 +0000 (15:12 +0200)]
tree-optimization/116290 - fix compare-debug issue in ldist
Loop distribution does different analysis with -g0/-g due to counting
a debug stmt starting a BB against a limit which will everntually
lead to different IVOPTs choices. I've fixed a possible IVOPTs
issue on the way even though it doesn't make a difference here.
PR tree-optimization/116290
* tree-loop-distribution.cc (determine_reduction_stmt_1): PHIs
have no debug variants. Start with first non-debug real stmt.
* tree-ssa-loop-ivopts.cc (find_givs_in_bb): Do not analyze
debug stmts.
Richard Biener [Mon, 9 Jan 2023 11:46:28 +0000 (12:46 +0100)]
middle-end/69482 - not preserving volatile accesses
The following addresses a long standing issue with not preserving
accesses to non-volatile objects through volatile qualified
pointers in the case that object gets expanded to a register. The
fix is to treat accesses to an object with a volatile qualified
access as forcing that object to memory. This issue got more
exposed recently so it regressed more since GCC 11.
PR middle-end/69482
* cfgexpand.cc (discover_nonconstant_array_refs_r): Volatile
qualified accesses also force objects to memory.
* gcc.target/i386/pr69482-1.c: New testcase.
* gcc.target/i386/pr69482-2.c: Likewise.
Jan Hubicka [Wed, 4 Sep 2024 07:19:08 +0000 (09:19 +0200)]
Zen5 tuning part 5: update instruction latencies in x86-tune-costs
there is nothing exciting in this patch. I measured latencies and also compared
them with newly released optimization guide. There are no dramatic changes
compared to zen4. One interesting new bit is that addss is faster and can be
2 cycles when fed by another addss.
I also increased the large insn bound since decoders seems no longer require
instructions to be 8 bytes or less.