and the memory operand size is 1 byte. As the result, the rest of 511
bytes is ignored by GCC. Implement ldtilecfg and sttilecfg intrinsics
with a pointer to XImode to honor the 512-byte memory block.
gcc/ChangeLog:
PR target/114098
* config/i386/amxtileintrin.h (_tile_loadconfig): Use
__builtin_ia32_ldtilecfg.
(_tile_storeconfig): Use __builtin_ia32_sttilecfg.
* config/i386/i386-builtin.def (BDESC): Add
__builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
* config/i386/i386-expand.c (ix86_expand_builtin): Handle
IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
* config/i386/i386.md (ldtilecfg): New pattern.
(sttilecfg): Likewise.
gcc/testsuite/ChangeLog:
PR target/114098
* gcc.target/i386/amxtile-4.c: New test.
There are no instructions that do traditional AltiVec addresses (i.e.
with the low four bits of the address masked off) for OOmode and XOmode
objects. The solution is to modify the constraints used in the movoo and
movxo pattern to disallow these types of addresses, which assists LRA in
resolving this issue. Furthermore, the mode size 16 check has been
removed in vsx_quad_dform_memory_operand to allow OOmode and XOmode, and
quad_address_p already handles less than size 16.
Eric Botcazou [Mon, 26 Feb 2024 12:13:34 +0000 (13:13 +0100)]
Finalization of object allocated by anonymous access designating local type
The finalization of objects dynamically allocated through an anonymous
access type is deferred to the enclosing library unit in the current
implementation and a warning is given on each of them.
However this cannot be done if the designated type is local, because this
would generate dangling references to the local finalization routine, so
the finalization needs to be dropped in this case and the warning adjusted.
gcc/ada/
PR ada/113893
* exp_ch7.adb (Build_Anonymous_Master): Do not build the master
for a local designated type.
* exp_util.adb (Build_Allocate_Deallocate_Proc): Force Needs_Fin
to false if no finalization master is attached to an access type
and assert that it is anonymous in this case.
* sem_res.adb (Resolve_Allocator): Mention that the object might
not be finalized at all in the warning given when the type is an
anonymous access-to-controlled type.
Richard Earnshaw [Thu, 22 Feb 2024 16:47:20 +0000 (16:47 +0000)]
arm: fix ICE with vectorized reciprocal division [PR108120]
The expand pattern for reciprocal division was enabled for all math
optimization modes, but the patterns it was generating were not
enabled unless -funsafe-math-optimizations were enabled, this leads to
an ICE when the pattern we generate cannot be recognized.
Fixed by only enabling vector division when doing unsafe math.
gcc:
PR target/108120
* config/arm/neon.md (div<VCVTF:mode>3): Rename from div<mode>3.
Gate with ARM_HAVE_NEON_<MODE>_ARITH.
gcc/testsuite:
PR target/108120
* gcc.target/arm/neon-recip-div-1.c: New file.
The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components. The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp. That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.
As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range. In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.
Unlike for GCC 14 we need additional handling in the load/store pair
code as various cases are not expecting to see V16QImode (particularly
the writeback patterns, but also aarch64_gen_load_pair).
gcc/ChangeLog:
PR target/111677
* config/aarch64/aarch64.c (aarch64_reg_save_mode): Use
V16QImode for the full 16-byte FPR saves in the vector PCS case.
(aarch64_gen_storewb_pair): Handle V16QImode.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_gen_load_pair): Likewise.
* config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>):
Rename to ...
(loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
V16QImode.
(storewb_pair<TX:mode>_<P:mode>): Rename to ...
(storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
V16QImode.
* config/aarch64/iterators.md (TX_V16QI): New.
gcc/testsuite/ChangeLog:
PR target/111677
* gcc.target/aarch64/torture/pr111677.c: New test.
Jakub Jelinek [Thu, 15 Feb 2024 19:04:01 +0000 (20:04 +0100)]
testsuite: Require lra effective target for pr107385.c
Old reload doesn't support asm goto with output operands.
We have lra effective target (though, strangely it returns
0 just for 2 targets out of at least 16 targets with no LRA support),
so this patch uses it, similarly how it is done in other asm goto
tests with output operands.
Jakub Jelinek [Thu, 15 Feb 2024 14:53:01 +0000 (15:53 +0100)]
expand: Fix handling of asm goto outputs vs. PHI argument adjustments [PR113921]
The Linux kernel and the following testcase distilled from it is
miscompiled, because tree-outof-ssa.cc (eliminate_phi) emits some
fixups on some of the edges (but doesn't commit edge insertions).
Later expand_asm_stmt emits further instructions on the same edge.
Now the problem is that expand_asm_stmt uses insert_insn_on_edge
to add its own fixups, but that function appends to the existing
sequence on the edge if any. And the bug triggers when the
fixup sequence emitted by eliminate_phi uses a pseudo which the
fixup sequence emitted by expand_asm_stmt later on sets.
So, we end up with
(set (reg A) (asm_operands ...))
and on one of the edges queued sequence
(set (reg C) (reg B)) // added by eliminate_phi
(set (reg B) (reg A)) // added by expand_asm_stmt
That is wrong, what we emit by expand_asm_stmt needs to be as close
to the asm_operands as possible (they aren't known until expand_asm_stmt
is called, the PHI fixup code assumes it is reg B which holds the right
value) and the PHI adjustments need to be done after it.
So, the following patch introduces a prepend_insn_to_edge function and
uses it from expand_asm_stmt, so that we queue
(set (reg B) (reg A)) // added by expand_asm_stmt
(set (reg C) (reg B)) // added by eliminate_phi
instead and so the value from the asm_operands output propagates correctly
to the PHI result.
2024-02-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/113921
* cfgrtl.h (prepend_insn_to_edge): New declaration.
* cfgrtl.c (insert_insn_on_edge): Clarify behavior in function
comment.
(prepend_insn_to_edge): New function.
* cfgexpand.c (expand_asm_stmt): Use prepend_insn_to_edge instead of
insert_insn_on_edge.
The linking of libgcc is already present in %(liborig), so the current
situation duplicates libraries. This was not an issue until macOS's new
linker started giving warnings for such cases.
Harald Anlauf [Sat, 27 Jan 2024 16:41:43 +0000 (17:41 +0100)]
Fortran: fix bounds-checking errors for CLASS array dummies [PR104908]
Commit r11-1235 addressed issues with bounds of unlimited polymorphic array
dummies. However, using the descriptor from sym->backend_decl does break
the case of CLASS array dummies. The obvious solution is to restrict the
fix to the unlimited polymorphic case, thus keeping the original descriptor
in the ordinary case.
gcc/fortran/ChangeLog:
PR fortran/104908
* trans-array.c (gfc_conv_array_ref): Restrict use of transformed
descriptor (sym->backend_decl) to the unlimited polymorphic case.
gcc/testsuite/ChangeLog:
PR fortran/104908
* gfortran.dg/pr104908.f90: New test.
Martin Jambor [Fri, 9 Feb 2024 17:58:43 +0000 (18:58 +0100)]
sra: Disqualify bases of operands of asm gotos
PR 110422 shows that SRA can ICE assuming there is a single edge
outgoing from a block terminated with an asm goto. We need that for
BB-terminating statements so that any adjustments they make to the
aggregates can be copied over to their replacements. Because we can't
have that after ASM gotos, we need to punt.
gcc/ChangeLog:
2024-01-17 Martin Jambor <mjambor@suse.cz>
PR tree-optimization/110422
* tree-sra.c (scan_function): Disqualify bases of operands of asm
gotos.
gcc/testsuite/ChangeLog:
2024-01-17 Martin Jambor <mjambor@suse.cz>
PR tree-optimization/110422
* gcc.dg/torture/pr110422.c: New test.
Xi Ruoyao [Fri, 2 Feb 2024 19:35:07 +0000 (03:35 +0800)]
MIPS: Fix wrong MSA FP vector negation
We expanded (neg x) to (minus const0 x) for MSA FP vectors, this is
wrong because -0.0 is not 0 - 0.0. This causes some Python tests to
fail when Python is built with MSA enabled.
Use the bnegi.df instructions to simply reverse the sign bit instead.
gcc/ChangeLog:
* config/mips/mips-msa.md (elmsgnbit): New define_mode_attr.
(neg<mode>2): Change the mode iterator from MSA to IMSA because
in FP arithmetic we cannot use (0 - x) for -x.
(neg<mode>2): New define_insn to implement FP vector negation,
using a bnegi instruction to negate the sign bit.
The first alternative stores the floating-point status register
in the destination. It should store zero. We need to copy %fr0
to another floating-point register to initialize it to zero.
2024-02-01 John David Anglin <danglin@gcc.gnu.org>
gcc/ChangeLog:
* config/pa/pa.md (atomic_storedi_1): Fix bug in
alternative 1.
Lewis Hyatt [Tue, 5 Dec 2023 16:33:39 +0000 (11:33 -0500)]
c-family: Fix ICE with large column number after restoring a PCH [PR105608]
Users are allowed to define macros prior to restoring a precompiled header
file, as long as those macros are not defined (or are defined identically)
in the PCH. However, the PCH restoration process destroys all the macro
definitions, so libcpp has to record them before restoring the PCH and then
redefine them afterward.
This process does not currently assign great locations to the macros after
redefining them. Some work is needed to also remember the original locations
and get the line_maps instance in the right state (since, like all other
data structures, the line_maps instance is also reset after restoring a PCH).
This patch addresses a more pressing issue, which is that we ICE in some
cases since GCC 11, hitting an assert in line-maps.cc. It happens if the
first line encountered after the PCH restore requires an LC_RENAME map, such
as will happen if the line is sufficiently long. This is much easier to
fix, since we just need to call linemap_line_start before asking libcpp to
redefine the stored macros, instead of afterward, to avoid the unexpected
need for an LC_RENAME before an LC_ENTER has been seen.
gcc/c-family/ChangeLog:
PR preprocessor/105608
* c-pch.c (c_common_read_pch): Start a new line map before asking
libcpp to restore macros defined prior to reading the PCH, instead
of afterward.
gcc/testsuite/ChangeLog:
PR preprocessor/105608
* g++.dg/pch/line-map-1.C: New test.
* g++.dg/pch/line-map-1.Hs: New test.
* g++.dg/pch/line-map-2.C: New test.
* g++.dg/pch/line-map-2.Hs: New test.
Jason Merrill [Tue, 19 Dec 2023 21:12:02 +0000 (16:12 -0500)]
c++: xvalue array subscript [PR103185]
Normally we handle xvalue array subscripting with ARRAY_REF, but in this
case we weren't doing that because the operands were reversed. Handle that
case better.
Jan Hubicka [Thu, 22 Dec 2022 09:55:46 +0000 (10:55 +0100)]
Zen4 tuning part 2
Adds tunes needed for zen4 microarchitecture. I added two new knobs.
TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors
are split to 256 vectors. This affects vectorization costs and reassociation
width. It probably should also affect RTX costs however I doubt it is very useful
since RTL optimizers are usually not judging between 256 and 512 vectors.
I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this
flag may not be a win except for very specific benchmarks. I am still doing some
more detailed testing here.
Oherwise I disabled gathers on zen4 for 2 parts nad 4 parts. We can open code them
and since the latencies has only increased since zen3 opencoding is better than
actual instrucction. This shows at 4 tsvc benchmarks.
I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc
benchmarks that increase noticeably (up to 250%) however there are also few
regressions. Most of these can be solved by incrasing vec_perm cost in the
vectorizer. However this does not cure about 14% regression on x264 that is
quite important. Here we produce vectorized loops for avx512 that probably
would be faster if the loops in question had high enough iteration count.
We hit this problem with avx256 too: since the loop iterates few times, only
prologues/epilogues are used. Adding another round of prologue/epilogue
code does not make it better.
Finally I enabled avx stores for constnat sized memcpy and memset. I am not
sure why this is an opt-in feature. I think for most hardware this is a win.
gcc/ChangeLog:
2022-12-22 Jan Hubicka <hubicka@ucw.cz>
* config/i386/i386-expand.c (ix86_expand_set_or_cpymem): Add
TARGET_AVX512_SPLIT_REGS
* config/i386/i386-options.c (ix86_option_override_internal):
Honor x86_TONE_AVOID_256FMA_CHAINS.
* config/i386/i386.c (ix86_vec_cost): Honor TARGET_AVX512_SPLIT_REGS.
(ix86_reassociation_width): Likewise.
* config/i386/i386.h (TARGET_AVX512_SPLIT_REGS): New tune.
* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable
for znver4.
(X86_TUNE_USE_GATHER_4PARTS): Likewise.
(X86_TUNE_AVOID_256FMA_CHAINS): Set for znver4.
(X86_TUNE_AVOID_512FMA_CHAINS): New utne; set for znver4.
(X86_TUNE_AVX256_OPTIMAL): Add znver4.
(X86_TUNE_AVX512_SPLIT_REGS): New tune.
(X86_TUNE_AVX256_MOVE_BY_PIECES): Add znver1-3.
(X86_TUNE_AVX256_STORE_BY_PIECES): Add znver1-3.
(X86_TUNE_AVX512_MOVE_BY_PIECES): Add znver4.
(X86_TUNE_AVX512_STORE_BY_PIECES): Add znver4.
Jan Hubicka [Thu, 22 Dec 2022 01:16:24 +0000 (02:16 +0100)]
Update znver4 costs
Update cost of znver4 mostly based on data measued by Agner Fog.
Compared to previous generations x87 became bit slower which is probably not
big deal (and we have minimal benchmarking coverage for it). One interesting
improvement is reducation of FMA cost. I also updated costs of AVX256
loads/stores based on latencies (not throughput which is twice of avx256).
Overall AVX512 vectorization seems to improve noticeably some of TSVC
benchmarks but since internally 512 vectors are split to 256 vectors it is
somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks
with loop that have small trip count like x264 and exchange), so for now I am
going to set AVX256_OPTIMAL tune but I am still playing with it. We improved
since ZNVER1 on choosing vectorization size and also have vectorized
prologues/epilogues so it may be possible to make avx512 small win overall.
2022-12-22 Jan Hubicka <hubicka@ucw.cz>
* config/i386/x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE
moves, division multiplication, gathers, L2 cache size, and more
complex FP instrutions.
Ken Matsui [Thu, 11 Jan 2024 06:08:07 +0000 (22:08 -0800)]
libstdc++: Fix error handling in filesystem::equivalent [PR113250]
This patch made std::filesystem::equivalent correctly throw an exception
when either path does not exist as per [fs.op.equivalent]/4.
PR libstdc++/113250
libstdc++-v3/ChangeLog:
* src/c++17/fs_ops.cc (fs::equivalent): Use || instead of &&.
* src/filesystem/ops.cc (fs::equivalent): Likewise.
* testsuite/27_io/filesystem/operations/equivalent.cc: Handle
error codes.
* testsuite/experimental/filesystem/operations/equivalent.cc:
Likewise.
Signed-off-by: Ken Matsui <kmatsui@gcc.gnu.org> Reviewed-by: Jonathan Wakely <jwakely@redhat.com>
(cherry picked from commit df147e2ee7199d33d66959c6509ce9c21072077f)
Patrick Palka [Wed, 3 Jan 2024 02:31:20 +0000 (21:31 -0500)]
libstdc++: testsuite: Reduce max_size_type.cc exec time [PR113175]
The adjustment to max_size_type.cc in r14-205-g83470a5cd4c3d2
inadvertently increased the execution time of this test by over 5x due
to making the two main loops actually run in the signed_p case instead
of being dead code.
To compensate, this patch cuts the relevant loops' range [-1000,1000] by
10x as proposed in the PR. This shouldn't significantly weaken the test
since the same important edge cases are still checked in the smaller range
and/or elsewhere. On my machine this reduces the test's execution time by
roughly 10x (and 1.6x relative to before r14-205).
PR testsuite/113175
libstdc++-v3/ChangeLog:
* testsuite/std/ranges/iota/max_size_type.cc (test02): Reduce
'limit' to 100 from 1000 and adjust 'log2_limit' accordingly.
(test03): Likewise.
Patrick Palka [Fri, 22 Sep 2023 10:25:49 +0000 (06:25 -0400)]
c++: constraint rewriting during ttp coercion [PR111485]
In order to compare the constraints of a ttp with that of its argument,
we rewrite the ttp's constraints in terms of the argument template's
template parameters. The substitution to achieve this currently uses a
single level of template arguments, but that never does the right thing
because a ttp's template parameters always have level >= 2. This patch
fixes this by including the outer template arguments in the substitution,
which ought to match the depth of the ttp.
The second testcase demonstrates it's better to substitute the concrete
outer template arguments instead of generic ones since a ttp's constraints
could depend on outer parameters.
PR c++/111485
gcc/cp/ChangeLog:
* pt.c (is_compatible_template_arg): New parameter 'args'.
Add the outer template arguments 'args' to 'new_args'.
(convert_template_argument): Pass 'args' to
is_compatible_template_arg.
gcc/testsuite/ChangeLog:
* g++.dg/cpp2a/concepts-ttp5.C: New test.
* g++.dg/cpp2a/concepts-ttp6.C: New test.
Jason Merrill [Fri, 23 Apr 2021 20:41:35 +0000 (16:41 -0400)]
c++: -Wdeprecated-copy and using operator= [PR92145]
For the purpose of [depr.impldec] "if the class has a user-declared copy
assignment operator", an operator= brought in from a base class with 'using'
may be a copy-assignment operator, but it isn't a copy-assignment operator
for the derived class.
gcc/cp/ChangeLog:
PR c++/92145
* class.c (classtype_has_depr_implicit_copy): Check DECL_CONTEXT
of operator=.
gcc/testsuite/ChangeLog:
PR c++/92145
* g++.dg/cpp0x/depr-copy3.C: New test.
Jason Merrill [Sun, 4 Jun 2023 16:00:55 +0000 (12:00 -0400)]
c++: NRV and goto [PR92407]
Here our named return value optimization was breaking the required
destructor when the goto takes 'a' out of scope. A simple fix for the
release branches is to disable the optimization in the presence of backward
goto.
We could do better by disabling the optimization only if there is a backward
goto across the variable declaration, but we don't track that, and in GCC 14
we instead make the goto work with NRV.
PR c++/92407
gcc/cp/ChangeLog:
* cp-tree.h (struct language_function): Add backward_goto.
* decl.c (check_goto): Set it.
* typeck.c (check_return_expr): Prevent NRV if set.
Patrick Palka [Tue, 25 Apr 2023 19:59:22 +0000 (15:59 -0400)]
c++: value dependence of by-ref lambda capture [PR108975]
We are still ICEing on the generic lambda version of the testcase from
this PR, even after r13-6743-g6f90de97634d6f, due to the by-ref capture
of the constant local variable 'dim' being considered value-dependent
when regenerating the lambda (at which point processing_template_decl is
set since the lambda is generic), which prevents us from constant folding
its uses. Later during prune_lambda_captures we end up not thoroughly
walking the body of the lambda and overlook the (non-folded) uses of
'dim' within the array bound and using-decls.
We could fix this by making prune_lambda_captures walk the body of the
lambda more thoroughly so that it finds these uses of 'dim', but ideally
we should be able to constant fold all uses of 'dim' ahead of time and
prune the implicit capture after all.
To that end this patch makes value_dependent_expression_p return false
for such by-ref captures of constant local variables, allowing their
uses to get constant folded ahead of time. It seems we just need to
disable the predicate's conservative early exit for reference variables
(added by r5-5022-g51d72abe5ea04e) when DECL_HAS_VALUE_EXPR_P. This
effectively makes us treat by-value and by-ref captures more consistently
when it comes to value dependence.
PR c++/108975
gcc/cp/ChangeLog:
* pt.c (value_dependent_expression_p) <case VAR_DECL>:
Suppress conservative early exit for reference variables
when DECL_HAS_VALUE_EXPR_P.
gcc/testsuite/ChangeLog:
* g++.dg/cpp0x/lambda/lambda-const11a.C: New test.