Darwin, Objective-C: Support -fconstant-cfstrings [PR108743].
This support the -fconstant-cfstrings option as used by clang (and
expect by some build scripts) as an alias to the target-specific
-mconstant-cfstrings.
The documentation is also updated to reflect that the 'f' option is
only available on Darwin, and to add the 'm' option to the Darwin
section of the invocation text.
* config/darwin.opt: Add fconstant-cfstrings alias to
mconstant-cfstrings.
* doc/invoke.texi: Amend invocation descriptions to reflect
that the fconstant-cfstrings is a target-option alias and to
add the missing mconstant-cfstrings option description to the
Darwin section.
Eric Botcazou [Thu, 7 Mar 2024 14:05:54 +0000 (15:05 +0100)]
Fix bogus error on allocator for array type with Dynamic_Predicate
This is a regression present on all active branches: the compiler gives
a bogus error on an allocator for an unconstrained array type declared
with a Dynamic_Predicate because Apply_Predicate_Check is invoked directly
on a subtype reference, which it cannot handle.
This moves the check to the resulting access value (after dereference) like
in Expand_Allocator_Expression.
gcc/ada/
PR ada/113979
* exp_ch4.adb (Expand_N_Allocator): In the subtype indication case,
remove call to Apply_Predicate_Check.
gcc/testsuite/
* gnat.dg/predicate15.adb: New test.
Iain Buclaw [Sun, 4 Feb 2024 21:04:14 +0000 (22:04 +0100)]
d: Fix callee destructor call invalidates the live object [PR113758]
When generating the argument, check the isCalleeDestroyingArgs hook, and
force a TARGET_EXPR to be created if true, so that a reference to the
live object isn't passed directly to the function that runs dtors.
When instead dealing with caller running destructors, two temporaries
were being generated, one explicit temporary generated by the D
front-end, and another implicitly by the code generator. This has been
reduced to one by setting DECL_VALUE_EXPR on the explicit temporary to
bind it to the implicit slot created for the TARGET_EXPR, as that has
the shorter lifetime of the two.
PR d/113758
gcc/d/ChangeLog:
* d-codegen.cc (d_build_call): Force a TARGET_EXPR when callee
destorys its arguments.
* decl.cc (DeclVisitor::visit (VarDeclaration *)): Set
SET_DECL_VALUE_EXPR on the temporary variable to make it a placeholder
for the TARGET_EXPR_SLOT.
and the memory operand size is 1 byte. As the result, the rest of 511
bytes is ignored by GCC. Implement ldtilecfg and sttilecfg intrinsics
with a pointer to XImode to honor the 512-byte memory block.
gcc/ChangeLog:
PR target/114098
* config/i386/amxtileintrin.h (_tile_loadconfig): Use
__builtin_ia32_ldtilecfg.
(_tile_storeconfig): Use __builtin_ia32_sttilecfg.
* config/i386/i386-builtin.def (BDESC): Add
__builtin_ia32_ldtilecfg and __builtin_ia32_sttilecfg.
* config/i386/i386-expand.c (ix86_expand_builtin): Handle
IX86_BUILTIN_LDTILECFG and IX86_BUILTIN_STTILECFG.
* config/i386/i386.md (ldtilecfg): New pattern.
(sttilecfg): Likewise.
gcc/testsuite/ChangeLog:
PR target/114098
* gcc.target/i386/amxtile-4.c: New test.
There are no instructions that do traditional AltiVec addresses (i.e.
with the low four bits of the address masked off) for OOmode and XOmode
objects. The solution is to modify the constraints used in the movoo and
movxo pattern to disallow these types of addresses, which assists LRA in
resolving this issue. Furthermore, the mode size 16 check has been
removed in vsx_quad_dform_memory_operand to allow OOmode and XOmode, and
quad_address_p already handles less than size 16.
Eric Botcazou [Mon, 26 Feb 2024 12:13:34 +0000 (13:13 +0100)]
Finalization of object allocated by anonymous access designating local type
The finalization of objects dynamically allocated through an anonymous
access type is deferred to the enclosing library unit in the current
implementation and a warning is given on each of them.
However this cannot be done if the designated type is local, because this
would generate dangling references to the local finalization routine, so
the finalization needs to be dropped in this case and the warning adjusted.
gcc/ada/
PR ada/113893
* exp_ch7.adb (Build_Anonymous_Master): Do not build the master
for a local designated type.
* exp_util.adb (Build_Allocate_Deallocate_Proc): Force Needs_Fin
to false if no finalization master is attached to an access type
and assert that it is anonymous in this case.
* sem_res.adb (Resolve_Allocator): Mention that the object might
not be finalized at all in the warning given when the type is an
anonymous access-to-controlled type.
Richard Earnshaw [Thu, 22 Feb 2024 16:47:20 +0000 (16:47 +0000)]
arm: fix ICE with vectorized reciprocal division [PR108120]
The expand pattern for reciprocal division was enabled for all math
optimization modes, but the patterns it was generating were not
enabled unless -funsafe-math-optimizations were enabled, this leads to
an ICE when the pattern we generate cannot be recognized.
Fixed by only enabling vector division when doing unsafe math.
gcc:
PR target/108120
* config/arm/neon.md (div<VCVTF:mode>3): Rename from div<mode>3.
Gate with ARM_HAVE_NEON_<MODE>_ARITH.
gcc/testsuite:
PR target/108120
* gcc.target/arm/neon-recip-div-1.c: New file.
The PR shows us ICEing due to an unrecognizable TFmode save emitted by
aarch64_process_components. The problem is that for T{I,F,D}mode we
conservatively require mems to be in range for x-register ldp/stp. That
is because (at least for TImode) it can be allocated to both GPRs and
FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is
a q-register load/store.
As Richard pointed out in the PR, aarch64_get_separate_components
already checks that the offsets are suitable for a single load, so we
just need to choose a mode in aarch64_reg_save_mode that gives the full
q-register range. In this patch, we choose V16QImode as an alternative
16-byte "bag-of-bits" mode that doesn't have the artificial range
restrictions imposed on T{I,F,D}mode.
Unlike for GCC 14 we need additional handling in the load/store pair
code as various cases are not expecting to see V16QImode (particularly
the writeback patterns, but also aarch64_gen_load_pair).
gcc/ChangeLog:
PR target/111677
* config/aarch64/aarch64.c (aarch64_reg_save_mode): Use
V16QImode for the full 16-byte FPR saves in the vector PCS case.
(aarch64_gen_storewb_pair): Handle V16QImode.
(aarch64_gen_loadwb_pair): Likewise.
(aarch64_gen_load_pair): Likewise.
* config/aarch64/aarch64.md (loadwb_pair<TX:mode>_<P:mode>):
Rename to ...
(loadwb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
V16QImode.
(storewb_pair<TX:mode>_<P:mode>): Rename to ...
(storewb_pair<TX_V16QI:mode>_<P:mode>): ... this, extending to
V16QImode.
* config/aarch64/iterators.md (TX_V16QI): New.
gcc/testsuite/ChangeLog:
PR target/111677
* gcc.target/aarch64/torture/pr111677.c: New test.
Jakub Jelinek [Thu, 15 Feb 2024 19:04:01 +0000 (20:04 +0100)]
testsuite: Require lra effective target for pr107385.c
Old reload doesn't support asm goto with output operands.
We have lra effective target (though, strangely it returns
0 just for 2 targets out of at least 16 targets with no LRA support),
so this patch uses it, similarly how it is done in other asm goto
tests with output operands.
Jakub Jelinek [Thu, 15 Feb 2024 14:53:01 +0000 (15:53 +0100)]
expand: Fix handling of asm goto outputs vs. PHI argument adjustments [PR113921]
The Linux kernel and the following testcase distilled from it is
miscompiled, because tree-outof-ssa.cc (eliminate_phi) emits some
fixups on some of the edges (but doesn't commit edge insertions).
Later expand_asm_stmt emits further instructions on the same edge.
Now the problem is that expand_asm_stmt uses insert_insn_on_edge
to add its own fixups, but that function appends to the existing
sequence on the edge if any. And the bug triggers when the
fixup sequence emitted by eliminate_phi uses a pseudo which the
fixup sequence emitted by expand_asm_stmt later on sets.
So, we end up with
(set (reg A) (asm_operands ...))
and on one of the edges queued sequence
(set (reg C) (reg B)) // added by eliminate_phi
(set (reg B) (reg A)) // added by expand_asm_stmt
That is wrong, what we emit by expand_asm_stmt needs to be as close
to the asm_operands as possible (they aren't known until expand_asm_stmt
is called, the PHI fixup code assumes it is reg B which holds the right
value) and the PHI adjustments need to be done after it.
So, the following patch introduces a prepend_insn_to_edge function and
uses it from expand_asm_stmt, so that we queue
(set (reg B) (reg A)) // added by expand_asm_stmt
(set (reg C) (reg B)) // added by eliminate_phi
instead and so the value from the asm_operands output propagates correctly
to the PHI result.
2024-02-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/113921
* cfgrtl.h (prepend_insn_to_edge): New declaration.
* cfgrtl.c (insert_insn_on_edge): Clarify behavior in function
comment.
(prepend_insn_to_edge): New function.
* cfgexpand.c (expand_asm_stmt): Use prepend_insn_to_edge instead of
insert_insn_on_edge.
The linking of libgcc is already present in %(liborig), so the current
situation duplicates libraries. This was not an issue until macOS's new
linker started giving warnings for such cases.
Harald Anlauf [Sat, 27 Jan 2024 16:41:43 +0000 (17:41 +0100)]
Fortran: fix bounds-checking errors for CLASS array dummies [PR104908]
Commit r11-1235 addressed issues with bounds of unlimited polymorphic array
dummies. However, using the descriptor from sym->backend_decl does break
the case of CLASS array dummies. The obvious solution is to restrict the
fix to the unlimited polymorphic case, thus keeping the original descriptor
in the ordinary case.
gcc/fortran/ChangeLog:
PR fortran/104908
* trans-array.c (gfc_conv_array_ref): Restrict use of transformed
descriptor (sym->backend_decl) to the unlimited polymorphic case.
gcc/testsuite/ChangeLog:
PR fortran/104908
* gfortran.dg/pr104908.f90: New test.
Martin Jambor [Fri, 9 Feb 2024 17:58:43 +0000 (18:58 +0100)]
sra: Disqualify bases of operands of asm gotos
PR 110422 shows that SRA can ICE assuming there is a single edge
outgoing from a block terminated with an asm goto. We need that for
BB-terminating statements so that any adjustments they make to the
aggregates can be copied over to their replacements. Because we can't
have that after ASM gotos, we need to punt.
gcc/ChangeLog:
2024-01-17 Martin Jambor <mjambor@suse.cz>
PR tree-optimization/110422
* tree-sra.c (scan_function): Disqualify bases of operands of asm
gotos.
gcc/testsuite/ChangeLog:
2024-01-17 Martin Jambor <mjambor@suse.cz>
PR tree-optimization/110422
* gcc.dg/torture/pr110422.c: New test.
Xi Ruoyao [Fri, 2 Feb 2024 19:35:07 +0000 (03:35 +0800)]
MIPS: Fix wrong MSA FP vector negation
We expanded (neg x) to (minus const0 x) for MSA FP vectors, this is
wrong because -0.0 is not 0 - 0.0. This causes some Python tests to
fail when Python is built with MSA enabled.
Use the bnegi.df instructions to simply reverse the sign bit instead.
gcc/ChangeLog:
* config/mips/mips-msa.md (elmsgnbit): New define_mode_attr.
(neg<mode>2): Change the mode iterator from MSA to IMSA because
in FP arithmetic we cannot use (0 - x) for -x.
(neg<mode>2): New define_insn to implement FP vector negation,
using a bnegi instruction to negate the sign bit.
The first alternative stores the floating-point status register
in the destination. It should store zero. We need to copy %fr0
to another floating-point register to initialize it to zero.
2024-02-01 John David Anglin <danglin@gcc.gnu.org>
gcc/ChangeLog:
* config/pa/pa.md (atomic_storedi_1): Fix bug in
alternative 1.
Lewis Hyatt [Tue, 5 Dec 2023 16:33:39 +0000 (11:33 -0500)]
c-family: Fix ICE with large column number after restoring a PCH [PR105608]
Users are allowed to define macros prior to restoring a precompiled header
file, as long as those macros are not defined (or are defined identically)
in the PCH. However, the PCH restoration process destroys all the macro
definitions, so libcpp has to record them before restoring the PCH and then
redefine them afterward.
This process does not currently assign great locations to the macros after
redefining them. Some work is needed to also remember the original locations
and get the line_maps instance in the right state (since, like all other
data structures, the line_maps instance is also reset after restoring a PCH).
This patch addresses a more pressing issue, which is that we ICE in some
cases since GCC 11, hitting an assert in line-maps.cc. It happens if the
first line encountered after the PCH restore requires an LC_RENAME map, such
as will happen if the line is sufficiently long. This is much easier to
fix, since we just need to call linemap_line_start before asking libcpp to
redefine the stored macros, instead of afterward, to avoid the unexpected
need for an LC_RENAME before an LC_ENTER has been seen.
gcc/c-family/ChangeLog:
PR preprocessor/105608
* c-pch.c (c_common_read_pch): Start a new line map before asking
libcpp to restore macros defined prior to reading the PCH, instead
of afterward.
gcc/testsuite/ChangeLog:
PR preprocessor/105608
* g++.dg/pch/line-map-1.C: New test.
* g++.dg/pch/line-map-1.Hs: New test.
* g++.dg/pch/line-map-2.C: New test.
* g++.dg/pch/line-map-2.Hs: New test.
Jason Merrill [Tue, 19 Dec 2023 21:12:02 +0000 (16:12 -0500)]
c++: xvalue array subscript [PR103185]
Normally we handle xvalue array subscripting with ARRAY_REF, but in this
case we weren't doing that because the operands were reversed. Handle that
case better.
Jan Hubicka [Thu, 22 Dec 2022 09:55:46 +0000 (10:55 +0100)]
Zen4 tuning part 2
Adds tunes needed for zen4 microarchitecture. I added two new knobs.
TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors
are split to 256 vectors. This affects vectorization costs and reassociation
width. It probably should also affect RTX costs however I doubt it is very useful
since RTL optimizers are usually not judging between 256 and 512 vectors.
I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this
flag may not be a win except for very specific benchmarks. I am still doing some
more detailed testing here.
Oherwise I disabled gathers on zen4 for 2 parts nad 4 parts. We can open code them
and since the latencies has only increased since zen3 opencoding is better than
actual instrucction. This shows at 4 tsvc benchmarks.
I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc
benchmarks that increase noticeably (up to 250%) however there are also few
regressions. Most of these can be solved by incrasing vec_perm cost in the
vectorizer. However this does not cure about 14% regression on x264 that is
quite important. Here we produce vectorized loops for avx512 that probably
would be faster if the loops in question had high enough iteration count.
We hit this problem with avx256 too: since the loop iterates few times, only
prologues/epilogues are used. Adding another round of prologue/epilogue
code does not make it better.
Finally I enabled avx stores for constnat sized memcpy and memset. I am not
sure why this is an opt-in feature. I think for most hardware this is a win.
gcc/ChangeLog:
2022-12-22 Jan Hubicka <hubicka@ucw.cz>
* config/i386/i386-expand.c (ix86_expand_set_or_cpymem): Add
TARGET_AVX512_SPLIT_REGS
* config/i386/i386-options.c (ix86_option_override_internal):
Honor x86_TONE_AVOID_256FMA_CHAINS.
* config/i386/i386.c (ix86_vec_cost): Honor TARGET_AVX512_SPLIT_REGS.
(ix86_reassociation_width): Likewise.
* config/i386/i386.h (TARGET_AVX512_SPLIT_REGS): New tune.
* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable
for znver4.
(X86_TUNE_USE_GATHER_4PARTS): Likewise.
(X86_TUNE_AVOID_256FMA_CHAINS): Set for znver4.
(X86_TUNE_AVOID_512FMA_CHAINS): New utne; set for znver4.
(X86_TUNE_AVX256_OPTIMAL): Add znver4.
(X86_TUNE_AVX512_SPLIT_REGS): New tune.
(X86_TUNE_AVX256_MOVE_BY_PIECES): Add znver1-3.
(X86_TUNE_AVX256_STORE_BY_PIECES): Add znver1-3.
(X86_TUNE_AVX512_MOVE_BY_PIECES): Add znver4.
(X86_TUNE_AVX512_STORE_BY_PIECES): Add znver4.
Jan Hubicka [Thu, 22 Dec 2022 01:16:24 +0000 (02:16 +0100)]
Update znver4 costs
Update cost of znver4 mostly based on data measued by Agner Fog.
Compared to previous generations x87 became bit slower which is probably not
big deal (and we have minimal benchmarking coverage for it). One interesting
improvement is reducation of FMA cost. I also updated costs of AVX256
loads/stores based on latencies (not throughput which is twice of avx256).
Overall AVX512 vectorization seems to improve noticeably some of TSVC
benchmarks but since internally 512 vectors are split to 256 vectors it is
somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks
with loop that have small trip count like x264 and exchange), so for now I am
going to set AVX256_OPTIMAL tune but I am still playing with it. We improved
since ZNVER1 on choosing vectorization size and also have vectorized
prologues/epilogues so it may be possible to make avx512 small win overall.
2022-12-22 Jan Hubicka <hubicka@ucw.cz>
* config/i386/x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE
moves, division multiplication, gathers, L2 cache size, and more
complex FP instrutions.
Ken Matsui [Thu, 11 Jan 2024 06:08:07 +0000 (22:08 -0800)]
libstdc++: Fix error handling in filesystem::equivalent [PR113250]
This patch made std::filesystem::equivalent correctly throw an exception
when either path does not exist as per [fs.op.equivalent]/4.
PR libstdc++/113250
libstdc++-v3/ChangeLog:
* src/c++17/fs_ops.cc (fs::equivalent): Use || instead of &&.
* src/filesystem/ops.cc (fs::equivalent): Likewise.
* testsuite/27_io/filesystem/operations/equivalent.cc: Handle
error codes.
* testsuite/experimental/filesystem/operations/equivalent.cc:
Likewise.
Signed-off-by: Ken Matsui <kmatsui@gcc.gnu.org> Reviewed-by: Jonathan Wakely <jwakely@redhat.com>
(cherry picked from commit df147e2ee7199d33d66959c6509ce9c21072077f)