git.ipfire.org Git - thirdparty/gcc.git/log

tree-optimization/116481 - avoid building function_type[]

The following avoids building an array type with function or method
element type during diagnosing an array bound violation as this
will result in an error, rejecting a program with a not too useful
error message. Instead build such array type manually.

PR tree-optimization/116481
* pointer-query.cc (build_printable_array_type):
Build an array types with function or method element type
manually to avoid bogus diagnostic.

* gcc.dg/pr116481.c: New testcase.

(cherry picked from commit 1506027347776a2f6ec5b92d56ef192e85944e2e)

tree-optimization/116290 - fix compare-debug issue in ldist

Loop distribution does different analysis with -g0/-g due to counting
a debug stmt starting a BB against a limit which will everntually
lead to different IVOPTs choices. I've fixed a possible IVOPTs
issue on the way even though it doesn't make a difference here.

PR tree-optimization/116290
* tree-loop-distribution.cc (determine_reduction_stmt_1): PHIs
have no debug variants. Start with first non-debug real stmt.
* tree-ssa-loop-ivopts.cc (find_givs_in_bb): Do not analyze
debug stmts.

* gcc.dg/pr116290.c: New testcase.

(cherry picked from commit 566740013b3445162b8c4bc2205e4e568d014968)

middle-end/115110 - Fix view_converted_memref_p

view_converted_memref_p was checking the reference type against the
pointer type of the offset operand rather than its pointed-to type
which leads to all refs being subject to view-convert treatment
in get_alias_set causing numerous testsuite fails but with its
new uses from r15-512-g9b7cad5884f21c is also a wrong-code issue.

PR middle-end/115110
* tree-ssa-alias.cc (view_converted_memref_p): Fix.

(cherry picked from commit a5b3721c06646bf5b9b50a22964e8e2bd4d03f5f)

rs6000: Correct the function code for _AMO_LD_DEC_BOUNDED

Corrected the function code for the Atomic Memory Operation "Fetch and Decrement
Bounded", changing it from 0x1A to 0x1C.

2024-10-11 Jeevitha Palanisamy <jeevitha@linux.ibm.com>

gcc/

* config/rs6000/amo.h (enum _AMO_LD): Correct the function code for
_AMO_LD_DEC_BOUNDED.

(cherry picked from commit 1a4c5643a5911d130dfab9a064222baeeb7f9be7)

Refine splitters related to "combine vpcmpuw + zero_extend to vpcmpuw"

r12-6103-g1a7ce8570997eb combines vpcmpuw + zero_extend to vpcmpuw
with the pre_reload splitter, but the splitter transforms the
zero_extend into a subreg which make reload think the upper part is
garbage, it's not correct.

The patch adjusts the zero_extend define_insn_and_split to
define_insn to keep zero_extend.

gcc/ChangeLog:

PR target/117159
* config/i386/sse.md
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Change from define_insn_and_split to define_insn.
(*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Ditto.
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Split to the zero_extend pattern.
(*<avx512>_cmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
(*<avx512>_ucmp<VI12_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
(*<avx512>_ucmp<VI48_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr117159.c: New test.
* gcc.target/i386/avx512bw-pr103750-1.c: Remove xfail.
* gcc.target/i386/avx512bw-pr103750-2.c: Remove xfail.

(cherry picked from commit 5259d3927c1c8e3a15b4b844adef59b48c241233)

Daily bump.

ipa: Treat static constructors and destructors as non-local (PR 115815)

In PR 115815, IPA-SRA thought it had control over all invocations of a
(recursive) static destructor but it did not see the implied
invocation which led to the original being left behind and the
clean-up code encountering uses of SSAs that definitely should have
been dead.

Fixed by teaching cgraph_node::can_be_local_p about static
constructors and destructors.  Similar test is missing in
cgraph_node::local_p so I added the check there as well.

In addition to the commit with the fix, this backport also contains
squashed commit 1a458bdeb223ffa501bac8e76182115681967094 which fixes
dejagnu directives in the testcase.

gcc/ChangeLog:

2024-07-25  Martin Jambor  <mjambor@suse.cz>

PR ipa/115815
* cgraph.cc (cgraph_node_cannot_be_local_p_1): Also check
DECL_STATIC_CONSTRUCTOR and DECL_STATIC_DESTRUCTOR.
* ipa-visibility.cc (non_local_p): Likewise.
(cgraph_node::local_p): Delete extraneous line of tabs.

gcc/testsuite/ChangeLog:

2024-07-25  Martin Jambor  <mjambor@suse.cz>

PR ipa/115815
* gcc.dg/lto/pr115815_0.c: New test.

(cherry picked from commit e98ad6a049c96c21cf641954584c2f5b7df0ce93)

RISC-V:Bugfix for C++ code compilation failure with rv32imafc_zve32f[pr116883]

From: xuli <xuli1@eswincomputing.com>

Example as follows:

int main()
{
  unsigned long arraya[128], arrayb[128], arrayc[128];
  for (int i = 0; i < 128; i++)
   {
      arraya[i] = arrayb[i] + arrayc[i];
   }
  return 0;
}

Compiled with -march=rv32imafc_zve32f -mabi=ilp32f, it will cause a compilation issue:

riscv_vector.h:40:25: error: ambiguating new declaration of 'vint64m4_t __riscv_vle64(vbool16_t, const long long int*, unsigned int)'
   40 | #pragma riscv intrinsic "vector"
      |                         ^~~~~~~~
riscv_vector.h:40:25: note: old declaration 'vint64m1_t __riscv_vle64(vbool64_t, const long long int*, unsigned int)'

With zvl=32b, vbool16_t is registered in init_builtins() with
type_common.precision=0x101 (nunits=2), mode_nunits[E_RVVMF16BI]=[2,2].

Normally, vbool64_t is only valid when TARGET_MIN_VLEN > 32, so vbool64_t
is not registered in init_builtins(), meaning vbool64_t=null.

In order to implement __attribute__((target("arch=+v"))), we must register
all vector types and all RVV intrinsics. Therefore, vbool64_t will be registered
by default with zvl=128b in reinit_builtins(), resulting in
type_common.precision=0x101 (nunits=2) and mode_nunits[E_RVVMF64BI]=[2,2].

We then get TYPE_VECTOR_SUBPARTS(vbool16_t) == TYPE_VECTOR_SUBPARTS(vbool64_t),
calculated using type_common.precision, resulting in 2. Since vbool16_t and
vbool64_t have the same element type (boolean_type), the compiler treats them
as the same type, leading to a re-declaration conflict.

After all types and intrinsics have been registered, processing
__attribute__((target("arch=+v"))) will update the parameters option and
init_adjust_machine_modes. Therefore, to avoid conflicts, we can choose
zvl=4096b for the null type reinit_builtins().

command option zvl=32b
  type         nunits
  vbool64_t => null
  vbool32_t=> [1,1]
  vbool16_t=> [2,2]
  vbool8_t=>  [4,4]
  vbool4_t=>  [8,8]
  vbool2_t=>  [16,16]
  vbool1_t=>  [32,32]

reinit zvl=128b
  vbool64_t => [2,2] conflict with zvl32b vbool16_t=> [2,2]
reinit zvl=256b
  vbool64_t => [4,4] conflict with zvl32b vbool8_t=>  [4,4]
reinit zvl=512b
  vbool64_t => [8,8] conflict with zvl32b vbool4_t=>  [8,8]
reinit zvl=1024b
  vbool64_t => [16,16] conflict with zvl32b vbool2_t=>  [16,16]
reinit zvl=2048b
  vbool64_t => [32,32] conflict with zvl32b vbool1_t=>  [32,32]
reinit zvl=4096b
  vbool64_t => [64,64] zvl=4096b is ok

Signed-off-by: xuli <xuli1@eswincomputing.com>
PR target/116883

gcc/ChangeLog:

* config/riscv/riscv-c.cc (riscv_pragma_intrinsic_flags_pollute): Choose zvl4096b
to initialize null type.

gcc/testsuite/ChangeLog:

* g++.target/riscv/rvv/base/pr116883.C: New test.

(cherry picked from commit fd8e590ff11266598d8f9b3d03d72ba7a6100512)

c++: checking ICE w/ constexpr if and lambda as def targ [PR117054]

Here we're tripping over the assert in extract_locals_r which enforces
that an extra-args tree appearing inside another extra-args tree doesn't
actually have extra args. This invariant doesn't always hold for lambdas
(which recently gained the extra-args mechanism) but that should be
harmless since cp_walk_subtrees doesn't walk LAMBDA_EXPR_EXTRA_ARGS and
so should be immune to the PR114303 issue for now. So let's just disable
this assert for lambdas.

PR c++/117054

gcc/cp/ChangeLog:

* pt.cc (extract_locals_r): Disable tree_extra_args assert
for LAMBDA_EXPR.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/lambda-targ9.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
(cherry picked from commit bb2bfdb2048aed18ef7dc01b51816a800e83ce54)

c++: ICE with -Wtautological-compare in template [PR116534]

Pre r14-4793, we'd call warn_tautological_cmp -> operand_equal_p
with operands wrapped in NON_DEPENDENT_EXPR, which works, since
o_e_p bails for codes it doesn't know. But now we pass operands
not encapsulated in NON_DEPENDENT_EXPR, and crash, because the
template tree for &a[x] has null DECL_FIELD_OFFSET.

This patch extends r12-7797 to cover the case when DECL_FIELD_OFFSET
is null.

PR c++/116534

gcc/ChangeLog:

* fold-const.cc (operand_compare::operand_equal_p): If either
field's DECL_FIELD_OFFSET is null, compare the fields with ==.

gcc/testsuite/ChangeLog:

* g++.dg/warn/Wtautological-compare4.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
(cherry picked from commit 7ca486889b1b1c7e7bcbbca3b6caa103294ec07d)

c++: wrong error due to std::initializer_list opt [PR116476]

Here maybe_init_list_as_array gets elttype=field, init={NON_LVALUE_EXPR <2>}
and it tries to convert the init's element type (int) to field
using implicit_conversion, which works, so overall maybe_init_list_as_array
is successful.

But it constifies init_elttype so we end up with "const int". Later,
when we actually perform the conversion and invoke field::field(T&&),
we end up with this error:

error: binding reference of type 'int&&' to 'const int' discards qualifiers

So I think maybe_init_list_as_array should try to perform the conversion,
like it does below with fc.

PR c++/116476

gcc/cp/ChangeLog:

* call.cc (maybe_init_list_as_array): Try convert_like and see if it
worked.

gcc/testsuite/ChangeLog:

* g++.dg/cpp0x/initlist-opt2.C: New test.

Reviewed-by: Jason Merrill <jason@redhat.com>
(cherry picked from commit 9f79c7ddff5f1b004803931406ad17eaba095fff)

c++: ICE with ()-init and TARGET_EXPR eliding [PR116424]

Here we crash on a cp_gimplify_expr/TARGET_EXPR assert:

      gcc_checking_assert (!TARGET_EXPR_ELIDING_P (*expr_p)
                           || !TREE_ADDRESSABLE (TREE_TYPE (*expr_p)));

We cannot elide the TARGET_EXPR because we're taking its address.

It is set as eliding in massage_init_elt.  I've tried to not set
TARGET_EXPR_ELIDING_P when the context is not direct-initialization.
That didn't work: even when it's not direct-initialization now, it
can become one later, for instance, after split_nonconstant_init.
One problem is that replace_placeholders_for_class_temp_r will replace
placeholders in non-eliding TARGET_EXPRs with the slot, but if we then
elide the TARGET_EXPR, we end up with a "stray" VAR_DECL and crash.
(Only some TARGET_EXPRs are handled by replace_decl.)

I thought I'd have to go back to
<https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651163.html> but
then I realized that this problem occurrs only with ()-init but not
{}-init.  With {}-init, there is no problem, because we are clearing
TARGET_EXPR_ELIDING_P in process_init_constructor_record:

       /* We can't actually elide the temporary when initializing a
          potentially-overlapping field from a function that returns by
          value.  */
       if (ce->index
           && TREE_CODE (next) == TARGET_EXPR
           && unsafe_copy_elision_p (ce->index, next))
         TARGET_EXPR_ELIDING_P (next) = false;

But that does not happen for ()-init because we have no ce->index.
()-init doesn't allow brace elision so we don't really reshape them.

But I can just move the clearing a few lines down and then it handles
both ()-init and {}-init.

PR c++/116424

gcc/cp/ChangeLog:

* typeck2.cc (process_init_constructor_record): Move the clearing of
TARGET_EXPR_ELIDING_P down.

gcc/testsuite/ChangeLog:

* g++.dg/cpp2a/paren-init38.C: New test.

(cherry picked from commit 15f857af2943a4aa282d04ff71f860352ad3291b)

i386: Fix expand_vector_set for VEC_MERGE/VEC_DUPLICATE RTX [PR117116]

Middle end can generate SYMBOL_REF RTX as a value "val" in the call
to expand_vector_set, but SYMBOL_REF RTX is not accepted in
<sse2p4_1>_pinsr<ssemodesuffix> insn pattern, generated via
VEC_MERGE/VEC_DUPLICATE RTX path.

Force the value into a register before VEC_MERGE/VEC_DUPLICATE RTX
is generated if it doesn't satisfy nonimmediate_operand predicate.

PR target/117116

gcc/ChangeLog:

* config/i386/i386-expand.cc (expand_vector_set): Force "val"
into a register before VEC_MERGE/VEC_DUPLICATE RTX is generated
if it doesn't satisfy nonimmediate_operand predicate.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr117116.c: New test.

(cherry picked from commit 80d7032067a3a5b76aecd657d9b35b0a8f5a941d)

Daily bump.

libstdc++: Fix Python deprecation warning in printers.py

python/libstdcxx/v6/printers.py:1355: DeprecationWarning: 'count' is passed as positional argument

The Python docs say:

  Deprecated since version 3.13: Passing count and flags as positional
  arguments is deprecated. In future Python versions they will be
  keyword-only parameters.

Using a keyword argument for count only became possible with Python 3.1
so introduce a new function to do the substitution.

libstdc++-v3/ChangeLog:

* python/libstdcxx/v6/printers.py (strip_fundts_namespace): New.
(StdExpAnyPrinter, StdExpOptionalPrinter): Use it.

(cherry picked from commit b9e98bb9919fa9f07782f23f79b3d35abb9ff542)

libstdc++: Increase timeouts for PSTL tests in debug mode [PR90276]

These tests compile very slowly in debug mode.

libstdc++-v3/ChangeLog:

PR libstdc++/90276
* testsuite/25_algorithms/pstl/alg_modifying_operations/rotate_copy.cc:
Increase timeout for debug mode.
* testsuite/25_algorithms/pstl/alg_modifying_operations/transform_binary.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_nonmodifying/mismatch.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/lexicographical_compare.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/minmax_element.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_symmetric_difference.cc:
Likewise.

(cherry picked from commit e65b6627a36869b01bbe128a5324e4b415b28880)

libstdc++: Implement LWG 3564 for ranges::transform_view

The _Iterator<true> type returned by begin() const uses const F& to
transform the elements, so it should use const F& to determine the
iterator's value_type and iterator_category as well.

This was accepted into the WP in July 2022.

libstdc++-v3/ChangeLog:

* include/std/ranges (transform_view:_Iterator): Use const F&
to determine value_type and iterator_category of
_Iterator<true>, as per LWG 3564.
* testsuite/std/ranges/adaptors/transform.cc: Check value_type
and iterator_category.

Reviewed-by: Patrick Palka <ppalka@redhat.com>
(cherry picked from commit dde19c600c3c8a1d765c9b4961d2556e89edad14)

libstdc++: Fix localized %c formatting for <chrono> [PR117085]

When formatting a time point with %c we call std::vformat_to using the
formatting locale's D_T_FMT string, but we weren't adding the L option
to the format string. This meant we always interpreted D_T_FMT in the C
locale, instead of using the formatting locale as obviously intended
when %c is used.

libstdc++-v3/ChangeLog:

PR libstdc++/117085
* include/bits/chrono_io.h (__formatter_chrono::_M_c): Add L
option to format string.
* testsuite/std/time/format.cc: Move to...
* testsuite/std/time/format/format.cc: ...here.
* testsuite/std/time/format/pr117085.cc: New test.

(cherry picked from commit 4ad697bb7f1aad252e1398c6f13eed3fa6d0ca5b)

libstdc++: Tweak %c formatting for chrono types

libstdc++-v3/ChangeLog:

* include/bits/chrono_io.h (__formatter_chrono::_M_c): Add
[[unlikely]] attribute to condition for missing %c format in
locale. Use %T instead of %H:%M:%S in fallback.

(cherry picked from commit ce89d2f3170e0d6474cee2c5cb9d478426a5b2f6)

libstdc++: Populate generic std::time_get's wide %c format [PR117135]

I missed out the __timepunct<wchar_t> specialization for the "generic"
implementation when defining the %c format in r15-4016-gc534e37faccf48.

libstdc++-v3/ChangeLog:

PR libstdc++/117135
* config/locale/generic/time_members.cc
(__timepunct<wchar_t>::_M_initialize_timepunc): Set
_M_date_time_format for C locale. Set %Ex formats to the same
values as the %x formats.

(cherry picked from commit 707d84efee7f7eb5a336935f386e094402f267a6)

libstdc++: Use std::move for iterator in ranges::fill [PR117094]

Input iterators aren't required to be copyable.

libstdc++-v3/ChangeLog:

PR libstdc++/117094
* include/bits/ranges_algobase.h (__fill_fn): Use std::move for
iterator that might not be copyable.
* testsuite/25_algorithms/fill/constrained.cc: Check
non-copyable iterator with sized sentinel.

(cherry picked from commit 03623fa91ff36ecb9faa3b55f7842a39b759594e)

middle-end: Fix ifcvt predicate generation for masked function calls

Up until now, due to a latent bug in the code for the ifcvt pass,
irrespective of the branch taken in a conditional statement, the
original condition for the if statement was used in masking the
function call.

Thus, for code such as:

  if (a[i] > limit)
    b[i] = fixed_const;
  else
    b[i] = fn (a[i]);

we would generate the following (wrong) if-converted tree code:

  _1 = a[i_1];
  _2 = _1 > limit;
  _3 = .MASK_CALL (fn, _1, _2);
  cstore_4 = _2 ? fixed_const : _3;

as opposed to the correct expected sequence:

  _1 = a[i_1];
  _2 = _1 > limit;
  _3 = ~_2;
  _4 = .MASK_CALL (fn, _1, _3);
  cstore_5 = _2 ? fixed_const : _4;

This patch ensures that the correct predicate mask generation is
carried out such that, upon autovectorization, the correct vector
lanes are selected in the vectorized function call.

gcc/ChangeLog:

* tree-if-conv.cc (predicate_statements): Fix handling of
predicated function calls.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-fncall-mask.c: New.

Fix handling of ICF_NOVOPS in ipa-modref

As shown in somewhat convoluted testcase, ipa-modref is mistreating
ECF_NOVOPS as "having no side effects". This come from time when
modref cared only about memory accesses and thus it was possible to
shortcut on it.

This patch removes (hopefully) all those bad shortcuts.
Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

PR ipa/109985

* ipa-modref.cc (modref_summary::useful_p): Fix handling of ECF_NOVOPS.
(modref_access_analysis::process_fnspec): Likevise.
(modref_access_analysis::analyze_call): Likevise.
(propagate_unknown_call): Likevise.
(modref_propagate_in_scc): Likevise.
(modref_propagate_flags_in_scc): Likewise.
(ipa_merge_modref_summary_after_inlining): Likewise.

(cherry picked from commit efcbe7b985e24ac002a863afd609c44a67761195)

aarch64: Fix caller saves of VNx2QI [PR116238]

The testcase contains a VNx2QImode pseudo that is live across a call
and that cannot be allocated a call-preserved register.  LRA quite
reasonably tried to save it before the call and restore it afterwards.
Unfortunately, the target told it to do that in SImode, even though
punning between SImode and VNx2QImode is disallowed by both
TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P.

The natural class to use for SImode is GENERAL_REGS, so this led
to an unsalvageable situation in which we had:

  (set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B))

where A needed GENERAL_REGS and B needed FP_REGS.  We therefore ended
up in a reload loop.

The hooks above should ensure that this situation can never occur
for incoming subregs.  It only happened here because the target
explicitly forced it.

The decision to use SImode for modes smaller than 4 bytes dates
back to the beginning of the port, before 16-bit floating-point
modes existed.  I'm not sure whether promoting to SImode really
makes sense for any FPR, but that's a separate performance/QoI
discussion.  For now, this patch just disallows using SImode
when it is wrong for correctness reasons, since that should be
safer to backport.

gcc/
PR testsuite/116238
* config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode):
Only return SImode if we can convert to and from it.

gcc/testsuite/
PR testsuite/116238
* gcc.target/aarch64/sve/pr116238.c: New test.

(cherry picked from commit ec9d6d45191f639482344362d048294e74587ca3)

Add regression test

gcc/testsuite/
PR ada/114593
* gnat.dg/specs/generic_inst2-child2.ads: New test.
* gnat.dg/specs/generic_inst2.ads: New helper.
* gnat.dg/specs/generic_inst2-child1.ads: Likewise.

ada: Type conversion in instance incorrectly rejected.

In some cases, a legal type conversion in a generic package is correctly
accepted but the corresponding type conversion in an instance of the generic
is incorrectly rejected.

gcc/ada/
PR ada/114593
* sem_res.adb (Valid_Conversion): Test In_Instance instead of
In_Instance_Body.

Add a new tune avx256_avoid_vec_perm for SRF.

According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2
instructions can be decomposed into two independent 128-bit
micro-operations, except for a subset of Intel AVX2 instructions,
known as cross-lane operations, can only compute the result for an
element by utilizing one or more sources belonging to other elements.

The 256-bit instructions listed below use more operand sources than
can be natively supported by a single reservation station within these
microarchitectures. They are decomposed into two μops, where the first
μop resolves a subset of operand dependencies across two cycles. The
dependent second μop executes the 256-bit operation by using a single
128-bit execution port for two consecutive cycles with a five-cycle
latency for a total latency of seven cycles.

VPERM2I128 ymm1, ymm2, ymm3/m256, imm8
VPERM2F128 ymm1, ymm2, ymm3/m256, imm8
VPERMPD ymm1, ymm2/m256, imm8
VPERMPS ymm1, ymm2, ymm3/m256
VPERMD ymm1, ymm2, ymm3/m256
VPERMQ ymm1, ymm2/m256, imm8

Instead of setting tune avx128_optimal for SRF, the patch add a new
tune avx256_avoid_vec_perm for it. so by default, vectorizer still
uses 256-bit VF if cost is profitable, but lowers to 128-bit whenever
256-bit vec_perm is needed for auto-vectorization. w/o vec_perm,
performance of 256-bit vectorization should be similar as 128-bit
ones(some benchmark results show it's even better than 128-bit
vectorization since it enables more parallelism for convert cases.)

[1] https://www.intel.com/content/www/us/en/content-details/814198/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html

gcc/ChangeLog:

* config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
Add new member m_num_avx256_vec_perm.
(ix86_vector_costs::add_stmt_cost): Record 256-bit vec_perm.
(ix86_vector_costs::finish_cost): Prevent vectorization for
TAREGT_AVX256_AVOID_VEC_PERM when there's 256-bit vec_perm
instruction.
* config/i386/i386.h (TARGET_AVX256_AVOID_VEC_PERM): New
Macro.
* config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): Add
m_CORE_ATOM.
(X86_TUNE_AVX256_AVOID_VEC_PERM): New tune.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx256_avoid_vec_perm.c: New test.

(cherry picked from commit 9eaecce3d8c1d9349adbf8c2cdaf8d87672ed29c)

Add new microarchitecture tune for SRF/GRR/CWF.

For Crestmont, 4-operand vex blendv instructions come from MSROM and
is slower than 3-instructions sequence (op1 & mask) | (op2 & ~mask).
legacy blendv instruction can still be handled by the decoder.

The patch add a new tune which is enabled for all processors except
for SRF/CWF. It will use vpand + vpandn + vpor instead of
vpblendvb(similar for vblendvps/vblendvpd) for SRF/CWF.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_sse_movcc): Guard
instruction blendv generation under new tune.
* config/i386/i386.h (TARGET_SSE_MOVCC_USE_BLENDV): New Macro.
* config/i386/x86-tune.def (X86_TUNE_SSE_MOVCC_USE_BLENDV):
New tune.

(cherry picked from commit 9c8cea8feb6cd54ef73113a0b74f1df7b60d09dc)

Daily bump.

testsuite: fix PR111613 test

PR ipa/111613
* gcc.c-torture/pr111613.c: Rename to..
* gcc.c-torture/execute/pr111613.c: ...this.

(cherry picked from commit 5e5d7a88932b132437069f716160f8b20862890b)

tree-optimization/117041 - fix load classification of former grouped load

When we first detect a grouped load but later dis-associate it we
only set DR_GROUP_FIRST_ELEMENT to NULL, indicating it is not a
STMT_VINFO_GROUPED_ACCESS but leave DR_GROUP_NEXT_ELEMENT set. This
causes a stray DR_GROUP_NEXT_ELEMENT access in get_group_load_store_type
to go wrong, indicating a load isn't single_element_p when it actually
is, leading to wrong classification and an ICE.

PR tree-optimization/117041
* tree-vect-stmts.cc (get_group_load_store_type): Only
check DR_GROUP_NEXT_ELEMENT for STMT_VINFO_GROUPED_ACCESS.

* gcc.dg/torture/pr117041.c: New testcase.

(cherry picked from commit 72c83f644dea755b4eba427aabde45f5d3694d9b)

middle-end/117086 - fixup vec_cond simplifications

The following adds missing checks for a vector type result type
to simplifications that end up creating a vec_cond.

PR middle-end/117086
* match.pd ((op (vec_cond ...) ..) -> (vec_cond ...)): Add
missing checks for VECTOR_TYPE_P (type).

* gcc.dg/torture/pr117086.c: New testcase.

(cherry picked from commit c64ae8377210bde44714d265311ee7bfa2733df9)

tree-optimization/116990 - missed control flow check in vect_analyze_loop_form

The following fixes checking for unsupported control flow in
vectorization to also cover the outer loop body.

PR tree-optimization/116990
* tree-vect-loop.cc (vect_analyze_loop_form): Check the current
loop body for control flow.

(cherry picked from commit b0b71618157ddac52266909978f331406f98f3a2)

tree-optimization/116879 - failure to recognize non-empty latch

When we relaxed the vectorizers constraint on loop structure verifying
the emptiness of the latch became too lose as can be seen in the case
for PR116879 where the latch effectively contains two basic-blocks
which one being an unmerged forwarder that's not empty.

PR tree-optimization/116879
* tree-vect-loop.cc (vect_analyze_loop_form): Scan all
blocks that form the latch.

* gcc.dg/pr116879.c: New testcase.

(cherry picked from commit 18e905b461a7138185cf4f0efde4a4e1214fb798)

tree-optimization/116850 - corrupt post-dom info

Path isolation computes post-dominators on demand but can end up
splitting blocks after that, wrecking it. We can delay splitting
of blocks until we no longer need the post-dom info which is what
the following patch does to solve the issue.

PR tree-optimization/116850
* gimple-ssa-isolate-paths.cc (bb_split_points): New global.
(insert_trap): Delay BB splitting if post-doms are computed.
(find_explicit_erroneous_behavior): Process delayed BB
splitting after releasing post dominators.
(gimple_ssa_isolate_erroneous_paths): Do not free post-dom
info here.

* gcc.dg/pr116850.c: New testcase.

(cherry picked from commit 64163657ba7e70347087a63bb2b32d83b52ea7d9)

tree-optimization/116768 - wrong dependence analysis

The following reverts a bogus fix done for PR101009 and instead makes
sure we get into the same_access_functions () case when computing
the distance vector for g[1] and g[1] where the constants ended up
having different types. The generic code doesn't seem to handle
loop invariant dependences. The special case gets us both
( 0 ) and ( 1 ) as distance vectors while formerly we got ( 1 ),
which the PR101009 fix changed to ( 0 ) with bad effects on other
cases as shown in this PR.

PR tree-optimization/116768
* tree-data-ref.cc (build_classic_dist_vector_1): Revert
PR101009 change.
* tree-chrec.cc (eq_evolutions_p): Make sure (sizetype)1
and (int)1 compare equal.

* gcc.dg/torture/pr116768.c: New testcase.

(cherry picked from commit 5b5a36b122e1205449f1512bf39521b669e713ef)

tree-optimization/116166 - forward jump-threading going wild

Currently the forward threader isn't limited as to the search space
it explores and with it now using path-ranger for simplifying
conditions it runs into it became pretty slow for degenerate cases
like compiling insn-emit.cc for RISC-V esp. when compiling for
a host with LOGICAL_OP_NON_SHORT_CIRCUIT disabled.

The following makes the forward threader honor the search space
limit I introduced for the backward threader. This reduces
compile-time from minutes to seconds for the testcase in PR116166.

Note this wasn't necessary before we had ranger but with ranger
the work we do is quadatic in the length of the threading path
we build up (the same is true for the backwards threader).

PR tree-optimization/116166
* tree-ssa-threadedge.h (jump_threader::thread_around_empty_blocks):
Add limit parameter.
(jump_threader::thread_through_normal_block): Likewise.
* tree-ssa-threadedge.cc (jump_threader::thread_around_empty_blocks):
Honor and decrement limit parameter.
(jump_threader::thread_through_normal_block): Likewise.
(jump_threader::thread_across_edge): Initialize limit from
param_max_jump_thread_paths and pass it down to workers.

(cherry picked from commit 2cf89ae83225f932b226cd57ef2d083a59bcf8a3)

i386: Fix up ix86_expand_int_compare with TImode comparisons of SUBREGs from V8{H,B}Fmode against zero [PR116921]

The following testcase ICEs, because the ix86_expand_int_compare
optimization to use {,v}ptest assumes there are instructions for all
16-byte vector modes.  That isn't the case, we only have one for
V16QI, V8HI, V4SI, V2DI, V1TI, V4SF and V2DF, not for
V8HF nor V8BF.

The following patch fixes that by using the V8HI instruction instead
for those 2 modes.  tmp can't be a SUBREG, because it is SUBREG_REG
of another SUBREG, so we don't need to worry about gen_lowpart
failing.

2024-10-04  Jakub Jelinek  <jakub@redhat.com>

PR target/116921
* config/i386/i386-expand.cc (ix86_expand_int_compare): Add a SUBREG
to V8HImode from V8HFmode or V8BFmode before generating a ptest.

* gcc.target/i386/pr116921.c: New test.

(cherry picked from commit 92e9e971ced90af5a825ae4b35ad6c98c9ab86da)

range-cache: Fix ranger ICE if number of bbs increases [PR116899]

Ranger cache during initialization reserves number of basic block slots
in the m_workback vector (in an inefficient way by doing create (0)
and safe_grow_cleared (n) and truncate (0) rather than say create (n)
or reserve (n) after create).  The problem is that some passes split bbs and/or
create new basic blocks and so when one is unlucky, the quick_push into that
vector can ICE.

The following patch replaces those 4 quick_push calls with safe_push.

I've also gathered some statistics from compiling 63 gcc sources (picked those
that dependent on gimple-range-cache.h just because I had to rebuild them once
for the instrumentation), and that showed that in 81% of cases nothing has
been pushed into the vector at all (and note, not everything was small, there
were even cases with 10000+ basic blocks), another 18.5% of cases had just 1-4
elements in the vector at most, 0.08% had 5-8 and 19 out of 305386 cases
had at most 9-11 elements, nothing more.  So, IMHO reserving number of basic
block in the vector is a waste of compile time memory and with the safe_push
calls, the patch just initializes the vector to vNULL.

2024-10-01  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/116899
* gimple-range-cache.cc (ranger_cache::ranger_cache): Set m_workback
to vNULL instead of creating it, growing and then truncating.
(ranger_cache::fill_block_cache): Use safe_push rather than quick_push
on m_workback.
(ranger_cache::range_from_dom): Likewise.

* gcc.dg/bitint-111.c: New test.

(cherry picked from commit de25f1729d212c11d6e2955130f4eb1d272b5ce7)

range-cache: Fix ICE on SSA_NAME with def_stmt not yet in the IL [PR116898]

Some passes like the bitint lowering queue some statements on edges and only
commit them at the end of the pass.  If they use ranger at the same time,
the ranger might see such SSA_NAMEs and ICE on those.  The following patch
instead just punts on them.

2024-10-01  Jakub Jelinek  <jakub@redhat.com>

PR middle-end/116898
* gimple-range-cache.cc (ranger_cache::block_range): If a SSA_NAME
with NULL def_bb isn't SSA_NAME_IS_DEFAULT_DEF, return false instead
of failing assertion.  Formatting fix.

* gcc.dg/bitint-110.c: New test.

(cherry picked from commit bdbd0607d5933cdecbf7e009a42f1d9486dddf44)

cselib: Discard useless locs of preserved VALUEs [PR116627]

remove_useless_values iteratively discards useless locs (locs of
cselib_val which refer to non-preserved VALUEs with no locations),
which in turn can make further values useless until no further VALUEs
are made useless and then discards the useless VALUEs.

Preserved VALUEs (something done during var-tracking only I think)
live in a different hash table, cselib_preserved_hash_table rather
than cselib_hash_table.  cselib_find_slot first looks up slot in
cselib_preserved_hash_table and only if not found looks it up in
cselib_hash_table (and INSERTs only into the latter), whereas preservation
of a VALUE results in move of a cselib_val from the latter to the former
hash table.

The testcase in the PR (apparently too fragile, it only reproduces on 14
branch with various flags on a single arch, not on trunk) ICEs, because
we have a preserved VALUE (QImode with (const_int 0) as one of the locs).
In a different BB SImode r2 is looked up, a non-preserved VALUE is created
for it, and the r13-2916 added code attempts to lookup also SUBREGs of that
in narrower modes, among those QImode, so adds to that SImode r2
non-preserve VALUE a new loc of (subreg:QI (value:SI) 0).  That SImode
value is considered useless, so remove_useless_value discards it, but
nothing discarded it from the preserved VALUE's loc_list, so when looking
something up in the hash table we ICE trying to derevence CSELIB_VAL
of the discarded VALUE.

I think we need to discuard useless locs even from the preserved VALUEs.
That IMHO shouldn't create any further useless VALUEs, the preserved
VALUEs are never useless, so we don't need to iterate with it, can do it
just once, but IMHO it needs to be done because actually
discard_useless_values.

The following patch does that.

2024-09-29  Jakub Jelinek  <jakub@redhat.com>

PR target/116627
* cselib.cc (remove_useless_values): Discard useless locs
even from preserved cselib_vals in cselib_preserved_hash_table
hash table.

(cherry picked from commit 73726725ae03995ef8b61622c954f7ca70416f79)

i386: Fix up _mm_min_ss etc. handling of zeros and NaNs [PR116738]

min/max patterns for intrinsics which on x86 result in the second
input operand if the two operands are both zeros or one or both of them
are a NaN shouldn't use SMIN/SMAX RTL, because that is similarly to
MIN_EXPR/MAX_EXPR undefined what will be the result in those cases.

The following patch adds an expander which uses either a new pattern with
UNSPEC_IEEE_M{AX,IN} or use the S{MIN,MAX} representation of the same.

2024-09-20 Uros Bizjak <ubizjak@gmail.com>
Jakub Jelinek <jakub@redhat.com>

PR target/116738
* config/i386/subst.md (mask_scalar_operand_arg34,
mask_scalar_expand_op3, round_saeonly_scalar_mask_arg3): New
subst attributes.
* config/i386/sse.md
(<sse>_vm<code><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
Change from define_insn to define_expand, rename the old define_insn
to ...
(*<sse>_vm<code><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
... this.
(<sse>_ieee_vm<ieee_maxmin><mode>3<mask_scalar_name><round_saeonly_scalar_name>):
New define_insn.

* gcc.target/i386/sse-pr116738.c: New test.

(cherry picked from commit 624d3af025e820ede7ec4334f7a2d5d4731c99a9)

c++: Don't emit deprecated/unavailable attribute diagnostics when creating cdtor thunks [PR116678]

Another spot where we mark_used a function (in this case ctor or dtor)
even when it is just artificially used inside of thunks (emitted on mingw
with -Os for the testcase).

2024-09-13 Jakub Jelinek <jakub@redhat.com>

PR c++/116678
* optimize.cc: Include decl.h.
(maybe_thunk_body): Temporarily change deprecated_state to
UNAVAILABLE_DEPRECATED_SUPPRESS.

* g++.dg/warn/deprecated-20.C: New test.

(cherry picked from commit b7b67732e20217196f2a13a10fc3df4605b2b2ab)

Daily bump.

aarch64: Alter pr116258.c test to correct for big endian.

The test at pr116258.c fails on big endian targets,
this is because the test checks that the index of a floating
point multiply is 0, which is correct only for little endian.

gcc/testsuite/ChangeLog:

PR tree-optimization/116258
* gcc.target/aarch64/pr116258.c:
Alter test to add big-endian support.

(cherry picked from commit a17a9bdcb3f749b895abf1fbf4f62859df9e8184)

Daily bump.

c: fix crash when checking for compatibility of structures [PR116726]

When checking for compatibility of structure or union types in
tagged_types_tu_compatible_p, restore the old value of the pointer to
the top of the temporary cache after recursively calling comptypes_internal
when looping over the members of a structure of union. While the next
iteration of the loop overwrites the pointer, I missed the fact that it can
be accessed again when types of function arguments are compared as part
of recursive type checking and the function is entered again.

PR c/116726

gcc/c/ChangeLog:

* c-typeck.cc (tagged_types_tu_compatible_p): Restore value
of the cache after recursing into comptypes_internal.

gcc/testsuite/ChangeLog:

* gcc.dg/pr116726.c: New test.

(cherry picked from commit 9227a64495d5594613604573b72422e8e3722fc5)

Daily bump.

Add regression test

gcc/testsuite/
PR ada/116190
* gnat.dg/aggr31.adb: New test.

ada: Fix wrong finalization of anonymous array aggregate

The issue arises when the aggregate consists only of iterated associations
because, in this case, its expansion uses a 2-pass mechanism which creates
a temporary that needs a fully-fledged initialization, thus running afoul
of the optimization that avoids building the initialization procedure in
the anonymous array case.

gcc/ada/ChangeLog:
* exp_aggr.ads (Is_Two_Pass_Aggregate): New function declaration.
* exp_aggr.adb (Is_Two_Pass_Aggregate): New function body.
(Expand_Array_Aggregate): Call Is_Two_Pass_Aggregate to detect the
aggregates that need the 2-pass expansion.
* exp_ch3.adb (Expand_Freeze_Array_Type): In the anonymous array
case, build the initialization procedure if the initial value in
the object declaration is a 2-pass aggregate.

Add regression test

gcc/testsuite/
PR ada/115535
* gnat.dg/put_image1.adb: New test

ada: Fix negative value returned by 'Image for array with nonnegative component

The problem is that Exp_Put_Image.Build_Elementary_Put_Image_Call uses the
signedness of the base type but the size of the first subtype, hence the
discrepancy between them.

gcc/ada/ChangeLog:
PR ada/115535
* exp_put_image.adb (Build_Elementary_Put_Image_Call): Use the size
of the underlying type to find the support type.

ada: Fix internal error on elsif part of if-statement containing if-expression

The problem occurs when the compiler is trying to find a context to which
it can hoist finalization actions coming from the if-expression, because
Find_Hook_Context incorrectly returns the N_Elsif_Part node.

gcc/ada/ChangeLog:
PR ada/114640
* exp_util.adb (Find_Hook_Context): For a node present within a
conditional expression, do not return an N_Elsif_Part node.

Add regression test

gcc/testsuite/
PR ada/114636
* gnat.dg/specs/generic_inst1.ads: New test.

ada: Fix bogus error in instantiation with formal package

The compiler reports that an actual does not match the formal when there
is a defaulted formal discrete type because Check_Formal_Package_Instance
fails to skip the implicit base type generated by the compiler.

gcc/ada/ChangeLog:
PR ada/114636
* sem_ch12.adb (Check_Formal_Package_Instance): For a defaulted
formal discrete type, skip the generated implicit base type.

x86/{,V}AES: adjust when to force EVEX encoding

Commit a79d13a01f8c ("i386: Fix aes/vaes patterns [PR114576]") correctly
said "..., but we need to emit {evex} prefix in the assembly if AES ISA
is not enabled". Yet it did so only for the TARGET_AES insns. Going from
the alternative chosen in the TARGET_VAES insns isn't quite right: If
AES is (also) enabled, EVEX encoding would needlessly be forced.

gcc/

* config/i386/sse.md (vaesdec_<mode>, vaesdeclast_<mode>,
vaesenc_<mode>, vaesenclast_<mode>): Replace which_alternative
check by TARGET_AES one.

Daily bump.

hppa: Fix indirect_goto constraint

Noticed testing LRA.

2024-10-05 John David Anglin <danglin@gcc.gnu.org>

gcc/ChangeLog:

* config/pa/pa.md: Fix indirect_got constraint.

Daily bump.

x86: Disable stack protector for naked functions

Since naked functions should not enable stack protector, define
TARGET_STACK_PROTECT_RUNTIME_ENABLED_P to disable stack protector
for naked functions.

gcc/

PR target/116962
* config/i386/i386.cc (ix86_stack_protect_runtime_enabled_p): New
function.
(TARGET_STACK_PROTECT_RUNTIME_ENABLED_P): New.

gcc/testsuite/

PR target/116962
* gcc.target/i386/pr116962.c: New file.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 7d2845da112214f064e7b24531cc67e256b5177e)

AVR: target/116953 - ICE due to operands clobber in avr_out_sbxx_branch.

PR target/116953
gcc/
* config/avr/avr.cc (avr_out_sbxx_branch): Work on a copy of
the operands rather than on operands itself, which is just
recog_data.operand and may be clobbered by jump_over_one_insn_p.
gcc/testsuite/
* gcc.target/avr/torture/pr116953.c: New test.

(cherry picked from commit 58b9024c996951f8d768f1c83a74e5f3eef8a1c7)

Fix crash with subunit of local package

This is a regression present on the 14 branch only: the expander gets
confused when trying to insert the finalizer of a procedure that contains
a package as a subunit. The offending code no longer exists on the
mainline so this adds the minimal fix to address the issue.

gcc/ada
PR ada/116430
* exp_ch7.adb (Build_Finalizer.Create_Finalizer): For the insertion
point of the finalizer, deal with package bodies that are subunits.

Daily bump.

libstdc++: Make debug sequence members mutable [PR116369]

We need to be able to attach debug mode iterators to const containers,
so the safe iterator constructor uses const_cast to get a modifiable
pointer to the container. If the container was defined as const, that
const_cast to access its members results in undefined behaviour. PR
116369 shows a case where it results in a segfault because the container
is in a rodata section (which shouldn't have happened, but the undefined
behaviour in the library still exists in any case).

This makes the _M_iterators and _M_const_iterators data members mutable,
so that it's safe to modify them even if the declared type of the
container is a const type.

Ideally we would not need the const_cast at all. Instead, the _M_attach
member (and everything it calls) should be const-qualified. That would
work fine now, because the members that it ends up modifying are
mutable. Making that change would require a number of new exports from
the shared library, and would require retaining the old non-const member
functions (maybe as symbol aliases) for backwards compatibility. That
might be worth changing at some point, but isn't done here.

libstdc++-v3/ChangeLog:

PR c++/116369
* include/debug/safe_base.h (_Safe_sequence_base::_M_iterators):
Add mutable specifier.
(_Safe_sequence_base::_M_const_iterators): Likewise.

(cherry picked from commit a35dd276cbf6236e08bcf6e56e62c2be41cf6e3c)

libstdc++: Fix @headername for bits/cpp_type_traits.h

There is no file ext/type_traits, point it to ext/type_traits.h instead.

libstdc++-v3/ChangeLog:

* include/bits/cpp_type_traits.h: Improve doxygen file docs.

(cherry picked from commit f6ed7a61a7c906f8fb7f8059132225c9bc41f3b2)

libstdc++: Fix @file for target-specific opt_random.h

A few of these files self-identified as ext/random.tcc, update to use
the actual basename.

libstdc++-v3/ChangeLog:

* config/cpu/aarch64/opt/ext/opt_random.h: Improve doxygen file
docs.
* config/cpu/i486/opt/ext/opt_random.h: Likewise.

(cherry picked from commit c2ad7b2d5247cf2ddee98d7f46274775a3fa1268)

libstdc++: Fix autoconf check for O_NONBLOCK in <fcntl.h>

I misused the AC_CHECK_DECL macro, assuming that it behaved like
AC_CHECK_DECLS and always defined a HAVE_xxx macro if the decl was
found. Instead, the [action-if-found] shell commands are needed to
defined HAVE_O_NONBLOCK explicitly.

libstdc++-v3/ChangeLog:

* configure.ac: Fix check for O_NONBLOCK.
* config.h.in: Regenerate.
* configure: Regenerate.

(cherry picked from commit b68561dd7925dfee1836f75d3fa8d33fff5c2498)

libstdc++: Remove noexcept-specifier from MCF __cxa_guard_acquire [PR116857]

This function definition should not be marked as non-throwing, because
the declaration in <cxxabi.h> is potentially throwing.

Also fix whitespace.

libstdc++-v3/ChangeLog:

PR libstdc++/116857
* libsupc++/guard.cc (__cxa_guard_acquire): Remove
_GLIBCXX_NOTHROW to match declaration in <cxxabi.h>.

(cherry picked from commit efdda203f52b9b55ef9acc8ad668bbd0570a8de6)

libstdc++: Fix std::codecvt<wchar_t, char, mbstate_t> for empty dest [PR37475]

For the GNU locale model, codecvt::do_out and codecvt::do_in incorrectly
return 'ok' when the destination range is empty. That happens because
detecting incomplete output is done in the loop body, and the loop is
never even entered if to == to_end.

By restructuring the loop condition so that we check the output range
separately, we can ensure that for a non-empty source range, we always
enter the loop at least once, and detect if the destination range is too
small.

The loops also seem easier to reason about if we return immediately on
any error, instead of checking the result twice on every iteration. We
can use an RAII type to restore the locale before returning, which also
simplifies all the other member functions.

libstdc++-v3/ChangeLog:

PR libstdc++/37475
* config/locale/gnu/codecvt_members.cc (Guard): New RAII type.
(do_out, do_in): Return partial if the destination is empty but
the source is not. Use Guard to restore locale on scope exit.
Return immediately on any conversion error.
(do_encoding, do_max_length, do_length): Use Guard.
* testsuite/22_locale/codecvt/in/char/37475.cc: New test.
* testsuite/22_locale/codecvt/in/wchar_t/37475.cc: New test.
* testsuite/22_locale/codecvt/out/char/37475.cc: New test.
* testsuite/22_locale/codecvt/out/wchar_t/37475.cc: New test.

(cherry picked from commit 73ad57c244c283bf6da0c16630212f11b945eda5)

libstdc++: Populate std::time_get::get's %c format for C locale

We were using the empty string "" for D_T_FMT and ERA_D_T_FMT in the C
locale, instead of "%a %b %e %T %Y" as the C standard requires. Set it
correctly for each locale implementation that defines time_members.cc.

We can also explicitly set the _M_era_xxx pointers to the same values as
the corresponding _M_xxx ones, rather than setting them to point to
identical string literals. This doesn't rely on the compiler merging
string literals, and makes it more explicit that they're the same in the
C locale.

libstdc++-v3/ChangeLog:

* config/locale/dragonfly/time_members.cc
(__timepunct<char>::_M_initialize_timepunc)
(__timepunct<wchar_t>::_M_initialize_timepunc): Set
_M_date_time_format for C locale. Set %Ex formats to the same
values as the %x formats.
* config/locale/generic/time_members.cc: Likewise.
* config/locale/gnu/time_members.cc: Likewise.
* testsuite/22_locale/time_get/get/char/5.cc: New test.
* testsuite/22_locale/time_get/get/wchar_t/5.cc: New test.

(cherry picked from commit c534e37faccf481afa9bc28f0605ca0ec3846c89)

libstdc++: Add missing 'inline' to always_inline function

This fixes a -Wattributes warning for the COW std::string which was
previously suppressed due to being in a system header.

libstdc++-v3/ChangeLog:

* include/bits/cow_string.h (__resize_for_overwrite): Add
inline keyword to function with always_inline attribute.

(cherry picked from commit 48e1b89f14f5eab9eb3f61830f608d92c4ee54b6)

libstdc++: Work around some PSTL test failures for debug mode [PR90276]

This addresses one known failure due to a bug in the upstream tests, and
a number of timeouts due to the algorithms running much more slowly with
debug mode checks enabled.

libstdc++-v3/ChangeLog:

PR libstdc++/90276
* testsuite/25_algorithms/pstl/alg_sorting/partial_sort.cc
[_GLIBCXX_DEBUG]: Add xfail-run-if for debug mode.
* testsuite/25_algorithms/pstl/alg_nonmodifying/nth_element.cc
[_GLIBCXX_DEBUG]: Reduce size of test data.
* testsuite/25_algorithms/pstl/alg_sorting/includes.cc:
Likewise.
* testsuite/25_algorithms/pstl/alg_sorting/set_util.h:
Likewise.

(cherry picked from commit 003ce8a6c4c28f8d285134afa9a423d0e234cf2e)

libstdc++: Disable std::formatter<char8_t, C> specialization

I noticed that char8_t was missing from the list of types that were
prevented from using the std::formatter partial specialization for
integer types. That partial specialization was also matching
cv-qualified integer types, because std::integral<const int> is true.

This change simplifies the constraints by introducing a new variable
template which is only true for cv-unqualified integer types, with
explicit specializations to exclude the character types. This should be
slightly more efficient than the previous constraints that checked
std::integral<T> and (!__is_one_of<T, char, wchar_t, ...>). It also
avoids the need for a separate std::formatter specialization for 128-bit
integers, as they can be handled by the new variable template too.

libstdc++-v3/ChangeLog:

* include/std/format (__format::__is_formattable_integer): New
variable template and specializations.
(template<integral, __char> struct formatter): Replace
constraints on first arg with __is_formattable_integer.
* testsuite/std/format/formatter/requirements.cc: Check that
std::formatter specializations for char8_t and const int are
disabled.

(cherry picked from commit 0f52a92ab249bde64b7570d4cf549437a3283520)

libstdc++: Fix formatting of most negative chrono::duration [PR116755]

When formatting chrono::duration<signed-integer-type, P>::min() we were
causing undefined behaviour by trying to form the negative of the most
negative value. If we convert negative durations with integer rep to the
corresponding unsigned integer rep then we can safely represent all
values.

libstdc++-v3/ChangeLog:

PR libstdc++/116755
* include/bits/chrono_io.h (formatter<duration<R,P>>::format):
Cast negative integral durations to unsigned rep.
* testsuite/20_util/duration/io.cc: Test the most negative
integer durations.

(cherry picked from commit 482e651f5750e4648ade90e32ed45b094538e7f8)

Daily bump.

tree-optimization/116585 - SSA corruption with split_constant_offset

split_constant_offset when looking through SSA defs can end up
picking SSA leafs that are subject to abnormal coalescing. This
can lead to downstream consumers to insert code based on the
result (like from dataref analysis) in places that violate constraints
for abnormal coalescing. It's best to not expand defs whose operands
are subject to abnormal coalescing - and not either do something when
a subexpression has operands like that already.

PR tree-optimization/116585
* tree-data-ref.cc (split_constant_offset_1): When either
operand is subject to abnormal coalescing do no further
processing.

* gcc.dg/torture/pr116585.c: New testcase.

(cherry picked from commit 1d0cb3b5fca69b81e69cfdb4aea0eebc1ac04750)

Daily bump.

c++: don't advertise C++20 concepts in C++14

There have been various problems with -std=c++14 -fconcepts; let's stop
defining the feature test macro in that case.

gcc/c-family/ChangeLog:

* c-cppbuiltin.cc (c_cpp_builtins): Don't define __cpp_concepts
before C++17.

c++: -Wdangling-reference and empty class [PR115361]

We can't have a dangling reference to an empty class unless it's
specifically to that class or one of its bases. This was giving a
false positive on the _ExtractKey pattern in libstdc++ hashtable.h.

This also adjusts the order of arguments to reference_related_p, which
is relevant for empty classes (unlike scalars).

Several of the classes in the testsuite needed to gain data members to
continue to warn.

PR c++/115361

gcc/cp/ChangeLog:

* call.cc (do_warn_dangling_reference): Check is_empty_class.

gcc/testsuite/ChangeLog:

* g++.dg/ext/attr-no-dangling6.C
* g++.dg/ext/attr-no-dangling7.C
* g++.dg/ext/attr-no-dangling8.C
* g++.dg/ext/attr-no-dangling9.C
* g++.dg/warn/Wdangling-reference1.C
* g++.dg/warn/Wdangling-reference2.C
* g++.dg/warn/Wdangling-reference3.C: Make classes non-empty.
* g++.dg/warn/Wdangling-reference23.C: New test.

c++: fix -Wdangling-reference false positive [PR115987]

This fixes another false positive. When a function is taking a
temporary of scalar type that couldn't be bound to the return type
of the function, don't warn, such a program would be ill-formed.

Thanks to Jonathan for reporting the problem.

PR c++/115987

gcc/cp/ChangeLog:

* call.cc (do_warn_dangling_reference): Don't consider a
temporary with a scalar type that cannot bind to the return type.

gcc/testsuite/ChangeLog:

* g++.dg/ext/attr-no-dangling6.C: Adjust.
* g++.dg/ext/attr-no-dangling7.C: Likewise.
* g++.dg/warn/Wdangling-reference22.C: New test.

Daily bump.

i386: Modernize AMD processor types

Use iterative PTA definitions for members of the same AMD processor family.

Also, fix a couple of related M_CPU_TYPE/M_CPU_SUBTYPE inconsistencies.

No functional changes intended.

gcc/ChangeLog:

* config/i386/i386.h: Add PTA_BDVER1, PTA_BDVER2, PTA_BDVER3,
PTA_BDVER4, PTA_BTVER1 and PTA_BTVER2.
* common/config/i386/i386-common.cc (processor_alias_table)
<"bdver1">: Use PTA_BDVER1.
<"bdver2">: Use PTA_BDVER2.
<"bdver3">: Use PTA_BDVER3.
<"bdver4">: Use PTA_BDVER4.
<"btver1">: Use PTA_BTVER1. Use M_CPU_TYPE (AMD_BTVER1).
<"btver2">: Use PTA_BTVER2.
<"shanghai>: Use M_CPU_SUBTYPE (AMDFAM10H_SHANGHAI).
<"istanbul>: Use M_CPU_SUBTYPE (AMDFAM10H_ISTANBUL).

(cherry picked from commit a72108920805a024b6bbee5acdd32914382c47a1)

Daily bump.

Zen5 tuning part 4: update reassocation width

Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in
3 of them. FP units can do 2 additions and 2 multiplications with latency 2
and 3. This patch updates reassociation width accordingly. This has potential
of increasing register pressure but unlike while benchmarking znver1 tuning
I did not noticed this actually causing problem on spec, so this patch bumps
up reassociation width to 6 for everything except for integer vectors, where
there are 4 units with typical latency of 1.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

* config/i386/i386.cc (ix86_reassociation_width): Update for Znver5.
* config/i386/x86-tune-costs.h (znver5_costs): Update reassociation
widths.

(cherry picked from commit f0ab3de6ec0e3540f2e57f3f5628005f0a4e3fa5)

Zen5 tuning part 3: fix typo in previous patch

gcc/ChangeLog:

* config/i386/x86-tune-sched.cc (ix86_fuse_mov_alu_p): Fix
typo.

(cherry picked from commit 910e1769a0653ac32bd8c1d6aabb39c797d5d773)

Zen5 tuning part 3: scheduler tweaks

this patch adds support for new fussion in znver5 documented in the
optimization manual:

   The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions
   with certain ALU instructions. The following conditions need to be met for
   fusion to happen:
     - The MOV should be reg-reg mov with Opcode 0x89 or 0x8B
     - The MOV is followed by an ALU instruction where the MOV and ALU destination register match.
     - The ALU instruction may source only registers or immediate data. There cannot be any memory source.
     - The ALU instruction sources either the source or dest of MOV instruction.
     - If ALU instruction has 2 reg sources, they should be different.
     - The following ALU instructions can fuse with an older qualified MOV instruction:
       ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR
       (I assume OP is OR)

I also increased issue rate from 4 to 6.  Theoretically znver5 can do more, but
with our model we can't realy use it.
Increasing issue rate to 8 leads to infinite loop in scheduler.

Finally, I also enabled fuse_alu_and_branch since it is supported by
znver5 (I think by earlier zens too).

New fussion pattern moves quite few instructions around in common code:
@@ -2210,13 +2210,13 @@
        .cfi_offset 3, -32
        leaq    63(%rsi), %rbx
        movq    %rbx, %rbp
+       shrq    $6, %rbp
+       salq    $3, %rbp
        subq    $16, %rsp
        .cfi_def_cfa_offset 48
        movq    %rdi, %r12
-       shrq    $6, %rbp
-       movq    %rsi, 8(%rsp)
-       salq    $3, %rbp
        movq    %rbp, %rdi
+       movq    %rsi, 8(%rsp)
        call    _Znwm
        movq    8(%rsp), %rsi
        movl    $0, 8(%r12)
@@ -2224,8 +2224,8 @@
        movq    %rax, (%r12)
        movq    %rbp, 32(%r12)
        testq   %rsi, %rsi
-       movq    %rsi, %rdx
        cmovns  %rsi, %rbx
+       movq    %rsi, %rdx
        sarq    $63, %rdx
        shrq    $58, %rdx
        sarq    $6, %rbx
which should help decoder bandwidth and perhaps also cache, though I was not
able to measure off-noise effect on SPEC.

gcc/ChangeLog:

* config/i386/i386.h (TARGET_FUSE_MOV_AND_ALU): New tune.
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Updat for znver5.
(ix86_adjust_cost): Add TODO about znver5 memory latency.
(ix86_fuse_mov_alu_p): New.
(ix86_macro_fusion_pair_p): Use it.
* config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): Add ZNVER5.
(X86_TUNE_FUSE_MOV_AND_ALU): New tune;

(cherry picked from commit e2125a600552bc6e0329e3f1224eea14804db8d3)

Zen5 tuning part 2: disable gather and scatter

We disable gathers for zen4. It seems that gather has improved a bit compared
to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
the indices are known ahead of time. Vector loads followed by shuffles result
in a higher load bandwidth." however the situation seems to be more
complicated.

gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
products in TSVC. Curiously enough breaking these out into microbenchmark
reversed the situation and it turns out that the performance depends on
how indices are distributed. gather is loss if indices are sequential,
neutral if they are random and win for some strides (4, 8).

This seems to be similar to earlier zens, so I think (especially for
backporting znver5 support) that it makes sense to be conistent and disable
gather unless we work out a good heuristics on when to use it. Since we
typically do not know the indices in advance, I don't see how that can be done.

I opened PR116582 with some examples of wins and loses

gcc/ChangeLog:

* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
ZNVER5.
(X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
(X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
(X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.

(cherry picked from commit d82edbe92eed53a479736fcbbe6d54d0fb42daa4)

Zen5 tuning part 1: avoid FMA chains

testing matrix multiplication benchmarks shows that FMA on a critical chain
is a perofrmance loss over separate multiply and add. While the latency of 4
is lower than multiply + add (3+2) the problem is that all values needs to
be ready before computation starts.

While on znver4 AVX512 code fared well with FMA, it was because of the split
registers. Znver5 benefits from avoding FMA on all widths.  This may be different
with the mobile version though.

On naive matrix multiplication benchmark the difference is 8% with -O3
only since with -Ofast loop interchange solves the problem differently.
It is 30% win, for example, on S323 from TSVC:

real_t s323(struct args_t * func_args)
{

//    recurrences
//    coupled recurrence

    initialise_arrays(__func__);
    gettimeofday(&func_args->t1, NULL);

    for (int nl = 0; nl < iterations/2; nl++) {
        for (int i = 1; i < LEN_1D; i++) {
            a[i] = b[i-1] + c[i] * d[i];
            b[i] = a[i] + c[i] * e[i];
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }

    gettimeofday(&func_args->t2, NULL);
    return calc_checksum(__func__);
}

gcc/ChangeLog:

* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS): Enable for
znver5.
(X86_TUNE_AVOID_256FMA_CHAINS): Likewise.
(X86_TUNE_AVOID_512FMA_CHAINS): Likewise.

(cherry picked from commit d6360b4083695970789fd65b9c515c11a5ce25b4)

x86: Don't use address override with segment regsiter

Address override only applies to the (reg32) part in the thread address
fs:(reg32).  Don't rewrite thread address like

(set (reg:CCZ 17 flags)
    (compare:CCZ (reg:SI 98 [ __gmpfr_emax.0_1 ])
        (mem/c:SI (plus:SI (plus:SI (unspec:SI [
                            (const_int 0 [0])
                        ] UNSPEC_TP)
                    (reg:SI 107))
                (const:SI (unspec:SI [
                            (symbol_ref:SI ("previous_emax") [flags 0x1a] <var_decl 0x7fffe9a11cf0 previous_emax>)
                        ] UNSPEC_DTPOFF))) [1 previous_emax+0 S4 A32])))

if address override is used to avoid the invalid memory operand like

cmpl %fs:previous_emax@dtpoff(%eax), %r12d

gcc/

PR target/116839
* config/i386/i386.cc (ix86_rewrite_tls_address_1): Make it
static.  Return if TLS address is thread register plus an integer
register.

gcc/testsuite/

PR target/116839
* gcc.target/i386/pr116839.c: New file.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit c79cc30862d7255ca15884aa956d1ccfa279d86a)

Daily bump.

s390: Fix TF to FPRX2 conversion [PR115860]

Currently subregs originating from *tf_to_fprx2_0 and *tf_to_fprx2_1
survive register allocation.  This in turn leads to wrong register
renaming.  Keeping the current approach would mean we need two insns for
*tf_to_fprx2_0 and *tf_to_fprx2_1, respectively.  Something along the
lines

(define_insn "*tf_to_fprx2_0"
  [(set (subreg:DF (match_operand:FPRX2 0 "nonimmediate_operand" "=f") 0)
        (unspec:DF [(match_operand:TF 1 "general_operand" "v")]
                   UNSPEC_TF_TO_FPRX2_0))]
  "TARGET_VXE"
  "#")

(define_insn "*tf_to_fprx2_0"
  [(set (match_operand:DF 0 "nonimmediate_operand" "=f")
        (unspec:DF [(match_operand:TF 1 "general_operand" "v")]
                   UNSPEC_TF_TO_FPRX2_0))]
  "TARGET_VXE"
  "vpdi\t%v0,%v1,%v0,1
  [(set_attr "op_type" "VRR")])

and similar for *tf_to_fprx2_1.  Note, pre register allocation operand 0
has mode FPRX2 and afterwards DF once subregs have been eliminated.

Since we always copy a whole vector register into a floating-point
register pair, another way to fix this is to merge *tf_to_fprx2_0 and
*tf_to_fprx2_1 into a single insn which means we don't have to use
subregs at all.  The downside of this is that the assembler template
contains two instructions, now.  The upside is that we don't have to
come up with some artificial insn before RA which might be more
readable/maintainable.  That is implemented by this patch.

In commit r11-4872-ge627cda5686592, the output operand specifier %V was
introduced which is used in tf_to_fprx2 only, now.  Instead of coming up
with its counterpart %F for floating-point registers, which would also
only be used in tf_to_fprx2, I print the operands directly.  This
renders %V unused which is why it is removed by this patch.

gcc/ChangeLog:

PR target/115860
* config/s390/s390.cc (print_operand): Remove operand specifier
%V.
* config/s390/s390.md (UNSPEC_TF_TO_FPRX2): New.
* config/s390/vector.md (*tf_to_fprx2_0): Remove.
(*tf_to_fprx2_1): Remove.
(tf_to_fprx2): New.

gcc/testsuite/ChangeLog:

* gcc.target/s390/vector/long-double-asm-abi.c: Adapt
scan-assembler directive.
* gcc.target/s390/vector/long-double-to-i64.c: Adapt
scan-assembler directive.
* gcc.target/s390/pr115860-1.c: New test.

(cherry picked from commit 46c2538435dfc50dd5c67c4e03ce387d1f6ebe9b)

s390: Fix AQ and AR constraints

Ensure for AQ and AR constraints that the resulting displacement after
adding any positive offset less than the size of the object being
referenced is still valid.

gcc/ChangeLog:

* config/s390/s390.cc (s390_mem_constraint): Check displacement
for AQ and AR constraints.

(cherry picked from commit 1a71ff3b89aadc7fa0af0bca269d74bb23c1a957)