Andrew Pinski [Fri, 27 Mar 2026 22:42:16 +0000 (15:42 -0700)]
phiprop: Move the check on vuse before the dominator tests
This again is some small optimization of the order of checks here.
The dom tests don't say if the prop can happen any more so putting
them after tests that will cause the prop not to happen is a good thing.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-phiprop.cc (propagate_with_phi): Move vuse checks
before the dominator tests.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
Andrew Pinski [Fri, 27 Mar 2026 22:25:13 +0000 (15:25 -0700)]
phiprop: Factor out the vdef check into new function
This is just a small cleanup and should make the code easier
to understand. And it should make it easier to add/allow
to skip over some store statements that don't affect the
variable being loadded.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-phiprop.cc (propagate_with_phi): Factor out
checking the load for vdef to ....
(can_move_into_conditional): Here.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
Andrew Pinski [Thu, 9 Apr 2026 19:40:22 +0000 (12:40 -0700)]
testsuite: Add phiprop testcase that is already fixed [PR116823]
This testcase was extracted from fold-const.cc but was fixed
by r16-4212-gf256a13f8aed83 which removed the clobber.
Since this is fixed seperately from the other improvements,
it is in a seperate patch.
Philipp Tomsich [Tue, 28 Apr 2026 17:22:35 +0000 (11:22 -0600)]
[4/6] fold-mem-offsets: Move RISC-V size-optimization workaround to the backend
The fold-mem-offsets pass contained a target-specific workaround that
skipped basic blocks optimized for size, to avoid conflicting with
RISC-V's shorten-memrefs pass. This penalized all targets.
Move the workaround to the RISC-V backend by disabling fold-mem-offsets
via SET_OPTION_IF_UNSET in riscv_option_override when optimizing for
size with compressed instructions enabled (the same condition that gates
the shorten-memrefs pass). This preserves the RISC-V behavior while
allowing other targets to fold offsets in size-optimized blocks.
gcc/ChangeLog:
* fold-mem-offsets.cc (pass_fold_mem_offsets::execute): Remove
optimize_bb_for_size_p check.
* config/riscv/riscv.cc (riscv_option_override): Disable
flag_fold_mem_offsets when optimizing for size with compressed
instructions.
This patch support RISC-V Zalasr[1](load-acquire/store-release) extension. Based on Edwin Lu's old patch:
https://patchwork.sourceware.org/project/gcc/patch/20250410214940.2712673-1-ewlu@rivosinc.com/
Implements TARGET_MEMTAG_CAN_TAG_ADDRESSES and TARGET_MEMTAG_TAG_BITSIZE
for the RISC-V back end, allowing -fsanitize=hwaddress if the target
machine supports the pointer masking extension.
------
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_can_tag_addresses): New function.
(RISCV_HWASAN_TAG_SIZE): New definition.
(riscv_memtag_tag_bitsize): New function.
(TARGET_MEMTAG_CAN_TAG_ADDRESSES): New definition.
(TARGET_MEMTAG_TAG_BITSIZE): Likewise.
[HWASAN] [RISC-V] Update EnableTaggingAbi for RISC-V linux. (#176616)
Cherry-picked from LLVM commit: 32d21326f3b60874fd72bbe509c06dbe5b729a32
Enabling pointer tagging in the userspace ABI for RISC-V kernels differs
to that of Aarch64. It requires requesting a particular number of masked
pointer bits, an error is returned if the platform could not accommodate
the request:
https://docs.kernel.org/arch/riscv/uabi.html#pointer-masking
While experimenting with enabling RISC-V HWASAN on GCC I was hitting the
error
> HWAddressSanitizer failed to enable tagged address syscall ABI
when attempting to run instrumented programs in the spike simulator
running kernel release 6.18. This patch successfully allows the tagged
address syscall ABI to be enabled by the support runtime.
Jeff Law [Tue, 28 Apr 2026 16:07:07 +0000 (10:07 -0600)]
[RISC-V][PR tree-optimization/94892] Improve equality test of sign bit splat against zero
One of the tests in pr94892 showed a case where we failed to convert a
sign bit splat + equality test against into a simple lt/ge test which
doesn't require the sign bit splat.
This is only failing on rv64, probably because the case in question has
a DI sign bit splat, then we take a lowpart SI subreg. The lowpart
dance isn't needed for rv32, though I've structured the test to verify
that we get sensible code on rv32 as well as rv64.
Like many other patches I'm submitting now, this has been in my tester
for a while, but the test has not. I'll be waiting on the pre-commit
tester to verify sanity before moving forward. I'm particularly
interested to see how it behaves with no -march flags. It should be
taking the defaults from when the toolchain was built, which should do
what we want.
Jeff
PR tree-optimization/94892
gcc/
* config/riscv/riscv.md (sign_bit_splat_equality_test): New pattern.
Tomasz Kamiński [Tue, 28 Apr 2026 14:01:47 +0000 (16:01 +0200)]
libstdc++: Make pointer_traits::pointer_to constexpr for main template.
This resolves LWG3454, "pointer_traits::pointer_to should be constexpr",
accepted in Kona 2025.
The change is applied since C++20, i.e. standard in which pointer_to
was made constexpr for T* specialization.
libstdc++-v3/ChangeLog:
* include/bits/ptr_traits.h (__ptr_traits_ptr_to::pointer_to):
Define as constexpr since C++20.
* testsuite/20_util/pointer_traits/pointer_to_constexpr.cc:
New test for custom pointer-like type.
libgomp.fortran/map-subarray-6.f90: Fix and robustify
Changes:
* Actually initialize the proper variable.
* Handle the three cases explicitly: self mapping/host fallback, mapping
but host accessible and mapping and (potentially) not host accessible.
Hence, remove 'dg-should-fail' - as the code should now always run.
* Add more checks for not pointer attaching, using values outside mapped
range.
* Add several comments and handle the case that 'tgt' is actually removed
during gimplification as unused. (Two cases: once the result with 'tgt'
removed - and once using 'tgt'/'tgt2' in the target region - and checking
then for the result).
libgomp/ChangeLog:
* testsuite/libgomp.fortran/map-subarray-6.f90: Fix, extend, and
robustify.
Richard Biener [Thu, 5 Mar 2026 10:20:44 +0000 (11:20 +0100)]
Avoid live code-generation for stmts kept as scalars
The following avoids trying to code-generate live lane extracts for
scalar defs that we have to keep anyway because they are used in
SLP graph leafs as extern inputs.
This resolves the known cases of one of the workarounds in live
code-generation.
* tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Do not
attempt to live code-generate defs that are kept in scalar
form anyway.
* tree-vect-loop.cc (vectorizable_live_operation): Update
comment.
Richard Biener [Tue, 3 Mar 2026 14:09:22 +0000 (15:09 +0100)]
Cost each BB vect live lane only once
The following makes sure to cost live scalar stmts appearing in multiple
SLP nodes only once and code-generate them from the SLP node we verified
we can replace all scalar uses from.
* tree-vectorizer.h (_slp_tree::live_lanes): New vector.
(SLP_TREE_LIVE_LANES): New.
* tree-vect-loop.cc (vectorizable_live_operation): Append
to SLP_TREE_LIVE_LANES.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize
SLP_TREE_LIVE_LANES.
(_slp_tree::~_slp_tree): Release SLP_TREE_LIVE_LANES.
(vect_print_slp_tree): Adjust live lane dumping, indicating
the SLP node a lane is code generated from.
(vect_bb_slp_mark_live_stmts): No longer verify we can
code-generate from all SLP nodes but at least one, picking
the first.
* tree-vect-stmts.cc (vect_transform_stmt): Iterate over
SLP_TREE_LIVE_LANES.
(vect_analyze_stmt): Also analyze reductions for live
lanes.
The following uses the vector coverage indicated by SLP_TREE_TYPE to
improve and simplify BB vector scalar costing, finally handling SLP
patterns properly.
PR tree-optimization/124222
* tree-vect-slp.cc (vect_slp_gather_vectorized_scalar_stmts): Remove.
(vect_bb_slp_scalar_cost): Simplify by using SLP_TREE_TYPE and
a use-def walk of the scalar stmts SSA uses.
(vect_bb_vectorization_profitable_p): Simplify.
* gcc.dg/vect/costmodel/x86_64/costmodel-pr124222.c: New testcase.
Richard Biener [Mon, 2 Mar 2026 13:53:04 +0000 (14:53 +0100)]
Simplify vect_bb_slp_mark_live_stmts
The following uses the full scalar stmt coverage now denoted by
SLP_TREE_TYPE to simplify computing STMT_VINFO_LIVE_P for code
generation of live lanes.
Richard Biener [Tue, 3 Mar 2026 12:48:17 +0000 (13:48 +0100)]
Re-do vect_mark_slp_stmts to compute full scalar stmt coverage
The following re-purposes STMT_SLP_TYPE for BB vectorization to indicate
the scalar (non-pattern) stmt coverage of the vectorized SLP graph.
This will allow for simpler and more precise determining of live lanes
and scalar costing.
* tree-vect-slp.cc (vect_slp_analyze_bb_1): Split out pure_slp
marking into ...
(vect_bb_slp_mark_stmts_vectorized): ... new function. Compute
full scalar stmt coverage of the SLP graph.
(vect_slp_gather_extern_scalar_stmts): New helper.
(vect_bb_slp_mark_live_stmts): Adjust.
* tree-vect-loop.cc (vectorizable_live_operation): Likewise.
Richard Biener [Mon, 2 Mar 2026 14:10:14 +0000 (15:10 +0100)]
Move BB analysis code to make flow more obvious
The following moves BB vect live stmt marking out of
vect_slp_analyze_operations to vect_slp_analyze_bb_1 and SLP stmt marking,
marking some vectorized stmts as PURE_SLP, right before it which
is the only remaining consumer.
* tree-vect-slp.cc (vect_slp_analyze_operations): Move
vect_bb_slp_mark_live_stmts call ...
(vect_slp_analyze_bb_1): ... here. Move SLP stmt marking
right before it.
(vect_mark_slp_stmts): Remove unused overload.
i386: Avoid redundant classify_argument call in construct_container
In construct_container, remove the early call to classify_argument.
examine_argument function already invokes classify_argument internally
and returns early if the argument must be passed in memory, making the
initial call in construct_container redundant. Recompute the
classification only when needed and assert that it succeeds.
While there, change the type of the in_return parameter from int to
bool in examine_argument and construct_container, and adjust all call
sites accordingly.
No functional change intended.
gcc/
* config/i386/i386.cc (construct_container): Remove redundant
early call to classify_argument. Recompute the classification
only when needed and assert that it succeeds.
(examine_argument): Change in_return parameter type to bool.
(function_arg_advance_64): Update call to examine_argument.
(function_arg_64): Update call to construct_container.
(function_value_64): Likewise.
(ix86_return_in_memory): Update call to examine_argument.
(ix86_gimplify_va_arg): Update calls to construct_container
and examine_argument.
Bohan Lei [Sat, 28 Feb 2026 02:41:32 +0000 (10:41 +0800)]
RISC-V: Specify -mcpu if --with-cpu is used
Commit 5be645a introduces the support of --with-cpu. The current
implementation specifies an `-march` option based on the default cpu
value. This behavior is not consistent with how the `-mcpu` option
works, however, as the `-mcpu` option involves both `-march` and
`-mtune`. Only setting the `-march` value can be confusing for users,
who may expect the --with-cpu option to act the same way as an explicit
`-mcpu` option.
If it is some design choice, though, I would be glad to know.
Thanks,
Bohan
gcc/ChangeLog:
* config/riscv/riscv.h (OPTION_DEFAULT_SPECS): Specify -mcpu
instead of -march when --with-cpu is used.
Bohan Lei [Tue, 31 Mar 2026 07:27:05 +0000 (15:27 +0800)]
simplify-rtx: Simplify (cmp (and/ior x C1) C2)
This is v6 of
https://gcc.gnu.org/pipermail/gcc-patches/2026-March/711809.html, fixed
and enhanced as was suggested by Philipp and Andrew. The previous
version:
https://gcc.gnu.org/pipermail/gcc-patches/2026-April/714878.html.
Andrew noticed the x86 regression and suggested updating the testcase.
This patch adds missing simplifications for (cmp (and/ior x C1) C2) in
special cases. In the AND case, when (and C1 C2) is not equal to C1,
some bits set in C2 are not set in C1, and thus (eq (and x C1) C2) can
never be true. The OR case is similar when (and C1 C2) is not equal to
C2. As we know that the result of (and x C1) cannot be greater than C1,
and that that of (or x C1) cannot be less than C1 for unsigned integers,
LTU, LEU, GTU, GEU cases can be optimized, too.
The patch is meant to fix an ICE on RISC-V. In a former patch, I tried
to change the insn condition directly, but Jeff pointed out that it was
more reasonable to optimize it out before the split. As was suggested
by Jeff, this patch tries to simplify the expression in
simplify_relational_operation_1.
The URL for the former patch:
https://patchwork.sourceware.org/project/gcc/patch/20251229024238.15044-1-garthlei@linux.alibaba.com/
gcc/ChangeLog:
* simplify-rtx.cc (simplify_context::simplify_relational_operation_1):
Add simplifications for `(cmp (and/ior x C1) C2)`.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr113609-1.c: Change assembly check after
optimization.
* gcc.target/riscv/zbs-if_then_else-02.c: New test.
Richard Biener [Tue, 10 Mar 2026 13:39:24 +0000 (14:39 +0100)]
Add comment to vect_estimate_min_profitable_iters
The following adds a comment how it's awkward to add the scalar loop
stmt cost vectors with scaled count to the vector loop cost vector
to estimate peeling costs.
* tree-vect-loop.cc (vect_estimate_min_profitable_iters):
Add comment about costing of prologue/epilogue.
Richard Biener [Tue, 10 Mar 2026 13:36:24 +0000 (14:36 +0100)]
Use scalar_costs in vect_get_known_peeling_cost
The following makes us use the scalar_costs summary when computing
the peeling cost part comparing different peelings for alignment.
That gets rid of repeated walks of LOOP_VINFO_SCALAR_ITERATION_COST
and the fallback to builtin_vectorization_cost.
* tree-vect-loop.cc (vect_get_known_peeling_cost): Use
scalar_costs instead of guesstimating it.
Richard Biener [Tue, 10 Mar 2026 13:34:56 +0000 (14:34 +0100)]
Cost scalar into vect_body
The following adjusts vect_compute_single_scalar_iteration_cost to
record stmts as vect_body rather than vect_prologue so that
scalar_costs->body_cost () will not be zero.
* tree-vect-loop.cc (vect_compute_single_scalar_iteration_cost):
Record stmt cost to vect_body.
Richard Biener [Tue, 10 Mar 2026 13:03:24 +0000 (14:03 +0100)]
Simplify vect_get_known_peeling_cost
The following reflects into vect_get_known_peeling_cost what it actually
does and simplifies that with the three callers in mind which do not
need most of what is computed. The function ends up using
legacy builtin_vectorization_cost to sum up N scalar loop copies.
With the next patch in the series this should improve, also
compile-time wise.
* tree-vectorizer.h (vect_get_known_peeling_cost): Simplify API.
* tree-vect-loop.cc (vect_get_known_peeling_cost): Avoid
all the overhead of record_stmt_cost as we only are interested
in the overall sum of the included builtin_vectorization_cost
calls.
* tree-vect-data-refs.cc (vect_peeling_hash_get_lowest_cost):
Adjust.
(vect_enhance_data_refs_alignment): Likewise.
match.pd: x != CST1 ? x + CST2 : CST3 -> x + CST2 [PR112659, PR122996]
This patch extends the conditional addition simplification introduced
in PR 122996 to handle a third constant and uniform vectors. This also
resolves the missing vector addition fold noted in PR 112659.
It simplifies x != CST1 ? x + CST2 : CST3 into x + CST2 when
CST1 + CST2 == CST3.
Bootstrapped and regression tested on x86_64-pc-linux-gnu.
* match.pd: Extend conditional addition pattern to handle
a third constant and uniform vectors.
gcc/testsuite/ChangeLog:
* g++.dg/tree-ssa/cond-add-vec-1.C: New test (positive cases).
* g++.dg/tree-ssa/cond-add-vec-2.C: New test (negative cases).
* gcc.dg/tree-ssa/cond-add-1.c: New test (positive cases).
* gcc.dg/tree-ssa/cond-add-2.c: New test (negative cases).
Monk Chiang [Wed, 7 Jan 2026 02:59:07 +0000 (18:59 -0800)]
RISC-V: Use long jump for crossing section boundaries
When -freorder-blocks-and-partition is used, GCC places cold code in
.text.unlikely section. Jumps from hot code (.text) to cold code
(.text.unlikely) may cross section boundaries. Since the linker may
place these sections more than 1MB apart, the JAL instruction's ±1MB
range can be exceeded, causing linker errors like:
relocation truncated to fit: R_RISCV_JAL against `.text.unlikely'
This patch fixes the issue by checking CROSSING_JUMP_P in the length
attribute calculation for jump instructions. When a jump crosses
section boundaries, the length is set to 8 bytes (AUIPC+JALR) instead
of 4 bytes (JAL), ensuring the long form is used.
This approach is consistent with other backends (NDS32, SH, ARC) that
also use CROSSING_JUMP_P to handle cross-section jumps.
gcc/ChangeLog:
* config/riscv/riscv.md (length attribute): Check CROSSING_JUMP_P
for jump instructions and use length 8 for crossing jumps.
(jump): Update comment to explain when long form is used.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/pr-crossing-jump-1.c: New test.
* gcc.target/riscv/pr-crossing-jump-2.c: New test.
* gcc.target/riscv/pr-crossing-jump-3.c: New test.
Jakub Jelinek [Tue, 28 Apr 2026 07:28:44 +0000 (09:28 +0200)]
range-op-float: Fix ICE on undefined_p ranges [PR125039]
The following testcase ICEs at -O1 since r14-4153.
lower_bound/upper_bound methods on frange (and others) assert
they aren't called on undefined_p () ranges, because such ranges
don't really have any lower or upper bound.
Most fold_range virtual methods call empty_range_varying early
which checks if the operand ranges aren't undefined and in that case
return true and set r to varying, and then can safely use
lower_bound/upper_bound etc.
Now, operator_not_equal::fold_range did that until r14-4152 indirectly,
by calling it in frelop_early_resolve which it called unconditionally.
r14-4153 changed it not to call frelop_early_resolve in some cases
because it could misbehave as mentioned in the comment.
frelop_early_resolve has 3 conditionals it handles.
if (!maybe_isnan (op1, op2) && relation_union (rel, my_rel) == my_rel)
doesn't apply for this case, because the
if (rel == VREL_EQ && maybe_isnan (op1, op2))
condition means maybe_isnan (op1, op2) will be true.
if (relation_intersect (rel, my_rel) == VREL_UNDEFINED)
is the condition which r14-4153 wanted to avoid. And finally
if (empty_range_varying (r, type, op1, op2))
is the condition the following patch readds, so that we don't ICE on those.
2026-04-27 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/125039
* range-op-float.cc (operator_not_equal::fold_range): Call
empty_range_varying when not calling frelop_early_resolve.
Jakub Jelinek [Tue, 28 Apr 2026 06:54:42 +0000 (08:54 +0200)]
c, middle-end: Implement C2Y N3747 paper - Integer Sets, v5
C23 disallowed signed _BitInt(1), it only allowed unsigned _BitInt(1)
and signed _BitInt(2) and larger precisions.
The https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3747.pdf paper
changes this for C2Y and allows signed _BitInt(1) and (backwards
incompatibly) changes the type of 0wb from _BitInt(2) to _BitInt(1),
all other literals keep their earlier types. The paper contains large
redesign of the C types hierarchy, but my understanding is that only
those two changes are changing something for users.
2026-04-28 Jakub Jelinek <jakub@redhat.com>
gcc/
* tree.cc (build_bitint_type): Allow build_bitint_type (1, 0).
(signed_or_unsigned_type_for): Call that for !unsignedp case
of BITINT_TYPE with bits 1.
gcc/c-family/
* c-common.cc (c_common_signed_or_unsigned_type): Use
build_bitint_type for TREE_CODE (type) == BITINT_TYPE whenever
flag_isoc2y even when precision is 1.
(c_common_get_alias_set): Don't special case BITINT_TYPE
with precision 1 for flag_isoc2y.
* c-lex.cc (interpret_integer): Use _BitInt(1) type for 0wb
if flag_isoc2y, rather than _BitInt(2).
gcc/c/
* c-decl.cc (finish_declspecs) <case cts_bitint>: Implement
C2Y N3747 - Integer Sets, v5. Allow signed _BitInt(1) for
flag_isoc2y.
gcc/testsuite/
* gcc.dg/torture/bitint-96.c: New test.
* gcc.dg/torture/bitint-97.c: New test.
* gcc.dg/torture/bitint-98.c: New test.
* gcc.dg/bitint-130.c: New test.
* gcc.dg/bitint-131.c: New test.
* gcc.dg/bitint-132.c: New test.
Andrew Pinski [Mon, 27 Apr 2026 18:30:25 +0000 (11:30 -0700)]
ivopts: Fix up doloop support for enum and bitint types [PR125036]
After r17-89-g78280307c78ead, ninter handles enum types so ivcannon
will also use enum types and now ninter returns an enum type here.
Also add_iv_candidate_for_doloop was expecting only integer types.
This fixes the issue by allowing non integer types for what ninter
returns and converts it into an integer type which is a full mode
integer type.
Bootstrapped and tested on powerpc64le-linux-gnu.
PR tree-optimization/125036
gcc/ChangeLog:
* tree-ssa-loop-ivopts.cc (add_iv_candidate_for_doloop): Don't
assert on niniter being integer type. Convert to a full mode
integer type if non integer or non-full mode integer type.
gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr125036-1.c: New test.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
Jeff Law [Mon, 27 Apr 2026 22:03:53 +0000 (16:03 -0600)]
[RISC-V][PR tree-optimization/57650] Detect more czero opportunities
So in pr57650 we have RTL like this:
> (set (reg:DI 147)
> (and:DI (gt:DI (reg:DI 153 [ y ])
> (reg:DI 154 [ z ]))
> (ne:DI (reg/v/f:DI 138 [ x ])
> (const_int 0 [0]))))
That's going to generate:
sgt a1,a1,a2
snez a5,a0
and a5,a5,a1
But with zicond we can do better. That's really just:
sgt a1,a1,a2
czero.eqz a1,a1,a0
We already had patterns to clean this kind of mess up a bit, but they needed a
bit more generalization. First they only accepted NE forms, but EQ is just as
valid and just requires us to select between czero.nez and czero.eqz. Second
the AND is commutative, so the equality test can appear in either position.
With those generalizations we can get the desired code. Note I'm not trying to
tackle the larger problems with 57650, just the low level code generation
inefficiencies.
This has been in my tester for a while without regressions and is being
exercised during a bootstrap on the BPI. I'll wait for pre-commit CI to render
a verdict.
PR tree-optimization/57650
gcc/
* config/riscv/zicond.md: Generalize patterns which identify
a logical AND of an equality test and some other sCC insn to
handle more cases.
gcc/testsuite/
* gcc.target/riscv/pr57650.c: New test.
Patrick Palka [Mon, 27 Apr 2026 21:56:18 +0000 (17:56 -0400)]
c++: fix decltype(id) for pointer-to-data-member access expr [PR124978]
Here after substitution into decltype(X), X is the expanded but not
constant-evaluated pointer-to-data-member access expression
*((const int *) *cw<Divide{42}>::value + (sizetype) *cw<&Divide::value>::value)
and finish_decltype_type wrongly strips the outermost INDIRECT_REF under
the assumption that it's an implicit dereference of a reference, but here
it's an explicit pointer dereference. This causes the decltype to yield
const int* instead of the expected int.
This patch fixes this particular bug by checking REFERENCE_REF_P instead
of INDIRECT_REF_P which additionally verifies the dereferenced thing
actually has reference type. The decltype now yields the correct type
modulo an unnecessary const due to the separate bug PR115314.
PR c++/124978
PR c++/115314
gcc/cp/ChangeLog:
* semantics.cc (finish_decltype_type): Check REFERENCE_REF_P
instead of INDIRECT_REF_P before stripping implicit dereferences.
Patrick Palka [Mon, 27 Apr 2026 21:52:03 +0000 (17:52 -0400)]
c++/modules: defer completion of streamed-in cNTTPs [PR124953]
Here we hit lazy loading recursion when streaming in the cNTTP object
wrap<Storage>{}, via get_template_parm_object -> cp_finish_decl ->
ensure_literal_type_for_constexpr_object -> complete_type, apparently
the class definition of wrap<Storage> hasn't been streamed in yet.
If we disable that literal type check for NTTP objects, we still hit
recursion, from layout_var_decl.
It seems prudent to defer calling cp_finish_decl for NTTP objects until
after lazy loading has completed like we do for expand_or_defer_fn and
cdtors. This patch arranges that, as a follow-up to the some previous
NTTP object streaming fixes r15-3031 and r16-318.
PR c++/124953
gcc/cp/ChangeLog:
* module.cc (trees_in::tree_node) <tt_nttp_var>: Push the result
of get_template_parm_object to post_load_decls.
(post_load_processing): Call cp_finish_decl on any not yet
completed NTTP objects.
* pt.cc (get_template_parm_object): Don't call cp_finish_decl
when !check_init.
gcc/testsuite/ChangeLog:
* g++.dg/modules/tpl-nttp-3_a.H: New test.
* g++.dg/modules/tpl-nttp-3_b.C: New test.
Philipp Tomsich [Fri, 6 Mar 2026 09:55:00 +0000 (10:55 +0100)]
ext-dce: Promote narrow operations to wider mode when extended bits are dead
When an operation like (sign_extend:DI (plus:SI ...)) has dead extended
bits, promote the inner operation to the wider mode, eliminating the
extension wrapper. This enables combine to see DI-mode sequences and
form instructions like sh1add, sh2add, sh3add on RISC-V.
Only promote candidates that form chains — where one candidate's result
feeds into another's operand. Standalone (isolated) promotions are
skipped because they cause regressions on targets with free sign
extension (e.g., RISC-V W-suffix instructions): they prevent combine
from folding sext.w patterns and break combine split patterns that
depend on the sign_extend wrapper (sh1add, packw).
Chain detection tracks promotion candidates and their register
connections within each basic block, propagating through copies
created by optimized extensions.
gcc/ChangeLog:
* ext-dce.cc (promotion_candidate_info): New struct.
(copy_info): New struct.
(promotion_candidates, promotable_dests): New file-scope variables.
(consumed_by_candidate, promotion_copies): Likewise.
(ext_dce_try_promote_operation): New function to promote
sign/zero-extended arithmetic to wider mode.
(ext_dce_record_promotion_candidate): New function to record
promotion candidates for deferred chain analysis.
(ext_dce_promote_chained_candidates): New function to promote
only chained candidates.
(ext_dce_process_uses): Record candidates instead of promoting
immediately; propagate chain info through optimized copies.
(ext_dce_process_bb): Call ext_dce_promote_chained_candidates
after processing all insns in a block.
(ext_dce_init): Allocate chain detection bitmaps.
(ext_dce_finish): Free chain detection data structures.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/ext-dce-promote-2.c: Update to verify both
chain promotions (sh1add, sh3add) and standalone skipping.
Philipp Tomsich [Fri, 6 Mar 2026 09:54:13 +0000 (10:54 +0100)]
ext-dce: Only remove REG_EQUAL/EQUIV notes on successful optimization
In ext_dce_try_optimize_extension, REG_EQUAL/EQUIV notes were removed
unconditionally after attempting validate_change, even when the
validation failed and the insn was reverted to its original state.
This could cause subsequent passes to generate different (incorrect)
code because they lost the REG_EQUAL hint on an unchanged insn.
Guard the note removal with the 'ok' flag so notes are only stripped
when validate_change actually committed the transformation.
gcc/ChangeLog:
* ext-dce.cc (ext_dce_try_optimize_extension): Only remove
REG_EQUAL/EQUIV notes when validate_change succeeds.
Soumya AR [Mon, 27 Apr 2026 20:47:57 +0000 (20:47 +0000)]
aarch64: Update br_mispredict_factor for generic tunings
After some testing, we have found that a br_mispredict_factor of 7 is more
suitable than the default factor of 6 that was proposed in d7aebc72899.
6 can be too restrictive on certain workloads and reject cheaper csels in favour
of conditional branches.
On an Olympus core, this change improves SPEC2017 fp rate geomean by 1% while
the int rate geomean is unchanged. There are no visible regressions >1%.
Additionally, github.com/facebook/zstd retains the performance improvement this
patch introduced.
Signed-off-by: Soumya AR <soumyaa@nvidia.com>
gcc/ChangeLog:
* config/aarch64/tuning_models/generic.h: Update br_mispredict_factor
to 7.
Philipp Tomsich [Thu, 12 Mar 2026 20:58:09 +0000 (21:58 +0100)]
match.pd: Relax single_use for fold-to-zero comparisons
The single_use restriction on the X +- C1 CMP C2 -> X CMP C2 -+ C1
simplification (for eq/ne) prevents folding patterns like (++*a == 1)
into (*a == 0) when the defining SSA value has multiple uses.
Comparing against zero is cheaper on most targets (beqz on RISC-V,
cbz on AArch64), so the transform is profitable even when the
defining SSA has multiple uses. Relax single_use when the folded
comparison constant is zero.
For example, given:
_1 = *a;
_2 = _1 + 1;
*a = _2;
if (_2 == 1)
match.pd now produces:
if (_1 == 0)
which generates beqz/cbz instead of li+beq/cmp+b.eq.
This is a partial fix towards the issue described in PR120283.
gcc/ChangeLog:
* match.pd: Relax single_use for eq/ne when folded constant
is zero.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/forwprop-pre-incr-cmp.c: New test.
Jason Merrill [Mon, 20 Apr 2026 19:46:05 +0000 (15:46 -0400)]
c++: constexpr union with no active member [PR124910]
Patrick pointed out that while r16-8767 made a union constant after
destroying its active member, we still weren't treating a union that never
had an active member as constant; the difference is the
CONSTRUCTOR_NO_CLEARING flag, and what that means to
reduced_constant_expression_p.
It seems to me that since P2686 [expr.const] says whether a prvalue
expression is a constant expression depends on the constituent values, and
[intro.object] says that only the active member is a constituent value of a
union, so a union with no active member has no constituent values and so is
vacuously constant, like an object of empty type. P2686 as a whole is not a
DR, but the draft was previously unclear, and CWG2658 also clarified that
copying a union is equivalent to copying the active member *if any*.
I was somewhat surprised that none of the existing tests needed to be
changed.
PR c++/124910
DR 2658
gcc/cp/ChangeLog:
* constexpr.cc (reduced_constant_expression_p): Allow a union
with no active member.
There is a typo in using dead_set instead of set in
clear_sparseset_regnos and regnos_in_sparseset_p. This can result in
wrong unused (stalled) notes and wrong or worse code generation by
optimizations using unused notes after RA.
gcc/ChangeLog:
* lra-lives.cc (clear_sparseset_regnos, regnos_in_sparseset_p):
Use set instead of dead_set.
ira_memory_move_cost is used in many IRA places but I found 2 places where
load and store costs are used instead of correspondingly store and load
costs. The patch fixes this.
gcc/ChangeLog:
* ira-costs.cc (record_reg_classes): When calculating alt_cost use
the right cost of memory-reg move.
* ira-emit.cc (emit_move_list): Use load cost instead of store for
moving memory to reg.
Jeff Law [Mon, 27 Apr 2026 17:27:12 +0000 (11:27 -0600)]
[RISC-V][PR target/121268] Add splitters to improve andn generation
So if we have something like (and (not X) (not Y)) where X or Y is a simple
register and the other is possibly more complex, but implementable with a
single instruction, we want to split at the the complex expression. Let's say
it's Y above. We want to generate
The most interesting cases for Y exploit the ~x = -x + 1 identity or (x & -x) -
1 = (x - 1) & ~x
If we take two functions from the PR:
unsigned int f1(unsigned int x)
{
return ~(x | -x);
}
unsigned int f3(unsigned int x)
{
return (x & -x) - 1;
}
Currently generates this on rv64:
f1:
negw a5,a0
or a0,a5,a0
not a0,a0
ret
f3:
negw a5,a0
and a0,a5,a0
addiw a0,a0,-1
ret
After this patch we generate:
f1:
addiw a5,a0,-1
andn a0,a5,a0
ret
f3:
addiw a5,a0,-1
andn a0,a5,a0
ret
I considered doing these in simplify-rtx. My biggest worry is over-fitting to
the way the RISC-V port expresses the "w" form instructions. So I stuck with a
target specific solution.
It's just a few 3->2 splitters. The bulk of the patch has been in my tester
for a while, but the last pattern is new after I did some experimentation on
rv32 to make sure it's generating sensible code too. The runs in my tester
have all been without regressions. Obviously I'll be waiting on the pre-commit
CI system to render a verdict.
PR target/121268
gcc/
* config/riscv/bitmanip.md: Add splitters to exploit identities
that relate subtraction and bitwise negation on 2's complement
arithmetic.
gcc/testsuite/
* gcc.target/riscv/pr121268.c: New test.
Muhammad Kamran [Mon, 27 Apr 2026 14:29:53 +0000 (15:29 +0100)]
aarch64/testsuite: add LTO coverage for branch-protection notes and attributes
Recent binutils (e.g. 2.46) switched AArch64 branch-protection emission
from .note.gnu.property to build attributes
(Tag_Feature_BTI, Tag_Feature_PAC, Tag_Feature_GCS) when GCC is
configured with such toolchains.
PR target/124365 exposed an issue where -flto with
-mbranch-protection=standard caused loss of branch-protection metadata in build
attributes. This was due to an LTO bug, now fixed upstream
(8b39ec70741b7fb9d059b6944f30a6743dea996a).
Add tests to verify both forms in LTO builds, covering:
• older binutils behaviour (.note.gnu.property), and
• newer binutils behaviour (build attributes).
This ensures branch-protection metadata is preserved across LTO for both
toolchain configurations.
PR target/124365
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/lto/lto.exp: New DejaGnu test driver for LTO tests
for aarch64. Copied from gcc/testsuite/gcc.target/arm/lto/lto.exp with
minor changes.
* gcc.target/aarch64/lto/pr124365-build-attributes-1_0.c: New test
for build attributes with branch protection.
* gcc.target/aarch64/lto/pr124365-build-attributes-1_1.c: Companion
source file for the LTO test.
* gcc.target/aarch64/lto/pr124365-build-attributes-2_0.c: New test
for build attributes without branch protection.
* gcc.target/aarch64/lto/pr124365-build-attributes-2_1.c: Companion
source file for the LTO test with branch protection enabled.
* gcc.target/aarch64/lto/pr124365-gnu-property-1_0.c: New test for
`.note.gnu.property` with branch protection.
* gcc.target/aarch64/lto/pr124365-gnu-property-1_1.c: Companion
source file for the LTO test.
* gcc.target/aarch64/lto/pr124365-gnu-property-2_0.c: New test for
`.note.gnu.property` without branch protection.
* gcc.target/aarch64/lto/pr124365-gnu-property-2_1.c: Companion
source file for the LTO test with branch protection enabled.
object-readelf in lib/lto.exp was hard-wired to use readelf -A,
limiting it to attribute checks. Extend it to accept a readelf option and
a regex, where the option selects the readelf flag and the regex is
matched against the output.
Add wrapper procedures for common use cases:
• attribute checks, and
• note checks.
Also add support for negative checks via an "is-negative" argument,
which requires that the regex is not present in the output.
gcc/ChangeLog:
* doc/sourcebuild.texi (Scan object metadata with readelf): Document
object-readelf-attributes, object-readelf-attributes-not,
object-readelf-notes, and object-readelf-notes-not as regex-based
checks with optional target/xfail selectors.
gcc/testsuite/ChangeLog:
* lib/lto.exp (object-readelf): Accept a readelf option and a single
regex; match against full readelf output. Keep positive/negative
behaviour via wrappers.
(object-readelf-attributes, object-readelf-attributes-not,
object-readelf-notes, object-readelf-notes-not): Implement as wrappers
over the generic matcher.
* gcc.dg-selftests/dg-final.exp (dg_final_directive_check_num_args):
Update for object-readelf-* wrappers to regex-style arguments (1..3).
* gcc.target/arm/lto/pr61123-enum-size_0.c: Update to use
object-readelf-attributes with a single regex.
[PATCH v3] tree-optimization: lower mempcpy to memcpy when result is unused [PR93556]
This patch allows the GIMPLE folder to transform __builtin_mempcpy into
__builtin_memcpy in cases where the return value is ignored. This is beneficial
because most targets have an efficient implementation for memcpy.
Existing tests that relied on the unfolded mempcpy have been duplicated - one
version now takes the folded mempcpy into account, and the other intentionally
prevents the folding from happening.
Bootstrapped and regression tested on x86_64-linux-gnu.
PR tree-optimization/93556
gcc/ChangeLog:
* gimple-fold.cc (gimple_fold_builtin_mempcpy): New function.
(gimple_fold_builtin): Handle BUILT_IN_MEMPCPY.
gcc/testsuite/ChangeLog:
* gcc.dg/pr79223.c: Rename to gcc.dg/pr79223-1.c and update scans.
* gcc.dg/tree-prof/val-prof-7.c: Rename to
gcc.dg/tree-prof/val-prof-7-1.c and update scans.
* gcc.dg/tree-ssa/builtins-folding-gimple-3.c: Update scans.
* gcc.dg/builtin-mempcpy-1.c: New test.
* gcc.dg/builtin-mempcpy-2.c: New test.
* gcc.dg/pr79223-2.c: New test.
* gcc.dg/tree-prof/val-prof-7-2.c: New test.
* gcc.dg/tree-ssa/builtins-folding-gimple-4.c: New test.
Jakub Jelinek [Mon, 27 Apr 2026 11:30:58 +0000 (13:30 +0200)]
libstdc++: Fix up std::is_scalar for std::meta::info [PR125024]
https://eel.is/c++draft/basic.types.general#9.sentence-1 says that
std::meta::info and its cv-qualified versions are scalar types too
(and in https://eel.is/c++draft/basic.fundamental#19.sentence-1
that they are fundamental types too).
Now, on the reflection side, eval_is_scalar_type is handled
in the compiler and uses SCALAR_TYPE_P (type) which includes
REFLECTION_TYPE_P check and eval_is_fundamental_type includes that
explicitly too.
std::is_fundamental uses
template<typename _Tp>
struct is_fundamental
: public __or_<is_arithmetic<_Tp>, is_void<_Tp>,
is_null_pointer<_Tp>
#if __cpp_impl_reflection >= 202506L
, is_reflection<_Tp>
#endif
>::type
{ };
but for std::is_scalar we apparently forgot to include is_reflection.
The following patch fixes that.
2026-04-26 Jakub Jelinek <jakub@redhat.com>
PR libstdc++/125024
* include/std/type_traits (std::is_scalar): For
__cpp_impl_reflection >= 202506L handle is_reflection types as
scalar.
* testsuite/20_util/is_scalar/reflection.cc: New test.
Jakub Jelinek [Mon, 27 Apr 2026 08:11:20 +0000 (10:11 +0200)]
testsuite: Fix up bitint-95.c test [PR124988]
I forgot to add the usual guards of bitint tests to bitint-95.c test
(which were done even in the 4 other tests from the same commit).
2026-04-27 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/124988
* gcc.dg/torture/bitint-95.c: Add bitint effective targets and
guard parts of test which need _BitInt(192) support with
__BITINT_MAXWIDTH__ >= 192.
Richard Biener [Mon, 27 Apr 2026 07:00:40 +0000 (09:00 +0200)]
tree-optimization/125025 - ICE with niter analysis and UBSAN
The following avoids trying to compute the absolute step by
negating a signed step, instead, as done in one other place
already, first convert to unsigned and then negate.
PR tree-optimization/125025
* tree-ssa-loop-niter.cc (number_of_iterations_ne): Avoid
negation of most negative signed integer.
(number_of_iterations_lt): Likewise.
Richard Biener [Sun, 26 Apr 2026 09:16:33 +0000 (11:16 +0200)]
tree-optimization/125019 - fix ICE with recurrence vectorization
This fixes an oversight with the PR124677 fix.
PR tree-optimization/125019
* tree-vect-loop.cc (vectorizable_recurr): Properly guard
against hitting last stmt when searching for the insertion
place.
While looking into PR 110252 a few years back, I noticed this missed
optimization in code from sel-sched.cc. I only realized today
I could generalize it to handle more than just 1 to all positive
values.
This adds the pattern to optimize:
signed < 0 ? positive : min<signed, positive>
into:
unsigned ts = signed;
unsigned ps = positive;
unsigned ru = min<ts, tp>;
(signed)ru
gcc:
* doc/install.texi (Prerequisites): Use Binutils over binutils to
refer to that project.
(Downloading the source): Ditto.
(Configuration): Ditto.
(Building): Ditto.
(Specific): Ditto.
Clearly a permutation of a permutation is another permutation, so
the above expression can be simplified/canonicalized. Conveniently
there's already code in simplify_rtx to spot that a vec_select of
vec_select is an identity, this patch extends that functionality to
simplify a vec_select of a vec_select to a single vec_select.
With this transformation in simplify-rtx.cc, combine now reports:
2026-04-26 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* simplify-rtx.cc (simplify_context::simplify_binary_operation_1)
<case VEC_SELECT>: Simplify a (non-identity) vec_select of a
vec_select.
gcc/testsuite/ChangeLog
* gcc.target/i386/sse2-pshufd-2.c: New test case.
Roger Sayle [Sun, 26 Apr 2026 09:56:43 +0000 (10:56 +0100)]
PR tree-optimization/124715: pow(0,-1) sets errno with -fmath-errno
This patch addresses PR tree-optimization/124715, where it is unsafe for
GCC (specifically match.pd) to transform pow(x,-1) into 1.0/x if x may be
zero, which sets errno, unless -fno-math-errno (included in -ffast-math)
is specified.
2026-04-26 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR tree-optimization/124715
* match.pd (simpify pows): Check flag_errno_math before simplifying
pow(x,-1) -> 1/x when x could be zero.
gcc/testsuite/ChangeLog
PR tree-optimization/124715
* gcc.dg/no-math-errno-5.c: New test case.
* gcc.dg/no-math-errno-6.c: Likewise.
Roger Sayle [Sun, 26 Apr 2026 09:53:20 +0000 (10:53 +0100)]
i386: Refactor AVX512 comparisons in machine description sse.md.
This patch refactors/tidies up the define_insns for vector comparisons
on 512-bit vectors in sse.md. The motivation is that the current
organization (accidentally) introduces dubious instructions such as
avx512f_cmpv16si3_mask_round and avx512vl_cmpv2di3_mask_round, which
are integer comparisons that specify a floating point rounding mode!?
The problem is caused by the decomposition of mode iterators.
Currently, sse.md uses four patterns: (1) for signed comparions
of floating point and large integer modes (V48H), (2) for signed
comparisons of small integer modes (VI12), (3) for unsigned
comparisons of small integer modes (VI12) and (4) for unsigned
comparisons of large integer modes (VI48). The first pattern
also allows for variants specifying the FP rounding mode.
The refactoring below uses a more sensible decomposition into
only three patterns: (1) for [signed] comparisons of floating
point modes (VFH), (2) for signed comparisons of integers (VI1248)
and (3) for unsigned comparisons of integers (VI1248).
For the record, to show this produces the same coverage:
The simplification also allows a clean-up of predicates
(for operand[3]) as there are 8 integer comparison operators
and 32 floating point comparison operators, and we no longer
need cmp_imm_predicate to restrict range based upon <mode>.
There are no changes other than removing the non-sensical patterns
from insn-emit, insn-recog and friends.
2026-04-26 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/sse.md
(<avx512>_cmp<mode>3<mask_scalar_merge_name><round_saeonly_name>):
Change mode iterator from V48H_AVX512VL to VFH_AVX512VL and op3's
predicate from <cmp_imm_predicate> to const_0_to_31_operand.
(<avx512>_cmp<mode>3<mask_scalar_merge_name>): Change mode
iterator from VI12_AVX512VL to VI1248_AVX512VLBW.
(<avx512>_ucmp<mode>3<mask_scalar_merge_name>): Likewise.
Jeff Law [Sun, 26 Apr 2026 00:12:27 +0000 (18:12 -0600)]
[RISC-V][PR rtl-optimization/56096] Improve equality comparisons of a logical AND expressions
This BZ shows that we can improve certain comparisons for RISC-V. In
particular if we are testing the result of a logical AND for equality and one
operand of the AND requires synthesis, we may be able to do better if we right
shift away any trailing zeros from the constant and shift the other input as
well. This wins when the shifted constant does not require synthesis.
That may in turn allow improvement of a select of 0 and 2^n based on the
zero/nonzero status of a logical AND. Essentially we can rewrite the sequence
to remove a data dependency.
Concretely:
>
> unsigned f1 (unsigned x, unsigned m)
> {
> x >>= ((m & 0x008080) ? 8 : 0);
> return x;
> }
Compiles into:
> li a5,32768
> addi a5,a5,128
> and a1,a1,a5
> snez a1,a1
> slliw a1,a1,3
> srlw a0,a0,a1
> ret
But after this patch we generate this instead:
> srai a5,a1,7
> andi a5,a5,257
> li a4,8
> czero.eqz a1,a4,a5
> srlw a0,a0,a1
> ret
It's just one less instruction, but the li can issue whenever the uarch wants
before the srlw as it has no incoming dependency. So we're slight more dense
on encoding and slightly more efficient as well. Much like 57650, I'm focused
on the low level RISC-V codegen issues, not the broader issues that are raised
in the PR.
This has been in my tree for a while, so it's been tested on riscv32-elf,
riscv64-elf and bootstrapped on the BPI which has support for czero. Waiting
on pre-commit CI before moving forward.
PR rtl-optimization/56096
gcc/
* config/riscv/riscv.md: Add new patterns to optimize certain cases with
a logical AND feeding an equality test against zero.
Andrew Pinski [Tue, 10 Feb 2026 17:41:48 +0000 (09:41 -0800)]
scev/niter: Use INTEGRAL_NB_TYPE_P instead of direct comparison to INTEGER_TYPE [PR124061]
I noticed this while looking into PR 124052. This is not the first time we had
direct type comparison against INTEGER_TYPE which should have been different.
As mention in PR 124052, I didn't include bool types so I needed a new macro
to simplify things.
Bootstrapped and tested on x86_64-linux-gnu.
PR tree-optimization/124061
gcc/ChangeLog:
* tree-scalar-evolution.cc (interpret_rhs_expr): Use
INTEGRAL_NB_TYPE_P instead of comparing the code to INTEGER_TYPE.
* tree-ssa-loop-niter.cc (number_of_iterations_ne): Likewise.
(number_of_iterations_cltz): Likewise.
(number_of_iterations_exit_assumptions): Likewise.
* tree.h (INTEGRAL_NB_TYPE_P): New macro.
gcc/testsuite/ChangeLog:
* g++.dg/opt/enum-loop-1.C: New test.
* gcc.dg/tree-ssa/bitint-loop-opt-1.c: New test.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
Jeff Law [Sat, 25 Apr 2026 18:18:34 +0000 (12:18 -0600)]
[RISC-V][PR target/123904] Improve bit masking of shifted values
If we are masking off bits on the upper and lower part of a register on riscv,
depending on the precise mask it may be best implemented as a shift triplet.
ie, shift left to clear upper bits, shift right to clear lower bits, shift left
again to put the bits into their proper position.
If the input value is already left shifted and the shift count corresponds to
the low mask bits, then we can get away with just two shifts. We shift left to
clear the relevant high bits, then shift right to put them into their proper
position.
This likey came from spec or coremark given it was reported to me by the RAU
team a while back. But the testcase didn't include enough breadcrumbs to know
for sure.
This has been repeatedly bootstrapped and regression tested on the Pioneer and
BPI as well as regularly regression tested on the riscv32-elf and riscv64-elf
embedded targets.
I'll wait for pre-commit CI to spin before pushing to the trunk.
PR target/123904
gcc/
* config/riscv/riscv.md (masking shifted value): New splitter to
optimize certain masking operations on shifted values.
gcc/testsuite/
* gcc.target/riscv/pr123904.c: New test.
Jeff Law [Sat, 25 Apr 2026 17:40:38 +0000 (11:40 -0600)]
[RISC-V][PR target/123838] Improve code generated for shifts with counts 31-N or 63-N
A shift count expressed at 31 - n ends up generating code like this:
li a5,31
subw a5,a5,a1
sllw a0,a0,a5
ret
Note how we had to load 31 into a constant for the subtraction. But instead of
using 31 - n we can use a bit-not as it'll do precisely what we need in the
bits that the shift instruction actually uses. This results in:
not a1, a1
sllw a0, a0, a1
ret
The core idea we're exploiting here is the processor implements
SHIFT_COUNT_TRUNCATED semantics. so a SI shift only cares about the low 5 bits
and DI the low 6 bits of the shift count. And if we think about what bit
pattern -1 would be in those cases we get 31 and 63. We then exploit the
identity
-x = ~x + 1 // identity
-1 - x = ~x // a tiny bit of algebra
So in these limited cases we can place the the -1 - x with ~x.
I didn't implement this in simplify-rtx. It wasn't actually going to help
because while the RISC-V chip implements SHIFT_COUNT_TRUNCATED semantics, it
doesn't define SHIFT_COUNT_TRUNCATED for "reasons".
So there's two patterns. One for an X mode destination, naturally the shift
count is 31/63 - n for SI/DI respectively. It's a bit odd that the subtraction
is always SImode, but that's probably narrowing happening somewhere.
The second pattern covers the "w" forms for rv64.
This trick probably works for the zbs instructions as well. That's going to be
a whole lot more patterns and I haven't seen this idiom show up anywhere in
practice, so it doesn't seem like a good cost/benefit analysis.
This spun overnight on riscv32-elf and riscv64-elf and on the Pioneer without
regressions. I'll wait for pre-commit CI to do its thing before pushing.
PR target/123838
gcc/
* config/riscv/riscv.md: Use splitters to simplify shifts where
the shift count is 31-N or 63-N.
gcc/testsuite
* gcc.target/riscv/pr123838.c: New test.
Pan Li [Tue, 13 Jan 2026 02:03:46 +0000 (10:03 +0800)]
RISC-V: Combine vec_duplicate + vmsle.vv to vmsle.vx on GR2VR cost
This patch would like to combine the vec_duplicate + vmsle.vv to the
vmsle.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have asm code like below, GR2VR cost is 0.
After this patch:
11 beq a3,zero,.L8
...
14 .L3:
15 vsetvli a5,a3,e32,m1,ta,ma
...
20 vmsle.vx v1,a2,v3
...
23 bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/predicates.md: Add ge to the swappable
cmp operator iterator.
* config/riscv/riscv-v.cc (get_swapped_cmp_rtx_code): Take
care of the swapped rtx code as well.
Daniel Barboza [Wed, 18 Feb 2026 13:29:50 +0000 (10:29 -0300)]
match.pd: remove bit set/bit clear branch mispredict [PR64567]
Add two patterns to eliminate mispredicts in the following bit ops
scenarios:
- checking if a single bit is not set, and in this case set it: always
set the bit;
- checking if a bitmask is set (even partially), and in this case clear
it: always clear the bitmask.
Bootstrapped and tested with x86_64-pc-linux-gnu.
PR tree-optimization/64567
gcc/ChangeLog:
* match.pd (`cond (bit_and A IMM) (bit_or A IMM) A`): New
pattern.
(`cond (bit_and A IMM) (bit_and A ~IMM) A`): New pattern.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/pr64567-2.c: New test.
* gcc.dg/tree-ssa/pr64567.c: New test.
tree-ssa-strlen: Use gimple_build/gimple_convert_to_ptrofftype [PR122989]
Replace convert_to_ptrofftype, force_gimple_operand_gsi,
gimple_build_assign, and gsi_insert_before with
gimple_convert_to_ptrofftype and gimple_build.
gcc/ChangeLog:
PR tree-optimization/122989
* tree-ssa-strlen.cc (get_string_length): Use
gimple_convert_to_ptrofftype and gimple_build instead of
convert_to_ptrofftype/force_gimple_operand_gsi/gimple_build_assign.
gld-2.46: warning: /tmp//cckSN7Ts.o: missing .note.GNU-stack section implies executable stack
gld-2.46: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
As shown in the PR, we can trigger an RTL checking abort when classifying thead
specific addressing modes. As far as I can tell, the code is supposed to be
extracting constant value from the multiply operation, but instead is
referencing the wrong object.
The fix is trivial. I don't think this is anywhere near serious enough to try
to get into the imminent gcc-16 release. So after pre-commit testing is done
I'll push to the trunk, then backport in a week or so after the gcc-16 release
has been made.
This has been regression tested on riscv64-elf and riscv32-elf. While it will
spin on the Pioneer overnight, which has the relevant thead extensions, they
aren't enabled by default, so I don't really expect any meaningful improvements
to coverage.
PR target/124984
gcc/
* config/riscv/thead.cc (th_memidx_classify_address_index): Extract
constant multiplicand value from the right object.
gcc/testsuite
* gcc.target/riscv/pr124984.c: New test.
Jeff Law [Fri, 24 Apr 2026 20:58:00 +0000 (14:58 -0600)]
[RISC-V][PR rtl-optimization/80770] Canonicalize extending byte loads for RISC-V
In the process of debugging pr80770 with Shreya it became apparent that a
failure to CSE certain memory references was inhibiting Shreya's RTL
simplification from firing in all the cases we cared about as the simplifier
requires two operands to be the same pseudo.
The failure to CSE stems from having two QI loads which are sign extended to
different sized destinations. As it turns out the code to fix that was
something I already had in flight as it's a small piece of eliminating a few
define_insn_and_split patterns (or simplifying them down to just a
define_split).
To expose the missed CSE what we really want to do is extend the value out to
word mode in a temporary, then use a lowpart extraction to set the real
destination. The key being we haven't changed the size of the load, just how
widely it gets extended. Think of it as canonicalization for the purposes of
CSE.
This isn't the full set of changes I had in flight in that space, but does
clean things up enough for QImode loads to get CSE'd better and is enough to
trigger Shreya's pr80770 changes consistently for the testcodes we have on
RISC-V.
This has been spinning in my tester for a while. So it's clean on riscv64-elf,
riscv32-elf as well as bootstrapped and regression tested on the Pioneer and
BPI-F3. I'll wait for the pre-commit tester to do its thing before pushing to
the trunk.
In case it's not obvious, I'm focused on trickling RISC-V target improvements
right now so as not to potentially interfere with the release process. So this
doesn't include Shreya's simplify-rtx.cc changes.
PR rtl-optimization/80770
gcc/
* config/riscv/riscv.md (zero_extendqi<SUPERQI:mode>2): Always extend
out to a word and use a subreg lowpart extraction to get the right bits.
(extend<SHORT:mode><SUPERQI:mode>2): Similarly.
Carter Rennick [Fri, 3 Apr 2026 13:07:38 +0000 (13:07 +0000)]
mips: Fix ICE on mips64-elf by removing MAX_FIXED_MODE_SIZE override [PR120144]
The definition of MAX_FIXED_MODE_SIZE did not account for MIPS supporting
TImode, which causes an internal compiler error when building libstdc++. Upon further
investigation, this definition appears to be a historical mistake.
This patch removes the MAX_FIXED_MODE_SIZE override, which fixes the error.
Eikansh Gupta [Tue, 31 Mar 2026 11:21:00 +0000 (16:51 +0530)]
tree-ssa-dce: eliminate dead relaxed atomic loads with no LHS [PR123966]
A relaxed atomic load whose result is never used has no observable
effect: the value is discarded and __ATOMIC_RELAXED provides no
inter-thread synchronisation guarantee.
Fix this by adding an early-return check for
BUILT_IN_ATOMIC_LOAD_1/2/4/8/16 calls that have no LHS and a
compile-time-constant relaxed memory order.
PR tree-optimization/123966
gcc/ChangeLog:
* tree-ssa-dce.cc (mark_stmt_if_obviously_necessary):
Don't mark a relaxed atomic load with no LHS as necessary.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/pr123966.c: New test.
Milan Tripkovic [Fri, 24 Apr 2026 15:01:51 +0000 (09:01 -0600)]
[PATCH] RISC-V: Add vector cost model for Spacemit-X60
This patch implements a dedicated vector cost model for the Spacemit-X60
core. The cost values are derived from micro-benchmarking
data provided by the Camel CDR project.
Following discussions during the RISC-V Patchwork Meeting and based on
the upstream review process, this model applies a clamping
for long-latency instructions. Specifically, all long reservations
are capped at 7 cycles.
As we do not have access to the SPEC CPU benchmark suite, no testing
was performed using that suite. The implementation is based on the
cycle counts reported in the linked data source.
Data source:
https://camel-cdr.github.io/rvv-bench-results/spacemit_x60/index.html
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_sched_adjust_cost):Enable
TARGET_ADJUST_LMUL_COST for spacemit_x60.
* config/riscv/spacemit-x60.md: Add vector pipeline model
for Spacemit-X60.
Co-authored-by: Dusan Stojkovic <Dusan.Stojkovic@rt-rk.com> Co-authored-by: Nikola Ratkovac <Nikola.Ratkovac@rt-rk.com>
When P2165R4 updated __has_tuple_element in C++23 to reuse __tuple_like
concept, it dropped the requirement of validity of get, assuming that for
tuple_like type with size of N, get<I> on lvalue is well-formed for any I < N.
This however does not hold for ranges::subrange (tuple-like of size 2) with
move-only iterator, for which get can only be applied on rvalue. In consequence
constrains allowed instantiating elements_view for range of such subrange,
but instantiating it's iterator lead to hard error from iterator_category
computation.
This patch applies the requirements on validity of get also in C++23 and
later standard modes.
libstdc++-v3/ChangeLog:
* include/std/ranges (__detail::__has_tuple_element): Check
if std::get<_Nm>(__t) returns referenceable type also for C++23
and later.
* testsuite/std/ranges/adaptors/elements.cc: Add test covering
vector of ranges::subrange with move-only iterator.
Reviewed-by: Patrick Palka <ppalka@redhat.com> Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
Richard Biener [Thu, 26 Feb 2026 14:27:10 +0000 (15:27 +0100)]
Some TLC to vect_create_new_slp_node APIs
The following properly documents the overloads of vect_create_new_slp_node
and adjusts callers in tree-vect-slp-patterns.cc
* tree-vect-slp.cc (vect_create_new_slp_node): Assert that 'code'
is either ERROR_MARK or VEC_PERM_EXPR. Document properly.
* tree-vect-slp-patterns.cc (vect_build_swap_evenodd_node):
Use lane_permutation_t.
(vect_build_combine_node): Likewise. Pass VEC_PERM_EXPR
as code.
[RISC-V][V2][PR target/123839] Improve subset of constant permutes for RISC-V
There's a set of constant permutes that are currently implemented
via vslideup+vcompress which requires a mask (and setup of the
mask), but which can be implemented via vslideup+vslidedown.
This has been tested on riscv{32,64}-elf as well as in a BPI-F3 which
is configured to use V by default.
PR target/123839
gcc/
* config/riscv/riscv-v.cc (shuffle_slide_patterns): Use a
vslideup+vslidedown pair rather than a vcompressed based
sequence.
gcc/testsuite
* gcc.target/riscv/rvv/autovec/binop/vcompress-avlprop-1.c: Adjust
expected output.
* gcc.target/riscv/rvv/autovec/pr123839.c: New test.
Jakub Jelinek [Fri, 24 Apr 2026 12:50:23 +0000 (14:50 +0200)]
rs6000: Don't fold stuff for C++ during targetm.resolve_overloaded_builtin [PR124133]
The following testcase ICEs starting with the removal of NON_DEPENDENT_EXPR
in GCC 14. The problem is that while parsing templates if all the arguments
of the overloaded builtins are non-dependent types,
targetm.resolve_overloaded_builtin can be called on it. And trying to
fold_convert or fold_build2 subexpressions of such arguments can ICE,
because they can contain various FE specific trees, or standard trees
with NULL_TREE types, or e.g. type mismatches in binary tree operands etc.
All that goes away later when the trees are instantiated and
targetm.resolve_overloaded_builtin is called again, but if it ICEs while
doing that, it won't reach that point. And the reason to call that
hook in that case if none of the arguments are type dependent is to figure
out if the result type is also non-dependent.
Given the general desire to fold stuff in the FE during parsing as little
as possible and fold it only during cp_fold later on and because from the
target *-c.cc files it isn't easily possible to find out if it is
processing_template_decl or not, the following patch just stops folding
anything in the arguments, calls convert instead of fold_convert and
just build2 instead of fold_build2 etc. when in C++ (and keeps doing what
it did for C).
2026-04-24 Jakub Jelinek <jakub@redhat.com>
PR target/124133
* config/rs6000/rs6000-c.cc (c_fold_convert): New function.
(c_fold_build2_loc): Likewise.
(fully_fold_convert): Use c_fold_convert instead of fold_convert.
(altivec_build_resolved_builtin): Likewise. Use c_fold_build2_loc
instead of fold_build2.
(resolve_vec_mul, resolve_vec_adde_sube, resolve_vec_addec_subec):
Use c_fold_build2_loc instead of fold_build2_loc.
(resolve_vec_splats, resolve_vec_extract): Use c_fold_convert instead
of fold_convert.
(resolve_vec_insert): Use c_fold_build2_loc instead of fold_build2.
(altivec_resolve_overloaded_builtin): Use c_fold_convert instead
of fold_convert.
* g++.target/powerpc/pr124133-1.C: New test.
* g++.target/powerpc/pr124133-2.C: New test.
Reviewed-by: Michael Meissner <meissner@linux.ibm.com>
Jakub Jelinek [Fri, 24 Apr 2026 12:36:29 +0000 (14:36 +0200)]
bitintlower: Padding bit fixes, part 5 [PR123635]
The following patch is hopefully the last missing part of the _BitInt
bitint_extended padding bit fixes, this time for
__builtin_{add,sub,mul}_overflow. For __builtin_{add,sub}_overflow,
the extension in the padding bits of a partial limb (if any) is already
done in some cases during the handling of the limbs (and the last
hunk in gimple-lower-bitint.cc just adds it to one spot where it was
missing). The extension in the padding bits of a full limb of padding
bits (if any) and for __builtin_mul_overflow partial limb too is done
in finish_arith_overflow. If both var and obj are NULL, it is
__builtin_*_overflow_p or __builtin_*_overflow that ignores the result
of the operation and only cares about whether it overflowed or not; in
that case there is nothing to extend.
2026-04-24 Jakub Jelinek <jakub@redhat.com>
PR middle-end/123635
PR tree-optimization/124988
* gimple-lower-bitint.cc (bitint_large_huge::finish_arith_overflow):
Handle bitint_extend.
(bitint_large_huge::lower_addsub_overflow): Fix up comment spelling.
For bitint_extended extend the partial limb if any.
* gcc.dg/torture/bitint-91.c: New test.
* gcc.dg/torture/bitint-92.c: New test.
* gcc.dg/torture/bitint-93.c: New test.
* gcc.dg/torture/bitint-94.c: New test.
* gcc.dg/torture/bitint-95.c: New test.
Tomasz Kamiński [Fri, 24 Apr 2026 11:02:22 +0000 (13:02 +0200)]
libstdc++: Reject using views::iota on iota_view.
Resolves LWG4096, views::iota(views::iota(0)) should be rejected.
For __e of type _Tp that is specialization of iota_view, the CTAD based
expression iota_view(__e) is well formed, and creates a copy of __e.
As iota_view<decay_t<_Tp>> is ill-formed in this case (iota_view is not
weakly_incrementable), using that type in return type explicitly, removes
the overload from overload resolution in this case.
The (now redudant) __detail::__can_iota_view constrain in template head is
preserved, to provide error messages consistent with adaptors for other
non-incrementable types.
libstdc++-v3/ChangeLog:
* include/std/ranges (_Iota::operator()(_Tp&&)): Replace
auto return type and CTAD with iota_view<decay_t<_Tp>>.
* testsuite/std/ranges/iota/iota_view.cc: Tests if
views::iota(iota_view) is rejected.
Reviewed-by: Jonathan Wakely <jwakely@redhat.com> Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
Tomasz Kamiński [Fri, 24 Apr 2026 09:58:39 +0000 (11:58 +0200)]
libstdc++: Constrain views::adjacent(_transform)?<0> to forward_ranges.
This resolves LWG 4098, "views::adjacent<0> should reject non-forward ranges"
which was approved in Sofia 2024.
libstdc++-v3/ChangeLog:
* include/std/ranges (_AdjacentTransform::operator())
(_Adjacent::operator()): Require forward_range for N == 0.
* testsuite/std/ranges/adaptors/adjacent/1.cc: Test if input_ranges
are rejected.
* testsuite/std/ranges/adaptors/adjacent_transform/1.cc: Likewise.
Reviewed-by: Jonathan Wakely <jwakely@redhat.com> Signed-off-by: Tomasz Kamiński <tkaminsk@redhat.com>
Tomasz Kamiński [Fri, 24 Apr 2026 09:13:02 +0000 (11:13 +0200)]
libstdc++: Add _GLIBCXX_RESOLVE_LIB_DEFECTS comment for LWG4083.
The LWG4083, "views::as_rvalue should reject non-input ranges" is resolved,
as input_range<_Range> is implied by __detail::__can_as_rvalue_view<_Range>.
can use integer load. Use inner mode as the scalar mode for CONST_VECTOR
load source.
gcc/
PR target/125009
* config/i386/i386-features.cc (ix86_place_single_vector_set):
Support CONST_VECTOR load no larger than integer register.
(ix86_broadcast_inner): Use inner mode as the scalar mode for
CONST_VECTOR load source.
(pass_x86_cse::x86_cse): Generate CONST_VECTOR broadcast source
for CONST_VECTOR load no larger than integer register.
gcc/testsuite/
PR target/125009
* g++.target/i386/pr125009.C: New test.
* gcc.target/i386/pr125009.c: Likewise.
Richard Biener [Wed, 15 Apr 2026 09:10:56 +0000 (11:10 +0200)]
tree-optimization/124843 - vectorize inversion of scalar bools
Scalar bool inversion vectorization fails due to bools having
bit precision. The following adds a pattern to rewrite it
to the corresponding BIT_XOR_EXPR operation which we can vectorize
just fine.
PR tree-optimization/124843
* tree-vect-patterns.cc (vect_recog_bool_pattern): Recognize
BIT_NOT_EXPR of scalar bools and rewrite with BIT_XOR_EXPR.