AArch64: tweak inner-loop penalty when doing outer-loop vect [PR121290]
r16-3394-g28ab83367e8710a78fffa2513e6e008ebdfbee3e added a cost model adjustment
to detect invariant load and replicate cases when doing outer-loop vectorization
where the inner loop uses a value defined in the outer-loop.
In other words, it's trying to detect the cases where the inner loop would need
to do an ld1r and all inputs are then working on replicated values. The
argument is that in this case the vector loop is just the scalar loop since each
lane just works on the duplicated values.
But it had two short comings.
1. It's an all or nothing thing. The load and replicate may only be a small
percentage of the amount of data being processed. As such this patch now
requires the load and replicate to be at least 50% of the leafs of an SLP
tree. Ideally we'd just only increase body by VF * invariant leafs, but we
can't since the middle-end cost model applies a rather large penalty to the
scalar code (* 50) and as such the base cost ends up being too high and we
just never vectorize. The 50% is an attempt to strike a balance in this
awkward situation. Experiments show it works reasonably well and we get the
right codegen in all the test cases.
2. It does not keep in mind that a load + replicate where that vector value is
used in a by index operation will result in is decomposing the load back to
scalar. e.g.
ld1r {v0.4s}, x0
mul v1.4s, v2.4s, v0.4s
is transformed into
ldr s0, x0
mul v1.4s, v2.4s, v0.s[0]
and as such this case may actually be profitable because we're only doing a
scalar load of a single element, similar to the scalar loop.
This patch tries to detect (loosely) such cases and doesn't apply the penalty
for these. It's a bit hard to tell whether we end up with a by index
operation so early as the vectorizer itself is not aware of them and as such
the patch does not do an exhaustive check, but only does the most obvious
one.
gcc/ChangeLog:
PR target/121290
* config/aarch64/aarch64.cc (aarch64_possible_by_lane_insn_p): New.
(aarch64_vector_costs): Add m_num_dup_stmts and m_num_total_stmts.
(aarch64_vector_costs::add_stmt_cost): Use them.
(adjust_body_cost): Likewise.
gcc/testsuite/ChangeLog:
PR target/121290
* gcc.target/aarch64/pr121290.c: Move to...
* gcc.target/aarch64/pr121290_1.c: ...here.
* g++.target/aarch64/pr121290_1.C: New test.
* gcc.target/aarch64/pr121290_2.c: New test.