Avoid splitting store dataref groups during SLP discovery
The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.
With this for example
int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
for (int i = 0; i < 256; i++)
{
x[4*i+0] = y[2*i+0];
x[4*i+1] = y[2*i+1];
x[4*i+2] = z[i];
x[4*i+3] = w[i];
}
}
which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark). We of course need a VF of four here.
The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.
The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees. That's
left for future improvements.
The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging. The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback). Not sure
how much we can or should rely on the SLP optimize pass to handle
this.
The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four. Thus
we're using hybrid SLP within a single store group. After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group. But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.
This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP. I'm unsure how to update
the testcase though.
gcc.dg/vect/slp-21.c runs into similar situations. Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken. It's probably not worth fixing this at this moment.
For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance. A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane. Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.
* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.
* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect SLP.
* gcc.dg/vect/slp-21.c: Remove not important scanning for SLP.
* gcc.dg/vect/slp-cond-1.c: Expect one more SLP if !vect_load_lanes.
* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.