The following extends SLP discovery to handle non-grouped loads
in loop vectorization in the case the same load appears in all
lanes.
Code generation is adjusted to mimick what we do for the case
of single element interleaving (when the load is not unit-stride)
which is already handled by SLP. There are some limits we
run into because peeling for gap cannot cover all cases and
we choose VMAT_CONTIGUOUS. The patch does not try to address
these issues yet.
The main obstacle is that these loads are not
STMT_VINFO_GROUPED_ACCESS and that's a new thing with SLP.
I know from the past that it's not a good idea to make them
grouped. Instead the following massages places to deal
with SLP loads that are not STMT_VINFO_GROUPED_ACCESS.
There's already a testcase testing for the case the PR
is after, just XFAILed, the following adjusts that instead
of adding another.
I do expect to have missed some so I don't plan to push this
on a Friday. Still there may be feedback, so posting this
now.
Bootstrapped and tested on x86_64-unknown-linux-gnu.
PR tree-optimization/96208
* tree-vect-slp.cc (vect_build_slp_tree_1): Allow
a non-grouped load if it is the same for all lanes.
(vect_build_slp_tree_2): Handle not grouped loads.
(vect_optimize_slp_pass::remove_redundant_permutations):
Likewise.
(vect_transform_slp_perm_load_1): Likewise.
* tree-vect-stmts.cc (vect_model_load_cost): Likewise.
(get_group_load_store_type): Likewise. Handle
invariant accesses.
(vectorizable_load): Likewise.
* gcc.dg/vect/slp-46.c: Adjust for new vectorizations.
* gcc.dg/vect/bb-slp-pr65935.c: Adjust.