The problem is that the code only considers the permutation for
the first scalar iteration, rather than for all VF iterations.
This patch tries to fix that by making vect_transform_slp_perm_load
calculate the value instead.
gcc/
* tree-vectorizer.h (vect_transform_slp_perm_load): Take an
optional extra parameter.
* tree-vect-slp.c (vect_transform_slp_perm_load): Calculate
the number of loads as well as the number of permutes, taking
the counting loop from...
* tree-vect-stmts.c (vect_model_load_cost): ...here. Use the
value computed by vect_transform_slp_perm_load for ncopies.