AArch64: take gather/scatter decode overhead into account
Gather and scatters are not usually beneficial when the loop count is small.
This is because there's not only a cost to their execution within the loop but
there is also some cost to enter loops with them.
As such this patch models this overhead. For generic tuning we however still
prefer gathers/scatters when the loop costs work out.