Bits of low hanging and high hanging fruit in this round of
optimization. Altivec has a sum characters into 4 lanes of integers
instructions (intrinsic vec_sum4s) that seems basically made for this
algorithm. Additionally, there's a similar multiply-accumulate routine
that takes two character vectors for input and outputs a vector of 4
ints for their respective adjacent sums. This alone was a good amount
of the performance gains.
Additionally, the shifting by 4 was still done in the loop when it was
easy to roll outside of the loop and do only once. This removed some
latency for a dependent operand to be ready. We also unrolled the loop
with independent sums, though, this only seems to help for much larger
input sizes.
Additionally, we reduced feeding the two 16 bit halves of the sum simply
by packing them into an aligned allocation in the stack next to each
other. Then, when loaded, we permute and shift the values to two
separate vector registers from the same input registers. The separation
of these scalars probably could have been done in vector registers
through some tricks but we need them in scalar GPRs anyhow every time
they leave the loop so it was naturally better to keep those separate
before hitting the vectorized code.
For the horizontal addition, the code was modified to use a sequence of
shifts and adds to produce a vector sum in the first lane. Then, the
much cheaper vec_ste was used to store the value into a general purpose
register rather than vec_extract.
Lastly, instead of doing the relatively expensive modulus in GPRs after
we perform the scalar operations to align all of the loads in the loop,
we can instead reduce "n" here for the first round to be n minus the
alignment offset.