Use vaddvq_u32 for adler32 NEON horizontal reduction
Replace interleaved pairwise reduction with vaddvq_u32 to break the
dependency chain between s1 and s2 modulo computations. The original
code merged both accumulators through a shared addp, serializing the
subsequent umull/lsr/msub chains. Independent reductions allow them
to execute in parallel.
On AArch64 this maps to the ADDV instruction. A compatibility shim
in neon_intrins.h emulates this on 32-bit ARM using vadd and vpadd.