This patch implements the lane-wise fp16fml intrinsics.
There's quite a few of them so I've split them up from
the other simpler fp16fml intrinsics.
These ones expose instructions such as
vfmal.f16 Dd, Sn, Sm[<index>] 0 <= index <= 1
vfmal.f16 Qd, Dn, Dm[<index>] 0 <= index <= 3
vfmsl.f16 Dd, Sn, Sm[<index>] 0 <= index <= 1
vfmsl.f16 Qd, Dn, Dm[<index>] 0 <= index <= 3
These instructions extract a single half-precision
floating-point value from one of the source regs
and perform a vfmal/vfmsl operation as per the
normal variant with that value.
The nuance here is that some of the intrinsics want
to do things like:
where the float16x8_t value of '__b' is held in a Q
register, so we need to be a bit smart about finding
the right D or S sub-register and translating the
lane number to a lane in that sub-register, instead
of just passing the language-level const-int down to
the assembly instruction.
That's where most of the complexity of this patch comes from
but hopefully it's orthogonal enough to make sense.
Bootstrapped and tested on arm-none-linux-gnueabihf as well as
armeb-none-eabi.