Change the replacement for memcpy to a vectorised version that does
word copies whenever possible. This drastically reduces the number of
memory references Memcheck has to process and speeds up a test program
that does repeated memcpys of large blocks by a factor of 4 or more.
Also add a vectorised version of memset.
The memcpy version is also constructed with a view to be used in
exp-ptrcheck, so it can copy areas of memory without losing
pointer-identity shadow data, as happens when doing all copies at a
byte granularity.