kern/misc: Implement faster grub_memcpy() for aligned buffers
When both "dest" and "src" are aligned, copying the data in grub_addr_t
sized chunks is more efficient than a byte-by-byte copy.
Also tweak __aeabi_memcpy(), __aeabi_memcpy4(), and __aeabi_memcpy8(),
since grub_memcpy() is not inline anymore.
Optimization for unaligned buffers was omitted to maintain code
simplicity and readability. The current chunk-copy optimization
for aligned buffers already provides a noticeable performance
improvement (*) for Argon2 keyslot decryption.
(*) On my system, for a LUKS2 keyslot configured with a 1 GB Argon2
memory requirement, this patch reduces the decryption time from
22 seconds to 12 seconds.
Signed-off-by: Gary Lin <glin@suse.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>