The function was tuned around 64-byte entry alignment and performs
better for all sizes with it.
As well different code boths where explicitly written to touch the
minimum number of cache line i.e sizes <= 32 touch only the entry
cache line.
(cherry picked from commit
227afaa67213efcdce6a870ef5086200f1076438)
# define VEC_SIZE 32
# define PAGE_SIZE 4096
.section SECTION(.text), "ax", @progbits
-ENTRY(MEMRCHR)
+ENTRY_P2ALIGN(MEMRCHR, 6)
# ifdef __ILP32__
/* Clear upper bits. */
and %RDX_LP, %RDX_LP