As it turns out, trying to peel off the remainder with so many branches
caused the code size to inflate a bit too much that this function
wouldn't inline without some fairly aggressive optimization flags. Only
catching vector sized chunks here makes the loop body small enough and
having the byte by byte copy idiom at the bottom gives the compiler some
flexibility that it is likely to do something there.
* behind or lookahead distance. */
uint64_t non_olap_size = llabs(from - out); // llabs vs labs for compatibility with windows
- memcpy(out, from, (size_t)non_olap_size);
- out += non_olap_size;
- from += non_olap_size;
- len -= non_olap_size;
-
/* So this doesn't give use a worst case scenario of function calls in a loop,
* we want to instead break this down into copy blocks of fixed lengths */
while (len) {
tocopy = MIN(non_olap_size, len);
len -= tocopy;
- while (tocopy >= 32) {
- memcpy(out, from, 32);
- out += 32;
- from += 32;
- tocopy -= 32;
- }
-
- if (tocopy >= 16) {
+ while (tocopy >= 16) {
memcpy(out, from, 16);
out += 16;
from += 16;
tocopy -= 4;
}
- if (tocopy >= 2) {
- memcpy(out, from, 2);
- out += 2;
- from += 2;
- tocopy -= 2;
- }
-
- if (tocopy) {
+ while (tocopy--) {
*out++ = *from++;
}
}