For very small loops (< 6 insns), it would be fine to unroll 4
times to run fast with less latency and better cache usage. Like
below loops:
while (i) a[--i] = NULL; while (p < e) *d++ = *p++;
With this patch enhances, we could see some performance improvement
for some workloads(e.g. SPEC2017).