guest reg with zeroes and then overwrite the lower half. This forces
the back end to generate code which creates huge write-after-write
stalls in the memory system of P4s due to the different sized writes.
This apparently small change reduces the run-time of one
sse2-intensive floating point program from 145 seconds to 90 seconds
(--tool=none).
git-svn-id: svn://svn.valgrind.org/vex/trunk@1121
delta += 3+1;
} else {
addr = disAMode ( &alen, sorb, delta+3, dis_buf );
- putXMMReg( gregOfRM(modrm), mkV128(0) );
+ putXMMRegLane64( gregOfRM(modrm), 1, mkU64(0) );
putXMMRegLane64( gregOfRM(modrm), 0,
loadLE(Ity_I64, mkexpr(addr)) );
DIP("movsd %s,%s\n", dis_buf,