Merge from CGTUNE branch, code generation improvements for amd64:
r1772:
When generating code for helper calls, be more aggressive about
computing values directly into argument registers, thereby avoiding
some reg-reg shuffling. This reduces the amount of code (on amd64)
generated by Cachegrind by about 6% and has zero or marginal benefit
for other tools.
r1773:
Emit 64-bit branch targets using 32-bit short forms when possible.
Since (with V's default amd64 load address of 0x38000000) this is
usually possible, it saves about 7% in code size for Memcheck and even
more for Cachegrind.