Tidy up some loose ends in the self-checking-translations machinery,
and unroll the adler32 loop in a not-very-successful attempt to reduce
the overhead of checking.
Basic support for self-checking translations. It fits quite neatly
into the IR: if a translation self-check fails, the translation exits
passing VEX_TRC_JMP_TINVAL to the despatcher and with the
guest_TISTART/guest_TILEN pseudo-registers indicating what area of the
guest code needs to be invalidated. The actual checksumming is done
by a helper function which does (a variant of) the Adler32 checksum.
Space/time overhead, whilst substantial, looks tolerable. There's a
little room for optimisation of the basic scheme. It would certainly
be viable to run with self-checking for all translations to support
Valgrinding JITs (including V itself) without any assistance from the
JIT.
A further hack to reduce ppc32 reg-alloc costs: don't give the
regalloc so many registers to play with. In the majority of cases it
won't be able to make much use of vast hordes of FP and Altivec
registers anyway.
Fix (well, ameliorate, at least) some lurking performance problems
(time taken to do register allocation, not quality of result) which
were tolerable when allocating for x86/amd64 but got bad when dealing
with ppc-ish numbers of real registers (90 ish).
* Don't sanity-check the entire regalloc state after each insn
processed; this is total overkill. Instead do it every 7th insn
processed (somewhat arbitrarily) and just before the last insn.
* Reinstate an optimisation from the old UCode allocator: shadow
the primary state structure (rreg_state) with a redundant inverse
mapping (vreg_state) to remove the need to search
through rreg_state when looking for info about a given vreg, a
very common operation. Add logic to keep the two maps consistent.
Add a sanity check to ensure they really are consistent.
* Rename some variables and macros to make the code easier to
understand.
On x86->x86 (--tool=none), total Vex runtime is reduced by about 10%,
and amd64 is similar. For ppc32 the vex runtime is nearly halved. On
x86->x86 (--tool=none), register allocation now consumes only about
10% of the total Vex run time.
When hooked up to Valgrind, run time of short-running programs --
which is dominated by translation time -- is reduced by up to 10%.
Calltree/kcachegrind/cachegrind proved instrumental in tracking down
and quantifying these performance problems. Thanks, Josef & Nick.
The logic that drove basic block to IR disassembly had been duplicated
over the 3 front ends (x86, amd64, ppc32). Given the need to take
into account basic block chasing, adding of instruction marks, etc,
the logic is not completely straightforward, and so commoning it up is
a good thing.
Julian Seward [Thu, 30 Jun 2005 23:31:27 +0000 (23:31 +0000)]
Enhance IR so as to distinguish between little- and big-endian loads and
stores, so that PPC can be properly handled. Until now it's been hardwired
to assume little-endian.
As a result, IRStmt_STle is renamed IRStmt_Store and IRExpr_LDle is
renamed IRExpr_Load.
Julian Seward [Thu, 30 Jun 2005 12:08:48 +0000 (12:08 +0000)]
Connect up the plumbing which allows the ppc32 front end to know the
cache line size it is supposed to simulate. Use this in
dis_cache_manage(). Finally reinstate 'dcbz'.
Julian Seward [Thu, 30 Jun 2005 11:49:14 +0000 (11:49 +0000)]
(API-visible change): generalise the VexSubArch idea. Everywhere
where a VexSubArch was previously passed around, a VexArchInfo is now
passed around. This is a struct which carries more details about any
given architecture and in particular gives a clean way to pass around
info about PPC cache line sizes, which is needed for guest-side PPC.
guest-ppc32
~~~~~~~~~~
- store-with-update instrs: Valgrind pagefault handler expects faulting address >= current stack ptr, so we need to update the stack ptr register _before_ storing the old stack ptr
- branch_ctr_ok (bad calc for 'branch if %ctr zero' case)
- mcrf: scanning bitfields in the wrong direction
- on spotting the magic sequence, delta += 24
- updated DIPs for +ve-only args
host-ppc32
~~~~~~~~~
- fixed CMov reg usage
- fixed Pin_Call in emit_PPC32Instr(): we already know how far we're jumping
- fixed Pin_Goto in emit_PPC32Instr(): vassert right range of jump deltas
other-ppc32
~~~~~~~~~~
- exported OFFSET_ppc32_(various) for valgrind
Julian Seward [Wed, 18 May 2005 11:47:47 +0000 (11:47 +0000)]
Handle XCHG rAX, reg for 32-bit regs as well as 64-bit regs. I'm not
sure this is right -- the AMD64 docs are very difficult to interpret
on the subtle point of precisely what is and isn't to be regarded as a
no-op.
Julian Seward [Thu, 12 May 2005 17:55:01 +0000 (17:55 +0000)]
Add the beginnings of what might be a general mechanism to pass
ABI-specific knowledge through the IR compilation pipeline. This
entails a new IR construction, AbiHint.
Currently there is only one kind of hint, and it is generated by the
amd64 front end. This tells whoever wants to know that a function
call or return has happened, and so the 128 bytes below %rsp should be
considered undefined.
Julian Seward [Wed, 11 May 2005 23:16:13 +0000 (23:16 +0000)]
Allow reg-alloc to use %rbx. This is a callee-saved register and
therefore particularly valuable - bringing it into circulation reduces
the volume of code generated by memcheck by about 3%.
Julian Seward [Wed, 11 May 2005 22:55:08 +0000 (22:55 +0000)]
Ah, the joys of register allocation. You might think that giving
reg-alloc as many registers as possible maximises performance. You
would be wrong. Giving it more registers generates more spilling of
caller-saved regs around the innumerable helper calls created by
Memcheck. What we really need are zillions of callee-save registers,
but those are in short supply. Hmm, perhaps I should let it use %rbx
too -- that's listed as callee-save.
Anyway, the current arrangement allows reg-alloc to use 8
general-purpose regs and 10 xmm registers. The x87 registers are not
used at all. This seems to work fairly well.
Julian Seward [Mon, 9 May 2005 22:23:38 +0000 (22:23 +0000)]
Finish off amd64 MMX instructions before they finish me off (it's
either them or me). Honestly, the amd64 insn set has the most complex
encoding I have ever seen.