Fix (well, ameliorate, at least) some lurking performance problems
(time taken to do register allocation, not quality of result) which
were tolerable when allocating for x86/amd64 but got bad when dealing
with ppc-ish numbers of real registers (90 ish).
* Don't sanity-check the entire regalloc state after each insn
processed; this is total overkill. Instead do it every 7th insn
processed (somewhat arbitrarily) and just before the last insn.
* Reinstate an optimisation from the old UCode allocator: shadow
the primary state structure (rreg_state) with a redundant inverse
mapping (vreg_state) to remove the need to search
through rreg_state when looking for info about a given vreg, a
very common operation. Add logic to keep the two maps consistent.
Add a sanity check to ensure they really are consistent.
* Rename some variables and macros to make the code easier to
understand.
On x86->x86 (--tool=none), total Vex runtime is reduced by about 10%,
and amd64 is similar. For ppc32 the vex runtime is nearly halved. On
x86->x86 (--tool=none), register allocation now consumes only about
10% of the total Vex run time.
When hooked up to Valgrind, run time of short-running programs --
which is dominated by translation time -- is reduced by up to 10%.
Calltree/kcachegrind/cachegrind proved instrumental in tracking down
and quantifying these performance problems. Thanks, Josef & Nick.
The logic that drove basic block to IR disassembly had been duplicated
over the 3 front ends (x86, amd64, ppc32). Given the need to take
into account basic block chasing, adding of instruction marks, etc,
the logic is not completely straightforward, and so commoning it up is
a good thing.
Julian Seward [Thu, 30 Jun 2005 23:31:27 +0000 (23:31 +0000)]
Enhance IR so as to distinguish between little- and big-endian loads and
stores, so that PPC can be properly handled. Until now it's been hardwired
to assume little-endian.
As a result, IRStmt_STle is renamed IRStmt_Store and IRExpr_LDle is
renamed IRExpr_Load.
Julian Seward [Thu, 30 Jun 2005 12:08:48 +0000 (12:08 +0000)]
Connect up the plumbing which allows the ppc32 front end to know the
cache line size it is supposed to simulate. Use this in
dis_cache_manage(). Finally reinstate 'dcbz'.
Julian Seward [Thu, 30 Jun 2005 11:49:14 +0000 (11:49 +0000)]
(API-visible change): generalise the VexSubArch idea. Everywhere
where a VexSubArch was previously passed around, a VexArchInfo is now
passed around. This is a struct which carries more details about any
given architecture and in particular gives a clean way to pass around
info about PPC cache line sizes, which is needed for guest-side PPC.
guest-ppc32
~~~~~~~~~~
- store-with-update instrs: Valgrind pagefault handler expects faulting address >= current stack ptr, so we need to update the stack ptr register _before_ storing the old stack ptr
- branch_ctr_ok (bad calc for 'branch if %ctr zero' case)
- mcrf: scanning bitfields in the wrong direction
- on spotting the magic sequence, delta += 24
- updated DIPs for +ve-only args
host-ppc32
~~~~~~~~~
- fixed CMov reg usage
- fixed Pin_Call in emit_PPC32Instr(): we already know how far we're jumping
- fixed Pin_Goto in emit_PPC32Instr(): vassert right range of jump deltas
other-ppc32
~~~~~~~~~~
- exported OFFSET_ppc32_(various) for valgrind
Julian Seward [Wed, 18 May 2005 11:47:47 +0000 (11:47 +0000)]
Handle XCHG rAX, reg for 32-bit regs as well as 64-bit regs. I'm not
sure this is right -- the AMD64 docs are very difficult to interpret
on the subtle point of precisely what is and isn't to be regarded as a
no-op.
Julian Seward [Thu, 12 May 2005 17:55:01 +0000 (17:55 +0000)]
Add the beginnings of what might be a general mechanism to pass
ABI-specific knowledge through the IR compilation pipeline. This
entails a new IR construction, AbiHint.
Currently there is only one kind of hint, and it is generated by the
amd64 front end. This tells whoever wants to know that a function
call or return has happened, and so the 128 bytes below %rsp should be
considered undefined.
Julian Seward [Wed, 11 May 2005 23:16:13 +0000 (23:16 +0000)]
Allow reg-alloc to use %rbx. This is a callee-saved register and
therefore particularly valuable - bringing it into circulation reduces
the volume of code generated by memcheck by about 3%.
Julian Seward [Wed, 11 May 2005 22:55:08 +0000 (22:55 +0000)]
Ah, the joys of register allocation. You might think that giving
reg-alloc as many registers as possible maximises performance. You
would be wrong. Giving it more registers generates more spilling of
caller-saved regs around the innumerable helper calls created by
Memcheck. What we really need are zillions of callee-save registers,
but those are in short supply. Hmm, perhaps I should let it use %rbx
too -- that's listed as callee-save.
Anyway, the current arrangement allows reg-alloc to use 8
general-purpose regs and 10 xmm registers. The x87 registers are not
used at all. This seems to work fairly well.
Julian Seward [Mon, 9 May 2005 22:23:38 +0000 (22:23 +0000)]
Finish off amd64 MMX instructions before they finish me off (it's
either them or me). Honestly, the amd64 insn set has the most complex
encoding I have ever seen.
Julian Seward [Tue, 3 May 2005 12:20:15 +0000 (12:20 +0000)]
x86 guest: generate Iop_Neg* in the x86->IR phase. Intent is to
ensure that the non-shadow (real) computation done by the program will
fail if Iop_Neg* is incorrectly handled somehow. Until this point,
Iop_Neg* is only generated by memcheck and so it will not be obvious
if it is mishandled. IOW, this commit enhances verifiability of the
x86-IR-x86 pipeline.
Julian Seward [Mon, 2 May 2005 16:16:15 +0000 (16:16 +0000)]
Minor tweakage: use testl rather than andl in three places on the
basis that andl trashes the tested register whereas testl doesn't. In
two out of the three cases this makes no difference since the tested
register is a copy of some other register anyway, but hey.